History log of /linux-master/mm/gup.c
Revision Date Author Comments
# 631426ba 14-Mar-2024 David Hildenbrand <david@redhat.com>

mm/madvise: make MADV_POPULATE_(READ|WRITE) handle VM_FAULT_RETRY properly

Darrick reports that in some cases where pread() would fail with -EIO and
mmap()+access would generate a SIGBUS signal, MADV_POPULATE_READ /
MADV_POPULATE_WRITE will keep retrying forever and not fail with -EFAULT.

While the madvise() call can be interrupted by a signal, this is not the
desired behavior. MADV_POPULATE_READ / MADV_POPULATE_WRITE should behave
like page faults in that case: fail and not retry forever.

A reproducer can be found at [1].

The reason is that __get_user_pages(), as called by
faultin_vma_page_range(), will not handle VM_FAULT_RETRY in a proper way:
it will simply return 0 when VM_FAULT_RETRY happened, making
madvise_populate()->faultin_vma_page_range() retry again and again, never
setting FOLL_TRIED->FAULT_FLAG_TRIED for __get_user_pages().

__get_user_pages_locked() does what we want, but duplicating that logic in
faultin_vma_page_range() feels wrong.

So let's use __get_user_pages_locked() instead, that will detect
VM_FAULT_RETRY and set FOLL_TRIED when retrying, making the fault handler
return VM_FAULT_SIGBUS (VM_FAULT_ERROR) at some point, propagating -EFAULT
from faultin_page() to __get_user_pages(), all the way to
madvise_populate().

But, there is an issue: __get_user_pages_locked() will end up re-taking
the MM lock and then __get_user_pages() will do another VMA lookup. In
the meantime, the VMA layout could have changed and we'd fail with
different error codes than we'd want to.

As __get_user_pages() will currently do a new VMA lookup either way, let
it do the VMA handling in a different way, controlled by a new
FOLL_MADV_POPULATE flag, effectively moving these checks from
madvise_populate() + faultin_page_range() in there.

With this change, Darricks reproducer properly fails with -EFAULT, as
documented for MADV_POPULATE_READ / MADV_POPULATE_WRITE.

[1] https://lore.kernel.org/all/20240313171936.GN1927156@frogsfrogsfrogs/

Link: https://lkml.kernel.org/r/20240314161300.382526-1-david@redhat.com
Link: https://lkml.kernel.org/r/20240314161300.382526-2-david@redhat.com
Fixes: 4ca9b3859dac ("mm/madvise: introduce MADV_POPULATE_(READ|WRITE) to prefault page tables")
Signed-off-by: David Hildenbrand <david@redhat.com>
Reported-by: Darrick J. Wong <djwong@kernel.org>
Closes: https://lore.kernel.org/all/20240311223815.GW1927156@frogsfrogsfrogs/
Cc: Darrick J. Wong <djwong@kernel.org>
Cc: Hugh Dickins <hughd@google.com>
Cc: Jason Gunthorpe <jgg@nvidia.com>
Cc: John Hubbard <jhubbard@nvidia.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>


# 1096bc93 29-Mar-2024 Linus Torvalds <torvalds@linux-foundation.org>

mm: clean up populate_vma_page_range() FOLL_* flag handling

The code wasn't exactly wrong, but it was very odd, and it used
FOLL_FORCE together with FOLL_WRITE when it really didn't need to (it
only set FOLL_WRITE for writable mappings, so then the FOLL_FORCE was
pointless).

It also pointlessly called __get_user_pages() even when it knew it
wouldn't populate anything because the vma wasn't accessible and it
explicitly tested for and did *not* set FOLL_FORCE for inaccessible
vma's.

This code does need to use FOLL_FORCE, because we want to do fault in
writable shared mappings, but then the mapping may not actually be
readable. And we don't want to use FOLL_WRITE (which would match the
permission of the vma), because that would also dirty the pages, which
we don't want to do.

For very similar reasons, FOLL_FORCE populates a executable-only mapping
with no read permissions. We don't have a FOLL_EXEC flag.

Yes, it would probably be cleaner to split FOLL_WRITE into two bits (for
separate permission and dirty bit handling), and add a FOLL_EXEC flag
for the "GUP executable page" case. That would allow us to avoid
FOLL_FORCE entirely here.

But that's not how our FOLL_xyz bits have traditionally worked, and that
would be a much bigger patch.

So this at least avoids the FOLL_FORCE | FOLL_WRITE combination that
made one of my experimental validation patches trigger a warning. That
warning was a false positive (and my experimental patch was incomplete
anyway), but it all made me look at this and decide to clean at least
this small case up.

Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# e3b4b137 20-Dec-2023 David Hildenbrand <david@redhat.com>

mm: convert page_try_share_anon_rmap() to folio_try_share_anon_rmap_[pte|pmd]()

Let's convert it like we converted all the other rmap functions. Don't
introduce folio_try_share_anon_rmap_ptes() for now, as we don't have a
user that wants rmap batching in sight. Pretty easy to add later.

All users are easy to convert -- only ksm.c doesn't use folios yet but
that is left for future work -- so let's just do it in a single shot.

While at it, turn the BUG_ON into a WARN_ON_ONCE.

Note that page_try_share_anon_rmap() so far didn't care about pte/pmd
mappings (no compound parameter). We're changing that so we can perform
better sanity checks and make the code actually more readable/consistent.
For example, __folio_rmap_sanity_checks() will make sure that a PMD range
actually falls completely into the folio.

Link: https://lkml.kernel.org/r/20231220224504.646757-39-david@redhat.com
Signed-off-by: David Hildenbrand <david@redhat.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
Cc: Muchun Song <muchun.song@linux.dev>
Cc: Muchun Song <songmuchun@bytedance.com>
Cc: Peter Xu <peterx@redhat.com>
Cc: Ryan Roberts <ryan.roberts@arm.com>
Cc: Yin Fengwei <fengwei.yin@intel.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>


# e9119fb6 23-Nov-2023 Peter Xu <peterx@redhat.com>

mm/gup: fix follow_devmap_p[mu]d() on page==NULL handling

This is a bug found not by any report but only by code observations.

When GUP sees a devpmd/devpud and if page==NULL is returned, it means a
fault is probably required. Here falling through when page==NULL can
cause unexpected behavior.

Fix both cases by catching the page==NULL cases with no_page_table().

Link: https://lkml.kernel.org/r/20231123180222.1048297-1-peterx@redhat.com
Fixes: 3565fce3a659 ("mm, x86: get_user_pages() for dax mappings")
Fixes: 080dbb618b4b ("mm/follow_page_mask: split follow_page_mask to smaller functions.")
Signed-off-by: Peter Xu <peterx@redhat.com>
Acked-by: David Hildenbrand <david@redhat.com>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
Cc: Christoph Hellwig <hch@lst.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>


# 9c4b2142 02-Oct-2023 Lorenzo Stoakes <lstoakes@gmail.com>

mm/gup: make failure to pin an error if FOLL_NOWAIT not specified

There really should be no circumstances under which a non-FOLL_NOWAIT GUP
operation fails to return any pages, so make this an error and warn on it.

To catch the trivial case, simply exit early if nr_pages == 0.

This brings __get_user_pages_locked() in line with the behaviour of its
nommu variant.

Link: https://lkml.kernel.org/r/2a42d96dd1e37163f90a0019a541163dafb7e4c3.1696288092.git.lstoakes@gmail.com
Signed-off-by: Lorenzo Stoakes <lstoakes@gmail.com>
Reviewed-by: Arnd Bergmann <arnd@arndb.de>
Reviewed-by: David Hildenbrand <david@redhat.com>
Cc: Adrian Hunter <adrian.hunter@intel.com>
Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com>
Cc: Arnaldo Carvalho de Melo <acme@kernel.org>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Ian Rogers <irogers@google.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Jason Gunthorpe <jgg@nvidia.com>
Cc: Jiri Olsa <jolsa@kernel.org>
Cc: John Hubbard <jhubbard@nvidia.com>
Cc: Mark Rutland <mark.rutland@arm.com>
Cc: Namhyung Kim <namhyung@kernel.org>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Richard Cochran <richardcochran@gmail.com>
Cc: Will Deacon <will@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>


# 0f20bba1 02-Oct-2023 Lorenzo Stoakes <lstoakes@gmail.com>

mm/gup: explicitly define and check internal GUP flags, disallow FOLL_TOUCH

Rather than open-coding a list of internal GUP flags in
is_valid_gup_args(), define which ones are internal.

In addition, explicitly check to see if the user passed in FOLL_TOUCH
somehow, as this appears to have been accidentally excluded.

Link: https://lkml.kernel.org/r/971e013dfe20915612ea8b704e801d7aef9a66b6.1696288092.git.lstoakes@gmail.com
Signed-off-by: Lorenzo Stoakes <lstoakes@gmail.com>
Reviewed-by: Arnd Bergmann <arnd@arndb.de>
Reviewed-by: David Hildenbrand <david@redhat.com>
Reviewed-by: Jason Gunthorpe <jgg@nvidia.com>
Cc: Adrian Hunter <adrian.hunter@intel.com>
Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com>
Cc: Arnaldo Carvalho de Melo <acme@kernel.org>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Ian Rogers <irogers@google.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Jiri Olsa <jolsa@kernel.org>
Cc: John Hubbard <jhubbard@nvidia.com>
Cc: Mark Rutland <mark.rutland@arm.com>
Cc: Namhyung Kim <namhyung@kernel.org>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Richard Cochran <richardcochran@gmail.com>
Cc: Will Deacon <will@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>


# 6beb9958 12-Jun-2023 Rick Edgecombe <rick.p.edgecombe@intel.com>

mm: Don't allow write GUPs to shadow stack memory

The x86 Control-flow Enforcement Technology (CET) feature includes a
new type of memory called shadow stack. This shadow stack memory has
some unusual properties, which requires some core mm changes to
function properly.

In userspace, shadow stack memory is writable only in very specific,
controlled ways. However, since userspace can, even in the limited
ways, modify shadow stack contents, the kernel treats it as writable
memory. As a result, without additional work there would remain many
ways for userspace to trigger the kernel to write arbitrary data to
shadow stacks via get_user_pages(, FOLL_WRITE) based operations. To
help userspace protect their shadow stacks, make this a little less
exposed by blocking writable get_user_pages() operations for shadow
stack VMAs.

Still allow FOLL_FORCE to write through shadow stack protections, as it
does for read-only protections. This is required for debugging use
cases.

[ dhansen: fix rebase goof, readd writable_file_mapping_allowed() hunk ]

Signed-off-by: Rick Edgecombe <rick.p.edgecombe@intel.com>
Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
Reviewed-by: Borislav Petkov (AMD) <bp@alien8.de>
Reviewed-by: Kees Cook <keescook@chromium.org>
Acked-by: Mike Rapoport (IBM) <rppt@kernel.org>
Acked-by: David Hildenbrand <david@redhat.com>
Tested-by: Pengfei Xu <pengfei.xu@intel.com>
Tested-by: John Allen <john.allen@amd.com>
Tested-by: Kees Cook <keescook@chromium.org>
Link: https://lore.kernel.org/all/20230613001108.3040476-23-rick.p.edgecombe%40intel.com


# 8f9ff2de 22-Aug-2023 Matthew Wilcox (Oracle) <willy@infradead.org>

secretmem: convert page_is_secretmem() to folio_is_secretmem()

The only caller already has a folio, so use it to save calling
compound_head() in PageLRU() and remove a use of page->mapping.

Link: https://lkml.kernel.org/r/20230822202335.179081-1-willy@infradead.org
Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Reviewed-by: Mike Rapoport (IBM) <rppt@kernel.org>
Reviewed-by: David Hildenbrand <david@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>


# 7acddcc1 03-Aug-2023 David Hildenbrand <david@redhat.com>

mm/gup: don't implicitly set FOLL_HONOR_NUMA_FAULT

Commit 0b9d705297b2 ("mm: numa: Support NUMA hinting page faults from
gup/gup_fast") from 2012 documented as the primary reason why we would want
to handle NUMA hinting faults from GUP:

KVM secondary MMU page faults will trigger the NUMA hinting page
faults through gup_fast -> get_user_pages -> follow_page ->
handle_mm_fault.

That is still the case today, and relevant KVM code has been converted to
manually set FOLL_HONOR_NUMA_FAULT. So let's stop setting
FOLL_HONOR_NUMA_FAULT for all GUP users and cross fingers that not that
many other ones that really require such handling for autonuma remain.

Possible interaction with MMU notifiers:

Assume a driver obtains a page using get_user_pages() to map it into
a secondary MMU, and uses the MMU notifier framework to get notified on
changes.

Assume get_user_pages() succeeded on a PROT_NONE-mapped page (because
FOLL_HONOR_NUMA_FAULT is not set) in an accessible VMA and the page is
mapped into a secondary MMU. Once user space would turn that mapping
inaccessible using mprotect(PROT_NONE), the actual PTE in the page table
might not change. If the MMU notifier would be smart and optimize for that
case "why notify if the PTE didn't change", that could be problematic.

At least change_pmd_range() with MMU_NOTIFY_PROTECTION_VMA for now does an
unconditional mmu_notifier_invalidate_range_start() ->
mmu_notifier_invalidate_range_end() and should be fine.

Note that even if a PTE in an accessible VMA is pte_protnone(), the
underlying page might be accessed by a secondary MMU that does not set
FOLL_HONOR_NUMA_FAULT, and test_young() MMU notifiers would return "true".

Link: https://lkml.kernel.org/r/20230803143208.383663-5-david@redhat.com
Signed-off-by: David Hildenbrand <david@redhat.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: Jason Gunthorpe <jgg@ziepe.ca>
Cc: John Hubbard <jhubbard@nvidia.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: liubo <liubo254@huawei.com>
Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Mel Gorman <mgorman@techsingularity.net>
Cc: Paolo Bonzini <pbonzini@redhat.com>
Cc: Peter Xu <peterx@redhat.com>
Cc: Shuah Khan <shuah@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>


# 48498071 28-Jun-2023 Peter Xu <peterx@redhat.com>

mm/gup: retire follow_hugetlb_page()

Now __get_user_pages() should be well prepared to handle thp completely,
as long as hugetlb gup requests even without the hugetlb's special path.

Time to retire follow_hugetlb_page().

Tweak misc comments to reflect reality of follow_hugetlb_page()'s removal.

Link: https://lkml.kernel.org/r/20230628215310.73782-7-peterx@redhat.com
Signed-off-by: Peter Xu <peterx@redhat.com>
Acked-by: David Hildenbrand <david@redhat.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: James Houghton <jthoughton@google.com>
Cc: Jason Gunthorpe <jgg@nvidia.com>
Cc: John Hubbard <jhubbard@nvidia.com>
Cc: Kirill A . Shutemov <kirill@shutemov.name>
Cc: Lorenzo Stoakes <lstoakes@gmail.com>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Mike Kravetz <mike.kravetz@oracle.com>
Cc: Mike Rapoport (IBM) <rppt@kernel.org>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Yang Shi <shy828301@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>


# 57edfcfd 28-Jun-2023 Peter Xu <peterx@redhat.com>

mm/gup: accelerate thp gup even for "pages != NULL"

The acceleration of THP was done with ctx.page_mask, however it'll be
ignored if **pages is non-NULL.

The old optimization was introduced in 2013 in 240aadeedc4a ("mm:
accelerate mm_populate() treatment of THP pages"). It didn't explain why
we can't optimize the **pages non-NULL case. It's possible that at that
time the major goal was for mm_populate() which should be enough back
then.

Optimize thp for all cases, by properly looping over each subpage, doing
cache flushes, and boost refcounts / pincounts where needed in one go.

This can be verified using gup_test below:

# chrt -f 1 ./gup_test -m 512 -t -L -n 1024 -r 10

Before: 13992.50 ( +-8.75%)
After: 378.50 (+-69.62%)

Link: https://lkml.kernel.org/r/20230628215310.73782-6-peterx@redhat.com
Signed-off-by: Peter Xu <peterx@redhat.com>
Reviewed-by: Lorenzo Stoakes <lstoakes@gmail.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: David Hildenbrand <david@redhat.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: James Houghton <jthoughton@google.com>
Cc: Jason Gunthorpe <jgg@nvidia.com>
Cc: John Hubbard <jhubbard@nvidia.com>
Cc: Kirill A . Shutemov <kirill@shutemov.name>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Mike Kravetz <mike.kravetz@oracle.com>
Cc: Mike Rapoport (IBM) <rppt@kernel.org>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Yang Shi <shy828301@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>


# ffe1e786 28-Jun-2023 Peter Xu <peterx@redhat.com>

mm/gup: cleanup next_page handling

The only path that doesn't use generic "**pages" handling is the gate vma.
Make it use the same path, meanwhile tune the next_page label upper to
cover "**pages" handling. This prepares for THP handling for "**pages".

Link: https://lkml.kernel.org/r/20230628215310.73782-5-peterx@redhat.com
Signed-off-by: Peter Xu <peterx@redhat.com>
Reviewed-by: Lorenzo Stoakes <lstoakes@gmail.com>
Acked-by: David Hildenbrand <david@redhat.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: James Houghton <jthoughton@google.com>
Cc: Jason Gunthorpe <jgg@nvidia.com>
Cc: John Hubbard <jhubbard@nvidia.com>
Cc: Kirill A . Shutemov <kirill@shutemov.name>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Mike Kravetz <mike.kravetz@oracle.com>
Cc: Mike Rapoport (IBM) <rppt@kernel.org>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Yang Shi <shy828301@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>


# 5502ea44 28-Jun-2023 Peter Xu <peterx@redhat.com>

mm/hugetlb: add page_mask for hugetlb_follow_page_mask()

follow_page() doesn't need it, but we'll start to need it when unifying
gup for hugetlb.

Link: https://lkml.kernel.org/r/20230628215310.73782-4-peterx@redhat.com
Signed-off-by: Peter Xu <peterx@redhat.com>
Reviewed-by: David Hildenbrand <david@redhat.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: James Houghton <jthoughton@google.com>
Cc: Jason Gunthorpe <jgg@nvidia.com>
Cc: John Hubbard <jhubbard@nvidia.com>
Cc: Kirill A . Shutemov <kirill@shutemov.name>
Cc: Lorenzo Stoakes <lstoakes@gmail.com>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Mike Kravetz <mike.kravetz@oracle.com>
Cc: Mike Rapoport (IBM) <rppt@kernel.org>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Yang Shi <shy828301@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>


# dd767aaa 28-Jun-2023 Peter Xu <peterx@redhat.com>

mm/hugetlb: handle FOLL_DUMP well in follow_page_mask()

Patch series "mm/gup: Unify hugetlb, speed up thp", v4.

Hugetlb has a special path for slow gup that follow_page_mask() is
actually skipped completely along with faultin_page(). It's not only
confusing, but also duplicating a lot of logics that generic gup already
has, making hugetlb slightly special.

This patchset tries to dedup the logic, by first touching up the slow gup
code to be able to handle hugetlb pages correctly with the current follow
page and faultin routines (where we're mostly there.. due to 10 years ago
we did try to optimize thp, but half way done; more below), then at the
last patch drop the special path, then the hugetlb gup will always go the
generic routine too via faultin_page().

Note that hugetlb is still special for gup, mostly due to the pgtable
walking (hugetlb_walk()) that we rely on which is currently per-arch. But
this is still one small step forward, and the diffstat might be a proof
too that this might be worthwhile.

Then for the "speed up thp" side: as a side effect, when I'm looking at
the chunk of code, I found that thp support is actually partially done.
It doesn't mean that thp won't work for gup, but as long as **pages
pointer passed over, the optimization will be skipped too. Patch 6 should
address that, so for thp we now get full speed gup.

For a quick number, "chrt -f 1 ./gup_test -m 512 -t -L -n 1024 -r 10"
gives me 13992.50us -> 378.50us. Gup_test is an extreme case, but just to
show how it affects thp gups.


This patch (of 8):

Firstly, the no_page_table() is meaningless for hugetlb which is a no-op
there, because a hugetlb page always satisfies:

- vma_is_anonymous() == false
- vma->vm_ops->fault != NULL

So we can already safely remove it in hugetlb_follow_page_mask(), alongside
with the page* variable.

Meanwhile, what we do in follow_hugetlb_page() actually makes sense for a
dump: we try to fault in the page only if the page cache is already
allocated. Let's do the same here for follow_page_mask() on hugetlb.

It should so far has zero effect on real dumps, because that still goes
into follow_hugetlb_page(). But this may start to influence a bit on
follow_page() users who mimics a "dump page" scenario, but hopefully in a
good way. This also paves way for unifying the hugetlb gup-slow.

Link: https://lkml.kernel.org/r/20230628215310.73782-1-peterx@redhat.com
Link: https://lkml.kernel.org/r/20230628215310.73782-2-peterx@redhat.com
Signed-off-by: Peter Xu <peterx@redhat.com>
Reviewed-by: Mike Kravetz <mike.kravetz@oracle.com>
Reviewed-by: David Hildenbrand <david@redhat.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: James Houghton <jthoughton@google.com>
Cc: Jason Gunthorpe <jgg@nvidia.com>
Cc: John Hubbard <jhubbard@nvidia.com>
Cc: Kirill A . Shutemov <kirill@shutemov.name>
Cc: Lorenzo Stoakes <lstoakes@gmail.com>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Mike Rapoport (IBM) <rppt@kernel.org>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Yang Shi <shy828301@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>


# d74943a2 03-Aug-2023 David Hildenbrand <david@redhat.com>

mm/gup: reintroduce FOLL_NUMA as FOLL_HONOR_NUMA_FAULT

Unfortunately commit 474098edac26 ("mm/gup: replace FOLL_NUMA by
gup_can_follow_protnone()") missed that follow_page() and
follow_trans_huge_pmd() never implicitly set FOLL_NUMA because they really
don't want to fail on PROT_NONE-mapped pages -- either due to NUMA hinting
or due to inaccessible (PROT_NONE) VMAs.

As spelled out in commit 0b9d705297b2 ("mm: numa: Support NUMA hinting
page faults from gup/gup_fast"): "Other follow_page callers like KSM
should not use FOLL_NUMA, or they would fail to get the pages if they use
follow_page instead of get_user_pages."

liubo reported [1] that smaps_rollup results are imprecise, because they
miss accounting of pages that are mapped PROT_NONE. Further, it's easy to
reproduce that KSM no longer works on inaccessible VMAs on x86-64, because
pte_protnone()/pmd_protnone() also indictaes "true" in inaccessible VMAs,
and follow_page() refuses to return such pages right now.

As KVM really depends on these NUMA hinting faults, removing the
pte_protnone()/pmd_protnone() handling in GUP code completely is not
really an option.

To fix the issues at hand, let's revive FOLL_NUMA as FOLL_HONOR_NUMA_FAULT
to restore the original behavior for now and add better comments.

Set FOLL_HONOR_NUMA_FAULT independent of FOLL_FORCE in
is_valid_gup_args(), to add that flag for all external GUP users.

Note that there are three GUP-internal __get_user_pages() users that don't
end up calling is_valid_gup_args() and consequently won't get
FOLL_HONOR_NUMA_FAULT set.

1) get_dump_page(): we really don't want to handle NUMA hinting
faults. It specifies FOLL_FORCE and wouldn't have honored NUMA
hinting faults already.
2) populate_vma_page_range(): we really don't want to handle NUMA hinting
faults. It specifies FOLL_FORCE on accessible VMAs, so it wouldn't have
honored NUMA hinting faults already.
3) faultin_vma_page_range(): we similarly don't want to handle NUMA
hinting faults.

To make the combination of FOLL_FORCE and FOLL_HONOR_NUMA_FAULT work in
inaccessible VMAs properly, we have to perform VMA accessibility checks in
gup_can_follow_protnone().

As GUP-fast should reject such pages either way in
pte_access_permitted()/pmd_access_permitted() -- for example on x86-64 and
arm64 that both implement pte_protnone() -- let's just always fallback to
ordinary GUP when stumbling over pte_protnone()/pmd_protnone().

As Linus notes [2], honoring NUMA faults might only make sense for
selected GUP users.

So we should really see if we can instead let relevant GUP callers specify
it manually, and not trigger NUMA hinting faults from GUP as default.
Prepare for that by making FOLL_HONOR_NUMA_FAULT an external GUP flag and
adding appropriate documenation.

While at it, remove a stale comment from follow_trans_huge_pmd(): That
comment for pmd_protnone() was added in commit 2b4847e73004 ("mm: numa:
serialise parallel get_user_page against THP migration"), which noted:

THP does not unmap pages due to a lack of support for migration
entries at a PMD level. This allows races with get_user_pages

Nowadays, we do have PMD migration entries, so the comment no longer
applies. Let's drop it.

[1] https://lore.kernel.org/r/20230726073409.631838-1-liubo254@huawei.com
[2] https://lore.kernel.org/r/CAHk-=wgRiP_9X0rRdZKT8nhemZGNateMtb366t37d8-x7VRs=g@mail.gmail.com

Link: https://lkml.kernel.org/r/20230803143208.383663-2-david@redhat.com
Fixes: 474098edac26 ("mm/gup: replace FOLL_NUMA by gup_can_follow_protnone()")
Signed-off-by: David Hildenbrand <david@redhat.com>
Reported-by: liubo <liubo254@huawei.com>
Closes: https://lore.kernel.org/r/20230726073409.631838-1-liubo254@huawei.com
Reported-by: Peter Xu <peterx@redhat.com>
Closes: https://lore.kernel.org/all/ZMKJjDaqZ7FW0jfe@x1n/
Acked-by: Mel Gorman <mgorman@techsingularity.net>
Acked-by: Peter Xu <peterx@redhat.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: Jason Gunthorpe <jgg@ziepe.ca>
Cc: John Hubbard <jhubbard@nvidia.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Paolo Bonzini <pbonzini@redhat.com>
Cc: Shuah Khan <shuah@kernel.org>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>


# 6cd06ab1 05-Jul-2023 Linus Torvalds <torvalds@linux-foundation.org>

gup: make the stack expansion warning a bit more targeted

I added a warning about about GUP no longer expanding the stack in
commit a425ac5365f6 ("gup: add warning if some caller would seem to want
stack expansion"), but didn't really expect anybody to hit it.

And it's true that nobody seems to have hit a _real_ case yet, but we
certainly have a number of reports of false positives. Which not only
causes extra noise in itself, but might also end up hiding any real
cases if they do exist.

So let's tighten up the warning condition, and replace the simplistic

vma = find_vma(mm, start);
if (vma && (start < vma->vm_start)) {
WARN_ON_ONCE(vma->vm_flags & VM_GROWSDOWN);

with a

vma = gup_vma_lookup(mm, start);

helper function which works otherwise like just "vma_lookup()", but with
some heuristics for when to warn about gup no longer causing stack
expansion.

In particular, don't just warn for "below the stack", but warn if it's
_just_ below the stack (with "just below" arbitrarily defined as 64kB,
because why not?). And rate-limit it to at most once per hour, which
means that any false positives shouldn't completely hide subsequent
reports, but we won't be flooding the logs about it either.

The previous code triggered when some GUP user (chromium crashpad)
accessing past the end of the previous vma, for example. That has never
expanded the stack, it just causes GUP to return early, and as such we
shouldn't be warning about it.

This is still going trigger the randomized testers, but to mitigate the
noise from that, use "dump_stack()" instead of "WARN_ON_ONCE()" to get
the kernel call chain. We'll get the relevant information, but syzbot
shouldn't get too upset about it.

Also, don't even bother with the GROWSUP case, which would be using
different heuristics entirely, but only happens on parisc.

Reported-by: kernel test robot <oliver.sang@intel.com>
Reported-by: John Hubbard <jhubbard@nvidia.com>
Reported-by: syzbot+6cf44e127903fdf9d929@syzkaller.appspotmail.com
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# a425ac53 25-Jun-2023 Linus Torvalds <torvalds@linux-foundation.org>

gup: add warning if some caller would seem to want stack expansion

It feels very unlikely that anybody would want to do a GUP in an
unmapped area under the stack pointer, but real users sometimes do some
really strange things. So add a (temporary) warning for the case where
a GUP fails and expanding the stack might have made it work.

It's trivial to do the expansion in the caller as part of getting the mm
lock in the first place - see __access_remote_vm() for ptrace, for
example - it's just that it's unnecessarily painful to do it deep in the
guts of the GUP lookup when we might have to drop and re-take the lock.

I doubt anybody actually does anything quite this strange, but let's be
proactive: adding these warnings is simple, and will make debugging it
much easier if they trigger.

Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 8d7071af 24-Jun-2023 Linus Torvalds <torvalds@linux-foundation.org>

mm: always expand the stack with the mmap write lock held

This finishes the job of always holding the mmap write lock when
extending the user stack vma, and removes the 'write_locked' argument
from the vm helper functions again.

For some cases, we just avoid expanding the stack at all: drivers and
page pinning really shouldn't be extending any stacks. Let's see if any
strange users really wanted that.

It's worth noting that architectures that weren't converted to the new
lock_mm_and_find_vma() helper function are left using the legacy
"expand_stack()" function, but it has been changed to drop the mmap_lock
and take it for writing while expanding the vma. This makes it fairly
straightforward to convert the remaining architectures.

As a result of dropping and re-taking the lock, the calling conventions
for this function have also changed, since the old vma may no longer be
valid. So it will now return the new vma if successful, and NULL - and
the lock dropped - if the area could not be extended.

Tested-by: Vegard Nossum <vegard.nossum@oracle.com>
Tested-by: John Paul Adrian Glaubitz <glaubitz@physik.fu-berlin.de> # ia64
Tested-by: Frank Scheiner <frank.scheiner@web.de> # ia64
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 9883c7f8 19-Jun-2023 Jason Gunthorpe <jgg@ziepe.ca>

mm/gup: do not return 0 from pin_user_pages_fast() for bad args

These routines are not intended to return zero, the callers cannot do
anything sane with a 0 return. They should return an error which means
future calls to GUP will not succeed, or they should return some non-zero
number of pinned pages which means GUP should be called again.

If start + nr_pages overflows it should return -EOVERFLOW to signal the
arguments are invalid.

Syzkaller keeps tripping on this when fuzzing GUP arguments.

Link: https://lkml.kernel.org/r/0-v1-3d5ed1f20d50+104-gup_overflow_jgg@nvidia.com
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
Reported-by: syzbot+353c7be4964c6253f24a@syzkaller.appspotmail.com
Closes: https://lore.kernel.org/all/000000000000094fdd05faa4d3a4@google.com
Reviewed-by: John Hubbard <jhubbard@nvidia.com>
Reviewed-by: Lorenzo Stoakes <lstoakes@gmail.com>
Reviewed-by: David Hildenbrand <david@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>


# 503670ee 13-Jun-2023 Vishal Moola (Oracle) <vishal.moola@gmail.com>

mm/gup.c: reorganize try_get_folio()

try_get_folio() takes in a page, then chooses to do some folio operations
based on the flags (either FOLL_GET or FOLL_PIN). We can rewrite this
function to be more purpose oriented.

After calling try_get_folio(), if neither FOLL_GET nor FOLL_PIN are set,
warn and fail. If FOLL_GET is set we can return the result. If FOLL_GET
is not set then FOLL_PIN is set, so we pin the folio.

This change assists with folio conversions, and makes the function more
readable.

Link: https://lkml.kernel.org/r/20230614021312.34085-5-vishal.moola@gmail.com
Signed-off-by: Vishal Moola (Oracle) <vishal.moola@gmail.com>
Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>


# c33c7948 12-Jun-2023 Ryan Roberts <ryan.roberts@arm.com>

mm: ptep_get() conversion

Convert all instances of direct pte_t* dereferencing to instead use
ptep_get() helper. This means that by default, the accesses change from a
C dereference to a READ_ONCE(). This is technically the correct thing to
do since where pgtables are modified by HW (for access/dirty) they are
volatile and therefore we should always ensure READ_ONCE() semantics.

But more importantly, by always using the helper, it can be overridden by
the architecture to fully encapsulate the contents of the pte. Arch code
is deliberately not converted, as the arch code knows best. It is
intended that arch code (arm64) will override the default with its own
implementation that can (e.g.) hide certain bits from the core code, or
determine young/dirty status by mixing in state from another source.

Conversion was done using Coccinelle:

----

// $ make coccicheck \
// COCCI=ptepget.cocci \
// SPFLAGS="--include-headers" \
// MODE=patch

virtual patch

@ depends on patch @
pte_t *v;
@@

- *v
+ ptep_get(v)

----

Then reviewed and hand-edited to avoid multiple unnecessary calls to
ptep_get(), instead opting to store the result of a single call in a
variable, where it is correct to do so. This aims to negate any cost of
READ_ONCE() and will benefit arch-overrides that may be more complex.

Included is a fix for an issue in an earlier version of this patch that
was pointed out by kernel test robot. The issue arose because config
MMU=n elides definition of the ptep helper functions, including
ptep_get(). HUGETLB_PAGE=n configs still define a simple
huge_ptep_clear_flush() for linking purposes, which dereferences the ptep.
So when both configs are disabled, this caused a build error because
ptep_get() is not defined. Fix by continuing to do a direct dereference
when MMU=n. This is safe because for this config the arch code cannot be
trying to virtualize the ptes because none of the ptep helpers are
defined.

Link: https://lkml.kernel.org/r/20230612151545.3317766-4-ryan.roberts@arm.com
Reported-by: kernel test robot <lkp@intel.com>
Link: https://lore.kernel.org/oe-kbuild-all/202305120142.yXsNEo6H-lkp@intel.com/
Signed-off-by: Ryan Roberts <ryan.roberts@arm.com>
Cc: Adrian Hunter <adrian.hunter@intel.com>
Cc: Alexander Potapenko <glider@google.com>
Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com>
Cc: Alex Williamson <alex.williamson@redhat.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Andrey Konovalov <andreyknvl@gmail.com>
Cc: Andrey Ryabinin <ryabinin.a.a@gmail.com>
Cc: Christian Brauner <brauner@kernel.org>
Cc: Christoph Hellwig <hch@infradead.org>
Cc: Daniel Vetter <daniel@ffwll.ch>
Cc: Dave Airlie <airlied@gmail.com>
Cc: Dimitri Sivanich <dimitri.sivanich@hpe.com>
Cc: Dmitry Vyukov <dvyukov@google.com>
Cc: Ian Rogers <irogers@google.com>
Cc: Jason Gunthorpe <jgg@ziepe.ca>
Cc: Jérôme Glisse <jglisse@redhat.com>
Cc: Jiri Olsa <jolsa@kernel.org>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Cc: Lorenzo Stoakes <lstoakes@gmail.com>
Cc: Mark Rutland <mark.rutland@arm.com>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Miaohe Lin <linmiaohe@huawei.com>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Mike Kravetz <mike.kravetz@oracle.com>
Cc: Mike Rapoport (IBM) <rppt@kernel.org>
Cc: Muchun Song <muchun.song@linux.dev>
Cc: Namhyung Kim <namhyung@kernel.org>
Cc: Naoya Horiguchi <naoya.horiguchi@nec.com>
Cc: Oleksandr Tyshchenko <oleksandr_tyshchenko@epam.com>
Cc: Pavel Tatashin <pasha.tatashin@soleen.com>
Cc: Roman Gushchin <roman.gushchin@linux.dev>
Cc: SeongJae Park <sj@kernel.org>
Cc: Shakeel Butt <shakeelb@google.com>
Cc: Uladzislau Rezki (Sony) <urezki@gmail.com>
Cc: Vincenzo Frascino <vincenzo.frascino@arm.com>
Cc: Yu Zhao <yuzhao@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>


# 2378118b 08-Jun-2023 Hugh Dickins <hughd@google.com>

mm/gup: remove FOLL_SPLIT_PMD use of pmd_trans_unstable()

There is now no reason for follow_pmd_mask()'s FOLL_SPLIT_PMD block to
distinguish huge_zero_page from a normal THP: follow_page_pte() handles
any instability, and here it's a good idea to replace any pmd_none(*pmd)
by a page table a.s.a.p, in the huge_zero_page case as for a normal THP;
and this removes an unnecessary possibility of -EBUSY failure.

(Hmm, couldn't the normal THP case have hit an unstably refaulted THP
before? But there are only two, exceptional, users of FOLL_SPLIT_PMD.)

Link: https://lkml.kernel.org/r/59fd15dd-4d39-5ec-2043-1d5117f7f85@google.com
Signed-off-by: Hugh Dickins <hughd@google.com>
Reviewed-by: Yang Shi <shy828301@gmail.com>
Cc: Alistair Popple <apopple@nvidia.com>
Cc: Anshuman Khandual <anshuman.khandual@arm.com>
Cc: Axel Rasmussen <axelrasmussen@google.com>
Cc: Christophe Leroy <christophe.leroy@csgroup.eu>
Cc: Christoph Hellwig <hch@infradead.org>
Cc: David Hildenbrand <david@redhat.com>
Cc: "Huang, Ying" <ying.huang@intel.com>
Cc: Ira Weiny <ira.weiny@intel.com>
Cc: Jason Gunthorpe <jgg@ziepe.ca>
Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Cc: Lorenzo Stoakes <lstoakes@gmail.com>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Mel Gorman <mgorman@techsingularity.net>
Cc: Miaohe Lin <linmiaohe@huawei.com>
Cc: Mike Kravetz <mike.kravetz@oracle.com>
Cc: Mike Rapoport (IBM) <rppt@kernel.org>
Cc: Minchan Kim <minchan@kernel.org>
Cc: Naoya Horiguchi <naoya.horiguchi@nec.com>
Cc: Pavel Tatashin <pasha.tatashin@soleen.com>
Cc: Peter Xu <peterx@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Qi Zheng <zhengqi.arch@bytedance.com>
Cc: Ralph Campbell <rcampbell@nvidia.com>
Cc: Ryan Roberts <ryan.roberts@arm.com>
Cc: SeongJae Park <sj@kernel.org>
Cc: Song Liu <song@kernel.org>
Cc: Steven Price <steven.price@arm.com>
Cc: Suren Baghdasaryan <surenb@google.com>
Cc: Thomas Hellström <thomas.hellstrom@linux.intel.com>
Cc: Will Deacon <will@kernel.org>
Cc: Yu Zhao <yuzhao@google.com>
Cc: Zack Rusin <zackr@vmware.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>


# 04dee9e8 08-Jun-2023 Hugh Dickins <hughd@google.com>

mm/various: give up if pte_offset_map[_lock]() fails

Following the examples of nearby code, various functions can just give up
if pte_offset_map() or pte_offset_map_lock() fails. And there's no need
for a preliminary pmd_trans_unstable() or other such check, since such
cases are now safely handled inside.

Link: https://lkml.kernel.org/r/7b9bd85d-1652-cbf2-159d-f503b45e5b@google.com
Signed-off-by: Hugh Dickins <hughd@google.com>
Cc: Alistair Popple <apopple@nvidia.com>
Cc: Anshuman Khandual <anshuman.khandual@arm.com>
Cc: Axel Rasmussen <axelrasmussen@google.com>
Cc: Christophe Leroy <christophe.leroy@csgroup.eu>
Cc: Christoph Hellwig <hch@infradead.org>
Cc: David Hildenbrand <david@redhat.com>
Cc: "Huang, Ying" <ying.huang@intel.com>
Cc: Ira Weiny <ira.weiny@intel.com>
Cc: Jason Gunthorpe <jgg@ziepe.ca>
Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Cc: Lorenzo Stoakes <lstoakes@gmail.com>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Mel Gorman <mgorman@techsingularity.net>
Cc: Miaohe Lin <linmiaohe@huawei.com>
Cc: Mike Kravetz <mike.kravetz@oracle.com>
Cc: Mike Rapoport (IBM) <rppt@kernel.org>
Cc: Minchan Kim <minchan@kernel.org>
Cc: Naoya Horiguchi <naoya.horiguchi@nec.com>
Cc: Pavel Tatashin <pasha.tatashin@soleen.com>
Cc: Peter Xu <peterx@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Qi Zheng <zhengqi.arch@bytedance.com>
Cc: Ralph Campbell <rcampbell@nvidia.com>
Cc: Ryan Roberts <ryan.roberts@arm.com>
Cc: SeongJae Park <sj@kernel.org>
Cc: Song Liu <song@kernel.org>
Cc: Steven Price <steven.price@arm.com>
Cc: Suren Baghdasaryan <surenb@google.com>
Cc: Thomas Hellström <thomas.hellstrom@linux.intel.com>
Cc: Will Deacon <will@kernel.org>
Cc: Yang Shi <shy828301@gmail.com>
Cc: Yu Zhao <yuzhao@google.com>
Cc: Zack Rusin <zackr@vmware.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>


# 26e1a0c3 08-Jun-2023 Hugh Dickins <hughd@google.com>

mm: use pmdp_get_lockless() without surplus barrier()

Patch series "mm: allow pte_offset_map[_lock]() to fail", v2.

What is it all about? Some mmap_lock avoidance i.e. latency reduction.
Initially just for the case of collapsing shmem or file pages to THPs; but
likely to be relied upon later in other contexts e.g. freeing of empty
page tables (but that's not work I'm doing). mmap_write_lock avoidance
when collapsing to anon THPs? Perhaps, but again that's not work I've
done: a quick attempt was not as easy as the shmem/file case.

I would much prefer not to have to make these small but wide-ranging
changes for such a niche case; but failed to find another way, and have
heard that shmem MADV_COLLAPSE's usefulness is being limited by that
mmap_write_lock it currently requires.

These changes (though of course not these exact patches) have been in
Google's data centre kernel for three years now: we do rely upon them.

What is this preparatory series about?

The current mmap locking will not be enough to guard against that tricky
transition between pmd entry pointing to page table, and empty pmd entry,
and pmd entry pointing to huge page: pte_offset_map() will have to
validate the pmd entry for itself, returning NULL if no page table is
there. What to do about that varies: sometimes nearby error handling
indicates just to skip it; but in many cases an ACTION_AGAIN or "goto
again" is appropriate (and if that risks an infinite loop, then there must
have been an oops, or pfn 0 mistaken for page table, before).

Given the likely extension to freeing empty page tables, I have not
limited this set of changes to a THP config; and it has been easier, and
sets a better example, if each site is given appropriate handling: even
where deeper study might prove that failure could only happen if the pmd
table were corrupted.

Several of the patches are, or include, cleanup on the way; and by the
end, pmd_trans_unstable() and suchlike are deleted: pte_offset_map() and
pte_offset_map_lock() then handle those original races and more. Most
uses of pte_lockptr() are deprecated, with pte_offset_map_nolock() taking
its place.


This patch (of 32):

Use pmdp_get_lockless() in preference to READ_ONCE(*pmdp), to get a more
reliable result with PAE (or READ_ONCE as before without PAE); and remove
the unnecessary extra barrier()s which got left behind in its callers.

HOWEVER: Note the small print in linux/pgtable.h, where it was designed
specifically for fast GUP, and depends on interrupts being disabled for
its full guarantee: most callers which have been added (here and before)
do NOT have interrupts disabled, so there is still some need for caution.

Link: https://lkml.kernel.org/r/f35279a9-9ac0-de22-d245-591afbfb4dc@google.com
Signed-off-by: Hugh Dickins <hughd@google.com>
Acked-by: Yu Zhao <yuzhao@google.com>
Acked-by: Peter Xu <peterx@redhat.com>
Cc: Alistair Popple <apopple@nvidia.com>
Cc: Anshuman Khandual <anshuman.khandual@arm.com>
Cc: Axel Rasmussen <axelrasmussen@google.com>
Cc: Christophe Leroy <christophe.leroy@csgroup.eu>
Cc: Christoph Hellwig <hch@infradead.org>
Cc: David Hildenbrand <david@redhat.com>
Cc: "Huang, Ying" <ying.huang@intel.com>
Cc: Ira Weiny <ira.weiny@intel.com>
Cc: Jason Gunthorpe <jgg@ziepe.ca>
Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Cc: Lorenzo Stoakes <lstoakes@gmail.com>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Mel Gorman <mgorman@techsingularity.net>
Cc: Miaohe Lin <linmiaohe@huawei.com>
Cc: Mike Kravetz <mike.kravetz@oracle.com>
Cc: Mike Rapoport (IBM) <rppt@kernel.org>
Cc: Minchan Kim <minchan@kernel.org>
Cc: Naoya Horiguchi <naoya.horiguchi@nec.com>
Cc: Pavel Tatashin <pasha.tatashin@soleen.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Qi Zheng <zhengqi.arch@bytedance.com>
Cc: Ralph Campbell <rcampbell@nvidia.com>
Cc: Ryan Roberts <ryan.roberts@arm.com>
Cc: SeongJae Park <sj@kernel.org>
Cc: Song Liu <song@kernel.org>
Cc: Steven Price <steven.price@arm.com>
Cc: Suren Baghdasaryan <surenb@google.com>
Cc: Thomas Hellström <thomas.hellstrom@linux.intel.com>
Cc: Will Deacon <will@kernel.org>
Cc: Yang Shi <shy828301@gmail.com>
Cc: Zack Rusin <zackr@vmware.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>


# a6e79df9 04-May-2023 Lorenzo Stoakes <lstoakes@gmail.com>

mm/gup: disallow FOLL_LONGTERM GUP-fast writing to file-backed mappings

Writing to file-backed dirty-tracked mappings via GUP is inherently broken
as we cannot rule out folios being cleaned and then a GUP user writing to
them again and possibly marking them dirty unexpectedly.

This is especially egregious for long-term mappings (as indicated by the
use of the FOLL_LONGTERM flag), so we disallow this case in GUP-fast as we
have already done in the slow path.

We have access to less information in the fast path as we cannot examine
the VMA containing the mapping, however we can determine whether the folio
is anonymous or belonging to a whitelisted filesystem - specifically
hugetlb and shmem mappings.

We take special care to ensure that both the folio and mapping are safe to
access when performing these checks and document folio_fast_pin_allowed()
accordingly.

It's important to note that there are no APIs allowing users to specify
FOLL_FAST_ONLY for a PUP-fast let alone with FOLL_LONGTERM, so we can
always rely on the fact that if we fail to pin on the fast path, the code
will fall back to the slow path which can perform the more thorough check.

Link: https://lkml.kernel.org/r/a27d39b87ded7f3dad5fd4181edb106393660453.1683235180.git.lstoakes@gmail.com
Signed-off-by: Lorenzo Stoakes <lstoakes@gmail.com>
Suggested-by: David Hildenbrand <david@redhat.com>
Suggested-by: Kirill A . Shutemov <kirill@shutemov.name>
Suggested-by: Peter Zijlstra <peterz@infradead.org>
Reviewed-by: Jan Kara <jack@suse.cz>
Acked-by: David Hildenbrand <david@redhat.com>
Cc: Jason Gunthorpe <jgg@nvidia.com>
Cc: John Hubbard <jhubbard@nvidia.com>
Cc: Mika Penttilä <mpenttil@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>


# 8ac26843 04-May-2023 Lorenzo Stoakes <lstoakes@gmail.com>

mm/gup: disallow FOLL_LONGTERM GUP-nonfast writing to file-backed mappings

Writing to file-backed mappings which require folio dirty tracking using
GUP is a fundamentally broken operation, as kernel write access to GUP
mappings do not adhere to the semantics expected by a file system.

A GUP caller uses the direct mapping to access the folio, which does not
cause write notify to trigger, nor does it enforce that the caller marks
the folio dirty.

The problem arises when, after an initial write to the folio, writeback
results in the folio being cleaned and then the caller, via the GUP
interface, writes to the folio again.

As a result of the use of this secondary, direct, mapping to the folio no
write notify will occur, and if the caller does mark the folio dirty, this
will be done so unexpectedly.

For example, consider the following scenario:-

1. A folio is written to via GUP which write-faults the memory, notifying
the file system and dirtying the folio.
2. Later, writeback is triggered, resulting in the folio being cleaned and
the PTE being marked read-only.
3. The GUP caller writes to the folio, as it is mapped read/write via the
direct mapping.
4. The GUP caller, now done with the page, unpins it and sets it dirty
(though it does not have to).

This results in both data being written to a folio without writenotify,
and the folio being dirtied unexpectedly (if the caller decides to do so).

This issue was first reported by Jan Kara [1] in 2018, where the problem
resulted in file system crashes.

This is only relevant when the mappings are file-backed and the underlying
file system requires folio dirty tracking. File systems which do not,
such as shmem or hugetlb, are not at risk and therefore can be written to
without issue.

Unfortunately this limitation of GUP has been present for some time and
requires future rework of the GUP API in order to provide correct write
access to such mappings.

However, for the time being we introduce this check to prevent the most
egregious case of this occurring, use of the FOLL_LONGTERM pin.

These mappings are considerably more likely to be written to after folios
are cleaned and thus simply must not be permitted to do so.

This patch changes only the slow-path GUP functions, a following patch
adapts the GUP-fast path along similar lines.

[1] https://lore.kernel.org/linux-mm/20180103100430.GE4911@quack2.suse.cz/

Link: https://lkml.kernel.org/r/7282506742d2390c125949c2f9894722750bb68a.1683235180.git.lstoakes@gmail.com
Signed-off-by: Lorenzo Stoakes <lstoakes@gmail.com>
Suggested-by: Jason Gunthorpe <jgg@nvidia.com>
Reviewed-by: John Hubbard <jhubbard@nvidia.com>
Reviewed-by: Mika Penttilä <mpenttil@redhat.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Reviewed-by: Jason Gunthorpe <jgg@nvidia.com>
Acked-by: David Hildenbrand <david@redhat.com>
Cc: Kirill A . Shutemov <kirill@shutemov.name>
Cc: Peter Zijlstra <peterz@infradead.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>


# b2cac248 17-May-2023 Lorenzo Stoakes <lstoakes@gmail.com>

mm/gup: remove vmas array from internal GUP functions

Now we have eliminated all callers to GUP APIs which use the vmas
parameter, eliminate it altogether.

This eliminates a class of bugs where vmas might have been kept around
longer than the mmap_lock and thus we need not be concerned about locks
being dropped during this operation leaving behind dangling pointers.

This simplifies the GUP API and makes it considerably clearer as to its
purpose - follow flags are applied and if pinning, an array of pages is
returned.

Link: https://lkml.kernel.org/r/6811b4b2b4b3baf3dd07f422bb18853bb2cd09fb.1684350871.git.lstoakes@gmail.com
Signed-off-by: Lorenzo Stoakes <lstoakes@gmail.com>
Acked-by: David Hildenbrand <david@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Christian König <christian.koenig@amd.com>
Cc: Dennis Dalessandro <dennis.dalessandro@cornelisnetworks.com>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: Janosch Frank <frankja@linux.ibm.com>
Cc: Jarkko Sakkinen <jarkko@kernel.org>
Cc: Jason Gunthorpe <jgg@nvidia.com>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
Cc: Sakari Ailus <sakari.ailus@linux.intel.com>
Cc: Sean Christopherson <seanjc@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>


# 4c630f30 17-May-2023 Lorenzo Stoakes <lstoakes@gmail.com>

mm/gup: remove vmas parameter from pin_user_pages()

We are now in a position where no caller of pin_user_pages() requires the
vmas parameter at all, so eliminate this parameter from the function and
all callers.

This clears the way to removing the vmas parameter from GUP altogether.

Link: https://lkml.kernel.org/r/195a99ae949c9f5cb589d2222b736ced96ec199a.1684350871.git.lstoakes@gmail.com
Signed-off-by: Lorenzo Stoakes <lstoakes@gmail.com>
Acked-by: David Hildenbrand <david@redhat.com>
Acked-by: Dennis Dalessandro <dennis.dalessandro@cornelisnetworks.com> [qib]
Reviewed-by: Christoph Hellwig <hch@lst.de>
Acked-by: Sakari Ailus <sakari.ailus@linux.intel.com> [drivers/media]
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Christian König <christian.koenig@amd.com>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: Janosch Frank <frankja@linux.ibm.com>
Cc: Jarkko Sakkinen <jarkko@kernel.org>
Cc: Jason Gunthorpe <jgg@nvidia.com>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
Cc: Sean Christopherson <seanjc@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>


# ca5e8632 17-May-2023 Lorenzo Stoakes <lstoakes@gmail.com>

mm/gup: remove vmas parameter from get_user_pages_remote()

The only instances of get_user_pages_remote() invocations which used the
vmas parameter were for a single page which can instead simply look up the
VMA directly. In particular:-

- __update_ref_ctr() looked up the VMA but did nothing with it so we simply
remove it.

- __access_remote_vm() was already using vma_lookup() when the original
lookup failed so by doing the lookup directly this also de-duplicates the
code.

We are able to perform these VMA operations as we already hold the
mmap_lock in order to be able to call get_user_pages_remote().

As part of this work we add get_user_page_vma_remote() which abstracts the
VMA lookup, error handling and decrementing the page reference count should
the VMA lookup fail.

This forms part of a broader set of patches intended to eliminate the vmas
parameter altogether.

[akpm@linux-foundation.org: avoid passing NULL to PTR_ERR]
Link: https://lkml.kernel.org/r/d20128c849ecdbf4dd01cc828fcec32127ed939a.1684350871.git.lstoakes@gmail.com
Signed-off-by: Lorenzo Stoakes <lstoakes@gmail.com>
Reviewed-by: Catalin Marinas <catalin.marinas@arm.com> (for arm64)
Acked-by: David Hildenbrand <david@redhat.com>
Reviewed-by: Janosch Frank <frankja@linux.ibm.com> (for s390)
Reviewed-by: Christoph Hellwig <hch@lst.de>
Cc: Christian König <christian.koenig@amd.com>
Cc: Dennis Dalessandro <dennis.dalessandro@cornelisnetworks.com>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: Jarkko Sakkinen <jarkko@kernel.org>
Cc: Jason Gunthorpe <jgg@nvidia.com>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
Cc: Sakari Ailus <sakari.ailus@linux.intel.com>
Cc: Sean Christopherson <seanjc@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>


# 0b295316 17-May-2023 Lorenzo Stoakes <lstoakes@gmail.com>

mm/gup: remove unused vmas parameter from pin_user_pages_remote()

No invocation of pin_user_pages_remote() uses the vmas parameter, so
remove it. This forms part of a larger patch set eliminating the use of
the vmas parameters altogether.

Link: https://lkml.kernel.org/r/28f000beb81e45bf538a2aaa77c90f5482b67a32.1684350871.git.lstoakes@gmail.com
Signed-off-by: Lorenzo Stoakes <lstoakes@gmail.com>
Acked-by: David Hildenbrand <david@redhat.com>
Reviewed-by: Jason Gunthorpe <jgg@nvidia.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Christian König <christian.koenig@amd.com>
Cc: Dennis Dalessandro <dennis.dalessandro@cornelisnetworks.com>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: Janosch Frank <frankja@linux.ibm.com>
Cc: Jarkko Sakkinen <jarkko@kernel.org>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
Cc: Sakari Ailus <sakari.ailus@linux.intel.com>
Cc: Sean Christopherson <seanjc@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>


# 54d02069 17-May-2023 Lorenzo Stoakes <lstoakes@gmail.com>

mm/gup: remove unused vmas parameter from get_user_pages()

Patch series "remove the vmas parameter from GUP APIs", v6.

(pin_/get)_user_pages[_remote]() each provide an optional output parameter
for an array of VMA objects associated with each page in the input range.

These provide the means for VMAs to be returned, as long as mm->mmap_lock
is never released during the GUP operation (i.e. the internal flag
FOLL_UNLOCKABLE is not specified).

In addition, these VMAs can only be accessed with the mmap_lock held and
become invalidated the moment it is released.

The vast majority of invocations do not use this functionality and of
those that do, all but one case retrieve a single VMA to perform checks
upon.

It is not egregious in the single VMA cases to simply replace the
operation with a vma_lookup(). In these cases we duplicate the (fast)
lookup on a slow path already under the mmap_lock, abstracted to a new
get_user_page_vma_remote() inline helper function which also performs
error checking and reference count maintenance.

The special case is io_uring, where io_pin_pages() specifically needs to
assert that the VMAs underlying the range do not result in broken
long-term GUP file-backed mappings.

As GUP now internally asserts that FOLL_LONGTERM mappings are not
file-backed in a broken fashion (i.e. requiring dirty tracking) - as
implemented in "mm/gup: disallow FOLL_LONGTERM GUP-nonfast writing to
file-backed mappings" - this logic is no longer required and so we can
simply remove it altogether from io_uring.

Eliminating the vmas parameter eliminates an entire class of danging
pointer errors that might have occured should the lock have been
incorrectly released.

In addition, the API is simplified and now clearly expresses what it is
intended for - applying the specified GUP flags and (if pinning) returning
pinned pages.

This change additionally opens the door to further potential improvements
in GUP and the possible marrying of disparate code paths.

I have run this series against gup_test with no issues.

Thanks to Matthew Wilcox for suggesting this refactoring!


This patch (of 6):

No invocation of get_user_pages() use the vmas parameter, so remove it.

The GUP API is confusing and caveated. Recent changes have done much to
improve that, however there is more we can do. Exporting vmas is a prime
target as the caller has to be extremely careful to preclude their use
after the mmap_lock has expired or otherwise be left with dangling
pointers.

Removing the vmas parameter focuses the GUP functions upon their primary
purpose - pinning (and outputting) pages as well as performing the actions
implied by the input flags.

This is part of a patch series aiming to remove the vmas parameter
altogether.

Link: https://lkml.kernel.org/r/cover.1684350871.git.lstoakes@gmail.com
Link: https://lkml.kernel.org/r/589e0c64794668ffc799651e8d85e703262b1e9d.1684350871.git.lstoakes@gmail.com
Signed-off-by: Lorenzo Stoakes <lstoakes@gmail.com>
Suggested-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Acked-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Acked-by: David Hildenbrand <david@redhat.com>
Reviewed-by: Jason Gunthorpe <jgg@nvidia.com>
Acked-by: Christian König <christian.koenig@amd.com> (for radeon parts)
Acked-by: Jarkko Sakkinen <jarkko@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Acked-by: Sean Christopherson <seanjc@google.com> (KVM)
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Dennis Dalessandro <dennis.dalessandro@cornelisnetworks.com>
Cc: Janosch Frank <frankja@linux.ibm.com>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: Sakari Ailus <sakari.ailus@linux.intel.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>


# 31115034 06-May-2023 Lorenzo Stoakes <lstoakes@gmail.com>

mm/gup: add missing gup_must_unshare() check to gup_huge_pgd()

All other instances of gup_huge_pXd() perform the unshare check, so update
the PGD-specific function to do so as well.

While checking pgd_write() might seem unusual, this function already
performs such a check via pgd_access_permitted() so this is in line with
the existing implementation.

David said:

: This change makes unshare handling across all GUP-fast variants
: consistent, which is desirable as GUP-fast is complicated enough
: already even when consistent.
:
: This function was the only one I seemed to have missed (or left out and
: forgot why -- maybe because it's really dead code for now). The COW
: selftest would identify the problem, so far there was no report.
: Either the selftest wasn't run on corresponding architectures with that
: hugetlb size, or that code is still dead code and unused by
: architectures.
:
: the original commit(s) that added unsharing explain why we care about
: these checks:
:
: a7f226604170acd6 ("mm/gup: trigger FAULT_FLAG_UNSHARE when R/O-pinning a possibly shared anonymous page")
: 84209e87c6963f92 ("mm/gup: reliable R/O long-term pinning in COW mappings")

Link: https://lkml.kernel.org/r/cb971ac8dd315df97058ea69442ecc007b9a364a.1683381545.git.lstoakes@gmail.com
Signed-off-by: Lorenzo Stoakes <lstoakes@gmail.com>
Suggested-by: David Hildenbrand <david@redhat.com>
Acked-by: David Hildenbrand <david@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>


# 1101fb8f 26-May-2023 David Howells <dhowells@redhat.com>

mm: Provide a function to get an additional pin on a page

Provide a function to get an additional pin on a page that we already have
a pin on. This will be used in fs/direct-io.c when dispatching multiple
bios to a page we've extracted from a user-backed iter rather than redoing
the extraction.

Signed-off-by: David Howells <dhowells@redhat.com>
cc: Christoph Hellwig <hch@infradead.org>
cc: David Hildenbrand <david@redhat.com>
cc: Lorenzo Stoakes <lstoakes@gmail.com>
cc: Andrew Morton <akpm@linux-foundation.org>
cc: Jens Axboe <axboe@kernel.dk>
cc: Al Viro <viro@zeniv.linux.org.uk>
cc: Matthew Wilcox <willy@infradead.org>
cc: Jan Kara <jack@suse.cz>
cc: Jeff Layton <jlayton@kernel.org>
cc: Jason Gunthorpe <jgg@nvidia.com>
cc: Logan Gunthorpe <logang@deltatee.com>
cc: Hillf Danton <hdanton@sina.com>
cc: Christian Brauner <brauner@kernel.org>
cc: Linus Torvalds <torvalds@linux-foundation.org>
cc: linux-fsdevel@vger.kernel.org
cc: linux-block@vger.kernel.org
cc: linux-kernel@vger.kernel.org
cc: linux-mm@kvack.org
Reviewed-by: Christoph Hellwig <hch@lst.de>
Acked-by: David Hildenbrand <david@redhat.com>
Link: https://lore.kernel.org/r/20230526214142.958751-3-dhowells@redhat.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>


# c8070b78 26-May-2023 David Howells <dhowells@redhat.com>

mm: Don't pin ZERO_PAGE in pin_user_pages()

Make pin_user_pages*() leave a ZERO_PAGE unpinned if it extracts a pointer
to it from the page tables and make unpin_user_page*() correspondingly
ignore a ZERO_PAGE when unpinning. We don't want to risk overrunning a
zero page's refcount as we're only allowed ~2 million pins on it -
something that userspace can conceivably trigger.

Add a pair of functions to test whether a page or a folio is a ZERO_PAGE.

Signed-off-by: David Howells <dhowells@redhat.com>
cc: Christoph Hellwig <hch@infradead.org>
cc: David Hildenbrand <david@redhat.com>
cc: Lorenzo Stoakes <lstoakes@gmail.com>
cc: Andrew Morton <akpm@linux-foundation.org>
cc: Jens Axboe <axboe@kernel.dk>
cc: Al Viro <viro@zeniv.linux.org.uk>
cc: Matthew Wilcox <willy@infradead.org>
cc: Jan Kara <jack@suse.cz>
cc: Jeff Layton <jlayton@kernel.org>
cc: Jason Gunthorpe <jgg@nvidia.com>
cc: Logan Gunthorpe <logang@deltatee.com>
cc: Hillf Danton <hdanton@sina.com>
cc: Christian Brauner <brauner@kernel.org>
cc: Linus Torvalds <torvalds@linux-foundation.org>
cc: linux-fsdevel@vger.kernel.org
cc: linux-block@vger.kernel.org
cc: linux-kernel@vger.kernel.org
cc: linux-mm@kvack.org
Reviewed-by: Lorenzo Stoakes <lstoakes@gmail.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Acked-by: David Hildenbrand <david@redhat.com>
Link: https://lore.kernel.org/r/20230526214142.958751-2-dhowells@redhat.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>


# 6014bc27 28-Apr-2023 Linus Torvalds <torvalds@linux-foundation.org>

x86-64: make access_ok() independent of LAM

The linear address masking (LAM) code made access_ok() more complicated,
in that it now needs to untag the address in order to verify the access
range. See commit 74c228d20a51 ("x86/uaccess: Provide untagged_addr()
and remove tags before address check").

We were able to avoid that overhead in the get_user/put_user code paths
by simply using the sign bit for the address check, and depending on the
GP fault if the address was non-canonical, which made it all independent
of LAM.

And we can do the same thing for access_ok(): simply check that the user
pointer range has the high bit clear. No need to bother with any
address bit masking.

In fact, we can go a bit further, and just check the starting address
for known small accesses ranges: any accesses that overflow will still
be in the non-canonical area and will still GP fault.

To still make syzkaller catch any potentially unchecked user addresses,
we'll continue to warn about GP faults that are caused by accesses in
the non-canonical range. But we'll limit that to purely "high bit set
and past the one-page 'slop' area".

We could probably just do that "check only starting address" for any
arbitrary range size: realistically all kernel accesses to user space
will be done starting at the low address. But let's leave that kind of
optimization for later. As it is, this already allows us to generate
simpler code and not worry about any tag bits in the address.

The one thing to look out for is the GUP address check: instead of
actually copying data in the virtual address range (and thus bad
addresses being caught by the GP fault), GUP will look up the page
tables manually. As a result, the page table limits need to be checked,
and that was previously implicitly done by the access_ok().

With the relaxed access_ok() check, we need to just do an explicit check
for TASK_SIZE_MAX in the GUP code instead. The GUP code already needs
to do the tag bit unmasking anyway, so there this is all very
straightforward, and there are no LAM issues.

Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 5da1a868 09-Mar-2023 Jingyu Wang <jingyuwang_vip@163.com>

mm/gup.c: fix typo in comments

Link: https://lkml.kernel.org/r/20230309104813.170309-1-jingyuwang_vip@163.com
Signed-off-by: Jingyu Wang <jingyuwang_vip@163.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>


# 428e106a 12-Mar-2023 Kirill A. Shutemov <kirill.shutemov@linux.intel.com>

mm: Introduce untagged_addr_remote()

untagged_addr() removes tags/metadata from the address and brings it to
the canonical form. The helper is implemented on arm64 and sparc. Both of
them do untagging based on global rules.

However, Linear Address Masking (LAM) on x86 introduces per-process
settings for untagging. As a result, untagged_addr() is now only
suitable for untagging addresses for the current proccess.

The new helper untagged_addr_remote() has to be used when the address
targets remote process. It requires the mmap lock for target mm to be
taken.

Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Tested-by: Alexander Potapenko <glider@google.com>
Link: https://lore.kernel.org/all/20230312112612.31869-6-kirill.shutemov%40linux.intel.com


# be2d5756 15-Feb-2023 Baolin Wang <baolin.wang@linux.alibaba.com>

mm: change to return bool for folio_isolate_lru()

Patch series "Change the return value for page isolation functions", v3.

Now the page isolation functions did not return a boolean to indicate
success or not, instead it will return a negative error when failed
to isolate a page. So below code used in most places seem a boolean
success/failure thing, which can confuse people whether the isolation
is successful.

if (folio_isolate_lru(folio))
continue;

Moreover the page isolation functions only return 0 or -EBUSY, and
most users did not care about the negative error except for few users,
thus we can convert all page isolation functions to return a boolean
value, which can remove the confusion to make code more clear.

No functional changes intended in this patch series.


This patch (of 4):

Now the folio_isolate_lru() did not return a boolean value to indicate
isolation success or not, however below code checking the return value can
make people think that it was a boolean success/failure thing, which makes
people easy to make mistakes (see the fix patch[1]).

if (folio_isolate_lru(folio))
continue;

Thus it's better to check the negative error value expilictly returned by
folio_isolate_lru(), which makes code more clear per Linus's
suggestion[2]. Moreover Matthew suggested we can convert the isolation
functions to return a boolean[3], since most users did not care about the
negative error value, and can also remove the confusing of checking return
value.

So this patch converts the folio_isolate_lru() to return a boolean value,
which means return 'true' to indicate the folio isolation is successful,
and 'false' means a failure to isolation. Meanwhile changing all users'
logic of checking the isolation state.

No functional changes intended.

[1] https://lore.kernel.org/all/20230131063206.28820-1-Kuan-Ying.Lee@mediatek.com/T/#u
[2] https://lore.kernel.org/all/CAHk-=wiBrY+O-4=2mrbVyxR+hOqfdJ=Do6xoucfJ9_5az01L4Q@mail.gmail.com/
[3] https://lore.kernel.org/all/Y+sTFqwMNAjDvxw3@casper.infradead.org/

Link: https://lkml.kernel.org/r/cover.1676424378.git.baolin.wang@linux.alibaba.com
Link: https://lkml.kernel.org/r/8a4e3679ed4196168efadf7ea36c038f2f7d5aa9.1676424378.git.baolin.wang@linux.alibaba.com
Signed-off-by: Baolin Wang <baolin.wang@linux.alibaba.com>
Reviewed-by: SeongJae Park <sj@kernel.org>
Acked-by: David Hildenbrand <david@redhat.com>
Reviewed-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Acked-by: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Miaohe Lin <linmiaohe@huawei.com>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Mike Kravetz <mike.kravetz@oracle.com>
Cc: Muchun Song <muchun.song@linux.dev>
Cc: Naoya Horiguchi <naoya.horiguchi@nec.com>
Cc: Oscar Salvador <osalvador@suse.de>
Cc: Roman Gushchin <roman.gushchin@linux.dev>
Cc: Shakeel Butt <shakeelb@google.com>

Signed-off-by: Andrew Morton <akpm@linux-foundation.org>


# 6aa3a920 13-Jan-2023 Sidhartha Kumar <sidhartha.kumar@oracle.com>

mm/hugetlb: convert isolate_hugetlb to folios

Patch series "continue hugetlb folio conversion", v3.

This series continues the conversion of core hugetlb functions to use
folios. This series converts many helper funtions in the hugetlb fault
path. This is in preparation for another series to convert the hugetlb
fault code paths to operate on folios.


This patch (of 8):

Convert isolate_hugetlb() to take in a folio and convert its callers to
pass a folio. Use page_folio() to convert the callers to use a folio is
safe as isolate_hugetlb() operates on a head page.

Link: https://lkml.kernel.org/r/20230113223057.173292-1-sidhartha.kumar@oracle.com
Link: https://lkml.kernel.org/r/20230113223057.173292-2-sidhartha.kumar@oracle.com
Signed-off-by: Sidhartha Kumar <sidhartha.kumar@oracle.com>
Reviewed-by: Mike Kravetz <mike.kravetz@oracle.com>
Cc: John Hubbard <jhubbard@nvidia.com>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Mike Kravetz <mike.kravetz@oracle.com>
Cc: Muchun Song <songmuchun@bytedance.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>


# 9198a919 24-Jan-2023 Jason Gunthorpe <jgg@ziepe.ca>

mm/gup: make get_user_pages_fast_only() return the common return value

There are only two callers, both can handle the common return code:

- get_user_page_fast_only() checks == 1

- gfn_to_page_many_atomic() already returns -1, and the only caller
checks for negative return values

Remove the restriction against returning negative values.

Link: https://lkml.kernel.org/r/11-v2-987e91b59705+36b-gup_tidy_jgg@nvidia.com
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
Acked-by: Mike Rapoport (IBM) <rppt@kernel.org>
Reviewed-by: John Hubbard <jhubbard@nvidia.com>
Reviewed-by: David Hildenbrand <david@redhat.com>
Cc: Alistair Popple <apopple@nvidia.com>
Cc: Christoph Hellwig <hch@infradead.org>
Cc: Claudio Imbrenda <imbrenda@linux.ibm.com>
Cc: David Howells <dhowells@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>


# edad1bb1 24-Jan-2023 Jason Gunthorpe <jgg@ziepe.ca>

mm/gup: remove pin_user_pages_fast_only()

Commit ed29c2691188 ("drm/i915: Fix userptr so we do not have to worry
about obj->mm.lock, v7.") removed the only caller, remove this dead code
too.

Link: https://lkml.kernel.org/r/10-v2-987e91b59705+36b-gup_tidy_jgg@nvidia.com
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
Acked-by: Mike Rapoport (IBM) <rppt@kernel.org>
Reviewed-by: John Hubbard <jhubbard@nvidia.com>
Reviewed-by: David Hildenbrand <david@redhat.com>
Cc: Alistair Popple <apopple@nvidia.com>
Cc: Christoph Hellwig <hch@infradead.org>
Cc: Claudio Imbrenda <imbrenda@linux.ibm.com>
Cc: David Howells <dhowells@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>


# 9a863a6a 24-Jan-2023 Jason Gunthorpe <jgg@ziepe.ca>

mm/gup: make locked never NULL in the internal GUP functions

Now that NULL locked doesn't have a special meaning we can just make it
non-NULL in all cases and remove the special tests.

get_user_pages() and pin_user_pages() can safely pass in a locked = 1

get_user_pages_remote) and pin_user_pages_remote() can swap in a local
variable for locked if NULL is passed.

Remove all the NULL checks.

Link: https://lkml.kernel.org/r/9-v2-987e91b59705+36b-gup_tidy_jgg@nvidia.com
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
Acked-by: Mike Rapoport (IBM) <rppt@kernel.org>
Reviewed-by: John Hubbard <jhubbard@nvidia.com>
Cc: Alistair Popple <apopple@nvidia.com>
Cc: Christoph Hellwig <hch@infradead.org>
Cc: Claudio Imbrenda <imbrenda@linux.ibm.com>
Cc: David Hildenbrand <david@redhat.com>
Cc: David Howells <dhowells@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>


# f04740f5 24-Jan-2023 Jason Gunthorpe <jgg@ziepe.ca>

mm/gup: add FOLL_UNLOCKABLE

Setting FOLL_UNLOCKABLE allows GUP to lock/unlock the mmap lock on its
own. It is a more explicit replacement for locked != NULL. This clears
the way for passing in locked = 1, without intending that the lock can be
unlocked.

Set the flag in all cases where it is used, eg locked is present in the
external interface or locked is used internally with locked = 0.

Link: https://lkml.kernel.org/r/8-v2-987e91b59705+36b-gup_tidy_jgg@nvidia.com
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
Acked-by: Mike Rapoport (IBM) <rppt@kernel.org>
Reviewed-by: John Hubbard <jhubbard@nvidia.com>
Cc: Alistair Popple <apopple@nvidia.com>
Cc: Christoph Hellwig <hch@infradead.org>
Cc: Claudio Imbrenda <imbrenda@linux.ibm.com>
Cc: David Hildenbrand <david@redhat.com>
Cc: David Howells <dhowells@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>


# 6e4382c7 24-Jan-2023 Jason Gunthorpe <jgg@ziepe.ca>

mm/gup: remove locked being NULL from faultin_vma_page_range()

The only caller of this function always passes in a non-NULL locked, so
just remove this obsolete comment.

Link: https://lkml.kernel.org/r/7-v2-987e91b59705+36b-gup_tidy_jgg@nvidia.com
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
Reviewed-by: John Hubbard <jhubbard@nvidia.com>
Cc: Alistair Popple <apopple@nvidia.com>
Cc: Christoph Hellwig <hch@infradead.org>
Cc: Claudio Imbrenda <imbrenda@linux.ibm.com>
Cc: David Hildenbrand <david@redhat.com>
Cc: David Howells <dhowells@redhat.com>
Cc: Mike Rapoport (IBM) <rppt@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>


# 961ba472 24-Jan-2023 Jason Gunthorpe <jgg@ziepe.ca>

mm/gup: add an assertion that the mmap lock is locked

Since commit 5b78ed24e8ec ("mm/pagemap: add mmap_assert_locked()
annotations to find_vma*()") we already have this assertion, it is just
buried in find_vma():

__get_user_pages_locked()
__get_user_pages()
find_extend_vma()
find_vma()

Also check it at the top of __get_user_pages_locked() as a form of
documentation.

Link: https://lkml.kernel.org/r/6-v2-987e91b59705+36b-gup_tidy_jgg@nvidia.com
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
Reviewed-by: John Hubbard <jhubbard@nvidia.com>
Cc: Alistair Popple <apopple@nvidia.com>
Cc: Christoph Hellwig <hch@infradead.org>
Cc: Claudio Imbrenda <imbrenda@linux.ibm.com>
Cc: David Hildenbrand <david@redhat.com>
Cc: David Howells <dhowells@redhat.com>
Cc: Mike Rapoport (IBM) <rppt@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>


# d64e2dbc 24-Jan-2023 Jason Gunthorpe <jgg@ziepe.ca>

mm/gup: simplify the external interface functions and consolidate invariants

The GUP family of functions have a complex, but fairly well defined, set
of invariants for their arguments. Currently these are sprinkled about,
sometimes in duplicate through many functions.

Internally we don't follow all the invariants that the external interface
has to follow, so place these checks directly at the exported interface.
This ensures the internal functions never reach a violated invariant.

Remove the duplicated invariant checks.

The end result is to make these functions fully internal:
__get_user_pages_locked()
internal_get_user_pages_fast()
__gup_longterm_locked()

And all the other functions call directly into one of these.

Link: https://lkml.kernel.org/r/5-v2-987e91b59705+36b-gup_tidy_jgg@nvidia.com
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
Suggested-by: John Hubbard <jhubbard@nvidia.com>
Reviewed-by: John Hubbard <jhubbard@nvidia.com>
Acked-by: Mike Rapoport (IBM) <rppt@kernel.org>
Cc: Alistair Popple <apopple@nvidia.com>
Cc: Christoph Hellwig <hch@infradead.org>
Cc: Claudio Imbrenda <imbrenda@linux.ibm.com>
Cc: David Hildenbrand <david@redhat.com>
Cc: David Howells <dhowells@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>


# afa3c33e 24-Jan-2023 Jason Gunthorpe <jgg@ziepe.ca>

mm/gup: don't call __gup_longterm_locked() if FOLL_LONGTERM cannot be set

get_user_pages_remote(), get_user_pages_unlocked() and get_user_pages()
are never called with FOLL_LONGTERM, so directly call
__get_user_pages_locked()

The next patch will add an assertion for this.

Link: https://lkml.kernel.org/r/3-v2-987e91b59705+36b-gup_tidy_jgg@nvidia.com
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
Suggested-by: John Hubbard <jhubbard@nvidia.com>
Reviewed-by: John Hubbard <jhubbard@nvidia.com>
Acked-by: Mike Rapoport (IBM) <rppt@kernel.org>
Cc: Alistair Popple <apopple@nvidia.com>
Cc: Christoph Hellwig <hch@infradead.org>
Cc: Claudio Imbrenda <imbrenda@linux.ibm.com>
Cc: David Hildenbrand <david@redhat.com>
Cc: David Howells <dhowells@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>


# b2a72dff 24-Jan-2023 Jason Gunthorpe <jgg@ziepe.ca>

mm/gup: have internal functions get the mmap_read_lock()

Patch series "Simplify the external interface for GUP", v2.

It is quite a maze of EXPORTED symbols leading up to the three actual
worker functions of GUP. Simplify this by reorganizing some of the code so
the EXPORTED symbols directly call the correct internal function with
validated and consistent arguments.

Consolidate all the assertions into one place at the top of the call
chains.

Remove some dead code.

Move more things into the mm/internal.h header


This patch (of 13):

__get_user_pages_locked() and __gup_longterm_locked() both require the
mmap lock to be held. They have a slightly unusual locked parameter that
is used to allow these functions to unlock and relock the mmap lock and
convey that fact to the caller.

Several places wrap these functions with a simple mmap_read_lock() just so
they can follow the optimized locked protocol.

Consolidate this internally to the functions. Allow internal callers to
set locked = 0 to cause the functions to acquire and release the lock on
their own.

Reorganize __gup_longterm_locked() to use the autolocking in
__get_user_pages_locked().

Replace all the places obtaining the mmap_read_lock() just to call
__get_user_pages_locked() with the new mechanism. Replace all the
internal callers of get_user_pages_unlocked() with direct calls to
__gup_longterm_locked() using the new mechanism.

A following patch will add assertions ensuring the external interface
continues to always pass in locked = 1.

Link: https://lkml.kernel.org/r/0-v2-987e91b59705+36b-gup_tidy_jgg@nvidia.com
Link: https://lkml.kernel.org/r/1-v2-987e91b59705+36b-gup_tidy_jgg@nvidia.com
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
Acked-by: Mike Rapoport (IBM) <rppt@kernel.org>
Reviewed-by: John Hubbard <jhubbard@nvidia.com>
Cc: Alistair Popple <apopple@nvidia.com>
Cc: Christoph Hellwig <hch@infradead.org>
Cc: David Hildenbrand <david@redhat.com>
Cc: David Howells <dhowells@redhat.com>
Cc: Claudio Imbrenda <imbrenda@linux.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>


# c5acf1f6 25-Jan-2023 Jongwoo Han <jongwooo.han@gmail.com>

mm/gup.c: fix typo in comments

Link: https://lkml.kernel.org/r/20230125180847.4542-1-jongwooo.han@gmail.com
Signed-off-by: Jongwoo Han <jongwooo.han@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>


# 94688e8e 11-Jan-2023 Matthew Wilcox (Oracle) <willy@infradead.org>

mm: remove folio_pincount_ptr() and head_compound_pincount()

We can use folio->_pincount directly, since all users are guarded by tests
of compound/large.

Link: https://lkml.kernel.org/r/20230111142915.1001531-2-willy@infradead.org
Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Reviewed-by: John Hubbard <jhubbard@nvidia.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>


# aa1e6a93 30-Jan-2023 Kuan-Ying Lee <Kuan-Ying.Lee@mediatek.com>

mm/gup: add folio to list when folio_isolate_lru() succeed

If we call folio_isolate_lru() successfully, we will get return value 0.
We need to add this folio to the movable_pages_list.

Link: https://lkml.kernel.org/r/20230131063206.28820-1-Kuan-Ying.Lee@mediatek.com
Fixes: 67e139b02d99 ("mm/gup.c: refactor check_and_migrate_movable_pages()")
Signed-off-by: Kuan-Ying Lee <Kuan-Ying.Lee@mediatek.com>
Reviewed-by: Alistair Popple <apopple@nvidia.com>
Acked-by: David Hildenbrand <david@redhat.com>
Reviewed-by: Baolin Wang <baolin.wang@linux.alibaba.com>
Cc: Andrew Yang <andrew.yang@mediatek.com>
Cc: Chinwen Chang <chinwen.chang@mediatek.com>
Cc: John Hubbard <jhubbard@nvidia.com>
Cc: Matthias Brugger <matthias.bgg@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>


# 1180e732 26-Nov-2020 Peter Zijlstra <peterz@infradead.org>

mm/gup: Fix the lockless PMD access

On architectures where the PTE/PMD is larger than the native word size
(i386-PAE for example), READ_ONCE() can do the wrong thing. Use
pmdp_get_lockless() just like we use ptep_get_lockless().

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lkml.kernel.org/r/20221022114424.906110403%40infradead.org


# 93c5c61d 11-Oct-2022 Peter Xu <peterx@redhat.com>

mm/gup: Add FOLL_INTERRUPTIBLE

We have had FAULT_FLAG_INTERRUPTIBLE but it was never applied to GUPs. One
issue with it is that not all GUP paths are able to handle signal delivers
besides SIGKILL.

That's not ideal for the GUP users who are actually able to handle these
cases, like KVM.

KVM uses GUP extensively on faulting guest pages, during which we've got
existing infrastructures to retry a page fault at a later time. Allowing
the GUP to be interrupted by generic signals can make KVM related threads
to be more responsive. For examples:

(1) SIGUSR1: which QEMU/KVM uses to deliver an inter-process IPI,
e.g. when the admin issues a vm_stop QMP command, SIGUSR1 can be
generated to kick the vcpus out of kernel context immediately,

(2) SIGINT: which can be used with interactive hypervisor users to stop a
virtual machine with Ctrl-C without any delays/hangs,

(3) SIGTRAP: which grants GDB capability even during page faults that are
stuck for a long time.

Normally hypervisor will be able to receive these signals properly, but not
if we're stuck in a GUP for a long time for whatever reason. It happens
easily with a stucked postcopy migration when e.g. a network temp failure
happens, then some vcpu threads can hang death waiting for the pages. With
the new FOLL_INTERRUPTIBLE, we can allow GUP users like KVM to selectively
enable the ability to trap these signals.

Reviewed-by: John Hubbard <jhubbard@nvidia.com>
Reviewed-by: David Hildenbrand <david@redhat.com>
Signed-off-by: Peter Xu <peterx@redhat.com>
Message-Id: <20221011195809.557016-2-peterx@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>


# f7355e99 20-Oct-2022 David Hildenbrand <david@redhat.com>

mm/gup: remove FOLL_MIGRATION

Fortunately, the last user (KSM) is gone, so let's just remove this rather
special code from generic GUP handling -- especially because KSM never
required the PMD handling as KSM only deals with individual base pages.

[akpm@linux-foundation.org: fix merge snafu]Link: https://lkml.kernel.org/r/20221021101141.84170-10-david@redhat.com
Signed-off-by: David Hildenbrand <david@redhat.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: Jason Gunthorpe <jgg@nvidia.com>
Cc: John Hubbard <jhubbard@nvidia.com>
Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
Cc: Peter Xu <peterx@redhat.com>
Cc: Shuah Khan <shuah@kernel.org>
Cc: Vlastimil Babka <vbabka@suse.cz>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>


# f347454d 31-Oct-2022 David Hildenbrand <david@redhat.com>

mm/gup: disallow FOLL_FORCE|FOLL_WRITE on hugetlb mappings

hugetlb does not support fake write-faults (write faults without write
permissions). However, we are currently able to trigger a
FAULT_FLAG_WRITE fault on a VMA without VM_WRITE.

If we'd ever want to support FOLL_FORCE|FOLL_WRITE, we'd have to teach
hugetlb to:

(1) Leave the page mapped R/O after the fake write-fault, like
maybe_mkwrite() does.
(2) Allow writing to an exclusive anon page that's mapped R/O when
FOLL_FORCE is set, like can_follow_write_pte(). E.g.,
__follow_hugetlb_must_fault() needs adjustment.

For now, it's not clear if that added complexity is really required.
History tolds us that FOLL_FORCE is dangerous and that we better limit its
use to a bare minimum.

--------------------------------------------------------------------------
#include <stdio.h>
#include <stdlib.h>
#include <fcntl.h>
#include <unistd.h>
#include <errno.h>
#include <stdint.h>
#include <sys/mman.h>
#include <linux/mman.h>

int main(int argc, char **argv)
{
char *map;
int mem_fd;

map = mmap(NULL, 2 * 1024 * 1024u, PROT_READ,
MAP_PRIVATE|MAP_ANON|MAP_HUGETLB|MAP_HUGE_2MB, -1, 0);
if (map == MAP_FAILED) {
fprintf(stderr, "mmap() failed: %d\n", errno);
return 1;
}

mem_fd = open("/proc/self/mem", O_RDWR);
if (mem_fd < 0) {
fprintf(stderr, "open(/proc/self/mem) failed: %d\n", errno);
return 1;
}

if (pwrite(mem_fd, "0", 1, (uintptr_t) map) == 1) {
fprintf(stderr, "write() succeeded, which is unexpected\n");
return 1;
}

printf("write() failed as expected: %d\n", errno);
return 0;
}
--------------------------------------------------------------------------

Fortunately, we have a sanity check in hugetlb_wp() in place ever since
commit 1d8d14641fd9 ("mm/hugetlb: support write-faults in shared
mappings"), that bails out instead of silently mapping a page writable in
a !PROT_WRITE VMA.

Consequently, above reproducer triggers a warning, similar to the one
reported by szsbot:

------------[ cut here ]------------
WARNING: CPU: 1 PID: 3612 at mm/hugetlb.c:5313 hugetlb_wp+0x20a/0x1af0 mm/hugetlb.c:5313
Modules linked in:
CPU: 1 PID: 3612 Comm: syz-executor250 Not tainted 6.1.0-rc2-syzkaller #0
Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 10/11/2022
RIP: 0010:hugetlb_wp+0x20a/0x1af0 mm/hugetlb.c:5313
Code: ea 03 80 3c 02 00 0f 85 31 14 00 00 49 8b 5f 20 31 ff 48 89 dd 83 e5 02 48 89 ee e8 70 ab b7 ff 48 85 ed 75 5b e8 76 ae b7 ff <0f> 0b 41 bd 40 00 00 00 e8 69 ae b7 ff 48 b8 00 00 00 00 00 fc ff
RSP: 0018:ffffc90003caf620 EFLAGS: 00010293
RAX: 0000000000000000 RBX: 0000000008640070 RCX: 0000000000000000
RDX: ffff88807b963a80 RSI: ffffffff81c4ed2a RDI: 0000000000000007
RBP: 0000000000000000 R08: 0000000000000007 R09: 0000000000000000
R10: 0000000000000000 R11: 000000000008c07e R12: ffff888023805800
R13: 0000000000000000 R14: ffffffff91217f38 R15: ffff88801d4b0360
FS: 0000555555bba300(0000) GS:ffff8880b9b00000(0000) knlGS:0000000000000000
CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 00007fff7a47a1b8 CR3: 000000002378d000 CR4: 00000000003506e0
DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
Call Trace:
<TASK>
hugetlb_no_page mm/hugetlb.c:5755 [inline]
hugetlb_fault+0x19cc/0x2060 mm/hugetlb.c:5874
follow_hugetlb_page+0x3f3/0x1850 mm/hugetlb.c:6301
__get_user_pages+0x2cb/0xf10 mm/gup.c:1202
__get_user_pages_locked mm/gup.c:1434 [inline]
__get_user_pages_remote+0x18f/0x830 mm/gup.c:2187
get_user_pages_remote+0x84/0xc0 mm/gup.c:2260
__access_remote_vm+0x287/0x6b0 mm/memory.c:5517
ptrace_access_vm+0x181/0x1d0 kernel/ptrace.c:61
generic_ptrace_pokedata kernel/ptrace.c:1323 [inline]
ptrace_request+0xb46/0x10c0 kernel/ptrace.c:1046
arch_ptrace+0x36/0x510 arch/x86/kernel/ptrace.c:828
__do_sys_ptrace kernel/ptrace.c:1296 [inline]
__se_sys_ptrace kernel/ptrace.c:1269 [inline]
__x64_sys_ptrace+0x178/0x2a0 kernel/ptrace.c:1269
do_syscall_x64 arch/x86/entry/common.c:50 [inline]
do_syscall_64+0x35/0xb0 arch/x86/entry/common.c:80
entry_SYSCALL_64_after_hwframe+0x63/0xcd
[...]

So let's silence that warning by teaching GUP code that FOLL_FORCE -- so
far -- does not apply to hugetlb.

Note that FOLL_FORCE for read-access seems to be working as expected. The
assumption is that this has been broken forever, only ever since above
commit, we actually detect the wrong handling and WARN_ON_ONCE().

I assume this has been broken at least since 2014, when mm/gup.c came to
life. I failed to come up with a suitable Fixes tag quickly.

Link: https://lkml.kernel.org/r/20221031152524.173644-1-david@redhat.com
Fixes: 1d8d14641fd9 ("mm/hugetlb: support write-faults in shared mappings")
Signed-off-by: David Hildenbrand <david@redhat.com>
Reported-by: <syzbot+f0b97304ef90f0d0b1dc@syzkaller.appspotmail.com>
Cc: Mike Kravetz <mike.kravetz@oracle.com>
Cc: Peter Xu <peterx@redhat.com>
Cc: John Hubbard <jhubbard@nvidia.com>
Cc: Jason Gunthorpe <jgg@nvidia.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>


# 84209e87 16-Nov-2022 David Hildenbrand <david@redhat.com>

mm/gup: reliable R/O long-term pinning in COW mappings

We already support reliable R/O pinning of anonymous memory. However,
assume we end up pinning (R/O long-term) a pagecache page or the shared
zeropage inside a writable private ("COW") mapping. The next write access
will trigger a write-fault and replace the pinned page by an exclusive
anonymous page in the process page tables to break COW: the pinned page no
longer corresponds to the page mapped into the process' page table.

Now that FAULT_FLAG_UNSHARE can break COW on anything mapped into a
COW mapping, let's properly break COW first before R/O long-term
pinning something that's not an exclusive anon page inside a COW
mapping. FAULT_FLAG_UNSHARE will break COW and map an exclusive anon page
instead that can get pinned safely.

With this change, we can stop using FOLL_FORCE|FOLL_WRITE for reliable
R/O long-term pinning in COW mappings.

With this change, the new R/O long-term pinning tests for non-anonymous
memory succeed:
# [RUN] R/O longterm GUP pin ... with shared zeropage
ok 151 Longterm R/O pin is reliable
# [RUN] R/O longterm GUP pin ... with memfd
ok 152 Longterm R/O pin is reliable
# [RUN] R/O longterm GUP pin ... with tmpfile
ok 153 Longterm R/O pin is reliable
# [RUN] R/O longterm GUP pin ... with huge zeropage
ok 154 Longterm R/O pin is reliable
# [RUN] R/O longterm GUP pin ... with memfd hugetlb (2048 kB)
ok 155 Longterm R/O pin is reliable
# [RUN] R/O longterm GUP pin ... with memfd hugetlb (1048576 kB)
ok 156 Longterm R/O pin is reliable
# [RUN] R/O longterm GUP-fast pin ... with shared zeropage
ok 157 Longterm R/O pin is reliable
# [RUN] R/O longterm GUP-fast pin ... with memfd
ok 158 Longterm R/O pin is reliable
# [RUN] R/O longterm GUP-fast pin ... with tmpfile
ok 159 Longterm R/O pin is reliable
# [RUN] R/O longterm GUP-fast pin ... with huge zeropage
ok 160 Longterm R/O pin is reliable
# [RUN] R/O longterm GUP-fast pin ... with memfd hugetlb (2048 kB)
ok 161 Longterm R/O pin is reliable
# [RUN] R/O longterm GUP-fast pin ... with memfd hugetlb (1048576 kB)
ok 162 Longterm R/O pin is reliable

Note 1: We don't care about short-term R/O-pinning, because they have
snapshot semantics: they are not supposed to observe modifications that
happen after pinning.

As one example, assume we start direct I/O to read from a page and store
page content into a file: modifications to page content after starting
direct I/O are not guaranteed to end up in the file. So even if we'd pin
the shared zeropage, the end result would be as expected -- getting zeroes
stored to the file.

Note 2: For shared mappings we'll now always fallback to the slow path to
lookup the VMA when R/O long-term pining. While that's the necessary price
we have to pay right now, it's actually not that bad in practice: most
FOLL_LONGTERM users already specify FOLL_WRITE, for example, along with
FOLL_FORCE because they tried dealing with COW mappings correctly ...

Note 3: For users that use FOLL_LONGTERM right now without FOLL_WRITE,
such as VFIO, we'd now no longer pin the shared zeropage. Instead, we'd
populate exclusive anon pages that we can pin. There was a concern that
this could affect the memlock limit of existing setups.

For example, a VM running with VFIO could run into the memlock limit and
fail to run. However, we essentially had the same behavior already in
commit 17839856fd58 ("gup: document and work around "COW can break either
way" issue") which got merged into some enterprise distros, and there were
not any such complaints. So most probably, we're fine.

Link: https://lkml.kernel.org/r/20221116102659.70287-10-david@redhat.com
Signed-off-by: David Hildenbrand <david@redhat.com>
Acked-by: Daniel Vetter <daniel.vetter@ffwll.ch>
Reviewed-by: Vlastimil Babka <vbabka@suse.cz>
Reviewed-by: John Hubbard <jhubbard@nvidia.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>


# 53b2d09b 16-Nov-2022 Jason Gunthorpe <jgg@ziepe.ca>

mm/gup: remove the restriction on locked with FOLL_LONGTERM

This restriction was created because FOLL_LONGTERM used to scan the vma
list, so it could not tolerate becoming unlocked. That was fixed in
commit 52650c8b466b ("mm/gup: remove the vma allocation from
gup_longterm_locked()") and the restriction on !vma was removed.

However, the locked restriction remained, even though it isn't necessary
anymore.

Adjust __gup_longterm_locked() so it can handle the mmap_read_lock()
becoming unlocked while it is looping for migration. Migration does not
require the mmap_read_sem because it is only handling struct pages. If we
had to unlock then ensure the whole thing returns unlocked.

Remove __get_user_pages_remote() and __gup_longterm_unlocked(). These
cases can now just directly call other functions.

Link: https://lkml.kernel.org/r/0-v1-b9ae39aa8884+14dbb-gup_longterm_locked_jgg@nvidia.com
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
Reviewed-by: John Hubbard <jhubbard@nvidia.com>
Cc: Alistair Popple <apopple@nvidia.com>
Cc: John Hubbard <jhubbard@nvidia.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>


# 4003f107 21-Oct-2022 Logan Gunthorpe <logang@deltatee.com>

mm: introduce FOLL_PCI_P2PDMA to gate getting PCI P2PDMA pages

GUP Callers that expect PCI P2PDMA pages can now set FOLL_PCI_P2PDMA to
allow obtaining P2PDMA pages. If GUP is called without the flag and a
P2PDMA page is found, it will return an error in try_grab_page() or
try_grab_folio().

The check is safe to do before taking the reference to the page in both
cases seeing the page should be protected by either the appropriate
ptl or mmap_lock; or the gup fast guarantees preventing TLB flushes.

try_grab_folio() has one call site that WARNs on failure and cannot
actually deal with the failure of this function (it seems it will
get into an infinite loop). Expand the comment there to document a
couple more conditions on why it will not fail.

FOLL_PCI_P2PDMA cannot be set if FOLL_LONGTERM is set. This is to copy
fsdax until pgmap refcounts are fixed (see the link below for more
information).

Link: https://lkml.kernel.org/r/Yy4Ot5MoOhsgYLTQ@ziepe.ca
Signed-off-by: Logan Gunthorpe <logang@deltatee.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com>
Link: https://lore.kernel.org/r/20221021174116.7200-3-logang@deltatee.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>


# 0f089235 21-Oct-2022 Logan Gunthorpe <logang@deltatee.com>

mm: allow multiple error returns in try_grab_page()

In order to add checks for P2PDMA memory into try_grab_page(), expand
the error return from a bool to an int/error code. Update all the
callsites handle change in usage.

Also remove the WARN_ON_ONCE() call at the callsites seeing there
already is a WARN_ON_ONCE() inside the function if it fails.

Signed-off-by: Logan Gunthorpe <logang@deltatee.com>
Reviewed-by: Dan Williams <dan.j.williams@intel.com>
Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Link: https://lore.kernel.org/r/20221021174116.7200-2-logang@deltatee.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>


# 57a196a5 18-Sep-2022 Mike Kravetz <mike.kravetz@oracle.com>

hugetlb: simplify hugetlb handling in follow_page_mask

During discussions of this series [1], it was suggested that hugetlb
handling code in follow_page_mask could be simplified. At the beginning
of follow_page_mask, there currently is a call to follow_huge_addr which
'may' handle hugetlb pages. ia64 is the only architecture which provides
a follow_huge_addr routine that does not return error. Instead, at each
level of the page table a check is made for a hugetlb entry. If a hugetlb
entry is found, a call to a routine associated with that entry is made.

Currently, there are two checks for hugetlb entries at each page table
level. The first check is of the form:

if (p?d_huge())
page = follow_huge_p?d();

the second check is of the form:

if (is_hugepd())
page = follow_huge_pd().

We can replace these checks, as well as the special handling routines such
as follow_huge_p?d() and follow_huge_pd() with a single routine to handle
hugetlb vmas.

A new routine hugetlb_follow_page_mask is called for hugetlb vmas at the
beginning of follow_page_mask. hugetlb_follow_page_mask will use the
existing routine huge_pte_offset to walk page tables looking for hugetlb
entries. huge_pte_offset can be overwritten by architectures, and already
handles special cases such as hugepd entries.

[1] https://lore.kernel.org/linux-mm/cover.1661240170.git.baolin.wang@linux.alibaba.com/

[mike.kravetz@oracle.com: remove vma (pmd sharing) per Peter]
Link: https://lkml.kernel.org/r/20221028181108.119432-1-mike.kravetz@oracle.com
[mike.kravetz@oracle.com: remove left over hugetlb_vma_unlock_read()]
Link: https://lkml.kernel.org/r/20221030225825.40872-1-mike.kravetz@oracle.com
Link: https://lkml.kernel.org/r/20220919021348.22151-1-mike.kravetz@oracle.com
Signed-off-by: Mike Kravetz <mike.kravetz@oracle.com>
Suggested-by: David Hildenbrand <david@redhat.com>
Reviewed-by: David Hildenbrand <david@redhat.com>
Reviewed-by: Baolin Wang <baolin.wang@linux.alibaba.com>
Tested-by: Baolin Wang <baolin.wang@linux.alibaba.com>
Cc: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
Cc: Christophe Leroy <christophe.leroy@csgroup.eu>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Cc: Muchun Song <songmuchun@bytedance.com>
Cc: Naoya Horiguchi <naoya.horiguchi@linux.dev>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>


# fcd0ccd8 06-Dec-2022 John Starks <jostarks@microsoft.com>

mm/gup: fix gup_pud_range() for dax

For dax pud, pud_huge() returns true on x86. So the function works as long
as hugetlb is configured. However, dax doesn't depend on hugetlb.
Commit 414fd080d125 ("mm/gup: fix gup_pmd_range() for dax") fixed
devmap-backed huge PMDs, but missed devmap-backed huge PUDs. Fix this as
well.

This fixes the below kernel panic:

general protection fault, probably for non-canonical address 0x69e7c000cc478: 0000 [#1] SMP
< snip >
Call Trace:
<TASK>
get_user_pages_fast+0x1f/0x40
iov_iter_get_pages+0xc6/0x3b0
? mempool_alloc+0x5d/0x170
bio_iov_iter_get_pages+0x82/0x4e0
? bvec_alloc+0x91/0xc0
? bio_alloc_bioset+0x19a/0x2a0
blkdev_direct_IO+0x282/0x480
? __io_complete_rw_common+0xc0/0xc0
? filemap_range_has_page+0x82/0xc0
generic_file_direct_write+0x9d/0x1a0
? inode_update_time+0x24/0x30
__generic_file_write_iter+0xbd/0x1e0
blkdev_write_iter+0xb4/0x150
? io_import_iovec+0x8d/0x340
io_write+0xf9/0x300
io_issue_sqe+0x3c3/0x1d30
? sysvec_reschedule_ipi+0x6c/0x80
__io_queue_sqe+0x33/0x240
? fget+0x76/0xa0
io_submit_sqes+0xe6a/0x18d0
? __fget_light+0xd1/0x100
__x64_sys_io_uring_enter+0x199/0x880
? __context_tracking_enter+0x1f/0x70
? irqentry_exit_to_user_mode+0x24/0x30
? irqentry_exit+0x1d/0x30
? __context_tracking_exit+0xe/0x70
do_syscall_64+0x3b/0x90
entry_SYSCALL_64_after_hwframe+0x61/0xcb
RIP: 0033:0x7fc97c11a7be
< snip >
</TASK>
---[ end trace 48b2e0e67debcaeb ]---
RIP: 0010:internal_get_user_pages_fast+0x340/0x990
< snip >
Kernel panic - not syncing: Fatal exception
Kernel Offset: disabled

Link: https://lkml.kernel.org/r/1670392853-28252-1-git-send-email-ssengar@linux.microsoft.com
Fixes: 414fd080d125 ("mm/gup: fix gup_pmd_range() for dax")
Signed-off-by: John Starks <jostarks@microsoft.com>
Signed-off-by: Saurabh Sengar <ssengar@linux.microsoft.com>
Cc: Jan Kara <jack@suse.cz>
Cc: Yu Zhao <yuzhao@google.com>
Cc: Jason Gunthorpe <jgg@nvidia.com>
Cc: John Hubbard <jhubbard@nvidia.com>
Cc: David Hildenbrand <david@redhat.com>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: Alistair Popple <apopple@nvidia.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>


# fac35ba7 01-Sep-2022 Baolin Wang <baolin.wang@linux.alibaba.com>

mm/hugetlb: fix races when looking up a CONT-PTE/PMD size hugetlb page

On some architectures (like ARM64), it can support CONT-PTE/PMD size
hugetlb, which means it can support not only PMD/PUD size hugetlb (2M and
1G), but also CONT-PTE/PMD size(64K and 32M) if a 4K page size specified.

So when looking up a CONT-PTE size hugetlb page by follow_page(), it will
use pte_offset_map_lock() to get the pte entry lock for the CONT-PTE size
hugetlb in follow_page_pte(). However this pte entry lock is incorrect
for the CONT-PTE size hugetlb, since we should use huge_pte_lock() to get
the correct lock, which is mm->page_table_lock.

That means the pte entry of the CONT-PTE size hugetlb under current pte
lock is unstable in follow_page_pte(), we can continue to migrate or
poison the pte entry of the CONT-PTE size hugetlb, which can cause some
potential race issues, even though they are under the 'pte lock'.

For example, suppose thread A is trying to look up a CONT-PTE size hugetlb
page by move_pages() syscall under the lock, however antoher thread B can
migrate the CONT-PTE hugetlb page at the same time, which will cause
thread A to get an incorrect page, if thread A also wants to do page
migration, then data inconsistency error occurs.

Moreover we have the same issue for CONT-PMD size hugetlb in
follow_huge_pmd().

To fix above issues, rename the follow_huge_pmd() as follow_huge_pmd_pte()
to handle PMD and PTE level size hugetlb, which uses huge_pte_lock() to
get the correct pte entry lock to make the pte entry stable.

Mike said:

Support for CONT_PMD/_PTE was added with bb9dd3df8ee9 ("arm64: hugetlb:
refactor find_num_contig()"). Patch series "Support for contiguous pte
hugepages", v4. However, I do not believe these code paths were
executed until migration support was added with 5480280d3f2d ("arm64/mm:
enable HugeTLB migration for contiguous bit HugeTLB pages") I would go
with 5480280d3f2d for the Fixes: targe.

Link: https://lkml.kernel.org/r/635f43bdd85ac2615a58405da82b4d33c6e5eb05.1662017562.git.baolin.wang@linux.alibaba.com
Fixes: 5480280d3f2d ("arm64/mm: enable HugeTLB migration for contiguous bit HugeTLB pages")
Signed-off-by: Baolin Wang <baolin.wang@linux.alibaba.com>
Suggested-by: Mike Kravetz <mike.kravetz@oracle.com>
Reviewed-by: Mike Kravetz <mike.kravetz@oracle.com>
Cc: David Hildenbrand <david@redhat.com>
Cc: Muchun Song <songmuchun@bytedance.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>


# 0cf45986 25-Aug-2022 David Hildenbrand <david@redhat.com>

mm/gup: use gup_can_follow_protnone() also in GUP-fast

There seems to be no reason why FOLL_FORCE during GUP-fast would have to
fallback to the slow path when stumbling over a PROT_NONE mapped page. We
only have to trigger hinting faults in case FOLL_FORCE is not set, and any
kind of fault handling naturally happens from the slow path -- where NUMA
hinting accounting/handling would be performed.

Note that the comment regarding THP migration is outdated: commit
2b4847e73004 ("mm: numa: serialise parallel get_user_page against THP
migration") described that this was required for THP due to lack of PMD
migration entries. Nowadays, we do have proper PMD migration entries in
place -- see set_pmd_migration_entry(), which does a proper
pmdp_invalidate() when placing the migration entry.

So let's just reuse gup_can_follow_protnone() here to make it consistent
and drop the somewhat outdated comments.

Link: https://lkml.kernel.org/r/20220825164659.89824-3-david@redhat.com
Signed-off-by: David Hildenbrand <david@redhat.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: Jason Gunthorpe <jgg@nvidia.com>
Cc: John Hubbard <jhubbard@nvidia.com>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Peter Xu <peterx@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>


# 474098ed 25-Aug-2022 David Hildenbrand <david@redhat.com>

mm/gup: replace FOLL_NUMA by gup_can_follow_protnone()

Patch series "mm: minor cleanups around NUMA hinting".

Working on some GUP cleanups (e.g., getting rid of some FOLL_ flags) and
preparing for other GUP changes (getting rid of FOLL_FORCE|FOLL_WRITE for
for taking a R/O longterm pin), this is something I can easily send out
independently.

Get rid of FOLL_NUMA, allow FOLL_FORCE access to PROT_NONE mapped pages in
GUP-fast, and fixup some documentation around NUMA hinting.


This patch (of 3):

No need for a special flag that is not even properly documented to be
internal-only.

Let's just factor this check out and get rid of this flag. The separate
function has the nice benefit that we can centralize comments.

Link: https://lkml.kernel.org/r/20220825164659.89824-2-david@redhat.com
Link: https://lkml.kernel.org/r/20220825164659.89824-1-david@redhat.com
Signed-off-by: David Hildenbrand <david@redhat.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: Jason Gunthorpe <jgg@nvidia.com>
Cc: John Hubbard <jhubbard@nvidia.com>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Peter Xu <peterx@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>


# c4d1a92d 06-Sep-2022 Liam R. Howlett <Liam.Howlett@Oracle.com>

mm/gup: use maple tree navigation instead of linked list

Use find_vma_intersection() to locate the VMAs in __mm_populate() instead
of using find_vma() and the linked list.

Link: https://lkml.kernel.org/r/20220906194824.2110408-52-Liam.Howlett@oracle.com
Signed-off-by: Liam R. Howlett <Liam.Howlett@Oracle.com>
Reviewed-by: Davidlohr Bueso <dave@stgolabs.net>
Tested-by: Yu Zhao <yuzhao@google.com>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: David Hildenbrand <david@redhat.com>
Cc: David Howells <dhowells@redhat.com>
Cc: "Matthew Wilcox (Oracle)" <willy@infradead.org>
Cc: SeongJae Park <sj@kernel.org>
Cc: Sven Schnelle <svens@linux.ibm.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Will Deacon <will@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>


# 088b8aa5 01-Sep-2022 David Hildenbrand <david@redhat.com>

mm: fix PageAnonExclusive clearing racing with concurrent RCU GUP-fast

commit 6c287605fd56 ("mm: remember exclusively mapped anonymous pages with
PG_anon_exclusive") made sure that when PageAnonExclusive() has to be
cleared during temporary unmapping of a page, that the PTE is
cleared/invalidated and that the TLB is flushed.

What we want to achieve in all cases is that we cannot end up with a pin on
an anonymous page that may be shared, because such pins would be
unreliable and could result in memory corruptions when the mapped page
and the pin go out of sync due to a write fault.

That TLB flush handling was inspired by an outdated comment in
mm/ksm.c:write_protect_page(), which similarly required the TLB flush in
the past to synchronize with GUP-fast. However, ever since general RCU GUP
fast was introduced in commit 2667f50e8b81 ("mm: introduce a general RCU
get_user_pages_fast()"), a TLB flush is no longer sufficient to handle
concurrent GUP-fast in all cases -- it only handles traditional IPI-based
GUP-fast correctly.

Peter Xu (thankfully) questioned whether that TLB flush is really
required. On architectures that send an IPI broadcast on TLB flush,
it works as expected. To synchronize with RCU GUP-fast properly, we're
conceptually fine, however, we have to enforce a certain memory order and
are missing memory barriers.

Let's document that, avoid the TLB flush where possible and use proper
explicit memory barriers where required. We shouldn't really care about the
additional memory barriers here, as we're not on extremely hot paths --
and we're getting rid of some TLB flushes.

We use a smp_mb() pair for handling concurrent pinning and a
smp_rmb()/smp_wmb() pair for handling the corner case of only temporary
PTE changes but permanent PageAnonExclusive changes.

One extreme example, whereby GUP-fast takes a R/O pin and KSM wants to
convert an exclusive anonymous page to a KSM page, and that page is already
mapped write-protected (-> no PTE change) would be:

Thread 0 (KSM) Thread 1 (GUP-fast)

(B1) Read the PTE
# (B2) skipped without FOLL_WRITE
(A1) Clear PTE
smp_mb()
(A2) Check pinned
(B3) Pin the mapped page
smp_mb()
(A3) Clear PageAnonExclusive
smp_wmb()
(A4) Restore PTE
(B4) Check if the PTE changed
smp_rmb()
(B5) Check PageAnonExclusive

Thread 1 will properly detect that PageAnonExclusive was cleared and
back off.

Note that we don't need a memory barrier between checking if the page is
pinned and clearing PageAnonExclusive, because stores are not
speculated.

The possible issues due to reordering are of theoretical nature so far
and attempts to reproduce the race failed.

Especially the "no PTE change" case isn't the common case, because we'd
need an exclusive anonymous page that's mapped R/O and the PTE is clean
in KSM code -- and using KSM with page pinning isn't extremely common.
Further, the clear+TLB flush we used for now implies a memory barrier.
So the problematic missing part should be the missing memory barrier
after pinning but before checking if the PTE changed.

Link: https://lkml.kernel.org/r/20220901083559.67446-1-david@redhat.com
Fixes: 6c287605fd56 ("mm: remember exclusively mapped anonymous pages with PG_anon_exclusive")
Signed-off-by: David Hildenbrand <david@redhat.com>
Cc: Jason Gunthorpe <jgg@nvidia.com>
Cc: John Hubbard <jhubbard@nvidia.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: Peter Xu <peterx@redhat.com>
Cc: Alistair Popple <apopple@nvidia.com>
Cc: Nadav Amit <namit@vmware.com>
Cc: Yang Shi <shy828301@gmail.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Mike Kravetz <mike.kravetz@oracle.com>
Cc: Andrea Parri <parri.andrea@gmail.com>
Cc: Will Deacon <will@kernel.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: "Paul E. McKenney" <paulmck@kernel.org>
Cc: Christoph von Recklinghausen <crecklin@redhat.com>
Cc: Don Dutile <ddutile@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>


# 67e139b0 23-Aug-2022 Alistair Popple <apopple@nvidia.com>

mm/gup.c: refactor check_and_migrate_movable_pages()

When pinning pages with FOLL_LONGTERM check_and_migrate_movable_pages() is
called to migrate pages out of zones which should not contain any longterm
pinned pages.

When migration succeeds all pages will have been unpinned so pinning needs
to be retried. Migration can also fail, in which case the pages will also
have been unpinned but the operation should not be retried. If all pages
are in the correct zone nothing will be unpinned and no retry is required.

The logic in check_and_migrate_movable_pages() tracks unnecessary state
and the return codes for each case are difficult to follow. Refactor the
code to clean this up. No behaviour change is intended.

[akpm@linux-foundation.org: fix unused var warning]
Link: https://lkml.kernel.org/r/19583d1df07fdcb99cfa05c265588a3fa58d1902.1661317396.git-series.apopple@nvidia.com
Signed-off-by: Alistair Popple <apopple@nvidia.com>
Reviewed-by: John Hubbard <jhubbard@nvidia.com>
Cc: Alex Sierra <alex.sierra@amd.com>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: David Hildenbrand <david@redhat.com>
Cc: Felix Kuehling <felix.kuehling@amd.com>
Cc: Jason Gunthorpe <jgg@nvidia.com>
Cc: Logan Gunthorpe <logang@deltatee.com>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Miaohe Lin <linmiaohe@huawei.com>
Cc: Muchun Song <songmuchun@bytedance.com>
Cc: Ralph Campbell <rcampbell@nvidia.com>
Cc: Shigeru Yoshida <syoshida@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>


# f6d299ec 23-Aug-2022 Alistair Popple <apopple@nvidia.com>

mm/gup.c: don't pass gup_flags to check_and_migrate_movable_pages()

gup_flags is passed to check_and_migrate_movable_pages() so that it can
call either put_page() or unpin_user_page() to drop the page reference.
However check_and_migrate_movable_pages() is only called for
FOLL_LONGTERM, which implies FOLL_PIN so there is no need to pass
gup_flags.

Link: https://lkml.kernel.org/r/d611c65a9008ff55887307df457c6c2220ad6163.1661317396.git-series.apopple@nvidia.com
Signed-off-by: Alistair Popple <apopple@nvidia.com>
Reviewed-by: David Hildenbrand <david@redhat.com>
Reviewed-by: John Hubbard <jhubbard@nvidia.com>
Cc: Alex Sierra <alex.sierra@amd.com>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: Felix Kuehling <felix.kuehling@amd.com>
Cc: Jason Gunthorpe <jgg@nvidia.com>
Cc: Logan Gunthorpe <logang@deltatee.com>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Miaohe Lin <linmiaohe@huawei.com>
Cc: Muchun Song <songmuchun@bytedance.com>
Cc: Ralph Campbell <rcampbell@nvidia.com>
Cc: Shigeru Yoshida <syoshida@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>


# 24a95998 28-Jul-2022 Alistair Popple <apopple@nvidia.com>

mm/gup.c: simplify and fix check_and_migrate_movable_pages() return codes

When pinning pages with FOLL_LONGTERM check_and_migrate_movable_pages() is
called to migrate pages out of zones which should not contain any longterm
pinned pages.

When migration succeeds all pages will have been unpinned so pinning needs
to be retried. This is indicated by returning zero. When all pages are
in the correct zone the number of pinned pages is returned.

However migration can also fail, in which case pages are unpinned and
-ENOMEM is returned. However if the failure was due to not being unable
to isolate a page zero is returned. This leads to indefinite looping in
__gup_longterm_locked().

Fix this by simplifying the return codes such that zero indicates all
pages were successfully pinned in the correct zone while errors indicate
either pages were migrated and pinning should be retried or that migration
has failed and therefore the pinning operation should fail.

[syoshida@redhat.com: fix return value for __gup_longterm_locked()]
Link: https://lkml.kernel.org/r/20220821183547.950370-1-syoshida@redhat.com
[akpm@linux-foundation.org: fix code layout, per John]
[yshigeru@gmail.com: fix uninitialized return value on __gup_longterm_locked()]
Link: https://lkml.kernel.org/r/20220827230037.78876-1-syoshida@redhat.com
Link: https://lkml.kernel.org/r/20220729024645.764366-1-apopple@nvidia.com
Signed-off-by: Alistair Popple <apopple@nvidia.com>
Signed-off-by: Shigeru Yoshida <syoshida@redhat.com>
Cc: David Hildenbrand <david@redhat.com>
Cc: Jason Gunthorpe <jgg@nvidia.com>
Cc: Alistair Popple <apopple@nvidia.com>
Cc: John Hubbard <jhubbard@nvidia.com>
Cc: Minchan Kim <minchan@kernel.org>
Cc: Pasha Tatashin <pasha.tatashin@soleen.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>


# 70cbc3cc 07-Sep-2022 Yang Shi <shy828301@gmail.com>

mm: gup: fix the fast GUP race against THP collapse

Since general RCU GUP fast was introduced in commit 2667f50e8b81 ("mm:
introduce a general RCU get_user_pages_fast()"), a TLB flush is no longer
sufficient to handle concurrent GUP-fast in all cases, it only handles
traditional IPI-based GUP-fast correctly. On architectures that send an
IPI broadcast on TLB flush, it works as expected. But on the
architectures that do not use IPI to broadcast TLB flush, it may have the
below race:

CPU A CPU B
THP collapse fast GUP
gup_pmd_range() <-- see valid pmd
gup_pte_range() <-- work on pte
pmdp_collapse_flush() <-- clear pmd and flush
__collapse_huge_page_isolate()
check page pinned <-- before GUP bump refcount
pin the page
check PTE <-- no change
__collapse_huge_page_copy()
copy data to huge page
ptep_clear()
install huge pmd for the huge page
return the stale page
discard the stale page

The race can be fixed by checking whether PMD is changed or not after
taking the page pin in fast GUP, just like what it does for PTE. If the
PMD is changed it means there may be parallel THP collapse, so GUP should
back off.

Also update the stale comment about serializing against fast GUP in
khugepaged.

Link: https://lkml.kernel.org/r/20220907180144.555485-1-shy828301@gmail.com
Fixes: 2667f50e8b81 ("mm: introduce a general RCU get_user_pages_fast()")
Acked-by: David Hildenbrand <david@redhat.com>
Acked-by: Peter Xu <peterx@redhat.com>
Signed-off-by: Yang Shi <shy828301@gmail.com>
Reviewed-by: John Hubbard <jhubbard@nvidia.com>
Cc: "Aneesh Kumar K.V" <aneesh.kumar@linux.ibm.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: Jason Gunthorpe <jgg@nvidia.com>
Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Cc: Nicholas Piggin <npiggin@gmail.com>
Cc: Christophe Leroy <christophe.leroy@csgroup.eu>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>


# 5535be30 09-Aug-2022 David Hildenbrand <david@redhat.com>

mm/gup: fix FOLL_FORCE COW security issue and remove FOLL_COW

Ever since the Dirty COW (CVE-2016-5195) security issue happened, we know
that FOLL_FORCE can be possibly dangerous, especially if there are races
that can be exploited by user space.

Right now, it would be sufficient to have some code that sets a PTE of a
R/O-mapped shared page dirty, in order for it to erroneously become
writable by FOLL_FORCE. The implications of setting a write-protected PTE
dirty might not be immediately obvious to everyone.

And in fact ever since commit 9ae0f87d009c ("mm/shmem: unconditionally set
pte dirty in mfill_atomic_install_pte"), we can use UFFDIO_CONTINUE to map
a shmem page R/O while marking the pte dirty. This can be used by
unprivileged user space to modify tmpfs/shmem file content even if the
user does not have write permissions to the file, and to bypass memfd
write sealing -- Dirty COW restricted to tmpfs/shmem (CVE-2022-2590).

To fix such security issues for good, the insight is that we really only
need that fancy retry logic (FOLL_COW) for COW mappings that are not
writable (!VM_WRITE). And in a COW mapping, we really only broke COW if
we have an exclusive anonymous page mapped. If we have something else
mapped, or the mapped anonymous page might be shared (!PageAnonExclusive),
we have to trigger a write fault to break COW. If we don't find an
exclusive anonymous page when we retry, we have to trigger COW breaking
once again because something intervened.

Let's move away from this mandatory-retry + dirty handling and rely on our
PageAnonExclusive() flag for making a similar decision, to use the same
COW logic as in other kernel parts here as well. In case we stumble over
a PTE in a COW mapping that does not map an exclusive anonymous page, COW
was not properly broken and we have to trigger a fake write-fault to break
COW.

Just like we do in can_change_pte_writable() added via commit 64fe24a3e05e
("mm/mprotect: try avoiding write faults for exclusive anonymous pages
when changing protection") and commit 76aefad628aa ("mm/mprotect: fix
soft-dirty check in can_change_pte_writable()"), take care of softdirty
and uffd-wp manually.

For example, a write() via /proc/self/mem to a uffd-wp-protected range has
to fail instead of silently granting write access and bypassing the
userspace fault handler. Note that FOLL_FORCE is not only used for debug
access, but also triggered by applications without debug intentions, for
example, when pinning pages via RDMA.

This fixes CVE-2022-2590. Note that only x86_64 and aarch64 are
affected, because only those support CONFIG_HAVE_ARCH_USERFAULTFD_MINOR.

Fortunately, FOLL_COW is no longer required to handle FOLL_FORCE. So
let's just get rid of it.

Thanks to Nadav Amit for pointing out that the pte_dirty() check in
FOLL_FORCE code is problematic and might be exploitable.

Note 1: We don't check for the PTE being dirty because it doesn't matter
for making a "was COWed" decision anymore, and whoever modifies the
page has to set the page dirty either way.

Note 2: Kernels before extended uffd-wp support and before
PageAnonExclusive (< 5.19) can simply revert the problematic
commit instead and be safe regarding UFFDIO_CONTINUE. A backport to
v5.19 requires minor adjustments due to lack of
vma_soft_dirty_enabled().

Link: https://lkml.kernel.org/r/20220809205640.70916-1-david@redhat.com
Fixes: 9ae0f87d009c ("mm/shmem: unconditionally set pte dirty in mfill_atomic_install_pte")
Signed-off-by: David Hildenbrand <david@redhat.com>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: Axel Rasmussen <axelrasmussen@google.com>
Cc: Nadav Amit <nadav.amit@gmail.com>
Cc: Peter Xu <peterx@redhat.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: John Hubbard <jhubbard@nvidia.com>
Cc: Jason Gunthorpe <jgg@nvidia.com>
Cc: David Laight <David.Laight@ACULAB.COM>
Cc: <stable@vger.kernel.org> [5.16]
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>


# 65974cb9 20-Jul-2022 Alistair Popple <apopple@nvidia.com>

mm/gup.c: fix formatting in check_and_migrate_movable_page()

Commit b05a79d4377f ("mm/gup: migrate device coherent pages when pinning
instead of failing") added a badly formatted if statement. Fix it.

Link: https://lkml.kernel.org/r/20220721020552.1397598-2-apopple@nvidia.com
Signed-off-by: Alistair Popple <apopple@nvidia.com>
Reported-by: David Hildenbrand <david@redhat.com>
Reviewed-by: David Hildenbrand <david@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>


# 396a400b 30-Jun-2022 Linus Walleij <linus.walleij@linaro.org>

mm: gup: pass a pointer to virt_to_page()

Functions that work on a pointer to virtual memory such as virt_to_pfn()
and users of that function such as virt_to_page() are supposed to pass a
pointer to virtual memory, ideally a (void *) or other pointer. However
since many architectures implement virt_to_pfn() as a macro, this function
becomes polymorphic and accepts both a (unsigned long) and a (void *).

If we instead implement a proper virt_to_pfn(void *addr) function the
following happens (occurred on arch/arm):

mm/gup.c: In function '__get_user_pages_locked':
mm/gup.c:1599:49: warning: passing argument 1 of 'virt_to_pfn'
makes pointer from integer without a cast [-Wint-conversion]
pages[i] = virt_to_page(start);

Fix this with an explicit cast.

Link: https://lkml.kernel.org/r/20220630084124.691207-5-linus.walleij@linaro.org
Signed-off-by: Linus Walleij <linus.walleij@linaro.org>
Reviewed-by: Jason Gunthorpe <jgg@nvidia.com>
Cc: Alexander Potapenko <glider@google.com>
Cc: Dmitry Vyukov <dvyukov@google.com>
Cc: Marco Elver <elver@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>


# b05a79d4 15-Jul-2022 Alistair Popple <apopple@nvidia.com>

mm/gup: migrate device coherent pages when pinning instead of failing

Currently any attempts to pin a device coherent page will fail. This is
because device coherent pages need to be managed by a device driver, and
pinning them would prevent a driver from migrating them off the device.

However this is no reason to fail pinning of these pages. These are
coherent and accessible from the CPU so can be migrated just like pinning
ZONE_MOVABLE pages. So instead of failing all attempts to pin them first
try migrating them out of ZONE_DEVICE.

[hch@lst.de: rebased to the split device memory checks, moved migrate_device_page to migrate_device.c]
Link: https://lkml.kernel.org/r/20220715150521.18165-7-alex.sierra@amd.com
Signed-off-by: Alistair Popple <apopple@nvidia.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Acked-by: Felix Kuehling <Felix.Kuehling@amd.com>
Cc: David Hildenbrand <david@redhat.com>
Cc: Jason Gunthorpe <jgg@nvidia.com>
Cc: Jerome Glisse <jglisse@redhat.com>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Ralph Campbell <rcampbell@nvidia.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>


# 6077c943 15-Jul-2022 Alex Sierra <alex.sierra@amd.com>

mm: rename is_pinnable_page() to is_longterm_pinnable_page()

Patch series "Add MEMORY_DEVICE_COHERENT for coherent device memory
mapping", v9.

This patch series introduces MEMORY_DEVICE_COHERENT, a type of memory
owned by a device that can be mapped into CPU page tables like
MEMORY_DEVICE_GENERIC and can also be migrated like MEMORY_DEVICE_PRIVATE.

This patch series is mostly self-contained except for a few places where
it needs to update other subsystems to handle the new memory type.

System stability and performance are not affected according to our ongoing
testing, including xfstests.

How it works: The system BIOS advertises the GPU device memory (aka VRAM)
as SPM (special purpose memory) in the UEFI system address map.

The amdgpu driver registers the memory with devmap as
MEMORY_DEVICE_COHERENT using devm_memremap_pages. The initial user for
this hardware page migration capability is the Frontier supercomputer
project. This functionality is not AMD-specific. We expect other GPU
vendors to find this functionality useful, and possibly other hardware
types in the future.

Our test nodes in the lab are similar to the Frontier configuration, with
.5 TB of system memory plus 256 GB of device memory split across 4 GPUs,
all in a single coherent address space. Page migration is expected to
improve application efficiency significantly. We will report empirical
results as they become available.

Coherent device type pages at gup are now migrated back to system memory
if they are being pinned long-term (FOLL_LONGTERM). The reason is, that
long-term pinning would interfere with the device memory manager owning
the device-coherent pages (e.g. evictions in TTM). These series
incorporate Alistair Popple patches to do this migration from
pin_user_pages() calls. hmm_gup_test has been added to hmm-test to test
different get user pages calls.

This series includes handling of device-managed anonymous pages returned
by vm_normal_pages. Although they behave like normal pages for purposes
of mapping in CPU page tables and for COW, they do not support LRU lists,
NUMA migration or THP.

We also introduced a FOLL_LRU flag that adds the same behaviour to
follow_page and related APIs, to allow callers to specify that they expect
to put pages on an LRU list.


This patch (of 14):

is_pinnable_page() and folio_is_pinnable() are renamed to
is_longterm_pinnable_page() and folio_is_longterm_pinnable() respectively.
These functions are used in the FOLL_LONGTERM flag context.

Link: https://lkml.kernel.org/r/20220715150521.18165-1-alex.sierra@amd.com
Link: https://lkml.kernel.org/r/20220715150521.18165-2-alex.sierra@amd.com
Signed-off-by: Alex Sierra <alex.sierra@amd.com>
Reviewed-by: David Hildenbrand <david@redhat.com>
Cc: Jason Gunthorpe <jgg@nvidia.com>
Cc: Felix Kuehling <Felix.Kuehling@amd.com>
Cc: Ralph Campbell <rcampbell@nvidia.com>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Jerome Glisse <jglisse@redhat.com>
Cc: Alistair Popple <apopple@nvidia.com>
Cc: Matthew Wilcox <willy@infradead.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>


# 7ce82f4c 30-May-2022 Miaohe Lin <linmiaohe@huawei.com>

mm/migration: return errno when isolate_huge_page failed

We might fail to isolate huge page due to e.g. the page is under
migration which cleared HPageMigratable. We should return errno in this
case rather than always return 1 which could confuse the user, i.e. the
caller might think all of the memory is migrated while the hugetlb page is
left behind. We make the prototype of isolate_huge_page consistent with
isolate_lru_page as suggested by Huang Ying and rename isolate_huge_page
to isolate_hugetlb as suggested by Muchun to improve the readability.

Link: https://lkml.kernel.org/r/20220530113016.16663-4-linmiaohe@huawei.com
Fixes: e8db67eb0ded ("mm: migrate: move_pages() supports thp migration")
Signed-off-by: Miaohe Lin <linmiaohe@huawei.com>
Suggested-by: Huang Ying <ying.huang@intel.com>
Reported-by: kernel test robot <lkp@intel.com> (build error)
Cc: Alistair Popple <apopple@nvidia.com>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Christoph Lameter <cl@linux.com>
Cc: David Hildenbrand <david@redhat.com>
Cc: David Howells <dhowells@redhat.com>
Cc: Mike Kravetz <mike.kravetz@oracle.com>
Cc: Muchun Song <songmuchun@bytedance.com>
Cc: Oscar Salvador <osalvador@suse.de>
Cc: Peter Xu <peterx@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>


# d9272525 30-May-2022 Peter Xu <peterx@redhat.com>

mm: avoid unnecessary page fault retires on shared memory types

I observed that for each of the shared file-backed page faults, we're very
likely to retry one more time for the 1st write fault upon no page. It's
because we'll need to release the mmap lock for dirty rate limit purpose
with balance_dirty_pages_ratelimited() (in fault_dirty_shared_page()).

Then after that throttling we return VM_FAULT_RETRY.

We did that probably because VM_FAULT_RETRY is the only way we can return
to the fault handler at that time telling it we've released the mmap lock.

However that's not ideal because it's very likely the fault does not need
to be retried at all since the pgtable was well installed before the
throttling, so the next continuous fault (including taking mmap read lock,
walk the pgtable, etc.) could be in most cases unnecessary.

It's not only slowing down page faults for shared file-backed, but also add
more mmap lock contention which is in most cases not needed at all.

To observe this, one could try to write to some shmem page and look at
"pgfault" value in /proc/vmstat, then we should expect 2 counts for each
shmem write simply because we retried, and vm event "pgfault" will capture
that.

To make it more efficient, add a new VM_FAULT_COMPLETED return code just to
show that we've completed the whole fault and released the lock. It's also
a hint that we should very possibly not need another fault immediately on
this page because we've just completed it.

This patch provides a ~12% perf boost on my aarch64 test VM with a simple
program sequentially dirtying 400MB shmem file being mmap()ed and these are
the time it needs:

Before: 650.980 ms (+-1.94%)
After: 569.396 ms (+-1.38%)

I believe it could help more than that.

We need some special care on GUP and the s390 pgfault handler (for gmap
code before returning from pgfault), the rest changes in the page fault
handlers should be relatively straightforward.

Another thing to mention is that mm_account_fault() does take this new
fault as a generic fault to be accounted, unlike VM_FAULT_RETRY.

I explicitly didn't touch hmm_vma_fault() and break_ksm() because they do
not handle VM_FAULT_RETRY even with existing code, so I'm literally keeping
them as-is.

Link: https://lkml.kernel.org/r/20220530183450.42886-1-peterx@redhat.com
Signed-off-by: Peter Xu <peterx@redhat.com>
Acked-by: Geert Uytterhoeven <geert@linux-m68k.org>
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Acked-by: Vineet Gupta <vgupta@kernel.org>
Acked-by: Guo Ren <guoren@kernel.org>
Acked-by: Max Filippov <jcmvbkbc@gmail.com>
Acked-by: Christian Borntraeger <borntraeger@linux.ibm.com>
Acked-by: Michael Ellerman <mpe@ellerman.id.au> (powerpc)
Acked-by: Catalin Marinas <catalin.marinas@arm.com>
Reviewed-by: Alistair Popple <apopple@nvidia.com>
Reviewed-by: Ingo Molnar <mingo@kernel.org>
Acked-by: Russell King (Oracle) <rmk+kernel@armlinux.org.uk> [arm part]
Acked-by: Heiko Carstens <hca@linux.ibm.com>
Cc: Vasily Gorbik <gor@linux.ibm.com>
Cc: Stafford Horne <shorne@gmail.com>
Cc: David S. Miller <davem@davemloft.net>
Cc: Johannes Berg <johannes@sipsolutions.net>
Cc: Brian Cain <bcain@quicinc.com>
Cc: Richard Henderson <rth@twiddle.net>
Cc: Richard Weinberger <richard@nod.at>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Janosch Frank <frankja@linux.ibm.com>
Cc: Albert Ou <aou@eecs.berkeley.edu>
Cc: Anton Ivanov <anton.ivanov@cambridgegreys.com>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Sven Schnelle <svens@linux.ibm.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: James Bottomley <James.Bottomley@HansenPartnership.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Alexander Gordeev <agordeev@linux.ibm.com>
Cc: Jonas Bonn <jonas@southpole.se>
Cc: Will Deacon <will@kernel.org>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Michal Simek <monstr@monstr.eu>
Cc: Matt Turner <mattst88@gmail.com>
Cc: Paul Mackerras <paulus@samba.org>
Cc: David Hildenbrand <david@redhat.com>
Cc: Nicholas Piggin <npiggin@gmail.com>
Cc: Palmer Dabbelt <palmer@dabbelt.com>
Cc: Stefan Kristiansson <stefan.kristiansson@saunalahti.fi>
Cc: Paul Walmsley <paul.walmsley@sifive.com>
Cc: Ivan Kokshaysky <ink@jurassic.park.msu.ru>
Cc: Chris Zankel <chris@zankel.net>
Cc: Hugh Dickins <hughd@google.com>
Cc: Dinh Nguyen <dinguyen@kernel.org>
Cc: Rich Felker <dalias@libc.org>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Thomas Bogendoerfer <tsbogend@alpha.franken.de>
Cc: Helge Deller <deller@gmx.de>
Cc: Yoshinori Sato <ysato@users.osdn.me>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>


# f4f451a1 05-Jul-2022 Muchun Song <songmuchun@bytedance.com>

mm: fix missing wake-up event for FSDAX pages

FSDAX page refcounts are 1-based, rather than 0-based: if refcount is
1, then the page is freed. The FSDAX pages can be pinned through GUP,
then they will be unpinned via unpin_user_page() using a folio variant
to put the page, however, folio variants did not consider this special
case, the result will be to miss a wakeup event (like the user of
__fuse_dax_break_layouts()). This results in a task being permanently
stuck in TASK_INTERRUPTIBLE state.

Since FSDAX pages are only possibly obtained by GUP users, so fix GUP
instead of folio_put() to lower overhead.

Link: https://lkml.kernel.org/r/20220705123532.283-1-songmuchun@bytedance.com
Fixes: d8ddc099c6b3 ("mm/gup: Add gup_put_folio()")
Signed-off-by: Muchun Song <songmuchun@bytedance.com>
Suggested-by: Matthew Wilcox <willy@infradead.org>
Cc: Jason Gunthorpe <jgg@ziepe.ca>
Cc: John Hubbard <jhubbard@nvidia.com>
Cc: William Kucharski <william.kucharski@oracle.com>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: Jan Kara <jack@suse.cz>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>


# 0768c8de 09-May-2022 Yury Norov <yury.norov@gmail.com>

mm/gup: fix comments to pin_user_pages_*()

pin_user_pages API forces FOLL_PIN in gup_flags, which means that the API
requires struct page **pages to be provided (not NULL). However, the
comment to pin_user_pages() clearly allows for passing in a NULL @pages
argument.

Remove the incorrect comments, and add WARN_ON_ONCE(!pages) calls to
enforce the API.

It has been independently spotted by Minchan Kim and confirmed with
John Hubbard:

https://lore.kernel.org/all/YgWA0ghrrzHONehH@google.com/

Link: https://lkml.kernel.org/r/20220422015839.1274328-1-yury.norov@gmail.com
Signed-off-by: Yury Norov (NVIDIA) <yury.norov@gmail.com>
Reviewed-by: John Hubbard <jhubbard@nvidia.com>
Cc: Minchan Kim <minchan@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>


# b6a2619c 09-May-2022 David Hildenbrand <david@redhat.com>

mm/gup: sanity-check with CONFIG_DEBUG_VM that anonymous pages are exclusive when (un)pinning

Let's verify when (un)pinning anonymous pages that we always deal with
exclusive anonymous pages, which guarantees that we'll have a reliable
PIN, meaning that we cannot end up with the GUP pin being inconsistent
with he pages mapped into the page tables due to a COW triggered by a
write fault.

When pinning pages, after conditionally triggering GUP unsharing of
possibly shared anonymous pages, we should always only see exclusive
anonymous pages. Note that anonymous pages that are mapped writable must
be marked exclusive, otherwise we'd have a BUG.

When pinning during ordinary GUP, simply add a check after our conditional
GUP-triggered unsharing checks. As we know exactly how the page is
mapped, we know exactly in which page we have to check for
PageAnonExclusive().

When pinning via GUP-fast we have to be careful, because we can race with
fork(): verify only after we made sure via the seqcount that we didn't
race with concurrent fork() that we didn't end up pinning a possibly
shared anonymous page.

Similarly, when unpinning, verify that the pages are still marked as
exclusive: otherwise something turned the pages possibly shared, which can
result in random memory corruptions, which we really want to catch.

With only the pinned pages at hand and not the actual page table entries
we have to be a bit careful: hugetlb pages are always mapped via a single
logical page table entry referencing the head page and PG_anon_exclusive
of the head page applies. Anon THP are a bit more complicated, because we
might have obtained the page reference either via a PMD or a PTE --
depending on the mapping type we either have to check PageAnonExclusive of
the head page (PMD-mapped THP) or the tail page (PTE-mapped THP) applies:
as we don't know and to make our life easier, check that either is set.

Take care to not verify in case we're unpinning during GUP-fast because we
detected concurrent fork(): we might stumble over an anonymous page that
is now shared.

Link: https://lkml.kernel.org/r/20220428083441.37290-18-david@redhat.com
Signed-off-by: David Hildenbrand <david@redhat.com>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Christoph Hellwig <hch@lst.de>
Cc: David Rientjes <rientjes@google.com>
Cc: Don Dutile <ddutile@redhat.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: Jan Kara <jack@suse.cz>
Cc: Jann Horn <jannh@google.com>
Cc: Jason Gunthorpe <jgg@nvidia.com>
Cc: John Hubbard <jhubbard@nvidia.com>
Cc: Khalid Aziz <khalid.aziz@oracle.com>
Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
Cc: Liang Zhang <zhangliang5@huawei.com>
Cc: "Matthew Wilcox (Oracle)" <willy@infradead.org>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Mike Kravetz <mike.kravetz@oracle.com>
Cc: Mike Rapoport <rppt@linux.ibm.com>
Cc: Nadav Amit <namit@vmware.com>
Cc: Oded Gabbay <oded.gabbay@gmail.com>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Pedro Demarchi Gomes <pedrodemargomes@gmail.com>
Cc: Peter Xu <peterx@redhat.com>
Cc: Rik van Riel <riel@surriel.com>
Cc: Roman Gushchin <guro@fb.com>
Cc: Shakeel Butt <shakeelb@google.com>
Cc: Yang Shi <shy828301@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>


# a7f22660 09-May-2022 David Hildenbrand <david@redhat.com>

mm/gup: trigger FAULT_FLAG_UNSHARE when R/O-pinning a possibly shared anonymous page

Whenever GUP currently ends up taking a R/O pin on an anonymous page that
might be shared -- mapped R/O and !PageAnonExclusive() -- any write fault
on the page table entry will end up replacing the mapped anonymous page
due to COW, resulting in the GUP pin no longer being consistent with the
page actually mapped into the page table.

The possible ways to deal with this situation are:
(1) Ignore and pin -- what we do right now.
(2) Fail to pin -- which would be rather surprising to callers and
could break user space.
(3) Trigger unsharing and pin the now exclusive page -- reliable R/O
pins.

Let's implement 3) because it provides the clearest semantics and allows
for checking in unpin_user_pages() and friends for possible BUGs: when
trying to unpin a page that's no longer exclusive, clearly something went
very wrong and might result in memory corruptions that might be hard to
debug. So we better have a nice way to spot such issues.

This change implies that whenever user space *wrote* to a private mapping
(IOW, we have an anonymous page mapped), that GUP pins will always remain
consistent: reliable R/O GUP pins of anonymous pages.

As a side note, this commit fixes the COW security issue for hugetlb with
FOLL_PIN as documented in:
https://lore.kernel.org/r/3ae33b08-d9ef-f846-56fb-645e3b9b4c66@redhat.com
The vmsplice reproducer still applies, because vmsplice uses FOLL_GET
instead of FOLL_PIN.

Note that follow_huge_pmd() doesn't apply because we cannot end up in
there with FOLL_PIN.

This commit is heavily based on prototype patches by Andrea.

Link: https://lkml.kernel.org/r/20220428083441.37290-17-david@redhat.com
Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
Signed-off-by: David Hildenbrand <david@redhat.com>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Co-developed-by: Andrea Arcangeli <aarcange@redhat.com>
Cc: Christoph Hellwig <hch@lst.de>
Cc: David Rientjes <rientjes@google.com>
Cc: Don Dutile <ddutile@redhat.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: Jan Kara <jack@suse.cz>
Cc: Jann Horn <jannh@google.com>
Cc: Jason Gunthorpe <jgg@nvidia.com>
Cc: John Hubbard <jhubbard@nvidia.com>
Cc: Khalid Aziz <khalid.aziz@oracle.com>
Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
Cc: Liang Zhang <zhangliang5@huawei.com>
Cc: "Matthew Wilcox (Oracle)" <willy@infradead.org>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Mike Kravetz <mike.kravetz@oracle.com>
Cc: Mike Rapoport <rppt@linux.ibm.com>
Cc: Nadav Amit <namit@vmware.com>
Cc: Oded Gabbay <oded.gabbay@gmail.com>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Pedro Demarchi Gomes <pedrodemargomes@gmail.com>
Cc: Peter Xu <peterx@redhat.com>
Cc: Rik van Riel <riel@surriel.com>
Cc: Roman Gushchin <guro@fb.com>
Cc: Shakeel Butt <shakeelb@google.com>
Cc: Yang Shi <shy828301@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>


# 8909691b 09-May-2022 David Hildenbrand <david@redhat.com>

mm/gup: disallow follow_page(FOLL_PIN)

We want to change the way we handle R/O pins on anonymous pages that might
be shared: if we detect a possibly shared anonymous page -- mapped R/O and
not !PageAnonExclusive() -- we want to trigger unsharing via a page fault,
resulting in an exclusive anonymous page that can be pinned reliably
without getting replaced via COW on the next write fault.

However, the required page fault will be problematic for follow_page(): in
contrast to ordinary GUP, follow_page() doesn't trigger faults internally.
So we would have to end up failing a R/O pin via follow_page(), although
there is something mapped R/O into the page table, which might be rather
surprising.

We don't seem to have follow_page(FOLL_PIN) users, and it's a purely
internal MM function. Let's just make our life easier and the semantics
of follow_page() clearer by just disallowing FOLL_PIN for follow_page()
completely.

Link: https://lkml.kernel.org/r/20220428083441.37290-15-david@redhat.com
Signed-off-by: David Hildenbrand <david@redhat.com>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Christoph Hellwig <hch@lst.de>
Cc: David Rientjes <rientjes@google.com>
Cc: Don Dutile <ddutile@redhat.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: Jan Kara <jack@suse.cz>
Cc: Jann Horn <jannh@google.com>
Cc: Jason Gunthorpe <jgg@nvidia.com>
Cc: John Hubbard <jhubbard@nvidia.com>
Cc: Khalid Aziz <khalid.aziz@oracle.com>
Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
Cc: Liang Zhang <zhangliang5@huawei.com>
Cc: "Matthew Wilcox (Oracle)" <willy@infradead.org>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Mike Kravetz <mike.kravetz@oracle.com>
Cc: Mike Rapoport <rppt@linux.ibm.com>
Cc: Nadav Amit <namit@vmware.com>
Cc: Oded Gabbay <oded.gabbay@gmail.com>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Pedro Demarchi Gomes <pedrodemargomes@gmail.com>
Cc: Peter Xu <peterx@redhat.com>
Cc: Rik van Riel <riel@surriel.com>
Cc: Roman Gushchin <guro@fb.com>
Cc: Shakeel Butt <shakeelb@google.com>
Cc: Yang Shi <shy828301@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>


# da32b581 23-Apr-2022 Catalin Marinas <catalin.marinas@arm.com>

mm: Add fault_in_subpage_writeable() to probe at sub-page granularity

On hardware with features like arm64 MTE or SPARC ADI, an access fault
can be triggered at sub-page granularity. Depending on how the
fault_in_writeable() function is used, the caller can get into a
live-lock by continuously retrying the fault-in on an address different
from the one where the uaccess failed.

In the majority of cases progress is ensured by the following
conditions:

1. copy_to_user_nofault() guarantees at least one byte access if the
user address is not faulting.

2. The fault_in_writeable() loop is resumed from the first address that
could not be accessed by copy_to_user_nofault().

If the loop iteration is restarted from an earlier (initial) point, the
loop is repeated with the same conditions and it would live-lock.

Introduce an arch-specific probe_subpage_writeable() and call it from
the newly added fault_in_subpage_writeable() function. The arch code
with sub-page faults will have to implement the specific probing
functionality.

Note that no other fault_in_subpage_*() functions are added since they
have no callers currently susceptible to a live-lock.

Signed-off-by: Catalin Marinas <catalin.marinas@arm.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Link: https://lore.kernel.org/r/20220423100751.1870771-2-catalin.marinas@arm.com
Signed-off-by: Catalin Marinas <catalin.marinas@arm.com>


# ece369c7 01-Apr-2022 Hugh Dickins <hughd@google.com>

mm/munlock: add lru_add_drain() to fix memcg_stat_test

Mike reports that LTP memcg_stat_test usually leads to

memcg_stat_test 3 TINFO: Test unevictable with MAP_LOCKED
memcg_stat_test 3 TINFO: Running memcg_process --mmap-lock1 -s 135168
memcg_stat_test 3 TINFO: Warming up pid: 3460
memcg_stat_test 3 TINFO: Process is still here after warm up: 3460
memcg_stat_test 3 TFAIL: unevictable is 122880, 135168 expected

but may also lead to

memcg_stat_test 4 TINFO: Test unevictable with mlock
memcg_stat_test 4 TINFO: Running memcg_process --mmap-lock2 -s 135168
memcg_stat_test 4 TINFO: Warming up pid: 4271
memcg_stat_test 4 TINFO: Process is still here after warm up: 4271
memcg_stat_test 4 TFAIL: unevictable is 122880, 135168 expected

or both. A wee bit flaky.

follow_page_pte() used to have an lru_add_drain() per each page mlocked,
and the test came to rely on accurate stats. The pagevec to be drained
is different now, but still covered by lru_add_drain(); and, never mind
the test, I believe it's in everyone's interest that a bulk faulting
interface like populate_vma_page_range() or faultin_vma_page_range()
should drain its local pagevecs at the end, to save others sometimes
needing the much more expensive lru_add_drain_all().

This does not absolutely guarantee exact stats - the mlocking task can
be migrated between CPUs as it proceeds - but it's good enough and the
tests pass.

Link: https://lkml.kernel.org/r/47f6d39c-a075-50cb-1cfb-26dd957a48af@google.com
Fixes: b67bf49ce7aa ("mm/munlock: delete FOLL_MLOCK and FOLL_POPULATE")
Signed-off-by: Hugh Dickins <hughd@google.com>
Reported-by: Mike Galbraith <efault@gmx.de>
Cc: Vlastimil Babka <vbabka@suse.cz>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 73fd16d8 22-Mar-2022 John Hubbard <jhubbard@nvidia.com>

mm/gup: remove unused get_user_pages_locked()

Now that the last caller of get_user_pages_locked() is gone, remove it.

Link: https://lkml.kernel.org/r/20220204020010.68930-6-jhubbard@nvidia.com
Signed-off-by: John Hubbard <jhubbard@nvidia.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Reviewed-by: Jason Gunthorpe <jgg@nvidia.com>
Reviewed-by: Claudio Imbrenda <imbrenda@linux.ibm.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Cc: Alex Williamson <alex.williamson@redhat.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: David Hildenbrand <david@redhat.com>
Cc: Jason Gunthorpe <jgg@ziepe.ca>
Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Cc: Lukas Bulwahn <lukas.bulwahn@gmail.com>
Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
Cc: Peter Xu <peterx@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# ad6c4412 22-Mar-2022 John Hubbard <jhubbard@nvidia.com>

mm/gup: remove unused pin_user_pages_locked()

This routine was used for a short while, but then the calling code was
refactored and the only caller was removed.

Link: https://lkml.kernel.org/r/20220204020010.68930-4-jhubbard@nvidia.com
Signed-off-by: John Hubbard <jhubbard@nvidia.com>
Reviewed-by: David Hildenbrand <david@redhat.com>
Reviewed-by: Jason Gunthorpe <jgg@nvidia.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Claudio Imbrenda <imbrenda@linux.ibm.com>
Cc: Alex Williamson <alex.williamson@redhat.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Jason Gunthorpe <jgg@ziepe.ca>
Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Cc: Lukas Bulwahn <lukas.bulwahn@gmail.com>
Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
Cc: Peter Xu <peterx@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 65462462 22-Mar-2022 John Hubbard <jhubbard@nvidia.com>

mm/gup: follow_pfn_pte(): -EEXIST cleanup

Remove a quirky special case from follow_pfn_pte(), and adjust its
callers to match. Caller changes include:

__get_user_pages(): Regardless of any FOLL_* flags, get_user_pages() and
its variants should handle PFN-only entries by stopping early, if the
caller expected **pages to be filled in. This makes for a more reliable
API, as compared to the previous approach of skipping over such entries
(and thus leaving them silently unwritten).

move_pages(): squash the -EEXIST error return from follow_page() into
-EFAULT, because -EFAULT is listed in the man page, whereas -EEXIST is
not.

Link: https://lkml.kernel.org/r/20220204020010.68930-3-jhubbard@nvidia.com
Signed-off-by: John Hubbard <jhubbard@nvidia.com>
Suggested-by: Jason Gunthorpe <jgg@nvidia.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Jan Kara <jack@suse.cz>
Cc: Peter Xu <peterx@redhat.com>
Cc: Lukas Bulwahn <lukas.bulwahn@gmail.com>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Claudio Imbrenda <imbrenda@linux.ibm.com>
Cc: Alex Williamson <alex.williamson@redhat.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: David Hildenbrand <david@redhat.com>
Cc: Jason Gunthorpe <jgg@ziepe.ca>
Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 7196040e 22-Mar-2022 Peter Xu <peterx@redhat.com>

mm: fix invalid page pointer returned with FOLL_PIN gups

Patch series "mm/gup: some cleanups", v5.

This patch (of 5):

Alex reported invalid page pointer returned with pin_user_pages_remote()
from vfio after upstream commit 4b6c33b32296 ("vfio/type1: Prepare for
batched pinning with struct vfio_batch").

It turns out that it's not the fault of the vfio commit; however after
vfio switches to a full page buffer to store the page pointers it starts
to expose the problem easier.

The problem is for VM_PFNMAP vmas we should normally fail with an
-EFAULT then vfio will carry on to handle the MMIO regions. However
when the bug triggered, follow_page_mask() returned -EEXIST for such a
page, which will jump over the current page, leaving that entry in
**pages untouched. However the caller is not aware of it, hence the
caller will reference the page as usual even if the pointer data can be
anything.

We had that -EEXIST logic since commit 1027e4436b6a ("mm: make GUP
handle pfn mapping unless FOLL_GET is requested") which seems very
reasonable. It could be that when we reworked GUP with FOLL_PIN we
could have overlooked that special path in commit 3faa52c03f44 ("mm/gup:
track FOLL_PIN pages"), even if that commit rightfully touched up
follow_devmap_pud() on checking FOLL_PIN when it needs to return an
-EEXIST.

Attaching the Fixes to the FOLL_PIN rework commit, as it happened later
than 1027e4436b6a.

[jhubbard@nvidia.com: added some tags, removed a reference to an out of tree module.]

Link: https://lkml.kernel.org/r/20220207062213.235127-1-jhubbard@nvidia.com
Link: https://lkml.kernel.org/r/20220204020010.68930-1-jhubbard@nvidia.com
Link: https://lkml.kernel.org/r/20220204020010.68930-2-jhubbard@nvidia.com
Fixes: 3faa52c03f44 ("mm/gup: track FOLL_PIN pages")
Signed-off-by: Peter Xu <peterx@redhat.com>
Signed-off-by: John Hubbard <jhubbard@nvidia.com>
Reviewed-by: Claudio Imbrenda <imbrenda@linux.ibm.com>
Reported-by: Alex Williamson <alex.williamson@redhat.com>
Debugged-by: Alex Williamson <alex.williamson@redhat.com>
Tested-by: Alex Williamson <alex.williamson@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Jan Kara <jack@suse.cz>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Cc: Jason Gunthorpe <jgg@ziepe.ca>
Cc: David Hildenbrand <david@redhat.com>
Cc: Lukas Bulwahn <lukas.bulwahn@gmail.com>
Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
Cc: Jason Gunthorpe <jgg@nvidia.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 1b7f7e58 16-Feb-2022 Matthew Wilcox (Oracle) <willy@infradead.org>

mm/gup: Convert check_and_migrate_movable_pages() to use a folio

Switch from head pages to folios. This removes an assumption that
THPs are the only way to have a high-order page.

Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: John Hubbard <jhubbard@nvidia.com>
Reviewed-by: Jason Gunthorpe <jgg@nvidia.com>
Reviewed-by: William Kucharski <william.kucharski@oracle.com>


# 659508f9 23-Dec-2021 Matthew Wilcox (Oracle) <willy@infradead.org>

mm/gup: Turn compound_range_next() into gup_folio_range_next()

Convert the only caller to work on folios instead of pages.
This removes the last caller of put_compound_head(), so delete it.

Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: John Hubbard <jhubbard@nvidia.com>
Reviewed-by: Jason Gunthorpe <jgg@nvidia.com>
Reviewed-by: William Kucharski <william.kucharski@oracle.com>


# 12521c76 22-Dec-2021 Matthew Wilcox (Oracle) <willy@infradead.org>

mm/gup: Turn compound_next() into gup_folio_next()

Convert both callers to work on folios instead of pages.

Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: John Hubbard <jhubbard@nvidia.com>
Reviewed-by: Jason Gunthorpe <jgg@nvidia.com>
Reviewed-by: William Kucharski <william.kucharski@oracle.com>


# 2d7919a2 22-Dec-2021 Matthew Wilcox (Oracle) <willy@infradead.org>

mm/gup: Convert gup_huge_pgd() to use a folio

Use the new folio-based APIs. This was the last user of
try_grab_compound_head(), so remove it.

Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: John Hubbard <jhubbard@nvidia.com>
Reviewed-by: Jason Gunthorpe <jgg@nvidia.com>
Reviewed-by: William Kucharski <william.kucharski@oracle.com>


# 83afb52e 22-Dec-2021 Matthew Wilcox (Oracle) <willy@infradead.org>

mm/gup: Convert gup_huge_pud() to use a folio

Use the new folio-based APIs.

Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: John Hubbard <jhubbard@nvidia.com>
Reviewed-by: Jason Gunthorpe <jgg@nvidia.com>
Reviewed-by: William Kucharski <william.kucharski@oracle.com>


# 667ed1f7 22-Dec-2021 Matthew Wilcox (Oracle) <willy@infradead.org>

mm/gup: Convert gup_huge_pmd() to use a folio

Use the new folio-based APIs.

Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: John Hubbard <jhubbard@nvidia.com>
Reviewed-by: Jason Gunthorpe <jgg@nvidia.com>
Reviewed-by: William Kucharski <william.kucharski@oracle.com>


# 09a1626e 22-Dec-2021 Matthew Wilcox (Oracle) <willy@infradead.org>

mm/gup: Convert gup_hugepte() to use a folio

There should be little to no effect from this patch; just removing
uses of some old APIs.

Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: John Hubbard <jhubbard@nvidia.com>
Reviewed-by: Jason Gunthorpe <jgg@nvidia.com>
Reviewed-by: William Kucharski <william.kucharski@oracle.com>


# b0496fe4 10-Dec-2021 Matthew Wilcox (Oracle) <willy@infradead.org>

mm/gup: Convert gup_pte_range() to use a folio

We still call try_grab_folio() once per PTE; a future patch could
optimise to just adjust the reference count for each page within
the folio.

Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: John Hubbard <jhubbard@nvidia.com>
Reviewed-by: Jason Gunthorpe <jgg@nvidia.com>
Reviewed-by: William Kucharski <william.kucharski@oracle.com>


# 822951d8 07-Jan-2022 Matthew Wilcox (Oracle) <willy@infradead.org>

mm/hugetlb: Use try_grab_folio() instead of try_grab_compound_head()

follow_hugetlb_page() only cares about success or failure, so it doesn't
need to know the type of the returned pointer, only whether it's NULL
or not.

Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: John Hubbard <jhubbard@nvidia.com>
Reviewed-by: Jason Gunthorpe <jgg@nvidia.com>
Reviewed-by: William Kucharski <william.kucharski@oracle.com>


# d8ddc099 10-Dec-2021 Matthew Wilcox (Oracle) <willy@infradead.org>

mm/gup: Add gup_put_folio()

Convert put_compound_head() to gup_put_folio() and hpage_pincount_sub()
to folio_pincount_sub(). This removes the last call to put_page_refs(),
so delete it. Add a temporary put_compound_head() wrapper which will
be deleted by the end of this series.

Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: John Hubbard <jhubbard@nvidia.com>
Reviewed-by: Jason Gunthorpe <jgg@nvidia.com>
Reviewed-by: William Kucharski <william.kucharski@oracle.com>


# 5fec0719 04-Feb-2022 Matthew Wilcox (Oracle) <willy@infradead.org>

mm/gup: Convert try_grab_page() to use a folio

Hoist the folio conversion and the folio_ref_count() check to the
top of the function instead of using the one buried in try_get_page().

Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Reviewed-by: John Hubbard <jhubbard@nvidia.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>


# ece1ed7b 04-Feb-2022 Matthew Wilcox (Oracle) <willy@infradead.org>

mm/gup: Add try_get_folio() and try_grab_folio()

Convert try_get_compound_head() into try_get_folio() and convert
try_grab_compound_head() into try_grab_folio(). Add a temporary
try_grab_compound_head() wrapper around try_grab_folio() to let us
convert callers individually.

Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: John Hubbard <jhubbard@nvidia.com>
Reviewed-by: Jason Gunthorpe <jgg@nvidia.com>
Reviewed-by: William Kucharski <william.kucharski@oracle.com>


# 5232c63f 06-Jan-2022 Matthew Wilcox (Oracle) <willy@infradead.org>

mm: Make compound_pincount always available

Move compound_pincount from the third page to the second page, which
means it's available for all compound pages. That lets us delete
hpage_pincount_available().

On 32-bit systems, there isn't enough space for both compound_pincount
and compound_nr in the second page (it would collide with page->private,
which is in use for pages in the swap cache), so revert the optimisation
of storing both compound_order and compound_nr on 32-bit systems.

Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Reviewed-by: John Hubbard <jhubbard@nvidia.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Jason Gunthorpe <jgg@nvidia.com>
Reviewed-by: William Kucharski <william.kucharski@oracle.com>


# 6315d8a2 07-Jan-2022 Matthew Wilcox (Oracle) <willy@infradead.org>

mm/gup: Remove hpage_pincount_sub()

Move the assertion (and correct it to be a cheaper variant),
and inline the atomic_sub() operation.

Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: John Hubbard <jhubbard@nvidia.com>
Reviewed-by: Jason Gunthorpe <jgg@nvidia.com>
Reviewed-by: William Kucharski <william.kucharski@oracle.com>


# 78d9d6ce 07-Jan-2022 Matthew Wilcox (Oracle) <willy@infradead.org>

mm/gup: Remove hpage_pincount_add()

It's clearer to call atomic_add() in the callers; the assertions clearly
can't fire there because they're part of the condition for calling
atomic_add().

Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Reviewed-by: John Hubbard <jhubbard@nvidia.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>


# 59409373 07-Jan-2022 Matthew Wilcox (Oracle) <willy@infradead.org>

mm/gup: Handle page split race more efficiently

If we hit the page split race, the current code returns NULL which will
presumably trigger a retry under the mmap_lock. This isn't necessary;
we can just retry the compound_head() lookup. This is a very minor
optimisation of an unlikely path, but conceptually it matches (eg)
the page cache RCU-protected lookup.

Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: John Hubbard <jhubbard@nvidia.com>
Reviewed-by: Jason Gunthorpe <jgg@nvidia.com>
Reviewed-by: William Kucharski <william.kucharski@oracle.com>


# 4c654229 07-Jan-2022 Matthew Wilcox (Oracle) <willy@infradead.org>

mm/gup: Remove an assumption of a contiguous memmap

This assumption needs the inverse of nth_page(), which is temporarily
named page_nth() until it's renamed later in this series.

Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: John Hubbard <jhubbard@nvidia.com>
Reviewed-by: Jason Gunthorpe <jgg@nvidia.com>
Reviewed-by: William Kucharski <william.kucharski@oracle.com>


# c228afb1 07-Jan-2022 Matthew Wilcox (Oracle) <willy@infradead.org>

mm/gup: Fix some contiguous memmap assumptions

Several functions in gup.c assume that a compound page has virtually
contiguous page structs. This isn't true for SPARSEMEM configs unless
SPARSEMEM_VMEMMAP is also set. Fix them by using nth_page() instead of
plain pointer arithmetic.

Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: John Hubbard <jhubbard@nvidia.com>
Reviewed-by: Jason Gunthorpe <jgg@nvidia.com>
Reviewed-by: William Kucharski <william.kucharski@oracle.com>


# 28297dbc 09-Jan-2022 Matthew Wilcox (Oracle) <willy@infradead.org>

mm/gup: Change the calling convention for compound_next()

Return the head page instead of storing it to a passed parameter.
Reorder the arguments to match the calling function's arguments.

Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: John Hubbard <jhubbard@nvidia.com>
Reviewed-by: Jason Gunthorpe <jgg@nvidia.com>
Reviewed-by: William Kucharski <william.kucharski@oracle.com>


# 0b046e12 09-Jan-2022 Matthew Wilcox (Oracle) <willy@infradead.org>

mm/gup: Optimise compound_range_next()

By definition, a compound page has an order >= 1, so the second half
of the test was redundant. Also, this cannot be a tail page since
it's the result of calling compound_head(), so use PageHead() instead
of PageCompound().

Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: John Hubbard <jhubbard@nvidia.com>
Reviewed-by: Jason Gunthorpe <jgg@nvidia.com>
Reviewed-by: William Kucharski <william.kucharski@oracle.com>


# 8f39f5fc 09-Jan-2022 Matthew Wilcox (Oracle) <willy@infradead.org>

mm/gup: Change the calling convention for compound_range_next()

Return the head page instead of storing it to a passed parameter.
Pass the start page directly instead of passing a pointer to it.
Reorder the arguments to match the calling function's arguments.

Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Reviewed-by: John Hubbard <jhubbard@nvidia.com>
Reviewed-by: Jason Gunthorpe <jgg@nvidia.com>
Reviewed-by: William Kucharski <william.kucharski@oracle.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>


# e7602748 08-Jan-2022 Matthew Wilcox (Oracle) <willy@infradead.org>

mm/gup: Remove for_each_compound_head()

This macro doesn't simplify the users; it's easier to just call
compound_next() inside a standard loop.

Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: John Hubbard <jhubbard@nvidia.com>
Reviewed-by: Jason Gunthorpe <jgg@nvidia.com>
Reviewed-by: William Kucharski <william.kucharski@oracle.com>


# a5f100db 08-Jan-2022 Matthew Wilcox (Oracle) <willy@infradead.org>

mm/gup: Remove for_each_compound_range()

This macro doesn't simplify the users; it's easier to just call
compound_range_next() inside the loop.

Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: John Hubbard <jhubbard@nvidia.com>
Reviewed-by: Jason Gunthorpe <jgg@nvidia.com>
Reviewed-by: William Kucharski <william.kucharski@oracle.com>


# 8ea2979c 04-Feb-2022 Matthew Wilcox (Oracle) <willy@infradead.org>

mm/gup: Increment the page refcount before the pincount

We should always increase the refcount before doing anything else to
the page so that other page users see the elevated refcount first.

Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Reviewed-by: John Hubbard <jhubbard@nvidia.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>


# f9f38f78 15-Feb-2022 Christoph Hellwig <hch@lst.de>

mm: refactor check_and_migrate_movable_pages

Remove up to two levels of indentation by using continue statements
and move variables to local scope where possible.

Link: https://lkml.kernel.org/r/20220210072828.2930359-11-hch@lst.de
Signed-off-by: Christoph Hellwig <hch@lst.de>
Tested-by: "Sierra Guiza, Alejandro (Alex)" <alex.sierra@amd.com>

Cc: Alex Deucher <alexander.deucher@amd.com>
Cc: Alistair Popple <apopple@nvidia.com>
Cc: Ben Skeggs <bskeggs@redhat.com>
Cc: Chaitanya Kulkarni <kch@nvidia.com>
Cc: Christian Knig <christian.koenig@amd.com>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: Felix Kuehling <Felix.Kuehling@amd.com>
Cc: Jason Gunthorpe <jgg@nvidia.com>
Cc: Karol Herbst <kherbst@redhat.com>
Cc: Logan Gunthorpe <logang@deltatee.com>
Cc: Lyude Paul <lyude@redhat.com>
Cc: Miaohe Lin <linmiaohe@huawei.com>
Cc: Muchun Song <songmuchun@bytedance.com>
Cc: "Pan, Xinhui" <Xinhui.Pan@amd.com>
Cc: Ralph Campbell <rcampbell@nvidia.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>


# b67bf49c 14-Feb-2022 Hugh Dickins <hughd@google.com>

mm/munlock: delete FOLL_MLOCK and FOLL_POPULATE

If counting page mlocks, we must not double-count: follow_page_pte() can
tell if a page has already been Mlocked or not, but cannot tell if a pte
has already been counted or not: that will have to be done when the pte
is mapped in (which lru_cache_add_inactive_or_unevictable() already tracks
for new anon pages, but there's no such tracking yet for others).

Delete all the FOLL_MLOCK code - faulting in the missing pages will do
all that is necessary, without special mlock_vma_page() calls from here.

But then FOLL_POPULATE turns out to serve no purpose - it was there so
that its absence would tell faultin_page() not to faultin page when
setting up VM_LOCKONFAULT areas; but if there's no special work needed
here for mlock, then there's no work at all here for VM_LOCKONFAULT.

Have I got that right? I've not looked into the history, but see that
FOLL_POPULATE goes back before VM_LOCKONFAULT: did it serve a different
purpose before? Ah, yes, it was used to skip the old stack guard page.

And is it intentional that COW is not broken on existing pages when
setting up a VM_LOCKONFAULT area? I can see that being argued either
way, and have no reason to disagree with current behaviour.

Signed-off-by: Hugh Dickins <hughd@google.com>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>


# fe673d3f 08-Mar-2022 Linus Torvalds <torvalds@linux-foundation.org>

mm: gup: make fault_in_safe_writeable() use fixup_user_fault()

Instead of using GUP, make fault_in_safe_writeable() actually force a
'handle_mm_fault()' using the same fixup_user_fault() machinery that
futexes already use.

Using the GUP machinery meant that fault_in_safe_writeable() did not do
everything that a real fault would do, ranging from not auto-expanding
the stack segment, to not updating accessed or dirty flags in the page
tables (GUP sets those flags on the pages themselves).

The latter causes problems on architectures (like s390) that do accessed
bit handling in software, which meant that fault_in_safe_writeable()
didn't actually do all the fault handling it needed to, and trying to
access the user address afterwards would still cause faults.

Reported-and-tested-by: Andreas Gruenbacher <agruenba@redhat.com>
Fixes: cdd591fc86e3 ("iov_iter: Introduce fault_in_iov_iter_writeable")
Link: https://lore.kernel.org/all/CAHc6FU5nP+nziNGG0JAF1FUx-GV7kKFvM7aZuU_XD2_1v4vnvg@mail.gmail.com/
Acked-by: David Hildenbrand <david@redhat.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# c36c04c2 01-Feb-2022 John Hubbard <jhubbard@nvidia.com>

Revert "mm/gup: small refactoring: simplify try_grab_page()"

This reverts commit 54d516b1d62ff8f17cee2da06e5e4706a0d00b8a

That commit did a refactoring that effectively combined fast and slow
gup paths (again). And that was again incorrect, for two reasons:

a) Fast gup and slow gup get reference counts on pages in different
ways and with different goals: see Linus' writeup in commit
cd1adf1b63a1 ("Revert "mm/gup: remove try_get_page(), call
try_get_compound_head() directly""), and

b) try_grab_compound_head() also has a specific check for
"FOLL_LONGTERM && !is_pinned(page)", that assumes that the caller
can fall back to slow gup. This resulted in new failures, as
recently report by Will McVicker [1].

But (a) has problems too, even though they may not have been reported
yet. So just revert this.

Link: https://lore.kernel.org/r/20220131203504.3458775-1-willmcvicker@google.com [1]
Fixes: 54d516b1d62f ("mm/gup: small refactoring: simplify try_grab_page()")
Reported-and-tested-by: Will McVicker <willmcvicker@google.com>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Minchan Kim <minchan@google.com>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Christian Borntraeger <borntraeger@de.ibm.com>
Cc: Heiko Carstens <hca@linux.ibm.com>
Cc: Vasily Gorbik <gor@linux.ibm.com>
Cc: stable@vger.kernel.org # 5.15
Signed-off-by: John Hubbard <jhubbard@nvidia.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 28b0ee3f 14-Jan-2022 Li Xinhai <lixinhai.lxh@gmail.com>

mm/gup.c: stricter check on THP migration entry during follow_pmd_mask

When BUG_ON check for THP migration entry, the existing code only check
thp_migration_supported case, but not for !thp_migration_supported case.
If !thp_migration_supported() and !pmd_present(), the original code may
dead loop in theory. To make the BUG_ON check consistent, we need catch
both cases.

Move the BUG_ON check one step earlier, because if the bug happen we
should know it instead of depend on FOLL_MIGRATION been used by caller.

Because pmdval instead of *pmd is read by the is_pmd_migration_entry()
check, the existing code don't help to avoid useless locking within
pmd_migration_entry_wait(), so remove that check.

Link: https://lkml.kernel.org/r/20211217062559.737063-1-lixinhai.lxh@gmail.com
Signed-off-by: Li Xinhai <lixinhai.lxh@gmail.com>
Reviewed-by: "Huang, Ying" <ying.huang@intel.com>
Reviewed-by: Miaohe Lin <linmiaohe@huawei.com>
Cc: Zi Yan <ziy@nvidia.com>
Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 677b2a8c 14-Jan-2022 Christophe Leroy <christophe.leroy@csgroup.eu>

gup: avoid multiple user access locking/unlocking in fault_in_{read/write}able

fault_in_readable() and fault_in_writeable() perform __get_user() and
__put_user() in a loop, implying multiple user access locking/unlocking.

To avoid that, use user access blocks.

Link: https://lkml.kernel.org/r/720dcf79314acca1a78fae56d478cc851952149d.1637084492.git.christophe.leroy@csgroup.eu
Signed-off-by: Christophe Leroy <christophe.leroy@csgroup.eu>
Reviewed-by: Andreas Gruenbacher <agruenba@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 20b7fee7 05-Nov-2021 John Hubbard <jhubbard@nvidia.com>

mm/gup: further simplify __gup_device_huge()

Commit 6401c4eb57f9 ("mm: gup: fix potential pgmap refcnt leak in
__gup_device_huge()") simplified the return paths, but didn't go quite
far enough, as discussed in [1].

Remove the "ret" variable entirely, because there is enough information
already available to provide the return value.

[1] https://lore.kernel.org/r/CAHk-=wgQTRX=5SkCmS+zfmpqubGHGJvXX_HgnPG8JSpHKHBMeg@mail.gmail.com

Link: https://lkml.kernel.org/r/20210904004224.86391-1-jhubbard@nvidia.com
Signed-off-by: John Hubbard <jhubbard@nvidia.com>
Suggested-by: Linus Torvalds <torvalds@linux-foundation.org>
Reviewed-by: Jan Kara <jack@suse.cz>
Cc: Miaohe Lin <linmiaohe@huawei.com>
Cc: Claudio Imbrenda <imbrenda@linux.ibm.com>
Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 55b8fe70 17-Aug-2021 Andreas Gruenbacher <agruenba@redhat.com>

gup: Introduce FOLL_NOFAULT flag to disable page faults

Introduce a new FOLL_NOFAULT flag that causes get_user_pages to return
-EFAULT when it would otherwise trigger a page fault. This is roughly
similar to FOLL_FAST_ONLY but available on all architectures, and less
fragile.

Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>


# cdd591fc 05-Jul-2021 Andreas Gruenbacher <agruenba@redhat.com>

iov_iter: Introduce fault_in_iov_iter_writeable

Introduce a new fault_in_iov_iter_writeable helper for safely faulting
in an iterator for writing. Uses get_user_pages() to fault in the pages
without actually writing to them, which would be destructive.

We'll use fault_in_iov_iter_writeable in gfs2 once we've determined that
the iterator passed to .read_iter isn't in memory.

Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>


# bb523b40 02-Aug-2021 Andreas Gruenbacher <agruenba@redhat.com>

gup: Turn fault_in_pages_{readable,writeable} into fault_in_{readable,writeable}

Turn fault_in_pages_{readable,writeable} into versions that return the
number of bytes not faulted in, similar to copy_to_user, instead of
returning a non-zero value when any of the requested pages couldn't be
faulted in. This supports the existing users that require all pages to
be faulted in as well as new users that are happy if any pages can be
faulted in.

Rename the functions to fault_in_{readable,writeable} to make sure
this change doesn't silently break things.

Neither of these functions is entirely trivial and it doesn't seem
useful to inline them, so move them to mm/gup.c.

Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>


# cd1adf1b 07-Sep-2021 Linus Torvalds <torvalds@linux-foundation.org>

Revert "mm/gup: remove try_get_page(), call try_get_compound_head() directly"

This reverts commit 9857a17f206ff374aea78bccfb687f145368be2e.

That commit was completely broken, and I should have caught on to it
earlier. But happily, the kernel test robot noticed the breakage fairly
quickly.

The breakage is because "try_get_page()" is about avoiding the page
reference count overflow case, but is otherwise the exact same as a
plain "get_page()".

In contrast, "try_get_compound_head()" is an entirely different beast,
and uses __page_cache_add_speculative() because it's not just about the
page reference count, but also about possibly racing with the underlying
page going away.

So all the commentary about how

"try_get_page() has fallen a little behind in terms of maintenance,
try_get_compound_head() handles speculative page references more
thoroughly"

was just completely wrong: yes, try_get_compound_head() handles
speculative page references, but the point is that try_get_page() does
not, and must not.

So there's no lack of maintainance - there are fundamentally different
semantics.

A speculative page reference would be entirely wrong in "get_page()",
and it's entirely wrong in "try_get_page()". It's not about
speculation, it's purely about "uhhuh, you can't get this page because
you've tried to increment the reference count too much already".

The reason the kernel test robot noticed this bug was that it hit the
VM_BUG_ON() in __page_cache_add_speculative(), which is all about
verifying that the context of any speculative page access is correct.
But since that isn't what try_get_page() is all about, the VM_BUG_ON()
tests things that are not correct to test for try_get_page().

Reported-by: kernel test robot <oliver.sang@intel.com>
Cc: John Hubbard <jhubbard@nvidia.com>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 5ac95884 02-Sep-2021 Yang Shi <yang.shi@linux.alibaba.com>

mm/migrate: enable returning precise migrate_pages() success count

Under normal circumstances, migrate_pages() returns the number of pages
migrated. In error conditions, it returns an error code. When returning
an error code, there is no way to know how many pages were migrated or not
migrated.

Make migrate_pages() return how many pages are demoted successfully for
all cases, including when encountering errors. Page reclaim behavior will
depend on this in subsequent patches.

Link: https://lkml.kernel.org/r/20210721063926.3024591-3-ying.huang@intel.com
Link: https://lkml.kernel.org/r/20210715055145.195411-4-ying.huang@intel.com
Signed-off-by: Yang Shi <yang.shi@linux.alibaba.com>
Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
Signed-off-by: "Huang, Ying" <ying.huang@intel.com>
Suggested-by: Oscar Salvador <osalvador@suse.de> [optional parameter]
Reviewed-by: Yang Shi <shy828301@gmail.com>
Reviewed-by: Zi Yan <ziy@nvidia.com>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Wei Xu <weixugc@google.com>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: David Hildenbrand <david@redhat.com>
Cc: David Rientjes <rientjes@google.com>
Cc: Greg Thelen <gthelen@google.com>
Cc: Keith Busch <kbusch@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 9857a17f 02-Sep-2021 John Hubbard <jhubbard@nvidia.com>

mm/gup: remove try_get_page(), call try_get_compound_head() directly

try_get_page() is very similar to try_get_compound_head(), and in fact
try_get_page() has fallen a little behind in terms of maintenance:
try_get_compound_head() handles speculative page references more
thoroughly.

There are only two try_get_page() callsites, so just call
try_get_compound_head() directly from those, and remove try_get_page()
entirely.

Also, seeing as how this changes try_get_compound_head() into a non-static
function, provide some kerneldoc documentation for it.

Link: https://lkml.kernel.org/r/20210813044133.1536842-4-jhubbard@nvidia.com
Signed-off-by: John Hubbard <jhubbard@nvidia.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Christian Borntraeger <borntraeger@de.ibm.com>
Cc: Heiko Carstens <hca@linux.ibm.com>
Cc: Vasily Gorbik <gor@linux.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 54d516b1 02-Sep-2021 John Hubbard <jhubbard@nvidia.com>

mm/gup: small refactoring: simplify try_grab_page()

try_grab_page() does the same thing as try_grab_compound_head(..., refs=1,
...), just with a different API. So there is a lot of code duplication
there.

Change try_grab_page() to call try_grab_compound_head(), while keeping the
API contract identical for callers.

Also, now that try_grab_compound_head() always has a caller, remove the
__maybe_unused annotation.

Link: https://lkml.kernel.org/r/20210813044133.1536842-3-jhubbard@nvidia.com
Signed-off-by: John Hubbard <jhubbard@nvidia.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Christian Borntraeger <borntraeger@de.ibm.com>
Cc: Heiko Carstens <hca@linux.ibm.com>
Cc: Vasily Gorbik <gor@linux.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 3967db22 02-Sep-2021 John Hubbard <jhubbard@nvidia.com>

mm/gup: documentation corrections for gup/pup

Patch series "A few gup refactorings and documentation updates", v3.

While reviewing some of the other things going on around gup.c, I noticed
that the documentation was wrong for a few of the routines that I wrote.
And then I noticed that there was some significant code duplication too.
So this fixes those issues.

This is not entirely risk-free, but after looking closely at this, I think
it's actually a useful improvement, getting rid of the code duplication
here.

This patch (of 3):

The documentation for try_grab_compound_head() and try_grab_page() has
fallen a little out of date. Update and clarify a few points.

Also make it kerneldoc-correct, by adding @args documentation.

Link: https://lkml.kernel.org/r/20210813044133.1536842-1-jhubbard@nvidia.com
Link: https://lkml.kernel.org/r/20210813044133.1536842-2-jhubbard@nvidia.com
Signed-off-by: John Hubbard <jhubbard@nvidia.com>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Heiko Carstens <hca@linux.ibm.com>
Cc: Vasily Gorbik <gor@linux.ibm.com>
Cc: Christian Borntraeger <borntraeger@de.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# be51eb18 02-Sep-2021 Miaohe Lin <linmiaohe@huawei.com>

mm: gup: use helper PAGE_ALIGNED in populate_vma_page_range()

Use helper PAGE_ALIGNED to check if address is aligned to PAGE_SIZE.
Minor readability improvement.

Link: https://lkml.kernel.org/r/20210807093620.21347-6-linmiaohe@huawei.com
Signed-off-by: Miaohe Lin <linmiaohe@huawei.com>
Reviewed-by: John Hubbard <jhubbard@nvidia.com>
Reviewed-by: David Hildenbrand <david@redhat.com>
Reviewed-by: Claudio Imbrenda <imbrenda@linux.ibm.com>
Cc: Jan Kara <jack@suse.cz>
Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 6401c4eb 02-Sep-2021 Miaohe Lin <linmiaohe@huawei.com>

mm: gup: fix potential pgmap refcnt leak in __gup_device_huge()

When failed to try_grab_page, put_dev_pagemap() is missed. So pgmap
refcnt will leak in this case. Also we remove the check for pgmap against
NULL as it's also checked inside the put_dev_pagemap().

[akpm@linux-foundation.org: simplify, cleanup]
[akpm@linux-foundation.org: fix return value]

Link: https://lkml.kernel.org/r/20210807093620.21347-5-linmiaohe@huawei.com
Signed-off-by: Miaohe Lin <linmiaohe@huawei.com>
Fixes: 3faa52c03f44 ("mm/gup: track FOLL_PIN pages")
Reviewed-by: John Hubbard <jhubbard@nvidia.com>
Reviewed-by: Claudio Imbrenda <imbrenda@linux.ibm.com>
Cc: Jan Kara <jack@suse.cz>
Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 06a9e696 02-Sep-2021 Miaohe Lin <linmiaohe@huawei.com>

mm: gup: remove useless BUG_ON in __get_user_pages()

Indeed, this BUG_ON couldn't catch anything useful. We are sure ret == 0
here because we would already bail out if ret != 0 and ret is untouched
till here.

Link: https://lkml.kernel.org/r/20210807093620.21347-4-linmiaohe@huawei.com
Signed-off-by: Miaohe Lin <linmiaohe@huawei.com>
Reviewed-by: John Hubbard <jhubbard@nvidia.com>
Reviewed-by: David Hildenbrand <david@redhat.com>
Reviewed-by: Claudio Imbrenda <imbrenda@linux.ibm.com>
Cc: Jan Kara <jack@suse.cz>
Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 0fef147b 02-Sep-2021 Miaohe Lin <linmiaohe@huawei.com>

mm: gup: remove unneed local variable orig_refs

Remove unneed local variable orig_refs since refs is unchanged now.

Link: https://lkml.kernel.org/r/20210807093620.21347-3-linmiaohe@huawei.com
Signed-off-by: Miaohe Lin <linmiaohe@huawei.com>
Reviewed-by: John Hubbard <jhubbard@nvidia.com>
Reviewed-by: Claudio Imbrenda <imbrenda@linux.ibm.com>
Reviewed-by: David Hildenbrand <david@redhat.com>
Cc: Jan Kara <jack@suse.cz>
Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 8fed2f3c 02-Sep-2021 Miaohe Lin <linmiaohe@huawei.com>

mm: gup: remove set but unused local variable major

Patch series "Cleanups and fixup for gup".

This series contains cleanups to remove unneeded variable, useless BUG_ON
and use helper to improve readability. Also we fix a potential pgmap
refcnt leak. More details can be found in the respective changelogs.

This patch (of 5):

Since commit a2beb5f1efed ("mm: clean up the last pieces of page fault
accountings"), the local variable major is unused. Remove it.

Link: https://lkml.kernel.org/r/20210807093620.21347-1-linmiaohe@huawei.com
Link: https://lkml.kernel.org/r/20210807093620.21347-2-linmiaohe@huawei.com
Signed-off-by: Miaohe Lin <linmiaohe@huawei.com>
Reviewed-by: John Hubbard <jhubbard@nvidia.com>
Reviewed-by: David Hildenbrand <david@redhat.com>
Reviewed-by: Claudio Imbrenda <imbrenda@linux.ibm.com>
Cc: Jan Kara <jack@suse.cz>
Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# eb2faa51 13-Aug-2021 David Hildenbrand <david@redhat.com>

mm/madvise: report SIGBUS as -EFAULT for MADV_POPULATE_(READ|WRITE)

Doing some extended tests and polishing the man page update for
MADV_POPULATE_(READ|WRITE), I realized that we end up converting also
SIGBUS (via -EFAULT) to -EINVAL, making it look like yet another
madvise() user error.

We want to report only problematic mappings and permission problems that
the user could have know as -EINVAL.

Let's not convert -EFAULT arising due to SIGBUS (or SIGSEGV) to -EINVAL,
but instead indicate -EFAULT to user space. While we could also convert
it to -ENOMEM, using -EFAULT looks more helpful when user space might
want to troubleshoot what's going wrong: MADV_POPULATE_(READ|WRITE) is
not part of an final Linux release and we can still adjust the behavior.

Link: https://lkml.kernel.org/r/20210726154932.102880-1-david@redhat.com
Fixes: 4ca9b3859dac ("mm/madvise: introduce MADV_POPULATE_(READ|WRITE) to prefault page tables")
Signed-off-by: David Hildenbrand <david@redhat.com>
Cc: Arnd Bergmann <arnd@arndb.de>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Oscar Salvador <osalvador@suse.de>
Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Minchan Kim <minchan@kernel.org>
Cc: Jann Horn <jannh@google.com>
Cc: Jason Gunthorpe <jgg@ziepe.ca>
Cc: Dave Hansen <dave.hansen@intel.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: Rik van Riel <riel@surriel.com>
Cc: Michael S. Tsirkin <mst@redhat.com>
Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Richard Henderson <rth@twiddle.net>
Cc: Ivan Kokshaysky <ink@jurassic.park.msu.ru>
Cc: Matt Turner <mattst88@gmail.com>
Cc: Thomas Bogendoerfer <tsbogend@alpha.franken.de>
Cc: "James E.J. Bottomley" <James.Bottomley@HansenPartnership.com>
Cc: Helge Deller <deller@gmx.de>
Cc: Chris Zankel <chris@zankel.net>
Cc: Max Filippov <jcmvbkbc@gmail.com>
Cc: Mike Kravetz <mike.kravetz@oracle.com>
Cc: Peter Xu <peterx@redhat.com>
Cc: Rolf Eike Beer <eike-kernel@sf-tec.de>
Cc: Ram Pai <linuxram@us.ibm.com>
Cc: Shuah Khan <shuah@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 1507f512 07-Jul-2021 Mike Rapoport <rppt@kernel.org>

mm: introduce memfd_secret system call to create "secret" memory areas

Introduce "memfd_secret" system call with the ability to create memory
areas visible only in the context of the owning process and not mapped not
only to other processes but in the kernel page tables as well.

The secretmem feature is off by default and the user must explicitly
enable it at the boot time.

Once secretmem is enabled, the user will be able to create a file
descriptor using the memfd_secret() system call. The memory areas created
by mmap() calls from this file descriptor will be unmapped from the kernel
direct map and they will be only mapped in the page table of the processes
that have access to the file descriptor.

Secretmem is designed to provide the following protections:

* Enhanced protection (in conjunction with all the other in-kernel
attack prevention systems) against ROP attacks. Seceretmem makes
"simple" ROP insufficient to perform exfiltration, which increases the
required complexity of the attack. Along with other protections like
the kernel stack size limit and address space layout randomization which
make finding gadgets is really hard, absence of any in-kernel primitive
for accessing secret memory means the one gadget ROP attack can't work.
Since the only way to access secret memory is to reconstruct the missing
mapping entry, the attacker has to recover the physical page and insert
a PTE pointing to it in the kernel and then retrieve the contents. That
takes at least three gadgets which is a level of difficulty beyond most
standard attacks.

* Prevent cross-process secret userspace memory exposures. Once the
secret memory is allocated, the user can't accidentally pass it into the
kernel to be transmitted somewhere. The secreremem pages cannot be
accessed via the direct map and they are disallowed in GUP.

* Harden against exploited kernel flaws. In order to access secretmem,
a kernel-side attack would need to either walk the page tables and
create new ones, or spawn a new privileged uiserspace process to perform
secrets exfiltration using ptrace.

The file descriptor based memory has several advantages over the
"traditional" mm interfaces, such as mlock(), mprotect(), madvise(). File
descriptor approach allows explicit and controlled sharing of the memory
areas, it allows to seal the operations. Besides, file descriptor based
memory paves the way for VMMs to remove the secret memory range from the
userspace hipervisor process, for instance QEMU. Andy Lutomirski says:

"Getting fd-backed memory into a guest will take some possibly major
work in the kernel, but getting vma-backed memory into a guest without
mapping it in the host user address space seems much, much worse."

memfd_secret() is made a dedicated system call rather than an extension to
memfd_create() because it's purpose is to allow the user to create more
secure memory mappings rather than to simply allow file based access to
the memory. Nowadays a new system call cost is negligible while it is way
simpler for userspace to deal with a clear-cut system calls than with a
multiplexer or an overloaded syscall. Moreover, the initial
implementation of memfd_secret() is completely distinct from
memfd_create() so there is no much sense in overloading memfd_create() to
begin with. If there will be a need for code sharing between these
implementation it can be easily achieved without a need to adjust user
visible APIs.

The secret memory remains accessible in the process context using uaccess
primitives, but it is not exposed to the kernel otherwise; secret memory
areas are removed from the direct map and functions in the
follow_page()/get_user_page() family will refuse to return a page that
belongs to the secret memory area.

Once there will be a use case that will require exposing secretmem to the
kernel it will be an opt-in request in the system call flags so that user
would have to decide what data can be exposed to the kernel.

Removing of the pages from the direct map may cause its fragmentation on
architectures that use large pages to map the physical memory which
affects the system performance. However, the original Kconfig text for
CONFIG_DIRECT_GBPAGES said that gigabyte pages in the direct map "... can
improve the kernel's performance a tiny bit ..." (commit 00d1c5e05736
("x86: add gbpages switches")) and the recent report [1] showed that "...
although 1G mappings are a good default choice, there is no compelling
evidence that it must be the only choice". Hence, it is sufficient to
have secretmem disabled by default with the ability of a system
administrator to enable it at boot time.

Pages in the secretmem regions are unevictable and unmovable to avoid
accidental exposure of the sensitive data via swap or during page
migration.

Since the secretmem mappings are locked in memory they cannot exceed
RLIMIT_MEMLOCK. Since these mappings are already locked independently
from mlock(), an attempt to mlock()/munlock() secretmem range would fail
and mlockall()/munlockall() will ignore secretmem mappings.

However, unlike mlock()ed memory, secretmem currently behaves more like
long-term GUP: secretmem mappings are unmovable mappings directly consumed
by user space. With default limits, there is no excessive use of
secretmem and it poses no real problem in combination with
ZONE_MOVABLE/CMA, but in the future this should be addressed to allow
balanced use of large amounts of secretmem along with ZONE_MOVABLE/CMA.

A page that was a part of the secret memory area is cleared when it is
freed to ensure the data is not exposed to the next user of that page.

The following example demonstrates creation of a secret mapping (error
handling is omitted):

fd = memfd_secret(0);
ftruncate(fd, MAP_SIZE);
ptr = mmap(NULL, MAP_SIZE, PROT_READ | PROT_WRITE,
MAP_SHARED, fd, 0);

[1] https://lore.kernel.org/linux-mm/213b4567-46ce-f116-9cdf-bbd0c884eb3c@linux.intel.com/

[akpm@linux-foundation.org: suppress Kconfig whine]

Link: https://lkml.kernel.org/r/20210518072034.31572-5-rppt@kernel.org
Signed-off-by: Mike Rapoport <rppt@linux.ibm.com>
Acked-by: Hagen Paul Pfeifer <hagen@jauu.net>
Acked-by: James Bottomley <James.Bottomley@HansenPartnership.com>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Arnd Bergmann <arnd@arndb.de>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Christopher Lameter <cl@linux.com>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Elena Reshetova <elena.reshetova@intel.com>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: James Bottomley <jejb@linux.ibm.com>
Cc: "Kirill A. Shutemov" <kirill@shutemov.name>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Mark Rutland <mark.rutland@arm.com>
Cc: Michael Kerrisk <mtk.manpages@gmail.com>
Cc: Palmer Dabbelt <palmer@dabbelt.com>
Cc: Palmer Dabbelt <palmerdabbelt@google.com>
Cc: Paul Walmsley <paul.walmsley@sifive.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Rick Edgecombe <rick.p.edgecombe@intel.com>
Cc: Roman Gushchin <guro@fb.com>
Cc: Shakeel Butt <shakeelb@google.com>
Cc: Shuah Khan <shuah@kernel.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Tycho Andersen <tycho@tycho.ws>
Cc: Will Deacon <will@kernel.org>
Cc: David Hildenbrand <david@redhat.com>
Cc: kernel test robot <lkp@intel.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 4ca9b385 30-Jun-2021 David Hildenbrand <david@redhat.com>

mm/madvise: introduce MADV_POPULATE_(READ|WRITE) to prefault page tables

I. Background: Sparse Memory Mappings

When we manage sparse memory mappings dynamically in user space - also
sometimes involving MAP_NORESERVE - we want to dynamically populate/
discard memory inside such a sparse memory region. Example users are
hypervisors (especially implementing memory ballooning or similar
technologies like virtio-mem) and memory allocators. In addition, we want
to fail in a nice way (instead of generating SIGBUS) if populating does
not succeed because we are out of backend memory (which can happen easily
with file-based mappings, especially tmpfs and hugetlbfs).

While MADV_DONTNEED, MADV_REMOVE and FALLOC_FL_PUNCH_HOLE allow for
reliably discarding memory for most mapping types, there is no generic
approach to populate page tables and preallocate memory.

Although mmap() supports MAP_POPULATE, it is not applicable to the concept
of sparse memory mappings, where we want to populate/discard dynamically
and avoid expensive/problematic remappings. In addition, we never
actually report errors during the final populate phase - it is best-effort
only.

fallocate() can be used to preallocate file-based memory and fail in a
safe way. However, it cannot really be used for any private mappings on
anonymous files via memfd due to COW semantics. In addition, fallocate()
does not actually populate page tables, so we still always get pagefaults
on first access - which is sometimes undesired (i.e., real-time workloads)
and requires real prefaulting of page tables, not just a preallocation of
backend storage. There might be interesting use cases for sparse memory
regions along with mlockall(MCL_ONFAULT) which fallocate() cannot satisfy
as it does not prefault page tables.

II. On preallcoation/prefaulting from user space

Because we don't have a proper interface, what applications (like QEMU and
databases) end up doing is touching (i.e., reading+writing one byte to not
overwrite existing data) all individual pages.

However, that approach
1) Can result in wear on storage backing, because we end up reading/writing
each page; this is especially a problem for dax/pmem.
2) Can result in mmap_sem contention when prefaulting via multiple
threads.
3) Requires expensive signal handling, especially to catch SIGBUS in case
of hugetlbfs/shmem/file-backed memory. For example, this is
problematic in hypervisors like QEMU where SIGBUS handlers might already
be used by other subsystems concurrently to e.g, handle hardware errors.
"Simply" doing preallocation concurrently from other thread is not that
easy.

III. On MADV_WILLNEED

Extending MADV_WILLNEED is not an option because
1. It would change the semantics: "Expect access in the near future." and
"might be a good idea to read some pages" vs. "Definitely populate/
preallocate all memory and definitely fail on errors.".
2. Existing users (like virtio-balloon in QEMU when deflating the balloon)
don't want populate/prealloc semantics. They treat this rather as a hint
to give a little performance boost without too much overhead - and don't
expect that a lot of memory might get consumed or a lot of time
might be spent.

IV. MADV_POPULATE_READ and MADV_POPULATE_WRITE

Let's introduce MADV_POPULATE_READ and MADV_POPULATE_WRITE, inspired by
MAP_POPULATE, with the following semantics:
1. MADV_POPULATE_READ can be used to prefault page tables just like
manually reading each individual page. This will not break any COW
mappings. The shared zero page might get mapped and no backend storage
might get preallocated -- allocation might be deferred to
write-fault time. Especially shared file mappings require an explicit
fallocate() upfront to actually preallocate backend memory (blocks in
the file system) in case the file might have holes.
2. If MADV_POPULATE_READ succeeds, all page tables have been populated
(prefaulted) readable once.
3. MADV_POPULATE_WRITE can be used to preallocate backend memory and
prefault page tables just like manually writing (or
reading+writing) each individual page. This will break any COW
mappings -- e.g., the shared zeropage is never populated.
4. If MADV_POPULATE_WRITE succeeds, all page tables have been populated
(prefaulted) writable once.
5. MADV_POPULATE_READ and MADV_POPULATE_WRITE cannot be applied to special
mappings marked with VM_PFNMAP and VM_IO. Also, proper access
permissions (e.g., PROT_READ, PROT_WRITE) are required. If any such
mapping is encountered, madvise() fails with -EINVAL.
6. If MADV_POPULATE_READ or MADV_POPULATE_WRITE fails, some page tables
might have been populated.
7. MADV_POPULATE_READ and MADV_POPULATE_WRITE will return -EHWPOISON
when encountering a HW poisoned page in the range.
8. Similar to MAP_POPULATE, MADV_POPULATE_READ and MADV_POPULATE_WRITE
cannot protect from the OOM (Out Of Memory) handler killing the
process.

While the use case for MADV_POPULATE_WRITE is fairly obvious (i.e.,
preallocate memory and prefault page tables for VMs), one issue is that
whenever we prefault pages writable, the pages have to be marked dirty,
because the CPU could dirty them any time. while not a real problem for
hugetlbfs or dax/pmem, it can be a problem for shared file mappings: each
page will be marked dirty and has to be written back later when evicting.

MADV_POPULATE_READ allows for optimizing this scenario: Pre-read a whole
mapping from backend storage without marking it dirty, such that eviction
won't have to write it back. As discussed above, shared file mappings
might require an explciit fallocate() upfront to achieve
preallcoation+prepopulation.

Although sparse memory mappings are the primary use case, this will also
be useful for other preallocate/prefault use cases where MAP_POPULATE is
not desired or the semantics of MAP_POPULATE are not sufficient: as one
example, QEMU users can trigger preallocation/prefaulting of guest RAM
after the mapping was created -- and don't want errors to be silently
suppressed.

Looking at the history, MADV_POPULATE was already proposed in 2013 [1],
however, the main motivation back than was performance improvements --
which should also still be the case.

V. Single-threaded performance comparison

I did a short experiment, prefaulting page tables on completely *empty
mappings/files* and repeated the experiment 10 times. The results
correspond to the shortest execution time. In general, the performance
benefit for huge pages is negligible with small mappings.

V.1: Private mappings

POPULATE_READ and POPULATE_WRITE is fastest. Note that
Reading/POPULATE_READ will populate the shared zeropage where applicable
-- which result in short population times.

The fastest way to allocate backend storage (here: swap or huge pages) and
prefault page tables is POPULATE_WRITE.

V.2: Shared mappings

fallocate() is fastest, however, doesn't prefault page tables.
POPULATE_WRITE is faster than simple writes and read/writes.
POPULATE_READ is faster than simple reads.

Without a fd, the fastest way to allocate backend storage and prefault
page tables is POPULATE_WRITE. With an fd, the fastest way is usually
FALLOCATE+POPULATE_READ or FALLOCATE+POPULATE_WRITE respectively; one
exception are actual files: FALLOCATE+Read is slightly faster than
FALLOCATE+POPULATE_READ.

The fastest way to allocate backend storage prefault page tables is
FALLOCATE+POPULATE_WRITE -- except when dealing with actual files; then,
FALLOCATE+POPULATE_READ is fastest and won't directly mark all pages as
dirty.

v.3: Detailed results

==================================================
2 MiB MAP_PRIVATE:
**************************************************
Anon 4 KiB : Read : 0.119 ms
Anon 4 KiB : Write : 0.222 ms
Anon 4 KiB : Read/Write : 0.380 ms
Anon 4 KiB : POPULATE_READ : 0.060 ms
Anon 4 KiB : POPULATE_WRITE : 0.158 ms
Memfd 4 KiB : Read : 0.034 ms
Memfd 4 KiB : Write : 0.310 ms
Memfd 4 KiB : Read/Write : 0.362 ms
Memfd 4 KiB : POPULATE_READ : 0.039 ms
Memfd 4 KiB : POPULATE_WRITE : 0.229 ms
Memfd 2 MiB : Read : 0.030 ms
Memfd 2 MiB : Write : 0.030 ms
Memfd 2 MiB : Read/Write : 0.030 ms
Memfd 2 MiB : POPULATE_READ : 0.030 ms
Memfd 2 MiB : POPULATE_WRITE : 0.030 ms
tmpfs : Read : 0.033 ms
tmpfs : Write : 0.313 ms
tmpfs : Read/Write : 0.406 ms
tmpfs : POPULATE_READ : 0.039 ms
tmpfs : POPULATE_WRITE : 0.285 ms
file : Read : 0.033 ms
file : Write : 0.351 ms
file : Read/Write : 0.408 ms
file : POPULATE_READ : 0.039 ms
file : POPULATE_WRITE : 0.290 ms
hugetlbfs : Read : 0.030 ms
hugetlbfs : Write : 0.030 ms
hugetlbfs : Read/Write : 0.030 ms
hugetlbfs : POPULATE_READ : 0.030 ms
hugetlbfs : POPULATE_WRITE : 0.030 ms
**************************************************
4096 MiB MAP_PRIVATE:
**************************************************
Anon 4 KiB : Read : 237.940 ms
Anon 4 KiB : Write : 708.409 ms
Anon 4 KiB : Read/Write : 1054.041 ms
Anon 4 KiB : POPULATE_READ : 124.310 ms
Anon 4 KiB : POPULATE_WRITE : 572.582 ms
Memfd 4 KiB : Read : 136.928 ms
Memfd 4 KiB : Write : 963.898 ms
Memfd 4 KiB : Read/Write : 1106.561 ms
Memfd 4 KiB : POPULATE_READ : 78.450 ms
Memfd 4 KiB : POPULATE_WRITE : 805.881 ms
Memfd 2 MiB : Read : 357.116 ms
Memfd 2 MiB : Write : 357.210 ms
Memfd 2 MiB : Read/Write : 357.606 ms
Memfd 2 MiB : POPULATE_READ : 356.094 ms
Memfd 2 MiB : POPULATE_WRITE : 356.937 ms
tmpfs : Read : 137.536 ms
tmpfs : Write : 954.362 ms
tmpfs : Read/Write : 1105.954 ms
tmpfs : POPULATE_READ : 80.289 ms
tmpfs : POPULATE_WRITE : 822.826 ms
file : Read : 137.874 ms
file : Write : 987.025 ms
file : Read/Write : 1107.439 ms
file : POPULATE_READ : 80.413 ms
file : POPULATE_WRITE : 857.622 ms
hugetlbfs : Read : 355.607 ms
hugetlbfs : Write : 355.729 ms
hugetlbfs : Read/Write : 356.127 ms
hugetlbfs : POPULATE_READ : 354.585 ms
hugetlbfs : POPULATE_WRITE : 355.138 ms
**************************************************
2 MiB MAP_SHARED:
**************************************************
Anon 4 KiB : Read : 0.394 ms
Anon 4 KiB : Write : 0.348 ms
Anon 4 KiB : Read/Write : 0.400 ms
Anon 4 KiB : POPULATE_READ : 0.326 ms
Anon 4 KiB : POPULATE_WRITE : 0.273 ms
Anon 2 MiB : Read : 0.030 ms
Anon 2 MiB : Write : 0.030 ms
Anon 2 MiB : Read/Write : 0.030 ms
Anon 2 MiB : POPULATE_READ : 0.030 ms
Anon 2 MiB : POPULATE_WRITE : 0.030 ms
Memfd 4 KiB : Read : 0.412 ms
Memfd 4 KiB : Write : 0.372 ms
Memfd 4 KiB : Read/Write : 0.419 ms
Memfd 4 KiB : POPULATE_READ : 0.343 ms
Memfd 4 KiB : POPULATE_WRITE : 0.288 ms
Memfd 4 KiB : FALLOCATE : 0.137 ms
Memfd 4 KiB : FALLOCATE+Read : 0.446 ms
Memfd 4 KiB : FALLOCATE+Write : 0.330 ms
Memfd 4 KiB : FALLOCATE+Read/Write : 0.454 ms
Memfd 4 KiB : FALLOCATE+POPULATE_READ : 0.379 ms
Memfd 4 KiB : FALLOCATE+POPULATE_WRITE : 0.268 ms
Memfd 2 MiB : Read : 0.030 ms
Memfd 2 MiB : Write : 0.030 ms
Memfd 2 MiB : Read/Write : 0.030 ms
Memfd 2 MiB : POPULATE_READ : 0.030 ms
Memfd 2 MiB : POPULATE_WRITE : 0.030 ms
Memfd 2 MiB : FALLOCATE : 0.030 ms
Memfd 2 MiB : FALLOCATE+Read : 0.031 ms
Memfd 2 MiB : FALLOCATE+Write : 0.031 ms
Memfd 2 MiB : FALLOCATE+Read/Write : 0.031 ms
Memfd 2 MiB : FALLOCATE+POPULATE_READ : 0.030 ms
Memfd 2 MiB : FALLOCATE+POPULATE_WRITE : 0.030 ms
tmpfs : Read : 0.416 ms
tmpfs : Write : 0.369 ms
tmpfs : Read/Write : 0.425 ms
tmpfs : POPULATE_READ : 0.346 ms
tmpfs : POPULATE_WRITE : 0.295 ms
tmpfs : FALLOCATE : 0.139 ms
tmpfs : FALLOCATE+Read : 0.447 ms
tmpfs : FALLOCATE+Write : 0.333 ms
tmpfs : FALLOCATE+Read/Write : 0.454 ms
tmpfs : FALLOCATE+POPULATE_READ : 0.380 ms
tmpfs : FALLOCATE+POPULATE_WRITE : 0.272 ms
file : Read : 0.191 ms
file : Write : 0.511 ms
file : Read/Write : 0.524 ms
file : POPULATE_READ : 0.196 ms
file : POPULATE_WRITE : 0.434 ms
file : FALLOCATE : 0.004 ms
file : FALLOCATE+Read : 0.197 ms
file : FALLOCATE+Write : 0.554 ms
file : FALLOCATE+Read/Write : 0.480 ms
file : FALLOCATE+POPULATE_READ : 0.201 ms
file : FALLOCATE+POPULATE_WRITE : 0.381 ms
hugetlbfs : Read : 0.030 ms
hugetlbfs : Write : 0.030 ms
hugetlbfs : Read/Write : 0.030 ms
hugetlbfs : POPULATE_READ : 0.030 ms
hugetlbfs : POPULATE_WRITE : 0.030 ms
hugetlbfs : FALLOCATE : 0.030 ms
hugetlbfs : FALLOCATE+Read : 0.031 ms
hugetlbfs : FALLOCATE+Write : 0.031 ms
hugetlbfs : FALLOCATE+Read/Write : 0.030 ms
hugetlbfs : FALLOCATE+POPULATE_READ : 0.030 ms
hugetlbfs : FALLOCATE+POPULATE_WRITE : 0.030 ms
**************************************************
4096 MiB MAP_SHARED:
**************************************************
Anon 4 KiB : Read : 1053.090 ms
Anon 4 KiB : Write : 913.642 ms
Anon 4 KiB : Read/Write : 1060.350 ms
Anon 4 KiB : POPULATE_READ : 893.691 ms
Anon 4 KiB : POPULATE_WRITE : 782.885 ms
Anon 2 MiB : Read : 358.553 ms
Anon 2 MiB : Write : 358.419 ms
Anon 2 MiB : Read/Write : 357.992 ms
Anon 2 MiB : POPULATE_READ : 357.533 ms
Anon 2 MiB : POPULATE_WRITE : 357.808 ms
Memfd 4 KiB : Read : 1078.144 ms
Memfd 4 KiB : Write : 942.036 ms
Memfd 4 KiB : Read/Write : 1100.391 ms
Memfd 4 KiB : POPULATE_READ : 925.829 ms
Memfd 4 KiB : POPULATE_WRITE : 804.394 ms
Memfd 4 KiB : FALLOCATE : 304.632 ms
Memfd 4 KiB : FALLOCATE+Read : 1163.359 ms
Memfd 4 KiB : FALLOCATE+Write : 933.186 ms
Memfd 4 KiB : FALLOCATE+Read/Write : 1187.304 ms
Memfd 4 KiB : FALLOCATE+POPULATE_READ : 1013.660 ms
Memfd 4 KiB : FALLOCATE+POPULATE_WRITE : 794.560 ms
Memfd 2 MiB : Read : 358.131 ms
Memfd 2 MiB : Write : 358.099 ms
Memfd 2 MiB : Read/Write : 358.250 ms
Memfd 2 MiB : POPULATE_READ : 357.563 ms
Memfd 2 MiB : POPULATE_WRITE : 357.334 ms
Memfd 2 MiB : FALLOCATE : 356.735 ms
Memfd 2 MiB : FALLOCATE+Read : 358.152 ms
Memfd 2 MiB : FALLOCATE+Write : 358.331 ms
Memfd 2 MiB : FALLOCATE+Read/Write : 358.018 ms
Memfd 2 MiB : FALLOCATE+POPULATE_READ : 357.286 ms
Memfd 2 MiB : FALLOCATE+POPULATE_WRITE : 357.523 ms
tmpfs : Read : 1087.265 ms
tmpfs : Write : 950.840 ms
tmpfs : Read/Write : 1107.567 ms
tmpfs : POPULATE_READ : 922.605 ms
tmpfs : POPULATE_WRITE : 810.094 ms
tmpfs : FALLOCATE : 306.320 ms
tmpfs : FALLOCATE+Read : 1169.796 ms
tmpfs : FALLOCATE+Write : 933.730 ms
tmpfs : FALLOCATE+Read/Write : 1191.610 ms
tmpfs : FALLOCATE+POPULATE_READ : 1020.474 ms
tmpfs : FALLOCATE+POPULATE_WRITE : 798.945 ms
file : Read : 654.101 ms
file : Write : 1259.142 ms
file : Read/Write : 1289.509 ms
file : POPULATE_READ : 661.642 ms
file : POPULATE_WRITE : 1106.816 ms
file : FALLOCATE : 1.864 ms
file : FALLOCATE+Read : 656.328 ms
file : FALLOCATE+Write : 1153.300 ms
file : FALLOCATE+Read/Write : 1180.613 ms
file : FALLOCATE+POPULATE_READ : 668.347 ms
file : FALLOCATE+POPULATE_WRITE : 996.143 ms
hugetlbfs : Read : 357.245 ms
hugetlbfs : Write : 357.413 ms
hugetlbfs : Read/Write : 357.120 ms
hugetlbfs : POPULATE_READ : 356.321 ms
hugetlbfs : POPULATE_WRITE : 356.693 ms
hugetlbfs : FALLOCATE : 355.927 ms
hugetlbfs : FALLOCATE+Read : 357.074 ms
hugetlbfs : FALLOCATE+Write : 357.120 ms
hugetlbfs : FALLOCATE+Read/Write : 356.983 ms
hugetlbfs : FALLOCATE+POPULATE_READ : 356.413 ms
hugetlbfs : FALLOCATE+POPULATE_WRITE : 356.266 ms
**************************************************

[1] https://lkml.org/lkml/2013/6/27/698

[akpm@linux-foundation.org: coding style fixes]

Link: https://lkml.kernel.org/r/20210419135443.12822-3-david@redhat.com
Signed-off-by: David Hildenbrand <david@redhat.com>
Cc: Arnd Bergmann <arnd@arndb.de>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Oscar Salvador <osalvador@suse.de>
Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Minchan Kim <minchan@kernel.org>
Cc: Jann Horn <jannh@google.com>
Cc: Jason Gunthorpe <jgg@ziepe.ca>
Cc: Dave Hansen <dave.hansen@intel.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: Rik van Riel <riel@surriel.com>
Cc: Michael S. Tsirkin <mst@redhat.com>
Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Richard Henderson <rth@twiddle.net>
Cc: Ivan Kokshaysky <ink@jurassic.park.msu.ru>
Cc: Matt Turner <mattst88@gmail.com>
Cc: Thomas Bogendoerfer <tsbogend@alpha.franken.de>
Cc: "James E.J. Bottomley" <James.Bottomley@HansenPartnership.com>
Cc: Helge Deller <deller@gmx.de>
Cc: Chris Zankel <chris@zankel.net>
Cc: Max Filippov <jcmvbkbc@gmail.com>
Cc: Mike Kravetz <mike.kravetz@oracle.com>
Cc: Peter Xu <peterx@redhat.com>
Cc: Rolf Eike Beer <eike-kernel@sf-tec.de>
Cc: Ram Pai <linuxram@us.ibm.com>
Cc: Shuah Khan <shuah@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# a458b76a 28-Jun-2021 Andrea Arcangeli <aarcange@redhat.com>

mm: gup: pack has_pinned in MMF_HAS_PINNED

has_pinned 32bit can be packed in the MMF_HAS_PINNED bit as a noop
cleanup.

Any atomic_inc/dec to the mm cacheline shared by all threads in pin-fast
would reintroduce a loss of SMP scalability to pin-fast, so there's no
future potential usefulness to keep an atomic in the mm for this.

set_bit(MMF_HAS_PINNED) will be theoretically a bit slower than WRITE_ONCE
(atomic_set is equivalent to WRITE_ONCE), but the set_bit (just like
atomic_set after this commit) has to be still issued only once per "mm",
so the difference between the two will be lost in the noise.

will-it-scale "mmap2" shows no change in performance with enterprise
config as expected.

will-it-scale "pin_fast" retains the > 4000% SMP scalability performance
improvement against upstream as expected.

This is a noop as far as overall performance and SMP scalability are
concerned.

[peterx@redhat.com: pack has_pinned in MMF_HAS_PINNED]
Link: https://lkml.kernel.org/r/YJqWESqyxa8OZA+2@t490s
[akpm@linux-foundation.org: coding style fixes]
[peterx@redhat.com: fix build for task_mmu.c, introduce mm_set_has_pinned_flag, fix comments]

Link: https://lkml.kernel.org/r/20210507150553.208763-4-peterx@redhat.com
Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
Signed-off-by: Peter Xu <peterx@redhat.com>
Reviewed-by: John Hubbard <jhubbard@nvidia.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: Jan Kara <jack@suse.cz>
Cc: Jann Horn <jannh@google.com>
Cc: Jason Gunthorpe <jgg@nvidia.com>
Cc: Kirill Shutemov <kirill@shutemov.name>
Cc: Kirill Tkhai <ktkhai@virtuozzo.com>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Oleg Nesterov <oleg@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 292648ac 28-Jun-2021 Andrea Arcangeli <aarcange@redhat.com>

mm: gup: allow FOLL_PIN to scale in SMP

has_pinned cannot be written by each pin-fast or it won't scale in SMP.
This isn't "false sharing" strictly speaking (it's more like "true
non-sharing"), but it creates the same SMP scalability bottleneck of
"false sharing".

To verify the improvement, below test is done on 40 cpus host with
Intel(R) Xeon(R) CPU E5-2630 v4 @ 2.20GHz (must be with
CONFIG_GUP_TEST=y):

$ sudo chrt -f 1 ./gup_test -a -m 512 -j 40

Where we can get (average value for 40 threads):

Old kernel: 477729.97 (+- 3.79%)
New kernel: 89144.65 (+-11.76%)

On a similar condition with 256 cpus, this commits increases the SMP
scalability of pin_user_pages_fast() executed by different threads of the
same process by more than 4000%.

[peterx@redhat.com: rewrite commit message, add parentheses against "(A & B)"]

Link: https://lkml.kernel.org/r/20210507150553.208763-3-peterx@redhat.com
Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
Signed-off-by: Peter Xu <peterx@redhat.com>
Reviewed-by: John Hubbard <jhubbard@nvidia.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: Jan Kara <jack@suse.cz>
Cc: Jann Horn <jannh@google.com>
Cc: Jason Gunthorpe <jgg@nvidia.com>
Cc: Kirill Shutemov <kirill@shutemov.name>
Cc: Kirill Tkhai <ktkhai@virtuozzo.com>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Oleg Nesterov <oleg@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# c24d3732 28-Jun-2021 Jann Horn <jannh@google.com>

mm/gup: fix try_grab_compound_head() race with split_huge_page()

try_grab_compound_head() is used to grab a reference to a page from
get_user_pages_fast(), which is only protected against concurrent freeing
of page tables (via local_irq_save()), but not against concurrent TLB
flushes, freeing of data pages, or splitting of compound pages.

Because no reference is held to the page when try_grab_compound_head() is
called, the page may have been freed and reallocated by the time its
refcount has been elevated; therefore, once we're holding a stable
reference to the page, the caller re-checks whether the PTE still points
to the same page (with the same access rights).

The problem is that try_grab_compound_head() has to grab a reference on
the head page; but between the time we look up what the head page is and
the time we actually grab a reference on the head page, the compound page
may have been split up (either explicitly through split_huge_page() or by
freeing the compound page to the buddy allocator and then allocating its
individual order-0 pages). If that happens, get_user_pages_fast() may end
up returning the right page but lifting the refcount on a now-unrelated
page, leading to use-after-free of pages.

To fix it: Re-check whether the pages still belong together after lifting
the refcount on the head page. Move anything else that checks
compound_head(page) below the refcount increment.

This can't actually happen on bare-metal x86 (because there, disabling
IRQs locks out remote TLB flushes), but it can happen on virtualized x86
(e.g. under KVM) and probably also on arm64. The race window is pretty
narrow, and constantly allocating and shattering hugepages isn't exactly
fast; for now I've only managed to reproduce this in an x86 KVM guest with
an artificially widened timing window (by adding a loop that repeatedly
calls `inl(0x3f8 + 5)` in `try_get_compound_head()` to force VM exits, so
that PV TLB flushes are used instead of IPIs).

As requested on the list, also replace the existing VM_BUG_ON_PAGE() with
a warning and bailout. Since the existing code only performed the BUG_ON
check on DEBUG_VM kernels, ensure that the new code also only performs the
check under that configuration - I don't want to mix two logically
separate changes together too much. The macro VM_WARN_ON_ONCE_PAGE()
doesn't return a value on !DEBUG_VM, so wrap the whole check in an #ifdef
block. An alternative would be to change the VM_WARN_ON_ONCE_PAGE()
definition for !DEBUG_VM such that it always returns false, but since that
would differ from the behavior of the normal WARN macros, it might be too
confusing for readers.

Link: https://lkml.kernel.org/r/20210615012014.1100672-1-jannh@google.com
Fixes: 7aef4172c795 ("mm: handle PTE-mapped tail pages in gerneric fast gup implementaiton")
Signed-off-by: Jann Horn <jannh@google.com>
Reviewed-by: John Hubbard <jhubbard@nvidia.com>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Kirill A. Shutemov <kirill@shutemov.name>
Cc: Jan Kara <jack@suse.cz>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# f10628d2 22-May-2021 Michal Hocko <mhocko@suse.com>

Revert "mm/gup: check page posion status for coredump."

While reviewing [1] I came across commit d3378e86d182 ("mm/gup: check
page posion status for coredump.") and noticed that this patch is broken
in two ways. First it doesn't really prevent hwpoison pages from being
dumped because hwpoison pages can be marked asynchornously at any time
after the check. Secondly, and more importantly, the patch introduces a
ref count leak because get_dump_page takes a reference on the page which
is not released.

It also seems that the patch was merged incorrectly because there were
follow up changes not included as well as discussions on how to address
the underlying problem [2]

Therefore revert the original patch.

Link: http://lkml.kernel.org/r/20210429122519.15183-4-david@redhat.com [1]
Link: http://lkml.kernel.org/r/57ac524c-b49a-99ec-c1e4-ef5027bfb61b@redhat.com [2]
Link: https://lkml.kernel.org/r/20210505135407.31590-1-mhocko@kernel.org
Fixes: d3378e86d182 ("mm/gup: check page posion status for coredump.")
Signed-off-by: Michal Hocko <mhocko@suse.com>
Reviewed-by: David Hildenbrand <david@redhat.com>
Cc: Aili Yao <yaoaili@kingsoft.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# f0953a1b 06-May-2021 Ingo Molnar <mingo@kernel.org>

mm: fix typos in comments

Fix ~94 single-word typos in locking code comments, plus a few
very obvious grammar mistakes.

Link: https://lkml.kernel.org/r/20210322212624.GA1963421@gmail.com
Link: https://lore.kernel.org/r/20210322205203.GB1959563@gmail.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Reviewed-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Reviewed-by: Randy Dunlap <rdunlap@infradead.org>
Cc: Bhaskar Chowdhury <unixbhaskar@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# f68749ec 04-May-2021 Pavel Tatashin <pasha.tatashin@soleen.com>

mm/gup: longterm pin migration cleanup

When pages are longterm pinned, we must migrated them out of movable zone.
The function that migrates them has a hidden loop with goto. The loop is
to retry on isolation failures, and after successful migration.

Make this code better by moving this loop to the caller.

Link: https://lkml.kernel.org/r/20210215161349.246722-13-pasha.tatashin@soleen.com
Signed-off-by: Pavel Tatashin <pasha.tatashin@soleen.com>
Reviewed-by: Jason Gunthorpe <jgg@nvidia.com>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: David Hildenbrand <david@redhat.com>
Cc: David Rientjes <rientjes@google.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Ira Weiny <ira.weiny@intel.com>
Cc: James Morris <jmorris@namei.org>
Cc: Jason Gunthorpe <jgg@ziepe.ca>
Cc: John Hubbard <jhubbard@nvidia.com>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Mike Kravetz <mike.kravetz@oracle.com>
Cc: Oscar Salvador <osalvador@suse.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Sasha Levin <sashal@kernel.org>
Cc: Steven Rostedt (VMware) <rostedt@goodmis.org>
Cc: Tyler Hicks <tyhicks@linux.microsoft.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 24dc20c7 04-May-2021 Pavel Tatashin <pasha.tatashin@soleen.com>

mm/gup: change index type to long as it counts pages

In __get_user_pages_locked() i counts number of pages which should be
long, as long is used in all other places to contain number of pages, and
32-bit becomes increasingly small for handling page count proportional
values.

Link: https://lkml.kernel.org/r/20210215161349.246722-12-pasha.tatashin@soleen.com
Signed-off-by: Pavel Tatashin <pasha.tatashin@soleen.com>
Acked-by: Michal Hocko <mhocko@suse.com>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: David Hildenbrand <david@redhat.com>
Cc: David Rientjes <rientjes@google.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Ira Weiny <ira.weiny@intel.com>
Cc: James Morris <jmorris@namei.org>
Cc: Jason Gunthorpe <jgg@nvidia.com>
Cc: Jason Gunthorpe <jgg@ziepe.ca>
Cc: John Hubbard <jhubbard@nvidia.com>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Mike Kravetz <mike.kravetz@oracle.com>
Cc: Oscar Salvador <osalvador@suse.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Sasha Levin <sashal@kernel.org>
Cc: Steven Rostedt (VMware) <rostedt@goodmis.org>
Cc: Tyler Hicks <tyhicks@linux.microsoft.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# d1e153fe 04-May-2021 Pavel Tatashin <pasha.tatashin@soleen.com>

mm/gup: migrate pinned pages out of movable zone

We should not pin pages in ZONE_MOVABLE. Currently, we do not pin only
movable CMA pages. Generalize the function that migrates CMA pages to
migrate all movable pages. Use is_pinnable_page() to check which pages
need to be migrated

Link: https://lkml.kernel.org/r/20210215161349.246722-10-pasha.tatashin@soleen.com
Signed-off-by: Pavel Tatashin <pasha.tatashin@soleen.com>
Reviewed-by: John Hubbard <jhubbard@nvidia.com>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: David Hildenbrand <david@redhat.com>
Cc: David Rientjes <rientjes@google.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Ira Weiny <ira.weiny@intel.com>
Cc: James Morris <jmorris@namei.org>
Cc: Jason Gunthorpe <jgg@nvidia.com>
Cc: Jason Gunthorpe <jgg@ziepe.ca>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Mike Kravetz <mike.kravetz@oracle.com>
Cc: Oscar Salvador <osalvador@suse.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Sasha Levin <sashal@kernel.org>
Cc: Steven Rostedt (VMware) <rostedt@goodmis.org>
Cc: Tyler Hicks <tyhicks@linux.microsoft.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 1a08ae36 04-May-2021 Pavel Tatashin <pasha.tatashin@soleen.com>

mm cma: rename PF_MEMALLOC_NOCMA to PF_MEMALLOC_PIN

PF_MEMALLOC_NOCMA is used ot guarantee that the allocator will not
return pages that might belong to CMA region. This is currently used
for long term gup to make sure that such pins are not going to be done
on any CMA pages.

When PF_MEMALLOC_NOCMA has been introduced we haven't realized that it
is focusing on CMA pages too much and that there is larger class of
pages that need the same treatment. MOVABLE zone cannot contain any
long term pins as well so it makes sense to reuse and redefine this flag
for that usecase as well. Rename the flag to PF_MEMALLOC_PIN which
defines an allocation context which can only get pages suitable for
long-term pins.

Also rename: memalloc_nocma_save()/memalloc_nocma_restore to
memalloc_pin_save()/memalloc_pin_restore() and make the new functions
common.

[rppt@linux.ibm.com: fix renaming of PF_MEMALLOC_NOCMA to PF_MEMALLOC_PIN]
Link: https://lkml.kernel.org/r/20210331163816.11517-1-rppt@kernel.org

Link: https://lkml.kernel.org/r/20210215161349.246722-6-pasha.tatashin@soleen.com
Signed-off-by: Pavel Tatashin <pasha.tatashin@soleen.com>
Reviewed-by: John Hubbard <jhubbard@nvidia.com>
Acked-by: Michal Hocko <mhocko@suse.com>
Signed-off-by: Mike Rapoport <rppt@linux.ibm.com>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: David Hildenbrand <david@redhat.com>
Cc: David Rientjes <rientjes@google.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Ira Weiny <ira.weiny@intel.com>
Cc: James Morris <jmorris@namei.org>
Cc: Jason Gunthorpe <jgg@nvidia.com>
Cc: Jason Gunthorpe <jgg@ziepe.ca>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Mike Kravetz <mike.kravetz@oracle.com>
Cc: Oscar Salvador <osalvador@suse.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Sasha Levin <sashal@kernel.org>
Cc: Steven Rostedt (VMware) <rostedt@goodmis.org>
Cc: Tyler Hicks <tyhicks@linux.microsoft.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 6e7f34eb 04-May-2021 Pavel Tatashin <pasha.tatashin@soleen.com>

mm/gup: check for isolation errors

It is still possible that we pin movable CMA pages if there are
isolation errors and cma_page_list stays empty when we check again.

Check for isolation errors, and return success only when there are no
isolation errors, and cma_page_list is empty after checking.

Because isolation errors are transient, we retry indefinitely.

Link: https://lkml.kernel.org/r/20210215161349.246722-5-pasha.tatashin@soleen.com
Fixes: 9a4e9f3b2d73 ("mm: update get_user_pages_longterm to migrate pages allocated from CMA region")
Signed-off-by: Pavel Tatashin <pasha.tatashin@soleen.com>
Reviewed-by: Jason Gunthorpe <jgg@nvidia.com>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: David Hildenbrand <david@redhat.com>
Cc: David Rientjes <rientjes@google.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Ira Weiny <ira.weiny@intel.com>
Cc: James Morris <jmorris@namei.org>
Cc: Jason Gunthorpe <jgg@ziepe.ca>
Cc: John Hubbard <jhubbard@nvidia.com>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Mike Kravetz <mike.kravetz@oracle.com>
Cc: Oscar Salvador <osalvador@suse.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Sasha Levin <sashal@kernel.org>
Cc: Steven Rostedt (VMware) <rostedt@goodmis.org>
Cc: Tyler Hicks <tyhicks@linux.microsoft.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# f0f44638 04-May-2021 Pavel Tatashin <pasha.tatashin@soleen.com>

mm/gup: return an error on migration failure

When migration failure occurs, we still pin pages, which means that we
may pin CMA movable pages which should never be the case.

Instead return an error without pinning pages when migration failure
happens.

No need to retry migrating, because migrate_pages() already retries 10
times.

Link: https://lkml.kernel.org/r/20210215161349.246722-4-pasha.tatashin@soleen.com
Signed-off-by: Pavel Tatashin <pasha.tatashin@soleen.com>
Reviewed-by: Jason Gunthorpe <jgg@nvidia.com>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: David Hildenbrand <david@redhat.com>
Cc: David Rientjes <rientjes@google.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Ira Weiny <ira.weiny@intel.com>
Cc: James Morris <jmorris@namei.org>
Cc: Jason Gunthorpe <jgg@ziepe.ca>
Cc: John Hubbard <jhubbard@nvidia.com>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Mike Kravetz <mike.kravetz@oracle.com>
Cc: Oscar Salvador <osalvador@suse.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Sasha Levin <sashal@kernel.org>
Cc: Steven Rostedt (VMware) <rostedt@goodmis.org>
Cc: Tyler Hicks <tyhicks@linux.microsoft.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 83c02c23 04-May-2021 Pavel Tatashin <pasha.tatashin@soleen.com>

mm/gup: check every subpage of a compound page during isolation

When pages are isolated in check_and_migrate_movable_pages() we skip
compound number of pages at a time. However, as Jason noted, it is not
necessary correct that pages[i] corresponds to the pages that we
skipped. This is because it is possible that the addresses in this
range had split_huge_pmd()/split_huge_pud(), and these functions do not
update the compound page metadata.

The problem can be reproduced if something like this occurs:

1. User faulted huge pages.
2. split_huge_pmd() was called for some reason
3. User has unmapped some sub-pages in the range
4. User tries to longterm pin the addresses.

The resulting pages[i] might end-up having pages which are not compound
size page aligned.

Link: https://lkml.kernel.org/r/20210215161349.246722-3-pasha.tatashin@soleen.com
Fixes: aa712399c1e8 ("mm/gup: speed up check_and_migrate_cma_pages() on huge page")
Signed-off-by: Pavel Tatashin <pasha.tatashin@soleen.com>
Reported-by: Jason Gunthorpe <jgg@nvidia.com>
Reviewed-by: Jason Gunthorpe <jgg@nvidia.com>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: David Hildenbrand <david@redhat.com>
Cc: David Rientjes <rientjes@google.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Ira Weiny <ira.weiny@intel.com>
Cc: James Morris <jmorris@namei.org>
Cc: Jason Gunthorpe <jgg@ziepe.ca>
Cc: John Hubbard <jhubbard@nvidia.com>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Mike Kravetz <mike.kravetz@oracle.com>
Cc: Oscar Salvador <osalvador@suse.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Sasha Levin <sashal@kernel.org>
Cc: Steven Rostedt (VMware) <rostedt@goodmis.org>
Cc: Tyler Hicks <tyhicks@linux.microsoft.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# c991ffef 04-May-2021 Pavel Tatashin <pasha.tatashin@soleen.com>

mm/gup: don't pin migrated cma pages in movable zone

Patch series "prohibit pinning pages in ZONE_MOVABLE", v11.

When page is pinned it cannot be moved and its physical address stays
the same until pages is unpinned.

This is useful functionality to allows userland to implementation DMA
access. For example, it is used by vfio in vfio_pin_pages().

However, this functionality breaks memory hotplug/hotremove assumptions
that pages in ZONE_MOVABLE can always be migrated.

This patch series fixes this issue by forcing new allocations during
page pinning to omit ZONE_MOVABLE, and also to migrate any existing
pages from ZONE_MOVABLE during pinning.

It uses the same scheme logic that is currently used by CMA, and extends
the functionality for all allocations.

For more information read the discussion [1] about this problem.
[1] https://lore.kernel.org/lkml/CA+CK2bBffHBxjmb9jmSKacm0fJMinyt3Nhk8Nx6iudcQSj80_w@mail.gmail.com

This patch (of 14):

In order not to fragment CMA the pinned pages are migrated. However, they
are migrated to ZONE_MOVABLE, which also should not have pinned pages.

Remove __GFP_MOVABLE, so pages can be migrated to zones where pinning is
allowed.

Link: https://lkml.kernel.org/r/20210215161349.246722-1-pasha.tatashin@soleen.com
Link: https://lkml.kernel.org/r/20210215161349.246722-2-pasha.tatashin@soleen.com
Signed-off-by: Pavel Tatashin <pasha.tatashin@soleen.com>
Reviewed-by: David Hildenbrand <david@redhat.com>
Reviewed-by: John Hubbard <jhubbard@nvidia.com>
Acked-by: Michal Hocko <mhocko@suse.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Michal Hocko <mhocko@suse.com>
Cc: David Hildenbrand <david@redhat.com>
Cc: Oscar Salvador <osalvador@suse.de>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: Sasha Levin <sashal@kernel.org>
Cc: Tyler Hicks <tyhicks@linux.microsoft.com>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Cc: Mike Kravetz <mike.kravetz@oracle.com>
Cc: Steven Rostedt (VMware) <rostedt@goodmis.org>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Jason Gunthorpe <jgg@ziepe.ca>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: David Rientjes <rientjes@google.com>
Cc: John Hubbard <jhubbard@nvidia.com>
Cc: Ira Weiny <ira.weiny@intel.com>
Cc: James Morris <jmorris@namei.org>
Cc: Jason Gunthorpe <jgg@nvidia.com>
Cc: Michal Hocko <mhocko@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 4066c119 29-Apr-2021 Yang Shi <shy828301@gmail.com>

mm: gup: remove FOLL_SPLIT

Since commit 5a52c9df62b4 ("uprobe: use FOLL_SPLIT_PMD instead of
FOLL_SPLIT") and commit ba925fa35057 ("s390/gmap: improve THP splitting")
FOLL_SPLIT has not been used anymore. Remove the dead code.

Link: https://lkml.kernel.org/r/20210330203900.9222-1-shy828301@gmail.com
Signed-off-by: Yang Shi <shy828301@gmail.com>
Reviewed-by: John Hubbard <jhubbard@nvidia.com>
Reviewed-by: Jason Gunthorpe <jgg@nvidia.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 458a4f78 29-Apr-2021 Joao Martins <joao.m.martins@oracle.com>

mm/gup: add a range variant of unpin_user_pages_dirty_lock()

Add an unpin_user_page_range_dirty_lock() API which takes a starting page
and how many consecutive pages we want to unpin and optionally dirty.

To that end, define another iterator for_each_compound_range() that
operates in page ranges as opposed to page array.

For users (like RDMA mr_dereg) where each sg represents a contiguous set
of pages, we're able to more efficiently unpin pages without having to
supply an array of pages much of what happens today with
unpin_user_pages().

Link: https://lkml.kernel.org/r/20210212130843.13865-4-joao.m.martins@oracle.com
Suggested-by: Jason Gunthorpe <jgg@nvidia.com>
Signed-off-by: Joao Martins <joao.m.martins@oracle.com>
Reviewed-by: Jason Gunthorpe <jgg@nvidia.com>
Reviewed-by: John Hubbard <jhubbard@nvidia.com>
Cc: Christoph Hellwig <hch@infradead.org>
Cc: Doug Ledford <dledford@redhat.com>
Cc: Matthew Wilcox <willy@infradead.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 31b912de 29-Apr-2021 Joao Martins <joao.m.martins@oracle.com>

mm/gup: decrement head page once for group of subpages

Rather than decrementing the head page refcount one by one, we walk the
page array and checking which belong to the same compound_head. Later on
we decrement the calculated amount of references in a single write to the
head page. To that end switch to for_each_compound_head() does most of
the work.

set_page_dirty() needs no adjustment as it's a nop for non-dirty head
pages and it doesn't operate on tail pages.

This considerably improves unpinning of pages with THP and hugetlbfs:

- THP

gup_test -t -m 16384 -r 10 [-L|-a] -S -n 512 -w
PIN_LONGTERM_BENCHMARK (put values): ~87.6k us -> ~23.2k us

- 16G with 1G huge page size

gup_test -f /mnt/huge/file -m 16384 -r 10 [-L|-a] -S -n 512 -w
PIN_LONGTERM_BENCHMARK: (put values): ~87.6k us -> ~27.5k us

Link: https://lkml.kernel.org/r/20210212130843.13865-3-joao.m.martins@oracle.com
Signed-off-by: Joao Martins <joao.m.martins@oracle.com>
Reviewed-by: John Hubbard <jhubbard@nvidia.com>
Reviewed-by: Jason Gunthorpe <jgg@nvidia.com>
Cc: Christoph Hellwig <hch@infradead.org>
Cc: Doug Ledford <dledford@redhat.com>
Cc: Matthew Wilcox <willy@infradead.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 8745d7f6 29-Apr-2021 Joao Martins <joao.m.martins@oracle.com>

mm/gup: add compound page list iterator

Patch series "mm/gup: page unpining improvements", v4.

This series improves page unpinning, with an eye on improving MR
deregistration for big swaths of memory (which is bound by the page
unpining), particularly:

1) Decrement the head page by @ntails and thus reducing a lot the
number of atomic operations per compound page. This is done by
comparing individual tail pages heads, and counting number of
consecutive tails on which they match heads and based on that update
head page refcount. Should have a visible improvement in all page
(un)pinners which use compound pages

2) Introducing a new API for unpinning page ranges (to avoid the trick
in the previous item and be based on math), and use that in RDMA
ib_mem_release (used for mr deregistration).

Performance improvements: unpin_user_pages() for hugetlbfs and THP
improves ~3x (through gup_test) and RDMA MR dereg improves ~4.5x with the
new API. See patches 2 and 4 for those.

This patch (of 4):

Add a helper that iterates over head pages in a list of pages. It
essentially counts the tails until the next page to process has a
different head that the current. This is going to be used by
unpin_user_pages() family of functions, to batch the head page refcount
updates once for all passed consecutive tail pages.

Link: https://lkml.kernel.org/r/20210212130843.13865-1-joao.m.martins@oracle.com
Link: https://lkml.kernel.org/r/20210212130843.13865-2-joao.m.martins@oracle.com
Signed-off-by: Joao Martins <joao.m.martins@oracle.com>
Suggested-by: Jason Gunthorpe <jgg@nvidia.com>
Reviewed-by: John Hubbard <jhubbard@nvidia.com>
Reviewed-by: Jason Gunthorpe <jgg@nvidia.com>
Cc: Doug Ledford <dledford@redhat.com>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Christoph Hellwig <hch@infradead.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# d3378e86 09-Apr-2021 Aili Yao <yaoaili@kingsoft.com>

mm/gup: check page posion status for coredump.

When we do coredump for user process signal, this may be an SIGBUS signal
with BUS_MCEERR_AR or BUS_MCEERR_AO code, which means this signal is
resulted from ECC memory fail like SRAR or SRAO, we expect the memory
recovery work is finished correctly, then the get_dump_page() will not
return the error page as its process pte is set invalid by
memory_failure().

But memory_failure() may fail, and the process's related pte may not be
correctly set invalid, for current code, we will return the poison page,
get it dumped, and then lead to system panic as its in kernel code.

So check the poison status in get_dump_page(), and if TRUE, return NULL.

There maybe other scenario that is also better to check the posion status
and not to panic, so make a wrapper for this check, Thanks to David's
suggestion(<david@redhat.com>).

[akpm@linux-foundation.org: s/0/false/]
[yaoaili@kingsoft.com: is_page_poisoned() arg cannot be null, per Matthew]

Link: https://lkml.kernel.org/r/20210322115233.05e4e82a@alex-virtual-machine
Link: https://lkml.kernel.org/r/20210319104437.6f30e80d@alex-virtual-machine
Signed-off-by: Aili Yao <yaoaili@kingsoft.com>
Cc: David Hildenbrand <david@redhat.com>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Naoya Horiguchi <naoya.horiguchi@nec.com>
Cc: Oscar Salvador <osalvador@suse.de>
Cc: Mike Kravetz <mike.kravetz@oracle.com>
Cc: Aili Yao <yaoaili@kingsoft.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 0fa5bc40 24-Feb-2021 Joao Martins <joao.m.martins@oracle.com>

mm/hugetlb: grab head page refcount once for group of subpages

Patch series "mm/hugetlb: follow_hugetlb_page() improvements", v2.

While looking at ZONE_DEVICE struct page reuse particularly the last
patch[0], I found two possible improvements for follow_hugetlb_page()
which is solely used for get_user_pages()/pin_user_pages().

The first patch batches page refcount updates while the second tidies up
storing the subpages/vmas. Both together bring the cost of slow variant
of gup() cost from ~87.6k usecs to ~5.8k usecs.

libhugetlbfs tests seem to pass as well gup_test benchmarks with hugetlbfs
vmas.

This patch (of 2):

follow_hugetlb_page() once it locks the pmd/pud, checks all its N subpages
in a huge page and grabs a reference for each one. Similar to gup-fast,
have follow_hugetlb_page() grab the head page refcount only after counting
all its subpages that are part of the just faulted huge page.

Consequently we reduce the number of atomics necessary to pin said huge
page, which improves non-fast gup() considerably:

- 16G with 1G huge page size
gup_test -f /mnt/huge/file -m 16384 -r 10 -L -S -n 512 -w

PIN_LONGTERM_BENCHMARK: ~87.6k us -> ~12.8k us

Link: https://lkml.kernel.org/r/20210128182632.24562-1-joao.m.martins@oracle.com
Link: https://lkml.kernel.org/r/20210128182632.24562-2-joao.m.martins@oracle.com
Signed-off-by: Joao Martins <joao.m.martins@oracle.com>
Reviewed-by: Mike Kravetz <mike.kravetz@oracle.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# a00cda3f 14-Dec-2020 Mauro Carvalho Chehab <mchehab+huawei@kernel.org>

mm: fix kernel-doc markups

Kernel-doc markups should use this format:
identifier - description

Fix some issues on mm files:

1) The definition for get_user_pages_locked() doesn't follow it. Also,
it expects a short descrpition at the header, followed by a long one,
after the parameters. Fix it.

2) Kernel-doc requires that a kernel-doc markup to be immediately below
the function prototype, as otherwise it will rename it. So, move
get_pfnblock_flags_mask() description to the right place.

3) Make invalidate_mapping_pagevec() to also follow the expected
kernel-doc format.

While here, fix a few minor English syntax issues, as suggested
by Matthew:
will used -> will be used
similar with -> similar to

Link: https://lkml.kernel.org/r/80e85dddc92d333bc2159ee8a2294921612e8745.1605521731.git.mchehab+huawei@kernel.org
Signed-off-by: Mauro Carvalho Chehab <mchehab+huawei@kernel.org>
Suggested-by: Mattew Wilcox <willy@infradead.org> [English fixes]
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 4509b42c 14-Dec-2020 Jason Gunthorpe <jgg@ziepe.ca>

mm/gup: combine put_compound_head() and unpin_user_page()

These functions accomplish the same thing but have different
implementations.

unpin_user_page() has a bug where it calls mod_node_page_state() after
calling put_page() which creates a risk that the page could have been
hot-uplugged from the system.

Fix this by using put_compound_head() as the only implementation.

__unpin_devmap_managed_user_page() and related can be deleted as well in
favour of the simpler, but slower, version in put_compound_head() that has
an extra atomic page_ref_sub, but always calls put_page() which internally
contains the special devmap code.

Move put_compound_head() to be directly after try_grab_compound_head() so
people can find it in future.

Link: https://lkml.kernel.org/r/0-v1-6730d4ee0d32+40e6-gup_combine_put_jgg@nvidia.com
Fixes: 1970dc6f5226 ("mm/gup: /proc/vmstat: pin_user_pages (FOLL_PIN) reporting")
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
Reviewed-by: John Hubbard <jhubbard@nvidia.com>
Reviewed-by: Ira Weiny <ira.weiny@intel.com>
Reviewed-by: Jan Kara <jack@suse.cz>
CC: Joao Martins <joao.m.martins@oracle.com>
CC: Jonathan Corbet <corbet@lwn.net>
CC: Dan Williams <dan.j.williams@intel.com>
CC: Dave Chinner <david@fromorbit.com>
CC: Christoph Hellwig <hch@infradead.org>
CC: Jane Chu <jane.chu@oracle.com>
CC: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
CC: Michal Hocko <mhocko@suse.com>
CC: Mike Kravetz <mike.kravetz@oracle.com>
CC: Shuah Khan <shuah@kernel.org>
CC: Muchun Song <songmuchun@bytedance.com>
CC: Vlastimil Babka <vbabka@suse.cz>
CC: Matthew Wilcox <willy@infradead.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 52650c8b 14-Dec-2020 Jason Gunthorpe <jgg@ziepe.ca>

mm/gup: remove the vma allocation from gup_longterm_locked()

Long ago there wasn't a FOLL_LONGTERM flag so this DAX check was done by
post-processing the VMA list.

These days it is trivial to just check each VMA to see if it is DAX before
processing it inside __get_user_pages() and return failure if a DAX VMA is
encountered with FOLL_LONGTERM.

Removing the allocation of the VMA list is a significant speed up for many
call sites.

Add an IS_ENABLED to vma_is_fsdax so that code generation is unchanged
when DAX is compiled out.

Remove the dummy version of __gup_longterm_locked() as !CONFIG_CMA already
makes memalloc_nocma_save(), check_and_migrate_cma_pages(), and
memalloc_nocma_restore() into a NOP.

Link: https://lkml.kernel.org/r/0-v1-5551df3ed12e+b8-gup_dax_speedup_jgg@nvidia.com
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
Reviewed-by: Ira Weiny <ira.weiny@intel.com>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: John Hubbard <jhubbard@nvidia.com>
Cc: Pavel Tatashin <pasha.tatashin@soleen.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 57efa1fe 14-Dec-2020 Jason Gunthorpe <jgg@ziepe.ca>

mm/gup: prevent gup_fast from racing with COW during fork

Since commit 70e806e4e645 ("mm: Do early cow for pinned pages during
fork() for ptes") pages under a FOLL_PIN will not be write protected
during COW for fork. This means that pages returned from
pin_user_pages(FOLL_WRITE) should not become write protected while the pin
is active.

However, there is a small race where get_user_pages_fast(FOLL_PIN) can
establish a FOLL_PIN at the same time copy_present_page() is write
protecting it:

CPU 0 CPU 1
get_user_pages_fast()
internal_get_user_pages_fast()
copy_page_range()
pte_alloc_map_lock()
copy_present_page()
atomic_read(has_pinned) == 0
page_maybe_dma_pinned() == false
atomic_set(has_pinned, 1);
gup_pgd_range()
gup_pte_range()
pte_t pte = gup_get_pte(ptep)
pte_access_permitted(pte)
try_grab_compound_head()
pte = pte_wrprotect(pte)
set_pte_at();
pte_unmap_unlock()
// GUP now returns with a write protected page

The first attempt to resolve this by using the write protect caused
problems (and was missing a barrrier), see commit f3c64eda3e50 ("mm: avoid
early COW write protect games during fork()")

Instead wrap copy_p4d_range() with the write side of a seqcount and check
the read side around gup_pgd_range(). If there is a collision then
get_user_pages_fast() fails and falls back to slow GUP.

Slow GUP is safe against this race because copy_page_range() is only
called while holding the exclusive side of the mmap_lock on the src
mm_struct.

[akpm@linux-foundation.org: coding style fixes]
Link: https://lore.kernel.org/r/CAHk-=wi=iCnYCARbPGjkVJu9eyYeZ13N64tZYLdOB8CP5Q_PLw@mail.gmail.com

Link: https://lkml.kernel.org/r/2-v4-908497cf359a+4782-gup_fork_jgg@nvidia.com
Fixes: f3c64eda3e50 ("mm: avoid early COW write protect games during fork()")
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
Suggested-by: Linus Torvalds <torvalds@linux-foundation.org>
Reviewed-by: John Hubbard <jhubbard@nvidia.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Reviewed-by: Peter Xu <peterx@redhat.com>
Acked-by: "Ahmed S. Darwish" <a.darwish@linutronix.de> [seqcount_t parts]
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: "Aneesh Kumar K.V" <aneesh.kumar@linux.ibm.com>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Hugh Dickins <hughd@google.com>
Cc: Jann Horn <jannh@google.com>
Cc: Kirill Shutemov <kirill@shutemov.name>
Cc: Kirill Tkhai <ktkhai@virtuozzo.com>
Cc: Leon Romanovsky <leonro@nvidia.com>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Oleg Nesterov <oleg@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# c28b1fc7 14-Dec-2020 Jason Gunthorpe <jgg@ziepe.ca>

mm/gup: reorganize internal_get_user_pages_fast()

Patch series "Add a seqcount between gup_fast and copy_page_range()", v4.

As discussed and suggested by Linus use a seqcount to close the small race
between gup_fast and copy_page_range().

Ahmed confirms that raw_write_seqcount_begin() is the correct API to use
in this case and it doesn't trigger any lockdeps.

I was able to test it using two threads, one forking and the other using
ibv_reg_mr() to trigger GUP fast. Modifying copy_page_range() to sleep
made the window large enough to reliably hit to test the logic.

This patch (of 2):

The next patch in this series makes the lockless flow a little more
complex, so move the entire block into a new function and remove a level
of indention. Tidy a bit of cruft:

- addr is always the same as start, so use start

- Use the modern check_add_overflow() for computing end = start + len

- nr_pinned/pages << PAGE_SHIFT needs the LHS to be unsigned long to
avoid shift overflow, make the variables unsigned long to avoid coding
casts in both places. nr_pinned was missing its cast

- The handling of ret and nr_pinned can be streamlined a bit

No functional change.

Link: https://lkml.kernel.org/r/0-v4-908497cf359a+4782-gup_fork_jgg@nvidia.com
Link: https://lkml.kernel.org/r/1-v4-908497cf359a+4782-gup_fork_jgg@nvidia.com
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Reviewed-by: John Hubbard <jhubbard@nvidia.com>
Reviewed-by: Peter Xu <peterx@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 2a4a06da 13-Nov-2020 Peter Zijlstra <peterz@infradead.org>

mm/gup: Provide gup_get_pte() more generic

In order to write another lockless page-table walker, we need
gup_get_pte() exposed. While doing that, rename it to
ptep_get_lockless() to match the existing ptep_get() naming.

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lkml.kernel.org/r/20201126121121.036370527@infradead.org


# 96e1fac1 13-Nov-2020 Jason Gunthorpe <jgg@ziepe.ca>

mm/gup: use unpin_user_pages() in __gup_longterm_locked()

When FOLL_PIN is passed to __get_user_pages() the page list must be put
back using unpin_user_pages() otherwise the page pin reference persists
in a corrupted state.

There are two places in the unwind of __gup_longterm_locked() that put
the pages back without checking. Normally on error this function would
return the partial page list making this the caller's responsibility,
but in these two cases the caller is not allowed to see these pages at
all.

Fixes: 3faa52c03f44 ("mm/gup: track FOLL_PIN pages")
Reported-by: Ira Weiny <ira.weiny@intel.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Reviewed-by: Ira Weiny <ira.weiny@intel.com>
Reviewed-by: John Hubbard <jhubbard@nvidia.com>
Cc: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: <stable@vger.kernel.org>
Link: https://lkml.kernel.org/r/0-v2-3ae7d9d162e2+2a7-gup_cma_fix_jgg@nvidia.com
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 7f3bfab5 15-Oct-2020 Jann Horn <jannh@google.com>

mm/gup: take mmap_lock in get_dump_page()

Properly take the mmap_lock before calling into the GUP code from
get_dump_page(); and play nice, allowing the GUP code to drop the
mmap_lock if it has to sleep.

As Linus pointed out, we don't actually need the VMA because
__get_user_pages() will flush the dcache for us if necessary.

Signed-off-by: Jann Horn <jannh@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Acked-by: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Cc: "Eric W . Biederman" <ebiederm@xmission.com>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Hugh Dickins <hughd@google.com>
Link: http://lkml.kernel.org/r/20200827114932.3572699-7-jannh@google.com
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 8f942eea 15-Oct-2020 Jann Horn <jannh@google.com>

binfmt_elf_fdpic: stop using dump_emit() on user pointers on !MMU

Patch series "Fix ELF / FDPIC ELF core dumping, and use mmap_lock properly in there", v5.

At the moment, we have that rather ugly mmget_still_valid() helper to work
around <https://crbug.com/project-zero/1790>: ELF core dumping doesn't
take the mmap_sem while traversing the task's VMAs, and if anything (like
userfaultfd) then remotely messes with the VMA tree, fireworks ensue. So
at the moment we use mmget_still_valid() to bail out in any writers that
might be operating on a remote mm's VMAs.

With this series, I'm trying to get rid of the need for that as cleanly as
possible. ("cleanly" meaning "avoid holding the mmap_lock across
unbounded sleeps".)

Patches 1, 2, 3 and 4 are relatively unrelated cleanups in the core
dumping code.

Patches 5 and 6 implement the main change: Instead of repeatedly accessing
the VMA list with sleeps in between, we snapshot it at the start with
proper locking, and then later we just use our copy of the VMA list. This
ensures that the kernel won't crash, that VMA metadata in the coredump is
consistent even in the presence of concurrent modifications, and that any
virtual addresses that aren't being concurrently modified have their
contents show up in the core dump properly.

The disadvantage of this approach is that we need a bit more memory during
core dumping for storing metadata about all VMAs.

At the end of the series, patch 7 removes the old workaround for this
issue (mmget_still_valid()).

I have tested:

- Creating a simple core dump on X86-64 still works.
- The created coredump on X86-64 opens in GDB and looks plausible.
- X86-64 core dumps contain the first page for executable mappings at
offset 0, and don't contain the first page for non-executable file
mappings or executable mappings at offset !=0.
- NOMMU 32-bit ARM can still generate plausible-looking core dumps
through the FDPIC implementation. (I can't test this with GDB because
GDB is missing some structure definition for nommu ARM, but I've
poked around in the hexdump and it looked decent.)

This patch (of 7):

dump_emit() is for kernel pointers, and VMAs describe userspace memory.
Let's be tidy here and avoid accessing userspace pointers under KERNEL_DS,
even if it probably doesn't matter much on !MMU systems - especially given
that it looks like we can just use the same get_dump_page() as on MMU if
we move it out of the CONFIG_MMU block.

One small change we have to make in get_dump_page() is to use
__get_user_pages_locked() instead of __get_user_pages(), since the latter
doesn't exist on nommu. On mmu builds, __get_user_pages_locked() will
just call __get_user_pages() for us.

Signed-off-by: Jann Horn <jannh@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Acked-by: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Cc: "Eric W . Biederman" <ebiederm@xmission.com>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Hugh Dickins <hughd@google.com>
Link: http://lkml.kernel.org/r/20200827114932.3572699-1-jannh@google.com
Link: http://lkml.kernel.org/r/20200827114932.3572699-2-jannh@google.com
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 146608bb 13-Oct-2020 John Hubbard <jhubbard@nvidia.com>

mm/gup: protect unpin_user_pages() against npages==-ERRNO

As suggested by Dan Carpenter, fortify unpin_user_pages() just a bit,
against a typical caller mistake: check if the npages arg is really a
-ERRNO value, which would blow up the unpinning loop: WARN and return.

If this new WARN_ON() fires, then the system *might* be leaking pages (by
leaving them pinned), but probably not. More likely, gup/pup returned a
hard -ERRNO error to the caller, who erroneously passed it here.

Signed-off-by: John Hubbard <jhubbard@nvidia.com>
Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Cc: Ira Weiny <ira.weiny@intel.com>
Cc: Souptick Joarder <jrdr.linux@gmail.com>
Link: https://lkml.kernel.org/r/20200917065706.409079-1-jhubbard@nvidia.com
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 447f3e45 13-Oct-2020 Barry Song <song.bao.hua@hisilicon.com>

mm/gup: don't permit users to call get_user_pages with FOLL_LONGTERM

gup prohibits users from calling get_user_pages() with FOLL_PIN. But it
allows users to call get_user_pages() with FOLL_LONGTERM only. It seems
insensible.

Since FOLL_LONGTERM is a stricter case of FOLL_PIN, we should prohibit
users from calling get_user_pages() with FOLL_LONGTERM while not with
FOLL_PIN.

mm/gup_benchmark.c used to be the only user who did this improperly.
But it has been fixed by moving to use pin_user_pages().

[akpm@linux-foundation.org: fix CONFIG_MMU=n build]
Link: https://lkml.kernel.org/r/CA+G9fYuNS3k0DVT62twfV746pfNhCSrk5sVMcOcQ1PGGnEseyw@mail.gmail.com

Signed-off-by: Barry Song <song.bao.hua@hisilicon.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Reviewed-by: Ira Weiny <ira.weiny@intel.com>
Cc: John Hubbard <jhubbard@nvidia.com>
Cc: Jan Kara <jack@suse.cz>
Cc: Jérôme Glisse <jglisse@redhat.com>
Cc: "Matthew Wilcox (Oracle)" <willy@infradead.org>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Christoph Hellwig <hch@infradead.org>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: Dave Chinner <david@fromorbit.com>
Cc: Jason Gunthorpe <jgg@ziepe.ca>
Cc: Jonathan Corbet <corbet@lwn.net>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Mike Kravetz <mike.kravetz@oracle.com>
Cc: Shuah Khan <shuah@kernel.org>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Naresh Kamboju <naresh.kamboju@linaro.org>
Link: http://lkml.kernel.org/r/20200819110100.23504-1-song.bao.hua@hisilicon.com
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# a4d63c37 27-Sep-2020 Jason A. Donenfeld <Jason@zx2c4.com>

mm: do not rely on mm == current->mm in __get_user_pages_locked

It seems likely this block was pasted from internal_get_user_pages_fast,
which is not passed an mm struct and therefore uses current's. But
__get_user_pages_locked is passed an explicit mm, and current->mm is not
always valid. This was hit when being called from i915, which uses:

pin_user_pages_remote->
__get_user_pages_remote->
__gup_longterm_locked->
__get_user_pages_locked

Before, this would lead to an OOPS:

BUG: kernel NULL pointer dereference, address: 0000000000000064
#PF: supervisor write access in kernel mode
#PF: error_code(0x0002) - not-present page
CPU: 10 PID: 1431 Comm: kworker/u33:1 Tainted: P S U O 5.9.0-rc7+ #140
Hardware name: LENOVO 20QTCTO1WW/20QTCTO1WW, BIOS N2OET47W (1.34 ) 08/06/2020
Workqueue: i915-userptr-acquire __i915_gem_userptr_get_pages_worker [i915]
RIP: 0010:__get_user_pages_remote+0xd7/0x310
Call Trace:
__i915_gem_userptr_get_pages_worker+0xc8/0x260 [i915]
process_one_work+0x1ca/0x390
worker_thread+0x48/0x3c0
kthread+0x114/0x130
ret_from_fork+0x1f/0x30
CR2: 0000000000000064

This commit fixes the problem by using the mm pointer passed to the
function rather than the bogus one in current.

Fixes: 008cfe4418b3 ("mm: Introduce mm_struct.has_pinned")
Tested-by: Chris Wilson <chris@chris-wilson.co.uk>
Reported-by: Harald Arnesen <harald@skogtun.org>
Reviewed-by: Jason Gunthorpe <jgg@nvidia.com>
Reviewed-by: Peter Xu <peterx@redhat.com>
Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 008cfe44 25-Sep-2020 Peter Xu <peterx@redhat.com>

mm: Introduce mm_struct.has_pinned

(Commit message majorly collected from Jason Gunthorpe)

Reduce the chance of false positive from page_maybe_dma_pinned() by
keeping track if the mm_struct has ever been used with pin_user_pages().
This allows cases that might drive up the page ref_count to avoid any
penalty from handling dma_pinned pages.

Future work is planned, to provide a more sophisticated solution, likely
to turn it into a real counter. For now, make it atomic_t but use it as
a boolean for simplicity.

Suggested-by: Jason Gunthorpe <jgg@ziepe.ca>
Signed-off-by: Peter Xu <peterx@redhat.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# d3f7b1bb 25-Sep-2020 Vasily Gorbik <gor@linux.ibm.com>

mm/gup: fix gup_fast with dynamic page table folding

Currently to make sure that every page table entry is read just once
gup_fast walks perform READ_ONCE and pass pXd value down to the next
gup_pXd_range function by value e.g.:

static int gup_pud_range(p4d_t p4d, unsigned long addr, unsigned long end,
unsigned int flags, struct page **pages, int *nr)
...
pudp = pud_offset(&p4d, addr);

This function passes a reference on that local value copy to pXd_offset,
and might get the very same pointer in return. This happens when the
level is folded (on most arches), and that pointer should not be
iterated.

On s390 due to the fact that each task might have different 5,4 or
3-level address translation and hence different levels folded the logic
is more complex and non-iteratable pointer to a local copy leads to
severe problems.

Here is an example of what happens with gup_fast on s390, for a task
with 3-level paging, crossing a 2 GB pud boundary:

// addr = 0x1007ffff000, end = 0x10080001000
static int gup_pud_range(p4d_t p4d, unsigned long addr, unsigned long end,
unsigned int flags, struct page **pages, int *nr)
{
unsigned long next;
pud_t *pudp;

// pud_offset returns &p4d itself (a pointer to a value on stack)
pudp = pud_offset(&p4d, addr);
do {
// on second iteratation reading "random" stack value
pud_t pud = READ_ONCE(*pudp);

// next = 0x10080000000, due to PUD_SIZE/MASK != PGDIR_SIZE/MASK on s390
next = pud_addr_end(addr, end);
...
} while (pudp++, addr = next, addr != end); // pudp++ iterating over stack

return 1;
}

This happens since s390 moved to common gup code with commit
d1874a0c2805 ("s390/mm: make the pxd_offset functions more robust") and
commit 1a42010cdc26 ("s390/mm: convert to the generic
get_user_pages_fast code").

s390 tried to mimic static level folding by changing pXd_offset
primitives to always calculate top level page table offset in pgd_offset
and just return the value passed when pXd_offset has to act as folded.

What is crucial for gup_fast and what has been overlooked is that
PxD_SIZE/MASK and thus pXd_addr_end should also change correspondingly.
And the latter is not possible with dynamic folding.

To fix the issue in addition to pXd values pass original pXdp pointers
down to gup_pXd_range functions. And introduce pXd_offset_lockless
helpers, which take an additional pXd entry value parameter. This has
already been discussed in

https://lkml.kernel.org/r/20190418100218.0a4afd51@mschwideX1

Fixes: 1a42010cdc26 ("s390/mm: convert to the generic get_user_pages_fast code")
Signed-off-by: Vasily Gorbik <gor@linux.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Reviewed-by: Gerald Schaefer <gerald.schaefer@linux.ibm.com>
Reviewed-by: Alexander Gordeev <agordeev@linux.ibm.com>
Reviewed-by: Jason Gunthorpe <jgg@nvidia.com>
Reviewed-by: Mike Rapoport <rppt@linux.ibm.com>
Reviewed-by: John Hubbard <jhubbard@nvidia.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Dave Hansen <dave.hansen@intel.com>
Cc: Russell King <linux@armlinux.org.uk>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Will Deacon <will@kernel.org>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Jeff Dike <jdike@addtoit.com>
Cc: Richard Weinberger <richard@nod.at>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Arnd Bergmann <arnd@arndb.de>
Cc: Andrey Ryabinin <aryabinin@virtuozzo.com>
Cc: Heiko Carstens <hca@linux.ibm.com>
Cc: Christian Borntraeger <borntraeger@de.ibm.com>
Cc: Claudio Imbrenda <imbrenda@linux.ibm.com>
Cc: <stable@vger.kernel.org> [5.2+]
Link: https://lkml.kernel.org/r/patch.git-943f1e5dcff2.your-ad-here.call-01599856292-ext-8676@work.hours
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# a308c71b 21-Aug-2020 Peter Xu <peterx@redhat.com>

mm/gup: Remove enfornced COW mechanism

With the more strict (but greatly simplified) page reuse logic in
do_wp_page(), we can safely go back to the world where cow is not
enforced with writes.

This essentially reverts commit 17839856fd58 ("gup: document and work
around 'COW can break either way' issue"). There are some context
differences due to some changes later on around it:

2170ecfa7688 ("drm/i915: convert get_user_pages() --> pin_user_pages()", 2020-06-03)
376a34efa4ee ("mm/gup: refactor and de-duplicate gup_fast() code", 2020-06-03)

Some lines moved back and forth with those, but this revert patch should
have striped out and covered all the enforced cow bits anyways.

Suggested-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Peter Xu <peterx@redhat.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 9fa2dd94 03-Sep-2020 Dave Hansen <dave.hansen@linux.intel.com>

mm: fix pin vs. gup mismatch with gate pages

Gate pages were missed when converting from get to pin_user_pages().
This can lead to refcount imbalances. This is reliably and quickly
reproducible running the x86 selftests when vsyscall=emulate is enabled
(the default). Fix by using try_grab_page() with appropriate flags
passed.

The long story:

Today, pin_user_pages() and get_user_pages() are similar interfaces for
manipulating page reference counts. However, "pins" use a "bias" value
and manipulate the actual reference count by 1024 instead of 1 used by
plain "gets".

That means that pin_user_pages() must be matched with unpin_user_pages()
and can't be mixed with a plain put_user_pages() or put_page().

Enter gate pages, like the vsyscall page. They are pages usually in the
kernel image, but which are mapped to userspace. Userspace is allowed
access to them, including interfaces using get/pin_user_pages(). The
refcount of these kernel pages is manipulated just like a normal user
page on the get/pin side so that the put/unpin side can work the same
for normal user pages or gate pages.

get_gate_page() uses try_get_page() which only bumps the refcount by
1, not 1024, even if called in the pin_user_pages() path. If someone
pins a gate page, this happens:

pin_user_pages()
get_gate_page()
try_get_page() // bump refcount +1
... some time later
unpin_user_pages()
page_ref_sub_and_test(page, 1024))

... and boom, we get a refcount off by 1023. This is reliably and
quickly reproducible running the x86 selftests when booted with
vsyscall=emulate (the default). The selftests use ptrace(), but I
suspect anything using pin_user_pages() on gate pages could hit this.

To fix it, simply use try_grab_page() instead of try_get_page(), and
pass 'gup_flags' in so that FOLL_PIN can be respected.

This bug traces back to the very beginning of the FOLL_PIN support in
commit 3faa52c03f44 ("mm/gup: track FOLL_PIN pages"), which showed up in
the 5.7 release.

Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
Fixes: 3faa52c03f44 ("mm/gup: track FOLL_PIN pages")
Reported-by: Peter Zijlstra <peterz@infradead.org>
Reviewed-by: John Hubbard <jhubbard@nvidia.com>
Acked-by: Andy Lutomirski <luto@kernel.org>
Cc: x86@kernel.org
Cc: Jann Horn <jannh@google.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 6c357848 14-Aug-2020 Matthew Wilcox (Oracle) <willy@infradead.org>

mm: replace hpage_nr_pages with thp_nr_pages

The thp prefix is more frequently used than hpage and we should be
consistent between the various functions.

[akpm@linux-foundation.org: fix mm/migrate.c]

Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Reviewed-by: William Kucharski <william.kucharski@oracle.com>
Reviewed-by: Zi Yan <ziy@nvidia.com>
Cc: Mike Kravetz <mike.kravetz@oracle.com>
Cc: David Hildenbrand <david@redhat.com>
Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
Link: http://lkml.kernel.org/r/20200629151959.15779-6-willy@infradead.org
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 64019a2e 11-Aug-2020 Peter Xu <peterx@redhat.com>

mm/gup: remove task_struct pointer for all gup code

After the cleanup of page fault accounting, gup does not need to pass
task_struct around any more. Remove that parameter in the whole gup
stack.

Signed-off-by: Peter Xu <peterx@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Reviewed-by: John Hubbard <jhubbard@nvidia.com>
Link: http://lkml.kernel.org/r/20200707225021.200906-26-peterx@redhat.com
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# a2beb5f1 11-Aug-2020 Peter Xu <peterx@redhat.com>

mm: clean up the last pieces of page fault accountings

Here're the last pieces of page fault accounting that were still done
outside handle_mm_fault() where we still have regs==NULL when calling
handle_mm_fault():

arch/powerpc/mm/copro_fault.c: copro_handle_mm_fault
arch/sparc/mm/fault_32.c: force_user_fault
arch/um/kernel/trap.c: handle_page_fault
mm/gup.c: faultin_page
fixup_user_fault
mm/hmm.c: hmm_vma_fault
mm/ksm.c: break_ksm

Some of them has the issue of duplicated accounting for page fault
retries. Some of them didn't do the accounting at all.

This patch cleans all these up by letting handle_mm_fault() to do per-task
page fault accounting even if regs==NULL (though we'll still skip the perf
event accountings). With that, we can safely remove all the outliers now.

There's another functional change in that now we account the page faults
to the caller of gup, rather than the task_struct that passed into the gup
code. More information of this can be found at [1].

After this patch, below things should never be touched again outside
handle_mm_fault():

- task_struct.[maj|min]_flt
- PERF_COUNT_SW_PAGE_FAULTS_[MAJ|MIN]

[1] https://lore.kernel.org/lkml/CAHk-=wj_V2Tps2QrMn20_W0OJF9xqNh52XSGA42s-ZJ8Y+GyKw@mail.gmail.com/

Signed-off-by: Peter Xu <peterx@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Cc: Albert Ou <aou@eecs.berkeley.edu>
Cc: Alexander Gordeev <agordeev@linux.ibm.com>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Brian Cain <bcain@codeaurora.org>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Christian Borntraeger <borntraeger@de.ibm.com>
Cc: Chris Zankel <chris@zankel.net>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: David S. Miller <davem@davemloft.net>
Cc: Geert Uytterhoeven <geert@linux-m68k.org>
Cc: Gerald Schaefer <gerald.schaefer@de.ibm.com>
Cc: Greentime Hu <green.hu@gmail.com>
Cc: Guo Ren <guoren@kernel.org>
Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
Cc: Helge Deller <deller@gmx.de>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Ivan Kokshaysky <ink@jurassic.park.msu.ru>
Cc: James E.J. Bottomley <James.Bottomley@HansenPartnership.com>
Cc: John Hubbard <jhubbard@nvidia.com>
Cc: Jonas Bonn <jonas@southpole.se>
Cc: Ley Foon Tan <ley.foon.tan@intel.com>
Cc: "Luck, Tony" <tony.luck@intel.com>
Cc: Matt Turner <mattst88@gmail.com>
Cc: Max Filippov <jcmvbkbc@gmail.com>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Cc: Michal Simek <monstr@monstr.eu>
Cc: Nick Hu <nickhu@andestech.com>
Cc: Palmer Dabbelt <palmer@dabbelt.com>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Paul Walmsley <paul.walmsley@sifive.com>
Cc: Pekka Enberg <penberg@kernel.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Richard Henderson <rth@twiddle.net>
Cc: Rich Felker <dalias@libc.org>
Cc: Russell King <linux@armlinux.org.uk>
Cc: Stafford Horne <shorne@gmail.com>
Cc: Stefan Kristiansson <stefan.kristiansson@saunalahti.fi>
Cc: Thomas Bogendoerfer <tsbogend@alpha.franken.de>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Vasily Gorbik <gor@linux.ibm.com>
Cc: Vincent Chen <deanbo422@gmail.com>
Cc: Vineet Gupta <vgupta@synopsys.com>
Cc: Will Deacon <will@kernel.org>
Cc: Yoshinori Sato <ysato@users.sourceforge.jp>
Link: http://lkml.kernel.org/r/20200707225021.200906-25-peterx@redhat.com
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# bce617ed 11-Aug-2020 Peter Xu <peterx@redhat.com>

mm: do page fault accounting in handle_mm_fault

Patch series "mm: Page fault accounting cleanups", v5.

This is v5 of the pf accounting cleanup series. It originates from Gerald
Schaefer's report on an issue a week ago regarding to incorrect page fault
accountings for retried page fault after commit 4064b9827063 ("mm: allow
VM_FAULT_RETRY for multiple times"):

https://lore.kernel.org/lkml/20200610174811.44b94525@thinkpad/

What this series did:

- Correct page fault accounting: we do accounting for a page fault
(no matter whether it's from #PF handling, or gup, or anything else)
only with the one that completed the fault. For example, page fault
retries should not be counted in page fault counters. Same to the
perf events.

- Unify definition of PERF_COUNT_SW_PAGE_FAULTS: currently this perf
event is used in an adhoc way across different archs.

Case (1): for many archs it's done at the entry of a page fault
handler, so that it will also cover e.g. errornous faults.

Case (2): for some other archs, it is only accounted when the page
fault is resolved successfully.

Case (3): there're still quite some archs that have not enabled
this perf event.

Since this series will touch merely all the archs, we unify this
perf event to always follow case (1), which is the one that makes most
sense. And since we moved the accounting into handle_mm_fault, the
other two MAJ/MIN perf events are well taken care of naturally.

- Unify definition of "major faults": the definition of "major
fault" is slightly changed when used in accounting (not
VM_FAULT_MAJOR). More information in patch 1.

- Always account the page fault onto the one that triggered the page
fault. This does not matter much for #PF handlings, but mostly for
gup. More information on this in patch 25.

Patchset layout:

Patch 1: Introduced the accounting in handle_mm_fault(), not enabled.
Patch 2-23: Enable the new accounting for arch #PF handlers one by one.
Patch 24: Enable the new accounting for the rest outliers (gup, iommu, etc.)
Patch 25: Cleanup GUP task_struct pointer since it's not needed any more

This patch (of 25):

This is a preparation patch to move page fault accountings into the
general code in handle_mm_fault(). This includes both the per task
flt_maj/flt_min counters, and the major/minor page fault perf events. To
do this, the pt_regs pointer is passed into handle_mm_fault().

PERF_COUNT_SW_PAGE_FAULTS should still be kept in per-arch page fault
handlers.

So far, all the pt_regs pointer that passed into handle_mm_fault() is
NULL, which means this patch should have no intented functional change.

Suggested-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Peter Xu <peterx@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Cc: Albert Ou <aou@eecs.berkeley.edu>
Cc: Alexander Gordeev <agordeev@linux.ibm.com>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Brian Cain <bcain@codeaurora.org>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Christian Borntraeger <borntraeger@de.ibm.com>
Cc: Chris Zankel <chris@zankel.net>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: David S. Miller <davem@davemloft.net>
Cc: Geert Uytterhoeven <geert@linux-m68k.org>
Cc: Gerald Schaefer <gerald.schaefer@de.ibm.com>
Cc: Greentime Hu <green.hu@gmail.com>
Cc: Guo Ren <guoren@kernel.org>
Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
Cc: Helge Deller <deller@gmx.de>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Ivan Kokshaysky <ink@jurassic.park.msu.ru>
Cc: James E.J. Bottomley <James.Bottomley@HansenPartnership.com>
Cc: John Hubbard <jhubbard@nvidia.com>
Cc: Jonas Bonn <jonas@southpole.se>
Cc: Ley Foon Tan <ley.foon.tan@intel.com>
Cc: "Luck, Tony" <tony.luck@intel.com>
Cc: Matt Turner <mattst88@gmail.com>
Cc: Max Filippov <jcmvbkbc@gmail.com>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Cc: Michal Simek <monstr@monstr.eu>
Cc: Nick Hu <nickhu@andestech.com>
Cc: Palmer Dabbelt <palmer@dabbelt.com>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Paul Walmsley <paul.walmsley@sifive.com>
Cc: Pekka Enberg <penberg@kernel.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Richard Henderson <rth@twiddle.net>
Cc: Rich Felker <dalias@libc.org>
Cc: Russell King <linux@armlinux.org.uk>
Cc: Stafford Horne <shorne@gmail.com>
Cc: Stefan Kristiansson <stefan.kristiansson@saunalahti.fi>
Cc: Thomas Bogendoerfer <tsbogend@alpha.franken.de>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Vasily Gorbik <gor@linux.ibm.com>
Cc: Vincent Chen <deanbo422@gmail.com>
Cc: Vineet Gupta <vgupta@synopsys.com>
Cc: Will Deacon <will@kernel.org>
Cc: Yoshinori Sato <ysato@users.sourceforge.jp>
Link: http://lkml.kernel.org/r/20200707225021.200906-1-peterx@redhat.com
Link: http://lkml.kernel.org/r/20200707225021.200906-2-peterx@redhat.com
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# ed03d924 11-Aug-2020 Joonsoo Kim <iamjoonsoo.kim@lge.com>

mm/gup: use a standard migration target allocation callback

There is a well-defined migration target allocation callback. Use it.

Signed-off-by: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Acked-by: Michal Hocko <mhocko@suse.com>
Cc: "Aneesh Kumar K . V" <aneesh.kumar@linux.ibm.com>
Cc: Christoph Hellwig <hch@infradead.org>
Cc: Mike Kravetz <mike.kravetz@oracle.com>
Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
Cc: Roman Gushchin <guro@fb.com>
Link: http://lkml.kernel.org/r/1596180906-8442-3-git-send-email-iamjoonsoo.kim@lge.com
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# bbe88753 11-Aug-2020 Joonsoo Kim <iamjoonsoo.kim@lge.com>

mm/hugetlb: make hugetlb migration callback CMA aware

new_non_cma_page() in gup.c requires to allocate the new page that is not
on the CMA area. new_non_cma_page() implements it by using allocation
scope APIs.

However, there is a work-around for hugetlb. Normal hugetlb page
allocation API for migration is alloc_huge_page_nodemask(). It consists
of two steps. First is dequeing from the pool. Second is, if there is no
available page on the queue, allocating by using the page allocator.

new_non_cma_page() can't use this API since first step (deque) isn't aware
of scope API to exclude CMA area. So, new_non_cma_page() exports hugetlb
internal function for the second step, alloc_migrate_huge_page(), to
global scope and uses it directly. This is suboptimal since hugetlb pages
on the queue cannot be utilized.

This patch tries to fix this situation by making the deque function on
hugetlb CMA aware. In the deque function, CMA memory is skipped if
PF_MEMALLOC_NOCMA flag is found.

Signed-off-by: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Acked-by: Mike Kravetz <mike.kravetz@oracle.com>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Acked-by: Michal Hocko <mhocko@suse.com>
Cc: "Aneesh Kumar K . V" <aneesh.kumar@linux.ibm.com>
Cc: Christoph Hellwig <hch@infradead.org>
Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
Cc: Roman Gushchin <guro@fb.com>
Link: http://lkml.kernel.org/r/1596180906-8442-2-git-send-email-iamjoonsoo.kim@lge.com
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 41b4dc14 11-Aug-2020 Joonsoo Kim <iamjoonsoo.kim@lge.com>

mm/gup: restrict CMA region by using allocation scope API

We have well defined scope API to exclude CMA region. Use it rather than
manipulating gfp_mask manually. With this change, we can now restore
__GFP_MOVABLE for gfp_mask like as usual migration target allocation. It
would result in that the ZONE_MOVABLE is also searched by page allocator.
For hugetlb, gfp_mask is redefined since it has a regular allocation mask
filter for migration target. __GPF_NOWARN is added to hugetlb gfp_mask
filter since a new user for gfp_mask filter, gup, want to be silent when
allocation fails.

Note that this can be considered as a fix for the commit 9a4e9f3b2d73
("mm: update get_user_pages_longterm to migrate pages allocated from CMA
region"). However, "Fixes" tag isn't added here since it is just
suboptimal but it doesn't cause any problem.

Suggested-by: Michal Hocko <mhocko@suse.com>
Signed-off-by: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Cc: Christoph Hellwig <hch@infradead.org>
Cc: Roman Gushchin <guro@fb.com>
Cc: Mike Kravetz <mike.kravetz@oracle.com>
Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
Cc: "Aneesh Kumar K . V" <aneesh.kumar@linux.ibm.com>
Link: http://lkml.kernel.org/r/1596180906-8442-1-git-send-email-iamjoonsoo.kim@lge.com
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 0a36f7f8 07-Aug-2020 Tang Yizhou <tangyizhou@huawei.com>

mm/gup.c: fix the comment of return value for populate_vma_page_range()

The return value of populate_vma_page_range() is consistent with
__get_user_pages(), and so is the function comment of return value.

Signed-off-by: Tang Yizhou <tangyizhou@huawei.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Reviewed-by: Andrew Morton <akpm@linux-foundation.org>
Reviewed-by: Ira Weiny <ira.weiny@intel.com>
Link: http://lkml.kernel.org/r/20200720034303.29920-1-tangyizhou@huawei.com
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 481e980a 14-Jun-2020 Christophe Leroy <christophe.leroy@csgroup.eu>

mm: Allow arches to provide ptep_get()

Since commit 9e343b467c70 ("READ_ONCE: Enforce atomicity for
{READ,WRITE}_ONCE() memory accesses") it is not possible anymore to
use READ_ONCE() to access complex page table entries like the one
defined for powerpc 8xx with 16k size pages.

Define a ptep_get() helper that architectures can override instead
of performing a READ_ONCE() on the page table entry pointer.

Fixes: 9e343b467c70 ("READ_ONCE: Enforce atomicity for {READ,WRITE}_ONCE() memory accesses")
Signed-off-by: Christophe Leroy <christophe.leroy@csgroup.eu>
Acked-by: Will Deacon <will@kernel.org>
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/087fa12b6e920e32315136b998aa834f99242695.1592225558.git.christophe.leroy@csgroup.eu


# 55ca2263 14-Jun-2020 Christophe Leroy <christophe.leroy@csgroup.eu>

mm/gup: Use huge_ptep_get() in gup_hugepte()

gup_hugepte() reads hugepage table entries, it can't read
them directly, huge_ptep_get() must be used.

Fixes: 9e343b467c70 ("READ_ONCE: Enforce atomicity for {READ,WRITE}_ONCE() memory accesses")
Signed-off-by: Christophe Leroy <christophe.leroy@csgroup.eu>
Acked-by: Will Deacon <will@kernel.org>
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/ffc3714334c3bfaca6f13788ad039e8759ae413f.1592225558.git.christophe.leroy@csgroup.eu


# c1e8d7c6 08-Jun-2020 Michel Lespinasse <walken@google.com>

mmap locking API: convert mmap_sem comments

Convert comments that reference mmap_sem to reference mmap_lock instead.

[akpm@linux-foundation.org: fix up linux-next leftovers]
[akpm@linux-foundation.org: s/lockaphore/lock/, per Vlastimil]
[akpm@linux-foundation.org: more linux-next fixups, per Michel]

Signed-off-by: Michel Lespinasse <walken@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Reviewed-by: Vlastimil Babka <vbabka@suse.cz>
Reviewed-by: Daniel Jordan <daniel.m.jordan@oracle.com>
Cc: Davidlohr Bueso <dbueso@suse.de>
Cc: David Rientjes <rientjes@google.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: Jason Gunthorpe <jgg@ziepe.ca>
Cc: Jerome Glisse <jglisse@redhat.com>
Cc: John Hubbard <jhubbard@nvidia.com>
Cc: Laurent Dufour <ldufour@linux.ibm.com>
Cc: Liam Howlett <Liam.Howlett@oracle.com>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Ying Han <yinghan@google.com>
Link: http://lkml.kernel.org/r/20200520052908.204642-13-walken@google.com
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 3e4e28c5 08-Jun-2020 Michel Lespinasse <walken@google.com>

mmap locking API: convert mmap_sem API comments

Convert comments that reference old mmap_sem APIs to reference
corresponding new mmap locking APIs instead.

Signed-off-by: Michel Lespinasse <walken@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Reviewed-by: Vlastimil Babka <vbabka@suse.cz>
Reviewed-by: Davidlohr Bueso <dbueso@suse.de>
Reviewed-by: Daniel Jordan <daniel.m.jordan@oracle.com>
Cc: David Rientjes <rientjes@google.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: Jason Gunthorpe <jgg@ziepe.ca>
Cc: Jerome Glisse <jglisse@redhat.com>
Cc: John Hubbard <jhubbard@nvidia.com>
Cc: Laurent Dufour <ldufour@linux.ibm.com>
Cc: Liam Howlett <Liam.Howlett@oracle.com>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Ying Han <yinghan@google.com>
Link: http://lkml.kernel.org/r/20200520052908.204642-12-walken@google.com
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# da1c55f1 08-Jun-2020 Michel Lespinasse <walken@google.com>

mmap locking API: rename mmap_sem to mmap_lock

Rename the mmap_sem field to mmap_lock. Any new uses of this lock should
now go through the new mmap locking api. The mmap_lock is still
implemented as a rwsem, though this could change in the future.

[akpm@linux-foundation.org: fix it for mm-gup-might_lock_readmmap_sem-in-get_user_pages_fast.patch]

Signed-off-by: Michel Lespinasse <walken@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Reviewed-by: Vlastimil Babka <vbabka@suse.cz>
Reviewed-by: Davidlohr Bueso <dbueso@suse.de>
Reviewed-by: Daniel Jordan <daniel.m.jordan@oracle.com>
Cc: David Rientjes <rientjes@google.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: Jason Gunthorpe <jgg@ziepe.ca>
Cc: Jerome Glisse <jglisse@redhat.com>
Cc: John Hubbard <jhubbard@nvidia.com>
Cc: Laurent Dufour <ldufour@linux.ibm.com>
Cc: Liam Howlett <Liam.Howlett@oracle.com>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Ying Han <yinghan@google.com>
Link: http://lkml.kernel.org/r/20200520052908.204642-11-walken@google.com
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 42fc5414 08-Jun-2020 Michel Lespinasse <walken@google.com>

mmap locking API: add mmap_assert_locked() and mmap_assert_write_locked()

Add new APIs to assert that mmap_sem is held.

Using this instead of rwsem_is_locked and lockdep_assert_held[_write]
makes the assertions more tolerant of future changes to the lock type.

Signed-off-by: Michel Lespinasse <walken@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Reviewed-by: Vlastimil Babka <vbabka@suse.cz>
Reviewed-by: Daniel Jordan <daniel.m.jordan@oracle.com>
Cc: Davidlohr Bueso <dbueso@suse.de>
Cc: David Rientjes <rientjes@google.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: Jason Gunthorpe <jgg@ziepe.ca>
Cc: Jerome Glisse <jglisse@redhat.com>
Cc: John Hubbard <jhubbard@nvidia.com>
Cc: Laurent Dufour <ldufour@linux.ibm.com>
Cc: Liam Howlett <Liam.Howlett@oracle.com>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Ying Han <yinghan@google.com>
Link: http://lkml.kernel.org/r/20200520052908.204642-10-walken@google.com
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# d8ed45c5 08-Jun-2020 Michel Lespinasse <walken@google.com>

mmap locking API: use coccinelle to convert mmap_sem rwsem call sites

This change converts the existing mmap_sem rwsem calls to use the new mmap
locking API instead.

The change is generated using coccinelle with the following rule:

// spatch --sp-file mmap_lock_api.cocci --in-place --include-headers --dir .

@@
expression mm;
@@
(
-init_rwsem
+mmap_init_lock
|
-down_write
+mmap_write_lock
|
-down_write_killable
+mmap_write_lock_killable
|
-down_write_trylock
+mmap_write_trylock
|
-up_write
+mmap_write_unlock
|
-downgrade_write
+mmap_write_downgrade
|
-down_read
+mmap_read_lock
|
-down_read_killable
+mmap_read_lock_killable
|
-down_read_trylock
+mmap_read_trylock
|
-up_read
+mmap_read_unlock
)
-(&mm->mmap_sem)
+(mm)

Signed-off-by: Michel Lespinasse <walken@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Reviewed-by: Daniel Jordan <daniel.m.jordan@oracle.com>
Reviewed-by: Laurent Dufour <ldufour@linux.ibm.com>
Reviewed-by: Vlastimil Babka <vbabka@suse.cz>
Cc: Davidlohr Bueso <dbueso@suse.de>
Cc: David Rientjes <rientjes@google.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: Jason Gunthorpe <jgg@ziepe.ca>
Cc: Jerome Glisse <jglisse@redhat.com>
Cc: John Hubbard <jhubbard@nvidia.com>
Cc: Liam Howlett <Liam.Howlett@oracle.com>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Ying Han <yinghan@google.com>
Link: http://lkml.kernel.org/r/20200520052908.204642-5-walken@google.com
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# e31cf2f4 08-Jun-2020 Mike Rapoport <rppt@kernel.org>

mm: don't include asm/pgtable.h if linux/mm.h is already included

Patch series "mm: consolidate definitions of page table accessors", v2.

The low level page table accessors (pXY_index(), pXY_offset()) are
duplicated across all architectures and sometimes more than once. For
instance, we have 31 definition of pgd_offset() for 25 supported
architectures.

Most of these definitions are actually identical and typically it boils
down to, e.g.

static inline unsigned long pmd_index(unsigned long address)
{
return (address >> PMD_SHIFT) & (PTRS_PER_PMD - 1);
}

static inline pmd_t *pmd_offset(pud_t *pud, unsigned long address)
{
return (pmd_t *)pud_page_vaddr(*pud) + pmd_index(address);
}

These definitions can be shared among 90% of the arches provided
XYZ_SHIFT, PTRS_PER_XYZ and xyz_page_vaddr() are defined.

For architectures that really need a custom version there is always
possibility to override the generic version with the usual ifdefs magic.

These patches introduce include/linux/pgtable.h that replaces
include/asm-generic/pgtable.h and add the definitions of the page table
accessors to the new header.

This patch (of 12):

The linux/mm.h header includes <asm/pgtable.h> to allow inlining of the
functions involving page table manipulations, e.g. pte_alloc() and
pmd_alloc(). So, there is no point to explicitly include <asm/pgtable.h>
in the files that include <linux/mm.h>.

The include statements in such cases are remove with a simple loop:

for f in $(git grep -l "include <linux/mm.h>") ; do
sed -i -e '/include <asm\/pgtable.h>/ d' $f
done

Signed-off-by: Mike Rapoport <rppt@linux.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Cc: Arnd Bergmann <arnd@arndb.de>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Brian Cain <bcain@codeaurora.org>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Chris Zankel <chris@zankel.net>
Cc: "David S. Miller" <davem@davemloft.net>
Cc: Geert Uytterhoeven <geert@linux-m68k.org>
Cc: Greentime Hu <green.hu@gmail.com>
Cc: Greg Ungerer <gerg@linux-m68k.org>
Cc: Guan Xuetao <gxt@pku.edu.cn>
Cc: Guo Ren <guoren@kernel.org>
Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
Cc: Helge Deller <deller@gmx.de>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Ley Foon Tan <ley.foon.tan@intel.com>
Cc: Mark Salter <msalter@redhat.com>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Matt Turner <mattst88@gmail.com>
Cc: Max Filippov <jcmvbkbc@gmail.com>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Cc: Michal Simek <monstr@monstr.eu>
Cc: Mike Rapoport <rppt@kernel.org>
Cc: Nick Hu <nickhu@andestech.com>
Cc: Paul Walmsley <paul.walmsley@sifive.com>
Cc: Richard Weinberger <richard@nod.at>
Cc: Rich Felker <dalias@libc.org>
Cc: Russell King <linux@armlinux.org.uk>
Cc: Stafford Horne <shorne@gmail.com>
Cc: Thomas Bogendoerfer <tsbogend@alpha.franken.de>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Tony Luck <tony.luck@intel.com>
Cc: Vincent Chen <deanbo422@gmail.com>
Cc: Vineet Gupta <vgupta@synopsys.com>
Cc: Will Deacon <will@kernel.org>
Cc: Yoshinori Sato <ysato@users.sourceforge.jp>
Link: http://lkml.kernel.org/r/20200514170327.31389-1-rppt@kernel.org
Link: http://lkml.kernel.org/r/20200514170327.31389-2-rppt@kernel.org
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 6a005645 07-Jun-2020 John Hubbard <jhubbard@nvidia.com>

mm/gup: documentation fix for pin_user_pages*() APIs

All of the pin_user_pages*() API calls will cause pages to be
dma-pinned. As such, they are all suitable for either DMA, RDMA, and/or
Direct IO.

The documentation should say so, but it was instead saying that three of
the API calls were only suitable for Direct IO. This was discovered
when a reviewer wondered why an API call that specifically recommended
against Case 2 (DMA/RDMA) was being used in a DMA situation [1].

Fix this by simply deleting those claims. The gup.c comments already
refer to the more extensive Documentation/core-api/pin_user_pages.rst,
which does have the correct guidance. So let's just write it once,
there.

[1] https://lore.kernel.org/r/20200529074658.GM30374@kadam

Signed-off-by: John Hubbard <jhubbard@nvidia.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Reviewed-by: David Hildenbrand <david@redhat.com>
Acked-by: Pankaj Gupta <pankaj.gupta.linux@gmail.com>
Acked-by: Souptick Joarder <jrdr.linux@gmail.com>
Cc: Dan Carpenter <dan.carpenter@oracle.com>
Cc: Jan Kara <jack@suse.cz>
Cc: Vlastimil Babka <vbabka@suse.cz>
Link: http://lkml.kernel.org/r/20200529084515.46259-1-jhubbard@nvidia.com
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 420c2091 07-Jun-2020 John Hubbard <jhubbard@nvidia.com>

mm/gup: introduce pin_user_pages_locked()

Patch series "mm/gup: introduce pin_user_pages_locked(), use it in frame_vector.c", v2.

This adds yet one more pin_user_pages*() variant, and uses that to
convert mm/frame_vector.c.

With this, along with maybe 20 or 30 other recent patches in various
trees, we are close to having the relevant gup call sites
converted--with the notable exception of the bio/block layer.

This patch (of 2):

Introduce pin_user_pages_locked(), which is nearly identical to
get_user_pages_locked() except that it sets FOLL_PIN and rejects
FOLL_GET.

As with other pairs of get_user_pages*() and pin_user_pages() API calls,
it's prudent to assert that FOLL_PIN is *not* set in the
get_user_pages*() call, so add that as part of this.

[jhubbard@nvidia.com: v2]
Link: http://lkml.kernel.org/r/20200531234131.770697-2-jhubbard@nvidia.com

Signed-off-by: John Hubbard <jhubbard@nvidia.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Reviewed-by: David Hildenbrand <david@redhat.com>
Acked-by: Pankaj Gupta <pankaj.gupta.linux@gmail.com>
Cc: Daniel Vetter <daniel@ffwll.ch>
Cc: Jérôme Glisse <jglisse@redhat.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Jan Kara <jack@suse.cz>
Cc: Dave Chinner <david@fromorbit.com>
Cc: Souptick Joarder <jrdr.linux@gmail.com>
Link: http://lkml.kernel.org/r/20200531234131.770697-1-jhubbard@nvidia.com
Link: http://lkml.kernel.org/r/20200527223243.884385-1-jhubbard@nvidia.com
Link: http://lkml.kernel.org/r/20200527223243.884385-2-jhubbard@nvidia.com
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# dadbb612 07-Jun-2020 Souptick Joarder <jrdr.linux@gmail.com>

mm/gup.c: convert to use get_user_{page|pages}_fast_only()

API __get_user_pages_fast() renamed to get_user_pages_fast_only() to
align with pin_user_pages_fast_only().

As part of this we will get rid of write parameter. Instead caller will
pass FOLL_WRITE to get_user_pages_fast_only(). This will not change any
existing functionality of the API.

All the callers are changed to pass FOLL_WRITE.

Also introduce get_user_page_fast_only(), and use it in a few places
that hard-code nr_pages to 1.

Updated the documentation of the API.

Signed-off-by: Souptick Joarder <jrdr.linux@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Reviewed-by: John Hubbard <jhubbard@nvidia.com>
Reviewed-by: Paul Mackerras <paulus@ozlabs.org> [arch/powerpc/kvm]
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Mark Rutland <mark.rutland@arm.com>
Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com>
Cc: Jiri Olsa <jolsa@redhat.com>
Cc: Namhyung Kim <namhyung@kernel.org>
Cc: Paolo Bonzini <pbonzini@redhat.com>
Cc: Stephen Rothwell <sfr@canb.auug.org.au>
Cc: Mike Rapoport <rppt@linux.ibm.com>
Cc: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
Cc: Michal Suchanek <msuchanek@suse.de>
Link: http://lkml.kernel.org/r/1590396812-31277-1-git-send-email-jrdr.linux@gmail.com
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 2d3a36a4 03-Jun-2020 Michal Hocko <mhocko@suse.com>

mm, mempolicy: fix up gup usage in lookup_node

ba841078cd05 ("mm/mempolicy: Allow lookup_node() to handle fatal signal")
has added a special casing for 0 return value because that was a possible
gup return value when interrupted by fatal signal. This has been fixed by
ae46d2aa6a7f ("mm/gup: Let __get_user_pages_locked() return -EINTR for
fatal signal") in the mean time so ba841078cd05 can be reverted.

This patch however doesn't go all the way to revert it because the check
for 0 is wrong and confusing here. Firstly it is inherently unsafe to
access the page when get_user_pages_locked returns 0 (aka no page
returned).

Fortunatelly this will not happen because get_user_pages_locked will not
return 0 when nr_pages > 0 unless FOLL_NOWAIT is specified which is not
the case here. Document this potential error code in gup code while we
are at it.

Signed-off-by: Michal Hocko <mhocko@suse.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Cc: Peter Xu <peterx@redhat.com>
Link: http://lkml.kernel.org/r/20200421071026.18394-1-mhocko@kernel.org
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# f81cd178 03-Jun-2020 John Hubbard <jhubbard@nvidia.com>

mm/gup: might_lock_read(mmap_sem) in get_user_pages_fast()

Instead of scattering these assertions across the drivers, do this
assertion inside the core of get_user_pages_fast*() functions. That also
includes pin_user_pages_fast*() routines.

Add a might_lock_read(mmap_sem) call to internal_get_user_pages_fast().

Suggested-by: Matthew Wilcox <willy@infradead.org>
Signed-off-by: John Hubbard <jhubbard@nvidia.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Reviewed-by: Matthew Wilcox <willy@infradead.org>
Cc: Michel Lespinasse <walken@google.com>
Cc: Jason Gunthorpe <jgg@ziepe.ca>
Link: http://lkml.kernel.org/r/20200522010443.1290485-1-jhubbard@nvidia.com
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 104acc32 03-Jun-2020 John Hubbard <jhubbard@nvidia.com>

mm/gup: introduce pin_user_pages_fast_only()

This is the FOLL_PIN equivalent of __get_user_pages_fast(), except with a
more descriptive name, and gup_flags instead of a boolean "write" in the
argument list.

Signed-off-by: John Hubbard <jhubbard@nvidia.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Reviewed-by: Chris Wilson <chris@chris-wilson.co.uk>
Cc: Daniel Vetter <daniel@ffwll.ch>
Cc: David Airlie <airlied@linux.ie>
Cc: Jani Nikula <jani.nikula@linux.intel.com>
Cc: "Joonas Lahtinen" <joonas.lahtinen@linux.intel.com>
Cc: Matthew Auld <matthew.auld@intel.com>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Rodrigo Vivi <rodrigo.vivi@intel.com>
Cc: Souptick Joarder <jrdr.linux@gmail.com>
Cc: Tvrtko Ursulin <tvrtko.ursulin@intel.com>
Link: http://lkml.kernel.org/r/20200519002124.2025955-4-jhubbard@nvidia.com
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 376a34ef 03-Jun-2020 John Hubbard <jhubbard@nvidia.com>

mm/gup: refactor and de-duplicate gup_fast() code

There were two nearly identical sets of code for gup_fast() style of
walking the page tables with interrupts disabled. This has lead to the
usual maintenance problems that arise from having duplicated code.

There is already a core internal routine in gup.c for gup_fast(), so just
enhance it very slightly: allow skipping the fall-back to "slow" (regular)
get_user_pages(), via the new FOLL_FAST_ONLY flag. Then, just call
internal_get_user_pages_fast() from __get_user_pages_fast(), and adjust
the API to match pre-existing API behavior.

There is a change in behavior from this refactoring: the nested form of
interrupt disabling is used in all gup_fast() variants now. That's
because there is only one place that interrupt disabling for page walking
is done, and so the safer form is required. This should, if anything,
eliminate possible (rare) bugs, because the non-nested form of enabling
interrupts was fragile at best.

[jhubbard@nvidia.com: fixup]
Link: http://lkml.kernel.org/r/20200521233841.1279742-1-jhubbard@nvidia.com
Signed-off-by: John Hubbard <jhubbard@nvidia.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Reviewed-by: Chris Wilson <chris@chris-wilson.co.uk>
Cc: Daniel Vetter <daniel@ffwll.ch>
Cc: David Airlie <airlied@linux.ie>
Cc: Jani Nikula <jani.nikula@linux.intel.com>
Cc: "Joonas Lahtinen" <joonas.lahtinen@linux.intel.com>
Cc: Matthew Auld <matthew.auld@intel.com>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Rodrigo Vivi <rodrigo.vivi@intel.com>
Cc: Souptick Joarder <jrdr.linux@gmail.com>
Cc: Tvrtko Ursulin <tvrtko.ursulin@intel.com>
Link: http://lkml.kernel.org/r/20200519002124.2025955-3-jhubbard@nvidia.com
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 9e1f0580 03-Jun-2020 John Hubbard <jhubbard@nvidia.com>

mm/gup: move __get_user_pages_fast() down a few lines in gup.c

Patch series "mm/gup, drm/i915: refactor gup_fast, convert to pin_user_pages()", v2.

In order to convert the drm/i915 driver from get_user_pages() to
pin_user_pages(), a FOLL_PIN equivalent of __get_user_pages_fast() was
required. That led to refactoring __get_user_pages_fast(), with the
following goals:

1) As above: provide a pin_user_pages*() routine for drm/i915 to call,
in place of __get_user_pages_fast(),

2) Get rid of the gup.c duplicate code for walking page tables with
interrupts disabled. This duplicate code is a minor maintenance
problem anyway.

3) Make it easy for an upcoming patch from Souptick, which aims to
convert __get_user_pages_fast() to use a gup_flags argument, instead
of a bool writeable arg. Also, if this series looks good, we can
ask Souptick to change the name as well, to whatever the consensus
is. My initial recommendation is: get_user_pages_fast_only(), to
match the new pin_user_pages_only().

This patch (of 4):

This is in order to avoid a forward declaration of
internal_get_user_pages_fast(), in the next patch.

This is code movement only--all generated code should be identical.

Signed-off-by: John Hubbard <jhubbard@nvidia.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Reviewed-by: Chris Wilson <chris@chris-wilson.co.uk>
Cc: Daniel Vetter <daniel@ffwll.ch>
Cc: David Airlie <airlied@linux.ie>
Cc: Jani Nikula <jani.nikula@linux.intel.com>
Cc: "Joonas Lahtinen" <joonas.lahtinen@linux.intel.com>
Cc: Matthew Auld <matthew.auld@intel.com>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Rodrigo Vivi <rodrigo.vivi@intel.com>
Cc: Souptick Joarder <jrdr.linux@gmail.com>
Cc: Tvrtko Ursulin <tvrtko.ursulin@intel.com>
Link: http://lkml.kernel.org/r/20200522051931.54191-1-jhubbard@nvidia.com
Link: http://lkml.kernel.org/r/20200519002124.2025955-1-jhubbard@nvidia.com
Link: http://lkml.kernel.org/r/20200519002124.2025955-2-jhubbard@nvidia.com
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 548b6a1e 01-Jun-2020 Miles Chen <miles.chen@mediatek.com>

mm/gup.c: further document vma_permits_fault()

Describe the caller's responsibilities when passing
FAULT_FLAG_ALLOW_RETRY.

Link: http://lkml.kernel.org/r/1586915606.5647.5.camel@mtkswgap22
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 91429023 01-Jun-2020 John Hubbard <jhubbard@nvidia.com>

mm/gup: introduce pin_user_pages_unlocked

Introduce pin_user_pages_unlocked(), which is nearly identical to the
get_user_pages_unlocked() that it wraps, except that it sets FOLL_PIN
and rejects FOLL_GET.

Signed-off-by: John Hubbard <jhubbard@nvidia.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Reviewed-by: Andrew Morton <akpm@linux-foundation.org>
Cc: Andy Walls <awalls@md.metrocast.net>
Cc: Mauro Carvalho Chehab <mchehab@kernel.org>
Link: http://lkml.kernel.org/r/20200518012157.1178336-2-jhubbard@nvidia.com
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# adc8cb40 01-Jun-2020 Souptick Joarder <jrdr.linux@gmail.com>

mm/gup.c: update the documentation

This patch is an attempt to update the documentation.

- Add/ remove extra * based on type of function static/global.

- Add description for functions and their input arguments.

[akpm@linux-foundation.org: s@/*@/**@]
Signed-off-by: Souptick Joarder <jrdr.linux@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Reviewed-by: Andrew Morton <akpm@linux-foundation.org>
Link: http://lkml.kernel.org/r/1588013630-4497-1-git-send-email-jrdr.linux@gmail.com
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 17839856 27-May-2020 Linus Torvalds <torvalds@linux-foundation.org>

gup: document and work around "COW can break either way" issue

Doing a "get_user_pages()" on a copy-on-write page for reading can be
ambiguous: the page can be COW'ed at any time afterwards, and the
direction of a COW event isn't defined.

Yes, whoever writes to it will generally do the COW, but if the thread
that did the get_user_pages() unmapped the page before the write (and
that could happen due to memory pressure in addition to any outright
action), the writer could also just take over the old page instead.

End result: the get_user_pages() call might result in a page pointer
that is no longer associated with the original VM, and is associated
with - and controlled by - another VM having taken it over instead.

So when doing a get_user_pages() on a COW mapping, the only really safe
thing to do would be to break the COW when getting the page, even when
only getting it for reading.

At the same time, some users simply don't even care.

For example, the perf code wants to look up the page not because it
cares about the page, but because the code simply wants to look up the
physical address of the access for informational purposes, and doesn't
really care about races when a page might be unmapped and remapped
elsewhere.

This adds logic to force a COW event by setting FOLL_WRITE on any
copy-on-write mapping when FOLL_GET (or FOLL_PIN) is used to get a page
pointer as a result.

The current semantics end up being:

- __get_user_pages_fast(): no change. If you don't ask for a write,
you won't break COW. You'd better know what you're doing.

- get_user_pages_fast(): the fast-case "look it up in the page tables
without anything getting mmap_sem" now refuses to follow a read-only
page, since it might need COW breaking. Which happens in the slow
path - the fast path doesn't know if the memory might be COW or not.

- get_user_pages() (including the slow-path fallback for gup_fast()):
for a COW mapping, turn on FOLL_WRITE for FOLL_GET/FOLL_PIN, with
very similar semantics to FOLL_FORCE.

If it turns out that we want finer granularity (ie "only break COW when
it might actually matter" - things like the zero page are special and
don't need to be broken) we might need to push these semantics deeper
into the lookup fault path. So if people care enough, it's possible
that we might end up adding a new internal FOLL_BREAK_COW flag to go
with the internal FOLL_COW flag we already have for tracking "I had a
COW".

Alternatively, if it turns out that different callers might want to
explicitly control the forced COW break behavior, we might even want to
make such a flag visible to the users of get_user_pages() instead of
using the above default semantics.

But for now, this is mostly commentary on the issue (this commit message
being a lot bigger than the patch, and that patch in turn is almost all
comments), with that minimal "enable COW breaking early" logic using the
existing FOLL_WRITE behavior.

[ It might be worth noting that we've always had this ambiguity, and it
could arguably be seen as a user-space issue.

You only get private COW mappings that could break either way in
situations where user space is doing cooperative things (ie fork()
before an execve() etc), but it _is_ surprising and very subtle, and
fork() is supposed to give you independent address spaces.

So let's treat this as a kernel issue and make the semantics of
get_user_pages() easier to understand. Note that obviously a true
shared mapping will still get a page that can change under us, so this
does _not_ mean that get_user_pages() somehow returns any "stable"
page ]

Reported-by: Jann Horn <jannh@google.com>
Tested-by: Christoph Hellwig <hch@lst.de>
Acked-by: Oleg Nesterov <oleg@redhat.com>
Acked-by: Kirill Shutemov <kirill@shutemov.name>
Acked-by: Jan Kara <jack@suse.cz>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Matthew Wilcox <willy@infradead.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 475f4dfc 13-May-2020 Peter Xu <peterx@redhat.com>

mm/gup: fix fixup_user_fault() on multiple retries

This part was overlooked when reworking the gup code on multiple
retries.

When we get the 2nd+ retry, we'll be with TRIED flag set. Current code
will bail out on the 2nd retry because the !TRIED check will fail so the
retry logic will be skipped. What's worse is that, it will also return
zero which errornously hints the caller that the page is faulted in
while it's not.

The !TRIED flag check seems to not be needed even before the mutliple
retries change because if we get a VM_FAULT_RETRY, it must be the 1st
retry, and we should not have TRIED set for that.

Fix it by removing the !TRIED check, at the meantime check against fatal
signals properly before the page fault so we can still properly respond
to the user killing the process during retries.

Fixes: 4426e945df58 ("mm/gup: allow VM_FAULT_RETRY for multiple times")
Reported-by: Randy Dunlap <rdunlap@infradead.org>
Signed-off-by: Peter Xu <peterx@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Cc: Alex Williamson <alex.williamson@redhat.com>
Cc: Brian Geffon <bgeffon@google.com>
Link: http://lkml.kernel.org/r/20200502003523.8204-1-peterx@redhat.com
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# d180870d 20-Apr-2020 Michal Hocko <mhocko@suse.com>

mm, gup: return EINTR when gup is interrupted by fatal signals

EINTR is the usual error code which other killable interfaces return.
This is the case for the other fatal_signal_pending break out from the
same function. Make the code consistent.

ERESTARTSYS is also quite confusing because the signal is fatal and so
no restart will happen before returning to the userspace.

Signed-off-by: Michal Hocko <mhocko@suse.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Acked-by: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Xu <peterx@redhat.com>
Cc: Hillf Danton <hdanton@sina.com>
Link: http://lkml.kernel.org/r/20200409071133.31734-1-mhocko@kernel.org
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 72ef5e52 14-Apr-2020 Mauro Carvalho Chehab <mchehab+huawei@kernel.org>

docs: fix broken references to text files

Several references got broken due to txt to ReST conversion.

Several of them can be automatically fixed with:

scripts/documentation-file-ref-check --fix

Reviewed-by: Mathieu Poirier <mathieu.poirier@linaro.org> # hwtracing/coresight/Kconfig
Reviewed-by: Paul E. McKenney <paulmck@kernel.org> # memory-barrier.txt
Acked-by: Alex Shi <alex.shi@linux.alibaba.com> # translations/zh_CN
Acked-by: Federico Vaga <federico.vaga@vaga.pv.it> # translations/it_IT
Acked-by: Marc Zyngier <maz@kernel.org> # kvm/arm64
Signed-off-by: Mauro Carvalho Chehab <mchehab+huawei@kernel.org>
Link: https://lore.kernel.org/r/6f919ddb83a33b5f2a63b6b5f0575737bb2b36aa.1586881715.git.mchehab+huawei@kernel.org
Signed-off-by: Jonathan Corbet <corbet@lwn.net>


# ae46d2aa 08-Apr-2020 Hillf Danton <hdanton@sina.com>

mm/gup: Let __get_user_pages_locked() return -EINTR for fatal signal

__get_user_pages_locked() will return 0 instead of -EINTR after commit
4426e945df588 ("mm/gup: allow VM_FAULT_RETRY for multiple times") which
added extra code to allow gup detect fatal signal faster.

Restore the original -EINTR behavior.

Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Fixes: 4426e945df58 ("mm/gup: allow VM_FAULT_RETRY for multiple times")
Reported-by: syzbot+3be1a33f04dc782e9fd5@syzkaller.appspotmail.com
Signed-off-by: Hillf Danton <hdanton@sina.com>
Acked-by: Michal Hocko <mhocko@suse.com>
Signed-off-by: Peter Xu <peterx@redhat.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# c7b6a566 07-Apr-2020 Peter Xu <peterx@redhat.com>

mm/gup: Mark lock taken only after a successful retake

It's definitely incorrect to mark the lock as taken even if
down_read_killable() failed.

This wass overlooked when we switched from down_read() to
down_read_killable() because down_read() won't fail while
down_read_killable() could.

Fixes: 71335f37c5e8 ("mm/gup: allow to react to fatal signals")
Reported-by: syzbot+a8c70b7f3579fc0587dc@syzkaller.appspotmail.com
Signed-off-by: Peter Xu <peterx@redhat.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# e4a9bc58 06-Apr-2020 Joe Perches <joe@perches.com>

mm: use fallthrough;

Convert the various /* fallthrough */ comments to the pseudo-keyword
fallthrough;

Done via script:
https://lore.kernel.org/lkml/b56602fcf79f849e733e7b521bb0e17895d390fa.1582230379.git.joe@perches.com/

Signed-off-by: Joe Perches <joe@perches.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Reviewed-by: Gustavo A. R. Silva <gustavo@embeddedor.com>
Link: http://lkml.kernel.org/r/f62fea5d10eb0ccfc05d87c242a620c261219b66.camel@perches.com
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 9de4f22a 06-Apr-2020 Huang Ying <ying.huang@intel.com>

mm: code cleanup for MADV_FREE

Some comments for MADV_FREE is revised and added to help people understand
the MADV_FREE code, especially the page flag, PG_swapbacked. This makes
page_is_file_cache() isn't consistent with its comments. So the function
is renamed to page_is_file_lru() to make them consistent again. All these
are put in one patch as one logical change.

Suggested-by: David Hildenbrand <david@redhat.com>
Suggested-by: Johannes Weiner <hannes@cmpxchg.org>
Suggested-by: David Rientjes <rientjes@google.com>
Signed-off-by: "Huang, Ying" <ying.huang@intel.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Acked-by: David Rientjes <rientjes@google.com>
Acked-by: Michal Hocko <mhocko@kernel.org>
Acked-by: Pankaj Gupta <pankaj.gupta.linux@gmail.com>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Minchan Kim <minchan@kernel.org>
Cc: Hugh Dickins <hughd@google.com>
Cc: Rik van Riel <riel@surriel.com>
Link: http://lkml.kernel.org/r/20200317100342.2730705-1-ying.huang@intel.com
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# a0137f16 06-Apr-2020 Anshuman Khandual <anshuman.khandual@arm.com>

mm/vma: replace all remaining open encodings with vma_is_anonymous()

This replaces all remaining open encodings with vma_is_anonymous().

Signed-off-by: Anshuman Khandual <anshuman.khandual@arm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Acked-by: Vlastimil Babka <vbabka@suse.cz
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: "Aneesh Kumar K.V" <aneesh.kumar@linux.ibm.com>
Cc: Arnaldo Carvalho de Melo <acme@kernel.org>
Cc: Arnd Bergmann <arnd@arndb.de>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Geert Uytterhoeven <geert@linux-m68k.org>
Cc: Guo Ren <guoren@kernel.org>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Cc: Nick Piggin <npiggin@gmail.com>
Cc: Paul Burton <paulburton@kernel.org>
Cc: Paul Mackerras <paulus@ozlabs.org>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Ralf Baechle <ralf@linux-mips.org>
Cc: Rich Felker <dalias@libc.org>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Will Deacon <will@kernel.org>
Cc: Yoshinori Sato <ysato@users.sourceforge.jp>
Link: http://lkml.kernel.org/r/1582520593-30704-5-git-send-email-anshuman.khandual@arm.com
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 3122e80e 06-Apr-2020 Anshuman Khandual <anshuman.khandual@arm.com>

mm/vma: make vma_is_accessible() available for general use

Lets move vma_is_accessible() helper to include/linux/mm.h which makes it
available for general use. While here, this replaces all remaining open
encodings for VMA access check with vma_is_accessible().

Signed-off-by: Anshuman Khandual <anshuman.khandual@arm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Acked-by: Geert Uytterhoeven <geert@linux-m68k.org>
Acked-by: Guo Ren <guoren@kernel.org>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Cc: Guo Ren <guoren@kernel.org>
Cc: Geert Uytterhoeven <geert@linux-m68k.org>
Cc: Ralf Baechle <ralf@linux-mips.org>
Cc: Paul Burton <paulburton@kernel.org>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Cc: Yoshinori Sato <ysato@users.sourceforge.jp>
Cc: Rich Felker <dalias@libc.org>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Cc: "Aneesh Kumar K.V" <aneesh.kumar@linux.ibm.com>
Cc: Arnaldo Carvalho de Melo <acme@kernel.org>
Cc: Arnd Bergmann <arnd@arndb.de>
Cc: Nick Piggin <npiggin@gmail.com>
Cc: Paul Mackerras <paulus@ozlabs.org>
Cc: Will Deacon <will@kernel.org>
Link: http://lkml.kernel.org/r/1582520593-30704-3-git-send-email-anshuman.khandual@arm.com
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 71335f37 01-Apr-2020 Peter Xu <peterx@redhat.com>

mm/gup: allow to react to fatal signals

The existing gup code does not react to the fatal signals in many code
paths. For example, in one retry path of gup we're still using
down_read() rather than down_read_killable(). Also, when doing page
faults we don't pass in FAULT_FLAG_KILLABLE as well, which means that
within the faulting process we'll wait in non-killable way as well. These
were spotted by Linus during the code review of some other patches.

Let's allow the gup code to react to fatal signals to improve the
responsiveness of threads when during gup and being killed.

Signed-off-by: Peter Xu <peterx@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Tested-by: Brian Geffon <bgeffon@google.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Bobby Powers <bobbypowers@gmail.com>
Cc: David Hildenbrand <david@redhat.com>
Cc: Denis Plotnikov <dplotnikov@virtuozzo.com>
Cc: "Dr . David Alan Gilbert" <dgilbert@redhat.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: Jerome Glisse <jglisse@redhat.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: "Kirill A . Shutemov" <kirill@shutemov.name>
Cc: Martin Cracauer <cracauer@cons.org>
Cc: Marty McFadden <mcfadden8@llnl.gov>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Maya Gokhale <gokhale2@llnl.gov>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Mike Kravetz <mike.kravetz@oracle.com>
Cc: Mike Rapoport <rppt@linux.vnet.ibm.com>
Cc: Pavel Emelyanov <xemul@openvz.org>
Link: http://lkml.kernel.org/r/20200220160256.9887-1-peterx@redhat.com
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 4426e945 01-Apr-2020 Peter Xu <peterx@redhat.com>

mm/gup: allow VM_FAULT_RETRY for multiple times

This is the gup counterpart of the change that allows the VM_FAULT_RETRY
to happen for more than once. One thing to mention is that we must check
the fatal signal here before retry because the GUP can be interrupted by
that, otherwise we can loop forever.

Signed-off-by: Peter Xu <peterx@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Tested-by: Brian Geffon <bgeffon@google.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Bobby Powers <bobbypowers@gmail.com>
Cc: David Hildenbrand <david@redhat.com>
Cc: Denis Plotnikov <dplotnikov@virtuozzo.com>
Cc: "Dr . David Alan Gilbert" <dgilbert@redhat.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: Jerome Glisse <jglisse@redhat.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: "Kirill A . Shutemov" <kirill@shutemov.name>
Cc: Martin Cracauer <cracauer@cons.org>
Cc: Marty McFadden <mcfadden8@llnl.gov>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Maya Gokhale <gokhale2@llnl.gov>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Mike Kravetz <mike.kravetz@oracle.com>
Cc: Mike Rapoport <rppt@linux.vnet.ibm.com>
Cc: Pavel Emelyanov <xemul@openvz.org>
Link: http://lkml.kernel.org/r/20200220195357.16371-1-peterx@redhat.com
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# ad415db8 01-Apr-2020 Peter Xu <peterx@redhat.com>

mm/gup: fix __get_user_pages() on fault retry of hugetlb

When follow_hugetlb_page() returns with *locked==0, it means we've got a
VM_FAULT_RETRY within the fauling process and we've released the mmap_sem.
When that happens, we should stop and bail out.

Signed-off-by: Peter Xu <peterx@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Tested-by: Brian Geffon <bgeffon@google.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Bobby Powers <bobbypowers@gmail.com>
Cc: David Hildenbrand <david@redhat.com>
Cc: Denis Plotnikov <dplotnikov@virtuozzo.com>
Cc: "Dr . David Alan Gilbert" <dgilbert@redhat.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: Jerome Glisse <jglisse@redhat.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: "Kirill A . Shutemov" <kirill@shutemov.name>
Cc: Martin Cracauer <cracauer@cons.org>
Cc: Marty McFadden <mcfadden8@llnl.gov>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Maya Gokhale <gokhale2@llnl.gov>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Mike Kravetz <mike.kravetz@oracle.com>
Cc: Mike Rapoport <rppt@linux.vnet.ibm.com>
Cc: Pavel Emelyanov <xemul@openvz.org>
Link: http://lkml.kernel.org/r/20200220155353.8676-3-peterx@redhat.com
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 4f6da934 01-Apr-2020 Peter Xu <peterx@redhat.com>

mm/gup: rename "nonblocking" to "locked" where proper

Patch series "mm: Page fault enhancements", v6.

This series contains cleanups and enhancements to current page fault
logic. The whole idea comes from the discussion between Andrea and Linus
on the bug reported by syzbot here:

https://lkml.org/lkml/2017/11/2/833

Basically it does two things:

(a) Allows the page fault logic to be more interactive on not only
SIGKILL, but also the rest of userspace signals, and,

(b) Allows the page fault retry (VM_FAULT_RETRY) to happen for more
than once.

For (a): with the changes we should be able to react faster when page
faults are working in parallel with userspace signals like SIGSTOP and
SIGCONT (and more), and with that we can remove the buggy part in
userfaultfd and benefit the whole page fault mechanism on faster signal
processing to reach the userspace.

For (b), we should be able to allow the page fault handler to loop for
even more than twice. Some context: for now since we have
FAULT_FLAG_ALLOW_RETRY we can allow to retry the page fault once with the
same interrupt context, however never more than twice. This can be not
only a potential cleanup to remove this assumption since AFAIU the code
itself doesn't really have this twice-only limitation (though that should
be a protective approach in the past), at the same time it'll greatly
simplify future works like userfaultfd write-protect where it's possible
to retry for more than twice (please have a look at [1] below for a
possible user that might require the page fault to be handled for a third
time; if we can remove the retry limitation we can simply drop that patch
and those complexity).

This patch (of 16):

There's plenty of places around __get_user_pages() that has a parameter
"nonblocking" which does not really mean that "it won't block" (because it
can really block) but instead it shows whether the mmap_sem is released by
up_read() during the page fault handling mostly when VM_FAULT_RETRY is
returned.

We have the correct naming in e.g. get_user_pages_locked() or
get_user_pages_remote() as "locked", however there're still many places
that are using the "nonblocking" as name.

Renaming the places to "locked" where proper to better suite the
functionality of the variable. While at it, fixing up some of the
comments accordingly.

Signed-off-by: Peter Xu <peterx@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Tested-by: Brian Geffon <bgeffon@google.com>
Reviewed-by: Mike Rapoport <rppt@linux.vnet.ibm.com>
Reviewed-by: Jerome Glisse <jglisse@redhat.com>
Reviewed-by: David Hildenbrand <david@redhat.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Martin Cracauer <cracauer@cons.org>
Cc: "Kirill A . Shutemov" <kirill@shutemov.name>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: "Dr . David Alan Gilbert" <dgilbert@redhat.com>
Cc: Bobby Powers <bobbypowers@gmail.com>
Cc: Maya Gokhale <gokhale2@llnl.gov>
Cc: Mike Kravetz <mike.kravetz@oracle.com>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Marty McFadden <mcfadden8@llnl.gov>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Hugh Dickins <hughd@google.com>
Cc: Denis Plotnikov <dplotnikov@virtuozzo.com>
Cc: Pavel Emelyanov <xemul@openvz.org>
Link: http://lkml.kernel.org/r/20200220155353.8676-2-peterx@redhat.com
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# df3a0a21 01-Apr-2020 Pingfan Liu <kernelfans@gmail.com>

mm/gup: fix omission of check on FOLL_LONGTERM in gup fast path

FOLL_LONGTERM is a special case of FOLL_PIN. It suggests a pin which is
going to be given to hardware and can't move. It would truncate CMA
permanently and should be excluded.

In gup slow path, where
__gup_longterm_locked->check_and_migrate_cma_pages() handles
FOLL_LONGTERM, but in fast path, there lacks such a check, which means a
possible leak of CMA page to longterm pinned.

Place a check in try_grab_compound_head() in the fast path to fix the
leak, and if FOLL_LONGTERM happens on CMA, it will fall back to slow path
to migrate the page.

Some note about the check: Huge page's subpages have the same migrate type
due to either allocation from a free_list[] or alloc_contig_range() with
param MIGRATE_MOVABLE. So it is enough to check on a single subpage by
is_migrate_cma_page(subpage)

Signed-off-by: Pingfan Liu <kernelfans@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Jason Gunthorpe <jgg@mellanox.com>
Cc: Ira Weiny <ira.weiny@intel.com>
Cc: Mike Rapoport <rppt@linux.ibm.com>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: John Hubbard <jhubbard@nvidia.com>
Cc: "Aneesh Kumar K.V" <aneesh.kumar@linux.ibm.com>
Cc: Christoph Hellwig <hch@infradead.org>
Cc: Shuah Khan <shuah@kernel.org>
Cc: Jason Gunthorpe <jgg@ziepe.ca>
Link: http://lkml.kernel.org/r/1584876733-17405-3-git-send-email-kernelfans@gmail.com
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 4628b063 01-Apr-2020 Pingfan Liu <kernelfans@gmail.com>

mm/gup: rename nr as nr_pinned in get_user_pages_fast()

To better reflect the held state of pages and make code self-explaining,
rename nr as nr_pinned.

Signed-off-by: Pingfan Liu <kernelfans@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Reviewed-by: John Hubbard <jhubbard@nvidia.com>
Cc: Ira Weiny <ira.weiny@intel.com>
Cc: Mike Rapoport <rppt@linux.ibm.com>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: "Aneesh Kumar K.V" <aneesh.kumar@linux.ibm.com>
Cc: Christoph Hellwig <hch@infradead.org>
Cc: Shuah Khan <shuah@kernel.org>
Cc: Jason Gunthorpe <jgg@ziepe.ca>
Link: http://lkml.kernel.org/r/1584876733-17405-2-git-send-email-kernelfans@gmail.com
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# f28d4363 01-Apr-2020 Claudio Imbrenda <imbrenda@linux.ibm.com>

mm/gup/writeback: add callbacks for inaccessible pages

With the introduction of protected KVM guests on s390 there is now a
concept of inaccessible pages. These pages need to be made accessible
before the host can access them.

While cpu accesses will trigger a fault that can be resolved, I/O accesses
will just fail. We need to add a callback into architecture code for
places that will do I/O, namely when writeback is started or when a page
reference is taken.

This is not only to enable paging, file backing etc, it is also necessary
to protect the host against a malicious user space. For example a bad
QEMU could simply start direct I/O on such protected memory. We do not
want userspace to be able to trigger I/O errors and thus the logic is
"whenever somebody accesses that page (gup) or does I/O, make sure that
this page can be accessed". When the guest tries to access that page we
will wait in the page fault handler for writeback to have finished and for
the page_ref to be the expected value.

On s390x the function is not supposed to fail, so it is ok to use a
WARN_ON on failure. If we ever need some more finegrained handling we can
tackle this when we know the details.

Signed-off-by: Claudio Imbrenda <imbrenda@linux.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Reviewed-by: David Hildenbrand <david@redhat.com>
Reviewed-by: Christian Borntraeger <borntraeger@de.ibm.com>
Reviewed-by: John Hubbard <jhubbard@nvidia.com>
Acked-by: Will Deacon <will@kernel.org>
Cc: Jan Kara <jack@suse.cz>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Ira Weiny <ira.weiny@intel.com>
Cc: Jérôme Glisse <jglisse@redhat.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Christoph Hellwig <hch@infradead.org>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: Dave Chinner <david@fromorbit.com>
Cc: Jason Gunthorpe <jgg@ziepe.ca>
Cc: Jonathan Corbet <corbet@lwn.net>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Mike Kravetz <mike.kravetz@oracle.com>
Cc: Shuah Khan <shuah@kernel.org>
Cc: Vlastimil Babka <vbabka@suse.cz>
Link: http://lkml.kernel.org/r/20200306132537.783769-3-imbrenda@linux.ibm.com
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 1970dc6f 01-Apr-2020 John Hubbard <jhubbard@nvidia.com>

mm/gup: /proc/vmstat: pin_user_pages (FOLL_PIN) reporting

Now that pages are "DMA-pinned" via pin_user_page*(), and unpinned via
unpin_user_pages*(), we need some visibility into whether all of this is
working correctly.

Add two new fields to /proc/vmstat:

nr_foll_pin_acquired
nr_foll_pin_released

These are documented in Documentation/core-api/pin_user_pages.rst. They
represent the number of pages (since boot time) that have been pinned
("nr_foll_pin_acquired") and unpinned ("nr_foll_pin_released"), via
pin_user_pages*() and unpin_user_pages*().

In the absence of long-running DMA or RDMA operations that hold pages
pinned, the above two fields will normally be equal to each other.

Also: update Documentation/core-api/pin_user_pages.rst, to remove an
earlier (now confirmed untrue) claim about a performance problem with
/proc/vmstat.

Also: update Documentation/core-api/pin_user_pages.rst to rename the new
/proc/vmstat entries, to the names listed here.

Signed-off-by: John Hubbard <jhubbard@nvidia.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Reviewed-by: Jan Kara <jack@suse.cz>
Acked-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Cc: Ira Weiny <ira.weiny@intel.com>
Cc: Jérôme Glisse <jglisse@redhat.com>
Cc: "Matthew Wilcox (Oracle)" <willy@infradead.org>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Christoph Hellwig <hch@infradead.org>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: Dave Chinner <david@fromorbit.com>
Cc: Jason Gunthorpe <jgg@ziepe.ca>
Cc: Jonathan Corbet <corbet@lwn.net>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Mike Kravetz <mike.kravetz@oracle.com>
Cc: Shuah Khan <shuah@kernel.org>
Cc: Vlastimil Babka <vbabka@suse.cz>
Link: http://lkml.kernel.org/r/20200211001536.1027652-9-jhubbard@nvidia.com
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 47e29d32 01-Apr-2020 John Hubbard <jhubbard@nvidia.com>

mm/gup: page->hpage_pinned_refcount: exact pin counts for huge pages

For huge pages (and in fact, any compound page), the GUP_PIN_COUNTING_BIAS
scheme tends to overflow too easily, each tail page increments the head
page->_refcount by GUP_PIN_COUNTING_BIAS (1024). That limits the number
of huge pages that can be pinned.

This patch removes that limitation, by using an exact form of pin counting
for compound pages of order > 1. The "order > 1" is required because this
approach uses the 3rd struct page in the compound page, and order 1
compound pages only have two pages, so that won't work there.

A new struct page field, hpage_pinned_refcount, has been added, replacing
a padding field in the union (so no new space is used).

This enhancement also has a useful side effect: huge pages and compound
pages (of order > 1) do not suffer from the "potential false positives"
problem that is discussed in the page_dma_pinned() comment block. That is
because these compound pages have extra space for tracking things, so they
get exact pin counts instead of overloading page->_refcount.

Documentation/core-api/pin_user_pages.rst is updated accordingly.

Suggested-by: Jan Kara <jack@suse.cz>
Signed-off-by: John Hubbard <jhubbard@nvidia.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Reviewed-by: Jan Kara <jack@suse.cz>
Acked-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Cc: Ira Weiny <ira.weiny@intel.com>
Cc: Jérôme Glisse <jglisse@redhat.com>
Cc: "Matthew Wilcox (Oracle)" <willy@infradead.org>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Christoph Hellwig <hch@infradead.org>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: Dave Chinner <david@fromorbit.com>
Cc: Jason Gunthorpe <jgg@ziepe.ca>
Cc: Jonathan Corbet <corbet@lwn.net>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Mike Kravetz <mike.kravetz@oracle.com>
Cc: Shuah Khan <shuah@kernel.org>
Cc: Vlastimil Babka <vbabka@suse.cz>
Link: http://lkml.kernel.org/r/20200211001536.1027652-8-jhubbard@nvidia.com
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 3faa52c0 01-Apr-2020 John Hubbard <jhubbard@nvidia.com>

mm/gup: track FOLL_PIN pages

Add tracking of pages that were pinned via FOLL_PIN. This tracking is
implemented via overloading of page->_refcount: pins are added by adding
GUP_PIN_COUNTING_BIAS (1024) to the refcount. This provides a fuzzy
indication of pinning, and it can have false positives (and that's OK).
Please see the pre-existing Documentation/core-api/pin_user_pages.rst for
details.

As mentioned in pin_user_pages.rst, callers who effectively set FOLL_PIN
(typically via pin_user_pages*()) are required to ultimately free such
pages via unpin_user_page().

Please also note the limitation, discussed in pin_user_pages.rst under the
"TODO: for 1GB and larger huge pages" section. (That limitation will be
removed in a following patch.)

The effect of a FOLL_PIN flag is similar to that of FOLL_GET, and may be
thought of as "FOLL_GET for DIO and/or RDMA use".

Pages that have been pinned via FOLL_PIN are identifiable via a new
function call:

bool page_maybe_dma_pinned(struct page *page);

What to do in response to encountering such a page, is left to later
patchsets. There is discussion about this in [1], [2], [3], and [4].

This also changes a BUG_ON(), to a WARN_ON(), in follow_page_mask().

[1] Some slow progress on get_user_pages() (Apr 2, 2019):
https://lwn.net/Articles/784574/
[2] DMA and get_user_pages() (LPC: Dec 12, 2018):
https://lwn.net/Articles/774411/
[3] The trouble with get_user_pages() (Apr 30, 2018):
https://lwn.net/Articles/753027/
[4] LWN kernel index: get_user_pages():
https://lwn.net/Kernel/Index/#Memory_management-get_user_pages

[jhubbard@nvidia.com: add kerneldoc]
Link: http://lkml.kernel.org/r/20200307021157.235726-1-jhubbard@nvidia.com
[imbrenda@linux.ibm.com: if pin fails, we need to unpin, a simple put_page will not be enough]
Link: http://lkml.kernel.org/r/20200306132537.783769-2-imbrenda@linux.ibm.com
[akpm@linux-foundation.org: fix put_compound_head defined but not used]
Suggested-by: Jan Kara <jack@suse.cz>
Suggested-by: Jérôme Glisse <jglisse@redhat.com>
Signed-off-by: John Hubbard <jhubbard@nvidia.com>
Signed-off-by: Claudio Imbrenda <imbrenda@linux.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Reviewed-by: Jan Kara <jack@suse.cz>
Acked-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Cc: Ira Weiny <ira.weiny@intel.com>
Cc: "Matthew Wilcox (Oracle)" <willy@infradead.org>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Christoph Hellwig <hch@infradead.org>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: Dave Chinner <david@fromorbit.com>
Cc: Jason Gunthorpe <jgg@ziepe.ca>
Cc: Jonathan Corbet <corbet@lwn.net>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Mike Kravetz <mike.kravetz@oracle.com>
Cc: Shuah Khan <shuah@kernel.org>
Cc: Vlastimil Babka <vbabka@suse.cz>
Link: http://lkml.kernel.org/r/20200211001536.1027652-7-jhubbard@nvidia.com
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 94202f12 01-Apr-2020 John Hubbard <jhubbard@nvidia.com>

mm/gup: require FOLL_GET for get_user_pages_fast()

Internal to mm/gup.c, require that get_user_pages_fast() and
__get_user_pages_fast() identify themselves, by setting FOLL_GET. This is
required in order to be able to make decisions based on "FOLL_PIN, or
FOLL_GET, or both or neither are set", in upcoming patches.

Signed-off-by: John Hubbard <jhubbard@nvidia.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Reviewed-by: Jan Kara <jack@suse.cz>
Acked-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Cc: Ira Weiny <ira.weiny@intel.com>
Cc: Jérôme Glisse <jglisse@redhat.com>
Cc: "Matthew Wilcox (Oracle)" <willy@infradead.org>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Christoph Hellwig <hch@infradead.org>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: Dave Chinner <david@fromorbit.com>
Cc: Jason Gunthorpe <jgg@ziepe.ca>
Cc: Jonathan Corbet <corbet@lwn.net>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Mike Kravetz <mike.kravetz@oracle.com>
Cc: Shuah Khan <shuah@kernel.org>
Cc: Vlastimil Babka <vbabka@suse.cz>
Link: http://lkml.kernel.org/r/20200211001536.1027652-6-jhubbard@nvidia.com
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 3b78d834 01-Apr-2020 John Hubbard <jhubbard@nvidia.com>

mm/gup: pass gup flags to two more routines

In preparation for an upcoming patch, send gup flags args to two more
routines: put_compound_head(), and undo_dev_pagemap().

Signed-off-by: John Hubbard <jhubbard@nvidia.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Reviewed-by: Jan Kara <jack@suse.cz>
Acked-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Cc: Ira Weiny <ira.weiny@intel.com>
Cc: Jérôme Glisse <jglisse@redhat.com>
Cc: "Matthew Wilcox (Oracle)" <willy@infradead.org>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Christoph Hellwig <hch@infradead.org>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: Dave Chinner <david@fromorbit.com>
Cc: Jason Gunthorpe <jgg@ziepe.ca>
Cc: Jonathan Corbet <corbet@lwn.net>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Mike Kravetz <mike.kravetz@oracle.com>
Cc: Shuah Khan <shuah@kernel.org>
Cc: Vlastimil Babka <vbabka@suse.cz>
Link: http://lkml.kernel.org/r/20200211001536.1027652-5-jhubbard@nvidia.com
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 86dfbed4 01-Apr-2020 John Hubbard <jhubbard@nvidia.com>

mm/gup: pass a flags arg to __gup_device_* functions

A subsequent patch requires access to gup flags, so pass the flags
argument through to the __gup_device_* functions.

Also placate checkpatch.pl by shortening a nearby line.

Signed-off-by: John Hubbard <jhubbard@nvidia.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Reviewed-by: Jan Kara <jack@suse.cz>
Reviewed-by: Jérôme Glisse <jglisse@redhat.com>
Reviewed-by: Ira Weiny <ira.weiny@intel.com>
Acked-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Cc: "Matthew Wilcox (Oracle)" <willy@infradead.org>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Christoph Hellwig <hch@infradead.org>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: Dave Chinner <david@fromorbit.com>
Cc: Jason Gunthorpe <jgg@ziepe.ca>
Cc: Jonathan Corbet <corbet@lwn.net>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Mike Kravetz <mike.kravetz@oracle.com>
Cc: Shuah Khan <shuah@kernel.org>
Cc: Vlastimil Babka <vbabka@suse.cz>
Link: http://lkml.kernel.org/r/20200211001536.1027652-3-jhubbard@nvidia.com
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 22bf29b6 01-Apr-2020 John Hubbard <jhubbard@nvidia.com>

mm/gup: split get_user_pages_remote() into two routines

Patch series "mm/gup: track FOLL_PIN pages", v6.

This activates tracking of FOLL_PIN pages. This is in support of fixing
the get_user_pages()+DMA problem described in [1]-[4].

FOLL_PIN support is now in the main linux tree. However, the patch to use
FOLL_PIN to track pages was *not* submitted, because Leon saw an RDMA test
suite failure that involved (I think) page refcount overflows when huge
pages were used.

This patch definitively solves that kind of overflow problem, by adding an
exact pincount, for compound pages (of order > 1), in the 3rd struct page
of a compound page. If available, that form of pincounting is used,
instead of the GUP_PIN_COUNTING_BIAS approach. Thanks again to Jan Kara
for that idea.

Other interesting changes:

* dump_page(): added one, or two new things to report for compound
pages: head refcount (for all compound pages), and map_pincount (for
compound pages of order > 1).

* Documentation/core-api/pin_user_pages.rst: removed the "TODO" for the
huge page refcount upper limit problems, and added notes about how it
works now. Also added a note about the dump_page() enhancements.

* Added some comments in gup.c and mm.h, to explain that there are two
ways to count pinned pages: exact (for compound pages of order > 1) and
fuzzy (GUP_PIN_COUNTING_BIAS: for all other pages).

============================================================
General notes about the tracking patch:

This is a prerequisite to solving the problem of proper interactions
between file-backed pages, and [R]DMA activities, as discussed in [1],
[2], [3], [4] and in a remarkable number of email threads since about
2017. :)

In contrast to earlier approaches, the page tracking can be incrementally
applied to the kernel call sites that, until now, have been simply calling
get_user_pages() ("gup"). In other words, opt-in by changing from this:

get_user_pages() (sets FOLL_GET)
put_page()

to this:
pin_user_pages() (sets FOLL_PIN)
unpin_user_page()

============================================================
Future steps:

* Convert more subsystems from get_user_pages() to pin_user_pages().
The first probably needs to be bio/biovecs, because any filesystem
testing is too difficult without those in place.

* Change VFS and filesystems to respond appropriately when encountering
dma-pinned pages.

* Work with Ira and others to connect this all up with file system
leases.

[1] Some slow progress on get_user_pages() (Apr 2, 2019):
https://lwn.net/Articles/784574/

[2] DMA and get_user_pages() (LPC: Dec 12, 2018):
https://lwn.net/Articles/774411/

[3] The trouble with get_user_pages() (Apr 30, 2018):
https://lwn.net/Articles/753027/

[4] LWN kernel index: get_user_pages()
https://lwn.net/Kernel/Index/#Memory_management-get_user_pages

This patch (of 12):

An upcoming patch requires reusing the implementation of
get_user_pages_remote(). Split up get_user_pages_remote() into an outer
routine that checks flags, and an implementation routine that will be
reused. This makes subsequent changes much easier to understand.

There should be no change in behavior due to this patch.

Signed-off-by: John Hubbard <jhubbard@nvidia.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Reviewed-by: Jan Kara <jack@suse.cz>
Acked-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Cc: Ira Weiny <ira.weiny@intel.com>
Cc: Jérôme Glisse <jglisse@redhat.com>
Cc: "Matthew Wilcox (Oracle)" <willy@infradead.org>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Christoph Hellwig <hch@infradead.org>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: Dave Chinner <david@fromorbit.com>
Cc: Jason Gunthorpe <jgg@ziepe.ca>
Cc: Jonathan Corbet <corbet@lwn.net>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Mike Kravetz <mike.kravetz@oracle.com>
Cc: Shuah Khan <shuah@kernel.org>
Cc: Vlastimil Babka <vbabka@suse.cz>
Link: http://lkml.kernel.org/r/20200211001536.1027652-2-jhubbard@nvidia.com
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# ff2e6d72 03-Feb-2020 Peter Zijlstra <peterz@infradead.org>

asm-generic/tlb: rename HAVE_RCU_TABLE_FREE

Towards a more consistent naming scheme.

[akpm@linux-foundation.org: fix sparc64 Kconfig]
Link: http://lkml.kernel.org/r/20200116064531.483522-7-aneesh.kumar@linux.ibm.com
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# f1f6a7dd 30-Jan-2020 John Hubbard <jhubbard@nvidia.com>

mm, tree-wide: rename put_user_page*() to unpin_user_page*()

In order to provide a clearer, more symmetric API for pinning and
unpinning DMA pages. This way, pin_user_pages*() calls match up with
unpin_user_pages*() calls, and the API is a lot closer to being
self-explanatory.

Link: http://lkml.kernel.org/r/20200107224558.2362728-23-jhubbard@nvidia.com
Signed-off-by: John Hubbard <jhubbard@nvidia.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Cc: Alex Williamson <alex.williamson@redhat.com>
Cc: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
Cc: Björn Töpel <bjorn.topel@intel.com>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Daniel Vetter <daniel.vetter@ffwll.ch>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: Hans Verkuil <hverkuil-cisco@xs4all.nl>
Cc: Ira Weiny <ira.weiny@intel.com>
Cc: Jason Gunthorpe <jgg@mellanox.com>
Cc: Jason Gunthorpe <jgg@ziepe.ca>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: Jerome Glisse <jglisse@redhat.com>
Cc: Jonathan Corbet <corbet@lwn.net>
Cc: Kirill A. Shutemov <kirill@shutemov.name>
Cc: Leon Romanovsky <leonro@mellanox.com>
Cc: Mauro Carvalho Chehab <mchehab@kernel.org>
Cc: Mike Rapoport <rppt@linux.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# eddb1c22 30-Jan-2020 John Hubbard <jhubbard@nvidia.com>

mm/gup: introduce pin_user_pages*() and FOLL_PIN

Introduce pin_user_pages*() variations of get_user_pages*() calls, and
also pin_longterm_pages*() variations.

For now, these are placeholder calls, until the various call sites are
converted to use the correct get_user_pages*() or pin_user_pages*() API.

These variants will eventually all set FOLL_PIN, which is also
introduced, and thoroughly documented.

pin_user_pages()
pin_user_pages_remote()
pin_user_pages_fast()

All pages that are pinned via the above calls, must be unpinned via
put_user_page().

The underlying rules are:

* FOLL_PIN is a gup-internal flag, so the call sites should not directly
set it. That behavior is enforced with assertions.

* Call sites that want to indicate that they are going to do DirectIO
("DIO") or something with similar characteristics, should call a
get_user_pages()-like wrapper call that sets FOLL_PIN. These wrappers
will:

* Start with "pin_user_pages" instead of "get_user_pages". That
makes it easy to find and audit the call sites.

* Set FOLL_PIN

* For pages that are received via FOLL_PIN, those pages must be returned
via put_user_page().

Thanks to Jan Kara and Vlastimil Babka for explaining the 4 cases in
this documentation. (I've reworded it and expanded upon it.)

Link: http://lkml.kernel.org/r/20200107224558.2362728-12-jhubbard@nvidia.com
Signed-off-by: John Hubbard <jhubbard@nvidia.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Reviewed-by: Mike Rapoport <rppt@linux.ibm.com> [Documentation]
Reviewed-by: Jérôme Glisse <jglisse@redhat.com>
Cc: Jonathan Corbet <corbet@lwn.net>
Cc: Ira Weiny <ira.weiny@intel.com>
Cc: Alex Williamson <alex.williamson@redhat.com>
Cc: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
Cc: Björn Töpel <bjorn.topel@intel.com>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Daniel Vetter <daniel.vetter@ffwll.ch>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: Hans Verkuil <hverkuil-cisco@xs4all.nl>
Cc: Jason Gunthorpe <jgg@mellanox.com>
Cc: Jason Gunthorpe <jgg@ziepe.ca>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: Kirill A. Shutemov <kirill@shutemov.name>
Cc: Leon Romanovsky <leonro@mellanox.com>
Cc: Mauro Carvalho Chehab <mchehab@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# f4000fdf 30-Jan-2020 John Hubbard <jhubbard@nvidia.com>

mm/gup: allow FOLL_FORCE for get_user_pages_fast()

Commit 817be129e6f2 ("mm: validate get_user_pages_fast flags") allowed
only FOLL_WRITE and FOLL_LONGTERM to be passed to get_user_pages_fast().
This, combined with the fact that get_user_pages_fast() falls back to
"slow gup", which *does* accept FOLL_FORCE, leads to an odd situation:
if you need FOLL_FORCE, you cannot call get_user_pages_fast().

There does not appear to be any reason for filtering out FOLL_FORCE.
There is nothing in the _fast() implementation that requires that we
avoid writing to the pages. So it appears to have been an oversight.

Fix by allowing FOLL_FORCE to be set for get_user_pages_fast().

Link: http://lkml.kernel.org/r/20200107224558.2362728-9-jhubbard@nvidia.com
Fixes: 817be129e6f2 ("mm: validate get_user_pages_fast flags")
Signed-off-by: John Hubbard <jhubbard@nvidia.com>
Reviewed-by: Leon Romanovsky <leonro@mellanox.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Alex Williamson <alex.williamson@redhat.com>
Cc: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
Cc: Björn Töpel <bjorn.topel@intel.com>
Cc: Daniel Vetter <daniel.vetter@ffwll.ch>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: Hans Verkuil <hverkuil-cisco@xs4all.nl>
Cc: Ira Weiny <ira.weiny@intel.com>
Cc: Jason Gunthorpe <jgg@mellanox.com>
Cc: Jason Gunthorpe <jgg@ziepe.ca>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: Jerome Glisse <jglisse@redhat.com>
Cc: Jonathan Corbet <corbet@lwn.net>
Cc: Kirill A. Shutemov <kirill@shutemov.name>
Cc: Mauro Carvalho Chehab <mchehab@kernel.org>
Cc: Mike Rapoport <rppt@linux.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# c4237f8b 30-Jan-2020 John Hubbard <jhubbard@nvidia.com>

mm: fix get_user_pages_remote()'s handling of FOLL_LONGTERM

As it says in the updated comment in gup.c: current FOLL_LONGTERM
behavior is incompatible with FAULT_FLAG_ALLOW_RETRY because of the FS
DAX check requirement on vmas.

However, the corresponding restriction in get_user_pages_remote() was
slightly stricter than is actually required: it forbade all
FOLL_LONGTERM callers, but we can actually allow FOLL_LONGTERM callers
that do not set the "locked" arg.

Update the code and comments to loosen the restriction, allowing
FOLL_LONGTERM in some cases.

Also, copy the DAX check ("if a VMA is DAX, don't allow long term
pinning") from the VFIO call site, all the way into the internals of
get_user_pages_remote() and __gup_longterm_locked(). That is:
get_user_pages_remote() calls __gup_longterm_locked(), which in turn
calls check_dax_vmas(). This check will then be removed from the VFIO
call site in a subsequent patch.

Thanks to Jason Gunthorpe for pointing out a clean way to fix this, and
to Dan Williams for helping clarify the DAX refactoring.

Link: http://lkml.kernel.org/r/20200107224558.2362728-7-jhubbard@nvidia.com
Signed-off-by: John Hubbard <jhubbard@nvidia.com>
Tested-by: Alex Williamson <alex.williamson@redhat.com>
Acked-by: Alex Williamson <alex.williamson@redhat.com>
Reviewed-by: Jason Gunthorpe <jgg@mellanox.com>
Reviewed-by: Ira Weiny <ira.weiny@intel.com>
Suggested-by: Jason Gunthorpe <jgg@ziepe.ca>
Cc: Kirill A. Shutemov <kirill@shutemov.name>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: Jerome Glisse <jglisse@redhat.com>
Cc: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
Cc: Björn Töpel <bjorn.topel@intel.com>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Daniel Vetter <daniel.vetter@ffwll.ch>
Cc: Hans Verkuil <hverkuil-cisco@xs4all.nl>
Cc: Jan Kara <jack@suse.cz>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: Jonathan Corbet <corbet@lwn.net>
Cc: Leon Romanovsky <leonro@mellanox.com>
Cc: Mauro Carvalho Chehab <mchehab@kernel.org>
Cc: Mike Rapoport <rppt@linux.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# a707cdd5 30-Jan-2020 John Hubbard <jhubbard@nvidia.com>

mm/gup: move try_get_compound_head() to top, fix minor issues

An upcoming patch uses try_get_compound_head() more widely, so move it to
the top of gup.c.

Also fix a tiny spelling error and a checkpatch.pl warning.

Link: http://lkml.kernel.org/r/20200107224558.2362728-3-jhubbard@nvidia.com
Signed-off-by: John Hubbard <jhubbard@nvidia.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Jan Kara <jack@suse.cz>
Reviewed-by: Ira Weiny <ira.weiny@intel.com>
Cc: Alex Williamson <alex.williamson@redhat.com>
Cc: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
Cc: Björn Töpel <bjorn.topel@intel.com>
Cc: Daniel Vetter <daniel.vetter@ffwll.ch>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: Hans Verkuil <hverkuil-cisco@xs4all.nl>
Cc: Jason Gunthorpe <jgg@mellanox.com>
Cc: Jason Gunthorpe <jgg@ziepe.ca>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: Jerome Glisse <jglisse@redhat.com>
Cc: Jonathan Corbet <corbet@lwn.net>
Cc: Kirill A. Shutemov <kirill@shutemov.name>
Cc: Leon Romanovsky <leonro@mellanox.com>
Cc: Mauro Carvalho Chehab <mchehab@kernel.org>
Cc: Mike Rapoport <rppt@linux.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# a43e9820 30-Jan-2020 John Hubbard <jhubbard@nvidia.com>

mm/gup: factor out duplicate code from four routines

Patch series "mm/gup: prereqs to track dma-pinned pages: FOLL_PIN", v12.

Overview:

This is a prerequisite to solving the problem of proper interactions
between file-backed pages, and [R]DMA activities, as discussed in [1],
[2], [3], and in a remarkable number of email threads since about
2017. :)

A new internal gup flag, FOLL_PIN is introduced, and thoroughly
documented in the last patch's Documentation/vm/pin_user_pages.rst.

I believe that this will provide a good starting point for doing the
layout lease work that Ira Weiny has been working on. That's because
these new wrapper functions provide a clean, constrained, systematically
named set of functionality that, again, is required in order to even
know if a page is "dma-pinned".

In contrast to earlier approaches, the page tracking can be
incrementally applied to the kernel call sites that, until now, have
been simply calling get_user_pages() ("gup"). In other words, opt-in by
changing from this:

get_user_pages() (sets FOLL_GET)
put_page()

to this:
pin_user_pages() (sets FOLL_PIN)
unpin_user_page()

Testing:

* I've done some overall kernel testing (LTP, and a few other goodies),
and some directed testing to exercise some of the changes. And as you
can see, gup_benchmark is enhanced to exercise this. Basically, I've
been able to runtime test the core get_user_pages() and
pin_user_pages() and related routines, but not so much on several of
the call sites--but those are generally just a couple of lines
changed, each.

Not much of the kernel is actually using this, which on one hand
reduces risk quite a lot. But on the other hand, testing coverage
is low. So I'd love it if, in particular, the Infiniband and PowerPC
folks could do a smoke test of this series for me.

Runtime testing for the call sites so far is pretty light:

* io_uring: Some directed tests from liburing exercise this, and
they pass.
* process_vm_access.c: A small directed test passes.
* gup_benchmark: the enhanced version hits the new gup.c code, and
passes.
* infiniband: Ran rdma-core tests: rdma-core/build/bin/run_tests.py
* VFIO: compiles (I'm vowing to set up a run time test soon, but it's
not ready just yet)
* powerpc: it compiles...
* drm/via: compiles...
* goldfish: compiles...
* net/xdp: compiles...
* media/v4l2: compiles...

[1] Some slow progress on get_user_pages() (Apr 2, 2019): https://lwn.net/Articles/784574/
[2] DMA and get_user_pages() (LPC: Dec 12, 2018): https://lwn.net/Articles/774411/
[3] The trouble with get_user_pages() (Apr 30, 2018): https://lwn.net/Articles/753027/

This patch (of 22):

There are four locations in gup.c that have a fair amount of code
duplication. This means that changing one requires making the same
changes in four places, not to mention reading the same code four times,
and wondering if there are subtle differences.

Factor out the common code into static functions, thus reducing the
overall line count and the code's complexity.

Also, take the opportunity to slightly improve the efficiency of the
error cases, by doing a mass subtraction of the refcount, surrounded by
get_page()/put_page().

Also, further simplify (slightly), by waiting until the the successful
end of each routine, to increment *nr.

Link: http://lkml.kernel.org/r/20200107224558.2362728-2-jhubbard@nvidia.com
Signed-off-by: John Hubbard <jhubbard@nvidia.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Jérôme Glisse <jglisse@redhat.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Cc: Kirill A. Shutemov <kirill@shutemov.name>
Cc: Ira Weiny <ira.weiny@intel.com>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
Cc: Alex Williamson <alex.williamson@redhat.com>
Cc: Björn Töpel <bjorn.topel@intel.com>
Cc: Daniel Vetter <daniel.vetter@ffwll.ch>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: Hans Verkuil <hverkuil-cisco@xs4all.nl>
Cc: Jason Gunthorpe <jgg@mellanox.com>
Cc: Jason Gunthorpe <jgg@ziepe.ca>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: Jonathan Corbet <corbet@lwn.net>
Cc: Leon Romanovsky <leonro@mellanox.com>
Cc: Mauro Carvalho Chehab <mchehab@kernel.org>
Cc: Mike Rapoport <rppt@linux.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# be9d3045 30-Jan-2020 Wei Yang <richardw.yang@linux.intel.com>

mm/gup.c: use is_vm_hugetlb_page() to check whether to follow huge

No functional change, just leverage the helper function to improve
readability as others.

Link: http://lkml.kernel.org/r/20200113070322.26627-1-richardw.yang@linux.intel.com
Signed-off-by: Wei Yang <richardw.yang@linux.intel.com>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Acked-by: David Rientjes <rientjes@google.com>
Reviewed-by: Ralph Campbell <rcampbell@nvidia.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 15494520 30-Jan-2020 Qiujun Huang <hqjagain@gmail.com>

mm: fix gup_pud_range

sorry for not processing for a long time. I met it again.

patch v1 https://lkml.org/lkml/2019/9/20/656

do_machine_check()
do_memory_failure()
memory_failure()
hw_poison_user_mappings()
try_to_unmap()
pteval = swp_entry_to_pte(make_hwpoison_entry(subpage));

...and now we have a swap entry that indicates that the page entry
refers to a bad (and poisoned) page of memory, but gup_fast() at this
level of the page table was ignoring swap entries, and incorrectly
assuming that "!pxd_none() == valid and present".

And this was not just a poisoned page problem, but a generaly swap entry
problem. So, any swap entry type (device memory migration, numa
migration, or just regular swapping) could lead to the same problem.

Fix this by checking for pxd_present(), instead of pxd_none().

Link: http://lkml.kernel.org/r/1578479084-15508-1-git-send-email-hqjagain@gmail.com
Signed-off-by: Qiujun Huang <hqjagain@gmail.com>
Cc: John Hubbard <jhubbard@nvidia.com>
Cc: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# d2dfbe47 30-Nov-2019 Liu Xiang <liuxiang_1999@126.com>

mm/gup.c: fix comments of __get_user_pages() and get_user_pages_remote()

Fix comments of __get_user_pages() and get_user_pages_remote(), make
them more clear.

Link: http://lkml.kernel.org/r/1572443533-3118-1-git-send-email-liuxiang_1999@126.com
Signed-off-by: Liu Xiang <liuxiang_1999@126.com>
Suggested-by: John Hubbard <jhubbard@nvidia.com>
Reviewed-by: David Hildenbrand <david@redhat.com>
Reviewed-by: John Hubbard <jhubbard@nvidia.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# b96cc655 30-Nov-2019 zhong jiang <zhongjiang@huawei.com>

mm/gup.c: allow CMA migration to propagate errors back to caller

check_and_migrate_cma_pages() was recording the result of
__get_user_pages_locked() in an unsigned "nr_pages" variable.

Because __get_user_pages_locked() returns a signed value that can
include negative errno values, this had the effect of hiding errors.

Change check_and_migrate_cma_pages() implementation so that it uses a
signed variable instead, and propagates the results back to the caller
just as other gup internal functions do.

This was discovered with the help of unsigned_lesser_than_zero.cocci.

Link: http://lkml.kernel.org/r/1571671030-58029-1-git-send-email-zhongjiang@huawei.com
Signed-off-by: zhong jiang <zhongjiang@huawei.com>
Suggested-by: John Hubbard <jhubbard@nvidia.com>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Reviewed-by: John Hubbard <jhubbard@nvidia.com>
Reviewed-by: Ira Weiny <ira.weiny@intel.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 0cd22afd 18-Oct-2019 John Hubbard <jhubbard@nvidia.com>

mm/gup: fix a misnamed "write" argument, and a related bug

In several routines, the "flags" argument is incorrectly named "write".
Change it to "flags".

Also, in one place, the misnaming led to an actual bug:
"flags & FOLL_WRITE" is required, rather than just "flags".
(That problem was flagged by krobot, in v1 of this patch.)

Also, change the flags argument from int, to unsigned int.

You can see that this was a simple oversight, because the
calling code passes "flags" to the fifth argument:

gup_pgd_range():
...
if (!gup_huge_pd(__hugepd(pgd_val(pgd)), addr,
PGDIR_SHIFT, next, flags, pages, nr))

...which, until this patch, the callees referred to as "write".

Also, change two lines to avoid checkpatch line length
complaints, and another line to fix another oversight
that checkpatch called out: missing "int" on pdshift.

Link: http://lkml.kernel.org/r/20191014184639.1512873-3-jhubbard@nvidia.com
Fixes: b798bec4741b ("mm/gup: change write parameter to flags in fast walk")
Signed-off-by: John Hubbard <jhubbard@nvidia.com>
Reported-by: kbuild test robot <lkp@intel.com>
Suggested-by: Kirill A. Shutemov <kirill@shutemov.name>
Suggested-by: Ira Weiny <ira.weiny@intel.com>
Acked-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Reviewed-by: Ira Weiny <ira.weiny@intel.com>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
Cc: Keith Busch <keith.busch@intel.com>
Cc: Shuah Khan <shuah@kernel.org>
Cc: Christoph Hellwig <hch@infradead.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# f9652594 25-Sep-2019 Andrey Konovalov <andreyknvl@google.com>

mm: untag user pointers in mm/gup.c

This patch is a part of a series that extends kernel ABI to allow to pass
tagged user pointers (with the top byte set to something else other than
0x00) as syscall arguments.

mm/gup.c provides a kernel interface that accepts user addresses and
manipulates user pages directly (for example get_user_pages, that is used
by the futex syscall). Since a user can provided tagged addresses, we
need to handle this case.

Add untagging to gup.c functions that use user addresses for vma lookups.

Link: http://lkml.kernel.org/r/4731bddba3c938658c10ff4ed55cc01c60f4c8f8.1563904656.git.andreyknvl@google.com
Signed-off-by: Andrey Konovalov <andreyknvl@google.com>
Reviewed-by: Khalid Aziz <khalid.aziz@oracle.com>
Reviewed-by: Vincenzo Frascino <vincenzo.frascino@arm.com>
Reviewed-by: Kees Cook <keescook@chromium.org>
Reviewed-by: Catalin Marinas <catalin.marinas@arm.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Dave Hansen <dave.hansen@intel.com>
Cc: Eric Auger <eric.auger@redhat.com>
Cc: Felix Kuehling <Felix.Kuehling@amd.com>
Cc: Jens Wiklander <jens.wiklander@linaro.org>
Cc: Mauro Carvalho Chehab <mchehab+samsung@kernel.org>
Cc: Mike Rapoport <rppt@linux.ibm.com>
Cc: Will Deacon <will@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# bfe7b00d 23-Sep-2019 Song Liu <songliubraving@fb.com>

mm, thp: introduce FOLL_SPLIT_PMD

Introduce a new foll_flag: FOLL_SPLIT_PMD. As the name says
FOLL_SPLIT_PMD splits huge pmd for given mm_struct, the underlining huge
page stays as-is.

FOLL_SPLIT_PMD is useful for cases where we need to use regular pages, but
would switch back to huge page and huge pmd on. One of such example is
uprobe. The following patches use FOLL_SPLIT_PMD in uprobe.

Link: http://lkml.kernel.org/r/20190815164525.1848545-4-songliubraving@fb.com
Signed-off-by: Song Liu <songliubraving@fb.com>
Reviewed-by: Oleg Nesterov <oleg@redhat.com>
Acked-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 2d15eb31 23-Sep-2019 Andrew Morton <akpm@linux-foundation.org>

mm/gup: add make_dirty arg to put_user_pages_dirty_lock()

[11~From: John Hubbard <jhubbard@nvidia.com>
Subject: mm/gup: add make_dirty arg to put_user_pages_dirty_lock()

Patch series "mm/gup: add make_dirty arg to put_user_pages_dirty_lock()",
v3.

There are about 50+ patches in my tree [2], and I'll be sending out the
remaining ones in a few more groups:

* The block/bio related changes (Jerome mostly wrote those, but I've had
to move stuff around extensively, and add a little code)

* mm/ changes

* other subsystem patches

* an RFC that shows the current state of the tracking patch set. That
can only be applied after all call sites are converted, but it's good to
get an early look at it.

This is part a tree-wide conversion, as described in fc1d8e7cca2d ("mm:
introduce put_user_page*(), placeholder versions").

This patch (of 3):

Provide more capable variation of put_user_pages_dirty_lock(), and delete
put_user_pages_dirty(). This is based on the following:

1. Lots of call sites become simpler if a bool is passed into
put_user_page*(), instead of making the call site choose which
put_user_page*() variant to call.

2. Christoph Hellwig's observation that set_page_dirty_lock() is
usually correct, and set_page_dirty() is usually a bug, or at least
questionable, within a put_user_page*() calling chain.

This leads to the following API choices:

* put_user_pages_dirty_lock(page, npages, make_dirty)

* There is no put_user_pages_dirty(). You have to
hand code that, in the rare case that it's
required.

[jhubbard@nvidia.com: remove unused variable in siw_free_plist()]
Link: http://lkml.kernel.org/r/20190729074306.10368-1-jhubbard@nvidia.com
Link: http://lkml.kernel.org/r/20190724044537.10458-2-jhubbard@nvidia.com
Signed-off-by: John Hubbard <jhubbard@nvidia.com>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Jan Kara <jack@suse.cz>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Ira Weiny <ira.weiny@intel.com>
Cc: Jason Gunthorpe <jgg@ziepe.ca>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# d8c6546b 23-Sep-2019 Matthew Wilcox (Oracle) <willy@infradead.org>

mm: introduce compound_nr()

Replace 1 << compound_order(page) with compound_nr(page). Minor
improvements in readability.

Link: http://lkml.kernel.org/r/20190721104612.19120-4-willy@infradead.org
Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Reviewed-by: Andrew Morton <akpm@linux-foundation.org>
Reviewed-by: Ira Weiny <ira.weiny@intel.com>
Acked-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Cc: Michal Hocko <mhocko@suse.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 17596731 16-Jul-2019 Robin Murphy <robin.murphy@arm.com>

mm: introduce ARCH_HAS_PTE_DEVMAP

ARCH_HAS_ZONE_DEVICE is somewhat meaningless in itself, and combined
with the long-out-of-date comment can lead to the impression than an
architecture may just enable it (since __add_pages() now "comprehends
device memory" for itself) and expect things to work.

In practice, however, ZONE_DEVICE users have little chance of
functioning correctly without __HAVE_ARCH_PTE_DEVMAP, so let's clean
that up the same way as ARCH_HAS_PTE_SPECIAL and make it the proper
dependency so the real situation is clearer.

Link: http://lkml.kernel.org/r/87554aa78478a02a63f2c4cf60a847279ae3eb3b.1558547956.git.robin.murphy@arm.com
Signed-off-by: Robin Murphy <robin.murphy@arm.com>
Acked-by: Dan Williams <dan.j.williams@intel.com>
Reviewed-by: Ira Weiny <ira.weiny@intel.com>
Acked-by: Oliver O'Halloran <oohall@gmail.com>
Reviewed-by: Anshuman Khandual <anshuman.khandual@arm.com>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: David Hildenbrand <david@redhat.com>
Cc: Jerome Glisse <jglisse@redhat.com>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Will Deacon <will.deacon@arm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 790c7369 11-Jul-2019 Guenter Roeck <linux@roeck-us.net>

mm/gup.c: mark undo_dev_pagemap as __maybe_unused

Several mips builds generate the following build warning.

mm/gup.c:1788:13: warning: 'undo_dev_pagemap' defined but not used

The function is declared unconditionally but only called from behind
various ifdefs. Mark it __maybe_unused.

Link: http://lkml.kernel.org/r/1562072523-22311-1-git-send-email-linux@roeck-us.net
Signed-off-by: Guenter Roeck <linux@roeck-us.net>
Reviewed-by: Andrew Morton <akpm@linux-foundation.org>
Cc: Stephen Rothwell <sfr@canb.auug.org.au>
Cc: Robin Murphy <robin.murphy@arm.com>
Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# b5d1c39f 11-Jul-2019 Andy Lutomirski <luto@kernel.org>

mm/gup.c: remove some BUG_ONs from get_gate_page()

If we end up without a PGD or PUD entry backing the gate area, don't BUG
-- just fail gracefully.

It's not entirely implausible that this could happen some day on x86. It
doesn't right now even with an execute-only emulated vsyscall page because
the fixmap shares the PUD, but the core mm code shouldn't rely on that
particular detail to avoid OOPSing.

Link: http://lkml.kernel.org/r/a1d9f4efb75b9d464e59fd6af00104b21c58f6f7.1561610798.git.luto@kernel.org
Signed-off-by: Andy Lutomirski <luto@kernel.org>
Reviewed-by: Kees Cook <keescook@chromium.org>
Reviewed-by: Andrew Morton <akpm@linux-foundation.org>
Cc: Florian Weimer <fweimer@redhat.com>
Cc: Jann Horn <jannh@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# aa712399 11-Jul-2019 Pingfan Liu <kernelfans@gmail.com>

mm/gup: speed up check_and_migrate_cma_pages() on huge page

Both hugetlb and thp locate on the same migration type of pageblock, since
they are allocated from a free_list[]. Based on this fact, it is enough
to check on a single subpage to decide the migration type of the whole
huge page. By this way, it saves (2M/4K - 1) times loop for pmd_huge on
x86, similar on other archs.

Furthermore, when executing isolate_huge_page(), it avoid taking global
hugetlb_lock many times, and meanless remove/add to the local link list
cma_page_list.

[akpm@linux-foundation.org: make `i' and `step' unsigned]
Link: http://lkml.kernel.org/r/1561612545-28997-1-git-send-email-kernelfans@gmail.com
Signed-off-by: Pingfan Liu <kernelfans@gmail.com>
Reviewed-by: Andrew Morton <akpm@linux-foundation.org>
Reviewed-by: Ira Weiny <ira.weiny@intel.com>
Cc: Mike Rapoport <rppt@linux.ibm.com>
Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: John Hubbard <jhubbard@nvidia.com>
Cc: "Aneesh Kumar K.V" <aneesh.kumar@linux.ibm.com>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Keith Busch <keith.busch@intel.com>
Cc: Mike Kravetz <mike.kravetz@oracle.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 520b4a44 11-Jul-2019 Christoph Hellwig <hch@lst.de>

mm: mark the page referenced in gup_hugepte

All other get_user_page_fast cases mark the page referenced, so do this
here as well.

Link: http://lkml.kernel.org/r/20190625143715.1689-17-hch@lst.de
Signed-off-by: Christoph Hellwig <hch@lst.de>
Cc: Andrey Konovalov <andreyknvl@google.com>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: David Miller <davem@davemloft.net>
Cc: James Hogan <jhogan@kernel.org>
Cc: Jason Gunthorpe <jgg@mellanox.com>
Cc: Khalid Aziz <khalid.aziz@oracle.com>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Cc: Nicholas Piggin <npiggin@gmail.com>
Cc: Paul Burton <paul.burton@mips.com>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Ralf Baechle <ralf@linux-mips.org>
Cc: Rich Felker <dalias@libc.org>
Cc: Yoshinori Sato <ysato@users.sourceforge.jp>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 01a36916 11-Jul-2019 Christoph Hellwig <hch@lst.de>

mm: switch gup_hugepte to use try_get_compound_head

This applies the overflow fixes from 8fde12ca79aff ("mm: prevent
get_user_pages() from overflowing page refcount") to the powerpc hugepd
code and brings it back in sync with the other GUP cases.

Link: http://lkml.kernel.org/r/20190625143715.1689-16-hch@lst.de
Signed-off-by: Christoph Hellwig <hch@lst.de>
Cc: Andrey Konovalov <andreyknvl@google.com>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: David Miller <davem@davemloft.net>
Cc: James Hogan <jhogan@kernel.org>
Cc: Jason Gunthorpe <jgg@mellanox.com>
Cc: Khalid Aziz <khalid.aziz@oracle.com>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Cc: Nicholas Piggin <npiggin@gmail.com>
Cc: Paul Burton <paul.burton@mips.com>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Ralf Baechle <ralf@linux-mips.org>
Cc: Rich Felker <dalias@libc.org>
Cc: Yoshinori Sato <ysato@users.sourceforge.jp>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# cbd34da7 11-Jul-2019 Christoph Hellwig <hch@lst.de>

mm: move the powerpc hugepd code to mm/gup.c

While only powerpc supports the hugepd case, the code is pretty generic
and I'd like to keep all GUP internals in one place.

Link: http://lkml.kernel.org/r/20190625143715.1689-15-hch@lst.de
Signed-off-by: Christoph Hellwig <hch@lst.de>
Cc: Andrey Konovalov <andreyknvl@google.com>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: David Miller <davem@davemloft.net>
Cc: James Hogan <jhogan@kernel.org>
Cc: Jason Gunthorpe <jgg@mellanox.com>
Cc: Khalid Aziz <khalid.aziz@oracle.com>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Cc: Nicholas Piggin <npiggin@gmail.com>
Cc: Paul Burton <paul.burton@mips.com>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Ralf Baechle <ralf@linux-mips.org>
Cc: Rich Felker <dalias@libc.org>
Cc: Yoshinori Sato <ysato@users.sourceforge.jp>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 817be129 11-Jul-2019 Christoph Hellwig <hch@lst.de>

mm: validate get_user_pages_fast flags

We can only deal with FOLL_WRITE and/or FOLL_LONGTERM in
get_user_pages_fast, so reject all other flags.

Link: http://lkml.kernel.org/r/20190625143715.1689-14-hch@lst.de
Signed-off-by: Christoph Hellwig <hch@lst.de>
Cc: Andrey Konovalov <andreyknvl@google.com>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: David Miller <davem@davemloft.net>
Cc: James Hogan <jhogan@kernel.org>
Cc: Jason Gunthorpe <jgg@mellanox.com>
Cc: Khalid Aziz <khalid.aziz@oracle.com>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Cc: Nicholas Piggin <npiggin@gmail.com>
Cc: Paul Burton <paul.burton@mips.com>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Ralf Baechle <ralf@linux-mips.org>
Cc: Rich Felker <dalias@libc.org>
Cc: Yoshinori Sato <ysato@users.sourceforge.jp>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 050a9adc 11-Jul-2019 Christoph Hellwig <hch@lst.de>

mm: consolidate the get_user_pages* implementations

Always build mm/gup.c so that we don't have to provide separate nommu
stubs. Also merge the get_user_pages_fast and __get_user_pages_fast stubs
when HAVE_FAST_GUP into the main implementations, which will never call
the fast path if HAVE_FAST_GUP is not set.

This also ensures the new put_user_pages* helpers are available for nommu,
as those are currently missing, which would create a problem as soon as we
actually grew users for it.

Link: http://lkml.kernel.org/r/20190625143715.1689-13-hch@lst.de
Signed-off-by: Christoph Hellwig <hch@lst.de>
Cc: Andrey Konovalov <andreyknvl@google.com>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: David Miller <davem@davemloft.net>
Cc: James Hogan <jhogan@kernel.org>
Cc: Jason Gunthorpe <jgg@mellanox.com>
Cc: Khalid Aziz <khalid.aziz@oracle.com>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Cc: Nicholas Piggin <npiggin@gmail.com>
Cc: Paul Burton <paul.burton@mips.com>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Ralf Baechle <ralf@linux-mips.org>
Cc: Rich Felker <dalias@libc.org>
Cc: Yoshinori Sato <ysato@users.sourceforge.jp>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# d3649f68 11-Jul-2019 Christoph Hellwig <hch@lst.de>

mm: reorder code blocks in gup.c

This moves the actually exported functions towards the end of the file,
and reorders some functions to be in more logical blocks as a preparation
for moving various stubs inline into the main functionality using
IS_ENABLED().

Link: http://lkml.kernel.org/r/20190625143715.1689-12-hch@lst.de
Signed-off-by: Christoph Hellwig <hch@lst.de>
Cc: Andrey Konovalov <andreyknvl@google.com>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: David Miller <davem@davemloft.net>
Cc: James Hogan <jhogan@kernel.org>
Cc: Jason Gunthorpe <jgg@mellanox.com>
Cc: Khalid Aziz <khalid.aziz@oracle.com>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Cc: Nicholas Piggin <npiggin@gmail.com>
Cc: Paul Burton <paul.burton@mips.com>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Ralf Baechle <ralf@linux-mips.org>
Cc: Rich Felker <dalias@libc.org>
Cc: Yoshinori Sato <ysato@users.sourceforge.jp>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 67a929e0 11-Jul-2019 Christoph Hellwig <hch@lst.de>

mm: rename CONFIG_HAVE_GENERIC_GUP to CONFIG_HAVE_FAST_GUP

We only support the generic GUP now, so rename the config option to
be more clear, and always use the mm/Kconfig definition of the
symbol and select it from the arch Kconfigs.

Link: http://lkml.kernel.org/r/20190625143715.1689-11-hch@lst.de
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Khalid Aziz <khalid.aziz@oracle.com>
Reviewed-by: Jason Gunthorpe <jgg@mellanox.com>
Cc: Andrey Konovalov <andreyknvl@google.com>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: David Miller <davem@davemloft.net>
Cc: James Hogan <jhogan@kernel.org>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Cc: Nicholas Piggin <npiggin@gmail.com>
Cc: Paul Burton <paul.burton@mips.com>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Ralf Baechle <ralf@linux-mips.org>
Cc: Rich Felker <dalias@libc.org>
Cc: Yoshinori Sato <ysato@users.sourceforge.jp>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 39656e83 11-Jul-2019 Christoph Hellwig <hch@lst.de>

mm: lift the x86_32 PAE version of gup_get_pte to common code

The split low/high access is the only non-READ_ONCE version of gup_get_pte
that did show up in the various arch implemenations. Lift it to common
code and drop the ifdef based arch override.

Link: http://lkml.kernel.org/r/20190625143715.1689-4-hch@lst.de
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Jason Gunthorpe <jgg@mellanox.com>
Cc: Andrey Konovalov <andreyknvl@google.com>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: David Miller <davem@davemloft.net>
Cc: James Hogan <jhogan@kernel.org>
Cc: Khalid Aziz <khalid.aziz@oracle.com>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Cc: Nicholas Piggin <npiggin@gmail.com>
Cc: Paul Burton <paul.burton@mips.com>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Ralf Baechle <ralf@linux-mips.org>
Cc: Rich Felker <dalias@libc.org>
Cc: Yoshinori Sato <ysato@users.sourceforge.jp>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 26f4c328 11-Jul-2019 Christoph Hellwig <hch@lst.de>

mm: simplify gup_fast_permitted

Pass in the already calculated end value instead of recomputing it, and
leave the end > start check in the callers instead of duplicating them in
the arch code.

Link: http://lkml.kernel.org/r/20190625143715.1689-3-hch@lst.de
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Jason Gunthorpe <jgg@mellanox.com>
Cc: Andrey Konovalov <andreyknvl@google.com>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: David Miller <davem@davemloft.net>
Cc: James Hogan <jhogan@kernel.org>
Cc: Khalid Aziz <khalid.aziz@oracle.com>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Cc: Nicholas Piggin <npiggin@gmail.com>
Cc: Paul Burton <paul.burton@mips.com>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Ralf Baechle <ralf@linux-mips.org>
Cc: Rich Felker <dalias@libc.org>
Cc: Yoshinori Sato <ysato@users.sourceforge.jp>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# f455c854 11-Jul-2019 Christoph Hellwig <hch@lst.de>

mm: use untagged_addr() for get_user_pages_fast addresses

Patch series "switch the remaining architectures to use generic GUP", v4.

A series to switch mips, sh and sparc64 to use the generic GUP code so
that we only have one codebase to touch for further improvements to this
code.

This patch (of 16):

This will allow sparc64, or any future architecture with memory tagging to
override its tags for get_user_pages and get_user_pages_fast.

Link: http://lkml.kernel.org/r/20190625143715.1689-2-hch@lst.de
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Khalid Aziz <khalid.aziz@oracle.com>
Reviewed-by: Jason Gunthorpe <jgg@mellanox.com>
Cc: Paul Burton <paul.burton@mips.com>
Cc: James Hogan <jhogan@kernel.org>
Cc: Yoshinori Sato <ysato@users.sourceforge.jp>
Cc: Rich Felker <dalias@libc.org>
Cc: David Miller <davem@davemloft.net>
Cc: Nicholas Piggin <npiggin@gmail.com>
Cc: Khalid Aziz <khalid.aziz@oracle.com>
Cc: Andrey Konovalov <andreyknvl@google.com>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Cc: Ralf Baechle <ralf@linux-mips.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# a7030aea 11-Jul-2019 Bharath Vedartham <linux.bhar@gmail.com>

mm/gup.c: make follow_page_mask() static

follow_page_mask() is only used in gup.c, make it static.

Link: http://lkml.kernel.org/r/20190510190831.GA4061@bharath12345-Inspiron-5559
Signed-off-by: Bharath Vedartham <linux.bhar@gmail.com>
Reviewed-by: Ira Weiny <ira.weiny@intel.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 25b2995a 13-Jun-2019 Christoph Hellwig <hch@lst.de>

mm: remove MEMORY_DEVICE_PUBLIC support

The code hasn't been used since it was added to the tree, and doesn't
appear to actually be usable.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Jason Gunthorpe <jgg@mellanox.com>
Acked-by: Michal Hocko <mhocko@suse.com>
Reviewed-by: Dan Williams <dan.j.williams@intel.com>
Tested-by: Dan Williams <dan.j.williams@intel.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>


# df17277b 31-May-2019 Mike Rapoport <rppt@kernel.org>

mm/gup: continue VM_FAULT_RETRY processing even for pre-faults

When get_user_pages*() is called with pages = NULL, the processing of
VM_FAULT_RETRY terminates early without actually retrying to fault-in all
the pages.

If the pages in the requested range belong to a VMA that has userfaultfd
registered, handle_userfault() returns VM_FAULT_RETRY *after* user space
has populated the page, but for the gup pre-fault case there's no actual
retry and the caller will get no pages although they are present.

This issue was uncovered when running post-copy memory restore in CRIU
after d9c9ce34ed5c ("x86/fpu: Fault-in user stack if
copy_fpstate_to_sigframe() fails").

After this change, the copying of FPU state to the sigframe switched from
copy_to_user() variants which caused a real page fault to get_user_pages()
with pages parameter set to NULL.

In post-copy mode of CRIU, the destination memory is managed with
userfaultfd and lack of the retry for pre-fault case in get_user_pages()
causes a crash of the restored process.

Making the pre-fault behavior of get_user_pages() the same as the "normal"
one fixes the issue.

Link: http://lkml.kernel.org/r/1557844195-18882-1-git-send-email-rppt@linux.ibm.com
Fixes: d9c9ce34ed5c ("x86/fpu: Fault-in user stack if copy_fpstate_to_sigframe() fails")
Signed-off-by: Mike Rapoport <rppt@linux.ibm.com>
Tested-by: Andrei Vagin <avagin@gmail.com> [https://travis-ci.org/avagin/linux/builds/533184940]
Tested-by: Hugh Dickins <hughd@google.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Cc: Borislav Petkov <bp@suse.de>
Cc: Pavel Machek <pavel@ucw.cz>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 457c8996 19-May-2019 Thomas Gleixner <tglx@linutronix.de>

treewide: Add SPDX license identifier for missed files

Add SPDX license identifiers to all files which:

- Have no license information of any form

- Have EXPORT_.*_SYMBOL_GPL inside which was used in the
initial scan/conversion to ignore the file

These files fall under the project license, GPL v2 only. The resulting SPDX
license identifier is:

GPL-2.0-only

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>


# fc1d8e7c 13-May-2019 John Hubbard <jhubbard@nvidia.com>

mm: introduce put_user_page*(), placeholder versions

A discussion of the overall problem is below.

As mentioned in patch 0001, the steps are to fix the problem are:

1) Provide put_user_page*() routines, intended to be used
for releasing pages that were pinned via get_user_pages*().

2) Convert all of the call sites for get_user_pages*(), to
invoke put_user_page*(), instead of put_page(). This involves dozens of
call sites, and will take some time.

3) After (2) is complete, use get_user_pages*() and put_user_page*() to
implement tracking of these pages. This tracking will be separate from
the existing struct page refcounting.

4) Use the tracking and identification of these pages, to implement
special handling (especially in writeback paths) when the pages are
backed by a filesystem.

Overview
========

Some kernel components (file systems, device drivers) need to access
memory that is specified via process virtual address. For a long time,
the API to achieve that was get_user_pages ("GUP") and its variations.
However, GUP has critical limitations that have been overlooked; in
particular, GUP does not interact correctly with filesystems in all
situations. That means that file-backed memory + GUP is a recipe for
potential problems, some of which have already occurred in the field.

GUP was first introduced for Direct IO (O_DIRECT), allowing filesystem
code to get the struct page behind a virtual address and to let storage
hardware perform a direct copy to or from that page. This is a
short-lived access pattern, and as such, the window for a concurrent
writeback of GUP'd page was small enough that there were not (we think)
any reported problems. Also, userspace was expected to understand and
accept that Direct IO was not synchronized with memory-mapped access to
that data, nor with any process address space changes such as munmap(),
mremap(), etc.

Over the years, more GUP uses have appeared (virtualization, device
drivers, RDMA) that can keep the pages they get via GUP for a long period
of time (seconds, minutes, hours, days, ...). This long-term pinning
makes an underlying design problem more obvious.

In fact, there are a number of key problems inherent to GUP:

Interactions with file systems
==============================

File systems expect to be able to write back data, both to reclaim pages,
and for data integrity. Allowing other hardware (NICs, GPUs, etc) to gain
write access to the file memory pages means that such hardware can dirty
the pages, without the filesystem being aware. This can, in some cases
(depending on filesystem, filesystem options, block device, block device
options, and other variables), lead to data corruption, and also to kernel
bugs of the form:

kernel BUG at /build/linux-fQ94TU/linux-4.4.0/fs/ext4/inode.c:1899!
backtrace:
ext4_writepage
__writepage
write_cache_pages
ext4_writepages
do_writepages
__writeback_single_inode
writeback_sb_inodes
__writeback_inodes_wb
wb_writeback
wb_workfn
process_one_work
worker_thread
kthread
ret_from_fork

...which is due to the file system asserting that there are still buffer
heads attached:

({ \
BUG_ON(!PagePrivate(page)); \
((struct buffer_head *)page_private(page)); \
})

Dave Chinner's description of this is very clear:

"The fundamental issue is that ->page_mkwrite must be called on every
write access to a clean file backed page, not just the first one.
How long the GUP reference lasts is irrelevant, if the page is clean
and you need to dirty it, you must call ->page_mkwrite before it is
marked writeable and dirtied. Every. Time."

This is just one symptom of the larger design problem: real filesystems
that actually write to a backing device, do not actually support
get_user_pages() being called on their pages, and letting hardware write
directly to those pages--even though that pattern has been going on since
about 2005 or so.

Long term GUP
=============

Long term GUP is an issue when FOLL_WRITE is specified to GUP (so, a
writeable mapping is created), and the pages are file-backed. That can
lead to filesystem corruption. What happens is that when a file-backed
page is being written back, it is first mapped read-only in all of the CPU
page tables; the file system then assumes that nobody can write to the
page, and that the page content is therefore stable. Unfortunately, the
GUP callers generally do not monitor changes to the CPU pages tables; they
instead assume that the following pattern is safe (it's not):

get_user_pages()

Hardware can keep a reference to those pages for a very long time,
and write to it at any time. Because "hardware" here means "devices
that are not a CPU", this activity occurs without any interaction with
the kernel's file system code.

for each page
set_page_dirty
put_page()

In fact, the GUP documentation even recommends that pattern.

Anyway, the file system assumes that the page is stable (nothing is
writing to the page), and that is a problem: stable page content is
necessary for many filesystem actions during writeback, such as checksum,
encryption, RAID striping, etc. Furthermore, filesystem features like COW
(copy on write) or snapshot also rely on being able to use a new page for
as memory for that memory range inside the file.

Corruption during write back is clearly possible here. To solve that, one
idea is to identify pages that have active GUP, so that we can use a
bounce page to write stable data to the filesystem. The filesystem would
work on the bounce page, while any of the active GUP might write to the
original page. This would avoid the stable page violation problem, but
note that it is only part of the overall solution, because other problems
remain.

Other filesystem features that need to replace the page with a new one can
be inhibited for pages that are GUP-pinned. This will, however, alter and
limit some of those filesystem features. The only fix for that would be
to require GUP users to monitor and respond to CPU page table updates.
Subsystems such as ODP and HMM do this, for example. This aspect of the
problem is still under discussion.

Direct IO
=========

Direct IO can cause corruption, if userspace does Direct-IO that writes to
a range of virtual addresses that are mmap'd to a file. The pages written
to are file-backed pages that can be under write back, while the Direct IO
is taking place. Here, Direct IO races with a write back: it calls GUP
before page_mkclean() has replaced the CPU pte with a read-only entry.
The race window is pretty small, which is probably why years have gone by
before we noticed this problem: Direct IO is generally very quick, and
tends to finish up before the filesystem gets around to do anything with
the page contents. However, it's still a real problem. The solution is
to never let GUP return pages that are under write back, but instead,
force GUP to take a write fault on those pages. That way, GUP will
properly synchronize with the active write back. This does not change the
required GUP behavior, it just avoids that race.

Details
=======

Introduces put_user_page(), which simply calls put_page(). This provides
a way to update all get_user_pages*() callers, so that they call
put_user_page(), instead of put_page().

Also introduces put_user_pages(), and a few dirty/locked variations, as a
replacement for release_pages(), and also as a replacement for open-coded
loops that release multiple pages. These may be used for subsequent
performance improvements, via batching of pages to be released.

This is the first step of fixing a problem (also described in [1] and [2])
with interactions between get_user_pages ("gup") and filesystems.

Problem description: let's start with a bug report. Below, is what
happens sometimes, under memory pressure, when a driver pins some pages
via gup, and then marks those pages dirty, and releases them. Note that
the gup documentation actually recommends that pattern. The problem is
that the filesystem may do a writeback while the pages were gup-pinned,
and then the filesystem believes that the pages are clean. So, when the
driver later marks the pages as dirty, that conflicts with the
filesystem's page tracking and results in a BUG(), like this one that I
experienced:

kernel BUG at /build/linux-fQ94TU/linux-4.4.0/fs/ext4/inode.c:1899!
backtrace:
ext4_writepage
__writepage
write_cache_pages
ext4_writepages
do_writepages
__writeback_single_inode
writeback_sb_inodes
__writeback_inodes_wb
wb_writeback
wb_workfn
process_one_work
worker_thread
kthread
ret_from_fork

...which is due to the file system asserting that there are still buffer
heads attached:

({ \
BUG_ON(!PagePrivate(page)); \
((struct buffer_head *)page_private(page)); \
})

Dave Chinner's description of this is very clear:

"The fundamental issue is that ->page_mkwrite must be called on
every write access to a clean file backed page, not just the first
one. How long the GUP reference lasts is irrelevant, if the page is
clean and you need to dirty it, you must call ->page_mkwrite before it
is marked writeable and dirtied. Every. Time."

This is just one symptom of the larger design problem: real filesystems
that actually write to a backing device, do not actually support
get_user_pages() being called on their pages, and letting hardware write
directly to those pages--even though that pattern has been going on since
about 2005 or so.

The steps are to fix it are:

1) (This patch): provide put_user_page*() routines, intended to be used
for releasing pages that were pinned via get_user_pages*().

2) Convert all of the call sites for get_user_pages*(), to
invoke put_user_page*(), instead of put_page(). This involves dozens of
call sites, and will take some time.

3) After (2) is complete, use get_user_pages*() and put_user_page*() to
implement tracking of these pages. This tracking will be separate from
the existing struct page refcounting.

4) Use the tracking and identification of these pages, to implement
special handling (especially in writeback paths) when the pages are
backed by a filesystem.

[1] https://lwn.net/Articles/774411/ : "DMA and get_user_pages()"
[2] https://lwn.net/Articles/753027/ : "The Trouble with get_user_pages()"

Link: http://lkml.kernel.org/r/20190327023632.13307-2-jhubbard@nvidia.com
Signed-off-by: John Hubbard <jhubbard@nvidia.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Reviewed-by: Mike Rapoport <rppt@linux.ibm.com> [docs]
Reviewed-by: Ira Weiny <ira.weiny@intel.com>
Reviewed-by: Jérôme Glisse <jglisse@redhat.com>
Reviewed-by: Christoph Lameter <cl@linux.com>
Tested-by: Ira Weiny <ira.weiny@intel.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Christoph Hellwig <hch@infradead.org>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: Dave Chinner <david@fromorbit.com>
Cc: Jason Gunthorpe <jgg@ziepe.ca>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Ralph Campbell <rcampbell@nvidia.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 7af75561 13-May-2019 Ira Weiny <ira.weiny@intel.com>

mm/gup: add FOLL_LONGTERM capability to GUP fast

DAX pages were previously unprotected from longterm pins when users called
get_user_pages_fast().

Use the new FOLL_LONGTERM flag to check for DEVMAP pages and fall back to
regular GUP processing if a DEVMAP page is encountered.

[ira.weiny@intel.com: v3]
Link: http://lkml.kernel.org/r/20190328084422.29911-5-ira.weiny@intel.com
Link: http://lkml.kernel.org/r/20190328084422.29911-5-ira.weiny@intel.com
Link: http://lkml.kernel.org/r/20190317183438.2057-5-ira.weiny@intel.com
Signed-off-by: Ira Weiny <ira.weiny@intel.com>
Reviewed-by: Andrew Morton <akpm@linux-foundation.org>
Cc: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: "David S. Miller" <davem@davemloft.net>
Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: James Hogan <jhogan@kernel.org>
Cc: Jason Gunthorpe <jgg@ziepe.ca>
Cc: John Hubbard <jhubbard@nvidia.com>
Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Ralf Baechle <ralf@linux-mips.org>
Cc: Rich Felker <dalias@libc.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Yoshinori Sato <ysato@users.sourceforge.jp>
Cc: Mike Marshall <hubcap@omnibond.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 73b0140b 13-May-2019 Ira Weiny <ira.weiny@intel.com>

mm/gup: change GUP fast to use flags rather than a write 'bool'

To facilitate additional options to get_user_pages_fast() change the
singular write parameter to be gup_flags.

This patch does not change any functionality. New functionality will
follow in subsequent patches.

Some of the get_user_pages_fast() call sites were unchanged because they
already passed FOLL_WRITE or 0 for the write parameter.

NOTE: It was suggested to change the ordering of the get_user_pages_fast()
arguments to ensure that callers were converted. This breaks the current
GUP call site convention of having the returned pages be the final
parameter. So the suggestion was rejected.

Link: http://lkml.kernel.org/r/20190328084422.29911-4-ira.weiny@intel.com
Link: http://lkml.kernel.org/r/20190317183438.2057-4-ira.weiny@intel.com
Signed-off-by: Ira Weiny <ira.weiny@intel.com>
Reviewed-by: Mike Marshall <hubcap@omnibond.com>
Cc: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: "David S. Miller" <davem@davemloft.net>
Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: James Hogan <jhogan@kernel.org>
Cc: Jason Gunthorpe <jgg@ziepe.ca>
Cc: John Hubbard <jhubbard@nvidia.com>
Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Ralf Baechle <ralf@linux-mips.org>
Cc: Rich Felker <dalias@libc.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Yoshinori Sato <ysato@users.sourceforge.jp>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# b798bec4 13-May-2019 Ira Weiny <ira.weiny@intel.com>

mm/gup: change write parameter to flags in fast walk

In order to support more options in the GUP fast walk, change the write
parameter to flags throughout the call stack.

This patch does not change functionality and passes FOLL_WRITE where write
was previously used.

Link: http://lkml.kernel.org/r/20190328084422.29911-3-ira.weiny@intel.com
Link: http://lkml.kernel.org/r/20190317183438.2057-3-ira.weiny@intel.com
Signed-off-by: Ira Weiny <ira.weiny@intel.com>
Reviewed-by: Dan Williams <dan.j.williams@intel.com>
Cc: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Borislav Petkov <bp@alien8.de>
Cc: "David S. Miller" <davem@davemloft.net>
Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: James Hogan <jhogan@kernel.org>
Cc: Jason Gunthorpe <jgg@ziepe.ca>
Cc: John Hubbard <jhubbard@nvidia.com>
Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Ralf Baechle <ralf@linux-mips.org>
Cc: Rich Felker <dalias@libc.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Yoshinori Sato <ysato@users.sourceforge.jp>
Cc: Mike Marshall <hubcap@omnibond.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 932f4a63 13-May-2019 Ira Weiny <ira.weiny@intel.com>

mm/gup: replace get_user_pages_longterm() with FOLL_LONGTERM

Pach series "Add FOLL_LONGTERM to GUP fast and use it".

HFI1, qib, and mthca, use get_user_pages_fast() due to its performance
advantages. These pages can be held for a significant time. But
get_user_pages_fast() does not protect against mapping FS DAX pages.

Introduce FOLL_LONGTERM and use this flag in get_user_pages_fast() which
retains the performance while also adding the FS DAX checks. XDP has also
shown interest in using this functionality.[1]

In addition we change get_user_pages() to use the new FOLL_LONGTERM flag
and remove the specialized get_user_pages_longterm call.

[1] https://lkml.org/lkml/2019/3/19/939

"longterm" is a relative thing and at this point is probably a misnomer.
This is really flagging a pin which is going to be given to hardware and
can't move. I've thought of a couple of alternative names but I think we
have to settle on if we are going to use FL_LAYOUT or something else to
solve the "longterm" problem. Then I think we can change the flag to a
better name.

Secondly, it depends on how often you are registering memory. I have
spoken with some RDMA users who consider MR in the performance path...
For the overall application performance. I don't have the numbers as the
tests for HFI1 were done a long time ago. But there was a significant
advantage. Some of which is probably due to the fact that you don't have
to hold mmap_sem.

Finally, architecturally I think it would be good for everyone to use
*_fast. There are patches submitted to the RDMA list which would allow
the use of *_fast (they reworking the use of mmap_sem) and as soon as they
are accepted I'll submit a patch to convert the RDMA core as well. Also
to this point others are looking to use *_fast.

As an aside, Jasons pointed out in my previous submission that *_fast and
*_unlocked look very much the same. I agree and I think further cleanup
will be coming. But I'm focused on getting the final solution for DAX at
the moment.

This patch (of 7):

This patch starts a series which aims to support FOLL_LONGTERM in
get_user_pages_fast(). Some callers who would like to do a longterm (user
controlled pin) of pages with the fast variant of GUP for performance
purposes.

Rather than have a separate get_user_pages_longterm() call, introduce
FOLL_LONGTERM and change the longterm callers to use it.

This patch does not change any functionality. In the short term
"longterm" or user controlled pins are unsafe for Filesystems and FS DAX
in particular has been blocked. However, callers of get_user_pages_fast()
were not "protected".

FOLL_LONGTERM can _only_ be supported with get_user_pages[_fast]() as it
requires vmas to determine if DAX is in use.

NOTE: In merging with the CMA changes we opt to change the
get_user_pages() call in check_and_migrate_cma_pages() to a call of
__get_user_pages_locked() on the newly migrated pages. This makes the
code read better in that we are calling __get_user_pages_locked() on the
pages before and after a potential migration.

As a side affect some of the interfaces are cleaned up but this is not the
primary purpose of the series.

In review[1] it was asked:

<quote>
> This I don't get - if you do lock down long term mappings performance
> of the actual get_user_pages call shouldn't matter to start with.
>
> What do I miss?

A couple of points.

First "longterm" is a relative thing and at this point is probably a
misnomer. This is really flagging a pin which is going to be given to
hardware and can't move. I've thought of a couple of alternative names
but I think we have to settle on if we are going to use FL_LAYOUT or
something else to solve the "longterm" problem. Then I think we can
change the flag to a better name.

Second, It depends on how often you are registering memory. I have spoken
with some RDMA users who consider MR in the performance path... For the
overall application performance. I don't have the numbers as the tests
for HFI1 were done a long time ago. But there was a significant
advantage. Some of which is probably due to the fact that you don't have
to hold mmap_sem.

Finally, architecturally I think it would be good for everyone to use
*_fast. There are patches submitted to the RDMA list which would allow
the use of *_fast (they reworking the use of mmap_sem) and as soon as they
are accepted I'll submit a patch to convert the RDMA core as well. Also
to this point others are looking to use *_fast.

As an asside, Jasons pointed out in my previous submission that *_fast and
*_unlocked look very much the same. I agree and I think further cleanup
will be coming. But I'm focused on getting the final solution for DAX at
the moment.

</quote>

[1] https://lore.kernel.org/lkml/20190220180255.GA12020@iweiny-DESK2.sc.intel.com/T/#md6abad2569f3bf6c1f03686c8097ab6563e94965

[ira.weiny@intel.com: v3]
Link: http://lkml.kernel.org/r/20190328084422.29911-2-ira.weiny@intel.com
Link: http://lkml.kernel.org/r/20190328084422.29911-2-ira.weiny@intel.com
Link: http://lkml.kernel.org/r/20190317183438.2057-2-ira.weiny@intel.com
Signed-off-by: Ira Weiny <ira.weiny@intel.com>
Reviewed-by: Andrew Morton <akpm@linux-foundation.org>
Cc: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: John Hubbard <jhubbard@nvidia.com>
Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Jason Gunthorpe <jgg@ziepe.ca>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Paul Mackerras <paulus@samba.org>
Cc: "David S. Miller" <davem@davemloft.net>
Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
Cc: Rich Felker <dalias@libc.org>
Cc: Yoshinori Sato <ysato@users.sourceforge.jp>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Ralf Baechle <ralf@linux-mips.org>
Cc: James Hogan <jhogan@kernel.org>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: Mike Marshall <hubcap@omnibond.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 8fde12ca 11-Apr-2019 Linus Torvalds <torvalds@linux-foundation.org>

mm: prevent get_user_pages() from overflowing page refcount

If the page refcount wraps around past zero, it will be freed while
there are still four billion references to it. One of the possible
avenues for an attacker to try to make this happen is by doing direct IO
on a page multiple times. This patch makes get_user_pages() refuse to
take a new page reference if there are already more than two billion
references to the page.

Reported-by: Jann Horn <jannh@google.com>
Acked-by: Matthew Wilcox <willy@infradead.org>
Cc: stable@kernel.org
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 9a4e9f3b 05-Mar-2019 Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>

mm: update get_user_pages_longterm to migrate pages allocated from CMA region

This patch updates get_user_pages_longterm to migrate pages allocated
out of CMA region. This makes sure that we don't keep non-movable pages
(due to page reference count) in the CMA area.

This will be used by ppc64 in a later patch to avoid pinning pages in
the CMA region. ppc64 uses CMA region for allocation of the hardware
page table (hash page table) and not able to migrate pages out of CMA
region results in page table allocation failures.

One case where we hit this easy is when a guest using a VFIO passthrough
device. VFIO locks all the guest's memory and if the guest memory is
backed by CMA region, it becomes unmovable resulting in fragmenting the
CMA and possibly preventing other guests from allocation a large enough
hash page table.

NOTE: We allocate the new page without using __GFP_THISNODE

Link: http://lkml.kernel.org/r/20190114095438.32470-3-aneesh.kumar@linux.ibm.com
Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
Cc: Alexey Kardashevskiy <aik@ozlabs.ru>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: David Gibson <david@gibson.dropbear.id.au>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Mel Gorman <mgorman@techsingularity.net>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 414fd080 12-Feb-2019 Yu Zhao <yuzhao@google.com>

mm/gup: fix gup_pmd_range() for dax

For dax pmd, pmd_trans_huge() returns false but pmd_huge() returns true
on x86. So the function works as long as hugetlb is configured.
However, dax doesn't depend on hugetlb.

Link: http://lkml.kernel.org/r/20190111034033.601-1-yuzhao@google.com
Signed-off-by: Yu Zhao <yuzhao@google.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: Huang Ying <ying.huang@intel.com>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Keith Busch <keith.busch@intel.com>
Cc: "Michael S . Tsirkin" <mst@redhat.com>
Cc: John Hubbard <jhubbard@nvidia.com>
Cc: Wei Yang <richard.weiyang@gmail.com>
Cc: Mike Rapoport <rppt@linux.ibm.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: "Kirill A . Shutemov" <kirill.shutemov@linux.intel.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# ad8cfb9c 10-Feb-2019 Ira Weiny <ira.weiny@intel.com>

mm/gup: Remove the 'write' parameter from gup_fast_permitted()

The 'write' parameter is unused in gup_fast_permitted() so remove it.

Signed-off-by: Ira Weiny <ira.weiny@intel.com>
Acked-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Reviewed-by: Thomas Gleixner <tglx@linutronix.de>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: Dave Hansen <dave.hansen@intel.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: linux-mm@kvack.org
Link: http://lkml.kernel.org/r/20190210223424.13934-1-ira.weiny@intel.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>


# fa45f116 03-Jan-2019 Davidlohr Bueso <dave@stgolabs.net>

mm/: remove caller signal_pending branch predictions

This is already done for us internally by the signal machinery.

Link: http://lkml.kernel.org/r/20181116002713.8474-5-dave@stgolabs.net
Signed-off-by: Davidlohr Bueso <dave@stgolabs.net>
Reviewed-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 96d4f267 03-Jan-2019 Linus Torvalds <torvalds@linux-foundation.org>

Remove 'type' argument from access_ok() function

Nobody has actually used the type (VERIFY_READ vs VERIFY_WRITE) argument
of the user address range verification function since we got rid of the
old racy i386-only code to walk page tables by hand.

It existed because the original 80386 would not honor the write protect
bit when in kernel mode, so you had to do COW by hand before doing any
user access. But we haven't supported that in a long time, and these
days the 'type' argument is a purely historical artifact.

A discussion about extending 'user_access_begin()' to do the range
checking resulted this patch, because there is no way we're going to
move the old VERIFY_xyz interface to that model. And it's best done at
the end of the merge window when I've done most of my merges, so let's
just get this done once and for all.

This patch was mostly done with a sed-script, with manual fix-ups for
the cases that weren't of the trivial 'access_ok(VERIFY_xyz' form.

There were a couple of notable cases:

- csky still had the old "verify_area()" name as an alias.

- the iter_iov code had magical hardcoded knowledge of the actual
values of VERIFY_{READ,WRITE} (not that they mattered, since nothing
really used it)

- microblaze used the type argument for a debug printout

but other than those oddities this should be a total no-op patch.

I tried to fix up all architectures, did fairly extensive grepping for
access_ok() uses, and the changes are trivial, but I may have missed
something. Any missed conversion should be trivially fixable, though.

Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 08be37b7 30-Nov-2018 John Hubbard <jhubbard@nvidia.com>

mm/gup: finish consolidating error handling

Commit df06b37ffe5a ("mm/gup: cache dev_pagemap while pinning pages")
attempted to operate on each page that get_user_pages had retrieved. In
order to do that, it created a common exit point from the routine.
However, one case was missed, which this patch fixes up.

Also, there was still an unnecessary shadow declaration (with a
different type) of the "ret" variable, which this patch removes.

Keith's description of the situation is:

This also fixes a potentially leaked dev_pagemap reference count if a
failure occurs when an iteration crosses a vma boundary. I don't think
it's normal to have different vma's on a users mapped zone device
memory, but good to fix anyway.

I actually thought that this code:

/* first iteration or cross vma bound */
if (!vma || start >= vma->vm_end) {
vma = find_extend_vma(mm, start);
if (!vma && in_gate_area(mm, start)) {
ret = get_gate_page(mm, start & PAGE_MASK,
gup_flags, &vma,
pages ? &pages[i] : NULL);
if (ret)
goto out;

dealt with the "you're trying to pin the gate page, as part of this
call", rather than the generic case of crossing a vma boundary. (I
think there's a fine point that I must be overlooking.) But it's still a
valid case, either way.

Link: http://lkml.kernel.org/r/20181121081402.29641-2-jhubbard@nvidia.com
Fixes: df06b37ffe5a4 ("mm/gup: cache dev_pagemap while pinning pages")
Signed-off-by: John Hubbard <jhubbard@nvidia.com>
Reviewed-by: Keith Busch <keith.busch@intel.com>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Cc: Dave Hansen <dave.hansen@intel.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 78179556 16-Nov-2018 Mike Rapoport <rppt@kernel.org>

mm/gup.c: fix follow_page_mask() kerneldoc comment

Commit df06b37ffe5a ("mm/gup: cache dev_pagemap while pinning pages")
modified the signature of follow_page_mask() but left the parameter
description behind.

Update the description to make the code and comments agree again.

While at it, update formatting of the return value description to match
Documentation/doc-guide/kernel-doc.rst guidelines.

Link: http://lkml.kernel.org/r/1541603316-27832-1-git-send-email-rppt@linux.ibm.com
Signed-off-by: Mike Rapoport <rppt@linux.ibm.com>
Reviewed-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 2ebe8228 30-Oct-2018 Fengguang Wu <fengguang.wu@intel.com>

mm/gup.c: fix __get_user_pages_fast() comment

mmu_gather_tlb() no longer exists. Replace with mmu_table_batch().

Link: http://lkml.kernel.org/r/20180928053441.rpzwafzlsnp74mkl@wfg-t540p.sh.intel.com
Signed-off-by: Fengguang Wu <fengguang.wu@intel.com>
Reviewed-by: Andrew Morton <akpm@linux-foundation.org>
Cc: Jiri Kosina <trivial@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# df06b37f 26-Oct-2018 Keith Busch <kbusch@kernel.org>

mm/gup: cache dev_pagemap while pinning pages

Getting pages from ZONE_DEVICE memory needs to check the backing device's
live-ness, which is tracked in the device's dev_pagemap metadata. This
metadata is stored in a radix tree and looking it up adds measurable
software overhead.

This patch avoids repeating this relatively costly operation when
dev_pagemap is used by caching the last dev_pagemap while getting user
pages. The gup_benchmark kernel self test reports this reduces time to
get user pages to as low as 1/3 of the previous time.

Link: http://lkml.kernel.org/r/20181012173040.15669-1-keith.busch@intel.com
Signed-off-by: Keith Busch <keith.busch@intel.com>
Reviewed-by: Dan Williams <dan.j.williams@intel.com>
Acked-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Cc: Dave Hansen <dave.hansen@intel.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# d4faa402 26-Oct-2018 Wei Yang <richard.weiyang@gmail.com>

mm: remove unnecessary local variable addr in __get_user_pages_fast()

The local variable `addr' in __get_user_pages_fast() is just a shadow of
`start'. Since `start' never changes after assignment to `addr', it is
fine to replace `start' with it.

Also the meaning of [start, end] is more obvious than [addr, end] when
passed to gup_pgd_range().

Link: http://lkml.kernel.org/r/20180925021448.20265-1-richard.weiyang@gmail.com
Signed-off-by: Wei Yang <richard.weiyang@gmail.com>
Reviewed-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 2b740303 23-Aug-2018 Souptick Joarder <jrdr.linux@gmail.com>

mm: Change return type int to vm_fault_t for fault handlers

Use new return type vm_fault_t for fault handler. For now, this is just
documenting that the function returns a VM_FAULT value rather than an
errno. Once all instances are converted, vm_fault_t will become a
distinct type.

Ref-> commit 1c8f422059ae ("mm: change return type to vm_fault_t")

The aim is to change the return type of finish_fault() and
handle_mm_fault() to vm_fault_t type. As part of that clean up return
type of all other recursively called functions have been changed to
vm_fault_t type.

The places from where handle_mm_fault() is getting invoked will be
change to vm_fault_t type but in a separate patch.

vmf_error() is the newly introduce inline function in 4.17-rc6.

[akpm@linux-foundation.org: don't shadow outer local `ret' in __do_huge_pmd_anonymous_page()]
Link: http://lkml.kernel.org/r/20180604171727.GA20279@jordon-HP-15-Notebook-PC
Signed-off-by: Souptick Joarder <jrdr.linux@gmail.com>
Reviewed-by: Matthew Wilcox <mawilcox@microsoft.com>
Reviewed-by: Andrew Morton <akpm@linux-foundation.org>
Cc: Matthew Wilcox <willy@infradead.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# bb177a73 13-Jul-2018 Michal Hocko <mhocko@suse.com>

mm: do not bug_on on incorrect length in __mm_populate()

syzbot has noticed that a specially crafted library can easily hit
VM_BUG_ON in __mm_populate

kernel BUG at mm/gup.c:1242!
invalid opcode: 0000 [#1] SMP
CPU: 2 PID: 9667 Comm: a.out Not tainted 4.18.0-rc3 #644
Hardware name: VMware, Inc. VMware Virtual Platform/440BX Desktop Reference Platform, BIOS 6.00 05/19/2017
RIP: 0010:__mm_populate+0x1e2/0x1f0
Code: 55 d0 65 48 33 14 25 28 00 00 00 89 d8 75 21 48 83 c4 20 5b 41 5c 41 5d 41 5e 41 5f 5d c3 e8 75 18 f1 ff 0f 0b e8 6e 18 f1 ff <0f> 0b 31 db eb c9 e8 93 06 e0 ff 0f 1f 00 55 48 89 e5 53 48 89 fb
Call Trace:
vm_brk_flags+0xc3/0x100
vm_brk+0x1f/0x30
load_elf_library+0x281/0x2e0
__ia32_sys_uselib+0x170/0x1e0
do_fast_syscall_32+0xca/0x420
entry_SYSENTER_compat+0x70/0x7f

The reason is that the length of the new brk is not page aligned when we
try to populate the it. There is no reason to bug on that though.
do_brk_flags already aligns the length properly so the mapping is
expanded as it should. All we need is to tell mm_populate about it.
Besides that there is absolutely no reason to to bug_on in the first
place. The worst thing that could happen is that the last page wouldn't
get populated and that is far from putting system into an inconsistent
state.

Fix the issue by moving the length sanitization code from do_brk_flags
up to vm_brk_flags. The only other caller of do_brk_flags is brk
syscall entry and it makes sure to provide the proper length so t here
is no need for sanitation and so we can use do_brk_flags without it.

Also remove the bogus BUG_ONs.

[osalvador@techadventures.net: fix up vm_brk_flags s@request@len@]
Link: http://lkml.kernel.org/r/20180706090217.GI32658@dhcp22.suse.cz
Signed-off-by: Michal Hocko <mhocko@suse.com>
Reported-by: syzbot <syzbot+5dcb560fe12aa5091c06@syzkaller.appspotmail.com>
Tested-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
Reviewed-by: Oscar Salvador <osalvador@suse.de>
Cc: Zi Yan <zi.yan@cs.rutgers.edu>
Cc: "Aneesh Kumar K.V" <aneesh.kumar@linux.vnet.ibm.com>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
Cc: Michael S. Tsirkin <mst@redhat.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: "Huang, Ying" <ying.huang@intel.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 68827280 07-Jun-2018 Huang Ying <ying.huang@intel.com>

mm, gup: prevent pmd checking race in follow_pmd_mask()

mmap_sem will be read locked when calling follow_pmd_mask(). But this
cannot prevent PMD from being changed for all cases when PTL is
unlocked, for example, from pmd_trans_huge() to pmd_none() via
MADV_DONTNEED. So it is possible for the pmd_present() check in
follow_pmd_mask() to encounter an invalid PMD. This may cause an
incorrect VM_BUG_ON() or an infinite loop. Fix this by reading the PMD
entry into a local variable with READ_ONCE() and checking the local
variable and pmd_none() in the retry loop.

As Kirill pointed out, with PTL unlocked, the *pmd may be changed under
us, so reading it directly again and again may incur weird bugs. So
although using *pmd directly other than for pmd_present() checking may
be safe, it is still better to replace them to read *pmd once and check
the local variable multiple times.

When PTL unlocked, replace all *pmd with local variable was suggested by
Kirill.

Link: http://lkml.kernel.org/r/20180419083514.1365-1-ying.huang@intel.com
Signed-off-by: "Huang, Ying" <ying.huang@intel.com>
Reviewed-by: Zi Yan <zi.yan@cs.rutgers.edu>
Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: "Aneesh Kumar K.V" <aneesh.kumar@linux.vnet.ibm.com>
Cc: Dan Williams <dan.j.williams@intel.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 3010a5ea 07-Jun-2018 Laurent Dufour <ldufour@linux.vnet.ibm.com>

mm: introduce ARCH_HAS_PTE_SPECIAL

Currently the PTE special supports is turned on in per architecture
header files. Most of the time, it is defined in
arch/*/include/asm/pgtable.h depending or not on some other per
architecture static definition.

This patch introduce a new configuration variable to manage this
directly in the Kconfig files. It would later replace
__HAVE_ARCH_PTE_SPECIAL.

Here notes for some architecture where the definition of
__HAVE_ARCH_PTE_SPECIAL is not obvious:

arm
__HAVE_ARCH_PTE_SPECIAL which is currently defined in
arch/arm/include/asm/pgtable-3level.h which is included by
arch/arm/include/asm/pgtable.h when CONFIG_ARM_LPAE is set.
So select ARCH_HAS_PTE_SPECIAL if ARM_LPAE.

powerpc
__HAVE_ARCH_PTE_SPECIAL is defined in 2 files:
- arch/powerpc/include/asm/book3s/64/pgtable.h
- arch/powerpc/include/asm/pte-common.h
The first one is included if (PPC_BOOK3S & PPC64) while the second is
included in all the other cases.
So select ARCH_HAS_PTE_SPECIAL all the time.

sparc:
__HAVE_ARCH_PTE_SPECIAL is defined if defined(__sparc__) &&
defined(__arch64__) which are defined through the compiler in
sparc/Makefile if !SPARC32 which I assume to be if SPARC64.
So select ARCH_HAS_PTE_SPECIAL if SPARC64

There is no functional change introduced by this patch.

Link: http://lkml.kernel.org/r/1523433816-14460-2-git-send-email-ldufour@linux.vnet.ibm.com
Signed-off-by: Laurent Dufour <ldufour@linux.vnet.ibm.com>
Suggested-by: Jerome Glisse <jglisse@redhat.com>
Reviewed-by: Jerome Glisse <jglisse@redhat.com>
Acked-by: David Rientjes <rientjes@google.com>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: "Aneesh Kumar K . V" <aneesh.kumar@linux.vnet.ibm.com>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Jonathan Corbet <corbet@lwn.net>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Will Deacon <will.deacon@arm.com>
Cc: Yoshinori Sato <ysato@users.sourceforge.jp>
Cc: Rich Felker <dalias@libc.org>
Cc: David S. Miller <davem@davemloft.net>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Vineet Gupta <vgupta@synopsys.com>
Cc: Palmer Dabbelt <palmer@sifive.com>
Cc: Albert Ou <albert@sifive.com>
Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
Cc: David Rientjes <rientjes@google.com>
Cc: Robin Murphy <robin.murphy@arm.com>
Cc: Christophe LEROY <christophe.leroy@c-s.fr>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# a9b6de77 19-Apr-2018 Dan Williams <dan.j.williams@intel.com>

mm: fix __gup_device_huge vs unmap

get_user_pages_fast() for device pages is missing the typical validation
that all page references have been taken while the mapping was valid.
Without this validation truncate operations can not reliably coordinate
against new page reference events like O_DIRECT.

Cc: <stable@vger.kernel.org>
Fixes: 3565fce3a659 ("mm, x86: get_user_pages() for dax mappings")
Reported-by: Jan Kara <jack@suse.cz>
Reviewed-by: Jan Kara <jack@suse.cz>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>


# 7f7ccc2c 11-May-2018 Willy Tarreau <w@1wt.eu>

proc: do not access cmdline nor environ from file-backed areas

proc_pid_cmdline_read() and environ_read() directly access the target
process' VM to retrieve the command line and environment. If this
process remaps these areas onto a file via mmap(), the requesting
process may experience various issues such as extra delays if the
underlying device is slow to respond.

Let's simply refuse to access file-backed areas in these functions.
For this we add a new FOLL_ANON gup flag that is passed to all calls
to access_remote_vm(). The code already takes care of such failures
(including unmapped areas). Accesses via /proc/pid/mem were not
changed though.

This was assigned CVE-2018-1120.

Note for stable backports: the patch may apply to kernels prior to 4.11
but silently miss one location; it must be checked that no call to
access_remote_vm() keeps zero as the last argument.

Reported-by: Qualys Security Advisory <qsa@qualys.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Andy Lutomirski <luto@amacapital.net>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: stable@vger.kernel.org
Signed-off-by: Willy Tarreau <w@1wt.eu>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# d0811078 13-Apr-2018 Michael S. Tsirkin <mst@redhat.com>

mm/gup.c: document return value

__get_user_pages_fast handles errors differently from
get_user_pages_fast: the former always returns the number of pages
pinned, the later might return a negative error code.

Link: http://lkml.kernel.org/r/1522962072-182137-6-git-send-email-mst@redhat.com
Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
Reviewed-by: Andrew Morton <akpm@linux-foundation.org>
Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Cc: Huang Ying <ying.huang@intel.com>
Cc: Jonathan Corbet <corbet@lwn.net>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Thorsten Leemhuis <regressions@leemhuis.info>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# c61611f7 13-Apr-2018 Michael S. Tsirkin <mst@redhat.com>

get_user_pages_fast(): return -EFAULT on access_ok failure

get_user_pages_fast is supposed to be a faster drop-in equivalent of
get_user_pages. As such, callers expect it to return a negative return
code when passed an invalid address, and never expect it to return 0
when passed a positive number of pages, since its documentation says:

* Returns number of pages pinned. This may be fewer than the number
* requested. If nr_pages is 0 or negative, returns 0. If no pages
* were pinned, returns -errno.

When get_user_pages_fast fall back on get_user_pages this is exactly
what happens. Unfortunately the implementation is inconsistent: it
returns 0 if passed a kernel address, confusing callers: for example,
the following is pretty common but does not appear to do the right thing
with a kernel address:

ret = get_user_pages_fast(addr, 1, writeable, &page);
if (ret < 0)
return ret;

Change get_user_pages_fast to return -EFAULT when supplied a kernel
address to make it match expectations.

All callers have been audited for consistency with the documented
semantics.

Link: http://lkml.kernel.org/r/1522962072-182137-4-git-send-email-mst@redhat.com
Fixes: 5b65c4677a57 ("mm, x86/mm: Fix performance regression in get_user_pages_fast()")
Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
Reported-by: syzbot+6304bf97ef436580fede@syzkaller.appspotmail.com
Reviewed-by: Andrew Morton <akpm@linux-foundation.org>
Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Cc: Huang Ying <ying.huang@intel.com>
Cc: Jonathan Corbet <corbet@lwn.net>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Thorsten Leemhuis <regressions@leemhuis.info>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 2923117b 05-Apr-2018 Mario Leinweber <marioleinweber@web.de>

mm/gup.c: fix coding style issues.

- Fixed style error: 8 spaces -> 1 tab.
- Fixed style warning: Corrected misleading indentation.

Link: http://lkml.kernel.org/r/20180302210254.31888-1-marioleinweber@web.de
Signed-off-by: Mario Leinweber <marioleinweber@web.de>
Reviewed-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 96312e61 09-Mar-2018 Andrea Arcangeli <aarcange@redhat.com>

mm/gup.c: teach get_user_pages_unlocked to handle FOLL_NOWAIT

KVM is hanging during postcopy live migration with userfaultfd because
get_user_pages_unlocked is not capable to handle FOLL_NOWAIT.

Earlier FOLL_NOWAIT was only ever passed to get_user_pages.

Specifically faultin_page (the callee of get_user_pages_unlocked caller)
doesn't know that if FAULT_FLAG_RETRY_NOWAIT was set in the page fault
flags, when VM_FAULT_RETRY is returned, the mmap_sem wasn't actually
released (even if nonblocking is not NULL). So it sets *nonblocking to
zero and the caller won't release the mmap_sem thinking it was already
released, but it wasn't because of FOLL_NOWAIT.

Link: http://lkml.kernel.org/r/20180302174343.5421-2-aarcange@redhat.com
Fixes: ce53053ce378c ("kvm: switch get_user_page_nowait() to get_user_pages_unlocked()")
Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
Reported-by: Dr. David Alan Gilbert <dgilbert@redhat.com>
Tested-by: Dr. David Alan Gilbert <dgilbert@redhat.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 832d7aa0 29-Dec-2017 Christoph Hellwig <hch@lst.de>

mm: optimize dev_pagemap reference counting around get_dev_pagemap

Change the calling convention so that get_dev_pagemap always consumes the
previous reference instead of doing this using an explicit earlier call to
put_dev_pagemap in the callers.

The callers will still need to put the final reference after finishing the
loop over the pages.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Logan Gunthorpe <logang@deltatee.com>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>


# f6f37321 15-Dec-2017 Linus Torvalds <torvalds@linux-foundation.org>

Revert "mm: replace p??_write with pte_access_permitted in fault + gup paths"

This reverts commits 5c9d2d5c269c, c7da82b894e9, and e7fe7b5cae90.

We'll probably need to revisit this, but basically we should not
complicate the get_user_pages_fast() case, and checking the actual page
table protection key bits will require more care anyway, since the
protection keys depend on the exact state of the VM in question.

Particularly when doing a "remote" page lookup (ie in somebody elses VM,
not your own), you need to be much more careful than this was. Dave
Hansen says:

"So, the underlying bug here is that we now a get_user_pages_remote()
and then go ahead and do the p*_access_permitted() checks against the
current PKRU. This was introduced recently with the addition of the
new p??_access_permitted() calls.

We have checks in the VMA path for the "remote" gups and we avoid
consulting PKRU for them. This got missed in the pkeys selftests
because I did a ptrace read, but not a *write*. I also didn't
explicitly test it against something where a COW needed to be done"

It's also not entirely clear that it makes sense to check the protection
key bits at this level at all. But one possible eventual solution is to
make the get_user_pages_fast() case just abort if it sees protection key
bits set, which makes us fall back to the regular get_user_pages() case,
which then has a vma and can do the check there if we want to.

We'll see.

Somewhat related to this all: what we _do_ want to do some day is to
check the PAGE_USER bit - it should obviously always be set for user
pages, but it would be a good check to have back. Because we have no
generic way to test for it, we lost it as part of moving over from the
architecture-specific x86 GUP implementation to the generic one in
commit e585513b76f7 ("x86/mm/gup: Switch GUP to the generic
get_user_page_fast() implementation").

Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: Dave Hansen <dave.hansen@intel.com>
Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Cc: "Jérôme Glisse" <jglisse@redhat.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# e716712f 19-Nov-2017 Al Viro <viro@zeniv.linux.org.uk>

__get_user_pages_locked(): get rid of notify_drop argument

The only caller that doesn't pass true in it is get_user_pages() and
it passes NULL in locked. The only place where we check it is
if (notify_locked && lock_dropped && *locked)
and lock_dropped can become true only if we have locked != NULL.
In other words, the second part of condition will be false when
called by get_user_pages().

Just get rid of the argument and turn the condition into
if (lock_dropped && *locked)

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>


# 14cb138d 19-Nov-2017 Al Viro <viro@zeniv.linux.org.uk>

get_user_pages_unlocked(): pass true to __get_user_pages_locked() notify_drop

Equivalent transformation - the only place in __get_user_pages_locked()
where we look at notify_drop argument is
if (notify_drop && lock_dropped && *locked) {
up_read(&mm->mmap_sem);
*locked = 0;
}
in the very end. Changing notify_drop from false to true won't change
behaviour unless *locked is non-zero. The caller is
ret = __get_user_pages_locked(current, mm, start, nr_pages, pages, NULL,
&locked, false, gup_flags | FOLL_TOUCH);
if (locked)
up_read(&mm->mmap_sem);
so in that case the original kernel would have done up_read() right after
return from __get_user_pages_locked(), while the modified one would've done
it right before the return.

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>


# c803c9c6 18-Nov-2017 Al Viro <viro@zeniv.linux.org.uk>

fold __get_user_pages_unlocked() into its sole remaining caller

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>


# 2bb6d283 29-Nov-2017 Dan Williams <dan.j.williams@intel.com>

mm: introduce get_user_pages_longterm

Patch series "introduce get_user_pages_longterm()", v2.

Here is a new get_user_pages api for cases where a driver intends to
keep an elevated page count indefinitely. This is distinct from usages
like iov_iter_get_pages where the elevated page counts are transient.
The iov_iter_get_pages cases immediately turn around and submit the
pages to a device driver which will put_page when the i/o operation
completes (under kernel control).

In the longterm case userspace is responsible for dropping the page
reference at some undefined point in the future. This is untenable for
filesystem-dax case where the filesystem is in control of the lifetime
of the block / page and needs reasonable limits on how long it can wait
for pages in a mapping to become idle.

Fixing filesystems to actually wait for dax pages to be idle before
blocks from a truncate/hole-punch operation are repurposed is saved for
a later patch series.

Also, allowing longterm registration of dax mappings is a future patch
series that introduces a "map with lease" semantic where the kernel can
revoke a lease and force userspace to drop its page references.

I have also tagged these for -stable to purposely break cases that might
assume that longterm memory registrations for filesystem-dax mappings
were supported by the kernel. The behavior regression this policy
change implies is one of the reasons we maintain the "dax enabled.
Warning: EXPERIMENTAL, use at your own risk" notification when mounting
a filesystem in dax mode.

It is worth noting the device-dax interface does not suffer the same
constraints since it does not support file space management operations
like hole-punch.

This patch (of 4):

Until there is a solution to the dma-to-dax vs truncate problem it is
not safe to allow long standing memory registrations against
filesytem-dax vmas. Device-dax vmas do not have this problem and are
explicitly allowed.

This is temporary until a "memory registration with layout-lease"
mechanism can be implemented for the affected sub-systems (RDMA and
V4L2).

[akpm@linux-foundation.org: use kcalloc()]
Link: http://lkml.kernel.org/r/151068939435.7446.13560129395419350737.stgit@dwillia2-desk3.amr.corp.intel.com
Fixes: 3565fce3a659 ("mm, x86: get_user_pages() for dax mappings")
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
Suggested-by: Christoph Hellwig <hch@lst.de>
Cc: Doug Ledford <dledford@redhat.com>
Cc: Hal Rosenstock <hal.rosenstock@gmail.com>
Cc: Inki Dae <inki.dae@samsung.com>
Cc: Jan Kara <jack@suse.cz>
Cc: Jason Gunthorpe <jgg@mellanox.com>
Cc: Jeff Moyer <jmoyer@redhat.com>
Cc: Joonyoung Shim <jy0922.shim@samsung.com>
Cc: Kyungmin Park <kyungmin.park@samsung.com>
Cc: Mauro Carvalho Chehab <mchehab@kernel.org>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Ross Zwisler <ross.zwisler@linux.intel.com>
Cc: Sean Hefty <sean.hefty@intel.com>
Cc: Seung-Woo Kim <sw0312.kim@samsung.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 5c9d2d5c 29-Nov-2017 Dan Williams <dan.j.williams@intel.com>

mm: replace pte_write with pte_access_permitted in fault + gup paths

The 'access_permitted' helper is used in the gup-fast path and goes
beyond the simple _PAGE_RW check to also:

- validate that the mapping is writable from a protection keys
standpoint

- validate that the pte has _PAGE_USER set since all fault paths where
pte_write is must be referencing user-memory.

Link: http://lkml.kernel.org/r/151043111604.2842.8051684481794973100.stgit@dwillia2-desk3.amr.corp.intel.com
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
Cc: Dave Hansen <dave.hansen@intel.com>
Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Cc: "Jérôme Glisse" <jglisse@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 5b65c467 08-Sep-2017 Kirill A. Shutemov <kirill.shutemov@linux.intel.com>

mm, x86/mm: Fix performance regression in get_user_pages_fast()

The 0-day test bot found a performance regression that was tracked down to
switching x86 to the generic get_user_pages_fast() implementation:

http://lkml.kernel.org/r/20170710024020.GA26389@yexl-desktop

The regression was caused by the fact that we now use local_irq_save() +
local_irq_restore() in get_user_pages_fast() to disable interrupts.
In x86 implementation local_irq_disable() + local_irq_enable() was used.

The fix is to make get_user_pages_fast() use local_irq_disable(),
leaving local_irq_save() for __get_user_pages_fast() that can be called
with interrupts disabled.

Numbers for pinning a gigabyte of memory, one page a time, 20 repeats:

Before: Average: 14.91 ms, stddev: 0.45 ms
After: Average: 10.76 ms, stddev: 0.18 ms

Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Huang Ying <ying.huang@intel.com>
Cc: Jonathan Corbet <corbet@lwn.net>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Thorsten Leemhuis <regressions@leemhuis.info>
Cc: linux-mm@kvack.org
Fixes: e585513b76f7 ("x86/mm/gup: Switch GUP to the generic get_user_page_fast() implementation")
Link: http://lkml.kernel.org/r/20170908215603.9189-3-kirill.shutemov@linux.intel.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>


# df6ad698 08-Sep-2017 Jérôme Glisse <jglisse@redhat.com>

mm/device-public-memory: device memory cache coherent with CPU

Platform with advance system bus (like CAPI or CCIX) allow device memory
to be accessible from CPU in a cache coherent fashion. Add a new type of
ZONE_DEVICE to represent such memory. The use case are the same as for
the un-addressable device memory but without all the corners cases.

Link: http://lkml.kernel.org/r/20170817000548.32038-19-jglisse@redhat.com
Signed-off-by: Jérôme Glisse <jglisse@redhat.com>
Cc: Aneesh Kumar <aneesh.kumar@linux.vnet.ibm.com>
Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: Ross Zwisler <ross.zwisler@linux.intel.com>
Cc: Balbir Singh <bsingharora@gmail.com>
Cc: David Nellans <dnellans@nvidia.com>
Cc: Evgeny Baskakov <ebaskakov@nvidia.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: John Hubbard <jhubbard@nvidia.com>
Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Cc: Mark Hairgrove <mhairgrove@nvidia.com>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Sherry Cheung <SCheung@nvidia.com>
Cc: Subhash Gutti <sgutti@nvidia.com>
Cc: Vladimir Davydov <vdavydov.dev@gmail.com>
Cc: Bob Liu <liubo95@huawei.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 84c3fc4e 08-Sep-2017 Zi Yan <zi.yan@cs.rutgers.edu>

mm: thp: check pmd migration entry in common path

When THP migration is being used, memory management code needs to handle
pmd migration entries properly. This patch uses !pmd_present() or
is_swap_pmd() (depending on whether pmd_none() needs separate code or
not) to check pmd migration entries at the places where a pmd entry is
present.

Since pmd-related code uses split_huge_page(), split_huge_pmd(),
pmd_trans_huge(), pmd_trans_unstable(), or
pmd_none_or_trans_huge_or_clear_bad(), this patch:

1. adds pmd migration entry split code in split_huge_pmd(),

2. takes care of pmd migration entries whenever pmd_trans_huge() is present,

3. makes pmd_none_or_trans_huge_or_clear_bad() pmd migration entry aware.

Since split_huge_page() uses split_huge_pmd() and pmd_trans_unstable()
is equivalent to pmd_none_or_trans_huge_or_clear_bad(), we do not change
them.

Until this commit, a pmd entry should be:
1. pointing to a pte page,
2. is_swap_pmd(),
3. pmd_trans_huge(),
4. pmd_devmap(), or
5. pmd_none().

Signed-off-by: Zi Yan <zi.yan@cs.rutgers.edu>
Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: Anshuman Khandual <khandual@linux.vnet.ibm.com>
Cc: Dave Hansen <dave.hansen@intel.com>
Cc: David Nellans <dnellans@nvidia.com>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: Mel Gorman <mgorman@techsingularity.net>
Cc: Minchan Kim <minchan@kernel.org>
Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Michal Hocko <mhocko@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 09180ca4 06-Sep-2017 Oliver O'Halloran <oohall@gmail.com>

mm/gup: make __gup_device_* require THP

These functions are the only bits of generic code that use
{pud,pmd}_pfn() without checking for CONFIG_TRANSPARENT_HUGEPAGE. This
works fine on x86, the only arch with devmap support, since the *_pfn()
functions are always defined there, but this isn't true for every
architecture.

Link: http://lkml.kernel.org/r/20170626063833.11094-1-oohall@gmail.com
Signed-off-by: Oliver O'Halloran <oohall@gmail.com>
Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# d63206ee 06-Jul-2017 Punit Agrawal <punitagrawal@gmail.com>

mm, gup: ensure real head page is ref-counted when using hugepages

When speculatively taking references to a hugepage using
page_cache_add_speculative() in gup_huge_pmd(), it is assumed that the
page returned by pmd_page() is the head page. Although normally true,
this assumption doesn't hold when the hugepage comprises of successive
page table entries such as when using contiguous bit on arm64 at PTE or
PMD levels.

This can be addressed by ensuring that the page passed to
page_cache_add_speculative() is the real head or by de-referencing the
head page within the function.

We take the first approach to keep the usage pattern aligned with
page_cache_get_speculative() where users already pass the appropriate
page, i.e., the de-referenced head.

Apply the same logic to fix gup_huge_[pud|pgd]() as well.

[punit.agrawal@arm.com: fix arm64 ltp failure]
Link: http://lkml.kernel.org/r/20170619170145.25577-5-punit.agrawal@arm.com
Link: http://lkml.kernel.org/r/20170522133604.11392-3-punit.agrawal@arm.com
Signed-off-by: Punit Agrawal <punit.agrawal@arm.com>
Acked-by: Steve Capper <steve.capper@arm.com>
Cc: Michal Hocko <mhocko@suse.com>
Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
Cc: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Will Deacon <will.deacon@arm.com>
Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
Cc: Mark Rutland <mark.rutland@arm.com>
Cc: Hillf Danton <hillf.zj@alibaba-inc.com>
Cc: Mike Kravetz <mike.kravetz@oracle.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# a3e32855 06-Jul-2017 Will Deacon <will@kernel.org>

mm, gup: remove broken VM_BUG_ON_PAGE compound check for hugepages

When operating on hugepages with DEBUG_VM enabled, the GUP code checks
the compound head for each tail page prior to calling
page_cache_add_speculative. This is broken, because on the fast-GUP
path (where we don't hold any page table locks) we can be racing with a
concurrent invocation of split_huge_page_to_list.

split_huge_page_to_list deals with this race by using page_ref_freeze to
freeze the page and force concurrent GUPs to fail whilst the component
pages are modified. This modification includes clearing the
compound_head field for the tail pages, so checking this prior to a
successful call to page_cache_add_speculative can lead to false
positives: In fact, page_cache_add_speculative *already* has this check
once the page refcount has been successfully updated, so we can simply
remove the broken calls to VM_BUG_ON_PAGE.

Link: http://lkml.kernel.org/r/20170522133604.11392-2-punit.agrawal@arm.com
Signed-off-by: Will Deacon <will.deacon@arm.com>
Signed-off-by: Punit Agrawal <punit.agrawal@arm.com>
Acked-by: Steve Capper <steve.capper@arm.com>
Acked-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Cc: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
Cc: Mark Rutland <mark.rutland@arm.com>
Cc: Hillf Danton <hillf.zj@alibaba-inc.com>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Mike Kravetz <mike.kravetz@oracle.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 4dc71451 06-Jul-2017 Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>

mm/follow_page_mask: add support for hugepage directory entry

Architectures like ppc64 supports hugepage size that is not mapped to
any of of the page table levels. Instead they add an alternate page
table entry format called hugepage directory (hugepd). hugepd indicates
that the page table entry maps to a set of hugetlb pages. Add support
for this in generic follow_page_mask code. We already support this
format in the generic gup code.

The default implementation prints warning and returns NULL. We will add
ppc64 support in later patches

Link: http://lkml.kernel.org/r/1494926612-23928-7-git-send-email-aneesh.kumar@linux.vnet.ibm.com
Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
Cc: Anshuman Khandual <khandual@linux.vnet.ibm.com>
Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Mike Kravetz <kravetz@us.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# faaa5b62 06-Jul-2017 Anshuman Khandual <khandual@linux.vnet.ibm.com>

mm/follow_page_mask: add support for hugetlb pgd entries

ppc64 supports pgd hugetlb entries. Add code to handle hugetlb pgd
entries to follow_page_mask so that ppc64 can switch to it to handle
hugetlbe entries.

Link: http://lkml.kernel.org/r/1494926612-23928-5-git-send-email-aneesh.kumar@linux.vnet.ibm.com
Signed-off-by: Anshuman Khandual <khandual@linux.vnet.ibm.com>
Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Mike Kravetz <kravetz@us.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 080dbb61 06-Jul-2017 Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>

mm/follow_page_mask: split follow_page_mask to smaller functions.

Makes code reading easy. No functional changes in this patch. In a
followup patch, we will be updating the follow_page_mask to handle
hugetlb hugepd format so that archs like ppc64 can switch to the generic
version. This split helps in doing that nicely.

Link: http://lkml.kernel.org/r/1494926612-23928-3-git-send-email-aneesh.kumar@linux.vnet.ibm.com
Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
Reviewed-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
Cc: Anshuman Khandual <khandual@linux.vnet.ibm.com>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Mike Kravetz <kravetz@us.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 1be7107f 19-Jun-2017 Hugh Dickins <hughd@google.com>

mm: larger stack guard gap, between vmas

Stack guard page is a useful feature to reduce a risk of stack smashing
into a different mapping. We have been using a single page gap which
is sufficient to prevent having stack adjacent to a different mapping.
But this seems to be insufficient in the light of the stack usage in
userspace. E.g. glibc uses as large as 64kB alloca() in many commonly
used functions. Others use constructs liks gid_t buffer[NGROUPS_MAX]
which is 256kB or stack strings with MAX_ARG_STRLEN.

This will become especially dangerous for suid binaries and the default
no limit for the stack size limit because those applications can be
tricked to consume a large portion of the stack and a single glibc call
could jump over the guard page. These attacks are not theoretical,
unfortunatelly.

Make those attacks less probable by increasing the stack guard gap
to 1MB (on systems with 4k pages; but make it depend on the page size
because systems with larger base pages might cap stack allocations in
the PAGE_SIZE units) which should cover larger alloca() and VLA stack
allocations. It is obviously not a full fix because the problem is
somehow inherent, but it should reduce attack space a lot.

One could argue that the gap size should be configurable from userspace,
but that can be done later when somebody finds that the new 1MB is wrong
for some special case applications. For now, add a kernel command line
option (stack_guard_gap) to specify the stack gap size (in page units).

Implementation wise, first delete all the old code for stack guard page:
because although we could get away with accounting one extra page in a
stack vma, accounting a larger gap can break userspace - case in point,
a program run with "ulimit -S -v 20000" failed when the 1MB gap was
counted for RLIMIT_AS; similar problems could come with RLIMIT_MLOCK
and strict non-overcommit mode.

Instead of keeping gap inside the stack vma, maintain the stack guard
gap as a gap between vmas: using vm_start_gap() in place of vm_start
(or vm_end_gap() in place of vm_end if VM_GROWSUP) in just those few
places which need to respect the gap - mainly arch_get_unmapped_area(),
and and the vma tree's subtree_gap support for that.

Original-patch-by: Oleg Nesterov <oleg@redhat.com>
Original-patch-by: Michal Hocko <mhocko@suse.com>
Signed-off-by: Hugh Dickins <hughd@google.com>
Acked-by: Michal Hocko <mhocko@suse.com>
Tested-by: Helge Deller <deller@gmx.de> # parisc
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# e585513b 06-Jun-2017 Kirill A. Shutemov <kirill.shutemov@linux.intel.com>

x86/mm/gup: Switch GUP to the generic get_user_page_fast() implementation

This patch provides all required callbacks required by the generic
get_user_pages_fast() code and switches x86 over - and removes
the platform specific implementation.

Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Andy Lutomirski <luto@amacapital.net>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Brian Gerst <brgerst@gmail.com>
Cc: Dave Hansen <dave.hansen@intel.com>
Cc: Denys Vlasenko <dvlasenk@redhat.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Josh Poimboeuf <jpoimboe@redhat.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: linux-arch@vger.kernel.org
Cc: linux-mm@kvack.org
Link: http://lkml.kernel.org/r/20170606113133.22974-2-kirill.shutemov@linux.intel.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>


# 9a291a7c 02-Jun-2017 James Morse <james.morse@arm.com>

mm/hugetlb: report -EHWPOISON not -EFAULT when FOLL_HWPOISON is specified

KVM uses get_user_pages() to resolve its stage2 faults. KVM sets the
FOLL_HWPOISON flag causing faultin_page() to return -EHWPOISON when it
finds a VM_FAULT_HWPOISON. KVM handles these hwpoison pages as a
special case. (check_user_page_hwpoison())

When huge pages are involved, this doesn't work so well.
get_user_pages() calls follow_hugetlb_page(), which stops early if it
receives VM_FAULT_HWPOISON from hugetlb_fault(), eventually returning
-EFAULT to the caller. The step to map this to -EHWPOISON based on the
FOLL_ flags is missing. The hwpoison special case is skipped, and
-EFAULT is returned to user-space, causing Qemu or kvmtool to exit.

Instead, move this VM_FAULT_ to errno mapping code into a header file
and use it from faultin_page() and follow_hugetlb_page().

With this, KVM works as expected.

This isn't a problem for arm64 today as we haven't enabled
MEMORY_FAILURE, but I can't see any reason this doesn't happen on x86
too, so I think this should be a fix. This doesn't apply earlier than
stable's v4.11.1 due to all sorts of cleanup.

[james.morse@arm.com: add vm_fault_to_errno() call to faultin_page()]
suggested.
Link: http://lkml.kernel.org/r/20170525171035.16359-1-james.morse@arm.com
[akpm@linux-foundation.org: coding-style fixes]
Link: http://lkml.kernel.org/r/20170524160900.28786-1-james.morse@arm.com
Signed-off-by: James Morse <james.morse@arm.com>
Acked-by: Punit Agrawal <punit.agrawal@arm.com>
Acked-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
Cc: "Kirill A . Shutemov" <kirill.shutemov@linux.intel.com>
Cc: <stable@vger.kernel.org> [4.11.1+]
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# aa2369f1 03-May-2017 Arnd Bergmann <arnd@arndb.de>

mm/gup.c: fix access_ok() argument type

MIPS just got changed to only accept a pointer argument for access_ok(),
causing one warning in drivers/scsi/pmcraid.c. I tried changing x86 the
same way and found the same warning in __get_user_pages_fast() and
nowhere else in the kernel during randconfig testing:

mm/gup.c: In function '__get_user_pages_fast':
mm/gup.c:1578:6: error: passing argument 1 of '__chk_range_not_ok' makes pointer from integer without a cast [-Werror=int-conversion]

It would probably be a good idea to enforce type-safety in general, so
let's change this file to not cause a warning if we do that.

I don't know why the warning did not appear on MIPS.

Fixes: 2667f50e8b81 ("mm: introduce a general RCU get_user_pages_fast()")
Link: http://lkml.kernel.org/r/20170421162659.3314521-1-arnd@arndb.de
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Acked-by: Ingo Molnar <mingo@kernel.org>
Cc: Michal Hocko <mhocko@suse.com>
Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
Cc: Lorenzo Stoakes <lstoakes@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 6dd29b3d 23-Apr-2017 Ingo Molnar <mingo@kernel.org>

Revert "x86/mm/gup: Switch GUP to the generic get_user_page_fast() implementation"

This reverts commit 2947ba054a4dabbd82848728d765346886050029.

Dan Williams reported dax-pmem kernel warnings with the following signature:

WARNING: CPU: 8 PID: 245 at lib/percpu-refcount.c:155 percpu_ref_switch_to_atomic_rcu+0x1f5/0x200
percpu ref (dax_pmem_percpu_release [dax_pmem]) <= 0 (0) after switching to atomic

... and bisected it to this commit, which suggests possible memory corruption
caused by the x86 fast-GUP conversion.

He also pointed out:

"
This is similar to the backtrace when we were not properly handling
pud faults and was fixed with this commit: 220ced1676c4 "mm: fix
get_user_pages() vs device-dax pud mappings"

I've found some missing _devmap checks in the generic
get_user_pages_fast() path, but this does not fix the regression
[...]
"

So given that there are known bugs, and a pretty robust looking bisection
points to this commit suggesting that are unknown bugs in the conversion
as well, revert it for the time being - we'll re-try in v4.13.

Reported-by: Dan Williams <dan.j.williams@intel.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Michal Hocko <mhocko@suse.cz>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Rik van Riel <riel@redhat.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: aneesh.kumar@linux.vnet.ibm.com
Cc: dann.frazier@canonical.com
Cc: dave.hansen@intel.com
Cc: steve.capper@linaro.org
Cc: linux-kernel@vger.kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>


# 2947ba05 16-Mar-2017 Kirill A. Shutemov <kirill.shutemov@linux.intel.com>

x86/mm/gup: Switch GUP to the generic get_user_page_fast() implementation

This patch provides all required callbacks required by the generic
get_user_pages_fast() code and switches x86 over - and removes
the platform specific implementation.

Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Aneesh Kumar K . V <aneesh.kumar@linux.vnet.ibm.com>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Dann Frazier <dann.frazier@canonical.com>
Cc: Dave Hansen <dave.hansen@intel.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Rik van Riel <riel@redhat.com>
Cc: Steve Capper <steve.capper@linaro.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: linux-arch@vger.kernel.org
Cc: linux-mm@kvack.org
Link: http://lkml.kernel.org/r/20170316213906.89528-1-kirill.shutemov@linux.intel.com
[ Minor readability edits. ]
Signed-off-by: Ingo Molnar <mingo@kernel.org>


# 73e10a61 16-Mar-2017 Kirill A. Shutemov <kirill.shutemov@linux.intel.com>

mm/gup: Provide callback to check if __GUP_fast() is allowed for the range

This is a preparation patch for the transition of x86 to the generic GUP_fast()
implementation.

On x86, get_user_pages_fast() does a couple of sanity checks to see if we can
call __get_user_pages_fast() for the range.

This kind of wrapping protection should be useful for the generic code too.

Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Aneesh Kumar K . V <aneesh.kumar@linux.vnet.ibm.com>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Dann Frazier <dann.frazier@canonical.com>
Cc: Dave Hansen <dave.hansen@intel.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Rik van Riel <riel@redhat.com>
Cc: Steve Capper <steve.capper@linaro.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: linux-arch@vger.kernel.org
Cc: linux-mm@kvack.org
Link: http://lkml.kernel.org/r/20170316152655.37789-7-kirill.shutemov@linux.intel.com
[ Small readability edits. ]
Signed-off-by: Ingo Molnar <mingo@kernel.org>


# b59f65fa 16-Mar-2017 Kirill A. Shutemov <kirill.shutemov@linux.intel.com>

mm/gup: Implement the dev_pagemap() logic in the generic get_user_pages_fast() function

This is a preparation patch for the transition of x86 to the generic GUP_fast()
implementation.

Prepare generic GUP_fast() to handle dev_pagemap(). At the moment, it's
only implemented on x86. On non-x86, the new code will be compiled out.

Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Aneesh Kumar K . V <aneesh.kumar@linux.vnet.ibm.com>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: Dann Frazier <dann.frazier@canonical.com>
Cc: Dave Hansen <dave.hansen@intel.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Rik van Riel <riel@redhat.com>
Cc: Steve Capper <steve.capper@linaro.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: linux-arch@vger.kernel.org
Cc: linux-mm@kvack.org
Link: http://lkml.kernel.org/r/20170316152655.37789-6-kirill.shutemov@linux.intel.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>


# e9348053 16-Mar-2017 Kirill A. Shutemov <kirill.shutemov@linux.intel.com>

mm/gup: Mark all pages PageReferenced in generic get_user_pages_fast()

This is a preparation patch for the transition of x86 to the generic GUP_fast()
implementation.

Unlike generic GUP_fast(), the x86 version makes all pages it touches
referenced. It seems required for GRU and EPT.

See the following commit:

8ee53820edfd ("thp: mmu_notifier_test_young")

Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Aneesh Kumar K . V <aneesh.kumar@linux.vnet.ibm.com>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Dann Frazier <dann.frazier@canonical.com>
Cc: Dave Hansen <dave.hansen@intel.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Rik van Riel <riel@redhat.com>
Cc: Steve Capper <steve.capper@linaro.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: linux-arch@vger.kernel.org
Cc: linux-mm@kvack.org
Link: http://lkml.kernel.org/r/20170316152655.37789-5-kirill.shutemov@linux.intel.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>


# 0005d20b 16-Mar-2017 Kirill A. Shutemov <kirill.shutemov@linux.intel.com>

mm/gup: Move page table entry dereference into helper function

This is a preparation patch for the transition of x86 to the generic GUP_fast()
implementation.

On x86 PAE, page table entry is larger than sizeof(long) and we would
need to provide a helper that can read the entry atomically.

Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Aneesh Kumar K . V <aneesh.kumar@linux.vnet.ibm.com>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Dann Frazier <dann.frazier@canonical.com>
Cc: Dave Hansen <dave.hansen@intel.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Rik van Riel <riel@redhat.com>
Cc: Steve Capper <steve.capper@linaro.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: linux-arch@vger.kernel.org
Cc: linux-mm@kvack.org
Link: http://lkml.kernel.org/r/20170316152655.37789-4-kirill.shutemov@linux.intel.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>


# e7884f8e 16-Mar-2017 Kirill A. Shutemov <kirill.shutemov@linux.intel.com>

mm/gup: Move permission checks into helpers

This is a preparation patch for the transition of x86 to the generic GUP_fast()
implementation.

On x86, we would need to do additional permission checks to determine if
access is allowed.

Let's abstract it out into separate helpers.

Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Aneesh Kumar K . V <aneesh.kumar@linux.vnet.ibm.com>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Dann Frazier <dann.frazier@canonical.com>
Cc: Dave Hansen <dave.hansen@intel.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Rik van Riel <riel@redhat.com>
Cc: Steve Capper <steve.capper@linaro.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: linux-arch@vger.kernel.org
Cc: linux-mm@kvack.org
Link: http://lkml.kernel.org/r/20170316152655.37789-3-kirill.shutemov@linux.intel.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>


# 9a804fec 16-Mar-2017 Kirill A. Shutemov <kirill.shutemov@linux.intel.com>

mm/gup: Drop the arch_pte_access_permitted() MMU callback

The only arch that defines it to something meaningful is x86.
But x86 doesn't use the generic GUP_fast() implementation -- the
only place where the callback is called.

Let's drop it.

Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Aneesh Kumar K . V <aneesh.kumar@linux.vnet.ibm.com>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Dann Frazier <dann.frazier@canonical.com>
Cc: Dave Hansen <dave.hansen@intel.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Rik van Riel <riel@redhat.com>
Cc: Steve Capper <steve.capper@linaro.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: linux-arch@vger.kernel.org
Cc: linux-mm@kvack.org
Link: http://lkml.kernel.org/r/20170316152655.37789-2-kirill.shutemov@linux.intel.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>


# ce70df08 12-Mar-2017 Kirill A. Shutemov <kirill.shutemov@linux.intel.com>

mm, gup: fix typo in gup_p4d_range()

gup_p4d_range() should call gup_pud_range(), not itself.

[ This was not noticed on x86: this is the HAVE_GENERIC_RCU_GUP code
used by arm[64] and powerpc - Linus ]

Fixes: c2febafc6773 ("mm: convert generic code to 5-level paging")
Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Reported-by: Chris Packham <chris.packham@alliedtelesis.co.nz>
Reported-by: Anton Blanchard <anton@samba.org>
Acked-by: Michal Hocko <mhocko@suse.com>
Acked-by: Mark Rutland <mark.rutland@arm.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# c2febafc 09-Mar-2017 Kirill A. Shutemov <kirill.shutemov@linux.intel.com>

mm: convert generic code to 5-level paging

Convert all non-architecture-specific code to 5-level paging.

It's mostly mechanical adding handling one more page table level in
places where we deal with pud_t.

Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Acked-by: Michal Hocko <mhocko@suse.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 174cd4b1 02-Feb-2017 Ingo Molnar <mingo@kernel.org>

sched/headers: Prepare to move signal wakeup & sigpending methods from <linux/sched.h> into <linux/sched/signal.h>

Fix up affected files that include this signal functionality via sched.h.

Acked-by: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: linux-kernel@vger.kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>


# db08f203 24-Feb-2017 Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>

mm/gup: check for protnone only if it is a PTE entry

Do the prot_none/FOLL_NUMA check after we are sure this is a THP pte.
Archs can implement prot_none such that it can return true for regular
pmd entries.

Link: http://lkml.kernel.org/r/1487498326-8734-1-git-send-email-aneesh.kumar@linux.vnet.ibm.com
Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
Cc: Rik van Riel <riel@surriel.com>
Cc: Mel Gorman <mgorman@techsingularity.net>
Cc: Hillf Danton <hillf.zj@alibaba-inc.com>
Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# a00cc7d9 24-Feb-2017 Matthew Wilcox <willy@infradead.org>

mm, x86: add support for PUD-sized transparent hugepages

The current transparent hugepage code only supports PMDs. This patch
adds support for transparent use of PUDs with DAX. It does not include
support for anonymous pages. x86 support code also added.

Most of this patch simply parallels the work that was done for huge
PMDs. The only major difference is how the new ->pud_entry method in
mm_walk works. The ->pmd_entry method replaces the ->pte_entry method,
whereas the ->pud_entry method works along with either ->pmd_entry or
->pte_entry. The pagewalk code takes care of locking the PUD before
calling ->pud_walk, so handlers do not need to worry whether the PUD is
stable.

[dave.jiang@intel.com: fix SMP x86 32bit build for native_pud_clear()]
Link: http://lkml.kernel.org/r/148719066814.31111.3239231168815337012.stgit@djiang5-desk3.ch.intel.com
[dave.jiang@intel.com: native_pud_clear missing on i386 build]
Link: http://lkml.kernel.org/r/148640375195.69754.3315433724330910314.stgit@djiang5-desk3.ch.intel.com
Link: http://lkml.kernel.org/r/148545059381.17912.8602162635537598445.stgit@djiang5-desk3.ch.intel.com
Signed-off-by: Matthew Wilcox <mawilcox@microsoft.com>
Signed-off-by: Dave Jiang <dave.jiang@intel.com>
Tested-by: Alexander Kapshuk <alexander.kapshuk@gmail.com>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Jan Kara <jack@suse.com>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: Ross Zwisler <ross.zwisler@linux.intel.com>
Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Cc: Nilesh Choudhury <nilesh.choudhury@oracle.com>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 87ffc118 22-Feb-2017 Andrea Arcangeli <aarcange@redhat.com>

userfaultfd: hugetlbfs: gup: support VM_FAULT_RETRY

Add support for VM_FAULT_RETRY to follow_hugetlb_page() so that
get_user_pages_unlocked/locked and "nonblocking/FOLL_NOWAIT" features
will work on hugetlbfs.

This is required for fully functional userfaultfd non-present support on
hugetlbfs.

Link: http://lkml.kernel.org/r/20161216144821.5183-25-aarcange@redhat.com
Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
Reviewed-by: Mike Kravetz <mike.kravetz@oracle.com>
Cc: "Dr. David Alan Gilbert" <dgilbert@redhat.com>
Cc: Hillf Danton <hillf.zj@alibaba-inc.com>
Cc: Michael Rapoport <RAPOPORT@il.ibm.com>
Cc: Mike Rapoport <rppt@linux.vnet.ibm.com>
Cc: Pavel Emelyanov <xemul@parallels.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 8b7457ef 14-Dec-2016 Lorenzo Stoakes <lstoakes@gmail.com>

mm: unexport __get_user_pages_unlocked()

Unexport the low-level __get_user_pages_unlocked() function and replaces
invocations with calls to more appropriate higher-level functions.

In hva_to_pfn_slow() we are able to replace __get_user_pages_unlocked()
with get_user_pages_unlocked() since we can now pass gup_flags.

In async_pf_execute() and process_vm_rw_single_vec() we need to pass
different tsk, mm arguments so get_user_pages_remote() is the sane
replacement in these cases (having added manual acquisition and release
of mmap_sem.)

Additionally get_user_pages_remote() reintroduces use of the FOLL_TOUCH
flag. However, this flag was originally silently dropped by commit
1e9877902dc7 ("mm/gup: Introduce get_user_pages_remote()"), so this
appears to have been unintentional and reintroducing it is therefore not
an issue.

[akpm@linux-foundation.org: coding-style fixes]
Link: http://lkml.kernel.org/r/20161027095141.2569-3-lstoakes@gmail.com
Signed-off-by: Lorenzo Stoakes <lstoakes@gmail.com>
Acked-by: Michal Hocko <mhocko@suse.com>
Cc: Jan Kara <jack@suse.cz>
Cc: Hugh Dickins <hughd@google.com>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Rik van Riel <riel@redhat.com>
Cc: Mel Gorman <mgorman@techsingularity.net>
Cc: Paolo Bonzini <pbonzini@redhat.com>
Cc: Radim Krcmar <rkrcmar@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 5b56d49f 14-Dec-2016 Lorenzo Stoakes <lstoakes@gmail.com>

mm: add locked parameter to get_user_pages_remote()

Patch series "mm: unexport __get_user_pages_unlocked()".

This patch series continues the cleanup of get_user_pages*() functions
taking advantage of the fact we can now pass gup_flags as we please.

It firstly adds an additional 'locked' parameter to
get_user_pages_remote() to allow for its callers to utilise
VM_FAULT_RETRY functionality. This is necessary as the invocation of
__get_user_pages_unlocked() in process_vm_rw_single_vec() makes use of
this and no other existing higher level function would allow it to do
so.

Secondly existing callers of __get_user_pages_unlocked() are replaced
with the appropriate higher-level replacement -
get_user_pages_unlocked() if the current task and memory descriptor are
referenced, or get_user_pages_remote() if other task/memory descriptors
are referenced (having acquiring mmap_sem.)

This patch (of 2):

Add a int *locked parameter to get_user_pages_remote() to allow
VM_FAULT_RETRY faulting behaviour similar to get_user_pages_[un]locked().

Taking into account the previous adjustments to get_user_pages*()
functions allowing for the passing of gup_flags, we are now in a
position where __get_user_pages_unlocked() need only be exported for his
ability to allow VM_FAULT_RETRY behaviour, this adjustment allows us to
subsequently unexport __get_user_pages_unlocked() as well as allowing
for future flexibility in the use of get_user_pages_remote().

[sfr@canb.auug.org.au: merge fix for get_user_pages_remote API change]
Link: http://lkml.kernel.org/r/20161122210511.024ec341@canb.auug.org.au
Link: http://lkml.kernel.org/r/20161027095141.2569-2-lstoakes@gmail.com
Signed-off-by: Lorenzo Stoakes <lstoakes@gmail.com>
Acked-by: Michal Hocko <mhocko@suse.com>
Cc: Jan Kara <jack@suse.cz>
Cc: Hugh Dickins <hughd@google.com>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Rik van Riel <riel@redhat.com>
Cc: Mel Gorman <mgorman@techsingularity.net>
Cc: Paolo Bonzini <pbonzini@redhat.com>
Cc: Radim Krcmar <rkrcmar@redhat.com>
Signed-off-by: Stephen Rothwell <sfr@canb.auug.org.au>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 80a79516 12-Dec-2016 Lorenzo Stoakes <lstoakes@gmail.com>

mm: fix up get_user_pages* comments

In the previous round of get_user_pages* changes comments attached to
__get_user_pages_unlocked() and get_user_pages_unlocked() were rendered
incorrect, this patch corrects them.

In addition the get_user_pages_unlocked() comment seems to have already
been outdated as it referred to tsk, mm parameters which were removed in
c12d2da5 ("mm/gup: Remove the macro overload API migration helpers from
the get_user*() APIs"), this patch fixes this also.

Link: http://lkml.kernel.org/r/20161025233435.5338-1-lstoakes@gmail.com
Signed-off-by: Lorenzo Stoakes <lstoakes@gmail.com>
Acked-by: Michal Hocko <mhocko@suse.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 771ab430 12-Dec-2016 Tobias Klauser <tklauser@distanz.ch>

mm/gup.c: make unnecessarily global vma_permits_fault() static

Make vma_permits_fault() static as it is only used in mm/gup.c

This fixes a sparse warning.

Link: http://lkml.kernel.org/r/20161017122353.31598-1-tklauser@distanz.ch
Signed-off-by: Tobias Klauser <tklauser@distanz.ch>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 0d731759 24-Oct-2016 Lorenzo Stoakes <lstoakes@gmail.com>

mm: unexport __get_user_pages()

This patch unexports the low-level __get_user_pages() function.

Recent refactoring of the get_user_pages* functions allow flags to be
passed through get_user_pages() which eliminates the need for access to
this function from its one user, kvm.

We can see that the two calls to get_user_pages() which replace
__get_user_pages() in kvm_main.c are equivalent by examining their call
stacks:

get_user_page_nowait():
get_user_pages(start, 1, flags, page, NULL)
__get_user_pages_locked(current, current->mm, start, 1, page, NULL, NULL,
false, flags | FOLL_TOUCH)
__get_user_pages(current, current->mm, start, 1,
flags | FOLL_TOUCH | FOLL_GET, page, NULL, NULL)

check_user_page_hwpoison():
get_user_pages(addr, 1, flags, NULL, NULL)
__get_user_pages_locked(current, current->mm, addr, 1, NULL, NULL, NULL,
false, flags | FOLL_TOUCH)
__get_user_pages(current, current->mm, addr, 1, flags | FOLL_TOUCH, NULL,
NULL, NULL)

Signed-off-by: Lorenzo Stoakes <lstoakes@gmail.com>
Acked-by: Paolo Bonzini <pbonzini@redhat.com>
Acked-by: Michal Hocko <mhocko@suse.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 9beae1ea 12-Oct-2016 Lorenzo Stoakes <lstoakes@gmail.com>

mm: replace get_user_pages_remote() write/force parameters with gup_flags

This removes the 'write' and 'force' from get_user_pages_remote() and
replaces them with 'gup_flags' to make the use of FOLL_FORCE explicit in
callers as use of this flag can result in surprising behaviour (and
hence bugs) within the mm subsystem.

Signed-off-by: Lorenzo Stoakes <lstoakes@gmail.com>
Acked-by: Michal Hocko <mhocko@suse.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 768ae309 12-Oct-2016 Lorenzo Stoakes <lstoakes@gmail.com>

mm: replace get_user_pages() write/force parameters with gup_flags

This removes the 'write' and 'force' from get_user_pages() and replaces
them with 'gup_flags' to make the use of FOLL_FORCE explicit in callers
as use of this flag can result in surprising behaviour (and hence bugs)
within the mm subsystem.

Signed-off-by: Lorenzo Stoakes <lstoakes@gmail.com>
Acked-by: Christian König <christian.koenig@amd.com>
Acked-by: Jesper Nilsson <jesper.nilsson@axis.com>
Acked-by: Michal Hocko <mhocko@suse.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 3b913179 12-Oct-2016 Lorenzo Stoakes <lstoakes@gmail.com>

mm: replace get_user_pages_locked() write/force parameters with gup_flags

This removes the 'write' and 'force' use from get_user_pages_locked()
and replaces them with 'gup_flags' to make the use of FOLL_FORCE
explicit in callers as use of this flag can result in surprising
behaviour (and hence bugs) within the mm subsystem.

Signed-off-by: Lorenzo Stoakes <lstoakes@gmail.com>
Acked-by: Michal Hocko <mhocko@suse.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# c164154f 12-Oct-2016 Lorenzo Stoakes <lstoakes@gmail.com>

mm: replace get_user_pages_unlocked() write/force parameters with gup_flags

This removes the 'write' and 'force' use from get_user_pages_unlocked()
and replaces them with 'gup_flags' to make the use of FOLL_FORCE
explicit in callers as use of this flag can result in surprising
behaviour (and hence bugs) within the mm subsystem.

Signed-off-by: Lorenzo Stoakes <lstoakes@gmail.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Acked-by: Michal Hocko <mhocko@suse.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# d4944b0e 12-Oct-2016 Lorenzo Stoakes <lstoakes@gmail.com>

mm: remove write/force parameters from __get_user_pages_unlocked()

This removes the redundant 'write' and 'force' parameters from
__get_user_pages_unlocked() to make the use of FOLL_FORCE explicit in
callers as use of this flag can result in surprising behaviour (and
hence bugs) within the mm subsystem.

Signed-off-by: Lorenzo Stoakes <lstoakes@gmail.com>
Acked-by: Paolo Bonzini <pbonzini@redhat.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Acked-by: Michal Hocko <mhocko@suse.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 859110d7 12-Oct-2016 Lorenzo Stoakes <lstoakes@gmail.com>

mm: remove write/force parameters from __get_user_pages_locked()

This removes the redundant 'write' and 'force' parameters from
__get_user_pages_locked() to make the use of FOLL_FORCE explicit in
callers as use of this flag can result in surprising behaviour (and
hence bugs) within the mm subsystem.

Signed-off-by: Lorenzo Stoakes <lstoakes@gmail.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Acked-by: Michal Hocko <mhocko@suse.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 19be0eaf 13-Oct-2016 Linus Torvalds <torvalds@linux-foundation.org>

mm: remove gup_flags FOLL_WRITE games from __get_user_pages()

This is an ancient bug that was actually attempted to be fixed once
(badly) by me eleven years ago in commit 4ceb5db9757a ("Fix
get_user_pages() race for write access") but that was then undone due to
problems on s390 by commit f33ea7f404e5 ("fix get_user_pages bug").

In the meantime, the s390 situation has long been fixed, and we can now
fix it by checking the pte_dirty() bit properly (and do it better). The
s390 dirty bit was implemented in abf09bed3cce ("s390/mm: implement
software dirty bits") which made it into v3.9. Earlier kernels will
have to look at the page state itself.

Also, the VM has become more scalable, and what used a purely
theoretical race back then has become easier to trigger.

To fix it, we introduce a new internal FOLL_COW flag to mark the "yes,
we already did a COW" rather than play racy games with FOLL_WRITE that
is very fundamental, and then use the pte dirty flag to validate that
the FOLL_COW flag is still valid.

Reported-and-tested-by: Phil "not Paul" Oester <kernel@linuxace.com>
Acked-by: Hugh Dickins <hughd@google.com>
Reviewed-by: Michal Hocko <mhocko@suse.com>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Kees Cook <keescook@chromium.org>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Willy Tarreau <w@1wt.eu>
Cc: Nick Piggin <npiggin@gmail.com>
Cc: Greg Thelen <gthelen@google.com>
Cc: stable@vger.kernel.org
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# baa355fd 26-Jul-2016 Kirill A. Shutemov <kirill.shutemov@linux.intel.com>

thp: file pages support for split_huge_page()

Basic scheme is the same as for anon THP.

Main differences:

- File pages are on radix-tree, so we have head->_count offset by
HPAGE_PMD_NR. The count got distributed to small pages during split.

- mapping->tree_lock prevents non-lockless access to pages under split
over radix-tree;

- Lockless access is prevented by setting the head->_count to 0 during
split;

- After split, some pages can be beyond i_size. We drop them from
radix-tree.

- We don't setup migration entries. Just unmap pages. It helps
handling cases when i_size is in the middle of the page: no need
handle unmap pages beyond i_size manually.

Link: http://lkml.kernel.org/r/1466021202-61880-20-git-send-email-kirill.shutemov@linux.intel.com
Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# dcddffd4 26-Jul-2016 Kirill A. Shutemov <kirill.shutemov@linux.intel.com>

mm: do not pass mm_struct into handle_mm_fault

We always have vma->vm_mm around.

Link: http://lkml.kernel.org/r/1466021202-61880-8-git-send-email-kirill.shutemov@linux.intel.com
Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 337d9abf 26-Jul-2016 Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>

mm: thp: check pmd_trans_unstable() after split_huge_pmd()

split_huge_pmd() doesn't guarantee that the pmd is normal pmd pointing
to pte entries, which can be checked with pmd_trans_unstable(). Some
callers make this assertion and some do it differently and some not, so
let's do it in a unified manner.

Link: http://lkml.kernel.org/r/1464741400-12143-1-git-send-email-n-horiguchi@ah.jp.nec.com
Signed-off-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
Cc: "Kirill A. Shutemov" <kirill@shutemov.name>
Cc: Hugh Dickins <hughd@google.com>
Cc: Mel Gorman <mgorman@techsingularity.net>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# add6a0cd 07-Jun-2016 Paolo Bonzini <pbonzini@redhat.com>

KVM: MMU: try to fix up page faults before giving up

The vGPU folks would like to trap the first access to a BAR by setting
vm_ops on the VMAs produced by mmap-ing a VFIO device. The fault handler
then can use remap_pfn_range to place some non-reserved pages in the VMA.

This kind of VM_PFNMAP mapping is not handled by KVM, but follow_pfn
and fixup_user_fault together help supporting it. The patch also supports
VM_MIXEDMAP vmas where the pfns are not reserved and thus subject to
reference counting.

Cc: Xiao Guangrong <guangrong.xiao@linux.intel.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Radim Krčmář <rkrcmar@redhat.com>
Tested-by: Neo Jia <cjia@nvidia.com>
Reported-by: Kirti Wankhede <kwankhede@nvidia.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>


# c12d2da5 04-Apr-2016 Ingo Molnar <mingo@kernel.org>

mm/gup: Remove the macro overload API migration helpers from the get_user*() APIs

The pkeys changes brought about a truly hideous set of macros in:

cde70140fed8 ("mm/gup: Overload get_user_pages() functions")

... which macros are (ab-)using the fact that __VA_ARGS__ can be used
to shift parameter positions in macro arguments without breaking the
build and so can be used to call separate C functions depending on
the number of arguments of the macro.

This allowed easy migration of these 3 GUP APIs, as both these variants
worked at the C level:

old:
ret = get_user_pages(current, current->mm, address, 1, 1, 0, &page, NULL);

new:
ret = get_user_pages(address, 1, 1, 0, &page, NULL);

... while we also generated a (functionally harmless but noticeable) build
time warning if the old API was used. As there are over 300 uses of these
APIs, this trick eased the migration of the API and avoided excessive
migration pain in linux-next.

Now, with its work done, get rid of all of that complication and ugliness:

3 files changed, 16 insertions(+), 140 deletions(-)

... where the linecount of the migration hack was further inflated by the
fact that there are NOMMU variants of these GUP APIs as well.

Much of the conversion was done in linux-next over the past couple of months,
and Linus recently removed all remaining old API uses from the upstream tree
in the following upstrea commit:

cb107161df3c ("Convert straggling drivers to new six-argument get_user_pages()")

There was one more old-API usage in mm/gup.c, in the CONFIG_HAVE_GENERIC_RCU_GUP
code path that ARM, ARM64 and PowerPC uses.

After this commit any old API usage will break the build.

[ Also fixed a PowerPC/HAVE_GENERIC_RCU_GUP warning reported by Stephen Rothwell. ]

Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Dave Hansen <dave@sr71.net>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Stephen Rothwell <sfr@canb.auug.org.au>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: linux-kernel@vger.kernel.org
Cc: linux-mm@kvack.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>


# ea1754a0 01-Apr-2016 Kirill A. Shutemov <kirill.shutemov@linux.intel.com>

mm, fs: remove remaining PAGE_CACHE_* and page_cache_{get,release} usage

Mostly direct substitution with occasional adjustment or removing
outdated comments.

Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Acked-by: Michal Hocko <mhocko@suse.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# d61172b4 12-Feb-2016 Dave Hansen <dave.hansen@linux.intel.com>

mm/core, x86/mm/pkeys: Differentiate instruction fetches

As discussed earlier, we attempt to enforce protection keys in
software.

However, the code checks all faults to ensure that they are not
violating protection key permissions. It was assumed that all
faults are either write faults where we check PKRU[key].WD (write
disable) or read faults where we check the AD (access disable)
bit.

But, there is a third category of faults for protection keys:
instruction faults. Instruction faults never run afoul of
protection keys because they do not affect instruction fetches.

So, plumb the PF_INSTR bit down in to the
arch_vma_access_permitted() function where we do the protection
key checks.

We also add a new FAULT_FLAG_INSTRUCTION. This is because
handle_mm_fault() is not passed the architecture-specific
error_code where we keep PF_INSTR, so we need to encode the
instruction fetch information in to the arch-generic fault
flags.

Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
Reviewed-by: Thomas Gleixner <tglx@linutronix.de>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Andy Lutomirski <luto@amacapital.net>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Brian Gerst <brgerst@gmail.com>
Cc: Dave Hansen <dave@sr71.net>
Cc: Denys Vlasenko <dvlasenk@redhat.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Rik van Riel <riel@redhat.com>
Cc: linux-mm@kvack.org
Link: http://lkml.kernel.org/r/20160212210224.96928009@viggo.jf.intel.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>


# 1b2ee126 12-Feb-2016 Dave Hansen <dave.hansen@linux.intel.com>

mm/core: Do not enforce PKEY permissions on remote mm access

We try to enforce protection keys in software the same way that we
do in hardware. (See long example below).

But, we only want to do this when accessing our *own* process's
memory. If GDB set PKRU[6].AD=1 (disable access to PKEY 6), then
tried to PTRACE_POKE a target process which just happened to have
some mprotect_pkey(pkey=6) memory, we do *not* want to deny the
debugger access to that memory. PKRU is fundamentally a
thread-local structure and we do not want to enforce it on access
to _another_ thread's data.

This gets especially tricky when we have workqueues or other
delayed-work mechanisms that might run in a random process's context.
We can check that we only enforce pkeys when operating on our *own* mm,
but delayed work gets performed when a random user context is active.
We might end up with a situation where a delayed-work gup fails when
running randomly under its "own" task but succeeds when running under
another process. We want to avoid that.

To avoid that, we use the new GUP flag: FOLL_REMOTE and add a
fault flag: FAULT_FLAG_REMOTE. They indicate that we are
walking an mm which is not guranteed to be the same as
current->mm and should not be subject to protection key
enforcement.

Thanks to Jerome Glisse for pointing out this scenario.

Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
Reviewed-by: Thomas Gleixner <tglx@linutronix.de>
Cc: Alexey Kardashevskiy <aik@ozlabs.ru>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Andy Lutomirski <luto@amacapital.net>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Arnd Bergmann <arnd@arndb.de>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Boaz Harrosh <boaz@plexistor.com>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Brian Gerst <brgerst@gmail.com>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: Dave Chinner <dchinner@redhat.com>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: David Gibson <david@gibson.dropbear.id.au>
Cc: Denys Vlasenko <dvlasenk@redhat.com>
Cc: Dominik Dingel <dingel@linux.vnet.ibm.com>
Cc: Dominik Vogt <vogt@linux.vnet.ibm.com>
Cc: Eric B Munson <emunson@akamai.com>
Cc: Geliang Tang <geliangtang@163.com>
Cc: Guan Xuetao <gxt@mprc.pku.edu.cn>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: Jan Kara <jack@suse.cz>
Cc: Jason Low <jason.low2@hp.com>
Cc: Jerome Marchand <jmarchan@redhat.com>
Cc: Joerg Roedel <joro@8bytes.org>
Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Cc: Konstantin Khlebnikov <koct9i@gmail.com>
Cc: Laurent Dufour <ldufour@linux.vnet.ibm.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
Cc: Matthew Wilcox <willy@linux.intel.com>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Mikulas Patocka <mpatocka@redhat.com>
Cc: Minchan Kim <minchan@kernel.org>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Rik van Riel <riel@redhat.com>
Cc: Sasha Levin <sasha.levin@oracle.com>
Cc: Shachar Raindel <raindel@mellanox.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Xie XiuQi <xiexiuqi@huawei.com>
Cc: iommu@lists.linux-foundation.org
Cc: linux-arch@vger.kernel.org
Cc: linux-kernel@vger.kernel.org
Cc: linux-mm@kvack.org
Cc: linux-s390@vger.kernel.org
Cc: linuxppc-dev@lists.ozlabs.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>


# 33a709b2 12-Feb-2016 Dave Hansen <dave.hansen@linux.intel.com>

mm/gup, x86/mm/pkeys: Check VMAs and PTEs for protection keys

Today, for normal faults and page table walks, we check the VMA
and/or PTE to ensure that it is compatible with the action. For
instance, if we get a write fault on a non-writeable VMA, we
SIGSEGV.

We try to do the same thing for protection keys. Basically, we
try to make sure that if a user does this:

mprotect(ptr, size, PROT_NONE);
*ptr = foo;

they see the same effects with protection keys when they do this:

mprotect(ptr, size, PROT_READ|PROT_WRITE);
set_pkey(ptr, size, 4);
wrpkru(0xffffff3f); // access disable pkey 4
*ptr = foo;

The state to do that checking is in the VMA, but we also
sometimes have to do it on the page tables only, like when doing
a get_user_pages_fast() where we have no VMA.

We add two functions and expose them to generic code:

arch_pte_access_permitted(pte_flags, write)
arch_vma_access_permitted(vma, write)

These are, of course, backed up in x86 arch code with checks
against the PTE or VMA's protection key.

But, there are also cases where we do not want to respect
protection keys. When we ptrace(), for instance, we do not want
to apply the tracer's PKRU permissions to the PTEs from the
process being traced.

Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
Reviewed-by: Thomas Gleixner <tglx@linutronix.de>
Cc: Alexey Kardashevskiy <aik@ozlabs.ru>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Andy Lutomirski <luto@amacapital.net>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
Cc: Arnd Bergmann <arnd@arndb.de>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Boaz Harrosh <boaz@plexistor.com>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Brian Gerst <brgerst@gmail.com>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: Dave Hansen <dave@sr71.net>
Cc: David Gibson <david@gibson.dropbear.id.au>
Cc: David Hildenbrand <dahi@linux.vnet.ibm.com>
Cc: David Vrabel <david.vrabel@citrix.com>
Cc: Denys Vlasenko <dvlasenk@redhat.com>
Cc: Dominik Dingel <dingel@linux.vnet.ibm.com>
Cc: Dominik Vogt <vogt@linux.vnet.ibm.com>
Cc: Guan Xuetao <gxt@mprc.pku.edu.cn>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: Jason Low <jason.low2@hp.com>
Cc: Jerome Marchand <jmarchan@redhat.com>
Cc: Juergen Gross <jgross@suse.com>
Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Cc: Laurent Dufour <ldufour@linux.vnet.ibm.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
Cc: Matthew Wilcox <willy@linux.intel.com>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Mikulas Patocka <mpatocka@redhat.com>
Cc: Minchan Kim <minchan@kernel.org>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Rik van Riel <riel@redhat.com>
Cc: Sasha Levin <sasha.levin@oracle.com>
Cc: Shachar Raindel <raindel@mellanox.com>
Cc: Stephen Smalley <sds@tycho.nsa.gov>
Cc: Toshi Kani <toshi.kani@hpe.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: linux-arch@vger.kernel.org
Cc: linux-kernel@vger.kernel.org
Cc: linux-mm@kvack.org
Cc: linux-s390@vger.kernel.org
Cc: linuxppc-dev@lists.ozlabs.org
Link: http://lkml.kernel.org/r/20160212210219.14D5D715@viggo.jf.intel.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>


# d4925e00 12-Feb-2016 Dave Hansen <dave.hansen@linux.intel.com>

mm/gup: Factor out VMA fault permission checking

This code matches a fault condition up with the VMA and ensures
that the VMA allows the fault to be handled instead of just
erroring out.

We will be extending this in a moment to comprehend protection
keys.

Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
Reviewed-by: Thomas Gleixner <tglx@linutronix.de>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Andy Lutomirski <luto@amacapital.net>
Cc: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Brian Gerst <brgerst@gmail.com>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: Dave Hansen <dave@sr71.net>
Cc: Denys Vlasenko <dvlasenk@redhat.com>
Cc: Dominik Dingel <dingel@linux.vnet.ibm.com>
Cc: Eric B Munson <emunson@akamai.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Jason Low <jason.low2@hp.com>
Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Rik van Riel <riel@redhat.com>
Cc: Sasha Levin <sasha.levin@oracle.com>
Cc: linux-kernel@vger.kernel.org
Cc: linux-mm@kvack.org
Link: http://lkml.kernel.org/r/20160212210216.C3824032@viggo.jf.intel.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>


# d4edcf0d 12-Feb-2016 Dave Hansen <dave.hansen@linux.intel.com>

mm/gup: Switch all callers of get_user_pages() to not pass tsk/mm

We will soon modify the vanilla get_user_pages() so it can no
longer be used on mm/tasks other than 'current/current->mm',
which is by far the most common way it is called. For now,
we allow the old-style calls, but warn when they are used.
(implemented in previous patch)

This patch switches all callers of:

get_user_pages()
get_user_pages_unlocked()
get_user_pages_locked()

to stop passing tsk/mm so they will no longer see the warnings.

Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
Reviewed-by: Thomas Gleixner <tglx@linutronix.de>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Andy Lutomirski <luto@amacapital.net>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Brian Gerst <brgerst@gmail.com>
Cc: Dave Hansen <dave@sr71.net>
Cc: Denys Vlasenko <dvlasenk@redhat.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Rik van Riel <riel@redhat.com>
Cc: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: jack@suse.cz
Cc: linux-mm@kvack.org
Link: http://lkml.kernel.org/r/20160212210156.113E9407@viggo.jf.intel.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>


# cde70140 12-Feb-2016 Dave Hansen <dave.hansen@linux.intel.com>

mm/gup: Overload get_user_pages() functions

The concept here was a suggestion from Ingo. The implementation
horrors are all mine.

This allows get_user_pages(), get_user_pages_unlocked(), and
get_user_pages_locked() to be called with or without the
leading tsk/mm arguments. We will give a compile-time warning
about the old style being __deprecated and we will also
WARN_ON() if the non-remote version is used for a remote-style
access.

Doing this, folks will get nice warnings and will not break the
build. This should be nice for -next and will hopefully let
developers fix up their own code instead of maintainers needing
to do it at merge time.

The way we do this is hideous. It uses the __VA_ARGS__ macro
functionality to call different functions based on the number
of arguments passed to the macro.

There's an additional hack to ensure that our EXPORT_SYMBOL()
of the deprecated symbols doesn't trigger a warning.

We should be able to remove this mess as soon as -rc1 hits in
the release after this is merged.

Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Alexander Kuleshov <kuleshovmail@gmail.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: Dave Hansen <dave@sr71.net>
Cc: Dominik Dingel <dingel@linux.vnet.ibm.com>
Cc: Geliang Tang <geliangtang@163.com>
Cc: Jan Kara <jack@suse.cz>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Cc: Konstantin Khlebnikov <koct9i@gmail.com>
Cc: Leon Romanovsky <leon@leon.nu>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Masahiro Yamada <yamada.masahiro@socionext.com>
Cc: Mateusz Guzik <mguzik@redhat.com>
Cc: Maxime Coquelin <mcoquelin.stm32@gmail.com>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Paul Gortmaker <paul.gortmaker@windriver.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Vladimir Davydov <vdavydov@virtuozzo.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Xie XiuQi <xiexiuqi@huawei.com>
Cc: linux-kernel@vger.kernel.org
Cc: linux-mm@kvack.org
Link: http://lkml.kernel.org/r/20160212210155.73222EE1@viggo.jf.intel.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>


# 1e987790 12-Feb-2016 Dave Hansen <dave.hansen@linux.intel.com>

mm/gup: Introduce get_user_pages_remote()

For protection keys, we need to understand whether protections
should be enforced in software or not. In general, we enforce
protections when working on our own task, but not when on others.
We call these "current" and "remote" operations.

This patch introduces a new get_user_pages() variant:

get_user_pages_remote()

Which is a replacement for when get_user_pages() is called on
non-current tsk/mm.

We also introduce a new gup flag: FOLL_REMOTE which can be used
for the "__" gup variants to get this new behavior.

The uprobes is_trap_at_addr() location holds mmap_sem and
calls get_user_pages(current->mm) on an instruction address. This
makes it a pretty unique gup caller. Being an instruction access
and also really originating from the kernel (vs. the app), I opted
to consider this a 'remote' access where protection keys will not
be enforced.

Without protection keys, this patch should not change any behavior.

Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
Reviewed-by: Thomas Gleixner <tglx@linutronix.de>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Andy Lutomirski <luto@amacapital.net>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Brian Gerst <brgerst@gmail.com>
Cc: Dave Hansen <dave@sr71.net>
Cc: Denys Vlasenko <dvlasenk@redhat.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Rik van Riel <riel@redhat.com>
Cc: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: jack@suse.cz
Cc: linux-mm@kvack.org
Link: http://lkml.kernel.org/r/20160212210154.3F0E51EA@viggo.jf.intel.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>


# 46435364 30-Jan-2016 Hugh Dickins <hughd@google.com>

mm: retire GUP WARN_ON_ONCE that outlived its usefulness

Trinity is now hitting the WARN_ON_ONCE we added in v3.15 commit
cda540ace6a1 ("mm: get_user_pages(write,force) refuse to COW in shared
areas"). The warning has served its purpose, nobody was harmed by that
change, so just remove the warning to generate less noise from Trinity.

Which reminds me of the comment I wrongly left behind with that commit
(but was spotted at the time by Kirill), which has since moved into a
separate function, and become even more obscure: delete it.

Reported-by: Dave Jones <davej@codemonkey.org.uk>
Suggested-by: Kirill A. Shutemov <kirill@shutemov.name>
Signed-off-by: Hugh Dickins <hughd@google.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 4a9e1cda 15-Jan-2016 Dominik Dingel <dingel@linux.vnet.ibm.com>

mm: bring in additional flag for fixup_user_fault to signal unlock

During Jason's work with postcopy migration support for s390 a problem
regarding gmap faults was discovered.

The gmap code will call fixup_user_fault which will end up always in
handle_mm_fault. Till now we never cared about retries, but as the
userfaultfd code kind of relies on it. this needs some fix.

This patchset does not take care of the futex code. I will now look
closer at this.

This patch (of 2):

With the introduction of userfaultfd, kvm on s390 needs fixup_user_fault
to pass in FAULT_FLAG_ALLOW_RETRY and give feedback if during the
faulting we ever unlocked mmap_sem.

This patch brings in the logic to handle retries as well as it cleans up
the current documentation. fixup_user_fault was not having the same
semantics as filemap_fault. It never indicated if a retry happened and
so a caller wasn't able to handle that case. So we now changed the
behaviour to always retry a locked mmap_sem.

Signed-off-by: Dominik Dingel <dingel@linux.vnet.ibm.com>
Reviewed-by: Andrea Arcangeli <aarcange@redhat.com>
Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
Cc: Christian Borntraeger <borntraeger@de.ibm.com>
Cc: "Jason J. Herne" <jjherne@linux.vnet.ibm.com>
Cc: David Rientjes <rientjes@google.com>
Cc: Eric B Munson <emunson@akamai.com>
Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
Cc: Dominik Dingel <dingel@linux.vnet.ibm.com>
Cc: Paolo Bonzini <pbonzini@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 3565fce3 15-Jan-2016 Dan Williams <dan.j.williams@intel.com>

mm, x86: get_user_pages() for dax mappings

A dax mapping establishes a pte with _PAGE_DEVMAP set when the driver
has established a devm_memremap_pages() mapping, i.e. when the pfn_t
return from ->direct_access() has PFN_DEV and PFN_MAP set. Later, when
encountering _PAGE_DEVMAP during a page table walk we lookup and pin a
struct dev_pagemap instance to keep the result of pfn_to_page() valid
until put_page().

Signed-off-by: Dan Williams <dan.j.williams@intel.com>
Tested-by: Logan Gunthorpe <logang@deltatee.com>
Cc: Dave Hansen <dave@sr71.net>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# e90309c9 15-Jan-2016 Kirill A. Shutemov <kirill.shutemov@linux.intel.com>

thp: allow mlocked THP again

Before THP refcounting rework, THP was not allowed to cross VMA
boundary. So, if we have THP and we split it, PG_mlocked can be safely
transferred to small pages.

With new THP refcounting and naive approach to mlocking we can end up
with this scenario:
1. we have a mlocked THP, which belong to one VM_LOCKED VMA.
2. the process does munlock() on the *part* of the THP:
- the VMA is split into two, one of them VM_LOCKED;
- huge PMD split into PTE table;
- THP is still mlocked;
3. split_huge_page():
- it transfers PG_mlocked to *all* small pages regrardless if it
blong to any VM_LOCKED VMA.

We probably could munlock() all small pages on split_huge_page(), but I
think we have accounting issue already on step two.

Instead of forbidding mlocked pages altogether, we just avoid mlocking
PTE-mapped THPs and munlock THPs on split_huge_pmd().

This means PTE-mapped THPs will be on normal lru lists and will be split
under memory pressure by vmscan. After the split vmscan will detect
unevictable small pages and mlock them.

With this approach we shouldn't hit situation like described above.

Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Cc: Sasha Levin <sasha.levin@oracle.com>
Cc: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
Cc: Jerome Marchand <jmarchan@redhat.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: Dave Hansen <dave.hansen@intel.com>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Rik van Riel <riel@redhat.com>
Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
Cc: Steve Capper <steve.capper@linaro.org>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Michal Hocko <mhocko@suse.cz>
Cc: Christoph Lameter <cl@linux.com>
Cc: David Rientjes <rientjes@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 4b471e88 15-Jan-2016 Kirill A. Shutemov <kirill.shutemov@linux.intel.com>

mm, thp: remove infrastructure for handling splitting PMDs

With new refcounting we don't need to mark PMDs splitting. Let's drop
code to handle this.

Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Tested-by: Sasha Levin <sasha.levin@oracle.com>
Tested-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Acked-by: Jerome Marchand <jmarchan@redhat.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: Dave Hansen <dave.hansen@intel.com>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Rik van Riel <riel@redhat.com>
Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
Cc: Steve Capper <steve.capper@linaro.org>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Michal Hocko <mhocko@suse.cz>
Cc: Christoph Lameter <cl@linux.com>
Cc: David Rientjes <rientjes@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# ddc58f27 15-Jan-2016 Kirill A. Shutemov <kirill.shutemov@linux.intel.com>

mm: drop tail page refcounting

Tail page refcounting is utterly complicated and painful to support.

It uses ->_mapcount on tail pages to store how many times this page is
pinned. get_page() bumps ->_mapcount on tail page in addition to
->_count on head. This information is required by split_huge_page() to
be able to distribute pins from head of compound page to tails during
the split.

We will need ->_mapcount to account PTE mappings of subpages of the
compound page. We eliminate need in current meaning of ->_mapcount in
tail pages by forbidding split entirely if the page is pinned.

The only user of tail page refcounting is THP which is marked BROKEN for
now.

Let's drop all this mess. It makes get_page() and put_page() much
simpler.

Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Tested-by: Sasha Levin <sasha.levin@oracle.com>
Tested-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Acked-by: Jerome Marchand <jmarchan@redhat.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: Dave Hansen <dave.hansen@intel.com>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Rik van Riel <riel@redhat.com>
Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
Cc: Steve Capper <steve.capper@linaro.org>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Michal Hocko <mhocko@suse.cz>
Cc: Christoph Lameter <cl@linux.com>
Cc: David Rientjes <rientjes@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 78ddc534 15-Jan-2016 Kirill A. Shutemov <kirill.shutemov@linux.intel.com>

thp: rename split_huge_page_pmd() to split_huge_pmd()

We are going to decouple splitting THP PMD from splitting underlying
compound page.

This patch renames split_huge_page_pmd*() functions to split_huge_pmd*()
to reflect the fact that it doesn't imply page splitting, only PMD.

Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Tested-by: Sasha Levin <sasha.levin@oracle.com>
Tested-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
Acked-by: Jerome Marchand <jmarchan@redhat.com>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: Dave Hansen <dave.hansen@intel.com>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Rik van Riel <riel@redhat.com>
Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
Cc: Steve Capper <steve.capper@linaro.org>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Michal Hocko <mhocko@suse.cz>
Cc: Christoph Lameter <cl@linux.com>
Cc: David Rientjes <rientjes@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 7479df6d 15-Jan-2016 Kirill A. Shutemov <kirill.shutemov@linux.intel.com>

thp, mlock: do not allow huge pages in mlocked area

With new refcounting THP can belong to several VMAs. This makes tricky
to track THP pages, when they partially mlocked. It can lead to leaking
mlocked pages to non-VM_LOCKED vmas and other problems.

With this patch we will split all pages on mlock and avoid
fault-in/collapse new THP in VM_LOCKED vmas.

I've tried alternative approach: do not mark THP pages mlocked and keep
them on normal LRUs. This way vmscan could try to split huge pages on
memory pressure and free up subpages which doesn't belong to VM_LOCKED
vmas. But this is user-visible change: we screw up Mlocked accouting
reported in meminfo, so I had to leave this approach aside.

We can bring something better later, but this should be good enough for
now.

Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Tested-by: Sasha Levin <sasha.levin@oracle.com>
Tested-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
Acked-by: Jerome Marchand <jmarchan@redhat.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: Dave Hansen <dave.hansen@intel.com>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Rik van Riel <riel@redhat.com>
Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
Cc: Steve Capper <steve.capper@linaro.org>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Michal Hocko <mhocko@suse.cz>
Cc: Christoph Lameter <cl@linux.com>
Cc: David Rientjes <rientjes@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 7aef4172 15-Jan-2016 Kirill A. Shutemov <kirill.shutemov@linux.intel.com>

mm: handle PTE-mapped tail pages in gerneric fast gup implementaiton

With new refcounting we are going to see THP tail pages mapped with PTE.
Generic fast GUP rely on page_cache_get_speculative() to obtain
reference on page. page_cache_get_speculative() always fails on tail
pages, because ->_count on tail pages is always zero.

Let's handle tail pages in gup_pte_range().

New split_huge_page() will rely on migration entries to freeze page's
counts. Recheck PTE value after page_cache_get_speculative() on head
page should be enough to serialize against split.

Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Tested-by: Sasha Levin <sasha.levin@oracle.com>
Tested-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
Acked-by: Jerome Marchand <jmarchan@redhat.com>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: Dave Hansen <dave.hansen@intel.com>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Rik van Riel <riel@redhat.com>
Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
Cc: Steve Capper <steve.capper@linaro.org>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Michal Hocko <mhocko@suse.cz>
Cc: Christoph Lameter <cl@linux.com>
Cc: David Rientjes <rientjes@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 6742d293 15-Jan-2016 Kirill A. Shutemov <kirill.shutemov@linux.intel.com>

mm: adjust FOLL_SPLIT for new refcounting

We need to prepare kernel to allow transhuge pages to be mapped with
ptes too. We need to handle FOLL_SPLIT in follow_page_pte().

Also we use split_huge_page() directly instead of split_huge_page_pmd().
split_huge_page_pmd() will gone.

Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Tested-by: Sasha Levin <sasha.levin@oracle.com>
Tested-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Acked-by: Jerome Marchand <jmarchan@redhat.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: Dave Hansen <dave.hansen@intel.com>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Rik van Riel <riel@redhat.com>
Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
Cc: Steve Capper <steve.capper@linaro.org>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Michal Hocko <mhocko@suse.cz>
Cc: Christoph Lameter <cl@linux.com>
Cc: David Rientjes <rientjes@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# de60f5f1 05-Nov-2015 Eric B Munson <emunson@akamai.com>

mm: introduce VM_LOCKONFAULT

The cost of faulting in all memory to be locked can be very high when
working with large mappings. If only portions of the mapping will be used
this can incur a high penalty for locking.

For the example of a large file, this is the usage pattern for a large
statical language model (probably applies to other statical or graphical
models as well). For the security example, any application transacting in
data that cannot be swapped out (credit card data, medical records, etc).

This patch introduces the ability to request that pages are not
pre-faulted, but are placed on the unevictable LRU when they are finally
faulted in. The VM_LOCKONFAULT flag will be used together with VM_LOCKED
and has no effect when set without VM_LOCKED. Setting the VM_LOCKONFAULT
flag for a VMA will cause pages faulted into that VMA to be added to the
unevictable LRU when they are faulted or if they are already present, but
will not cause any missing pages to be faulted in.

Exposing this new lock state means that we cannot overload the meaning of
the FOLL_POPULATE flag any longer. Prior to this patch it was used to
mean that the VMA for a fault was locked. This means we need the new
FOLL_MLOCK flag to communicate the locked state of a VMA. FOLL_POPULATE
will now only control if the VMA should be populated and in the case of
VM_LOCKONFAULT, it will not be set.

Signed-off-by: Eric B Munson <emunson@akamai.com>
Acked-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Cc: Michal Hocko <mhocko@suse.cz>
Cc: Jonathan Corbet <corbet@lwn.net>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Geert Uytterhoeven <geert@linux-m68k.org>
Cc: Guenter Roeck <linux@roeck-us.net>
Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
Cc: Michael Kerrisk <mtk.manpages@gmail.com>
Cc: Ralf Baechle <ralf@linux-mips.org>
Cc: Shuah Khan <shuahkh@osg.samsung.com>
Cc: Stephen Rothwell <sfr@canb.auug.org.au>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 1027e443 04-Sep-2015 Kirill A. Shutemov <kirill.shutemov@linux.intel.com>

mm: make GUP handle pfn mapping unless FOLL_GET is requested

With DAX, pfn mapping becoming more common. The patch adjusts GUP code to
cover pfn mapping for cases when we don't need struct page to proceed.

To make it possible, let's change follow_page() code to return -EEXIST
error code if proper page table entry exists, but no corresponding struct
page. __get_user_page() would ignore the error code and move to the next
page frame.

The immediate effect of the change is working MAP_POPULATE and mlock() on
DAX mappings.

[akpm@linux-foundation.org: fix arm64 build]
Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Reviewed-by: Toshi Kani <toshi.kani@hp.com>
Acked-by: Matthew Wilcox <willy@linux.intel.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 9d8c47e4 15-Apr-2015 Jason Low <jason.low2@hp.com>

mm: use READ_ONCE() for non-scalar types

Commit 38c5ce936a08 ("mm/gup: Replace ACCESS_ONCE with READ_ONCE")
converted ACCESS_ONCE usage in gup_pmd_range() to READ_ONCE, since
ACCESS_ONCE doesn't work reliably on non-scalar types.

This patch also fixes the other ACCESS_ONCE usages in gup_pte_range()
and __get_user_pages_fast() in mm/gup.c

Signed-off-by: Jason Low <jason.low2@hp.com>
Acked-by: Michal Hocko <mhocko@suse.cz>
Acked-by: Davidlohr Bueso <dave@stgolabs.net>
Acked-by: Rik van Riel <riel@redhat.com>
Reviewed-by: Christian Borntraeger <borntraeger@de.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# acc3c8d1 14-Apr-2015 Kirill A. Shutemov <kirill.shutemov@linux.intel.com>

mm: move mm_populate()-related code to mm/gup.c

It's odd that we have populate_vma_page_range() and __mm_populate() in
mm/mlock.c. It's implementation of generic memory population and mlocking
is one of possible side effect, if VM_LOCKED is set.

__get_user_pages() is core of the implementation. Let's move the code
into mm/gup.c.

Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Acked-by: Linus Torvalds <torvalds@linux-foundation.org>
Acked-by: David Rientjes <rientjes@google.com>
Cc: Michel Lespinasse <walken@google.com>
Cc: Rik van Riel <riel@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 84d33df2 14-Apr-2015 Kirill A. Shutemov <kirill.shutemov@linux.intel.com>

mm: rename FOLL_MLOCK to FOLL_POPULATE

After commit a1fde08c74e9 ("VM: skip the stack guard page lookup in
get_user_pages only for mlock") FOLL_MLOCK has lost its original
meaning: we don't necessarily mlock the page if the flags is set -- we
also take VM_LOCKED into consideration.

Since we use the same codepath for __mm_populate(), let's rename
FOLL_MLOCK to FOLL_POPULATE.

Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Acked-by: Linus Torvalds <torvalds@linux-foundation.org>
Acked-by: David Rientjes <rientjes@google.com>
Cc: Michel Lespinasse <walken@google.com>
Cc: Rik van Riel <riel@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 8a0516ed 12-Feb-2015 Mel Gorman <mgorman@suse.de>

mm: convert p[te|md]_numa users to p[te|md]_protnone_numa

Convert existing users of pte_numa and friends to the new helper. Note
that the kernel is broken after this patch is applied until the other page
table modifiers are also altered. This patch layout is to make review
easier.

Signed-off-by: Mel Gorman <mgorman@suse.de>
Acked-by: Linus Torvalds <torvalds@linux-foundation.org>
Acked-by: Aneesh Kumar <aneesh.kumar@linux.vnet.ibm.com>
Acked-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Tested-by: Sasha Levin <sasha.levin@oracle.com>
Cc: Dave Jones <davej@redhat.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Kirill Shutemov <kirill.shutemov@linux.intel.com>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Rik van Riel <riel@redhat.com>
Cc: Sasha Levin <sasha.levin@oracle.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# a7b78075 11-Feb-2015 Andrea Arcangeli <aarcange@redhat.com>

mm: gup: use get_user_pages_unlocked within get_user_pages_fast

This allows the get_user_pages_fast slow path to release the mmap_sem
before blocking.

Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
Reviewed-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Cc: Andres Lagar-Cavilla <andreslc@google.com>
Cc: Peter Feiner <pfeiner@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 0fd71a56 11-Feb-2015 Andrea Arcangeli <aarcange@redhat.com>

mm: gup: add __get_user_pages_unlocked to customize gup_flags

Some callers (like KVM) may want to set the gup_flags like FOLL_HWPOSION
to get a proper -EHWPOSION retval instead of -EFAULT to take a more
appropriate action if get_user_pages runs into a memory failure.

Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
Reviewed-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Cc: Andres Lagar-Cavilla <andreslc@google.com>
Cc: Peter Feiner <pfeiner@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# f0818f47 11-Feb-2015 Andrea Arcangeli <aarcange@redhat.com>

mm: gup: add get_user_pages_locked and get_user_pages_unlocked

FAULT_FOLL_ALLOW_RETRY allows the page fault to drop the mmap_sem for
reading to reduce the mmap_sem contention (for writing), like while
waiting for I/O completion. The problem is that right now practically no
get_user_pages call uses FAULT_FOLL_ALLOW_RETRY, so we're not leveraging
that nifty feature.

Andres fixed it for the KVM page fault. However get_user_pages_fast
remains uncovered, and 99% of other get_user_pages aren't using it either
(the only exception being FOLL_NOWAIT in KVM which is really nonblocking
and in fact it doesn't even release the mmap_sem).

So this patchsets extends the optimization Andres did in the KVM page
fault to the whole kernel. It makes most important places (including
gup_fast) to use FAULT_FOLL_ALLOW_RETRY to reduce the mmap_sem hold times
during I/O.

The only few places that remains uncovered are drivers like v4l and other
exceptions that tends to work on their own memory and they're not working
on random user memory (for example like O_DIRECT that uses gup_fast and is
fully covered by this patch).

A follow up patch should probably also add a printk_once warning to
get_user_pages that should go obsolete and be phased out eventually. The
"vmas" parameter of get_user_pages makes it fundamentally incompatible
with FAULT_FOLL_ALLOW_RETRY (vmas array becomes meaningless the moment the
mmap_sem is released).

While this is just an optimization, this becomes an absolute requirement
for the userfaultfd feature http://lwn.net/Articles/615086/ .

The userfaultfd allows to block the page fault, and in order to do so I
need to drop the mmap_sem first. So this patch also ensures that all
memory where userfaultfd could be registered by KVM, the very first fault
(no matter if it is a regular page fault, or a get_user_pages) always has
FAULT_FOLL_ALLOW_RETRY set. Then the userfaultfd blocks and it is waken
only when the pagetable is already mapped. The second fault attempt after
the wakeup doesn't need FAULT_FOLL_ALLOW_RETRY, so it's ok to retry
without it.

This patch (of 5):

We can leverage the VM_FAULT_RETRY functionality in the page fault paths
better by using either get_user_pages_locked or get_user_pages_unlocked.

The former allows conversion of get_user_pages invocations that will have
to pass a "&locked" parameter to know if the mmap_sem was dropped during
the call. Example from:

down_read(&mm->mmap_sem);
do_something()
get_user_pages(tsk, mm, ..., pages, NULL);
up_read(&mm->mmap_sem);

to:

int locked = 1;
down_read(&mm->mmap_sem);
do_something()
get_user_pages_locked(tsk, mm, ..., pages, &locked);
if (locked)
up_read(&mm->mmap_sem);

The latter is suitable only as a drop in replacement of the form:

down_read(&mm->mmap_sem);
get_user_pages(tsk, mm, ..., pages, NULL);
up_read(&mm->mmap_sem);

into:

get_user_pages_unlocked(tsk, mm, ..., pages);

Where tsk, mm, the intermediate "..." paramters and "pages" can be any
value as before. Just the last parameter of get_user_pages (vmas) must be
NULL for get_user_pages_locked|unlocked to be usable (the latter original
form wouldn't have been safe anyway if vmas wasn't null, for the former we
just make it explicit by dropping the parameter).

If vmas is not NULL these two methods cannot be used.

Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
Reviewed-by: Andres Lagar-Cavilla <andreslc@google.com>
Reviewed-by: Peter Feiner <pfeiner@google.com>
Reviewed-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# e66f17ff 11-Feb-2015 Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>

mm/hugetlb: take page table lock in follow_huge_pmd()

We have a race condition between move_pages() and freeing hugepages, where
move_pages() calls follow_page(FOLL_GET) for hugepages internally and
tries to get its refcount without preventing concurrent freeing. This
race crashes the kernel, so this patch fixes it by moving FOLL_GET code
for hugepages into follow_huge_pmd() with taking the page table lock.

This patch intentionally removes page==NULL check after pte_page.
This is justified because pte_page() never returns NULL for any
architectures or configurations.

This patch changes the behavior of follow_huge_pmd() for tail pages and
then tail pages can be pinned/returned. So the caller must be changed to
properly handle the returned tail pages.

We could have a choice to add the similar locking to
follow_huge_(addr|pud) for consistency, but it's not necessary because
currently these functions don't support FOLL_GET flag, so let's leave it
for future development.

Here is the reproducer:

$ cat movepages.c
#include <stdio.h>
#include <stdlib.h>
#include <numaif.h>

#define ADDR_INPUT 0x700000000000UL
#define HPS 0x200000
#define PS 0x1000

int main(int argc, char *argv[]) {
int i;
int nr_hp = strtol(argv[1], NULL, 0);
int nr_p = nr_hp * HPS / PS;
int ret;
void **addrs;
int *status;
int *nodes;
pid_t pid;

pid = strtol(argv[2], NULL, 0);
addrs = malloc(sizeof(char *) * nr_p + 1);
status = malloc(sizeof(char *) * nr_p + 1);
nodes = malloc(sizeof(char *) * nr_p + 1);

while (1) {
for (i = 0; i < nr_p; i++) {
addrs[i] = (void *)ADDR_INPUT + i * PS;
nodes[i] = 1;
status[i] = 0;
}
ret = numa_move_pages(pid, nr_p, addrs, nodes, status,
MPOL_MF_MOVE_ALL);
if (ret == -1)
err("move_pages");

for (i = 0; i < nr_p; i++) {
addrs[i] = (void *)ADDR_INPUT + i * PS;
nodes[i] = 0;
status[i] = 0;
}
ret = numa_move_pages(pid, nr_p, addrs, nodes, status,
MPOL_MF_MOVE_ALL);
if (ret == -1)
err("move_pages");
}
return 0;
}

$ cat hugepage.c
#include <stdio.h>
#include <sys/mman.h>
#include <string.h>

#define ADDR_INPUT 0x700000000000UL
#define HPS 0x200000

int main(int argc, char *argv[]) {
int nr_hp = strtol(argv[1], NULL, 0);
char *p;

while (1) {
p = mmap((void *)ADDR_INPUT, nr_hp * HPS, PROT_READ | PROT_WRITE,
MAP_PRIVATE | MAP_ANONYMOUS | MAP_HUGETLB, -1, 0);
if (p != (void *)ADDR_INPUT) {
perror("mmap");
break;
}
memset(p, 0, nr_hp * HPS);
munmap(p, nr_hp * HPS);
}
}

$ sysctl vm.nr_hugepages=40
$ ./hugepage 10 &
$ ./movepages 10 $(pgrep -f hugepage)

Fixes: e632a938d914 ("mm: migrate: add hugepage migration code to move_pages()")
Signed-off-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
Reported-by: Hugh Dickins <hughd@google.com>
Cc: James Hogan <james.hogan@imgtec.com>
Cc: David Rientjes <rientjes@google.com>
Cc: Mel Gorman <mel@csn.ul.ie>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Michal Hocko <mhocko@suse.cz>
Cc: Rik van Riel <riel@redhat.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Luiz Capitulino <lcapitulino@redhat.com>
Cc: Nishanth Aravamudan <nacc@linux.vnet.ibm.com>
Cc: Lee Schermerhorn <lee.schermerhorn@hp.com>
Cc: Steve Capper <steve.capper@linaro.org>
Cc: <stable@vger.kernel.org> [3.12+]
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 0661a336 10-Feb-2015 Kirill A. Shutemov <kirill.shutemov@linux.intel.com>

mm: remove rest usage of VM_NONLINEAR and pte_file()

One bit in ->vm_flags is unused now!

Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Cc: Dan Carpenter <dan.carpenter@oracle.com>
Cc: Michal Hocko <mhocko@suse.cz>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 33692f27 29-Jan-2015 Linus Torvalds <torvalds@linux-foundation.org>

vm: add VM_FAULT_SIGSEGV handling support

The core VM already knows about VM_FAULT_SIGBUS, but cannot return a
"you should SIGSEGV" error, because the SIGSEGV case was generally
handled by the caller - usually the architecture fault handler.

That results in lots of duplication - all the architecture fault
handlers end up doing very similar "look up vma, check permissions, do
retries etc" - but it generally works. However, there are cases where
the VM actually wants to SIGSEGV, and applications _expect_ SIGSEGV.

In particular, when accessing the stack guard page, libsigsegv expects a
SIGSEGV. And it usually got one, because the stack growth is handled by
that duplicated architecture fault handler.

However, when the generic VM layer started propagating the error return
from the stack expansion in commit fee7e49d4514 ("mm: propagate error
from stack expansion even for guard page"), that now exposed the
existing VM_FAULT_SIGBUS result to user space. And user space really
expected SIGSEGV, not SIGBUS.

To fix that case, we need to add a VM_FAULT_SIGSEGV, and teach all those
duplicate architecture fault handlers about it. They all already have
the code to handle SIGSEGV, so it's about just tying that new return
value to the existing code, but it's all a bit annoying.

This is the mindless minimal patch to do this. A more extensive patch
would be to try to gather up the mostly shared fault handling logic into
one generic helper routine, and long-term we really should do that
cleanup.

Just from this patch, you can generally see that most architectures just
copied (directly or indirectly) the old x86 way of doing things, but in
the meantime that original x86 model has been improved to hold the VM
semaphore for shorter times etc and to handle VM_FAULT_RETRY and other
"newer" things, so it would be a good idea to bring all those
improvements to the generic case and teach other architectures about
them too.

Reported-and-tested-by: Takashi Iwai <tiwai@suse.de>
Tested-by: Jan Engelhardt <jengelh@inai.de>
Acked-by: Heiko Carstens <heiko.carstens@de.ibm.com> # "s390 still compiles and boots"
Cc: linux-arch@vger.kernel.org
Cc: stable@vger.kernel.org
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 38c5ce93 06-Jan-2015 Christian Borntraeger <borntraeger@de.ibm.com>

mm/gup: Replace ACCESS_ONCE with READ_ONCE

ACCESS_ONCE does not work reliably on non-scalar types. For
example gcc 4.6 and 4.7 might remove the volatile tag for such
accesses during the SRA (scalar replacement of aggregates) step
(https://gcc.gnu.org/bugzilla/show_bug.cgi?id=58145)

Fixup gup_pmd_range.

Signed-off-by: Christian Borntraeger <borntraeger@de.ibm.com>


# e37c6982 07-Dec-2014 Christian Borntraeger <borntraeger@de.ibm.com>

mm: replace ACCESS_ONCE with READ_ONCE or barriers

ACCESS_ONCE does not work reliably on non-scalar types. For
example gcc 4.6 and 4.7 might remove the volatile tag for such
accesses during the SRA (scalar replacement of aggregates) step
(https://gcc.gnu.org/bugzilla/show_bug.cgi?id=58145)

Let's change the code to access the page table elements with
READ_ONCE that does implicit scalar accesses for the gup code.

mm_find_pmd is tricky, because m68k and sparc(32bit) define pmd_t
as array of longs. This code requires just that the pmd_present
and pmd_trans_huge check are done on the same value, so a barrier
is sufficent.

A similar case is in handle_pte_fault. On ppc44x the word size is
32 bit, but a pte is 64 bit. A barrier is ok as well.

Signed-off-by: Christian Borntraeger <borntraeger@de.ibm.com>
Cc: linux-mm@kvack.org
Acked-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>


# f30c59e9 05-Nov-2014 Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>

mm: Update generic gup implementation to handle hugepage directory

Update generic gup implementation with powerpc specific details.
On powerpc at pmd level we can have hugepte, normal pmd pointer
or a pointer to the hugepage directory.

Tested-by: Steve Capper <steve.capper@linaro.org>
Acked-by: Steve Capper <steve.capper@linaro.org>
Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>


# 2667f50e 09-Oct-2014 Steve Capper <steve.capper@linaro.org>

mm: introduce a general RCU get_user_pages_fast()

This series implements general forms of get_user_pages_fast and
__get_user_pages_fast in core code and activates them for arm and arm64.

These are required for Transparent HugePages to function correctly, as a
futex on a THP tail will otherwise result in an infinite loop (due to the
core implementation of __get_user_pages_fast always returning 0).

Unfortunately, a futex on THP tail can be quite common for certain
workloads; thus THP is unreliable without a __get_user_pages_fast
implementation.

This series may also be beneficial for direct-IO heavy workloads and
certain KVM workloads.

This patch (of 6):

get_user_pages_fast() attempts to pin user pages by walking the page
tables directly and avoids taking locks. Thus the walker needs to be
protected from page table pages being freed from under it, and needs to
block any THP splits.

One way to achieve this is to have the walker disable interrupts, and rely
on IPIs from the TLB flushing code blocking before the page table pages
are freed.

On some platforms we have hardware broadcast of TLB invalidations, thus
the TLB flushing code doesn't necessarily need to broadcast IPIs; and
spuriously broadcasting IPIs can hurt system performance if done too
often.

This problem has been solved on PowerPC and Sparc by batching up page
table pages belonging to more than one mm_user, then scheduling an
rcu_sched callback to free the pages. This RCU page table free logic has
been promoted to core code and is activated when one enables
HAVE_RCU_TABLE_FREE. Unfortunately, these architectures implement their
own get_user_pages_fast routines.

The RCU page table free logic coupled with an IPI broadcast on THP split
(which is a rare event), allows one to protect a page table walker by
merely disabling the interrupts during the walk.

This patch provides a general RCU implementation of get_user_pages_fast
that can be used by architectures that perform hardware broadcast of TLB
invalidations.

It is based heavily on the PowerPC implementation by Nick Piggin.

[akpm@linux-foundation.org: various comment fixes]
Signed-off-by: Steve Capper <steve.capper@linaro.org>
Tested-by: Dann Frazier <dann.frazier@canonical.com>
Reviewed-by: Catalin Marinas <catalin.marinas@arm.com>
Acked-by: Hugh Dickins <hughd@google.com>
Cc: Russell King <rmk@arm.linux.org.uk>
Cc: Mark Rutland <mark.rutland@arm.com>
Cc: Mel Gorman <mel@csn.ul.ie>
Cc: Will Deacon <will.deacon@arm.com>
Cc: Christoffer Dall <christoffer.dall@linaro.org>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 234b239b 17-Sep-2014 Andres Lagar-Cavilla <andreslc@google.com>

kvm: Faults which trigger IO release the mmap_sem

When KVM handles a tdp fault it uses FOLL_NOWAIT. If the guest memory
has been swapped out or is behind a filemap, this will trigger async
readahead and return immediately. The rationale is that KVM will kick
back the guest with an "async page fault" and allow for some other
guest process to take over.

If async PFs are enabled the fault is retried asap from an async
workqueue. If not, it's retried immediately in the same code path. In
either case the retry will not relinquish the mmap semaphore and will
block on the IO. This is a bad thing, as other mmap semaphore users
now stall as a function of swap or filemap latency.

This patch ensures both the regular and async PF path re-enter the
fault allowing for the mmap semaphore to be relinquished in the case
of IO wait.

Reviewed-by: Radim Krčmář <rkrcmar@redhat.com>
Signed-off-by: Andres Lagar-Cavilla <andreslc@google.com>
Acked-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>


# 9a95f3cf 06-Aug-2014 Paul Cassella <cassella@cray.com>

mm: describe mmap_sem rules for __lock_page_or_retry() and callers

Add a comment describing the circumstances in which
__lock_page_or_retry() will or will not release the mmap_sem when
returning 0.

Add comments to lock_page_or_retry()'s callers (filemap_fault(),
do_swap_page()) noting the impact on VM_FAULT_RETRY returns.

Add comments on up the call tree, particularly replacing the false "We
return with mmap_sem still held" comments.

Signed-off-by: Paul Cassella <cassella@cray.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# fa5bb209 04-Jun-2014 Kirill A. Shutemov <kirill.shutemov@linux.intel.com>

mm: cleanup __get_user_pages()

Get rid of two nested loops over nr_pages, extract vma flags checking to
separate function and other random cleanups.

Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 16744483 04-Jun-2014 Kirill A. Shutemov <kirill.shutemov@linux.intel.com>

mm: extract code to fault in a page from __get_user_pages()

Nesting level in __get_user_pages() is just insane. Let's try to fix it
a bit.

Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 69e68b4f 04-Jun-2014 Kirill A. Shutemov <kirill.shutemov@linux.intel.com>

mm: cleanup follow_page_mask()

Cleanups:
- move pte-related code to separate function. It's about half of the
function;
- get rid of some goto-logic;
- use 'return NULL' instead of 'return page' where page can only be
NULL;

Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# f2b495ca 04-Jun-2014 Kirill A. Shutemov <kirill.shutemov@linux.intel.com>

mm: extract in_gate_area() case from __get_user_pages()

The case is special and disturb from reading main __get_user_pages()
code path. Let's move it to separate function.

Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 4bbd4c77 04-Jun-2014 Kirill A. Shutemov <kirill.shutemov@linux.intel.com>

mm: move get_user_pages()-related code to separate file

mm/memory.c is overloaded: over 4k lines. get_user_pages() code is
pretty much self-contained let's move it to separate file.

No other changes made.

Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>