History log of /linux-master/mm/migrate.c
Revision Date Author Comments
# 055267fe 06-Feb-2024 Paul Gofman <pgofman@codeweavers.com>

mm/migrate: preserve exact soft-dirty state

pte_mkdirty() sets both _PAGE_DIRTY and _PAGE_SOFT_DIRTY bits. The
_PAGE_SOFT_DIRTY can get set even if it wasn't set on original page before
migration. This makes non-soft-dirty pages soft-dirty just because of
migration/compaction. Clear the _PAGE_SOFT_DIRTY flag if it wasn't set on
original page.

By definition of soft-dirty feature, there can be spurious soft-dirty
pages because of kernel's internal activity such as VMA merging or
migration/compaction. This patch is eliminating the spurious soft-dirty
pages because of migration/compaction.

Link: https://lkml.kernel.org/r/20240206084838.34560-1-usama.anjum@collabora.com
Signed-off-by: Paul Gofman <pgofman@codeweavers.com>
Signed-off-by: Muhammad Usama Anjum <usama.anjum@collabora.com>
Acked-by: Andrei Vagin <avagin@gmail.com>
Cc: Michał Mirosław <emmir@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>


# 2774f256 16-Feb-2024 Byungchul Park <byungchul@sk.com>

mm/vmscan: fix a bug calling wakeup_kswapd() with a wrong zone index

With numa balancing on, when a numa system is running where a numa node
doesn't have its local memory so it has no managed zones, the following
oops has been observed. It's because wakeup_kswapd() is called with a
wrong zone index, -1. Fixed it by checking the index before calling
wakeup_kswapd().

> BUG: unable to handle page fault for address: 00000000000033f3
> #PF: supervisor read access in kernel mode
> #PF: error_code(0x0000) - not-present page
> PGD 0 P4D 0
> Oops: 0000 [#1] PREEMPT SMP NOPTI
> CPU: 2 PID: 895 Comm: masim Not tainted 6.6.0-dirty #255
> Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS
> rel-1.16.0-0-gd239552ce722-prebuilt.qemu.org 04/01/2014
> RIP: 0010:wakeup_kswapd (./linux/mm/vmscan.c:7812)
> Code: (omitted)
> RSP: 0000:ffffc90004257d58 EFLAGS: 00010286
> RAX: ffffffffffffffff RBX: ffff88883fff0480 RCX: 0000000000000003
> RDX: 0000000000000000 RSI: 0000000000000000 RDI: ffff88883fff0480
> RBP: ffffffffffffffff R08: ff0003ffffffffff R09: ffffffffffffffff
> R10: ffff888106c95540 R11: 0000000055555554 R12: 0000000000000003
> R13: 0000000000000000 R14: 0000000000000000 R15: ffff88883fff0940
> FS: 00007fc4b8124740(0000) GS:ffff888827c00000(0000) knlGS:0000000000000000
> CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> CR2: 00000000000033f3 CR3: 000000026cc08004 CR4: 0000000000770ee0
> DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
> DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
> PKRU: 55555554
> Call Trace:
> <TASK>
> ? __die
> ? page_fault_oops
> ? __pte_offset_map_lock
> ? exc_page_fault
> ? asm_exc_page_fault
> ? wakeup_kswapd
> migrate_misplaced_page
> __handle_mm_fault
> handle_mm_fault
> do_user_addr_fault
> exc_page_fault
> asm_exc_page_fault
> RIP: 0033:0x55b897ba0808
> Code: (omitted)
> RSP: 002b:00007ffeefa821a0 EFLAGS: 00010287
> RAX: 000055b89983acd0 RBX: 00007ffeefa823f8 RCX: 000055b89983acd0
> RDX: 00007fc2f8122010 RSI: 0000000000020000 RDI: 000055b89983acd0
> RBP: 00007ffeefa821a0 R08: 0000000000000037 R09: 0000000000000075
> R10: 0000000000000000 R11: 0000000000000202 R12: 0000000000000000
> R13: 00007ffeefa82410 R14: 000055b897ba5dd8 R15: 00007fc4b8340000
> </TASK>

Link: https://lkml.kernel.org/r/20240216111502.79759-1-byungchul@sk.com
Signed-off-by: Byungchul Park <byungchul@sk.com>
Reported-by: Hyeongtak Ji <hyeongtak.ji@sk.com>
Fixes: c574bbe917036 ("NUMA balancing: optimize page placement for memory tiering system")
Reviewed-by: Oscar Salvador <osalvador@suse.de>
Cc: Baolin Wang <baolin.wang@linux.alibaba.com>
Cc: "Huang, Ying" <ying.huang@intel.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>


# 0003e2a4 27-Oct-2023 Sean Christopherson <seanjc@google.com>

mm: Add AS_UNMOVABLE to mark mapping as completely unmovable

Add an "unmovable" flag for mappings that cannot be migrated under any
circumstance. KVM will use the flag for its upcoming GUEST_MEMFD support,
which will not support compaction/migration, at least not in the
foreseeable future.

Test AS_UNMOVABLE under folio lock as already done for the async
compaction/dirty folio case, as the mapping can be removed by truncation
while compaction is running. To avoid having to lock every folio with a
mapping, assume/require that unmovable mappings are also unevictable, and
have mapping_set_unmovable() also set AS_UNEVICTABLE.

Cc: Matthew Wilcox <willy@infradead.org>
Co-developed-by: Vlastimil Babka <vbabka@suse.cz>
Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
Signed-off-by: Sean Christopherson <seanjc@google.com>
Message-Id: <20231027182217.3615211-15-seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>


# a15dc478 20-Dec-2023 David Hildenbrand <david@redhat.com>

mm/migrate: page_add_anon_rmap() -> folio_add_anon_rmap_pte()

Let's convert remove_migration_pte().

Link: https://lkml.kernel.org/r/20231220224504.646757-18-david@redhat.com
Signed-off-by: David Hildenbrand <david@redhat.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
Cc: Muchun Song <muchun.song@linux.dev>
Cc: Muchun Song <songmuchun@bytedance.com>
Cc: Peter Xu <peterx@redhat.com>
Cc: Ryan Roberts <ryan.roberts@arm.com>
Cc: Yin Fengwei <fengwei.yin@intel.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>


# c4dffb0b 20-Dec-2023 David Hildenbrand <david@redhat.com>

mm/migrate: page_add_file_rmap() -> folio_add_file_rmap_pte()

Let's convert remove_migration_pte().

Link: https://lkml.kernel.org/r/20231220224504.646757-11-david@redhat.com
Signed-off-by: David Hildenbrand <david@redhat.com>
Reviewed-by: Yin Fengwei <fengwei.yin@intel.com>
Reviewed-by: Ryan Roberts <ryan.roberts@arm.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
Cc: Muchun Song <muchun.song@linux.dev>
Cc: Muchun Song <songmuchun@bytedance.com>
Cc: Peter Xu <peterx@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>


# 44887f39 20-Dec-2023 David Hildenbrand <david@redhat.com>

mm/rmap: introduce and use hugetlb_add_file_rmap()

hugetlb rmap handling differs quite a lot from "ordinary" rmap code. For
example, hugetlb currently only supports entire mappings, and treats any
mapping as mapped using a single "logical PTE". Let's move it out of the
way so we can overhaul our "ordinary" rmap. implementation/interface.

Right now we're using page_dup_file_rmap() in some cases where "ordinary"
rmap code would have used page_add_file_rmap(). So let's introduce and
use hugetlb_add_file_rmap() instead. We won't be adding a
"hugetlb_dup_file_rmap()" functon for the fork() case, as it would be
doing the same: "dup" is just an optimization for "add".

What remains is a single page_dup_file_rmap() call in fork() code.

Add sanity checks that we end up with the right folios in the right
functions.

Link: https://lkml.kernel.org/r/20231220224504.646757-4-david@redhat.com
Signed-off-by: David Hildenbrand <david@redhat.com>
Reviewed-by: Yin Fengwei <fengwei.yin@intel.com>
Reviewed-by: Ryan Roberts <ryan.roberts@arm.com>
Reviewed-by: Muchun Song <songmuchun@bytedance.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
Cc: Muchun Song <muchun.song@linux.dev>
Cc: Peter Xu <peterx@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>


# 9d5fafd5 20-Dec-2023 David Hildenbrand <david@redhat.com>

mm/rmap: rename hugepage_add* to hugetlb_add*

Patch series "mm/rmap: interface overhaul", v2.

This series overhauls the rmap interface, to get rid of the "bool
compound" / RMAP_COMPOUND parameter with the goal of making the interface
less error prone, more future proof, and more natural to extend to
"batching". Also, this converts the interface to always consume
folio+subpage, which speeds up operations on large folios.

Further, this series adds PTE-batching variants for 4 rmap functions,
whereby only folio_add_anon_rmap_ptes() is used for batching in this
series when PTE-remapping a PMD-mapped THP. folio_remove_rmap_ptes(),
folio_try_dup_anon_rmap_ptes() and folio_dup_file_rmap_ptes() will soon
come in handy[1,2].

This series performs a lot of folio conversion along the way. Most of the
added LOC in the diff are only due to documentation.

As we're moving to a pte/pmd interface where we clearly express the
mapping granularity we are dealing with, we first get the remainder of
hugetlb out of the way, as it is special and expected to remain special:
it treats everything as a "single logical PTE" and only currently allows
entire mappings.

Even if we'd ever support partial mappings, I strongly assume the
interface and implementation will still differ heavily: hopefull we can
avoid working on subpages/subpage mapcounts completely and only add a
"count" parameter for them to enable batching.

New (extended) hugetlb interface that operates on entire folio:
* hugetlb_add_new_anon_rmap() -> Already existed
* hugetlb_add_anon_rmap() -> Already existed
* hugetlb_try_dup_anon_rmap()
* hugetlb_try_share_anon_rmap()
* hugetlb_add_file_rmap()
* hugetlb_remove_rmap()

New "ordinary" interface for small folios / THP::
* folio_add_new_anon_rmap() -> Already existed
* folio_add_anon_rmap_[pte|ptes|pmd]()
* folio_try_dup_anon_rmap_[pte|ptes|pmd]()
* folio_try_share_anon_rmap_[pte|pmd]()
* folio_add_file_rmap_[pte|ptes|pmd]()
* folio_dup_file_rmap_[pte|ptes|pmd]()
* folio_remove_rmap_[pte|ptes|pmd]()

folio_add_new_anon_rmap() will always map at the largest granularity
possible (currently, a single PMD to cover a PMD-sized THP). Could be
extended if ever required.

In the future, we might want "_pud" variants and eventually "_pmds"
variants for batching.

I ran some simple microbenchmarks on an Intel(R) Xeon(R) Silver 4210R:
measuring munmap(), fork(), cow, MADV_DONTNEED on each PTE ... and PTE
remapping PMD-mapped THPs on 1 GiB of memory.

For small folios, there is barely a change (< 1% improvement for me).

For PTE-mapped THP:
* PTE-remapping a PMD-mapped THP is more than 10% faster.
* fork() is more than 4% faster.
* MADV_DONTNEED is 2% faster
* COW when writing only a single byte on a COW-shared PTE is 1% faster
* munmap() barely changes (< 1%).

[1] https://lkml.kernel.org/r/20230810103332.3062143-1-ryan.roberts@arm.com
[2] https://lkml.kernel.org/r/20231204105440.61448-1-ryan.roberts@arm.com


This patch (of 40):

Let's just call it "hugetlb_".

Yes, it's all already inconsistent and confusing because we have a lot of
"hugepage_" functions for legacy reasons. But "hugetlb" cannot possibly
be confused with transparent huge pages, and it matches "hugetlb.c" and
"folio_test_hugetlb()". So let's minimize confusion in rmap code.

Link: https://lkml.kernel.org/r/20231220224504.646757-1-david@redhat.com
Link: https://lkml.kernel.org/r/20231220224504.646757-2-david@redhat.com
Signed-off-by: David Hildenbrand <david@redhat.com>
Reviewed-by: Muchun Song <songmuchun@bytedance.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
Cc: Muchun Song <muchun.song@linux.dev>
Cc: Peter Xu <peterx@redhat.com>
Cc: Ryan Roberts <ryan.roberts@arm.com>
Cc: Yin Fengwei <fengwei.yin@intel.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>


# d1adb25d 15-Dec-2023 Baolin Wang <baolin.wang@linux.alibaba.com>

mm: migrate: fix getting incorrect page mapping during page migration

When running stress-ng testing, we found below kernel crash after a few hours:

Unable to handle kernel NULL pointer dereference at virtual address 0000000000000000
pc : dentry_name+0xd8/0x224
lr : pointer+0x22c/0x370
sp : ffff800025f134c0
......
Call trace:
dentry_name+0xd8/0x224
pointer+0x22c/0x370
vsnprintf+0x1ec/0x730
vscnprintf+0x2c/0x60
vprintk_store+0x70/0x234
vprintk_emit+0xe0/0x24c
vprintk_default+0x3c/0x44
vprintk_func+0x84/0x2d0
printk+0x64/0x88
__dump_page+0x52c/0x530
dump_page+0x14/0x20
set_migratetype_isolate+0x110/0x224
start_isolate_page_range+0xc4/0x20c
offline_pages+0x124/0x474
memory_block_offline+0x44/0xf4
memory_subsys_offline+0x3c/0x70
device_offline+0xf0/0x120
......

After analyzing the vmcore, I found this issue is caused by page migration.
The scenario is that, one thread is doing page migration, and we will use the
target page's ->mapping field to save 'anon_vma' pointer between page unmap and
page move, and now the target page is locked and refcount is 1.

Currently, there is another stress-ng thread performing memory hotplug,
attempting to offline the target page that is being migrated. It discovers that
the refcount of this target page is 1, preventing the offline operation, thus
proceeding to dump the page. However, page_mapping() of the target page may
return an incorrect file mapping to crash the system in dump_mapping(), since
the target page->mapping only saves 'anon_vma' pointer without setting
PAGE_MAPPING_ANON flag.

There are seveval ways to fix this issue:
(1) Setting the PAGE_MAPPING_ANON flag for target page's ->mapping when saving
'anon_vma', but this can confuse PageAnon() for PFN walkers, since the target
page has not built mappings yet.
(2) Getting the page lock to call page_mapping() in __dump_page() to avoid crashing
the system, however, there are still some PFN walkers that call page_mapping()
without holding the page lock, such as compaction.
(3) Using target page->private field to save the 'anon_vma' pointer and 2 bits
page state, just as page->mapping records an anonymous page, which can remove
the page_mapping() impact for PFN walkers and also seems a simple way.

So I choose option 3 to fix this issue, and this can also fix other potential
issues for PFN walkers, such as compaction.

Link: https://lkml.kernel.org/r/e60b17a88afc38cb32f84c3e30837ec70b343d2b.1702641709.git.baolin.wang@linux.alibaba.com
Fixes: 64c8902ed441 ("migrate_pages: split unmap_and_move() to _unmap() and _move()")
Signed-off-by: Baolin Wang <baolin.wang@linux.alibaba.com>
Reviewed-by: "Huang, Ying" <ying.huang@intel.com>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: David Hildenbrand <david@redhat.com>
Cc: Xu Yu <xuyu@linux.alibaba.com>
Cc: Zi Yan <ziy@nvidia.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>


# 600f111e 17-Nov-2023 Matthew Wilcox (Oracle) <willy@infradead.org>

fs: Rename mapping private members

It is hard to find where mapping->private_lock, mapping->private_list and
mapping->private_data are used, due to private_XXX being a relatively
common name for variables and structure members in the kernel. To fit
with other members of struct address_space, rename them all to have an
i_ prefix. Tested with an allmodconfig build.

Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Link: https://lore.kernel.org/r/20231117215823.2821906-1-willy@infradead.org
Acked-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Christian Brauner <brauner@kernel.org>


# fc346d0a 13-Dec-2023 Charan Teja Kalla <quic_charante@quicinc.com>

mm: migrate high-order folios in swap cache correctly

Large folios occupy N consecutive entries in the swap cache instead of
using multi-index entries like the page cache. However, if a large folio
is re-added to the LRU list, it can be migrated. The migration code was
not aware of the difference between the swap cache and the page cache and
assumed that a single xas_store() would be sufficient.

This leaves potentially many stale pointers to the now-migrated folio in
the swap cache, which can lead to almost arbitrary data corruption in the
future. This can also manifest as infinite loops with the RCU read lock
held.

[willy@infradead.org: modifications to the changelog & tweaked the fix]
Fixes: 3417013e0d18 ("mm/migrate: Add folio_migrate_mapping()")
Link: https://lkml.kernel.org/r/20231214045841.961776-1-willy@infradead.org
Signed-off-by: Charan Teja Kalla <quic_charante@quicinc.com>
Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Reported-by: Charan Teja Kalla <quic_charante@quicinc.com>
Closes: https://lkml.kernel.org/r/1700569840-17327-1-git-send-email-quic_charante@quicinc.com
Cc: David Hildenbrand <david@redhat.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
Cc: Shakeel Butt <shakeelb@google.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>


# eebb3dab 20-Oct-2023 Baolin Wang <baolin.wang@linux.alibaba.com>

mm: migrate: record the mlocked page status to remove unnecessary lru drain

When doing compaction, I found the lru_add_drain() is an obvious hotspot
when migrating pages. The distribution of this hotspot is as follows:
- 18.75% compact_zone
- 17.39% migrate_pages
- 13.79% migrate_pages_batch
- 11.66% migrate_folio_move
- 7.02% lru_add_drain
+ 7.02% lru_add_drain_cpu
+ 3.00% move_to_new_folio
1.23% rmap_walk
+ 1.92% migrate_folio_unmap
+ 3.20% migrate_pages_sync
+ 0.90% isolate_migratepages

The lru_add_drain() was added by commit c3096e6782b7 ("mm/migrate:
__unmap_and_move() push good newpage to LRU") to drain the newpage to LRU
immediately, to help to build up the correct newpage->mlock_count in
remove_migration_ptes() for mlocked pages. However, if there are no
mlocked pages are migrating, then we can avoid this lru drain operation,
especailly for the heavy concurrent scenarios.

So we can record the source pages' mlocked status in
migrate_folio_unmap(), and only drain the lru list when the mlocked status
is set in migrate_folio_move().

In addition, the page was already isolated from lru when migrating, so
checking the mlocked status is stable by folio_test_mlocked() in
migrate_folio_unmap().

After this patch, I can see the hotpot of the lru_add_drain() is gone:
- 9.41% migrate_pages_batch
- 6.15% migrate_folio_move
- 3.64% move_to_new_folio
+ 1.80% migrate_folio_extra
+ 1.70% buffer_migrate_folio
+ 1.41% rmap_walk
+ 0.62% folio_add_lru
+ 3.07% migrate_folio_unmap

Meanwhile, the compaction latency shows some improvements when running
thpscale:
base patched
Amean fault-both-1 1131.22 ( 0.00%) 1112.55 * 1.65%*
Amean fault-both-3 2489.75 ( 0.00%) 2324.15 * 6.65%*
Amean fault-both-5 3257.37 ( 0.00%) 3183.18 * 2.28%*
Amean fault-both-7 4257.99 ( 0.00%) 4079.04 * 4.20%*
Amean fault-both-12 6614.02 ( 0.00%) 6075.60 * 8.14%*
Amean fault-both-18 10607.78 ( 0.00%) 8978.86 * 15.36%*
Amean fault-both-24 14911.65 ( 0.00%) 11619.55 * 22.08%*
Amean fault-both-30 14954.67 ( 0.00%) 14925.66 * 0.19%*
Amean fault-both-32 16654.87 ( 0.00%) 15580.31 * 6.45%*

Link: https://lkml.kernel.org/r/06e9153a7a4850352ec36602df3a3a844de45698.1697859741.git.baolin.wang@linux.alibaba.com
Signed-off-by: Baolin Wang <baolin.wang@linux.alibaba.com>
Reviewed-by: "Huang, Ying" <ying.huang@intel.com>
Reviewed-by: Zi Yan <ziy@nvidia.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: Mel Gorman <mgorman@techsingularity.net>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Yin Fengwei <fengwei.yin@intel.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>


# 49cac03a 16-Oct-2023 Zi Yan <ziy@nvidia.com>

mm/migrate: add nr_split to trace_mm_migrate_pages stats.

Add nr_split to trace_mm_migrate_pages for large folio (including THP)
split events.

[akpm@linux-foundation.org: cleanup per Huang, Ying]
Link: https://lkml.kernel.org/r/20231017163129.2025214-2-zi.yan@sent.com
Signed-off-by: Zi Yan <ziy@nvidia.com>
Reviewed-by: "Huang, Ying" <ying.huang@intel.com>
Reviewed-by: Baolin Wang <baolin.wang@linux.alibaba.com>
Cc: David Hildenbrand <david@redhat.com>
Cc: Huang Ying <ying.huang@intel.com>
Cc: Matthew Wilcox <willy@infradead.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>


# a259945e 16-Oct-2023 Zi Yan <ziy@nvidia.com>

mm/migrate: correct nr_failed in migrate_pages_sync()

nr_failed was missing the large folio splits from migrate_pages_batch()
and can cause a mismatch between migrate_pages() return value and the
number of not migrated pages, i.e., when the return value of
migrate_pages() is 0, there are still pages left in the from page list.
It will happen when a non-PMD THP large folio fails to migrate due to
-ENOMEM and is split successfully but not all the split pages are not
migrated, migrate_pages_batch() would return non-zero, but
astats.nr_thp_split = 0. nr_failed would be 0 and returned to the caller
of migrate_pages(), but the not migrated pages are left in the from page
list without being added back to LRU lists.

Fix it by adding a new nr_split counter for large folio splits and adding
it to nr_failed in migrate_page_sync() after migrate_pages_batch() is
done.

Link: https://lkml.kernel.org/r/20231017163129.2025214-1-zi.yan@sent.com
Fixes: 2ef7dbb26990 ("migrate_pages: try migrate in batch asynchronously firstly")
Signed-off-by: Zi Yan <ziy@nvidia.com>
Acked-by: Huang Ying <ying.huang@intel.com>
Reviewed-by: Baolin Wang <baolin.wang@linux.alibaba.com>
Cc: David Hildenbrand <david@redhat.com>
Cc: Matthew Wilcox <willy@infradead.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>


# 4e694fe4 18-Oct-2023 Kefeng Wang <wangkefeng.wang@huawei.com>

mm: migrate: use folio_xchg_last_cpupid() in folio_migrate_flags()

Convert to use folio_xchg_last_cpupid() in folio_migrate_flags(), also
directly use folio_nid() instead of page_to_nid(&folio->page).

Link: https://lkml.kernel.org/r/20231018140806.2783514-15-wangkefeng.wang@huawei.com
Signed-off-by: Kefeng Wang <wangkefeng.wang@huawei.com>
Cc: David Hildenbrand <david@redhat.com>
Cc: Huang Ying <ying.huang@intel.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Juri Lelli <juri.lelli@redhat.com>
Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Vincent Guittot <vincent.guittot@linaro.org>
Cc: Zi Yan <ziy@nvidia.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>


# 8cba9576 06-Oct-2023 Nhat Pham <nphamcs@gmail.com>

hugetlb: memcg: account hugetlb-backed memory in memory controller

Currently, hugetlb memory usage is not acounted for in the memory
controller, which could lead to memory overprotection for cgroups with
hugetlb-backed memory. This has been observed in our production system.

For instance, here is one of our usecases: suppose there are two 32G
containers. The machine is booted with hugetlb_cma=6G, and each container
may or may not use up to 3 gigantic page, depending on the workload within
it. The rest is anon, cache, slab, etc. We can set the hugetlb cgroup
limit of each cgroup to 3G to enforce hugetlb fairness. But it is very
difficult to configure memory.max to keep overall consumption, including
anon, cache, slab etc. fair.

What we have had to resort to is to constantly poll hugetlb usage and
readjust memory.max. Similar procedure is done to other memory limits
(memory.low for e.g). However, this is rather cumbersome and buggy.
Furthermore, when there is a delay in memory limits correction, (for e.g
when hugetlb usage changes within consecutive runs of the userspace
agent), the system could be in an over/underprotected state.

This patch rectifies this issue by charging the memcg when the hugetlb
folio is utilized, and uncharging when the folio is freed (analogous to
the hugetlb controller). Note that we do not charge when the folio is
allocated to the hugetlb pool, because at this point it is not owned by
any memcg.

Some caveats to consider:
* This feature is only available on cgroup v2.
* There is no hugetlb pool management involved in the memory
controller. As stated above, hugetlb folios are only charged towards
the memory controller when it is used. Host overcommit management
has to consider it when configuring hard limits.
* Failure to charge towards the memcg results in SIGBUS. This could
happen even if the hugetlb pool still has pages (but the cgroup
limit is hit and reclaim attempt fails).
* When this feature is enabled, hugetlb pages contribute to memory
reclaim protection. low, min limits tuning must take into account
hugetlb memory.
* Hugetlb pages utilized while this option is not selected will not
be tracked by the memory controller (even if cgroup v2 is remounted
later on).

Link: https://lkml.kernel.org/r/20231006184629.155543-4-nphamcs@gmail.com
Signed-off-by: Nhat Pham <nphamcs@gmail.com>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Cc: Frank van der Linden <fvdl@google.com>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Mike Kravetz <mike.kravetz@oracle.com>
Cc: Muchun Song <muchun.song@linux.dev>
Cc: Rik van Riel <riel@surriel.com>
Cc: Roman Gushchin <roman.gushchin@linux.dev>
Cc: Shakeel Butt <shakeelb@google.com>
Cc: Shuah Khan <shuah@kernel.org>
Cc: Tejun heo <tj@kernel.org>
Cc: Yosry Ahmed <yosryahmed@google.com>
Cc: Zefan Li <lizefan.x@bytedance.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>


# ec47e250 03-Oct-2023 Gregory Price <gourry.memverge@gmail.com>

mm/migrate: remove unused mm argument from do_move_pages_to_node

This function does not actively use the mm_struct, it can be removed.

Link: https://lkml.kernel.org/r/20231003144857.752952-2-gregory.price@memverge.com
Signed-off-by: Gregory Price <gregory.price@memverge.com>
Reviewed-by: Jonathan Cameron <Jonathan.Cameron@huawei.com>
Cc: Arnd Bergmann <arnd@arndb.de>
Cc: Gregory Price <gregory.price@memverge.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>


# a08c7193 26-Sep-2023 Sidhartha Kumar <sidhartha.kumar@oracle.com>

mm/filemap: remove hugetlb special casing in filemap.c

Remove special cased hugetlb handling code within the page cache by
changing the granularity of ->index to the base page size rather than the
huge page size. The motivation of this patch is to reduce complexity
within the filemap code while also increasing performance by removing
branches that are evaluated on every page cache lookup.

To support the change in index, new wrappers for hugetlb page cache
interactions are added. These wrappers perform the conversion to a linear
index which is now expected by the page cache for huge pages.

========================= PERFORMANCE ======================================

Perf was used to check the performance differences after the patch.
Overall the performance is similar to mainline with a very small larger
overhead that occurs in __filemap_add_folio() and
hugetlb_add_to_page_cache(). This is because of the larger overhead that
occurs in xa_load() and xa_store() as the xarray is now using more entries
to store hugetlb folios in the page cache.

Timing

aarch64
2MB Page Size
6.5-rc3 + this patch:
[root@sidhakum-ol9-1 hugepages]# time fallocate -l 700GB test.txt
real 1m49.568s
user 0m0.000s
sys 1m49.461s

6.5-rc3:
[root]# time fallocate -l 700GB test.txt
real 1m47.495s
user 0m0.000s
sys 1m47.370s
1GB Page Size
6.5-rc3 + this patch:
[root@sidhakum-ol9-1 hugepages1G]# time fallocate -l 700GB test.txt
real 1m47.024s
user 0m0.000s
sys 1m46.921s

6.5-rc3:
[root@sidhakum-ol9-1 hugepages1G]# time fallocate -l 700GB test.txt
real 1m44.551s
user 0m0.000s
sys 1m44.438s

x86
2MB Page Size
6.5-rc3 + this patch:
[root@sidhakum-ol9-2 hugepages]# time fallocate -l 100GB test.txt
real 0m22.383s
user 0m0.000s
sys 0m22.255s

6.5-rc3:
[opc@sidhakum-ol9-2 hugepages]$ time sudo fallocate -l 100GB /dev/hugepages/test.txt
real 0m22.735s
user 0m0.038s
sys 0m22.567s

1GB Page Size
6.5-rc3 + this patch:
[root@sidhakum-ol9-2 hugepages1GB]# time fallocate -l 100GB test.txt
real 0m25.786s
user 0m0.001s
sys 0m25.589s

6.5-rc3:
[root@sidhakum-ol9-2 hugepages1G]# time fallocate -l 100GB test.txt
real 0m33.454s
user 0m0.001s
sys 0m33.193s

aarch64:
workload - fallocate a 700GB file backed by huge pages

6.5-rc3 + this patch:
2MB Page Size:
--100.00%--__arm64_sys_fallocate
ksys_fallocate
vfs_fallocate
hugetlbfs_fallocate
|
|--95.04%--__pi_clear_page
|
|--3.57%--clear_huge_page
| |
| |--2.63%--rcu_all_qs
| |
| --0.91%--__cond_resched
|
--0.67%--__cond_resched
0.17% 0.00% 0 fallocate [kernel.vmlinux] [k] hugetlb_add_to_page_cache
0.14% 0.10% 11 fallocate [kernel.vmlinux] [k] __filemap_add_folio

6.5-rc3
2MB Page Size:
--100.00%--__arm64_sys_fallocate
ksys_fallocate
vfs_fallocate
hugetlbfs_fallocate
|
|--94.91%--__pi_clear_page
|
|--4.11%--clear_huge_page
| |
| |--3.00%--rcu_all_qs
| |
| --1.10%--__cond_resched
|
--0.59%--__cond_resched
0.08% 0.01% 1 fallocate [kernel.kallsyms] [k] hugetlb_add_to_page_cache
0.05% 0.03% 3 fallocate [kernel.kallsyms] [k] __filemap_add_folio

x86
workload - fallocate a 100GB file backed by huge pages

6.5-rc3 + this patch:
2MB Page Size:
hugetlbfs_fallocate
|
--99.57%--clear_huge_page
|
--98.47%--clear_page_erms
|
--0.53%--asm_sysvec_apic_timer_interrupt

0.04% 0.04% 1 fallocate [kernel.kallsyms] [k] xa_load
0.04% 0.00% 0 fallocate [kernel.kallsyms] [k] hugetlb_add_to_page_cache
0.04% 0.00% 0 fallocate [kernel.kallsyms] [k] __filemap_add_folio
0.04% 0.00% 0 fallocate [kernel.kallsyms] [k] xas_store

6.5-rc3
2MB Page Size:
--99.93%--__x64_sys_fallocate
vfs_fallocate
hugetlbfs_fallocate
|
--99.38%--clear_huge_page
|
|--98.40%--clear_page_erms
|
--0.59%--__cond_resched
0.03% 0.03% 1 fallocate [kernel.kallsyms] [k] __filemap_add_folio

========================= TESTING ======================================

This patch passes libhugetlbfs tests and LTP hugetlb tests

********** TEST SUMMARY
* 2M
* 32-bit 64-bit
* Total testcases: 110 113
* Skipped: 0 0
* PASS: 107 113
* FAIL: 0 0
* Killed by signal: 3 0
* Bad configuration: 0 0
* Expected FAIL: 0 0
* Unexpected PASS: 0 0
* Test not present: 0 0
* Strange test result: 0 0
**********

Done executing testcases.
LTP Version: 20220527-178-g2761a81c4

page migration was also tested using Mike Kravetz's test program.[8]

[dan.carpenter@linaro.org: fix an NULL vs IS_ERR() bug]
Link: https://lkml.kernel.org/r/1772c296-1417-486f-8eef-171af2192681@moroto.mountain
Link: https://lkml.kernel.org/r/20230926192017.98183-1-sidhartha.kumar@oracle.com
Signed-off-by: Sidhartha Kumar <sidhartha.kumar@oracle.com>
Signed-off-by: Dan Carpenter <dan.carpenter@linaro.org>
Reported-and-tested-by: syzbot+c225dea486da4d5592bd@syzkaller.appspotmail.com
Closes: https://syzkaller.appspot.com/bug?extid=c225dea486da4d5592bd
Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
Cc: Mike Kravetz <mike.kravetz@oracle.com>
Cc: Muchun Song <songmuchun@bytedance.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>


# fa1df3f6 13-Sep-2023 Kefeng Wang <wangkefeng.wang@huawei.com>

mm: migrate: remove isolated variable in add_page_for_migration()

Directly check the return of isolate_hugetlb() and folio_isolate_lru() to
remove isolated variable, also setup err = -EBUSY in advance before
isolation, and update err only when successfully queued for migration,
which could help us to unify and simplify code a bit.

Link: https://lkml.kernel.org/r/20230913095131.2426871-9-wangkefeng.wang@huawei.com
Signed-off-by: Kefeng Wang <wangkefeng.wang@huawei.com>
Reviewed-by: Zi Yan <ziy@nvidia.com>
Cc: David Hildenbrand <david@redhat.com>
Cc: "Huang, Ying" <ying.huang@intel.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
Cc: Mike Kravetz <mike.kravetz@oracle.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>


# b426ed78 13-Sep-2023 Kefeng Wang <wangkefeng.wang@huawei.com>

mm: migrate: remove PageHead() check for HugeTLB in add_page_for_migration()

There is some different between hugeTLB and THP behave when passed the
address of a tail page, for THP, it will migrate the entire THP page, but
for HugeTLB, it will return -EACCES, or -ENOENT before commit e66f17ff7177
("mm/hugetlb: take page table lock in follow_huge_pmd()"),

-EACCES The page is mapped by multiple processes and can be moved
only if MPOL_MF_MOVE_ALL is specified.
-ENOENT The page is not present.

But when check manual[1], both of the two errnos are not suitable, it is
better to keep the same behave between hugetlb and THP when passed the
address of a tail page, so let's just remove the PageHead() check for
HugeTLB.

[1] https://man7.org/linux/man-pages/man2/move_pages.2.html

Link: https://lkml.kernel.org/r/20230913095131.2426871-8-wangkefeng.wang@huawei.com
Signed-off-by: Kefeng Wang <wangkefeng.wang@huawei.com>
Suggested-by: Mike Kravetz <mike.kravetz@oracle.com>
Acked-by: Zi Yan <ziy@nvidia.com>
Cc: David Hildenbrand <david@redhat.com>
Cc: "Huang, Ying" <ying.huang@intel.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>


# d64cfccb 13-Sep-2023 Kefeng Wang <wangkefeng.wang@huawei.com>

mm: migrate: use a folio in add_page_for_migration()

Use a folio in add_page_for_migration() to save compound_head() calls.

Link: https://lkml.kernel.org/r/20230913095131.2426871-7-wangkefeng.wang@huawei.com
Signed-off-by: Kefeng Wang <wangkefeng.wang@huawei.com>
Reviewed-by: Zi Yan <ziy@nvidia.com>
Cc: David Hildenbrand <david@redhat.com>
Cc: "Huang, Ying" <ying.huang@intel.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
Cc: Mike Kravetz <mike.kravetz@oracle.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>


# 7e2a5e5a 13-Sep-2023 Kefeng Wang <wangkefeng.wang@huawei.com>

mm: migrate: use __folio_test_movable()

Use __folio_test_movable(), no need to convert from folio to page again.

Link: https://lkml.kernel.org/r/20230913095131.2426871-6-wangkefeng.wang@huawei.com
Signed-off-by: Kefeng Wang <wangkefeng.wang@huawei.com>
Reviewed-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Reviewed-by: David Hildenbrand <david@redhat.com>
Reviewed-by: Zi Yan <ziy@nvidia.com>
Cc: "Huang, Ying" <ying.huang@intel.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: Mike Kravetz <mike.kravetz@oracle.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>


# 73eab3ca 13-Sep-2023 Kefeng Wang <wangkefeng.wang@huawei.com>

mm: migrate: convert migrate_misplaced_page() to migrate_misplaced_folio()

At present, numa balance only support base page and PMD-mapped THP, but we
will expand to support to migrate large folio/pte-mapped THP in the
future, it is better to make migrate_misplaced_page() to take a folio
instead of a page, and rename it to migrate_misplaced_folio(), it is a
preparation, also this remove several compound_head() calls.

Link: https://lkml.kernel.org/r/20230913095131.2426871-5-wangkefeng.wang@huawei.com
Signed-off-by: Kefeng Wang <wangkefeng.wang@huawei.com>
Reviewed-by: Zi Yan <ziy@nvidia.com>
Cc: David Hildenbrand <david@redhat.com>
Cc: "Huang, Ying" <ying.huang@intel.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
Cc: Mike Kravetz <mike.kravetz@oracle.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>


# 2ac9e99f 13-Sep-2023 Kefeng Wang <wangkefeng.wang@huawei.com>

mm: migrate: convert numamigrate_isolate_page() to numamigrate_isolate_folio()

Rename numamigrate_isolate_page() to numamigrate_isolate_folio(), then
make it takes a folio and use folio API to save compound_head() calls.

Link: https://lkml.kernel.org/r/20230913095131.2426871-4-wangkefeng.wang@huawei.com
Signed-off-by: Kefeng Wang <wangkefeng.wang@huawei.com>
Reviewed-by: Zi Yan <ziy@nvidia.com>
Cc: David Hildenbrand <david@redhat.com>
Cc: "Huang, Ying" <ying.huang@intel.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
Cc: Mike Kravetz <mike.kravetz@oracle.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>


# 728be28f 13-Sep-2023 Kefeng Wang <wangkefeng.wang@huawei.com>

mm: migrate: remove THP mapcount check in numamigrate_isolate_page()

The check of THP mapped by multiple processes was introduced by commit
04fa5d6a6547 ("mm: migrate: check page_count of THP before migrating") and
refactor by commit 340ef3902cf2 ("mm: numa: cleanup flow of transhuge page
migration"), which is out of date, since migrate_misplaced_page() is now
using the standard migrate_pages() for small pages and THPs, the reference
count checking is in folio_migrate_mapping(), so let's remove the special
check for THP.

Link: https://lkml.kernel.org/r/20230913095131.2426871-3-wangkefeng.wang@huawei.com
Signed-off-by: Kefeng Wang <wangkefeng.wang@huawei.com>
Suggested-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Reviewed-by: "Huang, Ying" <ying.huang@intel.com>
Cc: David Hildenbrand <david@redhat.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: Mike Kravetz <mike.kravetz@oracle.com>
Cc: Zi Yan <ziy@nvidia.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>


# a8ac4a76 13-Sep-2023 Kefeng Wang <wangkefeng.wang@huawei.com>

mm: migrate: remove PageTransHuge check in numamigrate_isolate_page()

Patch series "mm: migrate: more folio conversion and unification", v3.

Convert more migrate functions to use a folio, it is also a preparation
for large folio migration support when balancing numa.


This patch (of 8):

The assert VM_BUG_ON_PAGE(order && !PageTransHuge(page), page) is not very
useful,

1) for a tail/base page, order = 0, for a head page, the order > 0 &&
PageTransHuge() is true
2) there is a PageCompound() check and only base page is handled in
do_numa_page(), and do_huge_pmd_numa_page() only handle PMD-mapped
THP
3) even though the page is a tail page, isolate_lru_page() will post
a warning, and fail to isolate the page
4) if large folio/pte-mapped THP migration supported in the future,
we could migrate the entire folio if numa fault on a tail page

so just remove the check.

Link: https://lkml.kernel.org/r/20230913095131.2426871-1-wangkefeng.wang@huawei.com
Link: https://lkml.kernel.org/r/20230913095131.2426871-2-wangkefeng.wang@huawei.com
Signed-off-by: Kefeng Wang <wangkefeng.wang@huawei.com>
Suggested-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Cc: David Hildenbrand <david@redhat.com>
Cc: Huang Ying <ying.huang@intel.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: Mike Kravetz <mike.kravetz@oracle.com>
Cc: Zi Yan <ziy@nvidia.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>


# 09c55050 13-Sep-2023 David Hildenbrand <david@redhat.com>

mm/rmap: pass folio to hugepage_add_anon_rmap()

Let's pass a folio; we are always mapping the entire thing.

Link: https://lkml.kernel.org/r/20230913125113.313322-7-david@redhat.com
Signed-off-by: David Hildenbrand <david@redhat.com>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Mike Kravetz <mike.kravetz@oracle.com>
Cc: Muchun Song <muchun.song@linux.dev>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>


# 229e2253 03-Oct-2023 Gregory Price <gourry.memverge@gmail.com>

mm/migrate: fix do_pages_move for compat pointers

do_pages_move does not handle compat pointers for the page list.
correctly. Add in_compat_syscall check and appropriate get_user fetch
when iterating the page list.

It makes the syscall in compat mode (32-bit userspace, 64-bit kernel)
work the same way as the native 32-bit syscall again, restoring the
behavior before my broken commit 5b1b561ba73c ("mm: simplify
compat_sys_move_pages").

More specifically, my patch moved the parsing of the 'pages' array from
the main entry point into do_pages_stat(), which left the syscall
working correctly for the 'stat' operation (nodes = NULL), while the
'move' operation (nodes != NULL) is now missing the conversion and
interprets 'pages' as an array of 64-bit pointers instead of the
intended 32-bit userspace pointers.

It is possible that nobody noticed this bug because the few
applications that actually call move_pages are unlikely to run in
compat mode because of their large memory requirements, but this
clearly fixes a user-visible regression and should have been caught by
ltp.

Link: https://lkml.kernel.org/r/20231003144857.752952-1-gregory.price@memverge.com
Fixes: 5b1b561ba73c ("mm: simplify compat_sys_move_pages")
Signed-off-by: Gregory Price <gregory.price@memverge.com>
Reported-by: Arnd Bergmann <arnd@arndb.de>
Co-developed-by: Arnd Bergmann <arnd@arndb.de>
Cc: Jonathan Cameron <Jonathan.Cameron@huawei.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>


# 935d4f0c 21-Sep-2023 Ryan Roberts <ryan.roberts@arm.com>

mm: hugetlb: add huge page size param to set_huge_pte_at()

Patch series "Fix set_huge_pte_at() panic on arm64", v2.

This series fixes a bug in arm64's implementation of set_huge_pte_at(),
which can result in an unprivileged user causing a kernel panic. The
problem was triggered when running the new uffd poison mm selftest for
HUGETLB memory. This test (and the uffd poison feature) was merged for
v6.5-rc7.

Ideally, I'd like to get this fix in for v6.6 and I've cc'ed stable
(correctly this time) to get it backported to v6.5, where the issue first
showed up.


Description of Bug
==================

arm64's huge pte implementation supports multiple huge page sizes, some of
which are implemented in the page table with multiple contiguous entries.
So set_huge_pte_at() needs to work out how big the logical pte is, so that
it can also work out how many physical ptes (or pmds) need to be written.
It previously did this by grabbing the folio out of the pte and querying
its size.

However, there are cases when the pte being set is actually a swap entry.
But this also used to work fine, because for huge ptes, we only ever saw
migration entries and hwpoison entries. And both of these types of swap
entries have a PFN embedded, so the code would grab that and everything
still worked out.

But over time, more calls to set_huge_pte_at() have been added that set
swap entry types that do not embed a PFN. And this causes the code to go
bang. The triggering case is for the uffd poison test, commit
99aa77215ad0 ("selftests/mm: add uffd unit test for UFFDIO_POISON"), which
causes a PTE_MARKER_POISONED swap entry to be set, coutesey of commit
8a13897fb0da ("mm: userfaultfd: support UFFDIO_POISON for hugetlbfs") -
added in v6.5-rc7. Although review shows that there are other call sites
that set PTE_MARKER_UFFD_WP (which also has no PFN), these don't trigger
on arm64 because arm64 doesn't support UFFD WP.

If CONFIG_DEBUG_VM is enabled, we do at least get a BUG(), but otherwise,
it will dereference a bad pointer in page_folio():

static inline struct folio *hugetlb_swap_entry_to_folio(swp_entry_t entry)
{
VM_BUG_ON(!is_migration_entry(entry) && !is_hwpoison_entry(entry));

return page_folio(pfn_to_page(swp_offset_pfn(entry)));
}


Fix
===

The simplest fix would have been to revert the dodgy cleanup commit
18f3962953e4 ("mm: hugetlb: kill set_huge_swap_pte_at()"), but since
things have moved on, this would have required an audit of all the new
set_huge_pte_at() call sites to see if they should be converted to
set_huge_swap_pte_at(). As per the original intent of the change, it
would also leave us open to future bugs when people invariably get it
wrong and call the wrong helper.

So instead, I've added a huge page size parameter to set_huge_pte_at().
This means that the arm64 code has the size in all cases. It's a bigger
change, due to needing to touch the arches that implement the function,
but it is entirely mechanical, so in my view, low risk.

I've compile-tested all touched arches; arm64, parisc, powerpc, riscv,
s390, sparc (and additionally x86_64). I've additionally booted and run
mm selftests against arm64, where I observe the uffd poison test is fixed,
and there are no other regressions.


This patch (of 2):

In order to fix a bug, arm64 needs to be told the size of the huge page
for which the pte is being set in set_huge_pte_at(). Provide for this by
adding an `unsigned long sz` parameter to the function. This follows the
same pattern as huge_pte_clear().

This commit makes the required interface modifications to the core mm as
well as all arches that implement this function (arm64, parisc, powerpc,
riscv, s390, sparc). The actual arm64 bug will be fixed in a separate
commit.

No behavioral changes intended.

Link: https://lkml.kernel.org/r/20230922115804.2043771-1-ryan.roberts@arm.com
Link: https://lkml.kernel.org/r/20230922115804.2043771-2-ryan.roberts@arm.com
Fixes: 8a13897fb0da ("mm: userfaultfd: support UFFDIO_POISON for hugetlbfs")
Signed-off-by: Ryan Roberts <ryan.roberts@arm.com>
Reviewed-by: Christophe Leroy <christophe.leroy@csgroup.eu> [powerpc 8xx]
Reviewed-by: Lorenzo Stoakes <lstoakes@gmail.com> [vmalloc change]
Cc: Alexandre Ghiti <alex@ghiti.fr>
Cc: Albert Ou <aou@eecs.berkeley.edu>
Cc: Alexander Gordeev <agordeev@linux.ibm.com>
Cc: Anshuman Khandual <anshuman.khandual@arm.com>
Cc: Arnd Bergmann <arnd@arndb.de>
Cc: Axel Rasmussen <axelrasmussen@google.com>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Christian Borntraeger <borntraeger@linux.ibm.com>
Cc: Christoph Hellwig <hch@infradead.org>
Cc: David S. Miller <davem@davemloft.net>
Cc: Gerald Schaefer <gerald.schaefer@linux.ibm.com>
Cc: Heiko Carstens <hca@linux.ibm.com>
Cc: Helge Deller <deller@gmx.de>
Cc: "James E.J. Bottomley" <James.Bottomley@HansenPartnership.com>
Cc: Mike Kravetz <mike.kravetz@oracle.com>
Cc: Muchun Song <muchun.song@linux.dev>
Cc: Nicholas Piggin <npiggin@gmail.com>
Cc: Palmer Dabbelt <palmer@dabbelt.com>
Cc: Paul Walmsley <paul.walmsley@sifive.com>
Cc: Peter Xu <peterx@redhat.com>
Cc: Qi Zheng <zhengqi.arch@bytedance.com>
Cc: Ryan Roberts <ryan.roberts@arm.com>
Cc: SeongJae Park <sj@kernel.org>
Cc: Sven Schnelle <svens@linux.ibm.com>
Cc: Uladzislau Rezki (Sony) <urezki@gmail.com>
Cc: Vasily Gorbik <gor@linux.ibm.com>
Cc: Will Deacon <will@kernel.org>
Cc: <stable@vger.kernel.org> [6.5+]
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>


# 161e393c 12-Jun-2023 Rick Edgecombe <rick.p.edgecombe@intel.com>

mm: Make pte_mkwrite() take a VMA

The x86 Shadow stack feature includes a new type of memory called shadow
stack. This shadow stack memory has some unusual properties, which requires
some core mm changes to function properly.

One of these unusual properties is that shadow stack memory is writable,
but only in limited ways. These limits are applied via a specific PTE
bit combination. Nevertheless, the memory is writable, and core mm code
will need to apply the writable permissions in the typical paths that
call pte_mkwrite(). Future patches will make pte_mkwrite() take a VMA, so
that the x86 implementation of it can know whether to create regular
writable or shadow stack mappings.

But there are a couple of challenges to this. Modifying the signatures of
each arch pte_mkwrite() implementation would be error prone because some
are generated with macros and would need to be re-implemented. Also, some
pte_mkwrite() callers operate on kernel memory without a VMA.

So this can be done in a three step process. First pte_mkwrite() can be
renamed to pte_mkwrite_novma() in each arch, with a generic pte_mkwrite()
added that just calls pte_mkwrite_novma(). Next callers without a VMA can
be moved to pte_mkwrite_novma(). And lastly, pte_mkwrite() and all callers
can be changed to take/pass a VMA.

Previous work pte_mkwrite() renamed pte_mkwrite_novma() and converted
callers that don't have a VMA were to use pte_mkwrite_novma(). So now
change pte_mkwrite() to take a VMA and change the remaining callers to
pass a VMA. Apply the same changes for pmd_mkwrite().

No functional change.

Suggested-by: David Hildenbrand <david@redhat.com>
Signed-off-by: Rick Edgecombe <rick.p.edgecombe@intel.com>
Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
Reviewed-by: Mike Rapoport (IBM) <rppt@kernel.org>
Acked-by: David Hildenbrand <david@redhat.com>
Link: https://lore.kernel.org/all/20230613001108.3040476-4-rick.p.edgecombe%40intel.com


# d5db4f9d 12-Jul-2023 Matthew Wilcox (Oracle) <willy@infradead.org>

migrate: use folio_set_bh() instead of set_bh_page()

This function was converted before folio_set_bh() existed. Catch up to
the new API.

Link: https://lkml.kernel.org/r/20230713035512.4139457-5-willy@infradead.org
Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Reviewed-by: Jan Kara <jack@suse.cz>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Cc: Christian Brauner <brauner@kernel.org>
Cc: David Sterba <dsterba@suse.com>
Cc: Konstantin Komarov <almaz.alexandrovich@paragon-software.com>
Cc: Nathan Chancellor <nathan@kernel.org>
Cc: Nick Desaulniers <ndesaulniers@google.com>
Cc: Pankaj Raghav <p.raghav@samsung.com>
Cc: "Theodore Ts'o" <tytso@mit.edu>
Cc: Tom Rix <trix@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>


# 0201ebf2 28-Jun-2023 David Howells <dhowells@redhat.com>

mm: merge folio_has_private()/filemap_release_folio() call pairs

Patch series "mm, netfs, fscache: Stop read optimisation when folio
removed from pagecache", v7.

This fixes an optimisation in fscache whereby we don't read from the cache
for a particular file until we know that there's data there that we don't
have in the pagecache. The problem is that I'm no longer using PG_fscache
(aka PG_private_2) to indicate that the page is cached and so I don't get
a notification when a cached page is dropped from the pagecache.

The first patch merges some folio_has_private() and
filemap_release_folio() pairs and introduces a helper,
folio_needs_release(), to indicate if a release is required.

The second patch is the actual fix. Following Willy's suggestions[1], it
adds an AS_RELEASE_ALWAYS flag to an address_space that will make
filemap_release_folio() always call ->release_folio(), even if
PG_private/PG_private_2 aren't set. folio_needs_release() is altered to
add a check for this.


This patch (of 2):

Make filemap_release_folio() check folio_has_private(). Then, in most
cases, where a call to folio_has_private() is immediately followed by a
call to filemap_release_folio(), we can get rid of the test in the pair.

There are a couple of sites in mm/vscan.c that this can't so easily be
done. In shrink_folio_list(), there are actually three cases (something
different is done for incompletely invalidated buffers), but
filemap_release_folio() elides two of them.

In shrink_active_list(), we don't have have the folio lock yet, so the
check allows us to avoid locking the page unnecessarily.

A wrapper function to check if a folio needs release is provided for those
places that still need to do it in the mm/ directory. This will acquire
additional parts to the condition in a future patch.

After this, the only remaining caller of folio_has_private() outside of
mm/ is a check in fuse.

Link: https://lkml.kernel.org/r/20230628104852.3391651-1-dhowells@redhat.com
Link: https://lkml.kernel.org/r/20230628104852.3391651-2-dhowells@redhat.com
Reported-by: Rohith Surabattula <rohiths.msft@gmail.com>
Suggested-by: Matthew Wilcox <willy@infradead.org>
Signed-off-by: David Howells <dhowells@redhat.com>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Steve French <sfrench@samba.org>
Cc: Shyam Prasad N <nspmangalore@gmail.com>
Cc: Rohith Surabattula <rohiths.msft@gmail.com>
Cc: Dave Wysochanski <dwysocha@redhat.com>
Cc: Dominique Martinet <asmadeus@codewreck.org>
Cc: Ilya Dryomov <idryomov@gmail.com>
Cc: "Theodore Ts'o" <tytso@mit.edu>
Cc: Andreas Dilger <adilger.kernel@dilger.ca>
Cc: Xiubo Li <xiubli@redhat.com>
Cc: Jingbo Xu <jefflexu@linux.alibaba.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>


# 925c86a1 01-Aug-2023 Christoph Hellwig <hch@lst.de>

fs: add CONFIG_BUFFER_HEAD

Add a new config option that controls building the buffer_head code, and
select it from all file systems and stacking drivers that need it.

For the block device nodes and alternative iomap based buffered I/O path
is provided when buffer_head support is not enabled, and iomap needs a
a small tweak to define the IOMAP_F_BUFFER_HEAD flag to 0 to not call
into the buffer_head code when it doesn't exist.

Otherwise this is just Kconfig and ifdef changes.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Luis Chamberlain <mcgrof@kernel.org>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Link: https://lore.kernel.org/r/20230801172201.1923299-7-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>


# 994ec4e2 21-Jun-2023 Matthew Wilcox (Oracle) <willy@infradead.org>

mm: remove unnecessary pagevec includes

These files no longer need pagevec.h, mostly due to function declarations
being moved out of it.

Link: https://lkml.kernel.org/r/20230621164557.3510324-14-willy@infradead.org
Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>


# 0b52c420 18-Jun-2023 Jan Glauber <jglauber@digitalocean.com>

mm: fix shmem THP counters on migration

The per node numa_stat values for shmem don't change on page migration for
THP:

grep shmem /sys/fs/cgroup/machine.slice/.../memory.numa_stat:

shmem N0=1092616192 N1=10485760
shmem_thp N0=1092616192 N1=10485760

migratepages 9181 0 1:

shmem N0=0 N1=1103101952
shmem_thp N0=1092616192 N1=10485760

Fix that by updating shmem_thp counters likewise to shmem counters on page
migration.

[jglauber@digitalocean.com: use folio_test_pmd_mappable instead of folio_test_transhuge]
Link: https://lkml.kernel.org/r/20230622094720.510540-1-jglauber@digitalocean.com
Link: https://lkml.kernel.org/r/20230619103351.234837-1-jglauber@digitalocean.com
Signed-off-by: Jan Glauber <jglauber@digitalocean.com>
Reviewed-by: Baolin Wang <baolin.wang@linux.alibaba.com>
Cc: "Huang, Ying" <ying.huang@intel.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>


# c33c7948 12-Jun-2023 Ryan Roberts <ryan.roberts@arm.com>

mm: ptep_get() conversion

Convert all instances of direct pte_t* dereferencing to instead use
ptep_get() helper. This means that by default, the accesses change from a
C dereference to a READ_ONCE(). This is technically the correct thing to
do since where pgtables are modified by HW (for access/dirty) they are
volatile and therefore we should always ensure READ_ONCE() semantics.

But more importantly, by always using the helper, it can be overridden by
the architecture to fully encapsulate the contents of the pte. Arch code
is deliberately not converted, as the arch code knows best. It is
intended that arch code (arm64) will override the default with its own
implementation that can (e.g.) hide certain bits from the core code, or
determine young/dirty status by mixing in state from another source.

Conversion was done using Coccinelle:

----

// $ make coccicheck \
// COCCI=ptepget.cocci \
// SPFLAGS="--include-headers" \
// MODE=patch

virtual patch

@ depends on patch @
pte_t *v;
@@

- *v
+ ptep_get(v)

----

Then reviewed and hand-edited to avoid multiple unnecessary calls to
ptep_get(), instead opting to store the result of a single call in a
variable, where it is correct to do so. This aims to negate any cost of
READ_ONCE() and will benefit arch-overrides that may be more complex.

Included is a fix for an issue in an earlier version of this patch that
was pointed out by kernel test robot. The issue arose because config
MMU=n elides definition of the ptep helper functions, including
ptep_get(). HUGETLB_PAGE=n configs still define a simple
huge_ptep_clear_flush() for linking purposes, which dereferences the ptep.
So when both configs are disabled, this caused a build error because
ptep_get() is not defined. Fix by continuing to do a direct dereference
when MMU=n. This is safe because for this config the arch code cannot be
trying to virtualize the ptes because none of the ptep helpers are
defined.

Link: https://lkml.kernel.org/r/20230612151545.3317766-4-ryan.roberts@arm.com
Reported-by: kernel test robot <lkp@intel.com>
Link: https://lore.kernel.org/oe-kbuild-all/202305120142.yXsNEo6H-lkp@intel.com/
Signed-off-by: Ryan Roberts <ryan.roberts@arm.com>
Cc: Adrian Hunter <adrian.hunter@intel.com>
Cc: Alexander Potapenko <glider@google.com>
Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com>
Cc: Alex Williamson <alex.williamson@redhat.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Andrey Konovalov <andreyknvl@gmail.com>
Cc: Andrey Ryabinin <ryabinin.a.a@gmail.com>
Cc: Christian Brauner <brauner@kernel.org>
Cc: Christoph Hellwig <hch@infradead.org>
Cc: Daniel Vetter <daniel@ffwll.ch>
Cc: Dave Airlie <airlied@gmail.com>
Cc: Dimitri Sivanich <dimitri.sivanich@hpe.com>
Cc: Dmitry Vyukov <dvyukov@google.com>
Cc: Ian Rogers <irogers@google.com>
Cc: Jason Gunthorpe <jgg@ziepe.ca>
Cc: Jérôme Glisse <jglisse@redhat.com>
Cc: Jiri Olsa <jolsa@kernel.org>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Cc: Lorenzo Stoakes <lstoakes@gmail.com>
Cc: Mark Rutland <mark.rutland@arm.com>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Miaohe Lin <linmiaohe@huawei.com>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Mike Kravetz <mike.kravetz@oracle.com>
Cc: Mike Rapoport (IBM) <rppt@kernel.org>
Cc: Muchun Song <muchun.song@linux.dev>
Cc: Namhyung Kim <namhyung@kernel.org>
Cc: Naoya Horiguchi <naoya.horiguchi@nec.com>
Cc: Oleksandr Tyshchenko <oleksandr_tyshchenko@epam.com>
Cc: Pavel Tatashin <pasha.tatashin@soleen.com>
Cc: Roman Gushchin <roman.gushchin@linux.dev>
Cc: SeongJae Park <sj@kernel.org>
Cc: Shakeel Butt <shakeelb@google.com>
Cc: Uladzislau Rezki (Sony) <urezki@gmail.com>
Cc: Vincenzo Frascino <vincenzo.frascino@arm.com>
Cc: Yu Zhao <yuzhao@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>


# 04dee9e8 08-Jun-2023 Hugh Dickins <hughd@google.com>

mm/various: give up if pte_offset_map[_lock]() fails

Following the examples of nearby code, various functions can just give up
if pte_offset_map() or pte_offset_map_lock() fails. And there's no need
for a preliminary pmd_trans_unstable() or other such check, since such
cases are now safely handled inside.

Link: https://lkml.kernel.org/r/7b9bd85d-1652-cbf2-159d-f503b45e5b@google.com
Signed-off-by: Hugh Dickins <hughd@google.com>
Cc: Alistair Popple <apopple@nvidia.com>
Cc: Anshuman Khandual <anshuman.khandual@arm.com>
Cc: Axel Rasmussen <axelrasmussen@google.com>
Cc: Christophe Leroy <christophe.leroy@csgroup.eu>
Cc: Christoph Hellwig <hch@infradead.org>
Cc: David Hildenbrand <david@redhat.com>
Cc: "Huang, Ying" <ying.huang@intel.com>
Cc: Ira Weiny <ira.weiny@intel.com>
Cc: Jason Gunthorpe <jgg@ziepe.ca>
Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Cc: Lorenzo Stoakes <lstoakes@gmail.com>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Mel Gorman <mgorman@techsingularity.net>
Cc: Miaohe Lin <linmiaohe@huawei.com>
Cc: Mike Kravetz <mike.kravetz@oracle.com>
Cc: Mike Rapoport (IBM) <rppt@kernel.org>
Cc: Minchan Kim <minchan@kernel.org>
Cc: Naoya Horiguchi <naoya.horiguchi@nec.com>
Cc: Pavel Tatashin <pasha.tatashin@soleen.com>
Cc: Peter Xu <peterx@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Qi Zheng <zhengqi.arch@bytedance.com>
Cc: Ralph Campbell <rcampbell@nvidia.com>
Cc: Ryan Roberts <ryan.roberts@arm.com>
Cc: SeongJae Park <sj@kernel.org>
Cc: Song Liu <song@kernel.org>
Cc: Steven Price <steven.price@arm.com>
Cc: Suren Baghdasaryan <surenb@google.com>
Cc: Thomas Hellström <thomas.hellstrom@linux.intel.com>
Cc: Will Deacon <will@kernel.org>
Cc: Yang Shi <shy828301@gmail.com>
Cc: Yu Zhao <yuzhao@google.com>
Cc: Zack Rusin <zackr@vmware.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>


# 0cb8fd4d 08-Jun-2023 Hugh Dickins <hughd@google.com>

mm/migrate: remove cruft from migration_entry_wait()s

migration_entry_wait_on_locked() does not need to take a mapped pte
pointer, its callers can do the unmap first. Annotate it with
__releases(ptl) to reduce sparse warnings.

Fold __migration_entry_wait_huge() into migration_entry_wait_huge(). Fold
__migration_entry_wait() into migration_entry_wait(), preferring the
tighter pte_offset_map_lock() to pte_offset_map() and pte_lockptr().

Link: https://lkml.kernel.org/r/b0e2a532-cdf2-561b-e999-f3b13b8d6d3@google.com
Signed-off-by: Hugh Dickins <hughd@google.com>
Reviewed-by: Alistair Popple <apopple@nvidia.com>
Cc: Anshuman Khandual <anshuman.khandual@arm.com>
Cc: Axel Rasmussen <axelrasmussen@google.com>
Cc: Christophe Leroy <christophe.leroy@csgroup.eu>
Cc: Christoph Hellwig <hch@infradead.org>
Cc: David Hildenbrand <david@redhat.com>
Cc: "Huang, Ying" <ying.huang@intel.com>
Cc: Ira Weiny <ira.weiny@intel.com>
Cc: Jason Gunthorpe <jgg@ziepe.ca>
Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Cc: Lorenzo Stoakes <lstoakes@gmail.com>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Mel Gorman <mgorman@techsingularity.net>
Cc: Miaohe Lin <linmiaohe@huawei.com>
Cc: Mike Kravetz <mike.kravetz@oracle.com>
Cc: Mike Rapoport (IBM) <rppt@kernel.org>
Cc: Minchan Kim <minchan@kernel.org>
Cc: Naoya Horiguchi <naoya.horiguchi@nec.com>
Cc: Pavel Tatashin <pasha.tatashin@soleen.com>
Cc: Peter Xu <peterx@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Qi Zheng <zhengqi.arch@bytedance.com>
Cc: Ralph Campbell <rcampbell@nvidia.com>
Cc: Ryan Roberts <ryan.roberts@arm.com>
Cc: SeongJae Park <sj@kernel.org>
Cc: Song Liu <song@kernel.org>
Cc: Steven Price <steven.price@arm.com>
Cc: Suren Baghdasaryan <surenb@google.com>
Cc: Thomas Hellström <thomas.hellstrom@linux.intel.com>
Cc: Will Deacon <will@kernel.org>
Cc: Yang Shi <shy828301@gmail.com>
Cc: Yu Zhao <yuzhao@google.com>
Cc: Zack Rusin <zackr@vmware.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>


# 4e096ae1 12-May-2023 Matthew Wilcox (Oracle) <willy@infradead.org>

mm: convert migrate_pages() to work on folios

Almost all of the callers & implementors of migrate_pages() were already
converted to use folios. compaction_alloc() & compaction_free() are
trivial to convert a part of this patch and not worth splitting out.

Link: https://lkml.kernel.org/r/20230513001101.276972-1-willy@infradead.org
Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Reviewed-by: "Huang, Ying" <ying.huang@intel.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>


# 124abced 09-May-2023 Huang Ying <ying.huang@intel.com>

migrate_pages_batch: simplify retrying and failure counting of large folios

After recent changes to the retrying and failure counting in
migrate_pages_batch(), it was found that it's unnecessary to count
retrying and failure for normal, large, and THP folios separately.
Because we don't use retrying and failure number of large folios directly.
So, in this patch, we simplified retrying and failure counting of large
folios via counting retrying and failure of normal and large folios
together. This results in the reduced line number.

Previously, in migrate_pages_batch we need to track whether the source
folio is large/THP before splitting. So is_large is used to cache
folio_test_large() result. Now, we don't need that variable any more
because we don't count retrying and failure of large folios (only counting
that of THP folios). So, in this patch, is_large is removed to simplify
the code.

This is just code cleanup, no functionality changes are expected.

Link: https://lkml.kernel.org/r/20230510031829.11513-1-ying.huang@intel.com
Signed-off-by: "Huang, Ying" <ying.huang@intel.com>
Reviewed-by: Xin Hao <xhao@linux.alibaba.com>
Reviewed-by: Zi Yan <ziy@nvidia.com>
Reviewed-by: Alistair Popple <apopple@nvidia.com>
Cc: Yang Shi <shy828301@gmail.com>
Cc: Baolin Wang <baolin.wang@linux.alibaba.com>
Cc: Oscar Salvador <osalvador@suse.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>


# 4bb6dc79 28-Apr-2023 Douglas Anderson <dianders@chromium.org>

migrate_pages: avoid blocking for IO in MIGRATE_SYNC_LIGHT

The MIGRATE_SYNC_LIGHT mode is intended to block for things that will
finish quickly but not for things that will take a long time. Exactly how
long is too long is not well defined, but waits of tens of milliseconds is
likely non-ideal.

When putting a Chromebook under memory pressure (opening over 90 tabs on a
4GB machine) it was fairly easy to see delays waiting for some locks in
the kcompactd code path of > 100 ms. While the laptop wasn't amazingly
usable in this state, it was still limping along and this state isn't
something artificial. Sometimes we simply end up with a lot of memory
pressure.

Putting the same Chromebook under memory pressure while it was running
Android apps (though not stressing them) showed a much worse result (NOTE:
this was on a older kernel but the codepaths here are similar). Android
apps on ChromeOS currently run from a 128K-block, zlib-compressed,
loopback-mounted squashfs disk. If we get a page fault from something
backed by the squashfs filesystem we could end up holding a folio lock
while reading enough from disk to decompress 128K (and then decompressing
it using the somewhat slow zlib algorithms). That reading goes through
the ext4 subsystem (because it's a loopback mount) before eventually
ending up in the block subsystem. This extra jaunt adds extra overhead.
Without much work I could see cases where we ended up blocked on a folio
lock for over a second. With more extreme memory pressure I could see up
to 25 seconds.

We considered adding a timeout in the case of MIGRATE_SYNC_LIGHT for the
two locks that were seen to be slow [1] and that generated much
discussion. After discussion, it was decided that we should avoid waiting
for the two locks during MIGRATE_SYNC_LIGHT if they were being held for
IO. We'll continue with the unbounded wait for the more full SYNC modes.

With this change, I couldn't see any slow waits on these locks with my
previous testcases.

NOTE: The reason I stated digging into this originally isn't because some
benchmark had gone awry, but because we've received in-the-field crash
reports where we have a hung task waiting on the page lock (which is the
equivalent code path on old kernels). While the root cause of those
crashes is likely unrelated and won't be fixed by this patch, analyzing
those crash reports did point out these very long waits seemed like
something good to fix. With this patch we should no longer hang waiting
on these locks, but presumably the system will still be in a bad shape and
hang somewhere else.

[1] https://lore.kernel.org/r/20230421151135.v2.1.I2b71e11264c5c214bc59744b9e13e4c353bc5714@changeid

Link: https://lkml.kernel.org/r/20230428135414.v3.1.Ia86ccac02a303154a0b8bc60567e7a95d34c96d3@changeid
Signed-off-by: Douglas Anderson <dianders@chromium.org>
Suggested-by: Matthew Wilcox <willy@infradead.org>
Reviewed-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Acked-by: Mel Gorman <mgorman@techsingularity.net>
Cc: Hillf Danton <hdanton@sina.com>
Cc: Gao Xiang <hsiangkao@linux.alibaba.com>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Cc: Christian Brauner <brauner@kernel.org>
Cc: Gao Xiang <hsiangkao@linux.alibaba.com>
Cc: Huang Ying <ying.huang@intel.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Yu Zhao <yuzhao@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>


# f3ebdf04 18-Apr-2023 David Hildenbrand <david@redhat.com>

mm: don't check VMA write permissions if the PTE/PMD indicates write permissions

Staring at the comment "Recheck VMA as permissions can change since
migration started" in remove_migration_pte() can result in confusion,
because if the source PTE/PMD indicates write permissions, then there
should be no need to check VMA write permissions when restoring migration
entries or PTE-mapping a PMD.

Commit d3cb8bf6081b ("mm: migrate: Close race between migration completion
and mprotect") introduced the maybe_mkwrite() handling in
remove_migration_pte() in 2014, stating that a race between mprotect() and
migration finishing would be possible, and that we could end up with a
writable PTE that should be readable.

However, mprotect() code first updates vma->vm_flags / vma->vm_page_prot
and then walks the page tables to (a) set all present writable PTEs to
read-only and (b) convert all writable migration entries to readable
migration entries. While walking the page tables and modifying the
entries, migration code has to grab the PT locks to synchronize against
concurrent page table modifications.

Assuming migration would find a writable migration entry (while holding
the PT lock) and replace it with a writable present PTE, surely mprotect()
code didn't stumble over the writable migration entry yet (converting it
into a readable migration entry) and would instead wait for the PT lock to
convert the now present writable PTE into a read-only PTE. As mprotect()
didn't finish yet, the behavior is just like migration didn't happen: a
writable PTE will be converted to a read-only PTE.

So it's fine to rely on the writability information in the source PTE/PMD
and not recheck against the VMA as long as we're holding the PT lock to
synchronize with anyone who concurrently wants to downgrade write
permissions (like mprotect()) by first adjusting vma->vm_flags /
vma->vm_page_prot to then walk over the page tables to adjust the page
table entries.

Running test cases that should reveal such races -- mprotect(PROT_READ)
racing with page migration or THP splitting -- for multiple hours did not
reveal an issue with this cleanup.

Link: https://lkml.kernel.org/r/20230418142113.439494-1-david@redhat.com
Signed-off-by: David Hildenbrand <david@redhat.com>
Acked-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Reviewed-by: Alistair Popple <apopple@nvidia.com>
Cc: Mel Gorman <mgorman@techsingularity.net>
Cc: Peter Xu <peterx@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>


# 851ae642 16-Apr-2023 Huang Ying <ying.huang@intel.com>

migrate_pages_batch: fix statistics for longterm pin retry

In commit fd4a7ac32918 ("mm: migrate: try again if THP split is failed due
to page refcnt"), if the THP splitting fails due to page reference count,
we will retry to improve migration successful rate. But the failed
splitting is counted as migration failure and migration retry, which will
cause duplicated failure counting. So, in this patch, this is fixed via
undoing the failure counting if we decide to retry. The patch is tested
via failure injection.

Link: https://lkml.kernel.org/r/20230416235929.1040194-1-ying.huang@intel.com
Fixes: fd4a7ac32918 ("mm: migrate: try again if THP split is failed due to page refcnt")
Signed-off-by: "Huang, Ying" <ying.huang@intel.com>
Reviewed-by: Baolin Wang <baolin.wang@linux.alibaba.com>
Cc: Alistair Popple <apopple@nvidia.com>
Cc: David Hildenbrand <david@redhat.com>
Cc: Yang Shi <shy828301@gmail.com>
Cc: Zi Yan <ziy@nvidia.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>


# 3c811f78 11-Apr-2023 David Hildenbrand <david@redhat.com>

mm/migrate: revert "mm/migrate: fix wrongly apply write bit after mkdirty on sparc64"

This reverts commit 96a9c287e25d ("mm/migrate: fix wrongly apply write bit
after mkdirty on sparc64").

Now that sparc64 mkdirty handling is fixed and no longer sets a PTE/PMD
writable that shouldn't be writable, let's revert the temporary fix.

The mkdirty mm selftest still passes with this change on sparc64.

Note that loongarch handling was fixed in commit bf2f34a506e6 ("LoongArch:
Set _PAGE_DIRTY only if _PAGE_WRITE is set in {pmd,pte}_mkdirty()").

Link: https://lkml.kernel.org/r/20230411142512.438404-5-david@redhat.com
Signed-off-by: David Hildenbrand <david@redhat.com>
Cc: Anshuman Khandual <anshuman.khandual@arm.com>
Cc: David S. Miller <davem@davemloft.net>
Cc: Hugh Dickins <hughd@google.com>
Cc: Peter Xu <peterx@redhat.com>
Cc: Sam Ravnborg <sam@ravnborg.org>
Cc: Shuah Khan <shuah@kernel.org>
Cc: Yu Zhao <yuzhao@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>


# 1da28f1b 01-Mar-2023 Anshuman Khandual <anshuman.khandual@arm.com>

mm/migrate: drop pte_mkhuge() in remove_migration_pte()

Since the following commit, arch_make_huge_pte() should be used directly
in generic memory subsystem as a platform provided page table helper,
instead of pte_mkhuge(). This just drops pte_mkhuge() from
remove_migration_pte(), which has now become redundant.

'commit 16785bd77431 ("mm: merge pte_mkhuge() call into arch_make_huge_pte()")'

Link: https://lkml.kernel.org/r/20230302025349.358341-1-anshuman.khandual@arm.com
Signed-off-by: Anshuman Khandual <anshuman.khandual@arm.com>
Reported-by: Christophe Leroy <christophe.leroy@csgroup.eu>
Link: https://lore.kernel.org/all/1ea45095-0926-a56a-a273-816709e9075e@csgroup.eu/
Reviewed-by: David Hildenbrand <david@redhat.com>
Cc: Mike Kravetz <mike.kravetz@oracle.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>


# 428e106a 12-Mar-2023 Kirill A. Shutemov <kirill.shutemov@linux.intel.com>

mm: Introduce untagged_addr_remote()

untagged_addr() removes tags/metadata from the address and brings it to
the canonical form. The helper is implemented on arm64 and sparc. Both of
them do untagging based on global rules.

However, Linear Address Masking (LAM) on x86 introduces per-process
settings for untagging. As a result, untagged_addr() is now only
suitable for untagging addresses for the current proccess.

The new helper untagged_addr_remote() has to be used when the address
targets remote process. It requires the mmap lock for target mm to be
taken.

Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Tested-by: Alexander Potapenko <glider@google.com>
Link: https://lore.kernel.org/all/20230312112612.31869-6-kirill.shutemov%40linux.intel.com


# 2ef7dbb2 02-Mar-2023 Huang Ying <ying.huang@intel.com>

migrate_pages: try migrate in batch asynchronously firstly

When we have locked more than one folios, we cannot wait the lock or bit
(e.g., page lock, buffer head lock, writeback bit) synchronously.
Otherwise deadlock may be triggered. This make it hard to batch the
synchronous migration directly.

This patch re-enables batching synchronous migration via trying to migrate
in batch asynchronously firstly. And any folios that are failed to be
migrated asynchronously will be migrated synchronously one by one.

Test shows that this can restore the TLB flushing batching performance for
synchronous migration effectively.

Link: https://lkml.kernel.org/r/20230303030155.160983-4-ying.huang@intel.com
Fixes: 5dfab109d519 ("migrate_pages: batch _unmap and _move")
Signed-off-by: "Huang, Ying" <ying.huang@intel.com>
Tested-by: Hugh Dickins <hughd@google.com>
Reviewed-by: Baolin Wang <baolin.wang@linux.alibaba.com>
Cc: "Xu, Pengfei" <pengfei.xu@intel.com>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Stefan Roesch <shr@devkernel.io>
Cc: Tejun Heo <tj@kernel.org>
Cc: Xin Hao <xhao@linux.alibaba.com>
Cc: Zi Yan <ziy@nvidia.com>
Cc: Yang Shi <shy828301@gmail.com>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Mike Kravetz <mike.kravetz@oracle.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>


# a21d2133 02-Mar-2023 Huang Ying <ying.huang@intel.com>

migrate_pages: move split folios processing out of migrate_pages_batch()

To simplify the code logic and reduce the line number.

Link: https://lkml.kernel.org/r/20230303030155.160983-3-ying.huang@intel.com
Fixes: 5dfab109d519 ("migrate_pages: batch _unmap and _move")
Signed-off-by: "Huang, Ying" <ying.huang@intel.com>
Reviewed-by: Baolin Wang <baolin.wang@linux.alibaba.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: "Xu, Pengfei" <pengfei.xu@intel.com>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Stefan Roesch <shr@devkernel.io>
Cc: Tejun Heo <tj@kernel.org>
Cc: Xin Hao <xhao@linux.alibaba.com>
Cc: Zi Yan <ziy@nvidia.com>
Cc: Yang Shi <shy828301@gmail.com>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Mike Kravetz <mike.kravetz@oracle.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>


# fb3592c4 02-Mar-2023 Huang Ying <ying.huang@intel.com>

migrate_pages: fix deadlock in batched migration

Patch series "migrate_pages: fix deadlock in batched synchronous
migration", v2.

Two deadlock bugs were reported for the migrate_pages() batching series.
Thanks Hugh and Pengfei. Analysis shows that if we have locked some other
folios except the one we are migrating, it's not safe in general to wait
synchronously, for example, to wait the writeback to complete or wait to
lock the buffer head.

So 1/3 fixes the deadlock in a simple way, where the batching support for
the synchronous migration is disabled. The change is straightforward and
easy to be understood. While 3/3 re-introduce the batching for
synchronous migration via trying to migrate asynchronously in batch
optimistically, then fall back to migrate synchronously one by one for
fail-to-migrate folios. Test shows that this can restore the TLB flushing
batching performance for synchronous migration effectively.


This patch (of 3):

Two deadlock bugs were reported for the migrate_pages() batching series.
Thanks Hugh and Pengfei! For example, in the following deadlock trace
snippet,

INFO: task kworker/u4:0:9 blocked for more than 147 seconds.
Not tainted 6.2.0-rc4-kvm+ #1314
"echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
task:kworker/u4:0 state:D stack:0 pid:9 ppid:2 flags:0x00004000
Workqueue: loop4 loop_rootcg_workfn
Call Trace:
<TASK>
__schedule+0x43b/0xd00
schedule+0x6a/0xf0
io_schedule+0x4a/0x80
folio_wait_bit_common+0x1b5/0x4e0
? __pfx_wake_page_function+0x10/0x10
__filemap_get_folio+0x73d/0x770
shmem_get_folio_gfp+0x1fd/0xc80
shmem_write_begin+0x91/0x220
generic_perform_write+0x10e/0x2e0
__generic_file_write_iter+0x17e/0x290
? generic_write_checks+0x12b/0x1a0
generic_file_write_iter+0x97/0x180
? __sanitizer_cov_trace_const_cmp4+0x1a/0x20
do_iter_readv_writev+0x13c/0x210
? __sanitizer_cov_trace_const_cmp4+0x1a/0x20
do_iter_write+0xf6/0x330
vfs_iter_write+0x46/0x70
loop_process_work+0x723/0xfe0
loop_rootcg_workfn+0x28/0x40
process_one_work+0x3cc/0x8d0
worker_thread+0x66/0x630
? __pfx_worker_thread+0x10/0x10
kthread+0x153/0x190
? __pfx_kthread+0x10/0x10
ret_from_fork+0x29/0x50
</TASK>

INFO: task repro:1023 blocked for more than 147 seconds.
Not tainted 6.2.0-rc4-kvm+ #1314
"echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
task:repro state:D stack:0 pid:1023 ppid:360 flags:0x00004004
Call Trace:
<TASK>
__schedule+0x43b/0xd00
schedule+0x6a/0xf0
io_schedule+0x4a/0x80
folio_wait_bit_common+0x1b5/0x4e0
? compaction_alloc+0x77/0x1150
? __pfx_wake_page_function+0x10/0x10
folio_wait_bit+0x30/0x40
folio_wait_writeback+0x2e/0x1e0
migrate_pages_batch+0x555/0x1ac0
? __pfx_compaction_alloc+0x10/0x10
? __pfx_compaction_free+0x10/0x10
? __this_cpu_preempt_check+0x17/0x20
? lock_is_held_type+0xe6/0x140
migrate_pages+0x100e/0x1180
? __pfx_compaction_free+0x10/0x10
? __pfx_compaction_alloc+0x10/0x10
compact_zone+0xe10/0x1b50
? lock_is_held_type+0xe6/0x140
? check_preemption_disabled+0x80/0xf0
compact_node+0xa3/0x100
? __sanitizer_cov_trace_const_cmp8+0x1c/0x30
? _find_first_bit+0x7b/0x90
sysctl_compaction_handler+0x5d/0xb0
proc_sys_call_handler+0x29d/0x420
proc_sys_write+0x2b/0x40
vfs_write+0x3a3/0x780
ksys_write+0xb7/0x180
__x64_sys_write+0x26/0x30
do_syscall_64+0x3b/0x90
entry_SYSCALL_64_after_hwframe+0x72/0xdc
RIP: 0033:0x7f3a2471f59d
RSP: 002b:00007ffe567f7288 EFLAGS: 00000217 ORIG_RAX: 0000000000000001
RAX: ffffffffffffffda RBX: 0000000000000000 RCX: 00007f3a2471f59d
RDX: 0000000000000000 RSI: 0000000000000000 RDI: 0000000000000005
RBP: 00007ffe567f72a0 R08: 0000000000000010 R09: 0000000000000010
R10: 0000000000000010 R11: 0000000000000217 R12: 00000000004012e0
R13: 00007ffe567f73e0 R14: 0000000000000000 R15: 0000000000000000
</TASK>

The page migration task has held the lock of the shmem folio A, and is
waiting the writeback of the folio B of the file system on the loop block
device to complete. While the loop worker task which writes back the
folio B is waiting to lock the shmem folio A, because the folio A backs
the folio B in the loop device. Thus deadlock is triggered.

In general, if we have locked some other folios except the one we are
migrating, it's not safe to wait synchronously, for example, to wait the
writeback to complete or wait to lock the buffer head.

To fix the deadlock, in this patch, we avoid to batch the page migration
except for MIGRATE_ASYNC mode. In MIGRATE_ASYNC mode, synchronous waiting
is avoided.

The fix can be improved further. We will do that as soon as possible.

Link: https://lkml.kernel.org/r/20230303030155.160983-1-ying.huang@intel.com
Link: https://lore.kernel.org/linux-mm/87a6c8c-c5c1-67dc-1e32-eb30831d6e3d@google.com/
Link: https://lore.kernel.org/linux-mm/874jrg7kke.fsf@yhuang6-desk2.ccr.corp.intel.com/
Link: https://lore.kernel.org/linux-mm/20230227110614.dngdub2j3exr6dfp@quack3/
Link: https://lkml.kernel.org/r/20230303030155.160983-2-ying.huang@intel.com
Fixes: 5dfab109d519 ("migrate_pages: batch _unmap and _move")
Signed-off-by: "Huang, Ying" <ying.huang@intel.com>
Reported-by: Hugh Dickins <hughd@google.com>
Reported-by: "Xu, Pengfei" <pengfei.xu@intel.com>
Cc: Jan Kara <jack@suse.cz>
Cc: Baolin Wang <baolin.wang@linux.alibaba.com>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Stefan Roesch <shr@devkernel.io>
Cc: Tejun Heo <tj@kernel.org>
Cc: Xin Hao <xhao@linux.alibaba.com>
Cc: Zi Yan <ziy@nvidia.com>
Cc: Yang Shi <shy828301@gmail.com>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Mike Kravetz <mike.kravetz@oracle.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>


# e77d587a 04-Mar-2023 Linus Torvalds <torvalds@linux-foundation.org>

mm: avoid gcc complaint about pointer casting

The migration code ends up temporarily stashing information of the wrong
type in unused fields of the newly allocated destination folio. That
all works fine, but gcc does complain about the pointer type mis-use:

mm/migrate.c: In function ‘__migrate_folio_extract’:
mm/migrate.c:1050:20: note: randstruct: casting between randomized structure pointer types (ssa): ‘struct anon_vma’ and ‘struct address_space’

1050 | *anon_vmap = (void *)dst->mapping;
| ~~~~~~~~~~~^~~~~~~~~~~~~~~~~~~~~~

and gcc is actually right to complain since it really doesn't understand
that this is a very temporary special case where this is ok.

This could be fixed in different ways by just obfuscating the assignment
sufficiently that gcc doesn't see what is going on, but the truly
"proper C" way to do this is by explicitly using a union.

Using unions for type conversions like this is normally hugely ugly and
syntactically nasty, but this really is one of the few cases where we
want to make it clear that we're not doing type conversion, we're really
re-using the value bit-for-bit just using another type.

IOW, this should not become a common pattern, but in this one case using
that odd union is probably the best way to document to the compiler what
is conceptually going on here.

[ Side note: there are valid cases where we convert pointers to other
pointer types, notably the whole "folio vs page" situation, where the
types actually have fundamental commonalities.

The fact that the gcc note is limited to just randomized structures
means that we don't see equivalent warnings for those cases, but it
migth also mean that we miss other cases where we do play these kinds
of dodgy games, and this kind of explicit conversion might be a good
idea. ]

I verified that at least for an allmodconfig build on x86-64, this
generates the exact same code, apart from line numbers and assembler
comment changes.

Fixes: 64c8902ed441 ("migrate_pages: split unmap_and_move() to _unmap() and _move()")
Cc: Huang, Ying <ying.huang@intel.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# cd775580 15-Feb-2023 Baolin Wang <baolin.wang@linux.alibaba.com>

mm: change to return bool for isolate_movable_page()

Now the isolate_movable_page() can only return 0 or -EBUSY, and no users
will care about the negative return value, thus we can convert the
isolate_movable_page() to return a boolean value to make the code more
clear when checking the movable page isolation state.

No functional changes intended.

[akpm@linux-foundation.org: remove unneeded comment, per Matthew]
Link: https://lkml.kernel.org/r/cb877f73f4fff8d309611082ec740a7065b1ade0.1676424378.git.baolin.wang@linux.alibaba.com
Signed-off-by: Baolin Wang <baolin.wang@linux.alibaba.com>
Acked-by: David Hildenbrand <david@redhat.com>
Reviewed-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Acked-by: Linus Torvalds <torvalds@linux-foundation.org>
Reviewed-by: SeongJae Park <sj@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>


# 9747b9e9 15-Feb-2023 Baolin Wang <baolin.wang@linux.alibaba.com>

mm: hugetlb: change to return bool for isolate_hugetlb()

Now the isolate_hugetlb() only returns 0 or -EBUSY, and most users did not
care about the negative value, thus we can convert the isolate_hugetlb()
to return a boolean value to make code more clear when checking the
hugetlb isolation state. Moreover converts 2 users which will consider
the negative value returned by isolate_hugetlb().

No functional changes intended.

[akpm@linux-foundation.org: shorten locked section, per SeongJae Park]
Link: https://lkml.kernel.org/r/12a287c5bebc13df304387087bbecc6421510849.1676424378.git.baolin.wang@linux.alibaba.com
Signed-off-by: Baolin Wang <baolin.wang@linux.alibaba.com>
Acked-by: David Hildenbrand <david@redhat.com>
Reviewed-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Reviewed-by: Mike Kravetz <mike.kravetz@oracle.com>
Acked-by: Linus Torvalds <torvalds@linux-foundation.org>
Reviewed-by: SeongJae Park <sj@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>


# f7f9c00d 15-Feb-2023 Baolin Wang <baolin.wang@linux.alibaba.com>

mm: change to return bool for isolate_lru_page()

The isolate_lru_page() can only return 0 or -EBUSY, and most users did not
care about the negative error of isolate_lru_page(), except one user in
add_page_for_migration(). So we can convert the isolate_lru_page() to
return a boolean value, which can help to make the code more clear when
checking the return value of isolate_lru_page().

Also convert all users' logic of checking the isolation state.

No functional changes intended.

Link: https://lkml.kernel.org/r/3074c1ab628d9dbf139b33f248a8bc253a3f95f0.1676424378.git.baolin.wang@linux.alibaba.com
Signed-off-by: Baolin Wang <baolin.wang@linux.alibaba.com>
Acked-by: David Hildenbrand <david@redhat.com>
Reviewed-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Acked-by: Linus Torvalds <torvalds@linux-foundation.org>
Reviewed-by: SeongJae Park <sj@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>


# 6f7d760e 13-Feb-2023 Huang Ying <ying.huang@intel.com>

migrate_pages: move THP/hugetlb migration support check to simplify code

This is a code cleanup patch, no functionality change is expected. After
the change, the line number reduces especially in the long
migrate_pages_batch().

Link: https://lkml.kernel.org/r/20230213123444.155149-10-ying.huang@intel.com
Signed-off-by: "Huang, Ying" <ying.huang@intel.com>
Suggested-by: Alistair Popple <apopple@nvidia.com>
Reviewed-by: Zi Yan <ziy@nvidia.com>
Cc: Yang Shi <shy828301@gmail.com>
Cc: Baolin Wang <baolin.wang@linux.alibaba.com>
Cc: Oscar Salvador <osalvador@suse.de>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Bharata B Rao <bharata@amd.com>
Cc: Xin Hao <xhao@linux.alibaba.com>
Cc: Minchan Kim <minchan@kernel.org>
Cc: Mike Kravetz <mike.kravetz@oracle.com>
Cc: Hyeonggon Yoo <42.hyeyoo@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>


# 7e12beb8 13-Feb-2023 Huang Ying <ying.huang@intel.com>

migrate_pages: batch flushing TLB

The TLB flushing will cost quite some CPU cycles during the folio
migration in some situations. For example, when migrate a folio of a
process with multiple active threads that run on multiple CPUs. After
batching the _unmap and _move in migrate_pages(), the TLB flushing can be
batched easily with the existing TLB flush batching mechanism. This patch
implements that.

We use the following test case to test the patch.

On a 2-socket Intel server,

- Run pmbench memory accessing benchmark

- Run `migratepages` to migrate pages of pmbench between node 0 and
node 1 back and forth.

With the patch, the TLB flushing IPI reduces 99.1% during the test and the
number of pages migrated successfully per second increases 291.7%.

Haoxin helped to test the patchset on an ARM64 server with 128 cores, 2
NUMA nodes. Test results show that the page migration performance
increases up to 78%.

NOTE: TLB flushing is batched only for normal folios, not for THP folios.
Because the overhead of TLB flushing for THP folios is much lower than
that for normal folios (about 1/512 on x86 platform).

Link: https://lkml.kernel.org/r/20230213123444.155149-9-ying.huang@intel.com
Signed-off-by: "Huang, Ying" <ying.huang@intel.com>
Tested-by: Xin Hao <xhao@linux.alibaba.com>
Reviewed-by: Zi Yan <ziy@nvidia.com>
Reviewed-by: Xin Hao <xhao@linux.alibaba.com>
Cc: Yang Shi <shy828301@gmail.com>
Cc: Baolin Wang <baolin.wang@linux.alibaba.com>
Cc: Oscar Salvador <osalvador@suse.de>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Bharata B Rao <bharata@amd.com>
Cc: Alistair Popple <apopple@nvidia.com>
Cc: Minchan Kim <minchan@kernel.org>
Cc: Mike Kravetz <mike.kravetz@oracle.com>
Cc: Hyeonggon Yoo <42.hyeyoo@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>


# ebe75e47 13-Feb-2023 Huang Ying <ying.huang@intel.com>

migrate_pages: share more code between _unmap and _move

This is a code cleanup patch to reduce the duplicated code between the
_unmap and _move stages of migrate_pages(). No functionality change is
expected.

Link: https://lkml.kernel.org/r/20230213123444.155149-8-ying.huang@intel.com
Signed-off-by: "Huang, Ying" <ying.huang@intel.com>
Cc: Zi Yan <ziy@nvidia.com>
Cc: Yang Shi <shy828301@gmail.com>
Cc: Baolin Wang <baolin.wang@linux.alibaba.com>
Cc: Oscar Salvador <osalvador@suse.de>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Bharata B Rao <bharata@amd.com>
Cc: Alistair Popple <apopple@nvidia.com>
Cc: Xin Hao <xhao@linux.alibaba.com>
Cc: Minchan Kim <minchan@kernel.org>
Cc: Mike Kravetz <mike.kravetz@oracle.com>
Cc: Hyeonggon Yoo <42.hyeyoo@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>


# 80562ba0 13-Feb-2023 Huang Ying <ying.huang@intel.com>

migrate_pages: move migrate_folio_unmap()

Just move the position of the functions. There's no any functionality
change. This is to make it easier to review the next patch via putting
code near its position in the next patch.

Link: https://lkml.kernel.org/r/20230213123444.155149-7-ying.huang@intel.com
Signed-off-by: "Huang, Ying" <ying.huang@intel.com>
Reviewed-by: Zi Yan <ziy@nvidia.com>
Cc: Yang Shi <shy828301@gmail.com>
Cc: Baolin Wang <baolin.wang@linux.alibaba.com>
Cc: Oscar Salvador <osalvador@suse.de>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Bharata B Rao <bharata@amd.com>
Cc: Alistair Popple <apopple@nvidia.com>
Cc: Xin Hao <xhao@linux.alibaba.com>
Cc: Minchan Kim <minchan@kernel.org>
Cc: Mike Kravetz <mike.kravetz@oracle.com>
Cc: Hyeonggon Yoo <42.hyeyoo@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>


# 5dfab109 13-Feb-2023 Huang Ying <ying.huang@intel.com>

migrate_pages: batch _unmap and _move

In this patch the _unmap and _move stage of the folio migration is
batched. That for, previously, it is,

for each folio
_unmap()
_move()

Now, it is,

for each folio
_unmap()
for each folio
_move()

Based on this, we can batch the TLB flushing and use some hardware
accelerator to copy folios between batched _unmap and batched _move
stages.

Link: https://lkml.kernel.org/r/20230213123444.155149-6-ying.huang@intel.com
Signed-off-by: "Huang, Ying" <ying.huang@intel.com>
Tested-by: Hyeonggon Yoo <42.hyeyoo@gmail.com>
Cc: Zi Yan <ziy@nvidia.com>
Cc: Yang Shi <shy828301@gmail.com>
Cc: Baolin Wang <baolin.wang@linux.alibaba.com>
Cc: Oscar Salvador <osalvador@suse.de>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Bharata B Rao <bharata@amd.com>
Cc: Alistair Popple <apopple@nvidia.com>
Cc: Xin Hao <xhao@linux.alibaba.com>
Cc: Minchan Kim <minchan@kernel.org>
Cc: Mike Kravetz <mike.kravetz@oracle.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>


# 64c8902e 13-Feb-2023 Huang Ying <ying.huang@intel.com>

migrate_pages: split unmap_and_move() to _unmap() and _move()

This is a preparation patch to batch the folio unmapping and moving.

In this patch, unmap_and_move() is split to migrate_folio_unmap() and
migrate_folio_move(). So, we can batch _unmap() and _move() in different
loops later. To pass some information between unmap and move, the
original unused dst->mapping and dst->private are used.

Link: https://lkml.kernel.org/r/20230213123444.155149-5-ying.huang@intel.com
Signed-off-by: "Huang, Ying" <ying.huang@intel.com>
Reviewed-by: Baolin Wang <baolin.wang@linux.alibaba.com>
Reviewed-by: Xin Hao <xhao@linux.alibaba.com>
Cc: Zi Yan <ziy@nvidia.com>
Cc: Yang Shi <shy828301@gmail.com>
Cc: Oscar Salvador <osalvador@suse.de>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Bharata B Rao <bharata@amd.com>
Cc: Alistair Popple <apopple@nvidia.com>
Cc: Minchan Kim <minchan@kernel.org>
Cc: Mike Kravetz <mike.kravetz@oracle.com>
Cc: Hyeonggon Yoo <42.hyeyoo@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>


# 42012e04 13-Feb-2023 Huang Ying <ying.huang@intel.com>

migrate_pages: restrict number of pages to migrate in batch

This is a preparation patch to batch the folio unmapping and moving for
non-hugetlb folios.

If we had batched the folio unmapping, all folios to be migrated would be
unmapped before copying the contents and flags of the folios. If the
folios that were passed to migrate_pages() were too many in unit of pages,
the execution of the processes would be stopped for too long time, thus
too long latency. For example, migrate_pages() syscall will call
migrate_pages() with all folios of a process. To avoid this possible
issue, in this patch, we restrict the number of pages to be migrated to be
no more than HPAGE_PMD_NR. That is, the influence is at the same level of
THP migration.

Link: https://lkml.kernel.org/r/20230213123444.155149-4-ying.huang@intel.com
Signed-off-by: "Huang, Ying" <ying.huang@intel.com>
Reviewed-by: Baolin Wang <baolin.wang@linux.alibaba.com>
Cc: Zi Yan <ziy@nvidia.com>
Cc: Yang Shi <shy828301@gmail.com>
Cc: Oscar Salvador <osalvador@suse.de>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Bharata B Rao <bharata@amd.com>
Cc: Alistair Popple <apopple@nvidia.com>
Cc: Xin Hao <xhao@linux.alibaba.com>
Cc: Minchan Kim <minchan@kernel.org>
Cc: Mike Kravetz <mike.kravetz@oracle.com>
Cc: Hyeonggon Yoo <42.hyeyoo@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>


# e5bfff8b 13-Feb-2023 Huang Ying <ying.huang@intel.com>

migrate_pages: separate hugetlb folios migration

This is a preparation patch to batch the folio unmapping and moving for
the non-hugetlb folios. Based on that we can batch the TLB shootdown
during the folio migration and make it possible to use some hardware
accelerator for the folio copying.

In this patch the hugetlb folios and non-hugetlb folios migration is
separated in migrate_pages() to make it easy to change the non-hugetlb
folios migration implementation.

Link: https://lkml.kernel.org/r/20230213123444.155149-3-ying.huang@intel.com
Signed-off-by: "Huang, Ying" <ying.huang@intel.com>
Reviewed-by: Baolin Wang <baolin.wang@linux.alibaba.com>
Reviewed-by: Xin Hao <xhao@linux.alibaba.com>
Cc: Zi Yan <ziy@nvidia.com>
Cc: Yang Shi <shy828301@gmail.com>
Cc: Oscar Salvador <osalvador@suse.de>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Bharata B Rao <bharata@amd.com>
Cc: Alistair Popple <apopple@nvidia.com>
Cc: Minchan Kim <minchan@kernel.org>
Cc: Mike Kravetz <mike.kravetz@oracle.com>
Cc: Hyeonggon Yoo <42.hyeyoo@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>


# 5b855937 13-Feb-2023 Huang Ying <ying.huang@intel.com>

migrate_pages: organize stats with struct migrate_pages_stats

Patch series "migrate_pages(): batch TLB flushing", v5.

Now, migrate_pages() migrates folios one by one, like the fake code as
follows,

for each folio
unmap
flush TLB
copy
restore map

If multiple folios are passed to migrate_pages(), there are opportunities
to batch the TLB flushing and copying. That is, we can change the code to
something as follows,

for each folio
unmap
for each folio
flush TLB
for each folio
copy
for each folio
restore map

The total number of TLB flushing IPI can be reduced considerably. And we
may use some hardware accelerator such as DSA to accelerate the folio
copying.

So in this patch, we refactor the migrate_pages() implementation and
implement the TLB flushing batching. Base on this, hardware accelerated
folio copying can be implemented.

If too many folios are passed to migrate_pages(), in the naive batched
implementation, we may unmap too many folios at the same time. The
possibility for a task to wait for the migrated folios to be mapped again
increases. So the latency may be hurt. To deal with this issue, the max
number of folios be unmapped in batch is restricted to no more than
HPAGE_PMD_NR in the unit of page. That is, the influence is at the same
level of THP migration.

We use the following test to measure the performance impact of the
patchset,

On a 2-socket Intel server,

- Run pmbench memory accessing benchmark

- Run `migratepages` to migrate pages of pmbench between node 0 and
node 1 back and forth.

With the patch, the TLB flushing IPI reduces 99.1% during the test and
the number of pages migrated successfully per second increases 291.7%.

Xin Hao helped to test the patchset on an ARM64 server with 128 cores,
2 NUMA nodes. Test results show that the page migration performance
increases up to 78%.


This patch (of 9):

Define struct migrate_pages_stats to organize the various statistics in
migrate_pages(). This makes it easier to collect and consume the
statistics in multiple functions. This will be needed in the following
patches in the series.

Link: https://lkml.kernel.org/r/20230213123444.155149-1-ying.huang@intel.com
Link: https://lkml.kernel.org/r/20230213123444.155149-2-ying.huang@intel.com
Signed-off-by: "Huang, Ying" <ying.huang@intel.com>
Reviewed-by: Alistair Popple <apopple@nvidia.com>
Reviewed-by: Zi Yan <ziy@nvidia.com>
Reviewed-by: Baolin Wang <baolin.wang@linux.alibaba.com>
Reviewed-by: Xin Hao <xhao@linux.alibaba.com>
Cc: Yang Shi <shy828301@gmail.com>
Cc: Oscar Salvador <osalvador@suse.de>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Bharata B Rao <bharata@amd.com>
Cc: Minchan Kim <minchan@kernel.org>
Cc: Mike Kravetz <mike.kravetz@oracle.com>
Cc: Hyeonggon Yoo <42.hyeyoo@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>


# 280d724a 30-Jan-2023 Vishal Moola (Oracle) <vishal.moola@gmail.com>

mm/migrate: convert putback_movable_pages() to use folios

Removes 6 calls to compound_head(), and replaces putback_movable_page()
with putback_movable_folio() as well.

Link: https://lkml.kernel.org/r/20230130214352.40538-5-vishal.moola@gmail.com
Signed-off-by: Vishal Moola (Oracle) <vishal.moola@gmail.com>
Cc: Matthew Wilcox <willy@infradead.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>


# 19979497 30-Jan-2023 Vishal Moola (Oracle) <vishal.moola@gmail.com>

mm/migrate: convert isolate_movable_page() to use folios

Removes 6 calls to compound_head() and prepares the function to take in a
folio instead of page argument.

Link: https://lkml.kernel.org/r/20230130214352.40538-4-vishal.moola@gmail.com
Signed-off-by: Vishal Moola (Oracle) <vishal.moola@gmail.com>
Cc: Matthew Wilcox <willy@infradead.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>


# da707a6d 30-Jan-2023 Vishal Moola (Oracle) <vishal.moola@gmail.com>

mm/migrate: add folio_movable_ops()

folio_movable_ops() does the same as page_movable_ops() except uses folios
instead of pages. This function will help make folio conversions in
migrate.c more readable.

Link: https://lkml.kernel.org/r/20230130214352.40538-3-vishal.moola@gmail.com
Signed-off-by: Vishal Moola (Oracle) <vishal.moola@gmail.com>
Cc: Matthew Wilcox <willy@infradead.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>


# ea8e72f4 25-Jan-2023 Sidhartha Kumar <sidhartha.kumar@oracle.com>

mm/hugetlb: convert putback_active_hugepage to take in a folio

Convert putback_active_hugepage() to folio_putback_active_hugetlb(), this
removes one user of the Huge Page macros which take in a page. The
callers in migrate.c are also cleaned up by being able to directly use the
src and dst folio variables.

Link: https://lkml.kernel.org/r/20230125170537.96973-4-sidhartha.kumar@oracle.com
Signed-off-by: Sidhartha Kumar <sidhartha.kumar@oracle.com>
Reviewed-by: Mike Kravetz <mike.kravetz@oracle.com>
Cc: Gerald Schaefer <gerald.schaefer@linux.ibm.com>
Cc: John Hubbard <jhubbard@nvidia.com>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Muchun Song <songmuchun@bytedance.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>


# e37d3e83 13-Jan-2023 Sidhartha Kumar <sidhartha.kumar@oracle.com>

mm/hugetlb: convert alloc_migrate_huge_page to folios

Change alloc_huge_page_nodemask() to alloc_hugetlb_folio_nodemask() and
alloc_migrate_huge_page() to alloc_migrate_hugetlb_folio(). Both
functions now return a folio rather than a page.

Link: https://lkml.kernel.org/r/20230113223057.173292-7-sidhartha.kumar@oracle.com
Signed-off-by: Sidhartha Kumar <sidhartha.kumar@oracle.com>
Reviewed-by: Mike Kravetz <mike.kravetz@oracle.com>
Cc: John Hubbard <jhubbard@nvidia.com>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Muchun Song <songmuchun@bytedance.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>


# 6aa3a920 13-Jan-2023 Sidhartha Kumar <sidhartha.kumar@oracle.com>

mm/hugetlb: convert isolate_hugetlb to folios

Patch series "continue hugetlb folio conversion", v3.

This series continues the conversion of core hugetlb functions to use
folios. This series converts many helper funtions in the hugetlb fault
path. This is in preparation for another series to convert the hugetlb
fault code paths to operate on folios.


This patch (of 8):

Convert isolate_hugetlb() to take in a folio and convert its callers to
pass a folio. Use page_folio() to convert the callers to use a folio is
safe as isolate_hugetlb() operates on a head page.

Link: https://lkml.kernel.org/r/20230113223057.173292-1-sidhartha.kumar@oracle.com
Link: https://lkml.kernel.org/r/20230113223057.173292-2-sidhartha.kumar@oracle.com
Signed-off-by: Sidhartha Kumar <sidhartha.kumar@oracle.com>
Reviewed-by: Mike Kravetz <mike.kravetz@oracle.com>
Cc: John Hubbard <jhubbard@nvidia.com>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Mike Kravetz <mike.kravetz@oracle.com>
Cc: Muchun Song <songmuchun@bytedance.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>


# 96f97c43 11-Jan-2023 Lorenzo Stoakes <lstoakes@gmail.com>

mm: mlock: update the interface to use folios

Update the mlock interface to accept folios rather than pages, bringing
the interface in line with the internal implementation.

munlock_vma_page() still requires a page_folio() conversion, however this
is consistent with the existent mlock_vma_page() implementation and a
product of rmap still dealing in pages rather than folios.

Link: https://lkml.kernel.org/r/cba12777c5544305014bc0cbec56bb4cc71477d8.1673526881.git.lstoakes@gmail.com
Signed-off-by: Lorenzo Stoakes <lstoakes@gmail.com>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Cc: Christian Brauner <brauner@kernel.org>
Cc: Geert Uytterhoeven <geert@linux-m68k.org>
Cc: Hugh Dickins <hughd@google.com>
Cc: Joel Fernandes (Google) <joel@joelfernandes.org>
Cc: Jonathan Corbet <corbet@lwn.net>
Cc: Liam R. Howlett <Liam.Howlett@oracle.com>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Mike Rapoport (IBM) <rppt@kernel.org>
Cc: William Kucharski <william.kucharski@oracle.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>


# fcd48540 16-Dec-2022 Peter Xu <peterx@redhat.com>

mm/hugetlb: move swap entry handling into vma lock when faulted

In hugetlb_fault(), there used to have a special path to handle swap entry
at the entrance using huge_pte_offset(). That's unsafe because
huge_pte_offset() for a pmd sharable range can access freed pgtables if
without any lock to protect the pgtable from being freed after pmd
unshare.

Here the simplest solution to make it safe is to move the swap handling to
be after the vma lock being held. We may need to take the fault mutex on
either migration or hwpoison entries now (also the vma lock, but that's
really needed), however neither of them is hot path.

Note that the vma lock cannot be released in hugetlb_fault() when the
migration entry is detected, because in migration_entry_wait_huge() the
pgtable page will be used again (by taking the pgtable lock), so that also
need to be protected by the vma lock. Modify migration_entry_wait_huge()
so that it must be called with vma read lock held, and properly release
the lock in __migration_entry_wait_huge().

Link: https://lkml.kernel.org/r/20221216155100.2043537-5-peterx@redhat.com
Signed-off-by: Peter Xu <peterx@redhat.com>
Reviewed-by: Mike Kravetz <mike.kravetz@oracle.com>
Reviewed-by: John Hubbard <jhubbard@nvidia.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: David Hildenbrand <david@redhat.com>
Cc: James Houghton <jthoughton@google.com>
Cc: Jann Horn <jannh@google.com>
Cc: Miaohe Lin <linmiaohe@huawei.com>
Cc: Muchun Song <songmuchun@bytedance.com>
Cc: Nadav Amit <nadav.amit@gmail.com>
Cc: Rik van Riel <riel@surriel.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>


# 96a9c287 16-Feb-2023 Peter Xu <peterx@redhat.com>

mm/migrate: fix wrongly apply write bit after mkdirty on sparc64

Nick Bowler reported another sparc64 breakage after the young/dirty
persistent work for page migration (per "Link:" below). That's after a
similar report [2].

It turns out page migration was overlooked, and it wasn't failing before
because page migration was not enabled in the initial report test
environment.

David proposed another way [2] to fix this from sparc64 side, but that
patch didn't land somehow. Neither did I check whether there's any other
arch that has similar issues.

Let's fix it for now as simple as moving the write bit handling to be
after dirty, like what we did before.

Note: this is based on mm-unstable, because the breakage was since 6.1 and
we're at a very late stage of 6.2 (-rc8), so I assume for this specific
case we should target this at 6.3.

[1] https://lore.kernel.org/all/20221021160603.GA23307@u164.east.ru/
[2] https://lore.kernel.org/all/20221212130213.136267-1-david@redhat.com/

Link: https://lkml.kernel.org/r/20230216153059.256739-1-peterx@redhat.com
Fixes: 2e3468778dbe ("mm: remember young/dirty bit for page migrations")
Link: https://lore.kernel.org/all/CADyTPExpEqaJiMGoV+Z6xVgL50ZoMJg49B10LcZ=8eg19u34BA@mail.gmail.com/
Signed-off-by: Peter Xu <peterx@redhat.com>
Reported-by: Nick Bowler <nbowler@draconx.ca>
Acked-by: David Hildenbrand <david@redhat.com>
Tested-by: Nick Bowler <nbowler@draconx.ca>
Cc: <regressions@lists.linux.dev>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>


# e26355e2 06-Dec-2022 Jan Kara <jack@suse.cz>

mm: export buffer_migrate_folio_norefs()

Ext4 needs this function to allow safe migration for journalled data
pages.

Signed-off-by: Jan Kara <jack@suse.cz>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Link: https://lore.kernel.org/r/20221207112722.22220-11-jack@suse.cz
Signed-off-by: Theodore Ts'o <tytso@mit.edu>


# 4c74b65f 15-Nov-2022 Yang Li <yang.lee@linux.alibaba.com>

mm/migrate.c: stop using 0 as NULL pointer

mm/migrate.c:1198:24: warning: Using plain integer as NULL pointer

Link: https://bugzilla.openanolis.cn/show_bug.cgi?id=3080
Link: https://lkml.kernel.org/r/20221116012345.84870-1-yang.lee@linux.alibaba.com
Signed-off-by: Yang Li <yang.lee@linux.alibaba.com>
Reported-by: Abaci Robot <abaci@linux.alibaba.com>
Reviewed-by: David Hildenbrand <david@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>


# eaec4e63 08-Nov-2022 Huang Ying <ying.huang@intel.com>

migrate: convert migrate_pages() to use folios

Quite straightforward, the page functions are converted to corresponding
folio functions. Same for comments.

THP specific code are converted to be large folio.

Link: https://lkml.kernel.org/r/20221109012348.93849-3-ying.huang@intel.com
Signed-off-by: "Huang, Ying" <ying.huang@intel.com>
Reviewed-by: Baolin Wang <baolin.wang@linux.alibaba.com>
Tested-by: Baolin Wang <baolin.wang@linux.alibaba.com>
Cc: Zi Yan <ziy@nvidia.com>
Cc: Yang Shi <shy828301@gmail.com>
Cc: Oscar Salvador <osalvador@suse.de>
Cc: Matthew Wilcox <willy@infradead.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>


# 49f51859 08-Nov-2022 Huang Ying <ying.huang@intel.com>

migrate: convert unmap_and_move() to use folios

Patch series "migrate: convert migrate_pages()/unmap_and_move() to use
folios", v2.

The conversion is quite straightforward, just replace the page API to the
corresponding folio API. migrate_pages() and unmap_and_move() mostly work
with folios (head pages) only.


This patch (of 2):

Quite straightforward, the page functions are converted to corresponding
folio functions. Same for comments.

Link: https://lkml.kernel.org/r/20221109012348.93849-1-ying.huang@intel.com
Link: https://lkml.kernel.org/r/20221109012348.93849-2-ying.huang@intel.com
Signed-off-by: "Huang, Ying" <ying.huang@intel.com>
Reviewed-by: Yang Shi <shy828301@gmail.com>
Reviewed-by: Zi Yan <ziy@nvidia.com>
Reviewed-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Reviewed-by: Baolin Wang <baolin.wang@linux.alibaba.com>
Cc: Oscar Salvador <osalvador@suse.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>


# 16fd6b31 09-Nov-2022 Baolin Wang <baolin.wang@linux.alibaba.com>

Revert "mm: migration: fix the FOLL_GET failure on following huge page"

Revert commit 831568214883 ("mm: migration: fix the FOLL_GET failure on
following huge page"), since after commit 1a6baaa0db73 ("s390/hugetlb:
switch to generic version of follow_huge_pud()") and commit 57a196a58421
("hugetlb: simplify hugetlb handling in follow_page_mask") were merged,
now all the following huge page routines can support FOLL_GET operation.

Link: https://lkml.kernel.org/r/496786039852aba90ffa68f10d0df3f4236a990b.1667983080.git.baolin.wang@linux.alibaba.com
Signed-off-by: Baolin Wang <baolin.wang@linux.alibaba.com>
Acked-by: Haiyue Wang <haiyue.wang@intel.com>
Cc: Baolin Wang <baolin.wang@linux.alibaba.com>
Cc: "Huang, Ying" <ying.huang@intel.com>
Cc: Mike Kravetz <mike.kravetz@oracle.com>
Cc: Muchun Song <songmuchun@bytedance.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>


# 345c62d1 01-Nov-2022 Sidhartha Kumar <sidhartha.kumar@oracle.com>

mm/hugetlb: convert move_hugetlb_state() to folios

Clean up unmap_and_move_huge_page() by converting move_hugetlb_state() to
take in folios.

[akpm@linux-foundation.org: fix CONFIG_HUGETLB_PAGE=n build]
Link: https://lkml.kernel.org/r/20221101223059.460937-10-sidhartha.kumar@oracle.com
Signed-off-by: Sidhartha Kumar <sidhartha.kumar@oracle.com>
Reviewed-by: Mike Kravetz <mike.kravetz@oracle.com>
Reviewed-by: Muchun Song <songmuchun@bytedance.com>
Cc: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
Cc: Bui Quang Minh <minhquangbui99@gmail.com>
Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
Cc: Miaohe Lin <linmiaohe@huawei.com>
Cc: Mina Almasry <almasrymina@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>


# 8b881763 04-Nov-2022 Vlastimil Babka <vbabka@suse.cz>

mm/migrate: make isolate_movable_page() skip slab pages

In the next commit we want to rearrange struct slab fields to allow a larger
rcu_head. Afterwards, the page->mapping field will overlap with SLUB's "struct
list_head slab_list", where the value of prev pointer can become LIST_POISON2,
which is 0x122 + POISON_POINTER_DELTA. Unfortunately the bit 1 being set can
confuse PageMovable() to be a false positive and cause a GPF as reported by lkp
[1].

To fix this, make isolate_movable_page() skip pages with the PageSlab flag set.
This is a bit tricky as we need to add memory barriers to SLAB and SLUB's page
allocation and freeing, and their counterparts to isolate_movable_page().

Based on my RFC from [2]. Added a comment update from Matthew's variant in [3]
and, as done there, moved the PageSlab checks to happen before trying to take
the page lock.

[1] https://lore.kernel.org/all/208c1757-5edd-fd42-67d4-1940cc43b50f@intel.com/
[2] https://lore.kernel.org/all/aec59f53-0e53-1736-5932-25407125d4d4@suse.cz/
[3] https://lore.kernel.org/all/YzsVM8eToHUeTP75@casper.infradead.org/

Reported-by: kernel test robot <yujie.liu@intel.com>
Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
Acked-by: Hyeonggon Yoo <42.hyeyoo@gmail.com>


# fd4a7ac3 24-Oct-2022 Baolin Wang <baolin.wang@linux.alibaba.com>

mm: migrate: try again if THP split is failed due to page refcnt

When creating a virtual machine, we will use memfd_create() to get a file
descriptor which can be used to create share memory mappings using the
mmap function, meanwhile the mmap() will set the MAP_POPULATE flag to
allocate physical pages for the virtual machine.

When allocating physical pages for the guest, the host can fallback to
allocate some CMA pages for the guest when over half of the zone's free
memory is in the CMA area.

In guest os, when the application wants to do some data transaction with
DMA, our QEMU will call VFIO_IOMMU_MAP_DMA ioctl to do longterm-pin and
create IOMMU mappings for the DMA pages. However, when calling
VFIO_IOMMU_MAP_DMA ioctl to pin the physical pages, we found it will be
failed to longterm-pin sometimes.

After some invetigation, we found the pages used to do DMA mapping can
contain some CMA pages, and these CMA pages will cause a possible failure
of the longterm-pin, due to failed to migrate the CMA pages. The reason
of migration failure may be temporary reference count or memory allocation
failure. So that will cause the VFIO_IOMMU_MAP_DMA ioctl returns error,
which makes the application failed to start.

I observed one migration failure case (which is not easy to reproduce) is
that, the 'thp_migration_fail' count is 1 and the 'thp_split_page_failed'
count is also 1.

That means when migrating a THP which is in CMA area, but can not allocate
a new THP due to memory fragmentation, so it will split the THP. However
THP split is also failed, probably the reason is temporary reference count
of this THP. And the temporary reference count can be caused by dropping
page caches (I observed the drop caches operation in the system), but we
can not drop the shmem page caches due to they are already dirty at that
time.

Especially for THP split failure, which is caused by temporary reference
count, we can try again to mitigate the failure of migration in this case
according to previous discussion [1].

[1] https://lore.kernel.org/all/470dc638-a300-f261-94b4-e27250e42f96@redhat.com/
Link: https://lkml.kernel.org/r/6784730480a1df82e8f4cba1ed088e4ac767994b.1666599848.git.baolin.wang@linux.alibaba.com
Signed-off-by: Baolin Wang <baolin.wang@linux.alibaba.com>
Reviewed-by: "Huang, Ying" <ying.huang@intel.com>
Cc: Alistair Popple <apopple@nvidia.com>
Cc: David Hildenbrand <david@redhat.com>
Cc: Yang Shi <shy828301@gmail.com>
Cc: Zi Yan <ziy@nvidia.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>


# e51da3a9 22-Sep-2022 Sidhartha Kumar <sidhartha.kumar@oracle.com>

mm/hugetlb: add folio_hstate()

Helper function to retrieve hstate information from a hugetlb folio.

Link: https://lkml.kernel.org/r/20220922154207.1575343-6-sidhartha.kumar@oracle.com
Signed-off-by: Sidhartha Kumar <sidhartha.kumar@oracle.com>
Reported-by: kernel test robot <lkp@intel.com>
Reviewed-by: Mike Kravetz <mike.kravetz@oracle.com>
Cc: Arnd Bergmann <arnd@arndb.de>
Cc: Colin Cross <ccross@google.com>
Cc: David Howells <dhowells@redhat.com>
Cc: "Eric W . Biederman" <ebiederm@xmission.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Muchun Song <songmuchun@bytedance.com>
Cc: Peter Xu <peterx@redhat.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: William Kucharski <william.kucharski@oracle.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>


# 03e5f82e 24-Oct-2022 Baolin Wang <baolin.wang@linux.alibaba.com>

mm: migrate: fix return value if all subpages of THPs are migrated successfully

During THP migration, if THPs are not migrated but they are split and all
subpages are migrated successfully, migrate_pages() will still return the
number of THP pages that were not migrated. This will confuse the callers
of migrate_pages(). For example, the longterm pinning will failed though
all pages are migrated successfully.

Thus we should return 0 to indicate that all pages are migrated in this
case

Link: https://lkml.kernel.org/r/de386aa864be9158d2f3b344091419ea7c38b2f7.1666599848.git.baolin.wang@linux.alibaba.com
Fixes: b5bade978e9b ("mm: migrate: fix the return value of migrate_pages()")
Signed-off-by: Baolin Wang <baolin.wang@linux.alibaba.com>
Reviewed-by: Alistair Popple <apopple@nvidia.com>
Reviewed-by: Yang Shi <shy828301@gmail.com>
Cc: David Hildenbrand <david@redhat.com>
Cc: "Huang, Ying" <ying.huang@intel.com>
Cc: Zi Yan <ziy@nvidia.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>


# 16ce101d 28-Sep-2022 Alistair Popple <apopple@nvidia.com>

mm/memory.c: fix race when faulting a device private page

Patch series "Fix several device private page reference counting issues",
v2

This series aims to fix a number of page reference counting issues in
drivers dealing with device private ZONE_DEVICE pages. These result in
use-after-free type bugs, either from accessing a struct page which no
longer exists because it has been removed or accessing fields within the
struct page which are no longer valid because the page has been freed.

During normal usage it is unlikely these will cause any problems. However
without these fixes it is possible to crash the kernel from userspace.
These crashes can be triggered either by unloading the kernel module or
unbinding the device from the driver prior to a userspace task exiting.
In modules such as Nouveau it is also possible to trigger some of these
issues by explicitly closing the device file-descriptor prior to the task
exiting and then accessing device private memory.

This involves some minor changes to both PowerPC and AMD GPU code.
Unfortunately I lack hardware to test either of those so any help there
would be appreciated. The changes mimic what is done in for both Nouveau
and hmm-tests though so I doubt they will cause problems.


This patch (of 8):

When the CPU tries to access a device private page the migrate_to_ram()
callback associated with the pgmap for the page is called. However no
reference is taken on the faulting page. Therefore a concurrent migration
of the device private page can free the page and possibly the underlying
pgmap. This results in a race which can crash the kernel due to the
migrate_to_ram() function pointer becoming invalid. It also means drivers
can't reliably read the zone_device_data field because the page may have
been freed with memunmap_pages().

Close the race by getting a reference on the page while holding the ptl to
ensure it has not been freed. Unfortunately the elevated reference count
will cause the migration required to handle the fault to fail. To avoid
this failure pass the faulting page into the migrate_vma functions so that
if an elevated reference count is found it can be checked to see if it's
expected or not.

[mpe@ellerman.id.au: fix build]
Link: https://lkml.kernel.org/r/87fsgbf3gh.fsf@mpe.ellerman.id.au
Link: https://lkml.kernel.org/r/cover.60659b549d8509ddecafad4f498ee7f03bb23c69.1664366292.git-series.apopple@nvidia.com
Link: https://lkml.kernel.org/r/d3e813178a59e565e8d78d9b9a4e2562f6494f90.1664366292.git-series.apopple@nvidia.com
Signed-off-by: Alistair Popple <apopple@nvidia.com>
Acked-by: Felix Kuehling <Felix.Kuehling@amd.com>
Cc: Jason Gunthorpe <jgg@nvidia.com>
Cc: John Hubbard <jhubbard@nvidia.com>
Cc: Ralph Campbell <rcampbell@nvidia.com>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Cc: Lyude Paul <lyude@redhat.com>
Cc: Alex Deucher <alexander.deucher@amd.com>
Cc: Alex Sierra <alex.sierra@amd.com>
Cc: Ben Skeggs <bskeggs@redhat.com>
Cc: Christian König <christian.koenig@amd.com>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: David Hildenbrand <david@redhat.com>
Cc: "Huang, Ying" <ying.huang@intel.com>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Yang Shi <shy828301@gmail.com>
Cc: Zi Yan <ziy@nvidia.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>


# 29eea9b5 02-Sep-2022 Matthew Wilcox (Oracle) <willy@infradead.org>

mm: convert page_get_anon_vma() to folio_get_anon_vma()

With all callers now passing in a folio, rename the function and convert
all callers. Removes a couple of calls to compound_head() and a reference
to page->mapping.

Link: https://lkml.kernel.org/r/20220902194653.1739778-55-willy@infradead.org
Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>


# c33db292 02-Sep-2022 Matthew Wilcox (Oracle) <willy@infradead.org>

migrate: convert unmap_and_move_huge_page() to use folios

Saves several calls to compound_head() and removes a couple of uses of
page->lru.

Link: https://lkml.kernel.org/r/20220902194653.1739778-52-willy@infradead.org
Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>


# 682a71a1 02-Sep-2022 Matthew Wilcox (Oracle) <willy@infradead.org>

migrate: convert __unmap_and_move() to use folios

Removes a lot of calls to compound_head(). Also remove a VM_BUG_ON that
can never trigger as the PageAnon bit is the bottom bit of page->mapping.

Link: https://lkml.kernel.org/r/20220902194653.1739778-51-willy@infradead.org
Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>


# f7091ed6 23-Aug-2022 Haiyue Wang <haiyue.wang@intel.com>

mm: fix the handling Non-LRU pages returned by follow_page

The handling Non-LRU pages returned by follow_page() jumps directly, it
doesn't call put_page() to handle the reference count, since 'FOLL_GET'
flag for follow_page() has get_page() called. Fix the zone device page
check by handling the page reference count correctly before returning.

And as David reviewed, "device pages are never PageKsm pages". Drop this
zone device page check for break_ksm().

Since the zone device page can't be a transparent huge page, so drop the
redundant zone device page check for split_huge_pages_pid(). (by Miaohe)

Link: https://lkml.kernel.org/r/20220823135841.934465-3-haiyue.wang@intel.com
Fixes: 3218f8712d6b ("mm: handling Non-LRU pages returned by vm_normal_pages")
Signed-off-by: Haiyue Wang <haiyue.wang@intel.com>
Reviewed-by: "Huang, Ying" <ying.huang@intel.com>
Reviewed-by: Felix Kuehling <Felix.Kuehling@amd.com>
Reviewed-by: Alistair Popple <apopple@nvidia.com>
Reviewed-by: Miaohe Lin <linmiaohe@huawei.com>
Acked-by: David Hildenbrand <david@redhat.com>
Cc: Alex Sierra <alex.sierra@amd.com>
Cc: Gerald Schaefer <gerald.schaefer@linux.ibm.com>
Cc: Mike Kravetz <mike.kravetz@oracle.com>
Cc: Muchun Song <songmuchun@bytedance.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>


# 467b171a 18-Aug-2022 Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>

mm/demotion: update node_is_toptier to work with memory tiers

With memory tier support we can have memory only NUMA nodes in the top
tier from which we want to avoid promotion tracking NUMA faults. Update
node_is_toptier to work with memory tiers. All NUMA nodes are by default
top tier nodes. With lower(slower) memory tiers added we consider all
memory tiers above a memory tier having CPU NUMA nodes as a top memory
tier

[sj@kernel.org: include missed header file, memory-tiers.h]
Link: https://lkml.kernel.org/r/20220820190720.248704-1-sj@kernel.org
[akpm@linux-foundation.org: mm/memory.c needs linux/memory-tiers.h]
[aneesh.kumar@linux.ibm.com: make toptier_distance inclusive upper bound of toptiers]
Link: https://lkml.kernel.org/r/20220830081457.118960-1-aneesh.kumar@linux.ibm.com
Link: https://lkml.kernel.org/r/20220818131042.113280-10-aneesh.kumar@linux.ibm.com
Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
Reviewed-by: "Huang, Ying" <ying.huang@intel.com>
Acked-by: Wei Xu <weixugc@google.com>
Cc: Alistair Popple <apopple@nvidia.com>
Cc: Bharata B Rao <bharata@amd.com>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: Dave Hansen <dave.hansen@intel.com>
Cc: Davidlohr Bueso <dave@stgolabs.net>
Cc: Hesham Almatary <hesham.almatary@huawei.com>
Cc: Jagdish Gediya <jvgediya.oss@gmail.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Jonathan Cameron <Jonathan.Cameron@huawei.com>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Tim Chen <tim.c.chen@intel.com>
Cc: Yang Shi <shy828301@gmail.com>
Cc: SeongJae Park <sj@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>


# 6c542ab7 18-Aug-2022 Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>

mm/demotion: build demotion targets based on explicit memory tiers

This patch switch the demotion target building logic to use memory tiers
instead of NUMA distance. All N_MEMORY NUMA nodes will be placed in the
default memory tier and additional memory tiers will be added by drivers
like dax kmem.

This patch builds the demotion target for a NUMA node by looking at all
memory tiers below the tier to which the NUMA node belongs. The closest
node in the immediately following memory tier is used as a demotion
target.

Since we are now only building demotion target for N_MEMORY NUMA nodes the
CPU hotplug calls are removed in this patch.

Link: https://lkml.kernel.org/r/20220818131042.113280-6-aneesh.kumar@linux.ibm.com
Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
Reviewed-by: "Huang, Ying" <ying.huang@intel.com>
Acked-by: Wei Xu <weixugc@google.com>
Cc: Alistair Popple <apopple@nvidia.com>
Cc: Bharata B Rao <bharata@amd.com>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: Dave Hansen <dave.hansen@intel.com>
Cc: Davidlohr Bueso <dave@stgolabs.net>
Cc: Hesham Almatary <hesham.almatary@huawei.com>
Cc: Jagdish Gediya <jvgediya.oss@gmail.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Jonathan Cameron <Jonathan.Cameron@huawei.com>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Tim Chen <tim.c.chen@intel.com>
Cc: Yang Shi <shy828301@gmail.com>
Cc: SeongJae Park <sj@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>


# 91952440 18-Aug-2022 Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>

mm/demotion: move memory demotion related code

This moves memory demotion related code to mm/memory-tiers.c. No
functional change in this patch.

Link: https://lkml.kernel.org/r/20220818131042.113280-3-aneesh.kumar@linux.ibm.com
Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
Reviewed-by: "Huang, Ying" <ying.huang@intel.com>
Acked-by: Wei Xu <weixugc@google.com>
Cc: Alistair Popple <apopple@nvidia.com>
Cc: Bharata B Rao <bharata@amd.com>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: Dave Hansen <dave.hansen@intel.com>
Cc: Davidlohr Bueso <dave@stgolabs.net>
Cc: Hesham Almatary <hesham.almatary@huawei.com>
Cc: Jagdish Gediya <jvgediya.oss@gmail.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Jonathan Cameron <Jonathan.Cameron@huawei.com>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Tim Chen <tim.c.chen@intel.com>
Cc: Yang Shi <shy828301@gmail.com>
Cc: SeongJae Park <sj@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>


# 7047b5a4 17-Aug-2022 Baolin Wang <baolin.wang@linux.alibaba.com>

mm: migrate: do not retry 10 times for the subpages of fail-to-migrate THP

If THP is failed to migrate due to -ENOSYS or -ENOMEM case, the THP will
be split, and the subpages of fail-to-migrate THP will be tried to migrate
again, so we should not account the retry counter in the second loop,
since we already accounted 'nr_thp_failed' in the first loop.

Moreover we also do not need retry 10 times for -EAGAIN case for the
subpages of fail-to-migrate THP in the second loop, since we already
regarded the THP as migration failure, and save some migration time (for
the worst case, will try 512 * 10 times) according to previous discussion
[1].

[1] https://lore.kernel.org/linux-mm/87r13a7n04.fsf@yhuang6-desk2.ccr.corp.intel.com/

Link: https://lkml.kernel.org/r/20220817081408.513338-9-ying.huang@intel.com
Tested-by: "Huang, Ying" <ying.huang@intel.com>
Signed-off-by: Baolin Wang <baolin.wang@linux.alibaba.com>
Signed-off-by: "Huang, Ying" <ying.huang@intel.com>
Cc: Oscar Salvador <osalvador@suse.de>
Cc: Zi Yan <ziy@nvidia.com>
Cc: Yang Shi <shy828301@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>


# 077309bc 17-Aug-2022 Huang Ying <ying.huang@intel.com>

migrate_pages(): fix failure counting for retry

After 10 retries, we will give up and the remaining pages will be counted
as failure in nr_failed and nr_thp_failed. We should count the failure in
nr_failed_pages too. This is done in this patch.

Link: https://lkml.kernel.org/r/20220817081408.513338-8-ying.huang@intel.com
Fixes: 5984fabb6e82 ("mm: move_pages: report the number of non-attempted pages")
Signed-off-by: "Huang, Ying" <ying.huang@intel.com>
Reviewed-by: Baolin Wang <baolin.wang@linux.alibaba.com>
Reviewed-by: Oscar Salvador <osalvador@suse.de>
Cc: Zi Yan <ziy@nvidia.com>
Cc: Yang Shi <shy828301@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>


# e6fa8a79 17-Aug-2022 Huang Ying <ying.huang@intel.com>

migrate_pages(): fix failure counting for THP splitting

If THP is failed to be migrated, it may be split and retry. But after
splitting, the head page will be left in "from" list, although THP
migration failure has been counted already. If the head page is failed to
be migrated too, the failure will be counted twice incorrectly. So this
is fixed in this patch via moving the head page of THP after splitting to
"thp_split_pages" too.

Link: https://lkml.kernel.org/r/20220817081408.513338-7-ying.huang@intel.com
Fixes: 5984fabb6e82 ("mm: move_pages: report the number of non-attempted pages")
Signed-off-by: "Huang, Ying" <ying.huang@intel.com>
Reviewed-by: Baolin Wang <baolin.wang@linux.alibaba.com>
Reviewed-by: Oscar Salvador <osalvador@suse.de>
Cc: Zi Yan <ziy@nvidia.com>
Cc: Yang Shi <shy828301@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>


# 577be05c 17-Aug-2022 Huang Ying <ying.huang@intel.com>

migrate_pages(): fix failure counting for THP on -ENOSYS

If THP or hugetlbfs page migration isn't supported, unmap_and_move() or
unmap_and_move_huge_page() will return -ENOSYS. For THP, splitting will
be tried, but if splitting doesn't succeed, the THP will be left in "from"
list wrongly. If some other pages are retried, the THP migration failure
will counted again. This is fixed via moving the failure THP from "from"
to "ret_pages".

Another issue of the original code is that the unsupported failure
processing isn't consistent between THP and hugetlbfs page. Make them
consistent in this patch to make the code easier to be understood too.

Link: https://lkml.kernel.org/r/20220817081408.513338-6-ying.huang@intel.com
Fixes: 5984fabb6e82 ("mm: move_pages: report the number of non-attempted pages")
Signed-off-by: "Huang, Ying" <ying.huang@intel.com>
Reviewed-by: Baolin Wang <baolin.wang@linux.alibaba.com>
Cc: Zi Yan <ziy@nvidia.com>
Cc: Yang Shi <shy828301@gmail.com>
Cc: Oscar Salvador <osalvador@suse.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>


# 5fc30916 17-Aug-2022 Huang Ying <ying.huang@intel.com>

migrate_pages(): fix failure counting for THP subpages retrying

If THP is failed to be migrated for -ENOSYS and -ENOMEM, the THP will be
split into thp_split_pages, and after other pages are migrated, pages in
thp_split_pages will be migrated with no_subpage_counting == true, because
its failure have been counted already. If some pages in thp_split_pages
are retried during migration, we should not count their failure if
no_subpage_counting == true too. This is done this patch to fix the
failure counting for THP subpages retrying.

Link: https://lkml.kernel.org/r/20220817081408.513338-5-ying.huang@intel.com
Fixes: 5984fabb6e82 ("mm: move_pages: report the number of non-attempted pages")
Signed-off-by: "Huang, Ying" <ying.huang@intel.com>
Reviewed-by: Baolin Wang <baolin.wang@linux.alibaba.com>
Reviewed-by: Oscar Salvador <osalvador@suse.de>
Cc: Zi Yan <ziy@nvidia.com>
Cc: Yang Shi <shy828301@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>


# fbed53b4 17-Aug-2022 Huang Ying <ying.huang@intel.com>

migrate_pages(): fix THP failure counting for -ENOMEM

In unmap_and_move(), if the new THP cannot be allocated, -ENOMEM will be
returned, and migrate_pages() will try to split the THP unless "reason" is
MR_NUMA_MISPLACED (that is, nosplit == true). But when nosplit == true,
the THP migration failure will not be counted.

This is incorrect, so in this patch, the THP migration failure will be
counted for -ENOMEM regardless of nosplit is true or false. The nr_failed
counting isn't fixed because it's not used. Added some comments for it
per Baolin's suggestion.

Link: https://lkml.kernel.org/r/20220817081408.513338-4-ying.huang@intel.com
Fixes: 5984fabb6e82 ("mm: move_pages: report the number of non-attempted pages")
Signed-off-by: "Huang, Ying" <ying.huang@intel.com>
Reviewed-by: Baolin Wang <baolin.wang@linux.alibaba.com>
Reviewed-by: Oscar Salvador <osalvador@suse.de>
Cc: Zi Yan <ziy@nvidia.com>
Cc: Yang Shi <shy828301@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>


# 9c62ff00 17-Aug-2022 Huang Ying <ying.huang@intel.com>

migrate_pages(): remove unnecessary list_safe_reset_next()

Before commit b5bade978e9b ("mm: migrate: fix the return value of
migrate_pages()"), the tail pages of THP will be put in the "from"
list directly. So one of the loop cursors (page2) needs to be reset,
as is done in try_split_thp() via list_safe_reset_next(). But after
the commit, the tail pages of THP will be put in a dedicated
list (thp_split_pages). That is, the "from" list will not be changed
during splitting. So, it's unnecessary to call list_safe_reset_next()
anymore.

This is a code cleanup, no functionality changes are expected.

Link: https://lkml.kernel.org/r/20220817081408.513338-3-ying.huang@intel.com
Signed-off-by: "Huang, Ying" <ying.huang@intel.com>
Reviewed-by: Baolin Wang <baolin.wang@linux.alibaba.com>
Reviewed-by: Oscar Salvador <osalvador@suse.de>
Cc: Zi Yan <ziy@nvidia.com>
Cc: Yang Shi <shy828301@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>


# a7504ed1 17-Aug-2022 Huang Ying <ying.huang@intel.com>

migrate: fix syscall move_pages() return value for failure

Patch series "migrate_pages(): fix several bugs in error path", v3.

During review the code of migrate_pages() and build a test program for
it. Several bugs in error path are identified and fixed in this
series.

Most patches are tested via

- Apply error-inject.patch in Linux kernel
- Compile test-migrate.c (with -lnuma)
- Test with test-migrate.sh

error-inject.patch, test-migrate.c, and test-migrate.sh are as below.
It turns out that error injection is an important tool to fix bugs in
error path.


This patch (of 8):

The return value of move_pages() syscall is incorrect when counting
the remaining pages to be migrated. For example, for the following
test program,

"
#define _GNU_SOURCE

#include <stdbool.h>
#include <stdio.h>
#include <string.h>
#include <stdlib.h>
#include <errno.h>

#include <fcntl.h>
#include <sys/uio.h>
#include <sys/mman.h>
#include <sys/types.h>
#include <unistd.h>
#include <numaif.h>
#include <numa.h>

#ifndef MADV_FREE
#define MADV_FREE 8 /* free pages only if memory pressure */
#endif

#define ONE_MB (1024 * 1024)
#define MAP_SIZE (16 * ONE_MB)
#define THP_SIZE (2 * ONE_MB)
#define THP_MASK (THP_SIZE - 1)

#define ERR_EXIT_ON(cond, msg) \
do { \
int __cond_in_macro = (cond); \
if (__cond_in_macro) \
error_exit(__cond_in_macro, (msg)); \
} while (0)

void error_msg(int ret, int nr, int *status, const char *msg)
{
int i;

fprintf(stderr, "Error: %s, ret : %d, error: %s\n",
msg, ret, strerror(errno));

if (!nr)
return;
fprintf(stderr, "status: ");
for (i = 0; i < nr; i++)
fprintf(stderr, "%d ", status[i]);
fprintf(stderr, "\n");
}

void error_exit(int ret, const char *msg)
{
error_msg(ret, 0, NULL, msg);
exit(1);
}

int page_size;

bool do_vmsplice;
bool do_thp;

static int pipe_fds[2];
void *addr;
char *pn;
char *pn1;
void *pages[2];
int status[2];

void prepare()
{
int ret;
struct iovec iov;

if (addr) {
munmap(addr, MAP_SIZE);
close(pipe_fds[0]);
close(pipe_fds[1]);
}

ret = pipe(pipe_fds);
ERR_EXIT_ON(ret, "pipe");

addr = mmap(NULL, MAP_SIZE, PROT_READ | PROT_WRITE,
MAP_PRIVATE | MAP_ANONYMOUS, -1, 0);
ERR_EXIT_ON(addr == MAP_FAILED, "mmap");
if (do_thp) {
ret = madvise(addr, MAP_SIZE, MADV_HUGEPAGE);
ERR_EXIT_ON(ret, "advise hugepage");
}

pn = (char *)(((unsigned long)addr + THP_SIZE) & ~THP_MASK);
pn1 = pn + THP_SIZE;
pages[0] = pn;
pages[1] = pn1;
*pn = 1;

if (do_vmsplice) {
iov.iov_base = pn;
iov.iov_len = page_size;
ret = vmsplice(pipe_fds[1], &iov, 1, 0);
ERR_EXIT_ON(ret < 0, "vmsplice");
}

status[0] = status[1] = 1024;
}

void test_migrate()
{
int ret;
int nodes[2] = { 1, 1 };
pid_t pid = getpid();

prepare();
ret = move_pages(pid, 1, pages, nodes, status, MPOL_MF_MOVE_ALL);
error_msg(ret, 1, status, "move 1 page");

prepare();
ret = move_pages(pid, 2, pages, nodes, status, MPOL_MF_MOVE_ALL);
error_msg(ret, 2, status, "move 2 pages, page 1 not mapped");

prepare();
*pn1 = 1;
ret = move_pages(pid, 2, pages, nodes, status, MPOL_MF_MOVE_ALL);
error_msg(ret, 2, status, "move 2 pages");

prepare();
*pn1 = 1;
nodes[1] = 0;
ret = move_pages(pid, 2, pages, nodes, status, MPOL_MF_MOVE_ALL);
error_msg(ret, 2, status, "move 2 pages, page 1 to node 0");
}

int main(int argc, char *argv[])
{
numa_run_on_node(0);
page_size = getpagesize();

test_migrate();

fprintf(stderr, "\nMake page 0 cannot be migrated:\n");
do_vmsplice = true;
test_migrate();

fprintf(stderr, "\nTest THP:\n");
do_thp = true;
do_vmsplice = false;
test_migrate();

fprintf(stderr, "\nTHP: make page 0 cannot be migrated:\n");
do_vmsplice = true;
test_migrate();

return 0;
}
"

The output of the current kernel is,

"
Error: move 1 page, ret : 0, error: Success
status: 1
Error: move 2 pages, page 1 not mapped, ret : 0, error: Success
status: 1 -14
Error: move 2 pages, ret : 0, error: Success
status: 1 1
Error: move 2 pages, page 1 to node 0, ret : 0, error: Success
status: 1 0

Make page 0 cannot be migrated:
Error: move 1 page, ret : 0, error: Success
status: 1024
Error: move 2 pages, page 1 not mapped, ret : 1, error: Success
status: 1024 -14
Error: move 2 pages, ret : 0, error: Success
status: 1024 1024
Error: move 2 pages, page 1 to node 0, ret : 1, error: Success
status: 1024 1024
"

While the expected output is,

"
Error: move 1 page, ret : 0, error: Success
status: 1
Error: move 2 pages, page 1 not mapped, ret : 0, error: Success
status: 1 -14
Error: move 2 pages, ret : 0, error: Success
status: 1 1
Error: move 2 pages, page 1 to node 0, ret : 0, error: Success
status: 1 0

Make page 0 cannot be migrated:
Error: move 1 page, ret : 1, error: Success
status: 1024
Error: move 2 pages, page 1 not mapped, ret : 1, error: Success
status: 1024 -14
Error: move 2 pages, ret : 1, error: Success
status: 1024 1024
Error: move 2 pages, page 1 to node 0, ret : 2, error: Success
status: 1024 1024
"

Fix this via correcting the remaining pages counting. With the fix,
the output for the test program as above is expected.

Link: https://lkml.kernel.org/r/20220817081408.513338-1-ying.huang@intel.com
Link: https://lkml.kernel.org/r/20220817081408.513338-2-ying.huang@intel.com
Fixes: 5984fabb6e82 ("mm: move_pages: report the number of non-attempted pages")
Signed-off-by: "Huang, Ying" <ying.huang@intel.com>
Reviewed-by: Oscar Salvador <osalvador@suse.de>
Cc: Baolin Wang <baolin.wang@linux.alibaba.com>
Cc: Zi Yan <ziy@nvidia.com>
Cc: Yang Shi <shy828301@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>


# 2e346877 10-Aug-2022 Peter Xu <peterx@redhat.com>

mm: remember young/dirty bit for page migrations

When page migration happens, we always ignore the young/dirty bit settings
in the old pgtable, and marking the page as old in the new page table
using either pte_mkold() or pmd_mkold(), and keeping the pte clean.

That's fine from functional-wise, but that's not friendly to page reclaim
because the moving page can be actively accessed within the procedure.
Not to mention hardware setting the young bit can bring quite some
overhead on some systems, e.g. x86_64 needs a few hundreds nanoseconds to
set the bit. The same slowdown problem to dirty bits when the memory is
first written after page migration happened.

Actually we can easily remember the A/D bit configuration and recover the
information after the page is migrated. To achieve it, define a new set
of bits in the migration swap offset field to cache the A/D bits for old
pte. Then when removing/recovering the migration entry, we can recover
the A/D bits even if the page changed.

One thing to mention is that here we used max_swapfile_size() to detect
how many swp offset bits we have, and we'll only enable this feature if we
know the swp offset is big enough to store both the PFN value and the A/D
bits. Otherwise the A/D bits are dropped like before.

Link: https://lkml.kernel.org/r/20220811161331.37055-6-peterx@redhat.com
Signed-off-by: Peter Xu <peterx@redhat.com>
Reviewed-by: "Huang, Ying" <ying.huang@intel.com>
Cc: Alistair Popple <apopple@nvidia.com>
Cc: Andi Kleen <andi.kleen@intel.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: David Hildenbrand <david@redhat.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: "Kirill A . Shutemov" <kirill@shutemov.name>
Cc: Minchan Kim <minchan@kernel.org>
Cc: Nadav Amit <nadav.amit@gmail.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Dave Hansen <dave.hansen@intel.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>


# 33024536 13-Jul-2022 Huang Ying <ying.huang@intel.com>

memory tiering: hot page selection with hint page fault latency

Patch series "memory tiering: hot page selection", v4.

To optimize page placement in a memory tiering system with NUMA balancing,
the hot pages in the slow memory nodes need to be identified.
Essentially, the original NUMA balancing implementation selects the mostly
recently accessed (MRU) pages to promote. But this isn't a perfect
algorithm to identify the hot pages. Because the pages with quite low
access frequency may be accessed eventually given the NUMA balancing page
table scanning period could be quite long (e.g. 60 seconds). So in this
patchset, we implement a new hot page identification algorithm based on
the latency between NUMA balancing page table scanning and hint page
fault. Which is a kind of mostly frequently accessed (MFU) algorithm.

In NUMA balancing memory tiering mode, if there are hot pages in slow
memory node and cold pages in fast memory node, we need to promote/demote
hot/cold pages between the fast and cold memory nodes.

A choice is to promote/demote as fast as possible. But the CPU cycles and
memory bandwidth consumed by the high promoting/demoting throughput will
hurt the latency of some workload because of accessing inflating and slow
memory bandwidth contention.

A way to resolve this issue is to restrict the max promoting/demoting
throughput. It will take longer to finish the promoting/demoting. But
the workload latency will be better. This is implemented in this patchset
as the page promotion rate limit mechanism.

The promotion hot threshold is workload and system configuration
dependent. So in this patchset, a method to adjust the hot threshold
automatically is implemented. The basic idea is to control the number of
the candidate promotion pages to match the promotion rate limit.

We used the pmbench memory accessing benchmark tested the patchset on a
2-socket server system with DRAM and PMEM installed. The test results are
as follows,

pmbench score promote rate
(accesses/s) MB/s
------------- ------------
base 146887704.1 725.6
hot selection 165695601.2 544.0
rate limit 162814569.8 165.2
auto adjustment 170495294.0 136.9

From the results above,

With hot page selection patch [1/3], the pmbench score increases about
12.8%, and promote rate (overhead) decreases about 25.0%, compared with
base kernel.

With rate limit patch [2/3], pmbench score decreases about 1.7%, and
promote rate decreases about 69.6%, compared with hot page selection
patch.

With threshold auto adjustment patch [3/3], pmbench score increases about
4.7%, and promote rate decrease about 17.1%, compared with rate limit
patch.

Baolin helped to test the patchset with MySQL on a machine which contains
1 DRAM node (30G) and 1 PMEM node (126G).

sysbench /usr/share/sysbench/oltp_read_write.lua \
......
--tables=200 \
--table-size=1000000 \
--report-interval=10 \
--threads=16 \
--time=120

The tps can be improved about 5%.


This patch (of 3):

To optimize page placement in a memory tiering system with NUMA balancing,
the hot pages in the slow memory node need to be identified. Essentially,
the original NUMA balancing implementation selects the mostly recently
accessed (MRU) pages to promote. But this isn't a perfect algorithm to
identify the hot pages. Because the pages with quite low access frequency
may be accessed eventually given the NUMA balancing page table scanning
period could be quite long (e.g. 60 seconds). The most frequently
accessed (MFU) algorithm is better.

So, in this patch we implemented a better hot page selection algorithm.
Which is based on NUMA balancing page table scanning and hint page fault
as follows,

- When the page tables of the processes are scanned to change PTE/PMD
to be PROT_NONE, the current time is recorded in struct page as scan
time.

- When the page is accessed, hint page fault will occur. The scan
time is gotten from the struct page. And The hint page fault
latency is defined as

hint page fault time - scan time

The shorter the hint page fault latency of a page is, the higher the
probability of their access frequency to be higher. So the hint page
fault latency is a better estimation of the page hot/cold.

It's hard to find some extra space in struct page to hold the scan time.
Fortunately, we can reuse some bits used by the original NUMA balancing.

NUMA balancing uses some bits in struct page to store the page accessing
CPU and PID (referring to page_cpupid_xchg_last()). Which is used by the
multi-stage node selection algorithm to avoid to migrate pages shared
accessed by the NUMA nodes back and forth. But for pages in the slow
memory node, even if they are shared accessed by multiple NUMA nodes, as
long as the pages are hot, they need to be promoted to the fast memory
node. So the accessing CPU and PID information are unnecessary for the
slow memory pages. We can reuse these bits in struct page to record the
scan time. For the fast memory pages, these bits are used as before.

For the hot threshold, the default value is 1 second, which works well in
our performance test. All pages with hint page fault latency < hot
threshold will be considered hot.

It's hard for users to determine the hot threshold. So we don't provide a
kernel ABI to set it, just provide a debugfs interface for advanced users
to experiment. We will continue to work on a hot threshold automatic
adjustment mechanism.

The downside of the above method is that the response time to the workload
hot spot changing may be much longer. For example,

- A previous cold memory area becomes hot

- The hint page fault will be triggered. But the hint page fault
latency isn't shorter than the hot threshold. So the pages will
not be promoted.

- When the memory area is scanned again, maybe after a scan period,
the hint page fault latency measured will be shorter than the hot
threshold and the pages will be promoted.

To mitigate this, if there are enough free space in the fast memory node,
the hot threshold will not be used, all pages will be promoted upon the
hint page fault for fast response.

Thanks Zhong Jiang reported and tested the fix for a bug when disabling
memory tiering mode dynamically.

Link: https://lkml.kernel.org/r/20220713083954.34196-1-ying.huang@intel.com
Link: https://lkml.kernel.org/r/20220713083954.34196-2-ying.huang@intel.com
Signed-off-by: "Huang, Ying" <ying.huang@intel.com>
Reviewed-by: Baolin Wang <baolin.wang@linux.alibaba.com>
Tested-by: Baolin Wang <baolin.wang@linux.alibaba.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Rik van Riel <riel@surriel.com>
Cc: Mel Gorman <mgorman@techsingularity.net>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Yang Shi <shy828301@gmail.com>
Cc: Zi Yan <ziy@nvidia.com>
Cc: Wei Xu <weixugc@google.com>
Cc: osalvador <osalvador@suse.de>
Cc: Shakeel Butt <shakeelb@google.com>
Cc: Zhong Jiang <zhongjiang-ali@linux.alibaba.com>
Cc: Oscar Salvador <osalvador@suse.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>


# 83156821 12-Aug-2022 Haiyue Wang <haiyue.wang@intel.com>

mm: migration: fix the FOLL_GET failure on following huge page

Not all huge page APIs support FOLL_GET option, so move_pages() syscall
will fail to get the page node information for some huge pages.

Like x86 on linux 5.19 with 1GB huge page API follow_huge_pud(), it will
return NULL page for FOLL_GET when calling move_pages() syscall with the
NULL 'nodes' parameter, the 'status' parameter has '-2' error in array.

Note: follow_huge_pud() now supports FOLL_GET in linux 6.0.
Link: https://lore.kernel.org/all/20220714042420.1847125-3-naoya.horiguchi@linux.dev

But these huge page APIs don't support FOLL_GET:
1. follow_huge_pud() in arch/s390/mm/hugetlbpage.c
2. follow_huge_addr() in arch/ia64/mm/hugetlbpage.c
It will cause WARN_ON_ONCE for FOLL_GET.
3. follow_huge_pgd() in mm/hugetlb.c

This is an temporary solution to mitigate the side effect of the race
condition fix by calling follow_page() with FOLL_GET set for huge pages.

After supporting follow huge page by FOLL_GET is done, this fix can be
reverted safely.

Link: https://lkml.kernel.org/r/20220823135841.934465-2-haiyue.wang@intel.com
Link: https://lkml.kernel.org/r/20220812084921.409142-1-haiyue.wang@intel.com
Fixes: 4cd614841c06 ("mm: migration: fix possible do_pages_stat_array racing with memory offline")
Signed-off-by: Haiyue Wang <haiyue.wang@intel.com>
Reviewed-by: Miaohe Lin <linmiaohe@huawei.com>
Reviewed-by: "Huang, Ying" <ying.huang@intel.com>
Reviewed-by: Baolin Wang <baolin.wang@linux.alibaba.com>
Cc: David Hildenbrand <david@redhat.com>
Cc: Muchun Song <songmuchun@bytedance.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>


# 3218f871 15-Jul-2022 Alex Sierra <alex.sierra@amd.com>

mm: handling Non-LRU pages returned by vm_normal_pages

With DEVICE_COHERENT, we'll soon have vm_normal_pages() return
device-managed anonymous pages that are not LRU pages. Although they
behave like normal pages for purposes of mapping in CPU page, and for COW.
They do not support LRU lists, NUMA migration or THP.

Callers to follow_page() currently don't expect ZONE_DEVICE pages,
however, with DEVICE_COHERENT we might now return ZONE_DEVICE. Check for
ZONE_DEVICE pages in applicable users of follow_page() as well.

Link: https://lkml.kernel.org/r/20220715150521.18165-5-alex.sierra@amd.com
Signed-off-by: Alex Sierra <alex.sierra@amd.com>
Acked-by: Felix Kuehling <Felix.Kuehling@amd.com> [v2]
Reviewed-by: Alistair Popple <apopple@nvidia.com> [v6]
Cc: Christoph Hellwig <hch@lst.de>
Cc: David Hildenbrand <david@redhat.com>
Cc: Jason Gunthorpe <jgg@nvidia.com>
Cc: Jerome Glisse <jglisse@redhat.com>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Ralph Campbell <rcampbell@nvidia.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>


# ad1ac596 30-May-2022 Miaohe Lin <linmiaohe@huawei.com>

mm/migration: fix potential pte_unmap on an not mapped pte

__migration_entry_wait and migration_entry_wait_on_locked assume pte is
always mapped from caller. But this is not the case when it's called from
migration_entry_wait_huge and follow_huge_pmd. Add a hugetlbfs variant
that calls hugetlb_migration_entry_wait(ptep == NULL) to fix this issue.

Link: https://lkml.kernel.org/r/20220530113016.16663-5-linmiaohe@huawei.com
Fixes: 30dad30922cc ("mm: migration: add migrate_entry_wait_huge()")
Signed-off-by: Miaohe Lin <linmiaohe@huawei.com>
Suggested-by: David Hildenbrand <david@redhat.com>
Reviewed-by: David Hildenbrand <david@redhat.com>
Cc: Alistair Popple <apopple@nvidia.com>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Christoph Lameter <cl@linux.com>
Cc: David Howells <dhowells@redhat.com>
Cc: Huang Ying <ying.huang@intel.com>
Cc: kernel test robot <lkp@intel.com>
Cc: Mike Kravetz <mike.kravetz@oracle.com>
Cc: Muchun Song <songmuchun@bytedance.com>
Cc: Oscar Salvador <osalvador@suse.de>
Cc: Peter Xu <peterx@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>


# 7ce82f4c 30-May-2022 Miaohe Lin <linmiaohe@huawei.com>

mm/migration: return errno when isolate_huge_page failed

We might fail to isolate huge page due to e.g. the page is under
migration which cleared HPageMigratable. We should return errno in this
case rather than always return 1 which could confuse the user, i.e. the
caller might think all of the memory is migrated while the hugetlb page is
left behind. We make the prototype of isolate_huge_page consistent with
isolate_lru_page as suggested by Huang Ying and rename isolate_huge_page
to isolate_hugetlb as suggested by Muchun to improve the readability.

Link: https://lkml.kernel.org/r/20220530113016.16663-4-linmiaohe@huawei.com
Fixes: e8db67eb0ded ("mm: migrate: move_pages() supports thp migration")
Signed-off-by: Miaohe Lin <linmiaohe@huawei.com>
Suggested-by: Huang Ying <ying.huang@intel.com>
Reported-by: kernel test robot <lkp@intel.com> (build error)
Cc: Alistair Popple <apopple@nvidia.com>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Christoph Lameter <cl@linux.com>
Cc: David Hildenbrand <david@redhat.com>
Cc: David Howells <dhowells@redhat.com>
Cc: Mike Kravetz <mike.kravetz@oracle.com>
Cc: Muchun Song <songmuchun@bytedance.com>
Cc: Oscar Salvador <osalvador@suse.de>
Cc: Peter Xu <peterx@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>


# 160088b3 30-May-2022 Miaohe Lin <linmiaohe@huawei.com>

mm/migration: remove unneeded lock page and PageMovable check

When non-lru movable page was freed from under us, __ClearPageMovable must
have been done. So we can remove unneeded lock page and PageMovable check
here. Also free_pages_prepare() will clear PG_isolated for us, so we can
further remove ClearPageIsolated as suggested by David.

Link: https://lkml.kernel.org/r/20220530113016.16663-3-linmiaohe@huawei.com
Signed-off-by: Miaohe Lin <linmiaohe@huawei.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Oscar Salvador <osalvador@suse.de>
Reviewed-by: David Hildenbrand <david@redhat.com>
Cc: Alistair Popple <apopple@nvidia.com>
Cc: Christoph Lameter <cl@linux.com>
Cc: David Howells <dhowells@redhat.com>
Cc: Huang Ying <ying.huang@intel.com>
Cc: kernel test robot <lkp@intel.com>
Cc: Mike Kravetz <mike.kravetz@oracle.com>
Cc: Muchun Song <songmuchun@bytedance.com>
Cc: Peter Xu <peterx@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>


# 9d0ddc0c 06-Jun-2022 Matthew Wilcox (Oracle) <willy@infradead.org>

fs: Remove aops->migratepage()

With all users converted to migrate_folio(), remove this operation.

Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>


# b890ec2a 06-Jun-2022 Matthew Wilcox (Oracle) <willy@infradead.org>

hugetlb: Convert to migrate_folio

This involves converting migrate_huge_page_move_mapping(). We also need a
folio variant of hugetlb_set_page_subpool(), but that's for a later patch.

Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Acked-by: Muchun Song <songmuchun@bytedance.com>
Reviewed-by: Mike Kravetz <mike.kravetz@oracle.com>


# 2ec810d5 05-Jun-2022 Matthew Wilcox (Oracle) <willy@infradead.org>

mm/migrate: Add filemap_migrate_folio()

There is nothing iomap-specific about iomap_migratepage(), and it fits
a pattern used by several other filesystems, so move it to mm/migrate.c,
convert it to be filemap_migrate_folio() and convert the iomap filesystems
to use it.

Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>


# 54184650 06-Jun-2022 Matthew Wilcox (Oracle) <willy@infradead.org>

mm/migrate: Convert migrate_page() to migrate_folio()

Convert all callers to pass a folio. Most have the folio
already available. Switch all users from aops->migratepage to
aops->migrate_folio. Also turn the documentation into kerneldoc.

Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Acked-by: David Sterba <dsterba@suse.com>


# 108ca835 06-Jun-2022 Matthew Wilcox (Oracle) <willy@infradead.org>

mm/migrate: Convert expected_page_refs() to folio_expected_refs()

Now that both callers have a folio, convert this function to
take a folio & rename it.

Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>


# 67235182 06-Jun-2022 Matthew Wilcox (Oracle) <willy@infradead.org>

mm/migrate: Convert buffer_migrate_page() to buffer_migrate_folio()

Use a folio throughout __buffer_migrate_folio(), add kernel-doc for
buffer_migrate_folio() and buffer_migrate_folio_norefs(), move their
declarations to buffer.h and switch all filesystems that have wired
them up.

Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>


# 2be7fa10 06-Jun-2022 Matthew Wilcox (Oracle) <willy@infradead.org>

mm/migrate: Convert writeout() to take a folio

Use a folio throughout this function.

Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>


# 8faa8ef5 06-Jun-2022 Matthew Wilcox (Oracle) <willy@infradead.org>

mm/migrate: Convert fallback_migrate_page() to fallback_migrate_folio()

Use a folio throughout. migrate_page() will be converted to
migrate_folio() later.

Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>


# 5490da4f 06-Jun-2022 Matthew Wilcox (Oracle) <willy@infradead.org>

fs: Add aops->migrate_folio

Provide a folio-based replacement for aops->migratepage. Update the
documentation to document migrate_folio instead of migratepage.

Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>


# 68f2736a 07-Jun-2022 Matthew Wilcox (Oracle) <willy@infradead.org>

mm: Convert all PageMovable users to movable_operations

These drivers are rather uncomfortably hammered into the
address_space_operations hole. They aren't filesystems and don't behave
like filesystems. They just need their own movable_operations structure,
which we can point to directly from page->mapping.

Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>


# b653db77 19-Jun-2022 Matthew Wilcox (Oracle) <willy@infradead.org>

mm: Clear page->private when splitting or migrating a page

In our efforts to remove uses of PG_private, we have found folios with
the private flag clear and folio->private not-NULL. That is the root
cause behind 642d51fb0775 ("ceph: check folio PG_private bit instead
of folio->private"). It can also affect a few other filesystems that
haven't yet reported a problem.

compaction_alloc() can return a page with uninitialised page->private,
and rather than checking all the callers of migrate_pages(), just zero
page->private after calling get_new_page(). Similarly, the tail pages
from split_huge_page() may also have an uninitialised page->private.

Reported-by: Xiubo Li <xiubli@redhat.com>
Tested-by: Xiubo Li <xiubli@redhat.com>
Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>


# e7e3ffeb 12-May-2022 Matthew Wilcox (Oracle) <willy@infradead.org>

mm/migrate: convert move_to_new_page() into move_to_new_folio()

Pass in the folios that we already have in each caller. Saves a
lot of calls to compound_head().

Link: https://lkml.kernel.org/r/20220504182857.4013401-27-willy@infradead.org
Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>


# 717aeab4 12-May-2022 Jagdish Gediya <jvgediya@linux.ibm.com>

mm: convert sysfs input to bool using kstrtobool()

Sysfs input conversion to corrosponding bool value e.g. "false" or "0" to
false, "true" or "1" to true are currently handled through strncmp at
multiple places. Use kstrtobool() to convert sysfs input to bool value.

[akpm@linux-foundation.org: propagate kstrtobool() return value, per Andy]
Link: https://lkml.kernel.org/r/20220426180203.70782-2-jvgediya@linux.ibm.com
Signed-off-by: Jagdish Gediya <jvgediya@linux.ibm.com>
Reviewed-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Reviewed-by: Andy Shevchenko <andy.shevchenko@gmail.com>
Cc: Alexey Dobriyan <adobriyan@gmail.com>
Cc: Dave Hansen <dave.hansen@intel.com>
Cc: "Huang, Ying" <ying.huang@intel.com>
Cc: Jonathan Cameron <Jonathan.Cameron@huawei.com>
Cc: Petr Mladek <pmladek@suse.com>
Cc: Richard Fitzgerald <rf@opensource.cirrus.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>


# 6c287605 09-May-2022 David Hildenbrand <david@redhat.com>

mm: remember exclusively mapped anonymous pages with PG_anon_exclusive

Let's mark exclusively mapped anonymous pages with PG_anon_exclusive as
exclusive, and use that information to make GUP pins reliable and stay
consistent with the page mapped into the page table even if the page table
entry gets write-protected.

With that information at hand, we can extend our COW logic to always reuse
anonymous pages that are exclusive. For anonymous pages that might be
shared, the existing logic applies.

As already documented, PG_anon_exclusive is usually only expressive in
combination with a page table entry. Especially PTE vs. PMD-mapped
anonymous pages require more thought, some examples: due to mremap() we
can easily have a single compound page PTE-mapped into multiple page
tables exclusively in a single process -- multiple page table locks apply.
Further, due to MADV_WIPEONFORK we might not necessarily write-protect
all PTEs, and only some subpages might be pinned. Long story short: once
PTE-mapped, we have to track information about exclusivity per sub-page,
but until then, we can just track it for the compound page in the head
page and not having to update a whole bunch of subpages all of the time
for a simple PMD mapping of a THP.

For simplicity, this commit mostly talks about "anonymous pages", while
it's for THP actually "the part of an anonymous folio referenced via a
page table entry".

To not spill PG_anon_exclusive code all over the mm code-base, we let the
anon rmap code to handle all PG_anon_exclusive logic it can easily handle.

If a writable, present page table entry points at an anonymous (sub)page,
that (sub)page must be PG_anon_exclusive. If GUP wants to take a reliably
pin (FOLL_PIN) on an anonymous page references via a present page table
entry, it must only pin if PG_anon_exclusive is set for the mapped
(sub)page.

This commit doesn't adjust GUP, so this is only implicitly handled for
FOLL_WRITE, follow-up commits will teach GUP to also respect it for
FOLL_PIN without FOLL_WRITE, to make all GUP pins of anonymous pages fully
reliable.

Whenever an anonymous page is to be shared (fork(), KSM), or when
temporarily unmapping an anonymous page (swap, migration), the relevant
PG_anon_exclusive bit has to be cleared to mark the anonymous page
possibly shared. Clearing will fail if there are GUP pins on the page:

* For fork(), this means having to copy the page and not being able to
share it. fork() protects against concurrent GUP using the PT lock and
the src_mm->write_protect_seq.

* For KSM, this means sharing will fail. For swap this means, unmapping
will fail, For migration this means, migration will fail early. All
three cases protect against concurrent GUP using the PT lock and a
proper clear/invalidate+flush of the relevant page table entry.

This fixes memory corruptions reported for FOLL_PIN | FOLL_WRITE, when a
pinned page gets mapped R/O and the successive write fault ends up
replacing the page instead of reusing it. It improves the situation for
O_DIRECT/vmsplice/... that still use FOLL_GET instead of FOLL_PIN, if
fork() is *not* involved, however swapout and fork() are still
problematic. Properly using FOLL_PIN instead of FOLL_GET for these GUP
users will fix the issue for them.

I. Details about basic handling

I.1. Fresh anonymous pages

page_add_new_anon_rmap() and hugepage_add_new_anon_rmap() will mark the
given page exclusive via __page_set_anon_rmap(exclusive=1). As that is
the mechanism fresh anonymous pages come into life (besides migration code
where we copy the page->mapping), all fresh anonymous pages will start out
as exclusive.

I.2. COW reuse handling of anonymous pages

When a COW handler stumbles over a (sub)page that's marked exclusive, it
simply reuses it. Otherwise, the handler tries harder under page lock to
detect if the (sub)page is exclusive and can be reused. If exclusive,
page_move_anon_rmap() will mark the given (sub)page exclusive.

Note that hugetlb code does not yet check for PageAnonExclusive(), as it
still uses the old COW logic that is prone to the COW security issue
because hugetlb code cannot really tolerate unnecessary/wrong COW as huge
pages are a scarce resource.

I.3. Migration handling

try_to_migrate() has to try marking an exclusive anonymous page shared via
page_try_share_anon_rmap(). If it fails because there are GUP pins on the
page, unmap fails. migrate_vma_collect_pmd() and
__split_huge_pmd_locked() are handled similarly.

Writable migration entries implicitly point at shared anonymous pages.
For readable migration entries that information is stored via a new
"readable-exclusive" migration entry, specific to anonymous pages.

When restoring a migration entry in remove_migration_pte(), information
about exlusivity is detected via the migration entry type, and
RMAP_EXCLUSIVE is set accordingly for
page_add_anon_rmap()/hugepage_add_anon_rmap() to restore that information.

I.4. Swapout handling

try_to_unmap() has to try marking the mapped page possibly shared via
page_try_share_anon_rmap(). If it fails because there are GUP pins on the
page, unmap fails. For now, information about exclusivity is lost. In
the future, we might want to remember that information in the swap entry
in some cases, however, it requires more thought, care, and a way to store
that information in swap entries.

I.5. Swapin handling

do_swap_page() will never stumble over exclusive anonymous pages in the
swap cache, as try_to_migrate() prohibits that. do_swap_page() always has
to detect manually if an anonymous page is exclusive and has to set
RMAP_EXCLUSIVE for page_add_anon_rmap() accordingly.

I.6. THP handling

__split_huge_pmd_locked() has to move the information about exclusivity
from the PMD to the PTEs.

a) In case we have a readable-exclusive PMD migration entry, simply
insert readable-exclusive PTE migration entries.

b) In case we have a present PMD entry and we don't want to freeze
("convert to migration entries"), simply forward PG_anon_exclusive to
all sub-pages, no need to temporarily clear the bit.

c) In case we have a present PMD entry and want to freeze, handle it
similar to try_to_migrate(): try marking the page shared first. In
case we fail, we ignore the "freeze" instruction and simply split
ordinarily. try_to_migrate() will properly fail because the THP is
still mapped via PTEs.

When splitting a compound anonymous folio (THP), the information about
exclusivity is implicitly handled via the migration entries: no need to
replicate PG_anon_exclusive manually.

I.7. fork() handling fork() handling is relatively easy, because
PG_anon_exclusive is only expressive for some page table entry types.

a) Present anonymous pages

page_try_dup_anon_rmap() will mark the given subpage shared -- which will
fail if the page is pinned. If it failed, we have to copy (or PTE-map a
PMD to handle it on the PTE level).

Note that device exclusive entries are just a pointer at a PageAnon()
page. fork() will first convert a device exclusive entry to a present
page table and handle it just like present anonymous pages.

b) Device private entry

Device private entries point at PageAnon() pages that cannot be mapped
directly and, therefore, cannot get pinned.

page_try_dup_anon_rmap() will mark the given subpage shared, which cannot
fail because they cannot get pinned.

c) HW poison entries

PG_anon_exclusive will remain untouched and is stale -- the page table
entry is just a placeholder after all.

d) Migration entries

Writable and readable-exclusive entries are converted to readable entries:
possibly shared.

I.8. mprotect() handling

mprotect() only has to properly handle the new readable-exclusive
migration entry:

When write-protecting a migration entry that points at an anonymous page,
remember the information about exclusivity via the "readable-exclusive"
migration entry type.

II. Migration and GUP-fast

Whenever replacing a present page table entry that maps an exclusive
anonymous page by a migration entry, we have to mark the page possibly
shared and synchronize against GUP-fast by a proper clear/invalidate+flush
to make the following scenario impossible:

1. try_to_migrate() places a migration entry after checking for GUP pins
and marks the page possibly shared.

2. GUP-fast pins the page due to lack of synchronization

3. fork() converts the "writable/readable-exclusive" migration entry into a
readable migration entry

4. Migration fails due to the GUP pin (failing to freeze the refcount)

5. Migration entries are restored. PG_anon_exclusive is lost

-> We have a pinned page that is not marked exclusive anymore.

Note that we move information about exclusivity from the page to the
migration entry as it otherwise highly overcomplicates fork() and
PTE-mapping a THP.

III. Swapout and GUP-fast

Whenever replacing a present page table entry that maps an exclusive
anonymous page by a swap entry, we have to mark the page possibly shared
and synchronize against GUP-fast by a proper clear/invalidate+flush to
make the following scenario impossible:

1. try_to_unmap() places a swap entry after checking for GUP pins and
clears exclusivity information on the page.

2. GUP-fast pins the page due to lack of synchronization.

-> We have a pinned page that is not marked exclusive anymore.

If we'd ever store information about exclusivity in the swap entry,
similar to migration handling, the same considerations as in II would
apply. This is future work.

Link: https://lkml.kernel.org/r/20220428083441.37290-13-david@redhat.com
Signed-off-by: David Hildenbrand <david@redhat.com>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Christoph Hellwig <hch@lst.de>
Cc: David Rientjes <rientjes@google.com>
Cc: Don Dutile <ddutile@redhat.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: Jan Kara <jack@suse.cz>
Cc: Jann Horn <jannh@google.com>
Cc: Jason Gunthorpe <jgg@nvidia.com>
Cc: John Hubbard <jhubbard@nvidia.com>
Cc: Khalid Aziz <khalid.aziz@oracle.com>
Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
Cc: Liang Zhang <zhangliang5@huawei.com>
Cc: "Matthew Wilcox (Oracle)" <willy@infradead.org>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Mike Kravetz <mike.kravetz@oracle.com>
Cc: Mike Rapoport <rppt@linux.ibm.com>
Cc: Nadav Amit <namit@vmware.com>
Cc: Oded Gabbay <oded.gabbay@gmail.com>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Pedro Demarchi Gomes <pedrodemargomes@gmail.com>
Cc: Peter Xu <peterx@redhat.com>
Cc: Rik van Riel <riel@surriel.com>
Cc: Roman Gushchin <guro@fb.com>
Cc: Shakeel Butt <shakeelb@google.com>
Cc: Yang Shi <shy828301@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>


# 28c5209d 09-May-2022 David Hildenbrand <david@redhat.com>

mm/rmap: pass rmap flags to hugepage_add_anon_rmap()

Let's prepare for passing RMAP_EXCLUSIVE, similarly as we do for
page_add_anon_rmap() now. RMAP_COMPOUND is implicit for hugetlb pages and
ignored.

Link: https://lkml.kernel.org/r/20220428083441.37290-8-david@redhat.com
Signed-off-by: David Hildenbrand <david@redhat.com>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Christoph Hellwig <hch@lst.de>
Cc: David Rientjes <rientjes@google.com>
Cc: Don Dutile <ddutile@redhat.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: Jan Kara <jack@suse.cz>
Cc: Jann Horn <jannh@google.com>
Cc: Jason Gunthorpe <jgg@nvidia.com>
Cc: John Hubbard <jhubbard@nvidia.com>
Cc: Khalid Aziz <khalid.aziz@oracle.com>
Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
Cc: Liang Zhang <zhangliang5@huawei.com>
Cc: "Matthew Wilcox (Oracle)" <willy@infradead.org>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Mike Kravetz <mike.kravetz@oracle.com>
Cc: Mike Rapoport <rppt@linux.ibm.com>
Cc: Nadav Amit <namit@vmware.com>
Cc: Oded Gabbay <oded.gabbay@gmail.com>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Pedro Demarchi Gomes <pedrodemargomes@gmail.com>
Cc: Peter Xu <peterx@redhat.com>
Cc: Rik van Riel <riel@surriel.com>
Cc: Roman Gushchin <guro@fb.com>
Cc: Shakeel Butt <shakeelb@google.com>
Cc: Yang Shi <shy828301@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>


# f1e2db12 09-May-2022 David Hildenbrand <david@redhat.com>

mm/rmap: remove do_page_add_anon_rmap()

... and instead convert page_add_anon_rmap() to accept flags.

Passing flags instead of bools is usually nicer either way, and we want to
more often also pass RMAP_EXCLUSIVE in follow up patches when detecting
that an anonymous page is exclusive: for example, when restoring an
anonymous page from a writable migration entry.

This is a preparation for marking an anonymous page inside
page_add_anon_rmap() as exclusive when RMAP_EXCLUSIVE is passed.

Link: https://lkml.kernel.org/r/20220428083441.37290-7-david@redhat.com
Signed-off-by: David Hildenbrand <david@redhat.com>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Christoph Hellwig <hch@lst.de>
Cc: David Rientjes <rientjes@google.com>
Cc: Don Dutile <ddutile@redhat.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: Jan Kara <jack@suse.cz>
Cc: Jann Horn <jannh@google.com>
Cc: Jason Gunthorpe <jgg@nvidia.com>
Cc: John Hubbard <jhubbard@nvidia.com>
Cc: Khalid Aziz <khalid.aziz@oracle.com>
Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
Cc: Liang Zhang <zhangliang5@huawei.com>
Cc: "Matthew Wilcox (Oracle)" <willy@infradead.org>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Mike Kravetz <mike.kravetz@oracle.com>
Cc: Mike Rapoport <rppt@linux.ibm.com>
Cc: Nadav Amit <namit@vmware.com>
Cc: Oded Gabbay <oded.gabbay@gmail.com>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Pedro Demarchi Gomes <pedrodemargomes@gmail.com>
Cc: Peter Xu <peterx@redhat.com>
Cc: Rik van Riel <riel@surriel.com>
Cc: Roman Gushchin <guro@fb.com>
Cc: Shakeel Butt <shakeelb@google.com>
Cc: Yang Shi <shy828301@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>


# fb3d824d 09-May-2022 David Hildenbrand <david@redhat.com>

mm/rmap: split page_dup_rmap() into page_dup_file_rmap() and page_try_dup_anon_rmap()

... and move the special check for pinned pages into
page_try_dup_anon_rmap() to prepare for tracking exclusive anonymous pages
via a new pageflag, clearing it only after making sure that there are no
GUP pins on the anonymous page.

We really only care about pins on anonymous pages, because they are prone
to getting replaced in the COW handler once mapped R/O. For !anon pages
in cow-mappings (!VM_SHARED && VM_MAYWRITE) we shouldn't really care about
that, at least not that I could come up with an example.

Let's drop the is_cow_mapping() check from page_needs_cow_for_dma(), as we
know we're dealing with anonymous pages. Also, drop the handling of
pinned pages from copy_huge_pud() and add a comment if ever supporting
anonymous pages on the PUD level.

This is a preparation for tracking exclusivity of anonymous pages in the
rmap code, and disallowing marking a page shared (-> failing to duplicate)
if there are GUP pins on a page.

Link: https://lkml.kernel.org/r/20220428083441.37290-5-david@redhat.com
Signed-off-by: David Hildenbrand <david@redhat.com>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Christoph Hellwig <hch@lst.de>
Cc: David Rientjes <rientjes@google.com>
Cc: Don Dutile <ddutile@redhat.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: Jan Kara <jack@suse.cz>
Cc: Jann Horn <jannh@google.com>
Cc: Jason Gunthorpe <jgg@nvidia.com>
Cc: John Hubbard <jhubbard@nvidia.com>
Cc: Khalid Aziz <khalid.aziz@oracle.com>
Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
Cc: Liang Zhang <zhangliang5@huawei.com>
Cc: "Matthew Wilcox (Oracle)" <willy@infradead.org>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Mike Kravetz <mike.kravetz@oracle.com>
Cc: Mike Rapoport <rppt@linux.ibm.com>
Cc: Nadav Amit <namit@vmware.com>
Cc: Oded Gabbay <oded.gabbay@gmail.com>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Pedro Demarchi Gomes <pedrodemargomes@gmail.com>
Cc: Peter Xu <peterx@redhat.com>
Cc: Rik van Riel <riel@surriel.com>
Cc: Roman Gushchin <guro@fb.com>
Cc: Shakeel Butt <shakeelb@google.com>
Cc: Yang Shi <shy828301@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>


# 7d6e2d96 29-Apr-2022 Oscar Salvador <osalvador@suse.de>

mm: untangle config dependencies for demote-on-reclaim

At the time demote-on-reclaim was introduced, it was tied to
CONFIG_HOTPLUG_CPU + CONFIG_MIGRATE, but that is not really accurate.

The only two things we need to depend on are CONFIG_NUMA + CONFIG_MIGRATE,
so clean this up. Furthermore, we only register the hotplug memory
notifier when the system has CONFIG_MEMORY_HOTPLUG.

Link: https://lkml.kernel.org/r/20220322224016.4574-1-osalvador@suse.de
Signed-off-by: Oscar Salvador <osalvador@suse.de>
Suggested-by: "Huang, Ying" <ying.huang@intel.com>
Reviewed-by: "Huang, Ying" <ying.huang@intel.com>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Abhishek Goel <huntbag@linux.vnet.ibm.com>
Cc: Baolin Wang <baolin.wang@linux.alibaba.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>


# 9c42fe4e 29-Apr-2022 Baolin Wang <baolin.wang@linux.alibaba.com>

mm: migrate: simplify the refcount validation when migrating hugetlb mapping

There is no need to validate the hugetlb page's refcount before trying to
freeze the hugetlb page's expected refcount, instead we can just rely on
the page_ref_freeze() to simplify the validation.

Moreover we are always under the page lock when migrating the hugetlb page
mapping, which means nowhere else can remove it from the page cache, so we
can remove the xas_load() validation under the i_pages lock.

Link: https://lkml.kernel.org/r/eb2fbbeaef2b1714097b9dec457426d682ee0635.1649676424.git.baolin.wang@linux.alibaba.com
Signed-off-by: Baolin Wang <baolin.wang@linux.alibaba.com>
Acked-by: Mike Kravetz <mike.kravetz@oracle.com>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: "Huang, Ying" <ying.huang@intel.com>
Cc: Muchun Song <songmuchun@bytedance.com>
Cc: Alistair Popple <apopple@nvidia.com>
Cc: Zi Yan <ziy@nvidia.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>


# 4cd61484 29-Apr-2022 Miaohe Lin <linmiaohe@huawei.com>

mm/migration: fix possible do_pages_stat_array racing with memory offline

When follow_page peeks a page, the page could be migrated and then be
offlined while it's still being used by the do_pages_stat_array(). Use
FOLL_GET to hold the page refcnt to fix this potential race.

Link: https://lkml.kernel.org/r/20220318111709.60311-12-linmiaohe@huawei.com
Signed-off-by: Miaohe Lin <linmiaohe@huawei.com>
Acked-by: "Huang, Ying" <ying.huang@intel.com>
Reviewed-by: Muchun Song <songmuchun@bytedance.com>
Cc: Alistair Popple <apopple@nvidia.com>
Cc: Baolin Wang <baolin.wang@linux.alibaba.com>
Cc: Zi Yan <ziy@nvidia.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>


# 3f26c88b 29-Apr-2022 Miaohe Lin <linmiaohe@huawei.com>

mm/migration: fix potential invalid node access for reclaim-based migration

If we failed to setup hotplug state callbacks for mm/demotion:online in
some corner cases, node_demotion will be left uninitialized. Invalid node
might be returned from the next_demotion_node() when doing reclaim-based
migration. Use kcalloc to allocate node_demotion to fix the issue.

Link: https://lkml.kernel.org/r/20220318111709.60311-11-linmiaohe@huawei.com
Fixes: ac16ec835314 ("mm: migrate: support multiple target nodes demotion")
Signed-off-by: Miaohe Lin <linmiaohe@huawei.com>
Reviewed-by: "Huang, Ying" <ying.huang@intel.com>
Cc: Alistair Popple <apopple@nvidia.com>
Cc: Baolin Wang <baolin.wang@linux.alibaba.com>
Cc: Muchun Song <songmuchun@bytedance.com>
Cc: Zi Yan <ziy@nvidia.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>


# 69a041ff 29-Apr-2022 Miaohe Lin <linmiaohe@huawei.com>

mm/migration: fix potential page refcounts leak in migrate_pages

In -ENOMEM case, there might be some subpages of fail-to-migrate THPs left
in thp_split_pages list. We should move them back to migration list so
that they could be put back to the right list by the caller otherwise the
page refcnt will be leaked here. Also adjust nr_failed and nr_thp_failed
accordingly to make vm events account more accurate.

Link: https://lkml.kernel.org/r/20220318111709.60311-10-linmiaohe@huawei.com
Fixes: b5bade978e9b ("mm: migrate: fix the return value of migrate_pages()")
Signed-off-by: Miaohe Lin <linmiaohe@huawei.com>
Reviewed-by: Zi Yan <ziy@nvidia.com>
Reviewed-by: "Huang, Ying" <ying.huang@intel.com>
Reviewed-by: Baolin Wang <baolin.wang@linux.alibaba.com>
Reviewed-by: Muchun Song <songmuchun@bytedance.com>
Cc: Alistair Popple <apopple@nvidia.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>


# f430893b 29-Apr-2022 Miaohe Lin <linmiaohe@huawei.com>

mm/migration: remove some duplicated codes in migrate_pages

Remove the duplicated codes in migrate_pages to simplify the code. Minor
readability improvement. No functional change intended.

Link: https://lkml.kernel.org/r/20220318111709.60311-9-linmiaohe@huawei.com
Signed-off-by: Miaohe Lin <linmiaohe@huawei.com>
Reviewed-by: Zi Yan <ziy@nvidia.com>
Reviewed-by: Baolin Wang <baolin.wang@linux.alibaba.com>
Cc: Alistair Popple <apopple@nvidia.com>
Cc: "Huang, Ying" <ying.huang@intel.com>
Cc: Muchun Song <songmuchun@bytedance.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>


# 91925ab8 29-Apr-2022 Miaohe Lin <linmiaohe@huawei.com>

mm/migration: avoid unneeded nodemask_t initialization

Avoid unneeded next_pass and this_pass initialization as they're always
set before using to save possible cpu cycles when there are plenty of
nodes in the system.

Link: https://lkml.kernel.org/r/20220318111709.60311-8-linmiaohe@huawei.com
Signed-off-by: Miaohe Lin <linmiaohe@huawei.com>
Reviewed-by: Muchun Song <songmuchun@bytedance.com>
Reviewed-by: Baolin Wang <baolin.wang@linux.alibaba.com>
Cc: Alistair Popple <apopple@nvidia.com>
Cc: "Huang, Ying" <ying.huang@intel.com>
Cc: Zi Yan <ziy@nvidia.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>


# 3eefb826 29-Apr-2022 Miaohe Lin <linmiaohe@huawei.com>

mm/migration: use helper macro min in do_pages_stat

We could use helper macro min to help set the chunk_nr to simplify the
code.

Link: https://lkml.kernel.org/r/20220318111709.60311-7-linmiaohe@huawei.com
Signed-off-by: Miaohe Lin <linmiaohe@huawei.com>
Reviewed-by: Muchun Song <songmuchun@bytedance.com>
Cc: Alistair Popple <apopple@nvidia.com>
Cc: Baolin Wang <baolin.wang@linux.alibaba.com>
Cc: "Huang, Ying" <ying.huang@intel.com>
Cc: Zi Yan <ziy@nvidia.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>


# cb1c37b1 29-Apr-2022 Miaohe Lin <linmiaohe@huawei.com>

mm/migration: use helper function vma_lookup() in add_page_for_migration

We could use helper function vma_lookup() to lookup the needed vma to
simplify the code.

Link: https://lkml.kernel.org/r/20220318111709.60311-6-linmiaohe@huawei.com
Signed-off-by: Miaohe Lin <linmiaohe@huawei.com>
Cc: Alistair Popple <apopple@nvidia.com>
Cc: Baolin Wang <baolin.wang@linux.alibaba.com>
Cc: "Huang, Ying" <ying.huang@intel.com>
Cc: Muchun Song <songmuchun@bytedance.com>
Cc: Zi Yan <ziy@nvidia.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>


# b75454e1 29-Apr-2022 Miaohe Lin <linmiaohe@huawei.com>

mm/migration: remove unneeded local variable page_lru

We can use page_is_file_lru() directly to help account the isolated pages
to simplify the code a bit.

Link: https://lkml.kernel.org/r/20220318111709.60311-4-linmiaohe@huawei.com
Signed-off-by: Miaohe Lin <linmiaohe@huawei.com>
Cc: Alistair Popple <apopple@nvidia.com>
Cc: Baolin Wang <baolin.wang@linux.alibaba.com>
Cc: "Huang, Ying" <ying.huang@intel.com>
Cc: Muchun Song <songmuchun@bytedance.com>
Cc: Zi Yan <ziy@nvidia.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>


# 5202978b 29-Apr-2022 Miaohe Lin <linmiaohe@huawei.com>

mm/migration: remove unneeded local variable mapping_locked

Patch series "A few cleanup and fixup patches for migration", v2.

This series contains a few patches to remove unneeded variables, jump
label and use helper to simplify the code. Also we fix some bugs such as
page refcounts leak , invalid node access and so on. More details can be
found in the respective changelogs.


This patch (of 11):

When mapping_locked is true, TTU_RMAP_LOCKED is always set to ttu. We can
check ttu instead so mapping_locked can be removed. And ttu is either 0
or TTU_RMAP_LOCKED now. Change '|=' to '=' to reflect this.

Link: https://lkml.kernel.org/r/20220318111709.60311-1-linmiaohe@huawei.com
Link: https://lkml.kernel.org/r/20220318111709.60311-2-linmiaohe@huawei.com
Signed-off-by: Miaohe Lin <linmiaohe@huawei.com>
Reviewed-by: Muchun Song <songmuchun@bytedance.com>
Cc: Zi Yan <ziy@nvidia.com>
Cc: Baolin Wang <baolin.wang@linux.alibaba.com>
Cc: "Huang, Ying" <ying.huang@intel.com>
Cc: Muchun Song <songmuchun@bytedance.com>
Cc: Alistair Popple <apopple@nvidia.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>


# bc53008e 29-Apr-2022 Wei Yang <richard.weiyang@gmail.com>

mm/vmscan: make sure wakeup_kswapd with managed zone

wakeup_kswapd() only wake up kswapd when the zone is managed.

For two callers of wakeup_kswapd(), they are node perspective.

* wake_all_kswapds
* numamigrate_isolate_page

If we picked up a !managed zone, this is not we expected.

This patch makes sure we pick up a managed zone for wakeup_kswapd(). And
it also use managed_zone in migrate_balanced_pgdat() to get the proper
zone.

[richard.weiyang@gmail.com: adjust the usage in migrate_balanced_pgdat()]
Link: https://lkml.kernel.org/r/20220329010901.1654-2-richard.weiyang@gmail.com
Link: https://lkml.kernel.org/r/20220327024101.10378-2-richard.weiyang@gmail.com
Signed-off-by: Wei Yang <richard.weiyang@gmail.com>
Reviewed-by: "Huang, Ying" <ying.huang@intel.com>
Reviewed-by: Miaohe Lin <linmiaohe@huawei.com>
Reviewed-by: David Hildenbrand <david@redhat.com>
Cc: Mel Gorman <mgorman@techsingularity.net>
Cc: Oscar Salvador <osalvador@suse.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>


# 68189fef 30-Apr-2022 Matthew Wilcox (Oracle) <willy@infradead.org>

fs: Change try_to_free_buffers() to take a folio

All but two of the callers already have a folio; pass a folio into
try_to_free_buffers(). This removes the last user of cancel_dirty_page()
so remove that wrapper function too.

Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Reviewed-by: Jeff Layton <jlayton@kernel.org>


# a04cd160 08-Apr-2022 Zi Yan <ziy@nvidia.com>

mm: migrate: use thp_order instead of HPAGE_PMD_ORDER for new page allocation.

Fix a VM_BUG_ON_FOLIO(folio_nr_pages(old) != nr_pages) crash.

With folios support, it is possible to have other than HPAGE_PMD_ORDER
THPs, in the form of folios, in the system. Use thp_order() to correctly
determine the source page order during migration.

Link: https://lkml.kernel.org/r/20220404165325.1883267-1-zi.yan@sent.com
Link: https://lore.kernel.org/linux-mm/20220404132908.GA785673@u2004/
Fixes: d68eccad3706 ("mm/filemap: Allow large folios to be added to the page cache")
Reported-by: Naoya Horiguchi <naoya.horiguchi@linux.dev>
Signed-off-by: Zi Yan <ziy@nvidia.com>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Michal Hocko <mhocko@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# c185e494 06-Jul-2021 Matthew Wilcox (Oracle) <willy@infradead.org>

mm/migrate: Use a folio in migrate_misplaced_transhuge_page()

Unify alloc_misplaced_dst_page() and alloc_misplaced_dst_page_thp().
Removes an assumption that compound pages are HPAGE_PMD_ORDER.

Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Reviewed-by: Zi Yan <ziy@nvidia.com>
Reviewed-by: William Kucharski <william.kucharski@oracle.com>


# ffe06786 04-Apr-2022 Matthew Wilcox (Oracle) <willy@infradead.org>

mm/migrate: Use a folio in alloc_migration_target()

This removes an assumption that a large folio is HPAGE_PMD_ORDER
as well as letting us remove the call to prep_transhuge_page()
and a few hidden calls to compound_head().

Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Reviewed-by: Zi Yan <ziy@nvidia.com>
Reviewed-by: William Kucharski <william.kucharski@oracle.com>


# adb11e78 01-Apr-2022 Sebastian Andrzej Siewior <bigeasy@linutronix.de>

mm/munlock: protect the per-CPU pagevec by a local_lock_t

The access to mlock_pvec is protected by disabling preemption via
get_cpu_var() or implicit by having preemption disabled by the caller
(in mlock_page_drain() case). This breaks on PREEMPT_RT since
folio_lruvec_lock_irq() acquires a sleeping lock in this section.

Create struct mlock_pvec which consits of the local_lock_t and the
pagevec. Acquire the local_lock() before accessing the per-CPU pagevec.
Replace mlock_page_drain() with a _local() version which is invoked on
the local CPU and acquires the local_lock_t and a _remote() version
which uses the pagevec from a remote CPU which offline.

Link: https://lkml.kernel.org/r/YjizWi9IY0mpvIfb@linutronix.de
Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Acked-by: Hugh Dickins <hughd@google.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 4cc79b33 24-Mar-2022 Anshuman Khandual <anshuman.khandual@arm.com>

mm/migration: add trace events for base page and HugeTLB migrations

This adds two trace events for base page and HugeTLB page migrations.
These events, closely follow the implementation details like setting and
removing of PTE migration entries, which are essential operations for
migration. The new CREATE_TRACE_POINTS in <mm/rmap.c> covers both
<events/migration.h> and <events/tlb.h> based trace events. Hence drop
redundant CREATE_TRACE_POINTS from other places which could have otherwise
conflicted during build.

Link: https://lkml.kernel.org/r/1643368182-9588-3-git-send-email-anshuman.khandual@arm.com
Signed-off-by: Anshuman Khandual <anshuman.khandual@arm.com>
Reported-by: kernel test robot <lkp@intel.com>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Zi Yan <ziy@nvidia.com>
Cc: Naoya Horiguchi <naoya.horiguchi@nec.com>
Cc: John Hubbard <jhubbard@nvidia.com>
Cc: Matthew Wilcox <willy@infradead.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 734c1570 22-Mar-2022 Oscar Salvador <osalvador@suse.de>

mm: only re-generate demotion targets when a numa node changes its N_CPU state

Abhishek reported that after patch [1], hotplug operations are taking
roughly double the expected time. [2]

The reason behind is that the CPU callbacks that
migrate_on_reclaim_init() sets always call set_migration_target_nodes()
whenever a CPU is brought up/down.

But we only care about numa nodes going from having cpus to become
cpuless, and vice versa, as that influences the demotion_target order.

We do already have two CPU callbacks (vmstat_cpu_online() and
vmstat_cpu_dead()) that check exactly that, so get rid of the CPU
callbacks in migrate_on_reclaim_init() and only call
set_migration_target_nodes() from vmstat_cpu_{dead,online}() whenever a
numa node change its N_CPU state.

[1] https://lore.kernel.org/linux-mm/20210721063926.3024591-2-ying.huang@intel.com/
[2] https://lore.kernel.org/linux-mm/eb438ddd-2919-73d4-bd9f-b7eecdd9577a@linux.vnet.ibm.com/

[osalvador@suse.de: add feedback from Huang Ying]
Link: https://lkml.kernel.org/r/20220314150945.12694-1-osalvador@suse.de

Link: https://lkml.kernel.org/r/20220310120749.23077-1-osalvador@suse.de
Fixes: 884a6e5d1f93b ("mm/migrate: update node demotion order on hotplug events")
Signed-off-by: Oscar Salvador <osalvador@suse.de>
Reviewed-by: Baolin Wang <baolin.wang@linux.alibaba.com>
Tested-by: Baolin Wang <baolin.wang@linux.alibaba.com>
Reported-by: Abhishek Goel <huntbag@linux.vnet.ibm.com>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: "Huang, Ying" <ying.huang@intel.com>
Cc: Abhishek Goel <huntbag@linux.vnet.ibm.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# c574bbe9 22-Mar-2022 Huang Ying <ying.huang@intel.com>

NUMA balancing: optimize page placement for memory tiering system

With the advent of various new memory types, some machines will have
multiple types of memory, e.g. DRAM and PMEM (persistent memory). The
memory subsystem of these machines can be called memory tiering system,
because the performance of the different types of memory are usually
different.

In such system, because of the memory accessing pattern changing etc,
some pages in the slow memory may become hot globally. So in this
patch, the NUMA balancing mechanism is enhanced to optimize the page
placement among the different memory types according to hot/cold
dynamically.

In a typical memory tiering system, there are CPUs, fast memory and slow
memory in each physical NUMA node. The CPUs and the fast memory will be
put in one logical node (called fast memory node), while the slow memory
will be put in another (faked) logical node (called slow memory node).
That is, the fast memory is regarded as local while the slow memory is
regarded as remote. So it's possible for the recently accessed pages in
the slow memory node to be promoted to the fast memory node via the
existing NUMA balancing mechanism.

The original NUMA balancing mechanism will stop to migrate pages if the
free memory of the target node becomes below the high watermark. This
is a reasonable policy if there's only one memory type. But this makes
the original NUMA balancing mechanism almost do not work to optimize
page placement among different memory types. Details are as follows.

It's the common cases that the working-set size of the workload is
larger than the size of the fast memory nodes. Otherwise, it's
unnecessary to use the slow memory at all. So, there are almost always
no enough free pages in the fast memory nodes, so that the globally hot
pages in the slow memory node cannot be promoted to the fast memory
node. To solve the issue, we have 2 choices as follows,

a. Ignore the free pages watermark checking when promoting hot pages
from the slow memory node to the fast memory node. This will
create some memory pressure in the fast memory node, thus trigger
the memory reclaiming. So that, the cold pages in the fast memory
node will be demoted to the slow memory node.

b. Define a new watermark called wmark_promo which is higher than
wmark_high, and have kswapd reclaiming pages until free pages reach
such watermark. The scenario is as follows: when we want to promote
hot-pages from a slow memory to a fast memory, but fast memory's free
pages would go lower than high watermark with such promotion, we wake
up kswapd with wmark_promo watermark in order to demote cold pages and
free us up some space. So, next time we want to promote hot-pages we
might have a chance of doing so.

The choice "a" may create high memory pressure in the fast memory node.
If the memory pressure of the workload is high, the memory pressure
may become so high that the memory allocation latency of the workload
is influenced, e.g. the direct reclaiming may be triggered.

The choice "b" works much better at this aspect. If the memory
pressure of the workload is high, the hot pages promotion will stop
earlier because its allocation watermark is higher than that of the
normal memory allocation. So in this patch, choice "b" is implemented.
A new zone watermark (WMARK_PROMO) is added. Which is larger than the
high watermark and can be controlled via watermark_scale_factor.

In addition to the original page placement optimization among sockets,
the NUMA balancing mechanism is extended to be used to optimize page
placement according to hot/cold among different memory types. So the
sysctl user space interface (numa_balancing) is extended in a backward
compatible way as follow, so that the users can enable/disable these
functionality individually.

The sysctl is converted from a Boolean value to a bits field. The
definition of the flags is,

- 0: NUMA_BALANCING_DISABLED
- 1: NUMA_BALANCING_NORMAL
- 2: NUMA_BALANCING_MEMORY_TIERING

We have tested the patch with the pmbench memory accessing benchmark
with the 80:20 read/write ratio and the Gauss access address
distribution on a 2 socket Intel server with Optane DC Persistent
Memory Model. The test results shows that the pmbench score can
improve up to 95.9%.

Thanks Andrew Morton to help fix the document format error.

Link: https://lkml.kernel.org/r/20220221084529.1052339-3-ying.huang@intel.com
Signed-off-by: "Huang, Ying" <ying.huang@intel.com>
Tested-by: Baolin Wang <baolin.wang@linux.alibaba.com>
Reviewed-by: Baolin Wang <baolin.wang@linux.alibaba.com>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Reviewed-by: Oscar Salvador <osalvador@suse.de>
Reviewed-by: Yang Shi <shy828301@gmail.com>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Rik van Riel <riel@surriel.com>
Cc: Mel Gorman <mgorman@techsingularity.net>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Zi Yan <ziy@nvidia.com>
Cc: Wei Xu <weixugc@google.com>
Cc: Shakeel Butt <shakeelb@google.com>
Cc: zhongjiang-ali <zhongjiang-ali@linux.alibaba.com>
Cc: Randy Dunlap <rdunlap@infradead.org>
Cc: Feng Tang <feng.tang@intel.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# e39bb6be 22-Mar-2022 Huang Ying <ying.huang@intel.com>

NUMA Balancing: add page promotion counter

Patch series "NUMA balancing: optimize memory placement for memory tiering system", v13

With the advent of various new memory types, some machines will have
multiple types of memory, e.g. DRAM and PMEM (persistent memory). The
memory subsystem of these machines can be called memory tiering system,
because the performance of the different types of memory are different.

After commit c221c0b0308f ("device-dax: "Hotplug" persistent memory for
use like normal RAM"), the PMEM could be used as the cost-effective
volatile memory in separate NUMA nodes. In a typical memory tiering
system, there are CPUs, DRAM and PMEM in each physical NUMA node. The
CPUs and the DRAM will be put in one logical node, while the PMEM will
be put in another (faked) logical node.

To optimize the system overall performance, the hot pages should be
placed in DRAM node. To do that, we need to identify the hot pages in
the PMEM node and migrate them to DRAM node via NUMA migration.

In the original NUMA balancing, there are already a set of existing
mechanisms to identify the pages recently accessed by the CPUs in a node
and migrate the pages to the node. So we can reuse these mechanisms to
build the mechanisms to optimize the page placement in the memory
tiering system. This is implemented in this patchset.

At the other hand, the cold pages should be placed in PMEM node. So, we
also need to identify the cold pages in the DRAM node and migrate them
to PMEM node.

In commit 26aa2d199d6f ("mm/migrate: demote pages during reclaim"), a
mechanism to demote the cold DRAM pages to PMEM node under memory
pressure is implemented. Based on that, the cold DRAM pages can be
demoted to PMEM node proactively to free some memory space on DRAM node
to accommodate the promoted hot PMEM pages. This is implemented in this
patchset too.

We have tested the solution with the pmbench memory accessing benchmark
with the 80:20 read/write ratio and the Gauss access address
distribution on a 2 socket Intel server with Optane DC Persistent Memory
Model. The test results shows that the pmbench score can improve up to
95.9%.

This patch (of 3):

In a system with multiple memory types, e.g. DRAM and PMEM, the CPU
and DRAM in one socket will be put in one NUMA node as before, while
the PMEM will be put in another NUMA node as described in the
description of the commit c221c0b0308f ("device-dax: "Hotplug"
persistent memory for use like normal RAM"). So, the NUMA balancing
mechanism will identify all PMEM accesses as remote access and try to
promote the PMEM pages to DRAM.

To distinguish the number of the inter-type promoted pages from that of
the inter-socket migrated pages. A new vmstat count is added. The
counter is per-node (count in the target node). So this can be used to
identify promotion imbalance among the NUMA nodes.

Link: https://lkml.kernel.org/r/20220301085329.3210428-1-ying.huang@intel.com
Link: https://lkml.kernel.org/r/20220221084529.1052339-1-ying.huang@intel.com
Link: https://lkml.kernel.org/r/20220221084529.1052339-2-ying.huang@intel.com
Signed-off-by: "Huang, Ying" <ying.huang@intel.com>
Reviewed-by: Yang Shi <shy828301@gmail.com>
Tested-by: Baolin Wang <baolin.wang@linux.alibaba.com>
Reviewed-by: Baolin Wang <baolin.wang@linux.alibaba.com>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Reviewed-by: Oscar Salvador <osalvador@suse.de>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Rik van Riel <riel@surriel.com>
Cc: Mel Gorman <mgorman@techsingularity.net>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Zi Yan <ziy@nvidia.com>
Cc: Wei Xu <weixugc@google.com>
Cc: Shakeel Butt <shakeelb@google.com>
Cc: zhongjiang-ali <zhongjiang-ali@linux.alibaba.com>
Cc: Feng Tang <feng.tang@intel.com>
Cc: Randy Dunlap <rdunlap@infradead.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 356ea386 22-Mar-2022 andrew.yang <andrew.yang@mediatek.com>

mm/migrate: fix race between lock page and clear PG_Isolated

When memory is tight, system may start to compact memory for large
continuous memory demands. If one process tries to lock a memory page
that is being locked and isolated for compaction, it may wait a long time
or even forever. This is because compaction will perform non-atomic
PG_Isolated clear while holding page lock, this may overwrite PG_waiters
set by the process that can't obtain the page lock and add itself to the
waiting queue to wait for the lock to be unlocked.

CPU1 CPU2
lock_page(page); (successful)
lock_page(); (failed)
__ClearPageIsolated(page); SetPageWaiters(page) (may be overwritten)
unlock_page(page);

The solution is to not perform non-atomic operation on page flags while
holding page lock.

Link: https://lkml.kernel.org/r/20220315030515.20263-1-andrew.yang@mediatek.com
Signed-off-by: andrew.yang <andrew.yang@mediatek.com>
Cc: Matthias Brugger <matthias.bgg@gmail.com>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: "Vlastimil Babka" <vbabka@suse.cz>
Cc: David Howells <dhowells@redhat.com>
Cc: "William Kucharski" <william.kucharski@oracle.com>
Cc: David Hildenbrand <david@redhat.com>
Cc: Yang Shi <shy828301@gmail.com>
Cc: Marc Zyngier <maz@kernel.org>
Cc: Nicholas Tang <nicholas.tang@mediatek.com>
Cc: Kuan-Ying Lee <Kuan-Ying.Lee@mediatek.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# fc89213a 22-Mar-2022 Huang Ying <ying.huang@intel.com>

mm,migrate: fix establishing demotion target

In commit ac16ec835314 ("mm: migrate: support multiple target nodes
demotion"), after the first demotion target node is found, we will
continue to check the next candidate obtained via find_next_best_node().
This is to find all demotion target nodes with same NUMA distance. But
one side effect of find_next_best_node() is that the candidate node
returned will be set in "used" parameter, even if the candidate node isn't
passed in the following NUMA distance checking, the candidate node will
not be used as demotion target node for the following nodes. For example,
for system as follows,

node distances:
node 0 1 2 3
0: 10 21 17 28
1: 21 10 28 17
2: 17 28 10 28
3: 28 17 28 10

when we establish demotion target node for node 0, in the first round node
2 is added to the demotion target node set. Then in the second round,
node 3 is checked and failed because distance(0, 3) > distance(0, 2). But
node 3 is set in "used" nodemask too. When we establish demotion target
node for node 1, there is no available node. This is wrong, node 3 should
be set as the demotion target of node 1.

To fix this, if the candidate node is failed to pass the distance
checking, it will be cleared in "used" nodemask. So that it can be used
for the following node.

The bug can be reproduced and fixed with this patch on a 2 socket server
machine with DRAM and PMEM.

Link: https://lkml.kernel.org/r/20220128055940.1792614-1-ying.huang@intel.com
Fixes: ac16ec835314 ("mm: migrate: support multiple target nodes demotion")
Signed-off-by: "Huang, Ying" <ying.huang@intel.com>
Reviewed-by: Baolin Wang <baolin.wang@linux.alibaba.com>
Cc: Baolin Wang <baolin.wang@linux.alibaba.com>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Zi Yan <ziy@nvidia.com>
Cc: Oscar Salvador <osalvador@suse.de>
Cc: Yang Shi <shy828301@gmail.com>
Cc: zhongjiang-ali <zhongjiang-ali@linux.alibaba.com>
Cc: Xunlei Pang <xlpang@linux.alibaba.com>
Cc: Mel Gorman <mgorman@techsingularity.net>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# b698f0a1 22-Mar-2022 Hugh Dickins <hughd@google.com>

mm/fs: delete PF_SWAPWRITE

PF_SWAPWRITE has been redundant since v3.2 commit ee72886d8ed5 ("mm:
vmscan: do not writeback filesystem pages in direct reclaim").

Coincidentally, NeilBrown's current patch "remove inode_congested()"
deletes may_write_to_inode(), which appeared to be the one function which
took notice of PF_SWAPWRITE. But if you study the old logic, and the
conditions under which may_write_to_inode() was called, you discover that
flag and function have been pointless for a decade.

Link: https://lkml.kernel.org/r/75e80e7-742d-e3bd-531-614db8961e4@google.com
Signed-off-by: Hugh Dickins <hughd@google.com>
Cc: NeilBrown <neilb@suse.de>
Cc: Jan Kara <jack@suse.de>
Cc: "Darrick J. Wong" <djwong@kernel.org>
Cc: Dave Chinner <david@fromorbit.com>
Cc: Matthew Wilcox <willy@infradead.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 87d2762e 22-Mar-2022 Miaohe Lin <linmiaohe@huawei.com>

mm: remove unneeded local variable follflags

We can pass FOLL_GET | FOLL_DUMP to follow_page directly to simplify the
code a bit in add_page_for_migration and split_huge_pages_pid.

Link: https://lkml.kernel.org/r/20220311072002.35575-1-linmiaohe@huawei.com
Signed-off-by: Miaohe Lin <linmiaohe@huawei.com>
Reviewed-by: Anshuman Khandual <anshuman.khandual@arm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 3150be8f 22-Mar-2022 Muchun Song <songmuchun@bytedance.com>

mm: replace multiple dcache flush with flush_dcache_folio()

Simplify the code by using flush_dcache_folio().

Link: https://lkml.kernel.org/r/20220210123058.79206-8-songmuchun@bytedance.com
Signed-off-by: Muchun Song <songmuchun@bytedance.com>
Reviewed-by: Mike Kravetz <mike.kravetz@oracle.com>
Cc: Axel Rasmussen <axelrasmussen@google.com>
Cc: David Rientjes <rientjes@google.com>
Cc: Fam Zheng <fam.zheng@bytedance.com>
Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Cc: Lars Persson <lars.persson@axis.com>
Cc: Peter Xu <peterx@redhat.com>
Cc: Xiongchun Duan <duanxiongchun@bytedance.com>
Cc: Zi Yan <ziy@nvidia.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 2771739a 22-Mar-2022 Muchun Song <songmuchun@bytedance.com>

mm: fix missing cache flush for all tail pages of compound page

The D-cache maintenance inside move_to_new_page() only consider one
page, there is still D-cache maintenance issue for tail pages of
compound page (e.g. THP or HugeTLB).

THP migration is only enabled on x86_64, ARM64 and powerpc, while
powerpc and arm64 need to maintain the consistency between I-Cache and
D-Cache, which depends on flush_dcache_page() to maintain the
consistency between I-Cache and D-Cache.

But there is no issues on arm64 and powerpc since they already considers
the compound page cache flushing in their icache flush function.
HugeTLB migration is enabled on arm, arm64, mips, parisc, powerpc,
riscv, s390 and sh, while arm has handled the compound page cache flush
in flush_dcache_page(), but most others do not.

In theory, the issue exists on many architectures. Fix this by not
using flush_dcache_folio() since it is not backportable.

Link: https://lkml.kernel.org/r/20220210123058.79206-3-songmuchun@bytedance.com
Fixes: 290408d4a250 ("hugetlb: hugepage migration core")
Signed-off-by: Muchun Song <songmuchun@bytedance.com>
Reviewed-by: Zi Yan <ziy@nvidia.com>
Cc: Axel Rasmussen <axelrasmussen@google.com>
Cc: David Rientjes <rientjes@google.com>
Cc: Fam Zheng <fam.zheng@bytedance.com>
Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Cc: Lars Persson <lars.persson@axis.com>
Cc: Mike Kravetz <mike.kravetz@oracle.com>
Cc: Peter Xu <peterx@redhat.com>
Cc: Xiongchun Duan <duanxiongchun@bytedance.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 65462462 22-Mar-2022 John Hubbard <jhubbard@nvidia.com>

mm/gup: follow_pfn_pte(): -EEXIST cleanup

Remove a quirky special case from follow_pfn_pte(), and adjust its
callers to match. Caller changes include:

__get_user_pages(): Regardless of any FOLL_* flags, get_user_pages() and
its variants should handle PFN-only entries by stopping early, if the
caller expected **pages to be filled in. This makes for a more reliable
API, as compared to the previous approach of skipping over such entries
(and thus leaving them silently unwritten).

move_pages(): squash the -EEXIST error return from follow_page() into
-EFAULT, because -EFAULT is listed in the man page, whereas -EEXIST is
not.

Link: https://lkml.kernel.org/r/20220204020010.68930-3-jhubbard@nvidia.com
Signed-off-by: John Hubbard <jhubbard@nvidia.com>
Suggested-by: Jason Gunthorpe <jgg@nvidia.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Jan Kara <jack@suse.cz>
Cc: Peter Xu <peterx@redhat.com>
Cc: Lukas Bulwahn <lukas.bulwahn@gmail.com>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Claudio Imbrenda <imbrenda@linux.ibm.com>
Cc: Alex Williamson <alex.williamson@redhat.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: David Hildenbrand <david@redhat.com>
Cc: Jason Gunthorpe <jgg@ziepe.ca>
Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 2f031c6f 29-Jan-2022 Matthew Wilcox (Oracle) <willy@infradead.org>

mm/rmap: Convert rmap_walk() to take a folio

This ripples all the way through to every calling and called function
from rmap.

Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>


# 4eecb8b9 28-Jan-2022 Matthew Wilcox (Oracle) <willy@infradead.org>

mm/migrate: Convert remove_migration_ptes() to folios

Convert the implementation and all callers.

Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>


# 4b8554c5 28-Jan-2022 Matthew Wilcox (Oracle) <willy@infradead.org>

mm/rmap: Convert try_to_migrate() to folios

Convert the callers to pass a folio and the try_to_migrate_one()
worker to use a folio throughout. Fixes an assumption that a
folio must be <= PMD size.

Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>


# 2aff7a47 03-Feb-2022 Matthew Wilcox (Oracle) <willy@infradead.org>

mm: Convert page_vma_mapped_walk to work on PFNs

page_mapped_in_vma() really just wants to walk one page, but as the
code stands, if passed the head page of a compound page, it will
walk every page in the compound page. Extract pfn/nr_pages/pgoff
from the struct page early, so they can be overridden by
page_mapped_in_vma().

Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>


# eed05e54 03-Feb-2022 Matthew Wilcox (Oracle) <willy@infradead.org>

mm: Add DEFINE_PAGE_VMA_WALK and DEFINE_FOLIO_VMA_WALK

Instead of declaring a struct page_vma_mapped_walk directly,
use these helpers to allow us to transition to a PFN approach in the
following patches.

Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>


# 76cbbead 15-Feb-2022 Christoph Hellwig <hch@lst.de>

mm: move the migrate_vma_* device migration code into its own file

Split the code used to migrate to and from ZONE_DEVICE memory from
migrate.c into a new file.

Link: https://lkml.kernel.org/r/20220210072828.2930359-14-hch@lst.de
Signed-off-by: Christoph Hellwig <hch@lst.de>
Tested-by: "Sierra Guiza, Alejandro (Alex)" <alex.sierra@amd.com>
Cc: Alex Deucher <alexander.deucher@amd.com>
Cc: Alistair Popple <apopple@nvidia.com>
Cc: Ben Skeggs <bskeggs@redhat.com>
Cc: Chaitanya Kulkarni <kch@nvidia.com>
Cc: Christian Knig <christian.koenig@amd.com>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: Felix Kuehling <Felix.Kuehling@amd.com>
Cc: Jason Gunthorpe <jgg@nvidia.com>
Cc: Karol Herbst <kherbst@redhat.com>
Cc: Logan Gunthorpe <logang@deltatee.com>
Cc: Lyude Paul <lyude@redhat.com>
Cc: Miaohe Lin <linmiaohe@huawei.com>
Cc: Muchun Song <songmuchun@bytedance.com>
Cc: "Pan, Xinhui" <Xinhui.Pan@amd.com>
Cc: Ralph Campbell <rcampbell@nvidia.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>


# aaf7d70c 15-Feb-2022 Christoph Hellwig <hch@lst.de>

mm: refactor the ZONE_DEVICE handling in migrate_vma_pages

Make the flow a little more clear and prepare for adding a new
ZONE_DEVICE memory type.

Link: https://lkml.kernel.org/r/20220210072828.2930359-13-hch@lst.de
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Alistair Popple <apopple@nvidia.com>
Tested-by: "Sierra Guiza, Alejandro (Alex)" <alex.sierra@amd.com>

Cc: Alex Deucher <alexander.deucher@amd.com>
Cc: Ben Skeggs <bskeggs@redhat.com>
Cc: Chaitanya Kulkarni <kch@nvidia.com>
Cc: Christian Knig <christian.koenig@amd.com>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: Felix Kuehling <Felix.Kuehling@amd.com>
Cc: Jason Gunthorpe <jgg@nvidia.com>
Cc: Karol Herbst <kherbst@redhat.com>
Cc: Logan Gunthorpe <logang@deltatee.com>
Cc: Lyude Paul <lyude@redhat.com>
Cc: Miaohe Lin <linmiaohe@huawei.com>
Cc: Muchun Song <songmuchun@bytedance.com>
Cc: "Pan, Xinhui" <Xinhui.Pan@amd.com>
Cc: Ralph Campbell <rcampbell@nvidia.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>


# 1776c0d1 15-Feb-2022 Christoph Hellwig <hch@lst.de>

mm: refactor the ZONE_DEVICE handling in migrate_vma_insert_page

Make the flow a little more clear and prepare for adding a new
ZONE_DEVICE memory type.

Link: https://lkml.kernel.org/r/20220210072828.2930359-12-hch@lst.de
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Alistair Popple <apopple@nvidia.com>
Tested-by: "Sierra Guiza, Alejandro (Alex)" <alex.sierra@amd.com>

Cc: Alex Deucher <alexander.deucher@amd.com>
Cc: Ben Skeggs <bskeggs@redhat.com>
Cc: Chaitanya Kulkarni <kch@nvidia.com>
Cc: Christian Knig <christian.koenig@amd.com>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: Felix Kuehling <Felix.Kuehling@amd.com>
Cc: Jason Gunthorpe <jgg@nvidia.com>
Cc: Karol Herbst <kherbst@redhat.com>
Cc: Logan Gunthorpe <logang@deltatee.com>
Cc: Lyude Paul <lyude@redhat.com>
Cc: Miaohe Lin <linmiaohe@huawei.com>
Cc: Muchun Song <songmuchun@bytedance.com>
Cc: "Pan, Xinhui" <Xinhui.Pan@amd.com>
Cc: Ralph Campbell <rcampbell@nvidia.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>


# 27674ef6 15-Feb-2022 Christoph Hellwig <hch@lst.de>

mm: remove the extra ZONE_DEVICE struct page refcount

ZONE_DEVICE struct pages have an extra reference count that complicates
the code for put_page() and several places in the kernel that need to
check the reference count to see that a page is not being used (gup,
compaction, migration, etc.). Clean up the code so the reference count
doesn't need to be treated specially for ZONE_DEVICE pages.

Note that this excludes the special idle page wakeup for fsdax pages,
which still happens at refcount 1. This is a separate issue and will
be sorted out later. Given that only fsdax pages require the
notifiacation when the refcount hits 1 now, the PAGEMAP_OPS Kconfig
symbol can go away and be replaced with a FS_DAX check for this hook
in the put_page fastpath.

Based on an earlier patch from Ralph Campbell <rcampbell@nvidia.com>.

Link: https://lkml.kernel.org/r/20220210072828.2930359-8-hch@lst.de
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Logan Gunthorpe <logang@deltatee.com>
Reviewed-by: Ralph Campbell <rcampbell@nvidia.com>
Reviewed-by: Jason Gunthorpe <jgg@nvidia.com>
Reviewed-by: Dan Williams <dan.j.williams@intel.com>
Acked-by: Felix Kuehling <Felix.Kuehling@amd.com>
Tested-by: "Sierra Guiza, Alejandro (Alex)" <alex.sierra@amd.com>

Cc: Alex Deucher <alexander.deucher@amd.com>
Cc: Alistair Popple <apopple@nvidia.com>
Cc: Ben Skeggs <bskeggs@redhat.com>
Cc: Chaitanya Kulkarni <kch@nvidia.com>
Cc: Christian Knig <christian.koenig@amd.com>
Cc: Karol Herbst <kherbst@redhat.com>
Cc: Lyude Paul <lyude@redhat.com>
Cc: Miaohe Lin <linmiaohe@huawei.com>
Cc: Muchun Song <songmuchun@bytedance.com>
Cc: "Pan, Xinhui" <Xinhui.Pan@amd.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>


# b7435507 14-Feb-2022 Hugh Dickins <hughd@google.com>

mm/munlock: page migration needs mlock pagevec drained

Page migration of a VM_LOCKED page tends to fail, because when the old
page is unmapped, it is put on the mlock pagevec with raised refcount,
which then fails the freeze.

At first I thought this would be fixed by a local mlock_page_drain() at
the upper rmap_walk() level - which would have nicely batched all the
munlocks of that page; but tests show that the task can too easily move
to another cpu, leaving pagevec residue behind which fails the migration.

So try_to_migrate_one() drain the local pagevec after page_remove_rmap()
from a VM_LOCKED vma; and do the same in try_to_unmap_one(), whose
TTU_IGNORE_MLOCK users would want the same treatment; and do the same
in remove_migration_pte() - not important when successfully inserting
a new page, but necessary when hoping to retry after failure.

Any new pagevec runs the risk of adding a new way of stranding, and we
might discover other corners where mlock_page_drain() or lru_add_drain()
would now help.

Signed-off-by: Hugh Dickins <hughd@google.com>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>


# c3096e67 14-Feb-2022 Hugh Dickins <hughd@google.com>

mm/migrate: __unmap_and_move() push good newpage to LRU

Compaction, NUMA page movement, THP collapse/split, and memory failure
do isolate unevictable pages from their "LRU", losing the record of
mlock_count in doing so (isolators are likely to use page->lru for their
own private lists, so mlock_count has to be presumed lost).

That's unfortunate, and we should put in some work to correct that: one
can imagine a function to build up the mlock_count again - but it would
require i_mmap_rwsem for read, so be careful where it's called. Or
page_referenced_one() and try_to_unmap_one() might do that extra work.

But one place that can very easily be improved is page migration's
__unmap_and_move(): a small adjustment to where the successful new page
is put back on LRU, and its mlock_count (if any) is built back up by
remove_migration_ptes().

Signed-off-by: Hugh Dickins <hughd@google.com>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>


# cea86fe2 14-Feb-2022 Hugh Dickins <hughd@google.com>

mm/munlock: rmap call mlock_vma_page() munlock_vma_page()

Add vma argument to mlock_vma_page() and munlock_vma_page(), make them
inline functions which check (vma->vm_flags & VM_LOCKED) before calling
mlock_page() and munlock_page() in mm/mlock.c.

Add bool compound to mlock_vma_page() and munlock_vma_page(): this is
because we have understandable difficulty in accounting pte maps of THPs,
and if passed a PageHead page, mlock_page() and munlock_page() cannot
tell whether it's a pmd map to be counted or a pte map to be ignored.

Add vma arg to page_add_file_rmap() and page_remove_rmap(), like the
others, and use that to call mlock_vma_page() at the end of the page
adds, and munlock_vma_page() at the end of page_remove_rmap() (end or
beginning? unimportant, but end was easier for assertions in testing).

No page lock is required (although almost all adds happen to hold it):
delete the "Serialize with page migration" BUG_ON(!PageLocked(page))s.
Certainly page lock did serialize with page migration, but I'm having
difficulty explaining why that was ever important.

Mlock accounting on THPs has been hard to define, differed between anon
and file, involved PageDoubleMap in some places and not others, required
clear_page_mlock() at some points. Keep it simple now: just count the
pmds and ignore the ptes, there is no reason for ptes to undo pmd mlocks.

page_add_new_anon_rmap() callers unchanged: they have long been calling
lru_cache_add_inactive_or_unevictable(), which does its own VM_LOCKED
handling (it also checks for not VM_SPECIAL: I think that's overcautious,
and inconsistent with other checks, that mmap_region() already prevents
VM_LOCKED on VM_SPECIAL; but haven't quite convinced myself to change it).

Signed-off-by: Hugh Dickins <hughd@google.com>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>


# ffa65753 21-Jan-2022 Alistair Popple <apopple@nvidia.com>

mm/migrate.c: rework migration_entry_wait() to not take a pageref

This fixes the FIXME in migrate_vma_check_page().

Before migrating a page migration code will take a reference and check
there are no unexpected page references, failing the migration if there
are. When a thread faults on a migration entry it will take a temporary
reference to the page to wait for the page to become unlocked signifying
the migration entry has been removed.

This reference is dropped just prior to waiting on the page lock,
however the extra reference can cause migration failures so it is
desirable to avoid taking it.

As migration code already has a reference to the migrating page an extra
reference to wait on PG_locked is unnecessary so long as the reference
can't be dropped whilst setting up the wait.

When faulting on a migration entry the ptl is taken to check the
migration entry. Removing a migration entry also requires the ptl, and
migration code won't drop its page reference until after the migration
entry has been removed. Therefore retaining the ptl of a migration
entry is sufficient to ensure the page has a reference. Reworking
migration_entry_wait() to hold the ptl until the wait setup is complete
means the extra page reference is no longer needed.

[apopple@nvidia.com: v5]
Link: https://lkml.kernel.org/r/20211213033848.1973946-1-apopple@nvidia.com

Link: https://lkml.kernel.org/r/20211118020754.954425-1-apopple@nvidia.com
Signed-off-by: Alistair Popple <apopple@nvidia.com>
Acked-by: David Hildenbrand <david@redhat.com>
Cc: David Howells <dhowells@redhat.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: Jason Gunthorpe <jgg@nvidia.com>
Cc: Jerome Glisse <jglisse@redhat.com>
Cc: John Hubbard <jhubbard@nvidia.com>
Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
Cc: Ralph Campbell <rcampbell@nvidia.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# f1e8db04 14-Jan-2022 Colin Ian King <colin.king@intel.com>

mm/migrate: remove redundant variables used in a for-loop

The variable addr is being set and incremented in a for-loop but not
actually being used. It is redundant and so addr and also variable
start can be removed.

Link: https://lkml.kernel.org/r/20211221185729.609630-1-colin.i.king@gmail.com
Signed-off-by: Colin Ian King <colin.i.king@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# dcee9bf5 14-Jan-2022 Huang Ying <ying.huang@intel.com>

mm/migrate: move node demotion code to near its user

Now, node_demotion and next_demotion_node() are placed between
__unmap_and_move() and unmap_and_move(). This hurts code readability.
So move them near their users in the file. There's no functionality
change in this patch.

Link: https://lkml.kernel.org/r/20211206031227.3323097-1-ying.huang@intel.com
Signed-off-by: "Huang, Ying" <ying.huang@intel.com>
Reviewed-by: Baolin Wang <baolin.wang@linux.alibaba.com>
Reviewed-by: Yang Shi <shy828301@gmail.com>
Reviewed-by: Wei Xu <weixugc@google.com>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Zi Yan <ziy@nvidia.com>
Cc: Oscar Salvador <osalvador@suse.de>
Cc: Michal Hocko <mhocko@suse.com>
Cc: David Rientjes <rientjes@google.com>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: David Hildenbrand <david@redhat.com>
Cc: Greg Thelen <gthelen@google.com>
Cc: Keith Busch <kbusch@kernel.org>
Cc: Yang Shi <yang.shi@linux.alibaba.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 7813a1b5 14-Jan-2022 Baolin Wang <baolin.wang@linux.alibaba.com>

mm: migrate: add more comments for selecting target node randomly

As Yang Shi suggested [1], it will be helpful to explain why we should
select target node randomly now if there are multiple target nodes.

[1] https://lore.kernel.org/all/CAHbLzkqSqCL+g7dfzeOw8fPyeEC0BBv13Ny1UVGHDkadnQdR=g@mail.gmail.com/

Link: https://lkml.kernel.org/r/c31d36bd097c6e9e69fc0f409c43b78e53e64fc2.1637766801.git.baolin.wang@linux.alibaba.com
Signed-off-by: Baolin Wang <baolin.wang@linux.alibaba.com>
Reviewed-by: Yang Shi <shy828301@gmail.com>
Cc: "Huang, Ying" <ying.huang@intel.com>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Zi Yan <ziy@nvidia.com>
Cc: zhongjiang-ali <zhongjiang-ali@linux.alibaba.com>
Cc: Xunlei Pang <xlpang@linux.alibaba.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# ac16ec83 14-Jan-2022 Baolin Wang <baolin.wang@linux.alibaba.com>

mm: migrate: support multiple target nodes demotion

We have some machines with multiple memory types like below, which have
one fast (DRAM) memory node and two slow (persistent memory) memory
nodes. According to current node demotion policy, if node 0 fills up,
its memory should be migrated to node 1, when node 1 fills up, its
memory will be migrated to node 2: node 0 -> node 1 -> node 2 ->stop.

But this is not efficient and suitbale memory migration route for our
machine with multiple slow memory nodes. Since the distance between
node 0 to node 1 and node 0 to node 2 is equal, and memory migration
between slow memory nodes will increase persistent memory bandwidth
greatly, which will hurt the whole system's performance.

Thus for this case, we can treat the slow memory node 1 and node 2 as a
whole slow memory region, and we should migrate memory from node 0 to
node 1 and node 2 if node 0 fills up.

This patch changes the node_demotion data structure to support multiple
target nodes, and establishes the migration path to support multiple
target nodes with validating if the node distance is the best or not.

available: 3 nodes (0-2)
node 0 cpus: 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
node 0 size: 62153 MB
node 0 free: 55135 MB
node 1 cpus:
node 1 size: 127007 MB
node 1 free: 126930 MB
node 2 cpus:
node 2 size: 126968 MB
node 2 free: 126878 MB
node distances:
node 0 1 2
0: 10 20 20
1: 20 10 20
2: 20 20 10

Link: https://lkml.kernel.org/r/00728da107789bb4ed9e0d28b1d08fd8056af2ef.1636697263.git.baolin.wang@linux.alibaba.com
Signed-off-by: Baolin Wang <baolin.wang@linux.alibaba.com>
Reviewed-by: "Huang, Ying" <ying.huang@intel.com>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Zi Yan <ziy@nvidia.com>
Cc: Oscar Salvador <osalvador@suse.de>
Cc: Yang Shi <shy828301@gmail.com>
Cc: Baolin Wang <baolin.wang@linux.alibaba.com>
Cc: zhongjiang-ali <zhongjiang-ali@linux.alibaba.com>
Cc: Xunlei Pang <xlpang@linux.alibaba.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 5d39a7eb 14-Jan-2022 Baolin Wang <baolin.wang@linux.alibaba.com>

mm: migrate: correct the hugetlb migration stats

Correct the migration stats for hugetlb with using compound_nr() instead
of thp_nr_pages(), meanwhile change 'nr_failed_pages' to record the
number of normal pages failed to migrate, including THP and hugetlb, and
'nr_succeeded' will record the number of normal pages migrated
successfully.

[baolin.wang@linux.alibaba.com: fix docs, per Mike]
Link: https://lkml.kernel.org/r/141bdfc6-f898-3cc3-f692-726c5f6cb74d@linux.alibaba.com

Link: https://lkml.kernel.org/r/71a4b6c22f208728fe8c78ad26375436c4ff9704.1636275127.git.baolin.wang@linux.alibaba.com
Signed-off-by: Baolin Wang <baolin.wang@linux.alibaba.com>
Reviewed-by: Zi Yan <ziy@nvidia.com>
Cc: Steven Rostedt (VMware) <rostedt@goodmis.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# b5bade97 14-Jan-2022 Baolin Wang <baolin.wang@linux.alibaba.com>

mm: migrate: fix the return value of migrate_pages()

Patch series "Improve the migration stats".

According to talk with Zi Yan [1], this patch set changes the return
value of migrate_pages() to avoid returning a number which is larger
than the number of pages the users tried to migrate by move_pages()
syscall. Also fix the hugetlb migration stats and migration stats in
trace_mm_compaction_migratepages().

[1] https://lore.kernel.org/linux-mm/7E44019D-2A5D-4BA7-B4D5-00D4712F1687@nvidia.com/

This patch (of 3):

As Zi Yan pointed out, the syscall move_pages() can return a
non-migrated number larger than the number of pages the users tried to
migrate, when a THP page is failed to migrate. This is confusing for
users.

Since other migration scenarios do not care about the actual
non-migrated number of pages except the memory compaction migration
which will fix in following patch. Thus we can change the return value
to return the number of {normal page, THP, hugetlb} instead to avoid
this issue, and the number of THP splits will be considered as the
number of non-migrated THP, no matter how many subpages of the THP are
migrated successfully. Meanwhile we should still keep the migration
counters using the number of normal pages.

Link: https://lkml.kernel.org/r/cover.1636275127.git.baolin.wang@linux.alibaba.com
Link: https://lkml.kernel.org/r/6486fabc3e8c66ff613e150af25e89b3147977a6.1636275127.git.baolin.wang@linux.alibaba.com
Signed-off-by: Baolin Wang <baolin.wang@linux.alibaba.com>
Signed-off-by: Zi Yan <ziy@nvidia.com>
Co-developed-by: Zi Yan <ziy@nvidia.com>
Cc: Steven Rostedt (VMware) <rostedt@goodmis.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 1eba86c0 14-Jan-2022 Pasha Tatashin <pasha.tatashin@soleen.com>

mm: change page type prior to adding page table entry

Patch series "page table check", v3.

Ensure that some memory corruptions are prevented by checking at the
time of insertion of entries into user page tables that there is no
illegal sharing.

We have recently found a problem [1] that existed in kernel since 4.14.
The problem was caused by broken page ref count and led to memory
leaking from one process into another. The problem was accidentally
detected by studying a dump of one process and noticing that one page
contains memory that should not belong to this process.

There are some other page->_refcount related problems that were recently
fixed: [2], [3] which potentially could also lead to illegal sharing.

In addition to hardening refcount [4] itself, this work is an attempt to
prevent this class of memory corruption issues.

It uses a simple state machine that is independent from regular MM logic
to check for illegal sharing at time pages are inserted and removed from
page tables.

[1] https://lore.kernel.org/all/xr9335nxwc5y.fsf@gthelen2.svl.corp.google.com
[2] https://lore.kernel.org/all/1582661774-30925-2-git-send-email-akaher@vmware.com
[3] https://lore.kernel.org/all/20210622021423.154662-3-mike.kravetz@oracle.com
[4] https://lore.kernel.org/all/20211221150140.988298-1-pasha.tatashin@soleen.com

This patch (of 4):

There are a few places where we first update the entry in the user page
table, and later change the struct page to indicate that this is
anonymous or file page.

In most places, however, we first configure the page metadata and then
insert entries into the page table. Page table check, will use the
information from struct page to verify the type of entry is inserted.

Change the order in all places to first update struct page, and later to
update page table.

This means that we first do calls that may change the type of page (anon
or file):

page_move_anon_rmap
page_add_anon_rmap
do_page_add_anon_rmap
page_add_new_anon_rmap
page_add_file_rmap
hugepage_add_anon_rmap
hugepage_add_new_anon_rmap

And after that do calls that add entries to the page table:

set_huge_pte_at
set_pte_at

Link: https://lkml.kernel.org/r/20211221154650.1047963-1-pasha.tatashin@soleen.com
Link: https://lkml.kernel.org/r/20211221154650.1047963-2-pasha.tatashin@soleen.com
Signed-off-by: Pasha Tatashin <pasha.tatashin@soleen.com>
Cc: David Rientjes <rientjes@google.com>
Cc: Paul Turner <pjt@google.com>
Cc: Wei Xu <weixugc@google.com>
Cc: Greg Thelen <gthelen@google.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Jonathan Corbet <corbet@lwn.net>
Cc: Will Deacon <will@kernel.org>
Cc: Mike Rapoport <rppt@kernel.org>
Cc: Kees Cook <keescook@chromium.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Masahiro Yamada <masahiroy@kernel.org>
Cc: Sami Tolvanen <samitolvanen@google.com>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Frederic Weisbecker <frederic@kernel.org>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
Cc: Jiri Slaby <jirislaby@kernel.org>
Cc: Muchun Song <songmuchun@bytedance.com>
Cc: Hugh Dickins <hughd@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 6b24ca4a 27-Jun-2020 Matthew Wilcox (Oracle) <willy@infradead.org>

mm: Use multi-index entries in the page cache

We currently store large folios as 2^N consecutive entries. While this
consumes rather more memory than necessary, it also turns out to be buggy.
A writeback operation which starts within a tail page of a dirty folio will
not write back the folio as the xarray's dirty bit is only set on the
head index. With multi-index entries, the dirty bit will be found no
matter where in the folio the operation starts.

This does end up simplifying the page cache slightly, although not as
much as I had hoped.

Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Reviewed-by: William Kucharski <william.kucharski@oracle.com>


# 9f2b04a2 16-Aug-2021 Matthew Wilcox (Oracle) <willy@infradead.org>

filemap: Add folio_put_wait_locked()

Convert all three callers of put_and_wait_on_page_locked() to
folio_put_wait_locked(). This shrinks the kernel overall by 19 bytes.
filemap_update_page() shrinks by 19 bytes while __migration_entry_wait()
is unchanged. folio_put_wait_locked() is 14 bytes smaller than
put_and_wait_on_page_locked(), but pmd_migration_entry_wait() grows by
14 bytes. It removes the assumption from pmd_migration_entry_wait()
that pages cannot be larger than a PMD (which is true today, but
may be interesting to explore in the future).

Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: William Kucharski <william.kucharski@oracle.com>


# ab09243a 10-Nov-2021 Alistair Popple <apopple@nvidia.com>

mm/migrate.c: remove MIGRATE_PFN_LOCKED

MIGRATE_PFN_LOCKED is used to indicate to migrate_vma_prepare() that a
source page was already locked during migrate_vma_collect(). If it
wasn't then the a second attempt is made to lock the page. However if
the first attempt failed it's unlikely a second attempt will succeed,
and the retry adds complexity. So clean this up by removing the retry
and MIGRATE_PFN_LOCKED flag.

Destination pages are also meant to have the MIGRATE_PFN_LOCKED flag
set, but nothing actually checks that.

Link: https://lkml.kernel.org/r/20211025041608.289017-1-apopple@nvidia.com
Signed-off-by: Alistair Popple <apopple@nvidia.com>
Reviewed-by: Ralph Campbell <rcampbell@nvidia.com>
Acked-by: Felix Kuehling <Felix.Kuehling@amd.com>
Cc: Alex Deucher <alexander.deucher@amd.com>
Cc: Jerome Glisse <jglisse@redhat.com>
Cc: John Hubbard <jhubbard@nvidia.com>
Cc: Zi Yan <ziy@nvidia.com>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Ben Skeggs <bskeggs@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 0ef02462 10-Nov-2021 Baolin Wang <baolin.wang@linux.alibaba.com>

mm: migrate: simplify the file-backed pages validation when migrating its mapping

There is no need to validate the file-backed page's refcount before
trying to freeze the page's expected refcount, instead we can rely on
the folio_ref_freeze() to validate if the page has the expected refcount
before migrating its mapping.

Moreover we are always under the page lock when migrating the page
mapping, which means nowhere else can remove it from the page cache, so
we can remove the xas_load() validation under the i_pages lock.

Link: https://lkml.kernel.org/r/cover.1629447552.git.baolin.wang@linux.alibaba.com
Link: https://lkml.kernel.org/r/df4c129fd8e86a95dbc55f4663d77441cc0d3bd1.1629447552.git.baolin.wang@linux.alibaba.com
Signed-off-by: Baolin Wang <baolin.wang@linux.alibaba.com>
Suggested-by: Matthew Wilcox <willy@infradead.org>
Cc: Yang Shi <shy828301@gmail.com>
Cc: Alistair Popple <apopple@nvidia.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 20f9ba4f 05-Nov-2021 Yang Shi <shy828301@gmail.com>

mm: migrate: make demotion knob depend on migration

The memory demotion needs to call migrate_pages() to do the jobs. And
it is controlled by a knob, however, the knob doesn't depend on
CONFIG_MIGRATION. The knob could be truned on even though MIGRATION is
disabled, this will not cause any crash since migrate_pages() would just
return -ENOSYS. But it is definitely not optimal to go through demotion
path then retry regular swap every time.

And it doesn't make too much sense to have the knob visible to the users
when !MIGRATION. Move the related code from mempolicy.[h|c] to
migrate.[h|c].

Link: https://lkml.kernel.org/r/20211015005559.246709-1-shy828301@gmail.com
Signed-off-by: Yang Shi <shy828301@gmail.com>
Acked-by: "Huang, Ying" <ying.huang@intel.com>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 715cbfd6 07-May-2021 Matthew Wilcox (Oracle) <willy@infradead.org>

mm/migrate: Add folio_migrate_copy()

This is the folio equivalent of migrate_page_copy(), which is retained
as a wrapper for filesystems which are not yet converted to folios.
Also convert copy_huge_page() to folio_copy().

Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Reviewed-by: Zi Yan <ziy@nvidia.com>
Acked-by: Vlastimil Babka <vbabka@suse.cz>


# 19138349 07-May-2021 Matthew Wilcox (Oracle) <willy@infradead.org>

mm/migrate: Add folio_migrate_flags()

Turn migrate_page_states() into a wrapper around folio_migrate_flags().
Also convert two functions only called from folio_migrate_flags() to
be folio-based. ksm_migrate_page() becomes folio_migrate_ksm() and
copy_page_owner() becomes folio_copy_owner(). folio_migrate_flags()
alone shrinks by two thirds -- 1967 bytes down to 642 bytes.

Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Reviewed-by: Zi Yan <ziy@nvidia.com>
Reviewed-by: David Howells <dhowells@redhat.com>
Acked-by: Vlastimil Babka <vbabka@suse.cz>


# 3417013e 07-May-2021 Matthew Wilcox (Oracle) <willy@infradead.org>

mm/migrate: Add folio_migrate_mapping()

Reimplement migrate_page_move_mapping() as a wrapper around
folio_migrate_mapping(). Saves 193 bytes of kernel text.

Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: David Howells <dhowells@redhat.com>
Acked-by: Vlastimil Babka <vbabka@suse.cz>


# d21bba2b 06-May-2021 Matthew Wilcox (Oracle) <willy@infradead.org>

mm/memcg: Convert mem_cgroup_migrate() to take folios

Convert all callers of mem_cgroup_migrate() to call page_folio() first.
They all look like they're using head pages already, but this proves it.

Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Acked-by: Mike Rapoport <rppt@linux.ibm.com>
Reviewed-by: David Howells <dhowells@redhat.com>
Acked-by: Vlastimil Babka <vbabka@suse.cz>


# 8f425e4e 25-Jun-2021 Matthew Wilcox (Oracle) <willy@infradead.org>

mm/memcg: Convert mem_cgroup_charge() to take a folio

Convert all callers of mem_cgroup_charge() to call page_folio() on the
page they're currently passing in. Many of them will be converted to
use folios themselves soon.

Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: David Howells <dhowells@redhat.com>
Acked-by: Vlastimil Babka <vbabka@suse.cz>


# a6a0251c 18-Oct-2021 Huang Ying <ying.huang@intel.com>

mm/migrate: fix CPUHP state to update node demotion order

The node demotion order needs to be updated during CPU hotplug. Because
whether a NUMA node has CPU may influence the demotion order. The
update function should be called during CPU online/offline after the
node_states[N_CPU] has been updated. That is done in
CPUHP_AP_ONLINE_DYN during CPU online and in CPUHP_MM_VMSTAT_DEAD during
CPU offline. But in commit 884a6e5d1f93 ("mm/migrate: update node
demotion order on hotplug events"), the function to update node demotion
order is called in CPUHP_AP_ONLINE_DYN during CPU online/offline. This
doesn't satisfy the order requirement.

For example, there are 4 CPUs (P0, P1, P2, P3) in 2 sockets (P0, P1 in S0
and P2, P3 in S1), the demotion order is

- S0 -> NUMA_NO_NODE
- S1 -> NUMA_NO_NODE

After P2 and P3 is offlined, because S1 has no CPU now, the demotion
order should have been changed to

- S0 -> S1
- S1 -> NO_NODE

but it isn't changed, because the order updating callback for CPU
hotplug doesn't see the new nodemask. After that, if P1 is offlined,
the demotion order is changed to the expected order as above.

So in this patch, we added CPUHP_AP_MM_DEMOTION_ONLINE and
CPUHP_MM_DEMOTION_DEAD to be called after CPUHP_AP_ONLINE_DYN and
CPUHP_MM_VMSTAT_DEAD during CPU online and offline, and register the
update function on them.

Link: https://lkml.kernel.org/r/20210929060351.7293-1-ying.huang@intel.com
Fixes: 884a6e5d1f93 ("mm/migrate: update node demotion order on hotplug events")
Signed-off-by: "Huang, Ying" <ying.huang@intel.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Yang Shi <shy828301@gmail.com>
Cc: Zi Yan <ziy@nvidia.com>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Wei Xu <weixugc@google.com>
Cc: Oscar Salvador <osalvador@suse.de>
Cc: David Rientjes <rientjes@google.com>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: David Hildenbrand <david@redhat.com>
Cc: Greg Thelen <gthelen@google.com>
Cc: Keith Busch <kbusch@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 76af6a05 18-Oct-2021 Dave Hansen <dave.hansen@linux.intel.com>

mm/migrate: add CPU hotplug to demotion #ifdef

Once upon a time, the node demotion updates were driven solely by memory
hotplug events. But now, there are handlers for both CPU and memory
hotplug.

However, the #ifdef around the code checks only memory hotplug. A
system that has HOTPLUG_CPU=y but MEMORY_HOTPLUG=n would miss CPU
hotplug events.

Update the #ifdef around the common code. Add memory and CPU-specific
#ifdefs for their handlers. These memory/CPU #ifdefs avoid unused
function warnings when their Kconfig option is off.

[arnd@arndb.de: rework hotplug_memory_notifier() stub]
Link: https://lkml.kernel.org/r/20211013144029.2154629-1-arnd@kernel.org

Link: https://lkml.kernel.org/r/20210924161255.E5FE8F7E@davehans-spike.ostc.intel.com
Fixes: 884a6e5d1f93 ("mm/migrate: update node demotion order on hotplug events")
Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Cc: "Huang, Ying" <ying.huang@intel.com>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Wei Xu <weixugc@google.com>
Cc: Oscar Salvador <osalvador@suse.de>
Cc: David Rientjes <rientjes@google.com>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: David Hildenbrand <david@redhat.com>
Cc: Greg Thelen <gthelen@google.com>
Cc: Yang Shi <yang.shi@linux.alibaba.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 295be91f 18-Oct-2021 Dave Hansen <dave.hansen@linux.intel.com>

mm/migrate: optimize hotplug-time demotion order updates

Patch series "mm/migrate: 5.15 fixes for automatic demotion", v2.

This contains two fixes for the "automatic demotion" code which was
merged into 5.15:

* Fix memory hotplug performance regression by watching
suppressing any real action on irrelevant hotplug events.

* Ensure CPU hotplug handler is registered when memory hotplug
is disabled.

This patch (of 2):

== tl;dr ==

Automatic demotion opted for a simple, lazy approach to handling hotplug
events. This noticeably slows down memory hotplug[1]. Optimize away
updates to the demotion order when memory hotplug events should have no
effect.

This has no effect on CPU hotplug. There is no known problem on the CPU
side and any work there will be in a separate series.

== Background ==

Automatic demotion is a memory migration strategy to ensure that new
allocations have room in faster memory tiers on tiered memory systems.
The kernel maintains an array (node_demotion[]) to drive these
migrations.

The node_demotion[] path is calculated by starting at nodes with CPUs
and then "walking" to nodes with memory. Only hotplug events which
online or offline a node with memory (N_ONLINE) or CPUs (N_CPU) will
actually affect the migration order.

== Problem ==

However, the current code is lazy. It completely regenerates the
migration order on *any* CPU or memory hotplug event. The logic was
that these events are extremely rare and that the overhead from
indiscriminate order regeneration is minimal.

Part of the update logic involves a synchronize_rcu(), which is a pretty
big hammer. Its overhead was large enough to be detected by some 0day
tests that watch memory hotplug performance[1].

== Solution ==

Add a new helper (node_demotion_topo_changed()) which can differentiate
between superfluous and impactful hotplug events. Skip the expensive
update operation for superfluous events.

== Aside: Locking ==

It took me a few moments to declare the locking to be safe enough for
node_demotion_topo_changed() to work. It all hinges on the memory
hotplug lock:

During memory hotplug events, 'mem_hotplug_lock' is held for write.
This ensures that two memory hotplug events can not be called
simultaneously.

CPU hotplug has a similar lock (cpuhp_state_mutex) which also provides
mutual exclusion between CPU hotplug events. In addition, the demotion
code acquire and hold the mem_hotplug_lock for read during its CPU
hotplug handlers. This provides mutual exclusion between the demotion
memory hotplug callbacks and the CPU hotplug callbacks.

This effectively allows treating the migration target generation code to
act as if it is single-threaded.

1. https://lore.kernel.org/all/20210905135932.GE15026@xsang-OptiPlex-9020/

Link: https://lkml.kernel.org/r/20210924161251.093CCD06@davehans-spike.ostc.intel.com
Link: https://lkml.kernel.org/r/20210924161253.D7673E31@davehans-spike.ostc.intel.com
Fixes: 884a6e5d1f93 ("mm/migrate: update node demotion order on hotplug events")
Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
Reported-by: kernel test robot <oliver.sang@intel.com>
Reviewed-by: David Hildenbrand <david@redhat.com>
Cc: "Huang, Ying" <ying.huang@intel.com>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Wei Xu <weixugc@google.com>
Cc: Oscar Salvador <osalvador@suse.de>
Cc: David Rientjes <rientjes@google.com>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: Greg Thelen <gthelen@google.com>
Cc: Yang Shi <yang.shi@linux.alibaba.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 59ab844e 08-Sep-2021 Arnd Bergmann <arnd@arndb.de>

compat: remove some compat entry points

These are all handled correctly when calling the native system call entry
point, so remove the special cases.

Link: https://lkml.kernel.org/r/20210727144859.4150043-6-arnd@kernel.org
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Christian Borntraeger <borntraeger@de.ibm.com>
Cc: Christoph Hellwig <hch@infradead.org>
Cc: "David S. Miller" <davem@davemloft.net>
Cc: Eric Biederman <ebiederm@xmission.com>
Cc: Feng Tang <feng.tang@intel.com>
Cc: Heiko Carstens <hca@linux.ibm.com>
Cc: Helge Deller <deller@gmx.de>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: "James E.J. Bottomley" <James.Bottomley@HansenPartnership.com>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Thomas Bogendoerfer <tsbogend@alpha.franken.de>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Vasily Gorbik <gor@linux.ibm.com>
Cc: Will Deacon <will@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 5b1b561b 08-Sep-2021 Arnd Bergmann <arnd@arndb.de>

mm: simplify compat_sys_move_pages

The compat move_pages() implementation uses compat_alloc_user_space() for
converting the pointer array. Moving the compat handling into the
function itself is a bit simpler and lets us avoid the
compat_alloc_user_space() call.

Link: https://lkml.kernel.org/r/20210727144859.4150043-4-arnd@kernel.org
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Christian Borntraeger <borntraeger@de.ibm.com>
Cc: Christoph Hellwig <hch@infradead.org>
Cc: "David S. Miller" <davem@davemloft.net>
Cc: Eric Biederman <ebiederm@xmission.com>
Cc: Feng Tang <feng.tang@intel.com>
Cc: Heiko Carstens <hca@linux.ibm.com>
Cc: Helge Deller <deller@gmx.de>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: "James E.J. Bottomley" <James.Bottomley@HansenPartnership.com>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Thomas Bogendoerfer <tsbogend@alpha.franken.de>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Vasily Gorbik <gor@linux.ibm.com>
Cc: Will Deacon <will@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 213ecb31 08-Sep-2021 Baolin Wang <baolin.wang@linux.alibaba.com>

mm: migrate: change to use bool type for 'page_was_mapped'

Change to use bool type for 'page_was_mapped' variable making it more
readable.

Link: https://lkml.kernel.org/r/ce1279df18d2c163998c403e0b5ec6d3f6f90f7a.1629447552.git.baolin.wang@linux.alibaba.com
Signed-off-by: Baolin Wang <baolin.wang@linux.alibaba.com>
Reviewed-by: Yang Shi <shy828301@gmail.com>
Cc: Alistair Popple <apopple@nvidia.com>
Cc: Matthew Wilcox <willy@infradead.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 68a9843f 08-Sep-2021 Baolin Wang <baolin.wang@linux.alibaba.com>

mm: migrate: fix the incorrect function name in comments

since commit a98a2f0c8ce1 ("mm/rmap: split migration into its own
function"), the migration ptes establishment has been split into a
separate try_to_migrate() function, thus update the related comments.

Link: https://lkml.kernel.org/r/5b824bad6183259c916ae6cf42f81d14c6118b06.1629447552.git.baolin.wang@linux.alibaba.com
Signed-off-by: Baolin Wang <baolin.wang@linux.alibaba.com>
Reviewed-by: Yang Shi <shy828301@gmail.com>
Reviewed-by: Alistair Popple <apopple@nvidia.com>
Cc: Matthew Wilcox <willy@infradead.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 2b9b624f 08-Sep-2021 Baolin Wang <baolin.wang@linux.alibaba.com>

mm: migrate: introduce a local variable to get the number of pages

Use thp_nr_pages() instead of compound_nr() to get the number of pages for
THP page, meanwhile introducing a local variable 'nr_pages' to avoid
getting the number of pages repeatedly.

Link: https://lkml.kernel.org/r/a8e331ac04392ee230c79186330fb05e86a2aa77.1629447552.git.baolin.wang@linux.alibaba.com
Signed-off-by: Baolin Wang <baolin.wang@linux.alibaba.com>
Reviewed-by: Yang Shi <shy828301@gmail.com>
Cc: Alistair Popple <apopple@nvidia.com>
Cc: Matthew Wilcox <willy@infradead.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# c9bd7d18 02-Sep-2021 Randy Dunlap <rdunlap@infradead.org>

mm/migrate: correct kernel-doc notation

Use the expected "Return:" format to prevent a kernel-doc warning.

mm/migrate.c:1157: warning: Excess function parameter 'returns' description in 'next_demotion_node'

Link: https://lkml.kernel.org/r/20210808203151.10632-1-rdunlap@infradead.org
Signed-off-by: Randy Dunlap <rdunlap@infradead.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 5ac95884 02-Sep-2021 Yang Shi <yang.shi@linux.alibaba.com>

mm/migrate: enable returning precise migrate_pages() success count

Under normal circumstances, migrate_pages() returns the number of pages
migrated. In error conditions, it returns an error code. When returning
an error code, there is no way to know how many pages were migrated or not
migrated.

Make migrate_pages() return how many pages are demoted successfully for
all cases, including when encountering errors. Page reclaim behavior will
depend on this in subsequent patches.

Link: https://lkml.kernel.org/r/20210721063926.3024591-3-ying.huang@intel.com
Link: https://lkml.kernel.org/r/20210715055145.195411-4-ying.huang@intel.com
Signed-off-by: Yang Shi <yang.shi@linux.alibaba.com>
Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
Signed-off-by: "Huang, Ying" <ying.huang@intel.com>
Suggested-by: Oscar Salvador <osalvador@suse.de> [optional parameter]
Reviewed-by: Yang Shi <shy828301@gmail.com>
Reviewed-by: Zi Yan <ziy@nvidia.com>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Wei Xu <weixugc@google.com>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: David Hildenbrand <david@redhat.com>
Cc: David Rientjes <rientjes@google.com>
Cc: Greg Thelen <gthelen@google.com>
Cc: Keith Busch <kbusch@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 884a6e5d 02-Sep-2021 Dave Hansen <dave.hansen@linux.intel.com>

mm/migrate: update node demotion order on hotplug events

Reclaim-based migration is attempting to optimize data placement in memory
based on the system topology. If the system changes, so must the
migration ordering.

The implementation is conceptually simple and entirely unoptimized. On
any memory or CPU hotplug events, assume that a node was added or removed
and recalculate all migration targets. This ensures that the
node_demotion[] array is always ready to be used in case the new reclaim
mode is enabled.

This recalculation is far from optimal, most glaringly that it does not
even attempt to figure out the hotplug event would have some *actual*
effect on the demotion order. But, given the expected paucity of hotplug
events, this should be fine.

Link: https://lkml.kernel.org/r/20210721063926.3024591-2-ying.huang@intel.com
Link: https://lkml.kernel.org/r/20210715055145.195411-3-ying.huang@intel.com
Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
Signed-off-by: "Huang, Ying" <ying.huang@intel.com>
Reviewed-by: Yang Shi <shy828301@gmail.com>
Reviewed-by: Zi Yan <ziy@nvidia.com>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Wei Xu <weixugc@google.com>
Cc: Oscar Salvador <osalvador@suse.de>
Cc: David Rientjes <rientjes@google.com>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: David Hildenbrand <david@redhat.com>
Cc: Greg Thelen <gthelen@google.com>
Cc: Keith Busch <kbusch@kernel.org>
Cc: Yang Shi <yang.shi@linux.alibaba.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 79c28a41 02-Sep-2021 Dave Hansen <dave.hansen@linux.intel.com>

mm/numa: automatically generate node migration order

Patch series "Migrate Pages in lieu of discard", v11.

We're starting to see systems with more and more kinds of memory such as
Intel's implementation of persistent memory.

Let's say you have a system with some DRAM and some persistent memory.
Today, once DRAM fills up, reclaim will start and some of the DRAM
contents will be thrown out. Allocations will, at some point, start
falling over to the slower persistent memory.

That has two nasty properties. First, the newer allocations can end up in
the slower persistent memory. Second, reclaimed data in DRAM are just
discarded even if there are gobs of space in persistent memory that could
be used.

This patchset implements a solution to these problems. At the end of the
reclaim process in shrink_page_list() just before the last page refcount
is dropped, the page is migrated to persistent memory instead of being
dropped.

While I've talked about a DRAM/PMEM pairing, this approach would function
in any environment where memory tiers exist.

This is not perfect. It "strands" pages in slower memory and never brings
them back to fast DRAM. Huang Ying has follow-on work which repurposes
NUMA balancing to promote hot pages back to DRAM.

This is also all based on an upstream mechanism that allows persistent
memory to be onlined and used as if it were volatile:

http://lkml.kernel.org/r/20190124231441.37A4A305@viggo.jf.intel.com

With that, the DRAM and PMEM in each socket will be represented as 2
separate NUMA nodes, with the CPUs sit in the DRAM node. So the
general inter-NUMA demotion mechanism introduced in the patchset can
migrate the cold DRAM pages to the PMEM node.

We have tested the patchset with the postgresql and pgbench. On a
2-socket server machine with DRAM and PMEM, the kernel with the patchset
can improve the score of pgbench up to 22.1% compared with that of the
DRAM only + disk case. This comes from the reduced disk read throughput
(which reduces up to 70.8%).

== Open Issues ==

* Memory policies and cpusets that, for instance, restrict allocations
to DRAM can be demoted to PMEM whenever they opt in to this
new mechanism. A cgroup-level API to opt-in or opt-out of
these migrations will likely be required as a follow-on.
* Could be more aggressive about where anon LRU scanning occurs
since it no longer necessarily involves I/O. get_scan_count()
for instance says: "If we have no swap space, do not bother
scanning anon pages"

This patch (of 9):

Prepare for the kernel to auto-migrate pages to other memory nodes with a
node migration table. This allows creating single migration target for
each NUMA node to enable the kernel to do NUMA page migrations instead of
simply discarding colder pages. A node with no target is a "terminal
node", so reclaim acts normally there. The migration target does not
fundamentally _need_ to be a single node, but this implementation starts
there to limit complexity.

When memory fills up on a node, memory contents can be automatically
migrated to another node. The biggest problems are knowing when to
migrate and to where the migration should be targeted.

The most straightforward way to generate the "to where" list would be to
follow the page allocator fallback lists. Those lists already tell us if
memory is full where to look next. It would also be logical to move
memory in that order.

But, the allocator fallback lists have a fatal flaw: most nodes appear in
all the lists. This would potentially lead to migration cycles (A->B,
B->A, A->B, ...).

Instead of using the allocator fallback lists directly, keep a separate
node migration ordering. But, reuse the same data used to generate page
allocator fallback in the first place: find_next_best_node().

This means that the firmware data used to populate node distances
essentially dictates the ordering for now. It should also be
architecture-neutral since all NUMA architectures have a working
find_next_best_node().

RCU is used to allow lock-less read of node_demotion[] and prevent
demotion cycles been observed. If multiple reads of node_demotion[] are
performed, a single rcu_read_lock() must be held over all reads to ensure
no cycles are observed. Details are as follows.

=== What does RCU provide? ===

Imagine a simple loop which walks down the demotion path looking
for the last node:

terminal_node = start_node;
while (node_demotion[terminal_node] != NUMA_NO_NODE) {
terminal_node = node_demotion[terminal_node];
}

The initial values are:

node_demotion[0] = 1;
node_demotion[1] = NUMA_NO_NODE;

and are updated to:

node_demotion[0] = NUMA_NO_NODE;
node_demotion[1] = 0;

What guarantees that the cycle is not observed:

node_demotion[0] = 1;
node_demotion[1] = 0;

and would loop forever?

With RCU, a rcu_read_lock/unlock() can be placed around the loop. Since
the write side does a synchronize_rcu(), the loop that observed the old
contents is known to be complete before the synchronize_rcu() has
completed.

RCU, combined with disable_all_migrate_targets(), ensures that the old
migration state is not visible by the time __set_migration_target_nodes()
is called.

=== What does READ_ONCE() provide? ===

READ_ONCE() forbids the compiler from merging or reordering successive
reads of node_demotion[]. This ensures that any updates are *eventually*
observed.

Consider the above loop again. The compiler could theoretically read the
entirety of node_demotion[] into local storage (registers) and never go
back to memory, and *permanently* observe bad values for node_demotion[].

Note: RCU does not provide any universal compiler-ordering
guarantees:

https://lore.kernel.org/lkml/20150921204327.GH4029@linux.vnet.ibm.com/

This code is unused for now. It will be called later in the
series.

Link: https://lkml.kernel.org/r/20210721063926.3024591-1-ying.huang@intel.com
Link: https://lkml.kernel.org/r/20210715055145.195411-1-ying.huang@intel.com
Link: https://lkml.kernel.org/r/20210715055145.195411-2-ying.huang@intel.com
Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
Signed-off-by: "Huang, Ying" <ying.huang@intel.com>
Reviewed-by: Yang Shi <shy828301@gmail.com>
Reviewed-by: Zi Yan <ziy@nvidia.com>
Reviewed-by: Oscar Salvador <osalvador@suse.de>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Wei Xu <weixugc@google.com>
Cc: David Rientjes <rientjes@google.com>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: David Hildenbrand <david@redhat.com>
Cc: Greg Thelen <gthelen@google.com>
Cc: Keith Busch <kbusch@kernel.org>
Cc: Yang Shi <yang.shi@linux.alibaba.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# b5916c02 29-Jul-2021 Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>

mm/migrate: fix NR_ISOLATED corruption on 64-bit

Similar to commit 2da9f6305f30 ("mm/vmscan: fix NR_ISOLATED_FILE
corruption on 64-bit") avoid using unsigned int for nr_pages. With
unsigned int type the large unsigned int converts to a large positive
signed long.

Symptoms include CMA allocations hanging forever due to
alloc_contig_range->...->isolate_migratepages_block waiting forever in
"while (unlikely(too_many_isolated(pgdat)))".

Link: https://lkml.kernel.org/r/20210728042531.359409-1-aneesh.kumar@linux.ibm.com
Fixes: c5fc5c3ae0c8 ("mm: migrate: account THP NUMA migration counters correctly")
Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
Reported-by: Michael Ellerman <mpe@ellerman.id.au>
Reported-by: Alexey Kardashevskiy <aik@ozlabs.ru>
Reviewed-by: Yang Shi <shy828301@gmail.com>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Nicholas Piggin <npiggin@gmail.com>
Cc: David Hildenbrand <david@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 79789db0 12-Jul-2021 Matthew Wilcox (Oracle) <willy@infradead.org>

mm: Make copy_huge_page() always available

Rewrite copy_huge_page() and move it into mm/util.c so it's always
available. Fixes an exposure of uninitialised memory on configurations
with HUGETLB and UFFD enabled and MIGRATION disabled.

Fixes: 8cc5fcbb5be8 ("mm, hugetlb: fix racy resv_huge_pages underflow on UFFDIO_COPY")
Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Reviewed-by: Mike Kravetz <mike.kravetz@oracle.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 6b49bf6d 30-Jun-2021 Alistair Popple <apopple@nvidia.com>

mm: rename migrate_pgmap_owner

MMU notifier ranges have a migrate_pgmap_owner field which is used by
drivers to store a pointer. This is subsequently used by the driver
callback to filter MMU_NOTIFY_MIGRATE events. Other notifier event types
can also benefit from this filtering, so rename the 'migrate_pgmap_owner'
field to 'owner' and create a new notifier initialisation function to
initialise this field.

Link: https://lkml.kernel.org/r/20210616105937.23201-6-apopple@nvidia.com
Signed-off-by: Alistair Popple <apopple@nvidia.com>
Suggested-by: Peter Xu <peterx@redhat.com>
Reviewed-by: Peter Xu <peterx@redhat.com>
Cc: Ben Skeggs <bskeggs@redhat.com>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Hugh Dickins <hughd@google.com>
Cc: Jason Gunthorpe <jgg@nvidia.com>
Cc: John Hubbard <jhubbard@nvidia.com>
Cc: "Matthew Wilcox (Oracle)" <willy@infradead.org>
Cc: Ralph Campbell <rcampbell@nvidia.com>
Cc: Shakeel Butt <shakeelb@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# a98a2f0c 30-Jun-2021 Alistair Popple <apopple@nvidia.com>

mm/rmap: split migration into its own function

Migration is currently implemented as a mode of operation for
try_to_unmap_one() generally specified by passing the TTU_MIGRATION flag
or in the case of splitting a huge anonymous page TTU_SPLIT_FREEZE.

However it does not have much in common with the rest of the unmap
functionality of try_to_unmap_one() and thus splitting it into a separate
function reduces the complexity of try_to_unmap_one() making it more
readable.

Several simplifications can also be made in try_to_migrate_one() based on
the following observations:

- All users of TTU_MIGRATION also set TTU_IGNORE_MLOCK.
- No users of TTU_MIGRATION ever set TTU_IGNORE_HWPOISON.
- No users of TTU_MIGRATION ever set TTU_BATCH_FLUSH.

TTU_SPLIT_FREEZE is a special case of migration used when splitting an
anonymous page. This is most easily dealt with by calling the correct
function from unmap_page() in mm/huge_memory.c - either try_to_migrate()
for PageAnon or try_to_unmap().

Link: https://lkml.kernel.org/r/20210616105937.23201-5-apopple@nvidia.com
Signed-off-by: Alistair Popple <apopple@nvidia.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Ralph Campbell <rcampbell@nvidia.com>
Cc: Ben Skeggs <bskeggs@redhat.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: Jason Gunthorpe <jgg@nvidia.com>
Cc: John Hubbard <jhubbard@nvidia.com>
Cc: "Matthew Wilcox (Oracle)" <willy@infradead.org>
Cc: Peter Xu <peterx@redhat.com>
Cc: Shakeel Butt <shakeelb@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 4dd845b5 30-Jun-2021 Alistair Popple <apopple@nvidia.com>

mm/swapops: rework swap entry manipulation code

Both migration and device private pages use special swap entries that are
manipluated by a range of inline functions. The arguments to these are
somewhat inconsistent so rework them to remove flag type arguments and to
make the arguments similar for both read and write entry creation.

Link: https://lkml.kernel.org/r/20210616105937.23201-3-apopple@nvidia.com
Signed-off-by: Alistair Popple <apopple@nvidia.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Jason Gunthorpe <jgg@nvidia.com>
Reviewed-by: Ralph Campbell <rcampbell@nvidia.com>
Cc: Ben Skeggs <bskeggs@redhat.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: John Hubbard <jhubbard@nvidia.com>
Cc: "Matthew Wilcox (Oracle)" <willy@infradead.org>
Cc: Peter Xu <peterx@redhat.com>
Cc: Shakeel Butt <shakeelb@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# af5cdaf8 30-Jun-2021 Alistair Popple <apopple@nvidia.com>

mm: remove special swap entry functions

Patch series "Add support for SVM atomics in Nouveau", v11.

Introduction
============

Some devices have features such as atomic PTE bits that can be used to
implement atomic access to system memory. To support atomic operations to
a shared virtual memory page such a device needs access to that page which
is exclusive of the CPU. This series introduces a mechanism to
temporarily unmap pages granting exclusive access to a device.

These changes are required to support OpenCL atomic operations in Nouveau
to shared virtual memory (SVM) regions allocated with the
CL_MEM_SVM_ATOMICS clSVMAlloc flag. A more complete description of the
OpenCL SVM feature is available at
https://www.khronos.org/registry/OpenCL/specs/3.0-unified/html/
OpenCL_API.html#_shared_virtual_memory .

Implementation
==============

Exclusive device access is implemented by adding a new swap entry type
(SWAP_DEVICE_EXCLUSIVE) which is similar to a migration entry. The main
difference is that on fault the original entry is immediately restored by
the fault handler instead of waiting.

Restoring the entry triggers calls to MMU notifers which allows a device
driver to revoke the atomic access permission from the GPU prior to the
CPU finalising the entry.

Patches
=======

Patches 1 & 2 refactor existing migration and device private entry
functions.

Patches 3 & 4 rework try_to_unmap_one() by splitting out unrelated
functionality into separate functions - try_to_migrate_one() and
try_to_munlock_one().

Patch 5 renames some existing code but does not introduce functionality.

Patch 6 is a small clean-up to swap entry handling in copy_pte_range().

Patch 7 contains the bulk of the implementation for device exclusive
memory.

Patch 8 contains some additions to the HMM selftests to ensure everything
works as expected.

Patch 9 is a cleanup for the Nouveau SVM implementation.

Patch 10 contains the implementation of atomic access for the Nouveau
driver.

Testing
=======

This has been tested with upstream Mesa 21.1.0 and a simple OpenCL program
which checks that GPU atomic accesses to system memory are atomic.
Without this series the test fails as there is no way of write-protecting
the page mapping which results in the device clobbering CPU writes. For
reference the test is available at
https://ozlabs.org/~apopple/opencl_svm_atomics/

Further testing has been performed by adding support for testing exclusive
access to the hmm-tests kselftests.

This patch (of 10):

Remove multiple similar inline functions for dealing with different types
of special swap entries.

Both migration and device private swap entries use the swap offset to
store a pfn. Instead of multiple inline functions to obtain a struct page
for each swap entry type use a common function pfn_swap_entry_to_page().
Also open-code the various entry_to_pfn() functions as this results is
shorter code that is easier to understand.

Link: https://lkml.kernel.org/r/20210616105937.23201-1-apopple@nvidia.com
Link: https://lkml.kernel.org/r/20210616105937.23201-2-apopple@nvidia.com
Signed-off-by: Alistair Popple <apopple@nvidia.com>
Reviewed-by: Ralph Campbell <rcampbell@nvidia.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Cc: "Matthew Wilcox (Oracle)" <willy@infradead.org>
Cc: Hugh Dickins <hughd@google.com>
Cc: Peter Xu <peterx@redhat.com>
Cc: Shakeel Butt <shakeelb@google.com>
Cc: Ben Skeggs <bskeggs@redhat.com>
Cc: Jason Gunthorpe <jgg@nvidia.com>
Cc: John Hubbard <jhubbard@nvidia.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 662aeea7 30-Jun-2021 Yang Shi <shy828301@gmail.com>

mm: migrate: check mapcount for THP instead of refcount

The generic migration path will check refcount, so no need check refcount
here. But the old code actually prevents from migrating shared THP
(mapped by multiple processes), so bail out early if mapcount is > 1 to
keep the behavior.

Link: https://lkml.kernel.org/r/20210518200801.7413-7-shy828301@gmail.com
Signed-off-by: Yang Shi <shy828301@gmail.com>
Cc: Christian Borntraeger <borntraeger@de.ibm.com>
Cc: Gerald Schaefer <gerald.schaefer@linux.ibm.com>
Cc: Heiko Carstens <hca@linux.ibm.com>
Cc: Huang Ying <ying.huang@intel.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Vasily Gorbik <gor@linux.ibm.com>
Cc: Zi Yan <ziy@nvidia.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# b0b515bf 30-Jun-2021 Yang Shi <shy828301@gmail.com>

mm: migrate: don't split THP for misplaced NUMA page

The old behavior didn't split THP if migration is failed due to lack of
memory on the target node. But the THP migration does split THP, so keep
the old behavior for misplaced NUMA page migration.

Link: https://lkml.kernel.org/r/20210518200801.7413-6-shy828301@gmail.com
Signed-off-by: Yang Shi <shy828301@gmail.com>
Acked-by: Mel Gorman <mgorman@suse.de>
Cc: Christian Borntraeger <borntraeger@de.ibm.com>
Cc: Gerald Schaefer <gerald.schaefer@linux.ibm.com>
Cc: Heiko Carstens <hca@linux.ibm.com>
Cc: Huang Ying <ying.huang@intel.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Vasily Gorbik <gor@linux.ibm.com>
Cc: Zi Yan <ziy@nvidia.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# c5fc5c3a 30-Jun-2021 Yang Shi <shy828301@gmail.com>

mm: migrate: account THP NUMA migration counters correctly

Now both base page and THP NUMA migration is done via
migrate_misplaced_page(), keep the counters correctly for THP.

Link: https://lkml.kernel.org/r/20210518200801.7413-5-shy828301@gmail.com
Signed-off-by: Yang Shi <shy828301@gmail.com>
Acked-by: Mel Gorman <mgorman@suse.de>
Cc: Christian Borntraeger <borntraeger@de.ibm.com>
Cc: Gerald Schaefer <gerald.schaefer@linux.ibm.com>
Cc: Heiko Carstens <hca@linux.ibm.com>
Cc: Huang Ying <ying.huang@intel.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Vasily Gorbik <gor@linux.ibm.com>
Cc: Zi Yan <ziy@nvidia.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# c5b5a3dd 30-Jun-2021 Yang Shi <shy828301@gmail.com>

mm: thp: refactor NUMA fault handling

When the THP NUMA fault support was added THP migration was not supported
yet. So the ad hoc THP migration was implemented in NUMA fault handling.
Since v4.14 THP migration has been supported so it doesn't make too much
sense to still keep another THP migration implementation rather than using
the generic migration code.

This patch reworks the NUMA fault handling to use generic migration
implementation to migrate misplaced page. There is no functional change.

After the refactor the flow of NUMA fault handling looks just like its
PTE counterpart:
Acquire ptl
Prepare for migration (elevate page refcount)
Release ptl
Isolate page from lru and elevate page refcount
Migrate the misplaced THP

If migration fails just restore the old normal PMD.

In the old code anon_vma lock was needed to serialize THP migration
against THP split, but since then the THP code has been reworked a lot, it
seems anon_vma lock is not required anymore to avoid the race.

The page refcount elevation when holding ptl should prevent from THP
split.

Use migrate_misplaced_page() for both base page and THP NUMA hinting fault
and remove all the dead and duplicate code.

[dan.carpenter@oracle.com: fix a double unlock bug]
Link: https://lkml.kernel.org/r/YLX8uYN01JmfLnlK@mwanda

Link: https://lkml.kernel.org/r/20210518200801.7413-4-shy828301@gmail.com
Signed-off-by: Yang Shi <shy828301@gmail.com>
Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com>
Acked-by: Mel Gorman <mgorman@suse.de>
Cc: Christian Borntraeger <borntraeger@de.ibm.com>
Cc: Gerald Schaefer <gerald.schaefer@linux.ibm.com>
Cc: Heiko Carstens <hca@linux.ibm.com>
Cc: Huang Ying <ying.huang@intel.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Vasily Gorbik <gor@linux.ibm.com>
Cc: Zi Yan <ziy@nvidia.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 6acfb5ba 30-Jun-2021 Muchun Song <songmuchun@bytedance.com>

mm: migrate: fix missing update page_private to hugetlb_page_subpool

Since commit d6995da31122 ("hugetlb: use page.private for hugetlb specific
page flags") converts page.private for hugetlb specific page flags. We
should use hugetlb_page_subpool() to get the subpool pointer instead of
page_private().

This 'could' prevent the migration of hugetlb pages. page_private(hpage)
is now used for hugetlb page specific flags. At migration time, the only
flag which could be set is HPageVmemmapOptimized. This flag will only be
set if the new vmemmap reduction feature is enabled. In addition,
!page_mapping() implies an anonymous mapping. So, this will prevent
migration of hugetb pages in anonymous mappings if the vmemmap reduction
feature is enabled.

In addition, that if statement checked for the rare race condition of a
page being migrated while in the process of being freed. Since that check
is now wrong, we could leak hugetlb subpool usage counts.

The commit forgot to update it in the page migration routine. So fix it.

[songmuchun@bytedance.com: fix compiler error when !CONFIG_HUGETLB_PAGE reported by Randy]
Link: https://lkml.kernel.org/r/20210521022747.35736-1-songmuchun@bytedance.com

Link: https://lkml.kernel.org/r/20210520025949.1866-1-songmuchun@bytedance.com
Fixes: d6995da31122 ("hugetlb: use page.private for hugetlb specific page flags")
Signed-off-by: Muchun Song <songmuchun@bytedance.com>
Reported-by: Anshuman Khandual <anshuman.khandual@arm.com>
Reviewed-by: Mike Kravetz <mike.kravetz@oracle.com>
Acked-by: Michal Hocko <mhocko@suse.com>
Tested-by: Anshuman Khandual <anshuman.khandual@arm.com> [arm64]
Cc: Oscar Salvador <osalvador@suse.de>
Cc: David Hildenbrand <david@redhat.com>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Xiongchun Duan <duanxiongchun@bytedance.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 8cc5fcbb 30-Jun-2021 Mina Almasry <almasrymina@google.com>

mm, hugetlb: fix racy resv_huge_pages underflow on UFFDIO_COPY

On UFFDIO_COPY, if we fail to copy the page contents while holding the
hugetlb_fault_mutex, we will drop the mutex and return to the caller after
allocating a page that consumed a reservation. In this case there may be
a fault that double consumes the reservation. To handle this, we free the
allocated page, fix the reservations, and allocate a temporary hugetlb
page and return that to the caller. When the caller does the copy outside
of the lock, we again check the cache, and allocate a page consuming the
reservation, and copy over the contents.

Test:
Hacked the code locally such that resv_huge_pages underflows produce
a warning and the copy_huge_page_from_user() always fails, then:

./tools/testing/selftests/vm/userfaultfd hugetlb_shared 10
2 /tmp/kokonut_test/huge/userfaultfd_test && echo test success
./tools/testing/selftests/vm/userfaultfd hugetlb 10
2 /tmp/kokonut_test/huge/userfaultfd_test && echo test success

Both tests succeed and produce no warnings. After the
test runs number of free/resv hugepages is correct.

[yuehaibing@huawei.com: remove set but not used variable 'vm_alloc_shared']
Link: https://lkml.kernel.org/r/20210601141610.28332-1-yuehaibing@huawei.com
[almasrymina@google.com: fix allocation error check and copy func name]
Link: https://lkml.kernel.org/r/20210605010626.1459873-1-almasrymina@google.com

Link: https://lkml.kernel.org/r/20210528005029.88088-1-almasrymina@google.com
Signed-off-by: Mina Almasry <almasrymina@google.com>
Signed-off-by: YueHaibing <yuehaibing@huawei.com>
Cc: Axel Rasmussen <axelrasmussen@google.com>
Cc: Peter Xu <peterx@redhat.com>
Cc: Mike Kravetz <mike.kravetz@oracle.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 79c1c594 30-Jun-2021 Christophe Leroy <christophe.leroy@csgroup.eu>

mm/hugetlb: change parameters of arch_make_huge_pte()

Patch series "Subject: [PATCH v2 0/5] Implement huge VMAP and VMALLOC on powerpc 8xx", v2.

This series implements huge VMAP and VMALLOC on powerpc 8xx.

Powerpc 8xx has 4 page sizes:
- 4k
- 16k
- 512k
- 8M

At the time being, vmalloc and vmap only support huge pages which are
leaf at PMD level.

Here the PMD level is 4M, it doesn't correspond to any supported
page size.

For now, implement use of 16k and 512k pages which is done
at PTE level.

Support of 8M pages will be implemented later, it requires use of
hugepd tables.

To allow this, the architecture provides two functions:
- arch_vmap_pte_range_map_size() which tells vmap_pte_range() what
page size to use. A stub returning PAGE_SIZE is provided when the
architecture doesn't provide this function.
- arch_vmap_pte_supported_shift() which tells __vmalloc_node_range()
what page shift to use for a given area size. A stub returning
PAGE_SHIFT is provided when the architecture doesn't provide this
function.

This patch (of 5):

At the time being, arch_make_huge_pte() has the following prototype:

pte_t arch_make_huge_pte(pte_t entry, struct vm_area_struct *vma,
struct page *page, int writable);

vma is used to get the pages shift or size.
vma is also used on Sparc to get vm_flags.
page is not used.
writable is not used.

In order to use this function without a vma, replace vma by shift and
flags. Also remove the used parameters.

Link: https://lkml.kernel.org/r/cover.1620795204.git.christophe.leroy@csgroup.eu
Link: https://lkml.kernel.org/r/f4633ac6a7da2f22f31a04a89e0a7026bb78b15b.1620795204.git.christophe.leroy@csgroup.eu
Signed-off-by: Christophe Leroy <christophe.leroy@csgroup.eu>
Acked-by: Mike Kravetz <mike.kravetz@oracle.com>
Cc: Nicholas Piggin <npiggin@gmail.com>
Cc: Mike Kravetz <mike.kravetz@oracle.com>
Cc: Mike Rapoport <rppt@kernel.org>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Uladzislau Rezki <uladzislau.rezki@sony.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# ad2fa371 30-Jun-2021 Muchun Song <songmuchun@bytedance.com>

mm: hugetlb: alloc the vmemmap pages associated with each HugeTLB page

When we free a HugeTLB page to the buddy allocator, we need to allocate
the vmemmap pages associated with it. However, we may not be able to
allocate the vmemmap pages when the system is under memory pressure. In
this case, we just refuse to free the HugeTLB page. This changes behavior
in some corner cases as listed below:

1) Failing to free a huge page triggered by the user (decrease nr_pages).

User needs to try again later.

2) Failing to free a surplus huge page when freed by the application.

Try again later when freeing a huge page next time.

3) Failing to dissolve a free huge page on ZONE_MOVABLE via
offline_pages().

This can happen when we have plenty of ZONE_MOVABLE memory, but
not enough kernel memory to allocate vmemmmap pages. We may even
be able to migrate huge page contents, but will not be able to
dissolve the source huge page. This will prevent an offline
operation and is unfortunate as memory offlining is expected to
succeed on movable zones. Users that depend on memory hotplug
to succeed for movable zones should carefully consider whether the
memory savings gained from this feature are worth the risk of
possibly not being able to offline memory in certain situations.

4) Failing to dissolve a huge page on CMA/ZONE_MOVABLE via
alloc_contig_range() - once we have that handling in place. Mainly
affects CMA and virtio-mem.

Similar to 3). virito-mem will handle migration errors gracefully.
CMA might be able to fallback on other free areas within the CMA
region.

Vmemmap pages are allocated from the page freeing context. In order for
those allocations to be not disruptive (e.g. trigger oom killer)
__GFP_NORETRY is used. hugetlb_lock is dropped for the allocation because
a non sleeping allocation would be too fragile and it could fail too
easily under memory pressure. GFP_ATOMIC or other modes to access memory
reserves is not used because we want to prevent consuming reserves under
heavy hugetlb freeing.

[mike.kravetz@oracle.com: fix dissolve_free_huge_page use of tail/head page]
Link: https://lkml.kernel.org/r/20210527231225.226987-1-mike.kravetz@oracle.com
[willy@infradead.org: fix alloc_vmemmap_page_list documentation warning]
Link: https://lkml.kernel.org/r/20210615200242.1716568-6-willy@infradead.org

Link: https://lkml.kernel.org/r/20210510030027.56044-7-songmuchun@bytedance.com
Signed-off-by: Muchun Song <songmuchun@bytedance.com>
Signed-off-by: Mike Kravetz <mike.kravetz@oracle.com>
Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Reviewed-by: Mike Kravetz <mike.kravetz@oracle.com>
Reviewed-by: Oscar Salvador <osalvador@suse.de>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Anshuman Khandual <anshuman.khandual@arm.com>
Cc: Balbir Singh <bsingharora@gmail.com>
Cc: Barry Song <song.bao.hua@hisilicon.com>
Cc: Bodeddula Balasubramaniam <bodeddub@amazon.com>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Chen Huang <chenhuang5@huawei.com>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: David Hildenbrand <david@redhat.com>
Cc: David Rientjes <rientjes@google.com>
Cc: HORIGUCHI NAOYA <naoya.horiguchi@nec.com>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Joao Martins <joao.m.martins@oracle.com>
Cc: Joerg Roedel <jroedel@suse.de>
Cc: Jonathan Corbet <corbet@lwn.net>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Miaohe Lin <linmiaohe@huawei.com>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Mina Almasry <almasrymina@google.com>
Cc: Oliver Neukum <oneukum@suse.com>
Cc: Paul E. McKenney <paulmck@kernel.org>
Cc: Pawan Gupta <pawan.kumar.gupta@linux.intel.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Randy Dunlap <rdunlap@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Xiongchun Duan <duanxiongchun@bytedance.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 059b8b48 28-Jun-2021 Liam Howlett <liam.howlett@oracle.com>

mm/migrate: use vma_lookup() in do_pages_stat_array()

Use vma_lookup() to find the VMA at a specific address. As vma_lookup()
will return NULL if the address is not within any VMA, the start address
no longer needs to be validated.

Link: https://lkml.kernel.org/r/20210521174745.2219620-20-Liam.Howlett@Oracle.com
Signed-off-by: Liam R. Howlett <Liam.Howlett@Oracle.com>
Reviewed-by: Laurent Dufour <ldufour@linux.ibm.com>
Acked-by: David Hildenbrand <david@redhat.com>
Acked-by: Davidlohr Bueso <dbueso@suse.de>
Cc: Geert Uytterhoeven <geert@linux-m68k.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# ffc90cbb 15-Jun-2021 Xu Yu <xuyu@linux.alibaba.com>

mm, thp: use head page in __migration_entry_wait()

We notice that hung task happens in a corner but practical scenario when
CONFIG_PREEMPT_NONE is enabled, as follows.

Process 0 Process 1 Process 2..Inf
split_huge_page_to_list
unmap_page
split_huge_pmd_address
__migration_entry_wait(head)
__migration_entry_wait(tail)
remap_page (roll back)
remove_migration_ptes
rmap_walk_anon
cond_resched

Where __migration_entry_wait(tail) is occurred in kernel space, e.g.,
copy_to_user in fstat, which will immediately fault again without
rescheduling, and thus occupy the cpu fully.

When there are too many processes performing __migration_entry_wait on
tail page, remap_page will never be done after cond_resched.

This makes __migration_entry_wait operate on the compound head page,
thus waits for remap_page to complete, whether the THP is split
successfully or roll back.

Note that put_and_wait_on_page_locked helps to drop the page reference
acquired with get_page_unless_zero, as soon as the page is on the wait
queue, before actually waiting. So splitting the THP is only prevented
for a brief interval.

Link: https://lkml.kernel.org/r/b9836c1dd522e903891760af9f0c86a2cce987eb.1623144009.git.xuyu@linux.alibaba.com
Fixes: ba98828088ad ("thp: add option to setup migration entries during PMD split")
Suggested-by: Hugh Dickins <hughd@google.com>
Signed-off-by: Gang Deng <gavin.dg@linux.alibaba.com>
Signed-off-by: Xu Yu <xuyu@linux.alibaba.com>
Acked-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Acked-by: Hugh Dickins <hughd@google.com>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# f0953a1b 06-May-2021 Ingo Molnar <mingo@kernel.org>

mm: fix typos in comments

Fix ~94 single-word typos in locking code comments, plus a few
very obvious grammar mistakes.

Link: https://lkml.kernel.org/r/20210322212624.GA1963421@gmail.com
Link: https://lore.kernel.org/r/20210322205203.GB1959563@gmail.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Reviewed-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Reviewed-by: Randy Dunlap <rdunlap@infradead.org>
Cc: Bhaskar Chowdhury <unixbhaskar@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 7bc1aec5 04-May-2021 Liam Mark <lmark@codeaurora.org>

mm: cma: add trace events for CMA alloc perf testing

Add cma and migrate trace events to enable CMA allocation performance to
be measured via ftrace.

[georgi.djakov@linaro.org: add the CMA instance name to the cma_alloc_start trace event]
Link: https://lkml.kernel.org/r/20210326155414.25006-1-georgi.djakov@linaro.org

Link: https://lkml.kernel.org/r/20210324160740.15901-1-georgi.djakov@linaro.org
Signed-off-by: Liam Mark <lmark@codeaurora.org>
Signed-off-by: Georgi Djakov <georgi.djakov@linaro.org>
Acked-by: Minchan Kim <minchan@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 7ee820ee 04-May-2021 Miaohe Lin <linmiaohe@huawei.com>

Revert "mm: migrate: skip shared exec THP for NUMA balancing"

This reverts commit c77c5cbafe549eb330e8909861a3e16cbda2c848.

Since commit c77c5cbafe54 ("mm: migrate: skip shared exec THP for NUMA
balancing"), the NUMA balancing would skip shared exec transhuge page.
But this enhancement is not suitable for transhuge page. Because it's
required that page_mapcount() must be 1 due to no migration pte dance is
done here. On the other hand, the shared exec transhuge page will leave
the migrate_misplaced_page() with pte entry untouched and page locked.
Thus pagefault for NUMA will be triggered again and deadlock occurs when
we start waiting for the page lock held by ourselves.

Yang Shi said:

"Thanks for catching this. By relooking the code I think the other
important reason for removing this is
migrate_misplaced_transhuge_page() actually can't see shared exec
file THP at all since page_lock_anon_vma_read() is called before
and if page is not anonymous page it will just restore the PMD
without migrating anything.
The pages for private mapped file vma may be anonymous pages due to
COW but they can't be THP so it won't trigger THP numa fault at all. I
think this is why no bug was reported. I overlooked this in the first
place."

Link: https://lkml.kernel.org/r/20210325131524.48181-6-linmiaohe@huawei.com
Signed-off-by: Miaohe Lin <linmiaohe@huawei.com>
Reviewed-by: Yang Shi <shy828301@gmail.com>
Cc: Alistair Popple <apopple@nvidia.com>
Cc: David Hildenbrand <david@redhat.com>
Cc: Jerome Glisse <jglisse@redhat.com>
Cc: Rafael Aquini <aquini@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 843e1be1 04-May-2021 Miaohe Lin <linmiaohe@huawei.com>

mm/migrate.c: use helper migrate_vma_collect_skip() in migrate_vma_collect_hole()

It's more recommended to use helper function migrate_vma_collect_skip() to
skip the unexpected case and it also helps remove some duplicated codes.
Move migrate_vma_collect_skip() above migrate_vma_collect_hole() to avoid
compiler warning.

Link: https://lkml.kernel.org/r/20210325131524.48181-5-linmiaohe@huawei.com
Signed-off-by: Miaohe Lin <linmiaohe@huawei.com>
Reviewed-by: David Hildenbrand <david@redhat.com>
Cc: Alistair Popple <apopple@nvidia.com>
Cc: Jerome Glisse <jglisse@redhat.com>
Cc: Rafael Aquini <aquini@redhat.com>
Cc: Yang Shi <shy828301@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 34f5e9b9 04-May-2021 Miaohe Lin <linmiaohe@huawei.com>

mm/migrate.c: fix potential indeterminate pte entry in migrate_vma_insert_page()

If the zone device page does not belong to un-addressable device memory,
the variable entry will be uninitialized and lead to indeterminate pte
entry ultimately. Fix this unexpected case and warn about it.

Link: https://lkml.kernel.org/r/20210325131524.48181-4-linmiaohe@huawei.com
Fixes: df6ad69838fc ("mm/device-public-memory: device memory cache coherent with CPU")
Signed-off-by: Miaohe Lin <linmiaohe@huawei.com>
Reviewed-by: David Hildenbrand <david@redhat.com>
Cc: Alistair Popple <apopple@nvidia.com>
Cc: Jerome Glisse <jglisse@redhat.com>
Cc: Rafael Aquini <aquini@redhat.com>
Cc: Yang Shi <shy828301@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# a04840c6 04-May-2021 Miaohe Lin <linmiaohe@huawei.com>

mm/migrate.c: remove unnecessary rc != MIGRATEPAGE_SUCCESS check in 'else' case

It's guaranteed that in the 'else' case of the rc == MIGRATEPAGE_SUCCESS
check, rc does not equal to MIGRATEPAGE_SUCCESS. Remove this unnecessary
check.

Link: https://lkml.kernel.org/r/20210325131524.48181-3-linmiaohe@huawei.com
Signed-off-by: Miaohe Lin <linmiaohe@huawei.com>
Reviewed-by: David Hildenbrand <david@redhat.com>
Reviewed-by: Yang Shi <shy828301@gmail.com>
Cc: Alistair Popple <apopple@nvidia.com>
Cc: Jerome Glisse <jglisse@redhat.com>
Cc: Rafael Aquini <aquini@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 606a6f71 04-May-2021 Miaohe Lin <linmiaohe@huawei.com>

mm/migrate.c: make putback_movable_page() static

Patch series "Cleanup and fixup for mm/migrate.c", v3.

This series contains cleanups to remove unnecessary VM_BUG_ON_PAGE and rc
!= MIGRATEPAGE_SUCCESS check. Also use helper function to remove some
duplicated codes. What's more, this fixes potential deadlock in NUMA
balancing shared exec THP case and so on. More details can be found in
the respective changelogs.

This patch (of 5):

The putback_movable_page() is just called by putback_movable_pages() and
we know the page is locked and both PageMovable() and PageIsolated() is
checked right before calling putback_movable_page(). So we make it static
and remove all the 3 VM_BUG_ON_PAGE().

Link: https://lkml.kernel.org/r/20210325131524.48181-1-linmiaohe@huawei.com
Link: https://lkml.kernel.org/r/20210325131524.48181-2-linmiaohe@huawei.com
Signed-off-by: Miaohe Lin <linmiaohe@huawei.com>
Reviewed-by: David Hildenbrand <david@redhat.com>
Reviewed-by: Yang Shi <shy828301@gmail.com>
Cc: Jerome Glisse <jglisse@redhat.com>
Cc: Rafael Aquini <aquini@redhat.com>
Cc: Alistair Popple <apopple@nvidia.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 361a2a22 04-May-2021 Minchan Kim <minchan@kernel.org>

mm: replace migrate_[prep|finish] with lru_cache_[disable|enable]

Currently, migrate_[prep|finish] is merely a wrapper of
lru_cache_[disable|enable]. There is not much to gain from having
additional abstraction.

Use lru_cache_[disable|enable] instead of migrate_[prep|finish], which
would be more descriptive.

note: migrate_prep_local in compaction.c changed into lru_add_drain to
avoid CPU schedule cost with involving many other CPUs to keep old
behavior.

Link: https://lkml.kernel.org/r/20210319175127.886124-2-minchan@kernel.org
Signed-off-by: Minchan Kim <minchan@kernel.org>
Acked-by: Michal Hocko <mhocko@suse.com>
Reviewed-by: David Hildenbrand <david@redhat.com>
Cc: Chris Goldsworthy <cgoldswo@codeaurora.org>
Cc: John Dias <joaodias@google.com>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Oliver Sang <oliver.sang@intel.com>
Cc: Suren Baghdasaryan <surenb@google.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# d479960e 04-May-2021 Minchan Kim <minchan@kernel.org>

mm: disable LRU pagevec during the migration temporarily

LRU pagevec holds refcount of pages until the pagevec are drained. It
could prevent migration since the refcount of the page is greater than
the expection in migration logic. To mitigate the issue, callers of
migrate_pages drains LRU pagevec via migrate_prep or lru_add_drain_all
before migrate_pages call.

However, it's not enough because pages coming into pagevec after the
draining call still could stay at the pagevec so it could keep
preventing page migration. Since some callers of migrate_pages have
retrial logic with LRU draining, the page would migrate at next trail
but it is still fragile in that it doesn't close the fundamental race
between upcoming LRU pages into pagvec and migration so the migration
failure could cause contiguous memory allocation failure in the end.

To close the race, this patch disables lru caches(i.e, pagevec) during
ongoing migration until migrate is done.

Since it's really hard to reproduce, I measured how many times
migrate_pages retried with force mode(it is about a fallback to a sync
migration) with below debug code.

int migrate_pages(struct list_head *from, new_page_t get_new_page,
..
..

if (rc && reason == MR_CONTIG_RANGE && pass > 2) {
printk(KERN_ERR, "pfn 0x%lx reason %d", page_to_pfn(page), rc);
dump_page(page, "fail to migrate");
}

The test was repeating android apps launching with cma allocation in
background every five seconds. Total cma allocation count was about 500
during the testing. With this patch, the dump_page count was reduced
from 400 to 30.

The new interface is also useful for memory hotplug which currently
drains lru pcp caches after each migration failure. This is rather
suboptimal as it has to disrupt others running during the operation.
With the new interface the operation happens only once. This is also in
line with pcp allocator cache which are disabled for the offlining as
well.

Link: https://lkml.kernel.org/r/20210319175127.886124-1-minchan@kernel.org
Signed-off-by: Minchan Kim <minchan@kernel.org>
Reviewed-by: Chris Goldsworthy <cgoldswo@codeaurora.org>
Acked-by: Michal Hocko <mhocko@suse.com>
Cc: John Dias <joaodias@google.com>
Cc: Suren Baghdasaryan <surenb@google.com>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: David Hildenbrand <david@redhat.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Oliver Sang <oliver.sang@intel.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 84172f4b 30-Apr-2021 Matthew Wilcox (Oracle) <willy@infradead.org>

mm/page_alloc: combine __alloc_pages and __alloc_pages_nodemask

There are only two callers of __alloc_pages() so prune the thicket of
alloc_page variants by combining the two functions together. Current
callers of __alloc_pages() simply add an extra 'NULL' parameter and
current callers of __alloc_pages_nodemask() call __alloc_pages() instead.

Link: https://lkml.kernel.org/r/20210225150642.2582252-4-willy@infradead.org
Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Acked-by: Michal Hocko <mhocko@suse.com>
Cc: Mike Rapoport <rppt@linux.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# b6038942 24-Feb-2021 Shakeel Butt <shakeelb@google.com>

mm: memcg: add swapcache stat for memcg v2

This patch adds swapcache stat for the cgroup v2. The swapcache
represents the memory that is accounted against both the memory and the
swap limit of the cgroup. The main motivation behind exposing the
swapcache stat is for enabling users to gracefully migrate from cgroup
v1's memsw counter to cgroup v2's memory and swap counters.

Cgroup v1's memsw limit allows users to limit the memory+swap usage of a
workload but without control on the exact proportion of memory and swap.
Cgroup v2 provides separate limits for memory and swap which enables more
control on the exact usage of memory and swap individually for the
workload.

With some little subtleties, the v1's memsw limit can be switched with the
sum of the v2's memory and swap limits. However the alternative for memsw
usage is not yet available in cgroup v2. Exposing per-cgroup swapcache
stat enables that alternative. Adding the memory usage and swap usage and
subtracting the swapcache will approximate the memsw usage. This will
help in the transparent migration of the workloads depending on memsw
usage and limit to v2' memory and swap counters.

The reasons these applications are still interested in this approximate
memsw usage are: (1) these applications are not really interested in two
separate memory and swap usage metrics. A single usage metric is more
simple to use and reason about for them.

(2) The memsw usage metric hides the underlying system's swap setup from
the applications. Applications with multiple instances running in a
datacenter with heterogeneous systems (some have swap and some don't) will
keep seeing a consistent view of their usage.

[akpm@linux-foundation.org: fix CONFIG_SWAP=n build]

Link: https://lkml.kernel.org/r/20210108155813.2914586-3-shakeelb@google.com
Signed-off-by: Shakeel Butt <shakeelb@google.com>
Acked-by: Michal Hocko <mhocko@suse.com>
Reviewed-by: Roman Gushchin <guro@fb.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Muchun Song <songmuchun@bytedance.com>
Cc: Yang Shi <shy828301@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 48054625 24-Feb-2021 Matthew Wilcox (Oracle) <willy@infradead.org>

mm/filemap: pass a sleep state to put_and_wait_on_page_locked

This is prep work for the next patch, but I think at least one of the
current callers would prefer a killable sleep to an uninterruptible one.

Link: https://lkml.kernel.org/r/20210122160140.223228-6-willy@infradead.org
Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Reviewed-by: Kent Overstreet <kent.overstreet@gmail.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Cc: Miaohe Lin <linmiaohe@huawei.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 71a64f61 04-Feb-2021 Muchun Song <songmuchun@bytedance.com>

mm: migrate: do not migrate HugeTLB page whose refcount is one

All pages isolated for the migration have an elevated reference count and
therefore seeing a reference count equal to 1 means that the last user of
the page has dropped the reference and the page has became unused and
there doesn't make much sense to migrate it anymore.

This has been done for regular pages and this patch does the same for
hugetlb pages. Although the likelihood of the race is rather small for
hugetlb pages it makes sense the two code paths in sync.

Link: https://lkml.kernel.org/r/20210115124942.46403-2-songmuchun@bytedance.com
Signed-off-by: Muchun Song <songmuchun@bytedance.com>
Reviewed-by: Mike Kravetz <mike.kravetz@oracle.com>
Acked-by: Yang Shi <shy828301@gmail.com>
Acked-by: Michal Hocko <mhocko@suse.com>
Reviewed-by: David Hildenbrand <david@redhat.com>
Reviewed-by: Oscar Salvador <osalvador@suse.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 5c447d27 23-Jan-2021 Shakeel Butt <shakeelb@google.com>

mm: fix numa stats for thp migration

Currently the kernel is not correctly updating the numa stats for
NR_FILE_PAGES and NR_SHMEM on THP migration. Fix that.

For NR_FILE_DIRTY and NR_ZONE_WRITE_PENDING, although at the moment
there is no need to handle THP migration as kernel still does not have
write support for file THP but to be more future proof, this patch adds
the THP support for those stats as well.

Link: https://lkml.kernel.org/r/20210108155813.2914586-2-shakeelb@google.com
Fixes: e71769ae52609 ("mm: enable thp migration for shmem thp")
Signed-off-by: Shakeel Butt <shakeelb@google.com>
Acked-by: Yang Shi <shy828301@gmail.com>
Reviewed-by: Roman Gushchin <guro@fb.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Muchun Song <songmuchun@bytedance.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 8a8792f6 23-Jan-2021 Shakeel Butt <shakeelb@google.com>

mm: memcg: fix memcg file_dirty numa stat

The kernel updates the per-node NR_FILE_DIRTY stats on page migration
but not the memcg numa stats.

That was not an issue until recently the commit 5f9a4f4a7096 ("mm:
memcontrol: add the missing numa_stat interface for cgroup v2") exposed
numa stats for the memcg.

So fix the file_dirty per-memcg numa stat.

Link: https://lkml.kernel.org/r/20210108155813.2914586-1-shakeelb@google.com
Fixes: 5f9a4f4a7096 ("mm: memcontrol: add the missing numa_stat interface for cgroup v2")
Signed-off-by: Shakeel Butt <shakeelb@google.com>
Reviewed-by: Muchun Song <songmuchun@bytedance.com>
Acked-by: Yang Shi <shy828301@gmail.com>
Reviewed-by: Roman Gushchin <guro@fb.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 8958b249 15-Dec-2020 Haitao Shi <shihaitao1@huawei.com>

mm: fix some spelling mistakes in comments

Fix some spelling mistakes in comments:
udpate ==> update
succesful ==> successful
exmaple ==> example
unneccessary ==> unnecessary
stoping ==> stopping
uknown ==> unknown

Link: https://lkml.kernel.org/r/20201127011747.86005-1-shihaitao1@huawei.com
Signed-off-by: Haitao Shi <shihaitao1@huawei.com>
Reviewed-by: Mike Rapoport <rppt@linux.ibm.com>
Reviewed-by: Souptick Joarder <jrdr.linux@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# d85c6db4 14-Dec-2020 Stephen Zhang <starzhangzsd@gmail.com>

mm: migrate: remove unused parameter in migrate_vma_insert_page()

"dst" parameter to migrate_vma_insert_page() is not used anymore.

Link: https://lkml.kernel.org/r/CANubcdUwCAMuUyamG2dkWP=cqSR9MAS=tHLDc95kQkqU-rEnAg@mail.gmail.com
Signed-off-by: Stephen Zhang <starzhangzsd@gmail.com>
Reviewed-by: David Hildenbrand <david@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# d532e2e5 14-Dec-2020 Yang Shi <shy828301@gmail.com>

mm: migrate: return -ENOSYS if THP migration is unsupported

In the current implementation unmap_and_move() would return -ENOMEM if THP
migration is unsupported, then the THP will be split. If split is failed
just exit without trying to migrate other pages. It doesn't make too much
sense since there may be enough free memory to migrate other pages and
there may be a lot base pages on the list.

Return -ENOSYS to make consistent with hugetlb. And if THP split is
failed just skip and try other pages on the list.

Just skip the whole list and exit when free memory is really low.

Link: https://lkml.kernel.org/r/20201113205359.556831-6-shy828301@gmail.com
Signed-off-by: Yang Shi <shy828301@gmail.com>
Cc: Jan Kara <jack@suse.cz>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Song Liu <songliubraving@fb.com>
Cc: Zi Yan <ziy@nvidia.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 236c32eb 14-Dec-2020 Yang Shi <shy828301@gmail.com>

mm: migrate: clean up migrate_prep{_local}

The migrate_prep{_local} never fails, so it is pointless to have return
value and check the return value.

Link: https://lkml.kernel.org/r/20201113205359.556831-5-shy828301@gmail.com
Signed-off-by: Yang Shi <shy828301@gmail.com>
Reviewed-by: Zi Yan <ziy@nvidia.com>
Cc: Jan Kara <jack@suse.cz>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Song Liu <songliubraving@fb.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# c77c5cba 14-Dec-2020 Yang Shi <shy828301@gmail.com>

mm: migrate: skip shared exec THP for NUMA balancing

The NUMA balancing skip shared exec base page. Since
CONFIG_READ_ONLY_THP_FOR_FS was introduced, there are probably shared exec
THP, so skip such THPs for NUMA balancing as well.

And Willy's regular filesystem THP support patches could create shared
exec THP wven without that config.

In addition, the page_is_file_lru() is used to tell if the page is file
cache or not, but it filters out shmem page. It sounds like a typical
usecase by putting executables in shmem to achieve performance gain via
using shmem-THP, so it sounds worth skipping migration for such case too.

Link: https://lkml.kernel.org/r/20201113205359.556831-4-shy828301@gmail.com
Signed-off-by: Yang Shi <shy828301@gmail.com>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Jan Kara <jack@suse.cz>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Song Liu <songliubraving@fb.com>
Cc: Zi Yan <ziy@nvidia.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# dd4ae78a 14-Dec-2020 Yang Shi <shy828301@gmail.com>

mm: migrate: simplify the logic for handling permanent failure

When unmap_and_move{_huge_page}() returns !-EAGAIN and
!MIGRATEPAGE_SUCCESS, the page would be put back to LRU or proper list if
it is non-LRU movable page. But, the callers always call
putback_movable_pages() to put the failed pages back later on, so it seems
not very efficient to put every single page back immediately, and the code
looks convoluted.

Put the failed page on a separate list, then splice the list to migrate
list when all pages are tried. It is the caller's responsibility to call
putback_movable_pages() to handle failures. This also makes the code
simpler and more readable.

After the change the rules are:
* Success: non hugetlb page will be freed, hugetlb page will be put
back
* -EAGAIN: stay on the from list
* -ENOMEM: stay on the from list
* Other errno: put on ret_pages list then splice to from list

The from list would be empty iff all pages are migrated successfully, it
was not so before. This has no impact to current existing callsites.

Link: https://lkml.kernel.org/r/20201113205359.556831-3-shy828301@gmail.com
Signed-off-by: Yang Shi <shy828301@gmail.com>
Reviewed-by: Zi Yan <ziy@nvidia.com>
Cc: Jan Kara <jack@suse.cz>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Song Liu <songliubraving@fb.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# d12b8951 14-Dec-2020 Yang Shi <shy828301@gmail.com>

mm: truncate_complete_page() does not exist any more

Patch series "mm: misc migrate cleanup and improvement", v3.

This patch (of 5):

The commit 9f4e41f4717832e ("mm: refactor truncate_complete_page()")
refactored truncate_complete_page(), and it is not existed anymore,
correct the comment in vmscan and migrate to avoid confusion.

Link: https://lkml.kernel.org/r/20201113205359.556831-1-shy828301@gmail.com
Link: https://lkml.kernel.org/r/20201113205359.556831-2-shy828301@gmail.com
Signed-off-by: Yang Shi <shy828301@gmail.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Zi Yan <ziy@nvidia.com>
Cc: Song Liu <songliubraving@fb.com>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Matthew Wilcox <willy@infradead.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 5e5dda81 14-Dec-2020 Ralph Campbell <rcampbell@nvidia.com>

mm/migrate.c: optimize migrate_vma_pages() mmu notifier

When migrating a zero page or pte_none() anonymous page to device private
memory, migrate_vma_setup() will initialize the src[] array with a NULL
PFN. This lets the device driver allocate device private memory and clear
it instead of DMAing a page of zeros over the device bus.

Since the source page didn't exist at the time, no struct page was locked
nor a migration PTE inserted into the CPU page tables. The actual PTE
insertion happens in migrate_vma_pages() when it tries to insert the
device private struct page PTE into the CPU page tables.
migrate_vma_pages() has to call the mmu notifiers again since another
device could fault on the same page before the page table locks are
acquired.

Allow device drivers to optimize the invalidation similar to
migrate_vma_setup() by calling mmu_notifier_range_init() which sets struct
mmu_notifier_range event type to MMU_NOTIFY_MIGRATE and the
migrate_pgmap_owner field.

Link: https://lkml.kernel.org/r/20201021191335.10916-1-rcampbell@nvidia.com
Signed-off-by: Ralph Campbell <rcampbell@nvidia.com>
Cc: Jerome Glisse <jglisse@redhat.com>
Cc: John Hubbard <jhubbard@nvidia.com>
Cc: Alistair Popple <apopple@nvidia.com>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Jason Gunthorpe <jgg@nvidia.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# ab9dd4f8 14-Dec-2020 Long Li <lonuxli.64@gmail.com>

mm/migrate.c: fix comment spelling

The word in the comment is misspelled, it should be "include".

Link: https://lkml.kernel.org/r/20201024114144.GA20552@lilong
Signed-off-by: Long Li <lonuxli.64@gmail.com>
Reviewed-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 013339df 14-Dec-2020 Shakeel Butt <shakeelb@google.com>

mm/rmap: always do TTU_IGNORE_ACCESS

Since commit 369ea8242c0f ("mm/rmap: update to new mmu_notifier semantic
v2"), the code to check the secondary MMU's page table access bit is
broken for !(TTU_IGNORE_ACCESS) because the page is unmapped from the
secondary MMU's page table before the check. More specifically for those
secondary MMUs which unmap the memory in
mmu_notifier_invalidate_range_start() like kvm.

However memory reclaim is the only user of !(TTU_IGNORE_ACCESS) or the
absence of TTU_IGNORE_ACCESS and it explicitly performs the page table
access check before trying to unmap the page. So, at worst the reclaim
will miss accesses in a very short window if we remove page table access
check in unmapping code.

There is an unintented consequence of !(TTU_IGNORE_ACCESS) for the memcg
reclaim. From memcg reclaim the page_referenced() only account the
accesses from the processes which are in the same memcg of the target page
but the unmapping code is considering accesses from all the processes, so,
decreasing the effectiveness of memcg reclaim.

The simplest solution is to always assume TTU_IGNORE_ACCESS in unmapping
code.

Link: https://lkml.kernel.org/r/20201104231928.1494083-1-shakeelb@google.com
Fixes: 369ea8242c0f ("mm/rmap: update to new mmu_notifier semantic v2")
Signed-off-by: Shakeel Butt <shakeelb@google.com>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Cc: Hugh Dickins <hughd@google.com>
Cc: Jerome Glisse <jglisse@redhat.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Dan Williams <dan.j.williams@intel.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 336bf30e 13-Nov-2020 Mike Kravetz <mike.kravetz@oracle.com>

hugetlbfs: fix anon huge page migration race

Qian Cai reported the following BUG in [1]

LTP: starting move_pages12
BUG: unable to handle page fault for address: ffffffffffffffe0
...
RIP: 0010:anon_vma_interval_tree_iter_first+0xa2/0x170 avc_start_pgoff at mm/interval_tree.c:63
Call Trace:
rmap_walk_anon+0x141/0xa30 rmap_walk_anon at mm/rmap.c:1864
try_to_unmap+0x209/0x2d0 try_to_unmap at mm/rmap.c:1763
migrate_pages+0x1005/0x1fb0
move_pages_and_store_status.isra.47+0xd7/0x1a0
__x64_sys_move_pages+0xa5c/0x1100
do_syscall_64+0x5f/0x310
entry_SYSCALL_64_after_hwframe+0x44/0xa9

Hugh Dickins diagnosed this as a migration bug caused by code introduced
to use i_mmap_rwsem for pmd sharing synchronization. Specifically, the
routine unmap_and_move_huge_page() is always passing the TTU_RMAP_LOCKED
flag to try_to_unmap() while holding i_mmap_rwsem. This is wrong for
anon pages as the anon_vma_lock should be held in this case. Further
analysis suggested that i_mmap_rwsem was not required to he held at all
when calling try_to_unmap for anon pages as an anon page could never be
part of a shared pmd mapping.

Discussion also revealed that the hack in hugetlb_page_mapping_lock_write
to drop page lock and acquire i_mmap_rwsem is wrong. There is no way to
keep mapping valid while dropping page lock.

This patch does the following:

- Do not take i_mmap_rwsem and set TTU_RMAP_LOCKED for anon pages when
calling try_to_unmap.

- Remove the hacky code in hugetlb_page_mapping_lock_write. The routine
will now simply do a 'trylock' while still holding the page lock. If
the trylock fails, it will return NULL. This could impact the
callers:

- migration calling code will receive -EAGAIN and retry up to the
hard coded limit (10).

- memory error code will treat the page as BUSY. This will force
killing (SIGKILL) instead of SIGBUS any mapping tasks.

Do note that this change in behavior only happens when there is a
race. None of the standard kernel testing suites actually hit this
race, but it is possible.

[1] https://lore.kernel.org/lkml/20200708012044.GC992@lca.pw/
[2] https://lore.kernel.org/linux-mm/alpine.LSU.2.11.2010071833100.2214@eggly.anvils/

Fixes: c0d0381ade79 ("hugetlbfs: use i_mmap_rwsem for more pmd sharing synchronization")
Reported-by: Qian Cai <cai@lca.pw>
Suggested-by: Hugh Dickins <hughd@google.com>
Signed-off-by: Mike Kravetz <mike.kravetz@oracle.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Acked-by: Naoya Horiguchi <naoya.horiguchi@nec.com>
Cc: <stable@vger.kernel.org>
Link: https://lkml.kernel.org/r/20201105195058.78401-1-mike.kravetz@oracle.com
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 4dc200ce 17-Oct-2020 Miaohe Lin <linmiaohe@huawei.com>

mm/migrate: avoid possible unnecessary process right check in kernel_move_pages()

There is no need to check if this process has the right to modify the
specified process when they are same. And we could also skip the security
hook call if a process is modifying its own pages. Add helper function to
handle these.

Suggested-by: Matthew Wilcox <willy@infradead.org>
Signed-off-by: Hongxiang Lou <louhongxiang@huawei.com>
Signed-off-by: Miaohe Lin <linmiaohe@huawei.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Cc: Christopher Lameter <cl@linux.com>
Link: https://lkml.kernel.org/r/20200819083331.19012-1-linmiaohe@huawei.com
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 79f5f8fa 15-Oct-2020 Oscar Salvador <osalvador@suse.de>

mm,hwpoison: rework soft offline for in-use pages

This patch changes the way we set and handle in-use poisoned pages. Until
now, poisoned pages were released to the buddy allocator, trusting that
the checks that take place at allocation time would act as a safe net and
would skip that page.

This has proved to be wrong, as we got some pfn walkers out there, like
compaction, that all they care is the page to be in a buddy freelist.

Although this might not be the only user, having poisoned pages in the
buddy allocator seems a bad idea as we should only have free pages that
are ready and meant to be used as such.

Before explaining the taken approach, let us break down the kind of pages
we can soft offline.

- Anonymous THP (after the split, they end up being 4K pages)
- Hugetlb
- Order-0 pages (that can be either migrated or invalited)

* Normal pages (order-0 and anon-THP)

- If they are clean and unmapped page cache pages, we invalidate
then by means of invalidate_inode_page().
- If they are mapped/dirty, we do the isolate-and-migrate dance.

Either way, do not call put_page directly from those paths. Instead, we
keep the page and send it to page_handle_poison to perform the right
handling.

page_handle_poison sets the HWPoison flag and does the last put_page.

Down the chain, we placed a check for HWPoison page in
free_pages_prepare, that just skips any poisoned page, so those pages
do not end up in any pcplist/freelist.

After that, we set the refcount on the page to 1 and we increment
the poisoned pages counter.

If we see that the check in free_pages_prepare creates trouble, we can
always do what we do for free pages:

- wait until the page hits buddy's freelists
- take it off, and flag it

The downside of the above approach is that we could race with an
allocation, so by the time we want to take the page off the buddy, the
page has been already allocated so we cannot soft offline it.
But the user could always retry it.

* Hugetlb pages

- We isolate-and-migrate them

After the migration has been successful, we call dissolve_free_huge_page,
and we set HWPoison on the page if we succeed.
Hugetlb has a slightly different handling though.

While for non-hugetlb pages we cared about closing the race with an
allocation, doing so for hugetlb pages requires quite some additional
and intrusive code (we would need to hook in free_huge_page and some other
places).
So I decided to not make the code overly complicated and just fail
normally if the page we allocated in the meantime.

We can always build on top of this.

As a bonus, because of the way we handle now in-use pages, we no longer
need the put-as-isolation-migratetype dance, that was guarding for poisoned
pages to end up in pcplists.

Signed-off-by: Oscar Salvador <osalvador@suse.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Acked-by: Naoya Horiguchi <naoya.horiguchi@nec.com>
Cc: "Aneesh Kumar K.V" <aneesh.kumar@linux.ibm.com>
Cc: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
Cc: Aristeu Rozanski <aris@ruivo.org>
Cc: Dave Hansen <dave.hansen@intel.com>
Cc: David Hildenbrand <david@redhat.com>
Cc: Dmitry Yakunin <zeil@yandex-team.ru>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Mike Kravetz <mike.kravetz@oracle.com>
Cc: Oscar Salvador <osalvador@suse.com>
Cc: Qian Cai <cai@lca.pw>
Cc: Tony Luck <tony.luck@intel.com>
Link: https://lkml.kernel.org/r/20200922135650.1634-10-osalvador@suse.de
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# f1f4f3ab 13-Oct-2020 Ralph Campbell <rcampbell@nvidia.com>

mm/migrate: remove obsolete comment about device public

Device public memory never had an in tree consumer and was removed in
commit 25b2995a35b6 ("mm: remove MEMORY_DEVICE_PUBLIC support"). Delete
the obsolete comment.

Signed-off-by: Ralph Campbell <rcampbell@nvidia.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Reviewed-by: Jason Gunthorpe <jgg@nvidia.com>
Cc: Jerome Glisse <jglisse@redhat.com>
Cc: John Hubbard <jhubbard@nvidia.com>
Cc: Christoph Hellwig <hch@lst.de>
Link: http://lkml.kernel.org/r/20200827190735.12752-2-rcampbell@nvidia.com
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 42578891 13-Oct-2020 Ralph Campbell <rcampbell@nvidia.com>

mm/migrate: remove cpages-- in migrate_vma_finalize()

The variable struct migrate_vma->cpages is only used in
migrate_vma_setup(). There is no need to decrement it in
migrate_vma_finalize() since it is never checked.

Signed-off-by: Ralph Campbell <rcampbell@nvidia.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Cc: Jason Gunthorpe <jgg@nvidia.com>
Cc: Jerome Glisse <jglisse@redhat.com>
Cc: John Hubbard <jhubbard@nvidia.com>
Cc: Christoph Hellwig <hch@lst.de>
Link: http://lkml.kernel.org/r/20200827190735.12752-1-rcampbell@nvidia.com
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 6c5c7b9f 25-Sep-2020 Zi Yan <ziy@nvidia.com>

mm/migrate: correct thp migration stats

PageTransHuge returns true for both thp and hugetlb, so thp stats was
counting both thp and hugetlb migrations. Exclude hugetlb migration by
setting is_thp variable right.

Clean up thp handling code too when we are there.

Fixes: 1a5bae25e3cf ("mm/vmstat: add events for THP migration without split")
Signed-off-by: Zi Yan <ziy@nvidia.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Reviewed-by: Daniel Jordan <daniel.m.jordan@oracle.com>
Cc: Anshuman Khandual <anshuman.khandual@arm.com>
Link: https://lkml.kernel.org/r/20200917210413.1462975-1-zi.yan@sent.com
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# f56753ac 24-Sep-2020 Christoph Hellwig <hch@lst.de>

bdi: replace BDI_CAP_NO_{WRITEBACK,ACCT_DIRTY} with a single flag

Replace the two negative flags that are always used together with a
single positive flag that indicates the writeback capability instead
of two related non-capabilities. Also remove the pointless wrappers
to just check the flag.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Jan Kara <jack@suse.cz>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>


# a333e3e7 18-Sep-2020 Hugh Dickins <hughd@google.com>

mm: migration of hugetlbfs page skip memcg

hugetlbfs pages do not participate in memcg: so although they do find most
of migrate_page_states() useful, it would be better if they did not call
into mem_cgroup_migrate() - where Qian Cai reported that LTP's
move_pages12 triggers the warning in Alex Shi's prospective commit
"mm/memcg: warning on !memcg after readahead page charged".

Signed-off-by: Hugh Dickins <hughd@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Reviewed-by: Shakeel Butt <shakeelb@google.com>
Acked-by: Johannes Weiner <hannes@cmpxch.org>
Cc: Alex Shi <alex.shi@linux.alibaba.com>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Mike Kravetz <mike.kravetz@oracle.com>
Cc: Qian Cai <cai@lca.pw>
Link: http://lkml.kernel.org/r/alpine.LSU.2.11.2008301359460.5954@eggly.anvils
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 3d321bf8 04-Sep-2020 Ralph Campbell <rcampbell@nvidia.com>

mm/migrate: preserve soft dirty in remove_migration_pte()

The code to remove a migration PTE and replace it with a device private
PTE was not copying the soft dirty bit from the migration entry. This
could lead to page contents not being marked dirty when faulting the page
back from device private memory.

Signed-off-by: Ralph Campbell <rcampbell@nvidia.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Cc: Jerome Glisse <jglisse@redhat.com>
Cc: Alistair Popple <apopple@nvidia.com>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Jason Gunthorpe <jgg@nvidia.com>
Cc: Bharata B Rao <bharata@linux.ibm.com>
Link: https://lkml.kernel.org/r/20200831212222.22409-3-rcampbell@nvidia.com
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 6128763f 04-Sep-2020 Ralph Campbell <rcampbell@nvidia.com>

mm/migrate: remove unnecessary is_zone_device_page() check

Patch series "mm/migrate: preserve soft dirty in remove_migration_pte()".

I happened to notice this from code inspection after seeing Alistair
Popple's patch ("mm/rmap: Fixup copying of soft dirty and uffd ptes").

This patch (of 2):

The check for is_zone_device_page() and is_device_private_page() is
unnecessary since the latter is sufficient to determine if the page is a
device private page. Simplify the code for easier reading.

Signed-off-by: Ralph Campbell <rcampbell@nvidia.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Cc: Jerome Glisse <jglisse@redhat.com>
Cc: Alistair Popple <apopple@nvidia.com>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Jason Gunthorpe <jgg@nvidia.com>
Cc: Bharata B Rao <bharata@linux.ibm.com>
Link: https://lkml.kernel.org/r/20200831212222.22409-1-rcampbell@nvidia.com
Link: https://lkml.kernel.org/r/20200831212222.22409-2-rcampbell@nvidia.com
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# ad7df764 04-Sep-2020 Alistair Popple <alistair@popple.id.au>

mm/rmap: fixup copying of soft dirty and uffd ptes

During memory migration a pte is temporarily replaced with a migration
swap pte. Some pte bits from the existing mapping such as the soft-dirty
and uffd write-protect bits are preserved by copying these to the
temporary migration swap pte.

However these bits are not stored at the same location for swap and
non-swap ptes. Therefore testing these bits requires using the
appropriate helper function for the given pte type.

Unfortunately several code locations were found where the wrong helper
function is being used to test soft_dirty and uffd_wp bits which leads to
them getting incorrectly set or cleared during page-migration.

Fix these by using the correct tests based on pte type.

Fixes: a5430dda8a3a ("mm/migrate: support un-addressable ZONE_DEVICE page in migration")
Fixes: 8c3328f1f36a ("mm/migrate: migrate_vma() unmap page from vma while collecting pages")
Fixes: f45ec5ff16a7 ("userfaultfd: wp: support swap and page migration")
Signed-off-by: Alistair Popple <alistair@popple.id.au>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Reviewed-by: Peter Xu <peterx@redhat.com>
Cc: Jérôme Glisse <jglisse@redhat.com>
Cc: John Hubbard <jhubbard@nvidia.com>
Cc: Ralph Campbell <rcampbell@nvidia.com>
Cc: Alistair Popple <alistair@popple.id.au>
Cc: <stable@vger.kernel.org>
Link: https://lkml.kernel.org/r/20200825064232.10023-2-alistair@popple.id.au
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# ebdf8321 04-Sep-2020 Alistair Popple <alistair@popple.id.au>

mm/migrate: fixup setting UFFD_WP flag

Commit f45ec5ff16a75 ("userfaultfd: wp: support swap and page migration")
introduced support for tracking the uffd wp bit during page migration.
However the non-swap PTE variant was used to set the flag for zone device
private pages which are a type of swap page.

This leads to corruption of the swap offset if the original PTE has the
uffd_wp flag set.

Fixes: f45ec5ff16a75 ("userfaultfd: wp: support swap and page migration")
Signed-off-by: Alistair Popple <alistair@popple.id.au>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Reviewed-by: Peter Xu <peterx@redhat.com>
Cc: Jérôme Glisse <jglisse@redhat.com>
Cc: John Hubbard <jhubbard@nvidia.com>
Cc: Ralph Campbell <rcampbell@nvidia.com>
Link: https://lkml.kernel.org/r/20200825064232.10023-1-alistair@popple.id.au
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 6c357848 14-Aug-2020 Matthew Wilcox (Oracle) <willy@infradead.org>

mm: replace hpage_nr_pages with thp_nr_pages

The thp prefix is more frequently used than hpage and we should be
consistent between the various functions.

[akpm@linux-foundation.org: fix mm/migrate.c]

Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Reviewed-by: William Kucharski <william.kucharski@oracle.com>
Reviewed-by: Zi Yan <ziy@nvidia.com>
Cc: Mike Kravetz <mike.kravetz@oracle.com>
Cc: David Hildenbrand <david@redhat.com>
Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
Link: http://lkml.kernel.org/r/20200629151959.15779-6-willy@infradead.org
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# a0976311 11-Aug-2020 Joonsoo Kim <iamjoonsoo.kim@lge.com>

mm/mempolicy: use a standard migration target allocation callback

There is a well-defined migration target allocation callback. Use it.

Signed-off-by: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Acked-by: Michal Hocko <mhocko@suse.com>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Cc: Christoph Hellwig <hch@infradead.org>
Cc: Mike Kravetz <mike.kravetz@oracle.com>
Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
Cc: Roman Gushchin <guro@fb.com>
Link: http://lkml.kernel.org/r/1594622517-20681-7-git-send-email-iamjoonsoo.kim@lge.com
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 19fc7bed 11-Aug-2020 Joonsoo Kim <iamjoonsoo.kim@lge.com>

mm/migrate: introduce a standard migration target allocation function

There are some similar functions for migration target allocation. Since
there is no fundamental difference, it's better to keep just one rather
than keeping all variants. This patch implements base migration target
allocation function. In the following patches, variants will be converted
to use this function.

Changes should be mechanical, but, unfortunately, there are some
differences. First, some callers' nodemask is assgined to NULL since NULL
nodemask will be considered as all available nodes, that is,
&node_states[N_MEMORY]. Second, for hugetlb page allocation, gfp_mask is
redefined as regular hugetlb allocation gfp_mask plus __GFP_THISNODE if
user provided gfp_mask has it. This is because future caller of this
function requires to set this node constaint. Lastly, if provided nodeid
is NUMA_NO_NODE, nodeid is set up to the node where migration source
lives. It helps to remove simple wrappers for setting up the nodeid.

Note that PageHighmem() call in previous function is changed to open-code
"is_highmem_idx()" since it provides more readability.

[akpm@linux-foundation.org: tweak patch title, per Vlastimil]
[akpm@linux-foundation.org: fix typo in comment]

Signed-off-by: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Acked-by: Michal Hocko <mhocko@suse.com>
Cc: Christoph Hellwig <hch@infradead.org>
Cc: Mike Kravetz <mike.kravetz@oracle.com>
Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
Cc: Roman Gushchin <guro@fb.com>
Link: http://lkml.kernel.org/r/1594622517-20681-6-git-send-email-iamjoonsoo.kim@lge.com
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 9933a0c8 11-Aug-2020 Joonsoo Kim <iamjoonsoo.kim@lge.com>

mm/migrate: clear __GFP_RECLAIM to make the migration callback consistent with regular THP allocations

new_page_nodemask is a migration callback and it tries to use a common gfp
flags for the target page allocation whether it is a base page or a THP.
The later only adds GFP_TRANSHUGE to the given mask. This results in the
allocation being slightly more aggressive than necessary because the
resulting gfp mask will contain also __GFP_RECLAIM_KSWAPD. THP
allocations usually exclude this flag to reduce over eager background
reclaim during a high THP allocation load which has been seen during large
mmaps initialization. There is no indication that this is a problem for
migration as well but theoretically the same might happen when migrating
large mappings to a different node. Make the migration callback
consistent with regular THP allocations.

[akpm@linux-foundation.org: fix comment typo, per Vlastimil]

Signed-off-by: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Acked-by: Michal Hocko <mhocko@suse.com>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Cc: Christoph Hellwig <hch@infradead.org>
Cc: Mike Kravetz <mike.kravetz@oracle.com>
Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
Cc: Roman Gushchin <guro@fb.com>
Link: http://lkml.kernel.org/r/1594622517-20681-5-git-send-email-iamjoonsoo.kim@lge.com
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# d92bbc27 11-Aug-2020 Joonsoo Kim <iamjoonsoo.kim@lge.com>

mm/hugetlb: unify migration callbacks

There is no difference between two migration callback functions,
alloc_huge_page_node() and alloc_huge_page_nodemask(), except
__GFP_THISNODE handling. It's redundant to have two almost similar
functions in order to handle this flag. So, this patch tries to remove
one by introducing a new argument, gfp_mask, to
alloc_huge_page_nodemask().

After introducing gfp_mask argument, it's caller's job to provide correct
gfp_mask. So, every callsites for alloc_huge_page_nodemask() are changed
to provide gfp_mask.

Note that it's safe to remove a node id check in alloc_huge_page_node()
since there is no caller passing NUMA_NO_NODE as a node id.

Signed-off-by: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Reviewed-by: Mike Kravetz <mike.kravetz@oracle.com>
Reviewed-by: Vlastimil Babka <vbabka@suse.cz>
Acked-by: Michal Hocko <mhocko@suse.com>
Cc: Christoph Hellwig <hch@infradead.org>
Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
Cc: Roman Gushchin <guro@fb.com>
Link: http://lkml.kernel.org/r/1594622517-20681-4-git-send-email-iamjoonsoo.kim@lge.com
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# b4b38223 11-Aug-2020 Joonsoo Kim <iamjoonsoo.kim@lge.com>

mm/migrate: move migration helper from .h to .c

It's not performance sensitive function. Move it to .c. This is a
preparation step for future change.

Signed-off-by: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Reviewed-by: Vlastimil Babka <vbabka@suse.cz>
Acked-by: Mike Kravetz <mike.kravetz@oracle.com>
Acked-by: Michal Hocko <mhocko@suse.com>
Cc: Christoph Hellwig <hch@infradead.org>
Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
Cc: Roman Gushchin <guro@fb.com>
Link: http://lkml.kernel.org/r/1594622517-20681-3-git-send-email-iamjoonsoo.kim@lge.com
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# eaf444de 11-Aug-2020 Randy Dunlap <rdunlap@infradead.org>

mm/migrate.c: delete duplicated word

Drop the repeated word "and".

Signed-off-by: Randy Dunlap <rdunlap@infradead.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Reviewed-by: Andrew Morton <akpm@linux-foundation.org>
Reviewed-by: Zi Yan <ziy@nvidia.com>
Link: http://lkml.kernel.org/r/20200801173822.14973-8-rdunlap@infradead.org
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 1a5bae25 11-Aug-2020 Anshuman Khandual <anshuman.khandual@arm.com>

mm/vmstat: add events for THP migration without split

Add following new vmstat events which will help in validating THP
migration without split. Statistics reported through these new VM events
will help in performance debugging.

1. THP_MIGRATION_SUCCESS
2. THP_MIGRATION_FAILURE
3. THP_MIGRATION_SPLIT

In addition, these new events also update normal page migration statistics
appropriately via PGMIGRATE_SUCCESS and PGMIGRATE_FAILURE. While here,
this updates current trace event 'mm_migrate_pages' to accommodate now
available THP statistics.

[akpm@linux-foundation.org: s/hpage_nr_pages/thp_nr_pages/]
[ziy@nvidia.com: v2]
Link: http://lkml.kernel.org/r/C5E3C65C-8253-4638-9D3C-71A61858BB8B@nvidia.com
[anshuman.khandual@arm.com: s/thp_nr_pages/hpage_nr_pages/]
Link: http://lkml.kernel.org/r/1594287583-16568-1-git-send-email-anshuman.khandual@arm.com

Signed-off-by: Anshuman Khandual <anshuman.khandual@arm.com>
Signed-off-by: Zi Yan <ziy@nvidia.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Reviewed-by: Daniel Jordan <daniel.m.jordan@oracle.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Zi Yan <ziy@nvidia.com>
Cc: John Hubbard <jhubbard@nvidia.com>
Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
Link: http://lkml.kernel.org/r/1594080415-27924-1-git-send-email-anshuman.khandual@arm.com
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 0744f280 11-Aug-2020 Ralph Campbell <rcampbell@nvidia.com>

mm/migrate: optimize migrate_vma_setup() for holes

Patch series "mm/migrate: optimize migrate_vma_setup() for holes".

A simple optimization for migrate_vma_*() when the source vma is not an
anonymous vma and a new test case to exercise it.

This patch (of 2):

When migrating system memory to device private memory, if the source
address range is a valid VMA range and there is no memory or a zero page,
the source PFN array is marked as valid but with no PFN.

This lets the device driver allocate private memory and clear it, then
insert the new device private struct page into the CPU's page tables when
migrate_vma_pages() is called. migrate_vma_pages() only inserts the new
page if the VMA is an anonymous range.

There is no point in telling the device driver to allocate device private
memory and then not migrate the page. Instead, mark the source PFN array
entries as not migrating to avoid this overhead.

[rcampbell@nvidia.com: v2]
Link: http://lkml.kernel.org/r/20200710194840.7602-2-rcampbell@nvidia.com

Signed-off-by: Ralph Campbell <rcampbell@nvidia.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Cc: Jerome Glisse <jglisse@redhat.com>
Cc: John Hubbard <jhubbard@nvidia.com>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Jason Gunthorpe <jgg@mellanox.com>
Cc: "Bharata B Rao" <bharata@linux.ibm.com>
Cc: Shuah Khan <shuah@kernel.org>
Link: http://lkml.kernel.org/r/20200710194840.7602-1-rcampbell@nvidia.com
Link: http://lkml.kernel.org/r/20200709165711.26584-1-rcampbell@nvidia.com
Link: http://lkml.kernel.org/r/20200709165711.26584-2-rcampbell@nvidia.com
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# b518154e 11-Aug-2020 Joonsoo Kim <iamjoonsoo.kim@lge.com>

mm/vmscan: protect the workingset on anonymous LRU

In current implementation, newly created or swap-in anonymous page is
started on active list. Growing active list results in rebalancing
active/inactive list so old pages on active list are demoted to inactive
list. Hence, the page on active list isn't protected at all.

Following is an example of this situation.

Assume that 50 hot pages on active list. Numbers denote the number of
pages on active/inactive list (active | inactive).

1. 50 hot pages on active list
50(h) | 0

2. workload: 50 newly created (used-once) pages
50(uo) | 50(h)

3. workload: another 50 newly created (used-once) pages
50(uo) | 50(uo), swap-out 50(h)

This patch tries to fix this issue. Like as file LRU, newly created or
swap-in anonymous pages will be inserted to the inactive list. They are
promoted to active list if enough reference happens. This simple
modification changes the above example as following.

1. 50 hot pages on active list
50(h) | 0

2. workload: 50 newly created (used-once) pages
50(h) | 50(uo)

3. workload: another 50 newly created (used-once) pages
50(h) | 50(uo), swap-out 50(uo)

As you can see, hot pages on active list would be protected.

Note that, this implementation has a drawback that the page cannot be
promoted and will be swapped-out if re-access interval is greater than the
size of inactive list but less than the size of total(active+inactive).
To solve this potential issue, following patch will apply workingset
detection similar to the one that's already applied to file LRU.

Signed-off-by: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Cc: Hugh Dickins <hughd@google.com>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Mel Gorman <mgorman@techsingularity.net>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Minchan Kim <minchan@kernel.org>
Link: http://lkml.kernel.org/r/1595490560-15117-3-git-send-email-iamjoonsoo.kim@lge.com
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# c1a06df6 07-Aug-2020 Ralph Campbell <rcampbell@nvidia.com>

mm/migrate: fix migrate_pgmap_owner w/o CONFIG_MMU_NOTIFIER

On x86_64, when CONFIG_MMU_NOTIFIER is not set/enabled, there is a
compiler error:

mm/migrate.c: In function 'migrate_vma_collect':
mm/migrate.c:2481:7: error: 'struct mmu_notifier_range' has no member named 'migrate_pgmap_owner'
range.migrate_pgmap_owner = migrate->pgmap_owner;
^

Fixes: 998427b3ad2c ("mm/notifier: add migration invalidation type")
Reported-by: Randy Dunlap <rdunlap@infradead.org>
Signed-off-by: Ralph Campbell <rcampbell@nvidia.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Tested-by: Randy Dunlap <rdunlap@infradead.org>
Acked-by: Randy Dunlap <rdunlap@infradead.org>
Cc: Jerome Glisse <jglisse@redhat.com>
Cc: John Hubbard <jhubbard@nvidia.com>
Cc: Christoph Hellwig <hch@lst.de>
Cc: "Jason Gunthorpe" <jgg@mellanox.com>
Link: http://lkml.kernel.org/r/20200806193353.7124-1-rcampbell@nvidia.com
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 998427b3 23-Jul-2020 Ralph Campbell <rcampbell@nvidia.com>

mm/notifier: add migration invalidation type

Currently migrate_vma_setup() calls mmu_notifier_invalidate_range_start()
which flushes all device private page mappings whether or not a page is
being migrated to/from device private memory.

In order to not disrupt device mappings that are not being migrated, shift
the responsibility for clearing device private mappings to the device
driver and leave CPU page table unmapping handled by
migrate_vma_setup().

To support this, the caller of migrate_vma_setup() should always set
struct migrate_vma::pgmap_owner to a non NULL value that matches the
device private page->pgmap->owner. This value is then passed to the struct
mmu_notifier_range with a new event type which the driver's invalidation
function can use to avoid device MMU invalidations.

Link: https://lore.kernel.org/r/20200723223004.9586-4-rcampbell@nvidia.com
Signed-off-by: Ralph Campbell <rcampbell@nvidia.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>


# 5143192c 23-Jul-2020 Ralph Campbell <rcampbell@nvidia.com>

mm/migrate: add a flags parameter to migrate_vma

The src_owner field in struct migrate_vma is being used for two purposes,
it acts as a selection filter for which types of pages are to be migrated
and it identifies device private pages owned by the caller.

Split this into separate parameters so the src_owner field can be used
just to identify device private pages owned by the caller of
migrate_vma_setup().

Rename the src_owner field to pgmap_owner to reflect it is now used only
to identify which device private pages to migrate.

Link: https://lore.kernel.org/r/20200723223004.9586-3-rcampbell@nvidia.com
Signed-off-by: Ralph Campbell <rcampbell@nvidia.com>
Reviewed-by: Bharata B Rao <bharata@linux.ibm.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>


# 6ec4476a 08-Jul-2020 Linus Torvalds <torvalds@linux-foundation.org>

Raise gcc version requirement to 4.9

I realize that we fairly recently raised it to 4.8, but the fact is, 4.9
is a much better minimum version to target.

We have a number of workarounds for actual bugs in pre-4.9 gcc versions
(including things like internal compiler errors on ARM), but we also
have some syntactic workarounds for lacking features.

In particular, raising the minimum to 4.9 means that we can now just
assume _Generic() exists, which is likely the much better replacement
for a lot of very convoluted built-time magic with conditionals on
sizeof and/or __builtin_choose_expr() with same_type() etc.

Using _Generic also means that you will need to have a very recent
version of 'sparse', but thats easy to build yourself, and much less of
a hassle than some old gcc version can be.

The latest (in a long string) of reasons for minimum compiler version
upgrades was commit 5435f73d5c4a ("efi/x86: Fix build with gcc 4").

Ard points out that RHEL 7 uses gcc-4.8, but the people who stay back on
old RHEL versions persumably also don't build their own kernels anyway.
And maybe they should cross-built or just have a little side affair with
a newer compiler?

Acked-by: Ard Biesheuvel <ardb@kernel.org>
Acked-by: Peter Zijlstra <peterz@infradead.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# c1e8d7c6 08-Jun-2020 Michel Lespinasse <walken@google.com>

mmap locking API: convert mmap_sem comments

Convert comments that reference mmap_sem to reference mmap_lock instead.

[akpm@linux-foundation.org: fix up linux-next leftovers]
[akpm@linux-foundation.org: s/lockaphore/lock/, per Vlastimil]
[akpm@linux-foundation.org: more linux-next fixups, per Michel]

Signed-off-by: Michel Lespinasse <walken@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Reviewed-by: Vlastimil Babka <vbabka@suse.cz>
Reviewed-by: Daniel Jordan <daniel.m.jordan@oracle.com>
Cc: Davidlohr Bueso <dbueso@suse.de>
Cc: David Rientjes <rientjes@google.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: Jason Gunthorpe <jgg@ziepe.ca>
Cc: Jerome Glisse <jglisse@redhat.com>
Cc: John Hubbard <jhubbard@nvidia.com>
Cc: Laurent Dufour <ldufour@linux.ibm.com>
Cc: Liam Howlett <Liam.Howlett@oracle.com>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Ying Han <yinghan@google.com>
Link: http://lkml.kernel.org/r/20200520052908.204642-13-walken@google.com
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 3e4e28c5 08-Jun-2020 Michel Lespinasse <walken@google.com>

mmap locking API: convert mmap_sem API comments

Convert comments that reference old mmap_sem APIs to reference
corresponding new mmap locking APIs instead.

Signed-off-by: Michel Lespinasse <walken@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Reviewed-by: Vlastimil Babka <vbabka@suse.cz>
Reviewed-by: Davidlohr Bueso <dbueso@suse.de>
Reviewed-by: Daniel Jordan <daniel.m.jordan@oracle.com>
Cc: David Rientjes <rientjes@google.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: Jason Gunthorpe <jgg@ziepe.ca>
Cc: Jerome Glisse <jglisse@redhat.com>
Cc: John Hubbard <jhubbard@nvidia.com>
Cc: Laurent Dufour <ldufour@linux.ibm.com>
Cc: Liam Howlett <Liam.Howlett@oracle.com>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Ying Han <yinghan@google.com>
Link: http://lkml.kernel.org/r/20200520052908.204642-12-walken@google.com
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# d8ed45c5 08-Jun-2020 Michel Lespinasse <walken@google.com>

mmap locking API: use coccinelle to convert mmap_sem rwsem call sites

This change converts the existing mmap_sem rwsem calls to use the new mmap
locking API instead.

The change is generated using coccinelle with the following rule:

// spatch --sp-file mmap_lock_api.cocci --in-place --include-headers --dir .

@@
expression mm;
@@
(
-init_rwsem
+mmap_init_lock
|
-down_write
+mmap_write_lock
|
-down_write_killable
+mmap_write_lock_killable
|
-down_write_trylock
+mmap_write_trylock
|
-up_write
+mmap_write_unlock
|
-downgrade_write
+mmap_write_downgrade
|
-down_read
+mmap_read_lock
|
-down_read_killable
+mmap_read_lock_killable
|
-down_read_trylock
+mmap_read_trylock
|
-up_read
+mmap_read_unlock
)
-(&mm->mmap_sem)
+(mm)

Signed-off-by: Michel Lespinasse <walken@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Reviewed-by: Daniel Jordan <daniel.m.jordan@oracle.com>
Reviewed-by: Laurent Dufour <ldufour@linux.ibm.com>
Reviewed-by: Vlastimil Babka <vbabka@suse.cz>
Cc: Davidlohr Bueso <dbueso@suse.de>
Cc: David Rientjes <rientjes@google.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: Jason Gunthorpe <jgg@ziepe.ca>
Cc: Jerome Glisse <jglisse@redhat.com>
Cc: John Hubbard <jhubbard@nvidia.com>
Cc: Liam Howlett <Liam.Howlett@oracle.com>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Ying Han <yinghan@google.com>
Link: http://lkml.kernel.org/r/20200520052908.204642-5-walken@google.com
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# d9eb1ea2 03-Jun-2020 Johannes Weiner <hannes@cmpxchg.org>

mm: memcontrol: delete unused lrucare handling

Swapin faults were the last event to charge pages after they had already
been put on the LRU list. Now that we charge directly on swapin, the
lrucare portion of the charge code is unused.

Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Reviewed-by: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Cc: Alex Shi <alex.shi@linux.alibaba.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: "Kirill A. Shutemov" <kirill@shutemov.name>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Roman Gushchin <guro@fb.com>
Cc: Balbir Singh <bsingharora@gmail.com>
Cc: Shakeel Butt <shakeelb@google.com>
Link: http://lkml.kernel.org/r/20200508183105.225460-19-hannes@cmpxchg.org
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 9d82c694 03-Jun-2020 Johannes Weiner <hannes@cmpxchg.org>

mm: memcontrol: convert anon and file-thp to new mem_cgroup_charge() API

With the page->mapping requirement gone from memcg, we can charge anon and
file-thp pages in one single step, right after they're allocated.

This removes two out of three API calls - especially the tricky commit
step that needed to happen at just the right time between when the page is
"set up" and when it's "published" - somewhat vague and fluid concepts
that varied by page type. All we need is a freshly allocated page and a
memcg context to charge.

v2: prevent double charges on pre-allocated hugepages in khugepaged

[hannes@cmpxchg.org: Fix crash - *hpage could be ERR_PTR instead of NULL]
Link: http://lkml.kernel.org/r/20200512215813.GA487759@cmpxchg.org
Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Reviewed-by: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Cc: Alex Shi <alex.shi@linux.alibaba.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: "Kirill A. Shutemov" <kirill@shutemov.name>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Roman Gushchin <guro@fb.com>
Cc: Shakeel Butt <shakeelb@google.com>
Cc: Balbir Singh <bsingharora@gmail.com>
Cc: Qian Cai <cai@lca.pw>
Link: http://lkml.kernel.org/r/20200508183105.225460-13-hannes@cmpxchg.org
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# be5d0a74 03-Jun-2020 Johannes Weiner <hannes@cmpxchg.org>

mm: memcontrol: switch to native NR_ANON_MAPPED counter

Memcg maintains a private MEMCG_RSS counter. This divergence from the
generic VM accounting means unnecessary code overhead, and creates a
dependency for memcg that page->mapping is set up at the time of charging,
so that page types can be told apart.

Convert the generic accounting sites to mod_lruvec_page_state and friends
to maintain the per-cgroup vmstat counter of NR_ANON_MAPPED. We use
lock_page_memcg() to stabilize page->mem_cgroup during rmap changes, the
same way we do for NR_FILE_MAPPED.

With the previous patch removing MEMCG_CACHE and the private NR_SHMEM
counter, this patch finally eliminates the need to have page->mapping set
up at charge time. However, we need to have page->mem_cgroup set up by
the time rmap runs and does the accounting, so switch the commit and the
rmap callbacks around.

v2: fix temporary accounting bug by switching rmap<->commit (Joonsoo)

Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Cc: Alex Shi <alex.shi@linux.alibaba.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Cc: "Kirill A. Shutemov" <kirill@shutemov.name>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Roman Gushchin <guro@fb.com>
Cc: Shakeel Butt <shakeelb@google.com>
Cc: Balbir Singh <bsingharora@gmail.com>
Link: http://lkml.kernel.org/r/20200508183105.225460-11-hannes@cmpxchg.org
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 0d1c2072 03-Jun-2020 Johannes Weiner <hannes@cmpxchg.org>

mm: memcontrol: switch to native NR_FILE_PAGES and NR_SHMEM counters

Memcg maintains private MEMCG_CACHE and NR_SHMEM counters. This
divergence from the generic VM accounting means unnecessary code overhead,
and creates a dependency for memcg that page->mapping is set up at the
time of charging, so that page types can be told apart.

Convert the generic accounting sites to mod_lruvec_page_state and friends
to maintain the per-cgroup vmstat counters of NR_FILE_PAGES and NR_SHMEM.
The page is already locked in these places, so page->mem_cgroup is stable;
we only need minimal tweaks of two mem_cgroup_migrate() calls to ensure
it's set up in time.

Then replace MEMCG_CACHE with NR_FILE_PAGES and delete the private
NR_SHMEM accounting sites.

Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Reviewed-by: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Cc: Alex Shi <alex.shi@linux.alibaba.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: "Kirill A. Shutemov" <kirill@shutemov.name>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Roman Gushchin <guro@fb.com>
Cc: Shakeel Butt <shakeelb@google.com>
Cc: Balbir Singh <bsingharora@gmail.com>
Link: http://lkml.kernel.org/r/20200508183105.225460-10-hannes@cmpxchg.org
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 3fba69a5 03-Jun-2020 Johannes Weiner <hannes@cmpxchg.org>

mm: memcontrol: drop @compound parameter from memcg charging API

The memcg charging API carries a boolean @compound parameter that tells
whether the page we're dealing with is a hugepage.
mem_cgroup_commit_charge() has another boolean @lrucare that indicates
whether the page needs LRU locking or not while charging. The majority of
callsites know those parameters at compile time, which results in a lot of
naked "false, false" argument lists. This makes for cryptic code and is a
breeding ground for subtle mistakes.

Thankfully, the huge page state can be inferred from the page itself and
doesn't need to be passed along. This is safe because charging completes
before the page is published and somebody may split it.

Simplify the callsites by removing @compound, and let memcg infer the
state by using hpage_nr_pages() unconditionally. That function does
PageTransHuge() to identify huge pages, which also helpfully asserts that
nobody passes in tail pages by accident.

The following patches will introduce a new charging API, best not to carry
over unnecessary weight.

Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Reviewed-by: Alex Shi <alex.shi@linux.alibaba.com>
Reviewed-by: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Reviewed-by: Shakeel Butt <shakeelb@google.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: "Kirill A. Shutemov" <kirill@shutemov.name>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Roman Gushchin <guro@fb.com>
Cc: Balbir Singh <bsingharora@gmail.com>
Link: http://lkml.kernel.org/r/20200508183105.225460-4-hannes@cmpxchg.org
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 1966391f 02-Jun-2020 Hugh Dickins <hughd@google.com>

mm/migrate.c: attach_page_private already does the get_page

Just finished bisecting mmotm, to find why a test which used to take
four minutes now took more than an hour: the __buffer_migrate_page()
cleanup left behind a get_page() which attach_page_private() now does.

Fixes: cd0f37154443 ("mm/migrate.c: call detach_page_private to cleanup code")
Signed-off-by: Hugh Dickins <hughd@google.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# cd0f3715 01-Jun-2020 Guoqing Jiang <guoqing.jiang@cloud.ionos.com>

mm/migrate.c: call detach_page_private to cleanup code

We can cleanup code a little by call detach_page_private here.

[akpm@linux-foundation.org: use attach_page_private(), per Dave]
http://lkml.kernel.org/r/20200521225220.GV2005@dread.disaster.area
[akpm@linux-foundation.org: clear PagePrivate]
Signed-off-by: Guoqing Jiang <guoqing.jiang@cloud.ionos.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Reviewed-by: Andrew Morton <akpm@linux-foundation.org>
Cc: Chao Yu <yuchao0@huawei.com>
Cc: Cong Wang <xiyou.wangcong@gmail.com>
Cc: Dave Chinner <dchinner@redhat.com>
Cc: Eric Biggers <ebiggers@google.com>
Cc: Gao Xiang <gaoxiang25@huawei.com>
Cc: Jaegeuk Kim <jaegeuk@kernel.org>
Cc: John Hubbard <jhubbard@nvidia.com>
Cc: Joseph Qi <joseph.qi@linux.alibaba.com>
Cc: Junxiao Bi <junxiao.bi@oracle.com>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Zi Yan <ziy@nvidia.com>
Cc: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Cc: Miklos Szeredi <mszeredi@redhat.com>
Link: http://lkml.kernel.org/r/20200519214049.15179-1-guoqing.jiang@cloud.ionos.com
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# d4388340 01-Jun-2020 Matthew Wilcox (Oracle) <willy@infradead.org>

fs: convert mpage_readpages to mpage_readahead

Implement the new readahead aop and convert all callers (block_dev,
exfat, ext2, fat, gfs2, hpfs, isofs, jfs, nilfs2, ocfs2, omfs, qnx6,
reiserfs & udf).

The callers are all trivial except for GFS2 & OCFS2.

Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Reviewed-by: Junxiao Bi <junxiao.bi@oracle.com> # ocfs2
Reviewed-by: Joseph Qi <joseph.qi@linux.alibaba.com> # ocfs2
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: John Hubbard <jhubbard@nvidia.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: William Kucharski <william.kucharski@oracle.com>
Cc: Chao Yu <yuchao0@huawei.com>
Cc: Cong Wang <xiyou.wangcong@gmail.com>
Cc: Darrick J. Wong <darrick.wong@oracle.com>
Cc: Eric Biggers <ebiggers@google.com>
Cc: Gao Xiang <gaoxiang25@huawei.com>
Cc: Jaegeuk Kim <jaegeuk@kernel.org>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Zi Yan <ziy@nvidia.com>
Cc: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Cc: Miklos Szeredi <mszeredi@redhat.com>
Link: http://lkml.kernel.org/r/20200414150233.24495-17-willy@infradead.org
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# f45ec5ff 06-Apr-2020 Peter Xu <peterx@redhat.com>

userfaultfd: wp: support swap and page migration

For either swap and page migration, we all use the bit 2 of the entry to
identify whether this entry is uffd write-protected. It plays a similar
role as the existing soft dirty bit in swap entries but only for keeping
the uffd-wp tracking for a specific PTE/PMD.

Something special here is that when we want to recover the uffd-wp bit
from a swap/migration entry to the PTE bit we'll also need to take care of
the _PAGE_RW bit and make sure it's cleared, otherwise even with the
_PAGE_UFFD_WP bit we can't trap it at all.

In change_pte_range() we do nothing for uffd if the PTE is a swap entry.
That can lead to data mismatch if the page that we are going to write
protect is swapped out when sending the UFFDIO_WRITEPROTECT. This patch
also applies/removes the uffd-wp bit even for the swap entries.

Signed-off-by: Peter Xu <peterx@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Bobby Powers <bobbypowers@gmail.com>
Cc: Brian Geffon <bgeffon@google.com>
Cc: David Hildenbrand <david@redhat.com>
Cc: Denis Plotnikov <dplotnikov@virtuozzo.com>
Cc: "Dr . David Alan Gilbert" <dgilbert@redhat.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: Jerome Glisse <jglisse@redhat.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: "Kirill A . Shutemov" <kirill@shutemov.name>
Cc: Martin Cracauer <cracauer@cons.org>
Cc: Marty McFadden <mcfadden8@llnl.gov>
Cc: Maya Gokhale <gokhale2@llnl.gov>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Mike Kravetz <mike.kravetz@oracle.com>
Cc: Mike Rapoport <rppt@linux.vnet.ibm.com>
Cc: Pavel Emelyanov <xemul@openvz.org>
Cc: Rik van Riel <riel@redhat.com>
Cc: Shaohua Li <shli@fb.com>
Link: http://lkml.kernel.org/r/20200220163112.11409-11-peterx@redhat.com
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 9de4f22a 06-Apr-2020 Huang Ying <ying.huang@intel.com>

mm: code cleanup for MADV_FREE

Some comments for MADV_FREE is revised and added to help people understand
the MADV_FREE code, especially the page flag, PG_swapbacked. This makes
page_is_file_cache() isn't consistent with its comments. So the function
is renamed to page_is_file_lru() to make them consistent again. All these
are put in one patch as one logical change.

Suggested-by: David Hildenbrand <david@redhat.com>
Suggested-by: Johannes Weiner <hannes@cmpxchg.org>
Suggested-by: David Rientjes <rientjes@google.com>
Signed-off-by: "Huang, Ying" <ying.huang@intel.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Acked-by: David Rientjes <rientjes@google.com>
Acked-by: Michal Hocko <mhocko@kernel.org>
Acked-by: Pankaj Gupta <pankaj.gupta.linux@gmail.com>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Minchan Kim <minchan@kernel.org>
Cc: Hugh Dickins <hughd@google.com>
Cc: Rik van Riel <riel@surriel.com>
Link: http://lkml.kernel.org/r/20200317100342.2730705-1-ying.huang@intel.com
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 6aeff241 06-Apr-2020 Yang Shi <yang.shi@linux.alibaba.com>

mm/migrate.c: migrate PG_readahead flag

Currently the migration code doesn't migrate PG_readahead flag.
Theoretically this would incur slight performance loss as the application
might have to ramp its readahead back up again. Even though such problem
happens, it might be hidden by something else since migration is typically
triggered by compaction and NUMA balancing, any of which should be more
noticeable.

Migrate the flag after end_page_writeback() since it may clear PG_reclaim
flag, which is the same bit as PG_readahead, for the new page.

[akpm@linux-foundation.org: tweak comment]
Signed-off-by: Yang Shi <yang.shi@linux.alibaba.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Mel Gorman <mgorman@techsingularity.net>
Link: http://lkml.kernel.org/r/1581640185-95731-1-git-send-email-yang.shi@linux.alibaba.com
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# d08221a0 06-Apr-2020 Wei Yang <richardw.yang@linux.intel.com>

mm/migrate.c: unify "not queued for migration" handling in do_pages_move()

It can currently happen that we store the status of a page twice:
* Once we detect that it is already on the target node
* Once we moved a bunch of pages, and a page that's already on the
target node is contained in the current interval.

Let's simplify the code and always call do_move_pages_to_node() in case we
did not queue a page for migration. Note that pages that are already on
the target node are not added to the pagelist and are, therefore, ignored
by do_move_pages_to_node() - there is no functional change.

The status of such a page is now only stored once.

[david@redhat.com rephrase changelog]
Signed-off-by: Wei Yang <richardw.yang@linux.intel.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Reviewed-by: David Hildenbrand <david@redhat.com>
Acked-by: Michal Hocko <mhocko@suse.com>
Link: http://lkml.kernel.org/r/20200214003017.25558-5-richardw.yang@linux.intel.com
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 5d7ae891 06-Apr-2020 Wei Yang <richardw.yang@linux.intel.com>

mm/migrate.c: check pagelist in move_pages_and_store_status()

When pagelist is empty, it is not necessary to do the move and store.
Also it consolidate the empty list check in one place.

Signed-off-by: Wei Yang <richardw.yang@linux.intel.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Reviewed-by: David Hildenbrand <david@redhat.com>
Acked-by: Michal Hocko <mhocko@suse.com>
Link: http://lkml.kernel.org/r/20200214003017.25558-4-richardw.yang@linux.intel.com
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 7ca8783a 06-Apr-2020 Wei Yang <richardw.yang@linux.intel.com>

mm/migrate.c: wrap do_move_pages_to_node() and store_status()

Usually, do_move_pages_to_node() and store_status() are used in
combination. We have three similar call sites.

Let's provide a wrapper for both function calls -
move_pages_and_store_status - to make the calling code easier to maintain
and fix (as noted by Yang Shi, the return value handling of
do_move_pages_to_node() has a flaw).

[david@redhat.com rephrase changelog]
Signed-off-by: Wei Yang <richardw.yang@linux.intel.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Reviewed-by: David Hildenbrand <david@redhat.com>
Acked-by: Michal Hocko <mhocko@suse.com>
Link: http://lkml.kernel.org/r/20200214003017.25558-3-richardw.yang@linux.intel.com
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 4afdacec 06-Apr-2020 Wei Yang <richardw.yang@linux.intel.com>

mm/migrate.c: no need to check for i > start in do_pages_move()

Patch series "cleanup on do_pages_move()", v5.

The logic in do_pages_move() is a little mess for audience to read and has
some potential error on handling the return value. Especially there are
three calls on do_move_pages_to_node() and store_status() with almost the
same form.

This patch set tries to make the code a little friendly for audience by
consolidate the calls.

This patch (of 4):

At this point, we always have i >= start. If i == start, store_status()
will return 0. So we can drop the check for i > start.

[david@redhat.com rephrase changelog]

Signed-off-by: Wei Yang <richardw.yang@linux.intel.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Reviewed-by: David Hildenbrand <david@redhat.com>
Acked-by: Michal Hocko <mhocko@suse.com>
Link: http://lkml.kernel.org/r/20200214003017.25558-2-richardw.yang@linux.intel.com
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# c0d0381a 01-Apr-2020 Mike Kravetz <mike.kravetz@oracle.com>

hugetlbfs: use i_mmap_rwsem for more pmd sharing synchronization

Patch series "hugetlbfs: use i_mmap_rwsem for more synchronization", v2.

While discussing the issue with huge_pte_offset [1], I remembered that
there were more outstanding hugetlb races. These issues are:

1) For shared pmds, huge PTE pointers returned by huge_pte_alloc can become
invalid via a call to huge_pmd_unshare by another thread.
2) hugetlbfs page faults can race with truncation causing invalid global
reserve counts and state.

A previous attempt was made to use i_mmap_rwsem in this manner as
described at [2]. However, those patches were reverted starting with [3]
due to locking issues.

To effectively use i_mmap_rwsem to address the above issues it needs to be
held (in read mode) during page fault processing. However, during fault
processing we need to lock the page we will be adding. Lock ordering
requires we take page lock before i_mmap_rwsem. Waiting until after
taking the page lock is too late in the fault process for the
synchronization we want to do.

To address this lock ordering issue, the following patches change the lock
ordering for hugetlb pages. This is not too invasive as hugetlbfs
processing is done separate from core mm in many places. However, I don't
really like this idea. Much ugliness is contained in the new routine
hugetlb_page_mapping_lock_write() of patch 1.

The only other way I can think of to address these issues is by catching
all the races. After catching a race, cleanup, backout, retry ... etc,
as needed. This can get really ugly, especially for huge page
reservations. At one time, I started writing some of the reservation
backout code for page faults and it got so ugly and complicated I went
down the path of adding synchronization to avoid the races. Any other
suggestions would be welcome.

[1] https://lore.kernel.org/linux-mm/1582342427-230392-1-git-send-email-longpeng2@huawei.com/
[2] https://lore.kernel.org/linux-mm/20181222223013.22193-1-mike.kravetz@oracle.com/
[3] https://lore.kernel.org/linux-mm/20190103235452.29335-1-mike.kravetz@oracle.com
[4] https://lore.kernel.org/linux-mm/1584028670.7365.182.camel@lca.pw/
[5] https://lore.kernel.org/lkml/20200312183142.108df9ac@canb.auug.org.au/

This patch (of 2):

While looking at BUGs associated with invalid huge page map counts, it was
discovered and observed that a huge pte pointer could become 'invalid' and
point to another task's page table. Consider the following:

A task takes a page fault on a shared hugetlbfs file and calls
huge_pte_alloc to get a ptep. Suppose the returned ptep points to a
shared pmd.

Now, another task truncates the hugetlbfs file. As part of truncation, it
unmaps everyone who has the file mapped. If the range being truncated is
covered by a shared pmd, huge_pmd_unshare will be called. For all but the
last user of the shared pmd, huge_pmd_unshare will clear the pud pointing
to the pmd. If the task in the middle of the page fault is not the last
user, the ptep returned by huge_pte_alloc now points to another task's
page table or worse. This leads to bad things such as incorrect page
map/reference counts or invalid memory references.

To fix, expand the use of i_mmap_rwsem as follows:
- i_mmap_rwsem is held in read mode whenever huge_pmd_share is called.
huge_pmd_share is only called via huge_pte_alloc, so callers of
huge_pte_alloc take i_mmap_rwsem before calling. In addition, callers
of huge_pte_alloc continue to hold the semaphore until finished with
the ptep.
- i_mmap_rwsem is held in write mode whenever huge_pmd_unshare is called.

One problem with this scheme is that it requires taking i_mmap_rwsem
before taking the page lock during page faults. This is not the order
specified in the rest of mm code. Handling of hugetlbfs pages is mostly
isolated today. Therefore, we use this alternative locking order for
PageHuge() pages.

mapping->i_mmap_rwsem
hugetlb_fault_mutex (hugetlbfs specific page fault mutex)
page->flags PG_locked (lock_page)

To help with lock ordering issues, hugetlb_page_mapping_lock_write() is
introduced to write lock the i_mmap_rwsem associated with a page.

In most cases it is easy to get address_space via vma->vm_file->f_mapping.
However, in the case of migration or memory errors for anon pages we do
not have an associated vma. A new routine _get_hugetlb_page_mapping()
will use anon_vma to get address_space in these cases.

Signed-off-by: Mike Kravetz <mike.kravetz@oracle.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Hugh Dickins <hughd@google.com>
Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
Cc: "Aneesh Kumar K . V" <aneesh.kumar@linux.vnet.ibm.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: "Kirill A . Shutemov" <kirill.shutemov@linux.intel.com>
Cc: Davidlohr Bueso <dave@stgolabs.net>
Cc: Prakash Sangappa <prakash.sangappa@oracle.com>
Link: http://lkml.kernel.org/r/20200316205756.146666-2-mike.kravetz@oracle.com
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 800bb1c8 16-Mar-2020 Christoph Hellwig <hch@lst.de>

mm: handle multiple owners of device private pages in migrate_vma

Add a new src_owner field to struct migrate_vma. If the field is set,
only device private pages with page->pgmap->owner equal to that field are
migrated. If the field is not set only "normal" pages are migrated.

Fixes: df6ad69838fc ("mm/device-public-memory: device memory cache coherent with CPU")
Link: https://lore.kernel.org/r/20200316193216.920734-3-hch@lst.de
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Ralph Campbell <rcampbell@nvidia.com>
Tested-by: Bharata B Rao <bharata@linux.ibm.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>


# b7a16c7a 03-Feb-2020 Steven Price <steven.price@arm.com>

mm: pagewalk: add 'depth' parameter to pte_hole

The pte_hole() callback is called at multiple levels of the page tables.
Code dumping the kernel page tables needs to know what at what depth the
missing entry is. Add this is an extra parameter to pte_hole(). When the
depth isn't know (e.g. processing a vma) then -1 is passed.

The depth that is reported is the actual level where the entry is missing
(ignoring any folding that is in place), i.e. any levels where
PTRS_PER_P?D is set to 1 are ignored.

Note that depth starts at 0 for a PGD so that PUD/PMD/PTE retain their
natural numbers as levels 2/3/4.

Link: http://lkml.kernel.org/r/20191218162402.45610-16-steven.price@arm.com
Signed-off-by: Steven Price <steven.price@arm.com>
Tested-by: Zong Li <zong.li@sifive.com>
Cc: Albert Ou <aou@eecs.berkeley.edu>
Cc: Alexandre Ghiti <alex@ghiti.fr>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Ard Biesheuvel <ard.biesheuvel@linaro.org>
Cc: Arnd Bergmann <arnd@arndb.de>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Christian Borntraeger <borntraeger@de.ibm.com>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: David S. Miller <davem@davemloft.net>
Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: James Hogan <jhogan@kernel.org>
Cc: James Morse <james.morse@arm.com>
Cc: Jerome Glisse <jglisse@redhat.com>
Cc: "Liang, Kan" <kan.liang@linux.intel.com>
Cc: Mark Rutland <mark.rutland@arm.com>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Cc: Paul Burton <paul.burton@mips.com>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Paul Walmsley <paul.walmsley@sifive.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Ralf Baechle <ralf@linux-mips.org>
Cc: Russell King <linux@armlinux.org.uk>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Vasily Gorbik <gor@linux.ibm.com>
Cc: Vineet Gupta <vgupta@synopsys.com>
Cc: Will Deacon <will@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 34290e2c 30-Jan-2020 Ralph Campbell <rcampbell@nvidia.com>

mm/migrate: add stable check in migrate_vma_insert_page()

migrate_vma_insert_page() closely follows the code in:
__handle_mm_fault()
handle_pte_fault()
do_anonymous_page()

Add a call to check_stable_address_space() after locking the page table
entry before inserting a ZONE_DEVICE private zero page mapping similar
to page faulting a new anonymous page.

Link: http://lkml.kernel.org/r/20200107211208.24595-4-rcampbell@nvidia.com
Signed-off-by: Ralph Campbell <rcampbell@nvidia.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Cc: Jerome Glisse <jglisse@redhat.com>
Cc: John Hubbard <jhubbard@nvidia.com>
Cc: Jason Gunthorpe <jgg@mellanox.com>
Cc: Bharata B Rao <bharata@linux.ibm.com>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Chris Down <chris@chrisdown.name>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# c23a0c99 30-Jan-2020 Ralph Campbell <rcampbell@nvidia.com>

mm/migrate: clean up some minor coding style

Fix some comment typos and coding style clean up in preparation for the
next patch. No functional changes.

Link: http://lkml.kernel.org/r/20200107211208.24595-3-rcampbell@nvidia.com
Signed-off-by: Ralph Campbell <rcampbell@nvidia.com>
Acked-by: Chris Down <chris@chrisdown.name>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Cc: Jerome Glisse <jglisse@redhat.com>
Cc: John Hubbard <jhubbard@nvidia.com>
Cc: Jason Gunthorpe <jgg@mellanox.com>
Cc: Bharata B Rao <bharata@linux.ibm.com>
Cc: Michal Hocko <mhocko@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 872ea707 30-Jan-2020 Ralph Campbell <rcampbell@nvidia.com>

mm/migrate: remove useless mask of start address

Addresses passed to walk_page_range() callback functions are already
page aligned and don't need to be masked with PAGE_MASK.

Link: http://lkml.kernel.org/r/20200107211208.24595-2-rcampbell@nvidia.com
Signed-off-by: Ralph Campbell <rcampbell@nvidia.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Cc: Jerome Glisse <jglisse@redhat.com>
Cc: John Hubbard <jhubbard@nvidia.com>
Cc: Jason Gunthorpe <jgg@mellanox.com>
Cc: Bharata B Rao <bharata@linux.ibm.com>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Chris Down <chris@chrisdown.name>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 5984fabb 30-Jan-2020 Yang Shi <yang.shi@linux.alibaba.com>

mm: move_pages: report the number of non-attempted pages

Since commit a49bd4d71637 ("mm, numa: rework do_pages_move"), the
semantic of move_pages() has changed to return the number of
non-migrated pages if they were result of a non-fatal reasons (usually a
busy page).

This was an unintentional change that hasn't been noticed except for LTP
tests which checked for the documented behavior.

There are two ways to go around this change. We can even get back to
the original behavior and return -EAGAIN whenever migrate_pages is not
able to migrate pages due to non-fatal reasons. Another option would be
to simply continue with the changed semantic and extend move_pages
documentation to clarify that -errno is returned on an invalid input or
when migration simply cannot succeed (e.g. -ENOMEM, -EBUSY) or the
number of pages that couldn't have been migrated due to ephemeral
reasons (e.g. page is pinned or locked for other reasons).

This patch implements the second option because this behavior is in
place for some time without anybody complaining and possibly new users
depending on it. Also it allows to have a slightly easier error
handling as the caller knows that it is worth to retry when err > 0.

But since the new semantic would be aborted immediately if migration is
failed due to ephemeral reasons, need include the number of
non-attempted pages in the return value too.

Link: http://lkml.kernel.org/r/1580160527-109104-1-git-send-email-yang.shi@linux.alibaba.com
Fixes: a49bd4d71637 ("mm, numa: rework do_pages_move")
Signed-off-by: Yang Shi <yang.shi@linux.alibaba.com>
Suggested-by: Michal Hocko <mhocko@suse.com>
Acked-by: Michal Hocko <mhocko@suse.com>
Reviewed-by: Wei Yang <richardw.yang@linux.intel.com>
Cc: <stable@vger.kernel.org> [4.17+]
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# dfe9aa23 30-Jan-2020 Wei Yang <richardw.yang@linux.intel.com>

mm/migrate.c: also overwrite error when it is bigger than zero

If we get here after successfully adding page to list, err would be 1 to
indicate the page is queued in the list.

Current code has two problems:

* on success, 0 is not returned
* on error, if add_page_for_migratioin() return 1, and the following err1
from do_move_pages_to_node() is set, the err1 is not returned since err
is 1

And these behaviors break the user interface.

Link: http://lkml.kernel.org/r/20200119065753.21694-1-richardw.yang@linux.intel.com
Fixes: e0153fc2c760 ("mm: move_pages: return valid node id in status if the page is already on the target node").
Signed-off-by: Wei Yang <richardw.yang@linux.intel.com>
Acked-by: Yang Shi <yang.shi@linux.alibaba.com>
Cc: John Hubbard <jhubbard@nvidia.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Christoph Lameter <cl@linux.com>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# e0153fc2 04-Jan-2020 Yang Shi <yang.shi@linux.alibaba.com>

mm: move_pages: return valid node id in status if the page is already on the target node

Felix Abecassis reports move_pages() would return random status if the
pages are already on the target node by the below test program:

int main(void)
{
const long node_id = 1;
const long page_size = sysconf(_SC_PAGESIZE);
const int64_t num_pages = 8;

unsigned long nodemask = 1 << node_id;
long ret = set_mempolicy(MPOL_BIND, &nodemask, sizeof(nodemask));
if (ret < 0)
return (EXIT_FAILURE);

void **pages = malloc(sizeof(void*) * num_pages);
for (int i = 0; i < num_pages; ++i) {
pages[i] = mmap(NULL, page_size, PROT_WRITE | PROT_READ,
MAP_PRIVATE | MAP_POPULATE | MAP_ANONYMOUS,
-1, 0);
if (pages[i] == MAP_FAILED)
return (EXIT_FAILURE);
}

ret = set_mempolicy(MPOL_DEFAULT, NULL, 0);
if (ret < 0)
return (EXIT_FAILURE);

int *nodes = malloc(sizeof(int) * num_pages);
int *status = malloc(sizeof(int) * num_pages);
for (int i = 0; i < num_pages; ++i) {
nodes[i] = node_id;
status[i] = 0xd0; /* simulate garbage values */
}

ret = move_pages(0, num_pages, pages, nodes, status, MPOL_MF_MOVE);
printf("move_pages: %ld\n", ret);
for (int i = 0; i < num_pages; ++i)
printf("status[%d] = %d\n", i, status[i]);
}

Then running the program would return nonsense status values:

$ ./move_pages_bug
move_pages: 0
status[0] = 208
status[1] = 208
status[2] = 208
status[3] = 208
status[4] = 208
status[5] = 208
status[6] = 208
status[7] = 208

This is because the status is not set if the page is already on the
target node, but move_pages() should return valid status as long as it
succeeds. The valid status may be errno or node id.

We can't simply initialize status array to zero since the pages may be
not on node 0. Fix it by updating status with node id which the page is
already on.

Link: http://lkml.kernel.org/r/1575584353-125392-1-git-send-email-yang.shi@linux.alibaba.com
Fixes: a49bd4d71637 ("mm, numa: rework do_pages_move")
Signed-off-by: Yang Shi <yang.shi@linux.alibaba.com>
Reported-by: Felix Abecassis <fabecassis@nvidia.com>
Tested-by: Felix Abecassis <fabecassis@nvidia.com>
Suggested-by: Michal Hocko <mhocko@suse.com>
Reviewed-by: John Hubbard <jhubbard@nvidia.com>
Acked-by: Christoph Lameter <cl@linux.com>
Acked-by: Michal Hocko <mhocko@suse.com>
Reviewed-by: Vlastimil Babka <vbabka@suse.cz>
Cc: Mel Gorman <mgorman@techsingularity.net>
Cc: <stable@vger.kernel.org> [4.17+]
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# bfe9d006 30-Nov-2019 Huang Ying <ying.huang@intel.com>

autonuma: fix watermark checking in migrate_balanced_pgdat()

When zone_watermark_ok() is called in migrate_balanced_pgdat() to check
migration target node, the parameter classzone_idx (for requested zone)
is specified as 0 (ZONE_DMA). But when allocating memory for autonuma
in alloc_misplaced_dst_page(), the requested zone from GFP flags is
ZONE_MOVABLE. That is, the requested zone is different. The size of
lowmem_reserve for the different requested zone is different. And this
may cause some issues.

For example, in the zoneinfo of a test machine as below,

Node 0, zone DMA32
pages free 61592
min 29
low 454
high 879
spanned 1044480
present 442306
managed 425921
protection: (0, 0, 62457, 62457, 62457)

The free page number of ZONE_DMA32 is greater than "high watermark +
lowmem_reserve[ZONE_DMA]", but less than "high watermark +
lowmem_reserve[ZONE_MOVABLE]". And because __alloc_pages_node() in
alloc_misplaced_dst_page() requests ZONE_MOVABLE, the
zone_watermark_ok() on ZONE_DMA32 in migrate_balanced_pgdat() may always
return true. So, autonuma may not stop even when memory pressure in
node 0 is heavy.

To fix the issue, ZONE_MOVABLE is used as parameter to call
zone_watermark_ok() in migrate_balanced_pgdat(). This makes it same as
requested zone in alloc_misplaced_dst_page(). So that
migrate_balanced_pgdat() returns false when memory pressure is heavy.

Link: http://lkml.kernel.org/r/20191101075727.26683-2-ying.huang@intel.com
Signed-off-by: "Huang, Ying" <ying.huang@intel.com>
Acked-by: Mel Gorman <mgorman@suse.de>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Rik van Riel <riel@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: Fengguang Wu <fengguang.wu@intel.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 74d4a579 30-Nov-2019 Yang Shi <yang.shi@linux.alibaba.com>

mm/migrate.c: handle freed page at the first place

When doing migration if the freed page is met, we just return without
migrating it since it is pointless to migrate a freed page. But, the
current code allocates target page unconditionally before handling freed
page, if the page is freed, the newly allocated will be just freed. It
doesn't make too much sense and is just a waste of time although
migrating freed page is rare.

So, handle freed page at the before that to avoid unnecessary page
allocation and free.

Link: http://lkml.kernel.org/r/1573755869-106954-1-git-send-email-yang.shi@linux.alibaba.com
Signed-off-by: Yang Shi <yang.shi@linux.alibaba.com>
Acked-by: Michal Hocko <mhocko@suse.com>
Reviewed-by: Andrew Morton <akpm@linux-foundation.org>
Cc: Mel Gorman <mgorman@techsingularity.net>
Cc: Vlastimil Babka <vbabka@suse.cz>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 057d3389 25-Sep-2019 Andrey Konovalov <andreyknvl@google.com>

mm: untag user pointers passed to memory syscalls

This patch is a part of a series that extends kernel ABI to allow to pass
tagged user pointers (with the top byte set to something else other than
0x00) as syscall arguments.

This patch allows tagged pointers to be passed to the following memory
syscalls: get_mempolicy, madvise, mbind, mincore, mlock, mlock2, mprotect,
mremap, msync, munlock, move_pages.

The mmap and mremap syscalls do not currently accept tagged addresses.
Architectures may interpret the tag as a background colour for the
corresponding vma.

Link: http://lkml.kernel.org/r/aaf0c0969d46b2feb9017f3e1b3ef3970b633d91.1563904656.git.andreyknvl@google.com
Signed-off-by: Andrey Konovalov <andreyknvl@google.com>
Reviewed-by: Khalid Aziz <khalid.aziz@oracle.com>
Reviewed-by: Vincenzo Frascino <vincenzo.frascino@arm.com>
Reviewed-by: Catalin Marinas <catalin.marinas@arm.com>
Reviewed-by: Kees Cook <keescook@chromium.org>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Dave Hansen <dave.hansen@intel.com>
Cc: Eric Auger <eric.auger@redhat.com>
Cc: Felix Kuehling <Felix.Kuehling@amd.com>
Cc: Jens Wiklander <jens.wiklander@linaro.org>
Cc: Mauro Carvalho Chehab <mchehab+samsung@kernel.org>
Cc: Mike Rapoport <rppt@linux.ibm.com>
Cc: Will Deacon <will@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 276f756d 23-Sep-2019 Pingfan Liu <kernelfans@gmail.com>

mm/migrate.c: clean up useless code in migrate_vma_collect_pmd()

Remove unused 'pfn' variable.

Link: http://lkml.kernel.org/r/1565167272-21453-1-git-send-email-kernelfans@gmail.com
Signed-off-by: Pingfan Liu <kernelfans@gmail.com>
Reviewed-by: Andrew Morton <akpm@linux-foundation.org>
Reviewed-by: Ralph Campbell <rcampbell@nvidia.com>
Cc: "Jérôme Glisse" <jglisse@redhat.com>
Cc: Mel Gorman <mgorman@techsingularity.net>
Cc: Jan Kara <jack@suse.cz>
Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Mike Kravetz <mike.kravetz@oracle.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Matthew Wilcox <willy@infradead.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 4101196b 23-Sep-2019 Matthew Wilcox (Oracle) <willy@infradead.org>

mm: page cache: store only head pages in i_pages

Transparent Huge Pages are currently stored in i_pages as pointers to
consecutive subpages. This patch changes that to storing consecutive
pointers to the head page in preparation for storing huge pages more
efficiently in i_pages.

Large parts of this are "inspired" by Kirill's patch
https://lore.kernel.org/lkml/20170126115819.58875-2-kirill.shutemov@linux.intel.com/

Kirill and Huang Ying contributed several fixes.

[willy@infradead.org: use compound_nr, squish uninit-var warning]
Link: http://lkml.kernel.org/r/20190731210400.7419-1-willy@infradead.org
Signed-off-by: Matthew Wilcox <willy@infradead.org>
Acked-by: Jan Kara <jack@suse.cz>
Reviewed-by: Kirill Shutemov <kirill@shutemov.name>
Reviewed-by: Song Liu <songliubraving@fb.com>
Tested-by: Song Liu <songliubraving@fb.com>
Tested-by: William Kucharski <william.kucharski@oracle.com>
Reviewed-by: William Kucharski <william.kucharski@oracle.com>
Tested-by: Qian Cai <cai@lca.pw>
Tested-by: Mikhail Gavrilov <mikhail.v.gavrilov@gmail.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: Chris Wilson <chris@chris-wilson.co.uk>
Cc: Song Liu <songliubraving@fb.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# d8c6546b 23-Sep-2019 Matthew Wilcox (Oracle) <willy@infradead.org>

mm: introduce compound_nr()

Replace 1 << compound_order(page) with compound_nr(page). Minor
improvements in readability.

Link: http://lkml.kernel.org/r/20190721104612.19120-4-willy@infradead.org
Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Reviewed-by: Andrew Morton <akpm@linux-foundation.org>
Reviewed-by: Ira Weiny <ira.weiny@intel.com>
Acked-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Cc: Michal Hocko <mhocko@suse.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 7b86ac33 28-Aug-2019 Christoph Hellwig <hch@lst.de>

pagewalk: separate function pointers from iterator data

The mm_walk structure currently mixed data and code. Split out the
operations vectors into a new mm_walk_ops structure, and while we are
changing the API also declare the mm_walk structure inside the
walk_page_range and walk_page_vma functions.

Based on patch from Linus Torvalds.

Link: https://lore.kernel.org/r/20190828141955.22210-3-hch@lst.de
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Thomas Hellstrom <thellstrom@vmware.com>
Reviewed-by: Steven Price <steven.price@arm.com>
Reviewed-by: Jason Gunthorpe <jgg@mellanox.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>


# a520110e 28-Aug-2019 Christoph Hellwig <hch@lst.de>

mm: split out a new pagewalk.h header from mm.h

Add a new header for the two handful of users of the walk_page_range /
walk_page_vma interface instead of polluting all users of mm.h with it.

Link: https://lore.kernel.org/r/20190828141955.22210-2-hch@lst.de
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Thomas Hellstrom <thellstrom@vmware.com>
Reviewed-by: Steven Price <steven.price@arm.com>
Reviewed-by: Jason Gunthorpe <jgg@mellanox.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>


# 9b2ed9cb 14-Aug-2019 Christoph Hellwig <hch@lst.de>

mm: remove CONFIG_MIGRATE_VMA_HELPER

CONFIG_MIGRATE_VMA_HELPER guards helpers that are required for proper
devic private memory support. Remove the option and just check for
CONFIG_DEVICE_PRIVATE instead.

Link: https://lore.kernel.org/r/20190814075928.23766-11-hch@lst.de
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Jason Gunthorpe <jgg@mellanox.com>
Tested-by: Ralph Campbell <rcampbell@nvidia.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>


# 06d462be 14-Aug-2019 Christoph Hellwig <hch@lst.de>

mm: remove the unused MIGRATE_PFN_DEVICE flag

No one ever checks this flag, and we could easily get that information
from the page if needed.

Link: https://lore.kernel.org/r/20190814075928.23766-10-hch@lst.de
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Ralph Campbell <rcampbell@nvidia.com>
Reviewed-by: Jason Gunthorpe <jgg@mellanox.com>
Tested-by: Ralph Campbell <rcampbell@nvidia.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>


# a7d1f22b 14-Aug-2019 Christoph Hellwig <hch@lst.de>

mm: turn migrate_vma upside down

There isn't any good reason to pass callbacks to migrate_vma. Instead
we can just export the three steps done by this function to drivers and
let them sequence the operation without callbacks. This removes a lot
of boilerplate code as-is, and will allow the drivers to drastically
improve code flow and error handling further on.

Link: https://lore.kernel.org/r/20190814075928.23766-2-hch@lst.de
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Ralph Campbell <rcampbell@nvidia.com>
Tested-by: Ralph Campbell <rcampbell@nvidia.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>


# 7b358c6f 02-Aug-2019 Ralph Campbell <rcampbell@nvidia.com>

mm/migrate.c: initialize pud_entry in migrate_vma()

When CONFIG_MIGRATE_VMA_HELPER is enabled, migrate_vma() calls
migrate_vma_collect() which initializes a struct mm_walk but didn't
initialize mm_walk.pud_entry. (Found by code inspection) Use a C
structure initialization to make sure it is set to NULL.

Link: http://lkml.kernel.org/r/20190719233225.12243-1-rcampbell@nvidia.com
Fixes: 8763cb45ab967 ("mm/migrate: new memory migration helper for use with device memory")
Signed-off-by: Ralph Campbell <rcampbell@nvidia.com>
Reviewed-by: John Hubbard <jhubbard@nvidia.com>
Reviewed-by: Andrew Morton <akpm@linux-foundation.org>
Cc: "Jérôme Glisse" <jglisse@redhat.com>
Cc: Mel Gorman <mgorman@techsingularity.net>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# ebdf4de5 02-Aug-2019 Jan Kara <jack@suse.cz>

mm: migrate: fix reference check race between __find_get_block() and migration

buffer_migrate_page_norefs() can race with bh users in the following
way:

CPU1 CPU2
buffer_migrate_page_norefs()
buffer_migrate_lock_buffers()
checks bh refs
spin_unlock(&mapping->private_lock)
__find_get_block()
spin_lock(&mapping->private_lock)
grab bh ref
spin_unlock(&mapping->private_lock)
move page do bh work

This can result in various issues like lost updates to buffers (i.e.
metadata corruption) or use after free issues for the old page.

This patch closes the race by holding mapping->private_lock while the
mapping is being moved to a new page. Ordinarily, a reference can be
taken outside of the private_lock using the per-cpu BH LRU but the
references are checked and the LRU invalidated if necessary. The
private_lock is held once the references are known so the buffer lookup
slow path will spin on the private_lock. Between the page lock and
private_lock, it should be impossible for other references to be
acquired and updates to happen during the migration.

A user had reported data corruption issues on a distribution kernel with
a similar page migration implementation as mainline. The data
corruption could not be reproduced with this patch applied. A small
number of migration-intensive tests were run and no performance problems
were noted.

[mgorman@techsingularity.net: Changelog, removed tracing]
Link: http://lkml.kernel.org/r/20190718090238.GF24383@techsingularity.net
Fixes: 89cb0888ca14 "mm: migrate: provide buffer_migrate_page_norefs()"
Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: Mel Gorman <mgorman@techsingularity.net>
Cc: <stable@vger.kernel.org> [5.0+]
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 37109694 18-Jul-2019 Keith Busch <kbusch@kernel.org>

mm: migrate: remove unused mode argument

migrate_page_move_mapping() doesn't use the mode argument. Remove it
and update callers accordingly.

Link: http://lkml.kernel.org/r/20190508210301.8472-1-keith.busch@intel.com
Signed-off-by: Keith Busch <keith.busch@intel.com>
Reviewed-by: Zi Yan <ziy@nvidia.com>
Cc: Mel Gorman <mgorman@techsingularity.net>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 69bf4b6b 05-Jul-2019 Linus Torvalds <torvalds@linux-foundation.org>

Revert "mm: page cache: store only head pages in i_pages"

This reverts commit 5fd4ca2d84b249f0858ce28cf637cf25b61a398f.

Mikhail Gavrilov reports that it causes the VM_BUG_ON_PAGE() in
__delete_from_swap_cache() to trigger:

page:ffffd6d34dff0000 refcount:1 mapcount:1 mapping:ffff97812323a689 index:0xfecec363
anon
flags: 0x17fffe00080034(uptodate|lru|active|swapbacked)
raw: 0017fffe00080034 ffffd6d34c67c508 ffffd6d3504b8d48 ffff97812323a689
raw: 00000000fecec363 0000000000000000 0000000100000000 ffff978433ace000
page dumped because: VM_BUG_ON_PAGE(entry != page)
page->mem_cgroup:ffff978433ace000
------------[ cut here ]------------
kernel BUG at mm/swap_state.c:170!
invalid opcode: 0000 [#1] SMP NOPTI
CPU: 1 PID: 221 Comm: kswapd0 Not tainted 5.2.0-0.rc2.git0.1.fc31.x86_64 #1
Hardware name: System manufacturer System Product Name/ROG STRIX X470-I GAMING, BIOS 2202 04/11/2019
RIP: 0010:__delete_from_swap_cache+0x20d/0x240
Code: 30 65 48 33 04 25 28 00 00 00 75 4a 48 83 c4 38 5b 5d 41 5c 41 5d 41 5e 41 5f c3 48 c7 c6 2f dc 0f 8a 48 89 c7 e8 93 1b fd ff <0f> 0b 48 c7 c6 a8 74 0f 8a e8 85 1b fd ff 0f 0b 48 c7 c6 a8 7d 0f
RSP: 0018:ffffa982036e7980 EFLAGS: 00010046
RAX: 0000000000000021 RBX: 0000000000000040 RCX: 0000000000000006
RDX: 0000000000000000 RSI: 0000000000000086 RDI: ffff97843d657900
RBP: 0000000000000001 R08: ffffa982036e7835 R09: 0000000000000535
R10: ffff97845e21a46c R11: ffffa982036e7835 R12: ffff978426387120
R13: 0000000000000000 R14: ffffd6d34dff0040 R15: ffffd6d34dff0000
FS: 0000000000000000(0000) GS:ffff97843d640000(0000) knlGS:0000000000000000
CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 00002cba88ef5000 CR3: 000000078a97c000 CR4: 00000000003406e0
Call Trace:
delete_from_swap_cache+0x46/0xa0
try_to_free_swap+0xbc/0x110
swap_writepage+0x13/0x70
pageout.isra.0+0x13c/0x350
shrink_page_list+0xc14/0xdf0
shrink_inactive_list+0x1e5/0x3c0
shrink_node_memcg+0x202/0x760
shrink_node+0xe0/0x470
balance_pgdat+0x2d1/0x510
kswapd+0x220/0x420
kthread+0xfb/0x130
ret_from_fork+0x22/0x40

and it's not immediately obvious why it happens. It's too late in the
rc cycle to do anything but revert for now.

Link: https://lore.kernel.org/lkml/CABXGCsN9mYmBD-4GaaeW_NrDu+FDXLzr_6x+XNxfmFV6QkYCDg@mail.gmail.com/
Reported-and-bisected-by: Mikhail Gavrilov <mikhail.v.gavrilov@gmail.com>
Suggested-by: Jan Kara <jack@suse.cz>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Kirill Shutemov <kirill@shutemov.name>
Cc: William Kucharski <william.kucharski@oracle.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 25b2995a 13-Jun-2019 Christoph Hellwig <hch@lst.de>

mm: remove MEMORY_DEVICE_PUBLIC support

The code hasn't been used since it was added to the tree, and doesn't
appear to actually be usable.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Jason Gunthorpe <jgg@mellanox.com>
Acked-by: Michal Hocko <mhocko@suse.com>
Reviewed-by: Dan Williams <dan.j.williams@intel.com>
Tested-by: Dan Williams <dan.j.williams@intel.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>


# 7269f999 13-May-2019 Jérôme Glisse <jglisse@redhat.com>

mm/mmu_notifier: use correct mmu_notifier events for each invalidation

This updates each existing invalidation to use the correct mmu notifier
event that represent what is happening to the CPU page table. See the
patch which introduced the events to see the rational behind this.

Link: http://lkml.kernel.org/r/20190326164747.24405-7-jglisse@redhat.com
Signed-off-by: Jérôme Glisse <jglisse@redhat.com>
Reviewed-by: Ralph Campbell <rcampbell@nvidia.com>
Reviewed-by: Ira Weiny <ira.weiny@intel.com>
Cc: Christian König <christian.koenig@amd.com>
Cc: Joonas Lahtinen <joonas.lahtinen@linux.intel.com>
Cc: Jani Nikula <jani.nikula@linux.intel.com>
Cc: Rodrigo Vivi <rodrigo.vivi@intel.com>
Cc: Jan Kara <jack@suse.cz>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Peter Xu <peterx@redhat.com>
Cc: Felix Kuehling <Felix.Kuehling@amd.com>
Cc: Jason Gunthorpe <jgg@mellanox.com>
Cc: Ross Zwisler <zwisler@kernel.org>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: Paolo Bonzini <pbonzini@redhat.com>
Cc: Radim Krcmar <rkrcmar@redhat.com>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Christian Koenig <christian.koenig@amd.com>
Cc: John Hubbard <jhubbard@nvidia.com>
Cc: Arnd Bergmann <arnd@arndb.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 6f4f13e8 13-May-2019 Jérôme Glisse <jglisse@redhat.com>

mm/mmu_notifier: contextual information for event triggering invalidation

CPU page table update can happens for many reasons, not only as a result
of a syscall (munmap(), mprotect(), mremap(), madvise(), ...) but also as
a result of kernel activities (memory compression, reclaim, migration,
...).

Users of mmu notifier API track changes to the CPU page table and take
specific action for them. While current API only provide range of virtual
address affected by the change, not why the changes is happening.

This patchset do the initial mechanical convertion of all the places that
calls mmu_notifier_range_init to also provide the default MMU_NOTIFY_UNMAP
event as well as the vma if it is know (most invalidation happens against
a given vma). Passing down the vma allows the users of mmu notifier to
inspect the new vma page protection.

The MMU_NOTIFY_UNMAP is always the safe default as users of mmu notifier
should assume that every for the range is going away when that event
happens. A latter patch do convert mm call path to use a more appropriate
events for each call.

This is done as 2 patches so that no call site is forgotten especialy
as it uses this following coccinelle patch:

%<----------------------------------------------------------------------
@@
identifier I1, I2, I3, I4;
@@
static inline void mmu_notifier_range_init(struct mmu_notifier_range *I1,
+enum mmu_notifier_event event,
+unsigned flags,
+struct vm_area_struct *vma,
struct mm_struct *I2, unsigned long I3, unsigned long I4) { ... }

@@
@@
-#define mmu_notifier_range_init(range, mm, start, end)
+#define mmu_notifier_range_init(range, event, flags, vma, mm, start, end)

@@
expression E1, E3, E4;
identifier I1;
@@
<...
mmu_notifier_range_init(E1,
+MMU_NOTIFY_UNMAP, 0, I1,
I1->vm_mm, E3, E4)
...>

@@
expression E1, E2, E3, E4;
identifier FN, VMA;
@@
FN(..., struct vm_area_struct *VMA, ...) {
<...
mmu_notifier_range_init(E1,
+MMU_NOTIFY_UNMAP, 0, VMA,
E2, E3, E4)
...> }

@@
expression E1, E2, E3, E4;
identifier FN, VMA;
@@
FN(...) {
struct vm_area_struct *VMA;
<...
mmu_notifier_range_init(E1,
+MMU_NOTIFY_UNMAP, 0, VMA,
E2, E3, E4)
...> }

@@
expression E1, E2, E3, E4;
identifier FN;
@@
FN(...) {
<...
mmu_notifier_range_init(E1,
+MMU_NOTIFY_UNMAP, 0, NULL,
E2, E3, E4)
...> }
---------------------------------------------------------------------->%

Applied with:
spatch --all-includes --sp-file mmu-notifier.spatch fs/proc/task_mmu.c --in-place
spatch --sp-file mmu-notifier.spatch --dir kernel/events/ --in-place
spatch --sp-file mmu-notifier.spatch --dir mm --in-place

Link: http://lkml.kernel.org/r/20190326164747.24405-6-jglisse@redhat.com
Signed-off-by: Jérôme Glisse <jglisse@redhat.com>
Reviewed-by: Ralph Campbell <rcampbell@nvidia.com>
Reviewed-by: Ira Weiny <ira.weiny@intel.com>
Cc: Christian König <christian.koenig@amd.com>
Cc: Joonas Lahtinen <joonas.lahtinen@linux.intel.com>
Cc: Jani Nikula <jani.nikula@linux.intel.com>
Cc: Rodrigo Vivi <rodrigo.vivi@intel.com>
Cc: Jan Kara <jack@suse.cz>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Peter Xu <peterx@redhat.com>
Cc: Felix Kuehling <Felix.Kuehling@amd.com>
Cc: Jason Gunthorpe <jgg@mellanox.com>
Cc: Ross Zwisler <zwisler@kernel.org>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: Paolo Bonzini <pbonzini@redhat.com>
Cc: Radim Krcmar <rkrcmar@redhat.com>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Christian Koenig <christian.koenig@amd.com>
Cc: John Hubbard <jhubbard@nvidia.com>
Cc: Arnd Bergmann <arnd@arndb.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 5fd4ca2d 13-May-2019 Matthew Wilcox <willy@infradead.org>

mm: page cache: store only head pages in i_pages

Transparent Huge Pages are currently stored in i_pages as pointers to
consecutive subpages. This patch changes that to storing consecutive
pointers to the head page in preparation for storing huge pages more
efficiently in i_pages.

Large parts of this are "inspired" by Kirill's patch
https://lore.kernel.org/lkml/20170126115819.58875-2-kirill.shutemov@linux.intel.com/

[willy@infradead.org: fix swapcache pages]
Link: http://lkml.kernel.org/r/20190324155441.GF10344@bombadil.infradead.org
[kirill@shutemov.name: hugetlb stores pages in page cache differently]
Link: http://lkml.kernel.org/r/20190404134553.vuvhgmghlkiw2hgl@kshutemo-mobl1
Link: http://lkml.kernel.org/r/20190307153051.18815-1-willy@infradead.org
Signed-off-by: Matthew Wilcox <willy@infradead.org>
Acked-by: Jan Kara <jack@suse.cz>
Reviewed-by: Kirill Shutemov <kirill@shutemov.name>
Reviewed-and-tested-by: Song Liu <songliubraving@fb.com>
Tested-by: William Kucharski <william.kucharski@oracle.com>
Reviewed-by: William Kucharski <william.kucharski@oracle.com>
Tested-by: Qian Cai <cai@lca.pw>
Cc: Hugh Dickins <hughd@google.com>
Cc: Song Liu <liu.song.a23@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# d2b2c6dd 28-Mar-2019 Lars Persson <lars.persson@axis.com>

mm/migrate.c: add missing flush_dcache_page for non-mapped page migrate

Our MIPS 1004Kc SoCs were seeing random userspace crashes with SIGILL
and SIGSEGV that could not be traced back to a userspace code bug. They
had all the magic signs of an I/D cache coherency issue.

Now recently we noticed that the /proc/sys/vm/compact_memory interface
was quite efficient at provoking this class of userspace crashes.

Studying the code in mm/migrate.c there is a distinction made between
migrating a page that is mapped at the instant of migration and one that
is not mapped. Our problem turned out to be the non-mapped pages.

For the non-mapped page the code performs a copy of the page content and
all relevant meta-data of the page without doing the required D-cache
maintenance. This leaves dirty data in the D-cache of the CPU and on
the 1004K cores this data is not visible to the I-cache. A subsequent
page-fault that triggers a mapping of the page will happily serve the
process with potentially stale code.

What about ARM then, this bug should have seen greater exposure? Well
ARM became immune to this flaw back in 2010, see commit c01778001a4f
("ARM: 6379/1: Assume new page cache pages have dirty D-cache").

My proposed fix moves the D-cache maintenance inside move_to_new_page to
make it common for both cases.

Link: http://lkml.kernel.org/r/20190315083502.11849-1-larper@axis.com
Fixes: 97ee0524614 ("flush cache before installing new page at migraton")
Signed-off-by: Lars Persson <larper@axis.com>
Reviewed-by: Paul Burton <paul.burton@mips.com>
Acked-by: Mel Gorman <mgorman@techsingularity.net>
Cc: Ralf Baechle <ralf@linux-mips.org>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# f900482d 05-Mar-2019 Jan Kara <jack@suse.cz>

mm/migrate.c: cleanup expected_page_refs()

Andrea has noted that page migration code propagates page_mapping(page)
through the whole migration stack down to migrate_page() function so it
seems stupid to then use page_mapping(page) in expected_page_refs()
instead of passed down 'mapping' argument. I agree so let's make
expected_page_refs() more in line with the rest of the migration stack.

Link: http://lkml.kernel.org/r/20190207112314.24872-1-jack@suse.cz
Signed-off-by: Jan Kara <jack@suse.cz>
Suggested-by: Andrea Arcangeli <aarcange@redhat.com>
Reviewed-by: Andrea Arcangeli <aarcange@redhat.com>
Cc: Mel Gorman <mgorman@techsingularity.net>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 8bb4e7a2 05-Mar-2019 Wei Yang <richard.weiyang@gmail.com>

mm: fix some typos in mm directory

No functional change.

Link: http://lkml.kernel.org/r/20190118235123.27843-1-richard.weiyang@gmail.com
Signed-off-by: Wei Yang <richard.weiyang@gmail.com>
Reviewed-by: Pekka Enberg <penberg@kernel.org>
Acked-by: Mike Rapoport <rppt@linux.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 806031bb 05-Mar-2019 Mel Gorman <mgorman@techsingularity.net>

mm, migrate: immediately fail migration of a page with no migration handler

Pages with no migration handler use a fallback handler which sometimes
works and sometimes persistently retries. A historical example was
blockdev pages but there are others such as odd refcounting when
page->private is used. These are retried multiple times which is
wasteful during compaction so this patch will fail migration faster
unless the caller specifies MIGRATE_SYNC.

This is not expected to help THP allocation success rates but it did
reduce latencies very slightly in some cases.

1-socket thpfioscale
4.20.0 4.20.0
noreserved-v2r15 failfast-v2r15
Amean fault-both-1 0.00 ( 0.00%) 0.00 * 0.00%*
Amean fault-both-3 3839.67 ( 0.00%) 3833.72 ( 0.15%)
Amean fault-both-5 5177.47 ( 0.00%) 4967.15 ( 4.06%)
Amean fault-both-7 7245.03 ( 0.00%) 7139.19 ( 1.46%)
Amean fault-both-12 11534.89 ( 0.00%) 11326.30 ( 1.81%)
Amean fault-both-18 16241.10 ( 0.00%) 16270.70 ( -0.18%)
Amean fault-both-24 19075.91 ( 0.00%) 19839.65 ( -4.00%)
Amean fault-both-30 22712.11 ( 0.00%) 21707.05 ( 4.43%)
Amean fault-both-32 21692.92 ( 0.00%) 21968.16 ( -1.27%)

The 2-socket results are not materially different. Scan rates are
similar as expected.

Link: http://lkml.kernel.org/r/20190118175136.31341-7-mgorman@techsingularity.net
Signed-off-by: Mel Gorman <mgorman@techsingularity.net>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Dan Carpenter <dan.carpenter@oracle.com>
Cc: David Rientjes <rientjes@google.com>
Cc: YueHaibing <yuehaibing@huawei.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 7ed2c31d 05-Mar-2019 Anshuman Khandual <anshuman.khandual@arm.com>

mm/hugetlb: distinguish between migratability and movability

Patch series "arm64/mm: Enable HugeTLB migration", v4.

This patch series enables HugeTLB migration support for all supported
huge page sizes at all levels including contiguous bit implementation.
Following HugeTLB migration support matrix has been enabled with this
patch series. All permutations have been tested except for the 16GB.

CONT PTE PMD CONT PMD PUD
-------- --- -------- ---
4K: 64K 2M 32M 1G
16K: 2M 32M 1G
64K: 2M 512M 16G

First the series adds migration support for PUD based huge pages. It
then adds a platform specific hook to query an architecture if a given
huge page size is supported for migration while also providing a default
fallback option preserving the existing semantics which just checks for
(PMD|PUD|PGDIR)_SHIFT macros. The last two patches enables HugeTLB
migration on arm64 and subscribe to this new platform specific hook by
defining an override.

The second patch differentiates between movability and migratability
aspects of huge pages and implements hugepage_movable_supported() which
can then be used during allocation to decide whether to place the huge
page in movable zone or not.

This patch (of 5):

During huge page allocation it's migratability is checked to determine
if it should be placed under movable zones with GFP_HIGHUSER_MOVABLE.
But the movability aspect of the huge page could depend on other factors
than just migratability. Movability in itself is a distinct property
which should not be tied with migratability alone.

This differentiates these two and implements an enhanced movability check
which also considers huge page size to determine if it is feasible to be
placed under a movable zone. At present it just checks for gigantic pages
but going forward it can incorporate other enhanced checks.

Link: http://lkml.kernel.org/r/1545121450-1663-2-git-send-email-anshuman.khandual@arm.com
Signed-off-by: Anshuman Khandual <anshuman.khandual@arm.com>
Reviewed-by: Steve Capper <steve.capper@arm.com>
Reviewed-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
Suggested-by: Michal Hocko <mhocko@kernel.org>
Acked-by: Michal Hocko <mhocko@suse.com>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# cb6acd01 28-Feb-2019 Mike Kravetz <mike.kravetz@oracle.com>

hugetlbfs: fix races and page leaks during migration

hugetlb pages should only be migrated if they are 'active'. The
routines set/clear_page_huge_active() modify the active state of hugetlb
pages.

When a new hugetlb page is allocated at fault time, set_page_huge_active
is called before the page is locked. Therefore, another thread could
race and migrate the page while it is being added to page table by the
fault code. This race is somewhat hard to trigger, but can be seen by
strategically adding udelay to simulate worst case scheduling behavior.
Depending on 'how' the code races, various BUG()s could be triggered.

To address this issue, simply delay the set_page_huge_active call until
after the page is successfully added to the page table.

Hugetlb pages can also be leaked at migration time if the pages are
associated with a file in an explicitly mounted hugetlbfs filesystem.
For example, consider a two node system with 4GB worth of huge pages
available. A program mmaps a 2G file in a hugetlbfs filesystem. It
then migrates the pages associated with the file from one node to
another. When the program exits, huge page counts are as follows:

node0
1024 free_hugepages
1024 nr_hugepages

node1
0 free_hugepages
1024 nr_hugepages

Filesystem Size Used Avail Use% Mounted on
nodev 4.0G 2.0G 2.0G 50% /var/opt/hugepool

That is as expected. 2G of huge pages are taken from the free_hugepages
counts, and 2G is the size of the file in the explicitly mounted
filesystem. If the file is then removed, the counts become:

node0
1024 free_hugepages
1024 nr_hugepages

node1
1024 free_hugepages
1024 nr_hugepages

Filesystem Size Used Avail Use% Mounted on
nodev 4.0G 2.0G 2.0G 50% /var/opt/hugepool

Note that the filesystem still shows 2G of pages used, while there
actually are no huge pages in use. The only way to 'fix' the filesystem
accounting is to unmount the filesystem

If a hugetlb page is associated with an explicitly mounted filesystem,
this information in contained in the page_private field. At migration
time, this information is not preserved. To fix, simply transfer
page_private from old to new page at migration time if necessary.

There is a related race with removing a huge page from a file and
migration. When a huge page is removed from the pagecache, the
page_mapping() field is cleared, yet page_private remains set until the
page is actually freed by free_huge_page(). A page could be migrated
while in this state. However, since page_mapping() is not set the
hugetlbfs specific routine to transfer page_private is not called and we
leak the page count in the filesystem.

To fix that, check for this condition before migrating a huge page. If
the condition is detected, return EBUSY for the page.

Link: http://lkml.kernel.org/r/74510272-7319-7372-9ea6-ec914734c179@oracle.com
Link: http://lkml.kernel.org/r/20190212221400.3512-1-mike.kravetz@oracle.com
Fixes: bcc54222309c ("mm: hugetlb: introduce page_huge_active")
Signed-off-by: Mike Kravetz <mike.kravetz@oracle.com>
Reviewed-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: "Kirill A . Shutemov" <kirill.shutemov@linux.intel.com>
Cc: Mel Gorman <mgorman@techsingularity.net>
Cc: Davidlohr Bueso <dave@stgolabs.net>
Cc: <stable@vger.kernel.org>
[mike.kravetz@oracle.com: v2]
Link: http://lkml.kernel.org/r/7534d322-d782-8ac6-1c8d-a8dc380eb3ab@oracle.com
[mike.kravetz@oracle.com: update comment and changelog]
Link: http://lkml.kernel.org/r/420bcfd6-158b-38e4-98da-26d0cd85bd01@oracle.com
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# e0a352fa 01-Feb-2019 David Hildenbrand <david@redhat.com>

mm: migrate: don't rely on __PageMovable() of newpage after unlocking it

We had a race in the old balloon compaction code before b1123ea6d3b3
("mm: balloon: use general non-lru movable page feature") refactored it
that became visible after backporting 195a8c43e93d ("virtio-balloon:
deflate via a page list") without the refactoring.

The bug existed from commit d6d86c0a7f8d ("mm/balloon_compaction:
redesign ballooned pages management") till b1123ea6d3b3 ("mm: balloon:
use general non-lru movable page feature"). d6d86c0a7f8d
("mm/balloon_compaction: redesign ballooned pages management") was
backported to 3.12, so the broken kernels are stable kernels [3.12 -
4.7].

There was a subtle race between dropping the page lock of the newpage in
__unmap_and_move() and checking for __is_movable_balloon_page(newpage).

Just after dropping this page lock, virtio-balloon could go ahead and
deflate the newpage, effectively dequeueing it and clearing PageBalloon,
in turn making __is_movable_balloon_page(newpage) fail.

This resulted in dropping the reference of the newpage via
putback_lru_page(newpage) instead of put_page(newpage), leading to
page->lru getting modified and a !LRU page ending up in the LRU lists.
With 195a8c43e93d ("virtio-balloon: deflate via a page list")
backported, one would suddenly get corrupted lists in
release_pages_balloon():

- WARNING: CPU: 13 PID: 6586 at lib/list_debug.c:59 __list_del_entry+0xa1/0xd0
- list_del corruption. prev->next should be ffffe253961090a0, but was dead000000000100

Nowadays this race is no longer possible, but it is hidden behind very
ugly handling of __ClearPageMovable() and __PageMovable().

__ClearPageMovable() will not make __PageMovable() fail, only
PageMovable(). So the new check (__PageMovable(newpage)) will still
hold even after newpage was dequeued by virtio-balloon.

If anybody would ever change that special handling, the BUG would be
introduced again. So instead, make it explicit and use the information
of the original isolated page before migration.

This patch can be backported fairly easy to stable kernels (in contrast
to the refactoring).

Link: http://lkml.kernel.org/r/20190129233217.10747-1-david@redhat.com
Fixes: d6d86c0a7f8d ("mm/balloon_compaction: redesign ballooned pages management")
Signed-off-by: David Hildenbrand <david@redhat.com>
Reported-by: Vratislav Bendel <vbendel@redhat.com>
Acked-by: Michal Hocko <mhocko@suse.com>
Acked-by: Rafael Aquini <aquini@redhat.com>
Cc: Mel Gorman <mgorman@techsingularity.net>
Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
Cc: Jan Kara <jack@suse.cz>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Dominik Brodowski <linux@dominikbrodowski.net>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Vratislav Bendel <vbendel@redhat.com>
Cc: Rafael Aquini <aquini@redhat.com>
Cc: Konstantin Khlebnikov <k.khlebnikov@samsung.com>
Cc: Minchan Kim <minchan@kernel.org>
Cc: <stable@vger.kernel.org> [3.12 - 4.7]
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 80409c65 01-Feb-2019 Jan Kara <jack@suse.cz>

mm: migrate: make buffer_migrate_page_norefs() actually succeed

Currently, buffer_migrate_page_norefs() was constantly failing because
buffer_migrate_lock_buffers() grabbed reference on each buffer. In
fact, there's no reason for buffer_migrate_lock_buffers() to grab any
buffer references as the page is locked during all our operation and
thus nobody can reclaim buffers from the page.

So remove grabbing of buffer references which also makes
buffer_migrate_page_norefs() succeed.

Link: http://lkml.kernel.org/r/20190116131217.7226-1-jack@suse.cz
Fixes: 89cb0888ca14 "mm: migrate: provide buffer_migrate_page_norefs()"
Signed-off-by: Jan Kara <jack@suse.cz>
Cc: Sergey Senozhatsky <sergey.senozhatsky.work@gmail.com>
Cc: Pavel Machek <pavel@ucw.cz>
Cc: Mel Gorman <mgorman@techsingularity.net>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: David Rientjes <rientjes@google.com>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Zi Yan <zi.yan@cs.rutgers.edu>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# ddeaab32 08-Jan-2019 Mike Kravetz <mike.kravetz@oracle.com>

hugetlbfs: revert "use i_mmap_rwsem for more pmd sharing synchronization"

This reverts b43a9990055958e70347c56f90ea2ae32c67334c

The reverted commit caused issues with migration and poisoning of anon
huge pages. The LTP move_pages12 test will cause an "unable to handle
kernel NULL pointer" BUG would occur with stack similar to:

RIP: 0010:down_write+0x1b/0x40
Call Trace:
migrate_pages+0x81f/0xb90
__ia32_compat_sys_migrate_pages+0x190/0x190
do_move_pages_to_node.isra.53.part.54+0x2a/0x50
kernel_move_pages+0x566/0x7b0
__x64_sys_move_pages+0x24/0x30
do_syscall_64+0x5b/0x180
entry_SYSCALL_64_after_hwframe+0x44/0xa9

The purpose of the reverted patch was to fix some long existing races
with huge pmd sharing. It used i_mmap_rwsem for this purpose with the
idea that this could also be used to address truncate/page fault races
with another patch. Further analysis has determined that i_mmap_rwsem
can not be used to address all these hugetlbfs synchronization issues.
Therefore, revert this patch while working an another approach to the
underlying issues.

Link: http://lkml.kernel.org/r/20190103235452.29335-2-mike.kravetz@oracle.com
Signed-off-by: Mike Kravetz <mike.kravetz@oracle.com>
Reported-by: Jan Stancek <jstancek@redhat.com>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Hugh Dickins <hughd@google.com>
Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
Cc: "Aneesh Kumar K . V" <aneesh.kumar@linux.vnet.ibm.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: "Kirill A . Shutemov" <kirill.shutemov@linux.intel.com>
Cc: Davidlohr Bueso <dave@stgolabs.net>
Cc: Prakash Sangappa <prakash.sangappa@oracle.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 4cf58924 03-Jan-2019 Joel Fernandes (Google) <joel@joelfernandes.org>

mm: treewide: remove unused address argument from pte_alloc functions

Patch series "Add support for fast mremap".

This series speeds up the mremap(2) syscall by copying page tables at
the PMD level even for non-THP systems. There is concern that the extra
'address' argument that mremap passes to pte_alloc may do something
subtle architecture related in the future that may make the scheme not
work. Also we find that there is no point in passing the 'address' to
pte_alloc since its unused. This patch therefore removes this argument
tree-wide resulting in a nice negative diff as well. Also ensuring
along the way that the enabled architectures do not do anything funky
with the 'address' argument that goes unnoticed by the optimization.

Build and boot tested on x86-64. Build tested on arm64. The config
enablement patch for arm64 will be posted in the future after more
testing.

The changes were obtained by applying the following Coccinelle script.
(thanks Julia for answering all Coccinelle questions!).
Following fix ups were done manually:
* Removal of address argument from pte_fragment_alloc
* Removal of pte_alloc_one_fast definitions from m68k and microblaze.

// Options: --include-headers --no-includes
// Note: I split the 'identifier fn' line, so if you are manually
// running it, please unsplit it so it runs for you.

virtual patch

@pte_alloc_func_def depends on patch exists@
identifier E2;
identifier fn =~
"^(__pte_alloc|pte_alloc_one|pte_alloc|__pte_alloc_kernel|pte_alloc_one_kernel)$";
type T2;
@@

fn(...
- , T2 E2
)
{ ... }

@pte_alloc_func_proto_noarg depends on patch exists@
type T1, T2, T3, T4;
identifier fn =~ "^(__pte_alloc|pte_alloc_one|pte_alloc|__pte_alloc_kernel|pte_alloc_one_kernel)$";
@@

(
- T3 fn(T1, T2);
+ T3 fn(T1);
|
- T3 fn(T1, T2, T4);
+ T3 fn(T1, T2);
)

@pte_alloc_func_proto depends on patch exists@
identifier E1, E2, E4;
type T1, T2, T3, T4;
identifier fn =~
"^(__pte_alloc|pte_alloc_one|pte_alloc|__pte_alloc_kernel|pte_alloc_one_kernel)$";
@@

(
- T3 fn(T1 E1, T2 E2);
+ T3 fn(T1 E1);
|
- T3 fn(T1 E1, T2 E2, T4 E4);
+ T3 fn(T1 E1, T2 E2);
)

@pte_alloc_func_call depends on patch exists@
expression E2;
identifier fn =~
"^(__pte_alloc|pte_alloc_one|pte_alloc|__pte_alloc_kernel|pte_alloc_one_kernel)$";
@@

fn(...
-, E2
)

@pte_alloc_macro depends on patch exists@
identifier fn =~
"^(__pte_alloc|pte_alloc_one|pte_alloc|__pte_alloc_kernel|pte_alloc_one_kernel)$";
identifier a, b, c;
expression e;
position p;
@@

(
- #define fn(a, b, c) e
+ #define fn(a, b) e
|
- #define fn(a, b) e
+ #define fn(a) e
)

Link: http://lkml.kernel.org/r/20181108181201.88826-2-joelaf@google.com
Signed-off-by: Joel Fernandes (Google) <joel@joelfernandes.org>
Suggested-by: Kirill A. Shutemov <kirill@shutemov.name>
Acked-by: Kirill A. Shutemov <kirill@shutemov.name>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Julia Lawall <Julia.Lawall@lip6.fr>
Cc: Kirill A. Shutemov <kirill@shutemov.name>
Cc: William Kucharski <william.kucharski@oracle.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# b43a9990 28-Dec-2018 Mike Kravetz <mike.kravetz@oracle.com>

hugetlbfs: use i_mmap_rwsem for more pmd sharing synchronization

While looking at BUGs associated with invalid huge page map counts, it was
discovered and observed that a huge pte pointer could become 'invalid' and
point to another task's page table. Consider the following:

A task takes a page fault on a shared hugetlbfs file and calls
huge_pte_alloc to get a ptep. Suppose the returned ptep points to a
shared pmd.

Now, another task truncates the hugetlbfs file. As part of truncation, it
unmaps everyone who has the file mapped. If the range being truncated is
covered by a shared pmd, huge_pmd_unshare will be called. For all but the
last user of the shared pmd, huge_pmd_unshare will clear the pud pointing
to the pmd. If the task in the middle of the page fault is not the last
user, the ptep returned by huge_pte_alloc now points to another task's
page table or worse. This leads to bad things such as incorrect page
map/reference counts or invalid memory references.

To fix, expand the use of i_mmap_rwsem as follows:

- i_mmap_rwsem is held in read mode whenever huge_pmd_share is called.
huge_pmd_share is only called via huge_pte_alloc, so callers of
huge_pte_alloc take i_mmap_rwsem before calling. In addition, callers
of huge_pte_alloc continue to hold the semaphore until finished with the
ptep.

- i_mmap_rwsem is held in write mode whenever huge_pmd_unshare is
called.

[mike.kravetz@oracle.com: add explicit check for mapping != null]
Link: http://lkml.kernel.org/r/20181218223557.5202-2-mike.kravetz@oracle.com
Fixes: 39dde65c9940 ("shared page table for hugetlb page")
Signed-off-by: Mike Kravetz <mike.kravetz@oracle.com>
Acked-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Hugh Dickins <hughd@google.com>
Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
Cc: "Aneesh Kumar K . V" <aneesh.kumar@linux.vnet.ibm.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Davidlohr Bueso <dave@stgolabs.net>
Cc: Prakash Sangappa <prakash.sangappa@oracle.com>
Cc: Colin Ian King <colin.king@canonical.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# ab41ee68 28-Dec-2018 Jan Kara <jack@suse.cz>

mm: migrate: drop unused argument of migrate_page_move_mapping()

All callers of migrate_page_move_mapping() now pass NULL for 'head'
argument. Drop it.

Link: http://lkml.kernel.org/r/20181211172143.7358-7-jack@suse.cz
Signed-off-by: Jan Kara <jack@suse.cz>
Acked-by: Mel Gorman <mgorman@suse.de>
Cc: Michal Hocko <mhocko@suse.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 89cb0888 28-Dec-2018 Jan Kara <jack@suse.cz>

mm: migrate: provide buffer_migrate_page_norefs()

Provide a variant of buffer_migrate_page() that also checks whether there
are no unexpected references to buffer heads. This function will then be
safe to use for block device pages.

[akpm@linux-foundation.org: remove EXPORT_SYMBOL(buffer_migrate_page_norefs)]
Link: http://lkml.kernel.org/r/20181211172143.7358-5-jack@suse.cz
Signed-off-by: Jan Kara <jack@suse.cz>
Acked-by: Mel Gorman <mgorman@suse.de>
Cc: Michal Hocko <mhocko@suse.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 84ade7c1 28-Dec-2018 Jan Kara <jack@suse.cz>

mm: migrate: move migrate_page_lock_buffers()

buffer_migrate_page() is the only caller of migrate_page_lock_buffers()
move it close to it and also drop the now unused stub for !CONFIG_BLOCK.

Link: http://lkml.kernel.org/r/20181211172143.7358-4-jack@suse.cz
Signed-off-by: Jan Kara <jack@suse.cz>
Acked-by: Mel Gorman <mgorman@suse.de>
Cc: Michal Hocko <mhocko@suse.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# cc4f11e6 28-Dec-2018 Jan Kara <jack@suse.cz>

mm: migrate: lock buffers before migrate_page_move_mapping()

Lock buffers before calling into migrate_page_move_mapping() so that that
function doesn't have to know about buffers (which is somewhat unexpected
anyway) and all the buffer head logic is in buffer_migrate_page().

Link: http://lkml.kernel.org/r/20181211172143.7358-3-jack@suse.cz
Signed-off-by: Jan Kara <jack@suse.cz>
Acked-by: Mel Gorman <mgorman@suse.de>
Cc: Michal Hocko <mhocko@suse.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 0b3901b3 28-Dec-2018 Jan Kara <jack@suse.cz>

mm: migration: factor out code to compute expected number of page references

Patch series "mm: migrate: Fix page migration stalls for blkdev pages".

This patchset deals with page migration stalls that were reported by our
customer due to a block device page that had a bufferhead that was in the
bh LRU cache.

The patchset modifies the page migration code so that bufferheads are
completely handled inside buffer_migrate_page() and then provides a new
migration helper for pages with buffer heads that is safe to use even for
block device pages and that also deals with bh lrus.

This patch (of 6):

Factor out function to compute number of expected page references in
migrate_page_move_mapping(). Note that we move hpage_nr_pages() and
page_has_private() checks from under xas_lock_irq() however this is safe
since we hold page lock.

[jack@suse.cz: fix expected_page_refs()]
Link: http://lkml.kernel.org/r/20181217131710.GB8611@quack2.suse.cz
Link: http://lkml.kernel.org/r/20181211172143.7358-2-jack@suse.cz
Signed-off-by: Jan Kara <jack@suse.cz>
Acked-by: Mel Gorman <mgorman@suse.de>
Cc: Michal Hocko <mhocko@suse.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# ac46d4f3 28-Dec-2018 Jérôme Glisse <jglisse@redhat.com>

mm/mmu_notifier: use structure for invalidate_range_start/end calls v2

To avoid having to change many call sites everytime we want to add a
parameter use a structure to group all parameters for the mmu_notifier
invalidate_range_start/end cakks. No functional changes with this patch.

[akpm@linux-foundation.org: coding style fixes]
Link: http://lkml.kernel.org/r/20181205053628.3210-3-jglisse@redhat.com
Signed-off-by: Jérôme Glisse <jglisse@redhat.com>
Acked-by: Christian König <christian.koenig@amd.com>
Acked-by: Jan Kara <jack@suse.cz>
Cc: Matthew Wilcox <mawilcox@microsoft.com>
Cc: Ross Zwisler <zwisler@kernel.org>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: Paolo Bonzini <pbonzini@redhat.com>
Cc: Radim Krcmar <rkrcmar@redhat.com>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Felix Kuehling <felix.kuehling@amd.com>
Cc: Ralph Campbell <rcampbell@nvidia.com>
Cc: John Hubbard <jhubbard@nvidia.com>
From: Jérôme Glisse <jglisse@redhat.com>
Subject: mm/mmu_notifier: use structure for invalidate_range_start/end calls v3

fix build warning in migrate.c when CONFIG_MMU_NOTIFIER=n

Link: http://lkml.kernel.org/r/20181213171330.8489-3-jglisse@redhat.com
Signed-off-by: Jérôme Glisse <jglisse@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 9a1ea439 28-Dec-2018 Hugh Dickins <hughd@google.com>

mm: put_and_wait_on_page_locked() while page is migrated

Waiting on a page migration entry has used wait_on_page_locked() all along
since 2006: but you cannot safely wait_on_page_locked() without holding a
reference to the page, and that extra reference is enough to make
migrate_page_move_mapping() fail with -EAGAIN, when a racing task faults
on the entry before migrate_page_move_mapping() gets there.

And that failure is retried nine times, amplifying the pain when trying to
migrate a popular page. With a single persistent faulter, migration
sometimes succeeds; with two or three concurrent faulters, success becomes
much less likely (and the more the page was mapped, the worse the overhead
of unmapping and remapping it on each try).

This is especially a problem for memory offlining, where the outer level
retries forever (or until terminated from userspace), because a heavy
refault workload can trigger an endless loop of migration failures.
wait_on_page_locked() is the wrong tool for the job.

David Herrmann (but was he the first?) noticed this issue in 2014:
https://marc.info/?l=linux-mm&m=140110465608116&w=2

Tim Chen started a thread in August 2017 which appears relevant:
https://marc.info/?l=linux-mm&m=150275941014915&w=2 where Kan Liang went
on to implicate __migration_entry_wait():
https://marc.info/?l=linux-mm&m=150300268411980&w=2 and the thread ended
up with the v4.14 commits: 2554db916586 ("sched/wait: Break up long wake
list walk") 11a19c7b099f ("sched/wait: Introduce wakeup boomark in
wake_up_page_bit")

Baoquan He reported "Memory hotplug softlock issue" 14 November 2018:
https://marc.info/?l=linux-mm&m=154217936431300&w=2

We have all assumed that it is essential to hold a page reference while
waiting on a page lock: partly to guarantee that there is still a struct
page when MEMORY_HOTREMOVE is configured, but also to protect against
reuse of the struct page going to someone who then holds the page locked
indefinitely, when the waiter can reasonably expect timely unlocking.

But in fact, so long as wait_on_page_bit_common() does the put_page(), and
is careful not to rely on struct page contents thereafter, there is no
need to hold a reference to the page while waiting on it. That does mean
that this case cannot go back through the loop: but that's fine for the
page migration case, and even if used more widely, is limited by the "Stop
walking if it's locked" optimization in wake_page_function().

Add interface put_and_wait_on_page_locked() to do this, using "behavior"
enum in place of "lock" arg to wait_on_page_bit_common() to implement it.
No interruptible or killable variant needed yet, but they might follow: I
have a vague notion that reporting -EINTR should take precedence over
return from wait_on_page_bit_common() without knowing the page state, so
arrange it accordingly - but that may be nothing but pedantic.

__migration_entry_wait() still has to take a brief reference to the page,
prior to calling put_and_wait_on_page_locked(): but now that it is dropped
before waiting, the chance of impeding page migration is very much
reduced. Should we perhaps disable preemption across this?

shrink_page_list()'s __ClearPageLocked(): that was a surprise! This
survived a lot of testing before that showed up. PageWaiters may have
been set by wait_on_page_bit_common(), and the reference dropped, just
before shrink_page_list() succeeds in freezing its last page reference: in
such a case, unlock_page() must be used. Follow the suggestion from
Michal Hocko, just revert a978d6f52106 ("mm: unlockless reclaim") now:
that optimization predates PageWaiters, and won't buy much these days; but
we can reinstate it for the !PageWaiters case if anyone notices.

It does raise the question: should vmscan.c's is_page_cache_freeable() and
__remove_mapping() now treat a PageWaiters page as if an extra reference
were held? Perhaps, but I don't think it matters much, since
shrink_page_list() already had to win its trylock_page(), so waiters are
not very common there: I noticed no difference when trying the bigger
change, and it's surely not needed while put_and_wait_on_page_locked() is
only used for page migration.

[willy@infradead.org: add put_and_wait_on_page_locked() kerneldoc]
Link: http://lkml.kernel.org/r/alpine.LSU.2.11.1811261121330.1116@eggly.anvils
Signed-off-by: Hugh Dickins <hughd@google.com>
Reported-by: Baoquan He <bhe@redhat.com>
Tested-by: Baoquan He <bhe@redhat.com>
Reviewed-by: Andrea Arcangeli <aarcange@redhat.com>
Acked-by: Michal Hocko <mhocko@suse.com>
Acked-by: Linus Torvalds <torvalds@linux-foundation.org>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Baoquan He <bhe@redhat.com>
Cc: David Hildenbrand <david@redhat.com>
Cc: Mel Gorman <mgorman@techsingularity.net>
Cc: David Herrmann <dh.herrmann@gmail.com>
Cc: Tim Chen <tim.c.chen@linux.intel.com>
Cc: Kan Liang <kan.liang@intel.com>
Cc: Andi Kleen <ak@linux.intel.com>
Cc: Davidlohr Bueso <dave@stgolabs.net>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Christoph Lameter <cl@linux.com>
Cc: Nick Piggin <npiggin@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 7eef5f97 26-Oct-2018 Andrea Arcangeli <aarcange@redhat.com>

mm: thp: relocate flush_cache_range() in migrate_misplaced_transhuge_page()

There should be no cache left by the time we overwrite the old transhuge
pmd with the new one. It's already too late to flush through the virtual
address because we already copied the page data to the new physical
address.

So flush the cache before the data copy.

Also delete the "end" variable to shutoff a "unused variable" warning on
x86 where flush_cache_range() is a noop.

Link: http://lkml.kernel.org/r/20181015202311.7209-1-aarcange@redhat.com
Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
Acked-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Cc: Aaron Tomlin <atomlin@redhat.com>
Cc: Jerome Glisse <jglisse@redhat.com>
Cc: Mel Gorman <mgorman@suse.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 7066f0f9 26-Oct-2018 Andrea Arcangeli <aarcange@redhat.com>

mm: thp: fix mmu_notifier in migrate_misplaced_transhuge_page()

change_huge_pmd() after arming the numa/protnone pmd doesn't flush the TLB
right away. do_huge_pmd_numa_page() flushes the TLB before calling
migrate_misplaced_transhuge_page(). By the time do_huge_pmd_numa_page()
runs some CPU could still access the page through the TLB.

change_huge_pmd() before arming the numa/protnone transhuge pmd calls
mmu_notifier_invalidate_range_start(). So there's no need of
mmu_notifier_invalidate_range_start()/mmu_notifier_invalidate_range_only_end()
sequence in migrate_misplaced_transhuge_page() too, because by the time
migrate_misplaced_transhuge_page() runs, the pmd mapping has already been
invalidated in the secondary MMUs. It has to or if a secondary MMU can
still write to the page, the migrate_page_copy() would lose data.

However an explicit mmu_notifier_invalidate_range() is needed before
migrate_misplaced_transhuge_page() starts copying the data of the
transhuge page or the below can happen for MMU notifier users sharing the
primary MMU pagetables and only implementing ->invalidate_range:

CPU0 CPU1 GPU sharing linux pagetables using
only ->invalidate_range
----------- ------------ ---------
GPU secondary MMU writes to the page
mapped by the transhuge pmd
change_pmd_range()
mmu..._range_start()
->invalidate_range_start() noop
change_huge_pmd()
set_pmd_at(numa/protnone)
pmd_unlock()
do_huge_pmd_numa_page()
CPU TLB flush globally (1)
CPU cannot write to page
migrate_misplaced_transhuge_page()
GPU writes to the page...
migrate_page_copy()
...GPU stops writing to the page
CPU TLB flush (2)
mmu..._range_end() (3)
->invalidate_range_stop() noop
->invalidate_range()
GPU secondary MMU is invalidated
and cannot write to the page anymore
(too late)

Just like we need a CPU TLB flush (1) because the TLB flush (2) arrives
too late, we also need a mmu_notifier_invalidate_range() before calling
migrate_misplaced_transhuge_page(), because the ->invalidate_range() in
(3) also arrives too late.

This requirement is the result of the lazy optimization in
change_huge_pmd() that releases the pmd_lock without first flushing the
TLB and without first calling mmu_notifier_invalidate_range().

Even converting the removed mmu_notifier_invalidate_range_only_end() into
a mmu_notifier_invalidate_range_end() would not have been enough to fix
this, because it run after migrate_page_copy().

After the hugepage data copy is done migrate_misplaced_transhuge_page()
can proceed and call set_pmd_at without having to flush the TLB nor any
secondary MMUs because the secondary MMU invalidate, just like the CPU TLB
flush, has to happen before the migrate_page_copy() is called or it would
be a bug in the first place (and it was for drivers using
->invalidate_range()).

KVM is unaffected because it doesn't implement ->invalidate_range().

The standard PAGE_SIZEd migrate_misplaced_page is less accelerated and
uses the generic migrate_pages which transitions the pte from
numa/protnone to a migration entry in try_to_unmap_one() and flushes TLBs
and all mmu notifiers there before copying the page.

Link: http://lkml.kernel.org/r/20181013002430.698-3-aarcange@redhat.com
Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
Acked-by: Mel Gorman <mgorman@suse.de>
Acked-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Reviewed-by: Aaron Tomlin <atomlin@redhat.com>
Cc: Jerome Glisse <jglisse@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# d7c33934 26-Oct-2018 Andrea Arcangeli <aarcange@redhat.com>

mm: thp: fix MADV_DONTNEED vs migrate_misplaced_transhuge_page race condition

Patch series "migrate_misplaced_transhuge_page race conditions".

Aaron found a new instance of the THP MADV_DONTNEED race against
pmdp_clear_flush* variants, that was apparently left unfixed.

While looking into the race found by Aaron, I may have found two more
issues in migrate_misplaced_transhuge_page.

These race conditions would not cause kernel instability, but they'd
corrupt userland data or leave data non zero after MADV_DONTNEED.

I did only minor testing, and I don't expect to be able to reproduce this
(especially the lack of ->invalidate_range before migrate_page_copy,
requires the latest iommu hardware or infiniband to reproduce). The last
patch is noop for x86 and it needs further review from maintainers of
archs that implement flush_cache_range() (not in CC yet).

To avoid confusion, it's not the first patch that introduces the bug fixed
in the second patch, even before removing the
pmdp_huge_clear_flush_notify, that _notify suffix was called after
migrate_page_copy already run.

This patch (of 3):

This is a corollary of ced108037c2aa ("thp: fix MADV_DONTNEED vs. numa
balancing race"), 58ceeb6bec8 ("thp: fix MADV_DONTNEED vs. MADV_FREE
race") and 5b7abeae3af8c ("thp: fix MADV_DONTNEED vs clear soft dirty
race).

When the above three fixes where posted Dave asked
https://lkml.kernel.org/r/929b3844-aec2-0111-fef7-8002f9d4e2b9@intel.com
but apparently this was missed.

The pmdp_clear_flush* in migrate_misplaced_transhuge_page() was introduced
in a54a407fbf7 ("mm: Close races between THP migration and PMD numa
clearing").

The important part of such commit is only the part where the page lock is
not released until the first do_huge_pmd_numa_page() finished disarming
the pagenuma/protnone.

The addition of pmdp_clear_flush() wasn't beneficial to such commit and
there's no commentary about such an addition either.

I guess the pmdp_clear_flush() in such commit was added just in case for
safety, but it ended up introducing the MADV_DONTNEED race condition found
by Aaron.

At that point in time nobody thought of such kind of MADV_DONTNEED race
conditions yet (they were fixed later) so the code may have looked more
robust by adding the pmdp_clear_flush().

This specific race condition won't destabilize the kernel, but it can
confuse userland because after MADV_DONTNEED the memory won't be zeroed
out.

This also optimizes the code and removes a superfluous TLB flush.

[akpm@linux-foundation.org: reflow comment to 80 cols, fix grammar and typo (beacuse)]
Link: http://lkml.kernel.org/r/20181013002430.698-2-aarcange@redhat.com
Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
Reported-by: Aaron Tomlin <atomlin@redhat.com>
Acked-by: Mel Gorman <mgorman@suse.de>
Acked-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Cc: Jerome Glisse <jglisse@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 1899ad18 26-Oct-2018 Johannes Weiner <hannes@cmpxchg.org>

mm: workingset: tell cache transitions from workingset thrashing

Refaults happen during transitions between workingsets as well as in-place
thrashing. Knowing the difference between the two has a range of
applications, including measuring the impact of memory shortage on the
system performance, as well as the ability to smarter balance pressure
between the filesystem cache and the swap-backed workingset.

During workingset transitions, inactive cache refaults and pushes out
established active cache. When that active cache isn't stale, however,
and also ends up refaulting, that's bonafide thrashing.

Introduce a new page flag that tells on eviction whether the page has been
active or not in its lifetime. This bit is then stored in the shadow
entry, to classify refaults as transitioning or thrashing.

How many page->flags does this leave us with on 32-bit?

20 bits are always page flags

21 if you have an MMU

23 with the zone bits for DMA, Normal, HighMem, Movable

29 with the sparsemem section bits

30 if PAE is enabled

31 with this patch.

So on 32-bit PAE, that leaves 1 bit for distinguishing two NUMA nodes. If
that's not enough, the system can switch to discontigmem and re-gain the 6
or 7 sparsemem section bits.

Link: http://lkml.kernel.org/r/20180828172258.3185-3-hannes@cmpxchg.org
Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Tested-by: Daniel Drake <drake@endlessm.com>
Tested-by: Suren Baghdasaryan <surenb@google.com>
Cc: Christopher Lameter <cl@linux.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Johannes Weiner <jweiner@fb.com>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Peter Enderborg <peter.enderborg@sony.com>
Cc: Randy Dunlap <rdunlap@infradead.org>
Cc: Shakeel Butt <shakeelb@google.com>
Cc: Tejun Heo <tj@kernel.org>
Cc: Vinayak Menon <vinmenon@codeaurora.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 89eb946a 04-Dec-2017 Matthew Wilcox <willy@infradead.org>

mm: Convert page migration to XArray

Signed-off-by: Matthew Wilcox <willy@infradead.org>


# e6112fc3 05-Oct-2018 Anshuman Khandual <anshuman.khandual@arm.com>

mm/migrate.c: split only transparent huge pages when allocation fails

split_huge_page_to_list() fails on HugeTLB pages. I was experimenting
with moving 32MB contig HugeTLB pages on arm64 (with a debug patch
applied) and hit the following stack trace when the kernel crashed.

[ 3732.462797] Call trace:
[ 3732.462835] split_huge_page_to_list+0x3b0/0x858
[ 3732.462913] migrate_pages+0x728/0xc20
[ 3732.462999] soft_offline_page+0x448/0x8b0
[ 3732.463097] __arm64_sys_madvise+0x724/0x850
[ 3732.463197] el0_svc_handler+0x74/0x110
[ 3732.463297] el0_svc+0x8/0xc
[ 3732.463347] Code: d1000400 f90b0e60 f2fbd5a2 a94982a1 (f9000420)

When unmap_and_move[_huge_page]() fails due to lack of memory, the
splitting should happen only for transparent huge pages not for HugeTLB
pages. PageTransHuge() returns true for both THP and HugeTLB pages.
Hence the conditonal check should test PagesHuge() flag to make sure that
given pages is not a HugeTLB one.

Link: http://lkml.kernel.org/r/1537798495-4996-1-git-send-email-anshuman.khandual@arm.com
Fixes: 94723aafb9 ("mm: unclutter THP migration")
Signed-off-by: Anshuman Khandual <anshuman.khandual@arm.com>
Acked-by: Michal Hocko <mhocko@suse.com>
Acked-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Cc: Zi Yan <zi.yan@cs.rutgers.edu>
Cc: Mike Kravetz <mike.kravetz@oracle.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>


# e125fe40 05-Oct-2018 Kirill A. Shutemov <kirill.shutemov@linux.intel.com>

mm, thp: fix mlocking THP page with migration enabled

A transparent huge page is represented by a single entry on an LRU list.
Therefore, we can only make unevictable an entire compound page, not
individual subpages.

If a user tries to mlock() part of a huge page, we want the rest of the
page to be reclaimable.

We handle this by keeping PTE-mapped huge pages on normal LRU lists: the
PMD on border of VM_LOCKED VMA will be split into PTE table.

Introduction of THP migration breaks[1] the rules around mlocking THP
pages. If we had a single PMD mapping of the page in mlocked VMA, the
page will get mlocked, regardless of PTE mappings of the page.

For tmpfs/shmem it's easy to fix by checking PageDoubleMap() in
remove_migration_pmd().

Anon THP pages can only be shared between processes via fork(). Mlocked
page can only be shared if parent mlocked it before forking, otherwise CoW
will be triggered on mlock().

For Anon-THP, we can fix the issue by munlocking the page on removing PTE
migration entry for the page. PTEs for the page will always come after
mlocked PMD: rmap walks VMAs from oldest to newest.

Test-case:

#include <unistd.h>
#include <sys/mman.h>
#include <sys/wait.h>
#include <linux/mempolicy.h>
#include <numaif.h>

int main(void)
{
unsigned long nodemask = 4;
void *addr;

addr = mmap((void *)0x20000000UL, 2UL << 20, PROT_READ | PROT_WRITE,
MAP_PRIVATE | MAP_ANONYMOUS | MAP_LOCKED, -1, 0);

if (fork()) {
wait(NULL);
return 0;
}

mlock(addr, 4UL << 10);
mbind(addr, 2UL << 20, MPOL_PREFERRED | MPOL_F_RELATIVE_NODES,
&nodemask, 4, MPOL_MF_MOVE);

return 0;
}

[1] https://lkml.kernel.org/r/CAOMGZ=G52R-30rZvhGxEbkTw7rLLwBGadVYeo--iizcD3upL3A@mail.gmail.com

Link: http://lkml.kernel.org/r/20180917133816.43995-1-kirill.shutemov@linux.intel.com
Fixes: 616b8371539a ("mm: thp: enable thp migration in generic path")
Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Reported-by: Vegard Nossum <vegard.nossum@oracle.com>
Reviewed-by: Zi Yan <zi.yan@cs.rutgers.edu>
Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: <stable@vger.kernel.org> [4.14+]
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>


# efaffc5e 01-Oct-2018 Mel Gorman <mgorman@techsingularity.net>

mm, sched/numa: Remove rate-limiting of automatic NUMA balancing migration

Rate limiting of page migrations due to automatic NUMA balancing was
introduced to mitigate the worst-case scenario of migrating at high
frequency due to false sharing or slowly ping-ponging between nodes.
Since then, a lot of effort was spent on correctly identifying these
pages and avoiding unnecessary migrations and the safety net may no longer
be required.

Jirka Hladky reported a regression in 4.17 due to a scheduler patch that
avoids spreading STREAM tasks wide prematurely. However, once the task
was properly placed, it delayed migrating the memory due to rate limiting.
Increasing the limit fixed the problem for him.

Currently, the limit is hard-coded and does not account for the real
capabilities of the hardware. Even if an estimate was attempted, it would
not properly account for the number of memory controllers and it could
not account for the amount of bandwidth used for normal accesses. Rather
than fudging, this patch simply eliminates the rate limiting.

However, Jirka reports that a STREAM configuration using multiple
processes achieved similar performance to 4.16. In local tests, this patch
improved performance of STREAM relative to the baseline but it is somewhat
machine-dependent. Most workloads show little or not performance difference
implying that there is not a heavily reliance on the throttling mechanism
and it is safe to remove.

STREAM on 2-socket machine
4.19.0-rc5 4.19.0-rc5
numab-v1r1 noratelimit-v1r1
MB/sec copy 43298.52 ( 0.00%) 44673.38 ( 3.18%)
MB/sec scale 30115.06 ( 0.00%) 31293.06 ( 3.91%)
MB/sec add 32825.12 ( 0.00%) 34883.62 ( 6.27%)
MB/sec triad 32549.52 ( 0.00%) 34906.60 ( 7.24%

Signed-off-by: Mel Gorman <mgorman@techsingularity.net>
Reviewed-by: Rik van Riel <riel@surriel.com>
Acked-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Jirka Hladky <jhladky@redhat.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Linux-MM <linux-mm@kvack.org>
Cc: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: http://lkml.kernel.org/r/20181001100525.29789-2-mgorman@techsingularity.net
Signed-off-by: Ingo Molnar <mingo@kernel.org>


# 75346121 21-Sep-2018 Srikar Dronamraju <srikar@linux.vnet.ibm.com>

mm/migrate: Use spin_trylock() while resetting rate limit

Since this spinlock will only serialize the migrate rate limiting,
convert the spin_lock() to a spin_trylock(). If another thread is updating, this
task can move on.

Specjbb2005 results (8 warehouses)
Higher bops are better

2 Socket - 2 Node Haswell - X86
JVMS Prev Current %Change
4 205332 198512 -3.32145
1 319785 313559 -1.94693

2 Socket - 4 Node Power8 - PowerNV
JVMS Prev Current %Change
8 74912 74761.9 -0.200368
1 206585 214874 4.01239

2 Socket - 2 Node Power9 - PowerNV
JVMS Prev Current %Change
4 189162 180536 -4.56011
1 213760 210281 -1.62753

4 Socket - 4 Node Power7 - PowerVM
JVMS Prev Current %Change
8 58736.8 56511.4 -3.78877
1 105419 104899 -0.49327

Avoiding stretching of window intervals may be the reason for the
regression. Also code now uses READ_ONCE/WRITE_ONCE. That may
also be hurting performance to some extent.

Some events stats before and after applying the patch.

perf stats 8th warehouse Multi JVM 2 Socket - 2 Node Haswell - X86
Event Before After
cs 14,285,708 13,818,546
migrations 1,180,621 1,149,960
faults 339,114 385,583
cache-misses 55,205,631,894 55,259,546,768
sched:sched_move_numa 843 2,257
sched:sched_stick_numa 6 9
sched:sched_swap_numa 219 512
migrate:mm_migrate_pages 365 2,225

vmstat 8th warehouse Multi JVM 2 Socket - 2 Node Haswell - X86
Event Before After
numa_hint_faults 26907 72692
numa_hint_faults_local 24279 62270
numa_hit 239771 238762
numa_huge_pte_updates 0 48
numa_interleave 68 75
numa_local 239688 238676
numa_other 83 86
numa_pages_migrated 363 2225
numa_pte_updates 27415 98557

perf stats 8th warehouse Single JVM 2 Socket - 2 Node Haswell - X86
Event Before After
cs 3,202,779 3,173,490
migrations 37,186 36,966
faults 106,076 108,776
cache-misses 12,024,873,744 12,200,075,320
sched:sched_move_numa 931 1,264
sched:sched_stick_numa 0 0
sched:sched_swap_numa 1 0
migrate:mm_migrate_pages 637 899

vmstat 8th warehouse Single JVM 2 Socket - 2 Node Haswell - X86
Event Before After
numa_hint_faults 17409 21109
numa_hint_faults_local 14367 17120
numa_hit 73953 72934
numa_huge_pte_updates 20 42
numa_interleave 25 33
numa_local 73892 72866
numa_other 61 68
numa_pages_migrated 668 915
numa_pte_updates 27276 42326

perf stats 8th warehouse Multi JVM 2 Socket - 2 Node Power9 - PowerNV
Event Before After
cs 8,474,013 8,312,022
migrations 254,934 231,705
faults 320,506 310,242
cache-misses 110,580,458 402,324,573
sched:sched_move_numa 725 193
sched:sched_stick_numa 0 0
sched:sched_swap_numa 7 3
migrate:mm_migrate_pages 145 93

vmstat 8th warehouse Multi JVM 2 Socket - 2 Node Power9 - PowerNV
Event Before After
numa_hint_faults 22797 11838
numa_hint_faults_local 21539 11216
numa_hit 89308 90689
numa_huge_pte_updates 0 0
numa_interleave 865 1579
numa_local 88955 89634
numa_other 353 1055
numa_pages_migrated 149 92
numa_pte_updates 22930 12109

perf stats 8th warehouse Single JVM 2 Socket - 2 Node Power9 - PowerNV
Event Before After
cs 2,195,628 2,170,481
migrations 11,179 10,126
faults 149,656 160,962
cache-misses 8,117,515 10,834,845
sched:sched_move_numa 49 10
sched:sched_stick_numa 0 0
sched:sched_swap_numa 0 0
migrate:mm_migrate_pages 5 2

vmstat 8th warehouse Single JVM 2 Socket - 2 Node Power9 - PowerNV
Event Before After
numa_hint_faults 3577 403
numa_hint_faults_local 3476 358
numa_hit 26142 25898
numa_huge_pte_updates 0 0
numa_interleave 358 207
numa_local 26042 25860
numa_other 100 38
numa_pages_migrated 5 2
numa_pte_updates 3587 400

perf stats 8th warehouse Multi JVM 4 Socket - 4 Node Power7 - PowerVM
Event Before After
cs 100,602,296 110,339,633
migrations 4,135,630 4,139,812
faults 789,256 863,622
cache-misses 226,160,621,058 231,838,045,660
sched:sched_move_numa 1,366 2,196
sched:sched_stick_numa 16 33
sched:sched_swap_numa 374 544
migrate:mm_migrate_pages 1,350 2,469

vmstat 8th warehouse Multi JVM 4 Socket - 4 Node Power7 - PowerVM
Event Before After
numa_hint_faults 47857 85748
numa_hint_faults_local 39768 66831
numa_hit 240165 242213
numa_huge_pte_updates 0 0
numa_interleave 0 0
numa_local 240165 242211
numa_other 0 2
numa_pages_migrated 1224 2376
numa_pte_updates 48354 86233

perf stats 8th warehouse Single JVM 4 Socket - 4 Node Power7 - PowerVM
Event Before After
cs 58,515,496 59,331,057
migrations 564,845 552,019
faults 245,807 266,586
cache-misses 73,603,757,976 73,796,312,990
sched:sched_move_numa 996 981
sched:sched_stick_numa 10 54
sched:sched_swap_numa 193 286
migrate:mm_migrate_pages 646 713

vmstat 8th warehouse Single JVM 4 Socket - 4 Node Power7 - PowerVM
Event Before After
numa_hint_faults 13422 14807
numa_hint_faults_local 5619 5738
numa_hit 36118 36230
numa_huge_pte_updates 0 0
numa_interleave 0 0
numa_local 36116 36228
numa_other 2 2
numa_pages_migrated 616 703
numa_pte_updates 13374 14742

Suggested-by: Peter Zijlstra <peterz@infradead.org>
Signed-off-by: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Jirka Hladky <jhladky@redhat.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mel Gorman <mgorman@techsingularity.net>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Rik van Riel <riel@surriel.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: http://lkml.kernel.org/r/1537552141-27815-6-git-send-email-srikar@linux.vnet.ibm.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>


# d4ae9916 23-Aug-2018 Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>

mm: soft-offline: close the race against page allocation

A process can be killed with SIGBUS(BUS_MCEERR_AR) when it tries to
allocate a page that was just freed on the way of soft-offline. This is
undesirable because soft-offline (which is about corrected error) is
less aggressive than hard-offline (which is about uncorrected error),
and we can make soft-offline fail and keep using the page for good
reason like "system is busy."

Two main changes of this patch are:

- setting migrate type of the target page to MIGRATE_ISOLATE. As done
in free_unref_page_commit(), this makes kernel bypass pcplist when
freeing the page. So we can assume that the page is in freelist just
after put_page() returns,

- setting PG_hwpoison on free page under zone->lock which protects
freelists, so this allows us to avoid setting PG_hwpoison on a page
that is decided to be allocated soon.

[akpm@linux-foundation.org: tweak set_hwpoison_free_buddy_page() comment]
Link: http://lkml.kernel.org/r/1531452366-11661-3-git-send-email-n-horiguchi@ah.jp.nec.com
Signed-off-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
Reported-by: Xishi Qiu <xishi.qiuxishi@alibaba-inc.com>
Tested-by: Mike Kravetz <mike.kravetz@oracle.com>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: <zy.zhengyi@alibaba-inc.com>
Cc: Mike Kravetz <mike.kravetz@oracle.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 6bc9b564 23-Aug-2018 Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>

mm: fix race on soft-offlining free huge pages

Patch series "mm: soft-offline: fix race against page allocation".

Xishi recently reported the issue about race on reusing the target pages
of soft offlining. Discussion and analysis showed that we need make
sure that setting PG_hwpoison should be done in the right place under
zone->lock for soft offline. 1/2 handles free hugepage's case, and 2/2
hanldes free buddy page's case.

This patch (of 2):

There's a race condition between soft offline and hugetlb_fault which
causes unexpected process killing and/or hugetlb allocation failure.

The process killing is caused by the following flow:

CPU 0 CPU 1 CPU 2

soft offline
get_any_page
// find the hugetlb is free
mmap a hugetlb file
page fault
...
hugetlb_fault
hugetlb_no_page
alloc_huge_page
// succeed
soft_offline_free_page
// set hwpoison flag
mmap the hugetlb file
page fault
...
hugetlb_fault
hugetlb_no_page
find_lock_page
return VM_FAULT_HWPOISON
mm_fault_error
do_sigbus
// kill the process

The hugetlb allocation failure comes from the following flow:

CPU 0 CPU 1

mmap a hugetlb file
// reserve all free page but don't fault-in
soft offline
get_any_page
// find the hugetlb is free
soft_offline_free_page
// set hwpoison flag
dissolve_free_huge_page
// fail because all free hugepages are reserved
page fault
...
hugetlb_fault
hugetlb_no_page
alloc_huge_page
...
dequeue_huge_page_node_exact
// ignore hwpoisoned hugepage
// and finally fail due to no-mem

The root cause of this is that current soft-offline code is written based
on an assumption that PageHWPoison flag should be set at first to avoid
accessing the corrupted data. This makes sense for memory_failure() or
hard offline, but does not for soft offline because soft offline is about
corrected (not uncorrected) error and is safe from data lost. This patch
changes soft offline semantics where it sets PageHWPoison flag only after
containment of the error page completes successfully.

Link: http://lkml.kernel.org/r/1531452366-11661-2-git-send-email-n-horiguchi@ah.jp.nec.com
Signed-off-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
Reported-by: Xishi Qiu <xishi.qiuxishi@alibaba-inc.com>
Suggested-by: Xishi Qiu <xishi.qiuxishi@alibaba-inc.com>
Tested-by: Mike Kravetz <mike.kravetz@oracle.com>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: <zy.zhengyi@alibaba-inc.com>
Cc: Mike Kravetz <mike.kravetz@oracle.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 815f0ddb 22-Aug-2018 Nick Desaulniers <ndesaulniers@google.com>

include/linux/compiler*.h: make compiler-*.h mutually exclusive

Commit cafa0010cd51 ("Raise the minimum required gcc version to 4.6")
recently exposed a brittle part of the build for supporting non-gcc
compilers.

Both Clang and ICC define __GNUC__, __GNUC_MINOR__, and
__GNUC_PATCHLEVEL__ for quick compatibility with code bases that haven't
added compiler specific checks for __clang__ or __INTEL_COMPILER.

This is brittle, as they happened to get compatibility by posing as a
certain version of GCC. This broke when upgrading the minimal version
of GCC required to build the kernel, to a version above what ICC and
Clang claim to be.

Rather than always including compiler-gcc.h then undefining or
redefining macros in compiler-intel.h or compiler-clang.h, let's
separate out the compiler specific macro definitions into mutually
exclusive headers, do more proper compiler detection, and keep shared
definitions in compiler_types.h.

Fixes: cafa0010cd51 ("Raise the minimum required gcc version to 4.6")
Reported-by: Masahiro Yamada <yamada.masahiro@socionext.com>
Suggested-by: Eli Friedman <efriedma@codeaurora.org>
Suggested-by: Joe Perches <joe@perches.com>
Signed-off-by: Nick Desaulniers <ndesaulniers@google.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# e1fb4a08 17-Aug-2018 Dave Jiang <dave.jiang@intel.com>

dax: remove VM_MIXEDMAP for fsdax and device dax

This patch is reworked from an earlier patch that Dan has posted:
https://patchwork.kernel.org/patch/10131727/

VM_MIXEDMAP is used by dax to direct mm paths like vm_normal_page() that
the memory page it is dealing with is not typical memory from the linear
map. The get_user_pages_fast() path, since it does not resolve the vma,
is already using {pte,pmd}_devmap() as a stand-in for VM_MIXEDMAP, so we
use that as a VM_MIXEDMAP replacement in some locations. In the cases
where there is no pte to consult we fallback to using vma_is_dax() to
detect the VM_MIXEDMAP special case.

Now that we have explicit driver pfn_t-flag opt-in/opt-out for
get_user_pages() support for DAX we can stop setting VM_MIXEDMAP. This
also means we no longer need to worry about safely manipulating vm_flags
in a future where we support dynamically changing the dax mode of a
file.

DAX should also now be supported with madvise_behavior(), vma_merge(),
and copy_page_range().

This patch has been tested against ndctl unit test. It has also been
tested against xfstests commit: 625515d using fake pmem created by
memmap and no additional issues have been observed.

Link: http://lkml.kernel.org/r/152847720311.55924.16999195879201817653.stgit@djiang5-desk3.ch.intel.com
Signed-off-by: Dave Jiang <dave.jiang@intel.com>
Acked-by: Dan Williams <dan.j.williams@intel.com>
Cc: Jan Kara <jack@suse.cz>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 013567be 11-May-2018 Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>

mm: migrate: fix double call of radix_tree_replace_slot()

radix_tree_replace_slot() is called twice for head page, it's obviously
a bug. Let's fix it.

Link: http://lkml.kernel.org/r/20180423072101.GA12157@hori1.linux.bs1.fc.nec.co.jp
Fixes: e71769ae5260 ("mm: enable thp migration for shmem thp")
Signed-off-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
Reported-by: Matthew Wilcox <willy@infradead.org>
Reviewed-by: Andrew Morton <akpm@linux-foundation.org>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: "Kirill A. Shutemov" <kirill@shutemov.name>
Cc: Zi Yan <zi.yan@sent.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# e71769ae 20-Apr-2018 Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>

mm: enable thp migration for shmem thp

My testing for the latest kernel supporting thp migration showed an
infinite loop in offlining the memory block that is filled with shmem
thps. We can get out of the loop with a signal, but kernel should return
with failure in this case.

What happens in the loop is that scan_movable_pages() repeats returning
the same pfn without any progress. That's because page migration always
fails for shmem thps.

In memory offline code, memory blocks containing unmovable pages should be
prevented from being offline targets by has_unmovable_pages() inside
start_isolate_page_range(). So it's possible to change migratability for
non-anonymous thps to avoid the issue, but it introduces more complex and
thp-specific handling in migration code, so it might not good.

So this patch is suggesting to fix the issue by enabling thp migration for
shmem thp. Both of anon/shmem thp are migratable so we don't need
precheck about the type of thps.

Link: http://lkml.kernel.org/r/20180406030706.GA2434@hori1.linux.bs1.fc.nec.co.jp
Fixes: commit 72b39cfc4d75 ("mm, memory_hotplug: do not fail offlining too early")
Signed-off-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
Acked-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Cc: Zi Yan <zi.yan@sent.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Michal Hocko <mhocko@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 8f175cf5 20-Apr-2018 Michal Hocko <mhocko@suse.com>

mm: fix do_pages_move status handling

Li Wang has reported that LTP move_pages04 test fails with the current
tree:

LTP move_pages04:
TFAIL : move_pages04.c:143: status[1] is EPERM, expected EFAULT

The test allocates an array of two pages, one is present while the other
is not (resp. backed by zero page) and it expects EFAULT for the second
page as the man page suggests. We are reporting EPERM which doesn't make
any sense and this is a result of a bug from cf5f16b23ec9 ("mm: unclutter
THP migration").

do_pages_move tries to handle as many pages in one batch as possible so we
queue all pages with the same node target together and that corresponds to
[start, i] range which is then used to update status array.
add_page_for_migration will correctly notice the zero (resp. !present)
page and returns with EFAULT which gets written to the status. But if
this is the last page in the array we do not update start and so the last
store_status after the loop will overwrite the range of the last batch
with NUMA_NO_NODE (which corresponds to EPERM).

Fix this by simply bailing out from the last flush if the pagelist is
empty as there is clearly nothing more to do.

Link: http://lkml.kernel.org/r/20180418121255.334-1-mhocko@kernel.org
Fixes: cf5f16b23ec9 ("mm: unclutter THP migration")
Signed-off-by: Michal Hocko <mhocko@suse.com>
Reported-by: Li Wang <liwang@redhat.com>
Tested-by: Li Wang <liwang@redhat.com>
Cc: Zi Yan <zi.yan@cs.rutgers.edu>
Cc: "Kirill A. Shutemov" <kirill@shutemov.name>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# b93b0163 10-Apr-2018 Matthew Wilcox <willy@infradead.org>

page cache: use xa_lock

Remove the address_space ->tree_lock and use the xa_lock newly added to
the radix_tree_root. Rename the address_space ->page_tree to ->i_pages,
since we don't really care that it's a tree.

[willy@infradead.org: fix nds32, fs/dax.c]
Link: http://lkml.kernel.org/r/20180406145415.GB20605@bombadil.infradead.orgLink: http://lkml.kernel.org/r/20180313132639.17387-9-willy@infradead.org
Signed-off-by: Matthew Wilcox <mawilcox@microsoft.com>
Acked-by: Jeff Layton <jlayton@redhat.com>
Cc: Darrick J. Wong <darrick.wong@oracle.com>
Cc: Dave Chinner <david@fromorbit.com>
Cc: Ryusuke Konishi <konishi.ryusuke@lab.ntt.co.jp>
Cc: Will Deacon <will.deacon@arm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 94723aaf 10-Apr-2018 Michal Hocko <mhocko@suse.com>

mm: unclutter THP migration

THP migration is hacked into the generic migration with rather
surprising semantic. The migration allocation callback is supposed to
check whether the THP can be migrated at once and if that is not the
case then it allocates a simple page to migrate. unmap_and_move then
fixes that up by spliting the THP into small pages while moving the head
page to the newly allocated order-0 page. Remaning pages are moved to
the LRU list by split_huge_page. The same happens if the THP allocation
fails. This is really ugly and error prone [1].

I also believe that split_huge_page to the LRU lists is inherently wrong
because all tail pages are not migrated. Some callers will just work
around that by retrying (e.g. memory hotplug). There are other pfn
walkers which are simply broken though. e.g. madvise_inject_error will
migrate head and then advances next pfn by the huge page size.
do_move_page_to_node_array, queue_pages_range (migrate_pages, mbind),
will simply split the THP before migration if the THP migration is not
supported then falls back to single page migration but it doesn't handle
tail pages if the THP migration path is not able to allocate a fresh THP
so we end up with ENOMEM and fail the whole migration which is a
questionable behavior. Page compaction doesn't try to migrate large
pages so it should be immune.

This patch tries to unclutter the situation by moving the special THP
handling up to the migrate_pages layer where it actually belongs. We
simply split the THP page into the existing list if unmap_and_move fails
with ENOMEM and retry. So we will _always_ migrate all THP subpages and
specific migrate_pages users do not have to deal with this case in a
special way.

[1] http://lkml.kernel.org/r/20171121021855.50525-1-zi.yan@sent.com

Link: http://lkml.kernel.org/r/20180103082555.14592-4-mhocko@kernel.org
Signed-off-by: Michal Hocko <mhocko@suse.com>
Acked-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Reviewed-by: Zi Yan <zi.yan@cs.rutgers.edu>
Cc: Andrea Reale <ar@linux.vnet.ibm.com>
Cc: Anshuman Khandual <khandual@linux.vnet.ibm.com>
Cc: Mike Kravetz <mike.kravetz@oracle.com>
Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 666feb21 10-Apr-2018 Michal Hocko <mhocko@suse.com>

mm, migrate: remove reason argument from new_page_t

No allocation callback is using this argument anymore. new_page_node
used to use this parameter to convey node_id resp. migration error up
to move_pages code (do_move_page_to_node_array). The error status never
made it into the final status field and we have a better way to
communicate node id to the status field now. All other allocation
callbacks simply ignored the argument so we can drop it finally.

[mhocko@suse.com: fix migration callback]
Link: http://lkml.kernel.org/r/20180105085259.GH2801@dhcp22.suse.cz
[akpm@linux-foundation.org: fix alloc_misplaced_dst_page()]
[mhocko@kernel.org: fix build]
Link: http://lkml.kernel.org/r/20180103091134.GB11319@dhcp22.suse.cz
Link: http://lkml.kernel.org/r/20180103082555.14592-3-mhocko@kernel.org
Signed-off-by: Michal Hocko <mhocko@suse.com>
Reviewed-by: Zi Yan <zi.yan@cs.rutgers.edu>
Cc: Andrea Reale <ar@linux.vnet.ibm.com>
Cc: Anshuman Khandual <khandual@linux.vnet.ibm.com>
Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Cc: Mike Kravetz <mike.kravetz@oracle.com>
Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# a49bd4d7 10-Apr-2018 Michal Hocko <mhocko@suse.com>

mm, numa: rework do_pages_move

Patch series "unclutter thp migration"

Motivation:

THP migration is hacked into the generic migration with rather
surprising semantic. The migration allocation callback is supposed to
check whether the THP can be migrated at once and if that is not the
case then it allocates a simple page to migrate. unmap_and_move then
fixes that up by splitting the THP into small pages while moving the
head page to the newly allocated order-0 page. Remaining pages are
moved to the LRU list by split_huge_page. The same happens if the THP
allocation fails. This is really ugly and error prone [2].

I also believe that split_huge_page to the LRU lists is inherently wrong
because all tail pages are not migrated. Some callers will just work
around that by retrying (e.g. memory hotplug). There are other pfn
walkers which are simply broken though. e.g. madvise_inject_error will
migrate head and then advances next pfn by the huge page size.
do_move_page_to_node_array, queue_pages_range (migrate_pages, mbind),
will simply split the THP before migration if the THP migration is not
supported then falls back to single page migration but it doesn't handle
tail pages if the THP migration path is not able to allocate a fresh THP
so we end up with ENOMEM and fail the whole migration which is a
questionable behavior. Page compaction doesn't try to migrate large
pages so it should be immune.

The first patch reworks do_pages_move which relies on a very ugly
calling semantic when the return status is pushed to the migration path
via private pointer. It uses pre allocated fixed size batching to
achieve that. We simply cannot do the same if a THP is to be split
during the migration path which is done in the patch 3. Patch 2 is
follow up cleanup which removes the mentioned return status calling
convention ugliness.

On a side note:

There are some semantic issues I have encountered on the way when
working on patch 1 but I am not addressing them here. E.g. trying to
move THP tail pages will result in either success or EBUSY (the later
one more likely once we isolate head from the LRU list). Hugetlb
reports EACCESS on tail pages. Some errors are reported via status
parameter but migration failures are not even though the original
`reason' argument suggests there was an intention to do so. From a
quick look into git history this never worked. I have tried to keep the
semantic unchanged.

Then there is a relatively minor thing that the page isolation might
fail because of pages not being on the LRU - e.g. because they are
sitting on the per-cpu LRU caches. Easily fixable.

This patch (of 3):

do_pages_move is supposed to move user defined memory (an array of
addresses) to the user defined numa nodes (an array of nodes one for
each address). The user provided status array then contains resulting
numa node for each address or an error. The semantic of this function
is little bit confusing because only some errors are reported back.
Notably migrate_pages error is only reported via the return value. This
patch doesn't try to address these semantic nuances but rather change
the underlying implementation.

Currently we are processing user input (which can be really large) in
batches which are stored to a temporarily allocated page. Each address
is resolved to its struct page and stored to page_to_node structure
along with the requested target numa node. The array of these
structures is then conveyed down the page migration path via private
argument. new_page_node then finds the corresponding structure and
allocates the proper target page.

What is the problem with the current implementation and why to change
it? Apart from being quite ugly it also doesn't cope with unexpected
pages showing up on the migration list inside migrate_pages path. That
doesn't happen currently but the follow up patch would like to make the
thp migration code more clear and that would need to split a THP into
the list for some cases.

How does the new implementation work? Well, instead of batching into a
fixed size array we simply batch all pages that should be migrated to
the same node and isolate all of them into a linked list which doesn't
require any additional storage. This should work reasonably well
because page migration usually migrates larger ranges of memory to a
specific node. So the common case should work equally well as the
current implementation. Even if somebody constructs an input where the
target numa nodes would be interleaved we shouldn't see a large
performance impact because page migration alone doesn't really benefit
from batching. mmap_sem batching for the lookup is quite questionable
and isolate_lru_page which would benefit from batching is not using it
even in the current implementation.

Link: http://lkml.kernel.org/r/20180103082555.14592-2-mhocko@kernel.org
Signed-off-by: Michal Hocko <mhocko@suse.com>
Acked-by: Kirill A. Shutemov <kirill@shutemov.name>
Reviewed-by: Andrew Morton <akpm@linux-foundation.org>
Cc: Anshuman Khandual <khandual@linux.vnet.ibm.com>
Cc: Zi Yan <zi.yan@cs.rutgers.edu>
Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Andrea Reale <ar@linux.vnet.ibm.com>
Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Cc: Mike Kravetz <mike.kravetz@oracle.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 07707125 10-Apr-2018 Ralph Campbell <rcampbell@nvidia.com>

mm/migrate: properly preserve write attribute in special migrate entry

Use of pte_write(pte) is only valid for present pte, the common code
which set the migration entry can be reach for both valid present pte
and special swap entry (for device memory). Fix the code to use the
mpfn value which properly handle both cases.

On x86 this did not have any bad side effect because pte write bit is
below PAGE_BIT_GLOBAL and thus special swap entry have it set to 0 which
in turn means we were always creating read only special migration entry.

So once migration did finish we always write protected the CPU page
table entry (moreover this is only an issue when migrating from device
memory to system memory). End effect is that CPU write access would
fault again and restore write permission.

This behaviour isn't too bad; it just burns CPU cycles by forcing CPU to
take a second fault on write access. ie, double faulting the same
address. There is no corruption or incorrect states (it behaves as a
COWed page from a fork with a mapcount of 1).

Link: http://lkml.kernel.org/r/20180402023506.12180-1-jglisse@redhat.com
Signed-off-by: Ralph Campbell <rcampbell@nvidia.com>
Signed-off-by: Jérôme Glisse <jglisse@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 09a913a7 10-Apr-2018 Mel Gorman <mgorman@techsingularity.net>

sched/numa: avoid trapping faults and attempting migration of file-backed dirty pages

change_pte_range is called from task work context to mark PTEs for
receiving NUMA faulting hints. If the marked pages are dirty then
migration may fail. Some filesystems cannot migrate dirty pages without
blocking so are skipped in MIGRATE_ASYNC mode which just wastes CPU.
Even when they can, it can be a waste of cycles when the pages are
shared forcing higher scan rates. This patch avoids marking shared
dirty pages for hinting faults but also will skip a migration if the
page was dirtied after the scanner updated a clean page.

This is most noticeable running the NASA Parallel Benchmark when backed
by btrfs, the default root filesystem for some distributions, but also
noticeable when using XFS.

The following are results from a 4-socket machine running a 4.16-rc4
kernel with some scheduler patches that are pending for the next merge
window.

4.16.0-rc4 4.16.0-rc4
schedtip-20180309 nodirty-v1
Time cg.D 459.07 ( 0.00%) 444.21 ( 3.24%)
Time ep.D 76.96 ( 0.00%) 77.69 ( -0.95%)
Time is.D 25.55 ( 0.00%) 27.85 ( -9.00%)
Time lu.D 601.58 ( 0.00%) 596.87 ( 0.78%)
Time mg.D 107.73 ( 0.00%) 108.22 ( -0.45%)

is.D regresses slightly in terms of absolute time but note that that
particular load varies quite a bit from run to run. The more relevant
observation is the total system CPU usage.

4.16.0-rc4 4.16.0-rc4
schedtip-20180309 nodirty-v1
User 71471.91 70627.04
System 11078.96 8256.13
Elapsed 661.66 632.74

That is a substantial drop in system CPU usage and overall the workload
completes faster. The NUMA balancing statistics are also interesting

NUMA base PTE updates 111407972 139848884
NUMA huge PMD updates 206506 264869
NUMA page range updates 217139044 275461812
NUMA hint faults 4300924 3719784
NUMA hint local faults 3012539 3416618
NUMA hint local percent 70 91
NUMA pages migrated 1517487 1358420

While more PTEs are scanned due to changes in what faults are gathered,
it's clear that a far higher percentage of faults are local as the bulk
of the remote hits were dirty pages that, in this case with btrfs, had
no chance of migrating.

The following is a comparison when using XFS as that is a more realistic
filesystem choice for a data partition

4.16.0-rc4 4.16.0-rc4
schedtip-20180309 nodirty-v1r47
Time cg.D 485.28 ( 0.00%) 442.62 ( 8.79%)
Time ep.D 77.68 ( 0.00%) 77.54 ( 0.18%)
Time is.D 26.44 ( 0.00%) 24.79 ( 6.24%)
Time lu.D 597.46 ( 0.00%) 597.11 ( 0.06%)
Time mg.D 142.65 ( 0.00%) 105.83 ( 25.81%)

That is a reasonable gain on two relatively long-lived workloads. While
not presented, there is also a substantial drop in system CPu usage and
the NUMA balancing stats show similar improvements in locality as btrfs
did.

Link: http://lkml.kernel.org/r/20180326094334.zserdec62gwmmfqf@techsingularity.net
Signed-off-by: Mel Gorman <mgorman@techsingularity.net>
Reviewed-by: Rik van Riel <riel@surriel.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 7addf443 17-Mar-2018 Dominik Brodowski <linux@dominikbrodowski.net>

mm: add kernel_move_pages() helper, move compat syscall to mm/migrate.c

Move compat_sys_move_pages() to mm/migrate.c and make it call a newly
introduced helper -- kernel_move_pages() -- instead of the syscall.

This patch is part of a series which removes in-kernel calls to syscalls.
On this basis, the syscall entry path can be streamlined. For details, see
http://lkml.kernel.org/r/20180325162527.GA17492@light.dominikbrodowski.net

Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: linux-mm@kvack.org
Cc: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Dominik Brodowski <linux@dominikbrodowski.net>


# ab5ac90a 31-Jan-2018 Michal Hocko <mhocko@suse.com>

mm, hugetlb: do not rely on overcommit limit during migration

hugepage migration relies on __alloc_buddy_huge_page to get a new page.
This has 2 main disadvantages.

1) it doesn't allow to migrate any huge page if the pool is used
completely which is not an exceptional case as the pool is static and
unused memory is just wasted.

2) it leads to a weird semantic when migration between two numa nodes
might increase the pool size of the destination NUMA node while the
page is in use. The issue is caused by per NUMA node surplus pages
tracking (see free_huge_page).

Address both issues by changing the way how we allocate and account
pages allocated for migration. Those should temporal by definition. So
we mark them that way (we will abuse page flags in the 3rd page) and
update free_huge_page to free such pages to the page allocator. Page
migration path then just transfers the temporal status from the new page
to the old one which will be freed on the last reference. The global
surplus count will never change during this path but we still have to be
careful when migrating a per-node suprlus page. This is now handled in
move_hugetlb_state which is called from the migration path and it copies
the hugetlb specific page state and fixes up the accounting when needed

Rename __alloc_buddy_huge_page to __alloc_surplus_huge_page to better
reflect its purpose. The new allocation routine for the migration path
is __alloc_migrate_huge_page.

The user visible effect of this patch is that migrated pages are really
temporal and they travel between NUMA nodes as per the migration
request:

Before migration
/sys/devices/system/node/node0/hugepages/hugepages-2048kB/free_hugepages:0
/sys/devices/system/node/node0/hugepages/hugepages-2048kB/nr_hugepages:1
/sys/devices/system/node/node0/hugepages/hugepages-2048kB/surplus_hugepages:0
/sys/devices/system/node/node1/hugepages/hugepages-2048kB/free_hugepages:0
/sys/devices/system/node/node1/hugepages/hugepages-2048kB/nr_hugepages:0
/sys/devices/system/node/node1/hugepages/hugepages-2048kB/surplus_hugepages:0

After
/sys/devices/system/node/node0/hugepages/hugepages-2048kB/free_hugepages:0
/sys/devices/system/node/node0/hugepages/hugepages-2048kB/nr_hugepages:0
/sys/devices/system/node/node0/hugepages/hugepages-2048kB/surplus_hugepages:0
/sys/devices/system/node/node1/hugepages/hugepages-2048kB/free_hugepages:0
/sys/devices/system/node/node1/hugepages/hugepages-2048kB/nr_hugepages:1
/sys/devices/system/node/node1/hugepages/hugepages-2048kB/surplus_hugepages:0

with the previous implementation, both nodes would have nr_hugepages:1
until the page is freed.

Link: http://lkml.kernel.org/r/20180103093213.26329-4-mhocko@kernel.org
Signed-off-by: Michal Hocko <mhocko@suse.com>
Reviewed-by: Mike Kravetz <mike.kravetz@oracle.com>
Reviewed-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
Cc: Andrea Reale <ar@linux.vnet.ibm.com>
Cc: Anshuman Khandual <khandual@linux.vnet.ibm.com>
Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Zi Yan <zi.yan@cs.rutgers.edu>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# f55e1014 29-Nov-2017 Linus Torvalds <torvalds@linux-foundation.org>

Revert "mm, thp: Do not make pmd/pud dirty without a reason"

This reverts commit 152e93af3cfe2d29d8136cc0a02a8612507136ee.

It was a nice cleanup in theory, but as Nicolai Stange points out, we do
need to make the page dirty for the copy-on-write case even when we
didn't end up making it writable, since the dirty bit is what we use to
check that we've gone through a COW cycle.

Reported-by: Michal Hocko <mhocko@kernel.org>
Acked-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 152e93af 26-Nov-2017 Kirill A. Shutemov <kirill.shutemov@linux.intel.com>

mm, thp: Do not make pmd/pud dirty without a reason

Currently we make page table entries dirty all the time regardless of
access type and don't even consider if the mapping is write-protected.
The reasoning is that we don't really need dirty tracking on THP and
making the entry dirty upfront may save some time on first write to the
page.

Unfortunately, such approach may result in false-positive
can_follow_write_pmd() for huge zero page or read-only shmem file.

Let's only make page dirty only if we about to write to the page anyway
(as we do for small pages).

I've restructured the code to make entry dirty inside
maybe_p[mu]d_mkwrite(). It also takes into account if the vma is
write-protected.

Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Acked-by: Michal Hocko <mhocko@suse.com>
Cc: Hugh Dickins <hughd@google.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 4645b9fe 15-Nov-2017 Jérôme Glisse <jglisse@redhat.com>

mm/mmu_notifier: avoid call to invalidate_range() in range_end()

This is an optimization patch that only affect mmu_notifier users which
rely on the invalidate_range() callback. This patch avoids calling that
callback twice in a row from inside __mmu_notifier_invalidate_range_end

Existing pattern (before this patch):
mmu_notifier_invalidate_range_start()
pte/pmd/pud_clear_flush_notify()
mmu_notifier_invalidate_range()
mmu_notifier_invalidate_range_end()
mmu_notifier_invalidate_range()

New pattern (after this patch):
mmu_notifier_invalidate_range_start()
pte/pmd/pud_clear_flush_notify()
mmu_notifier_invalidate_range()
mmu_notifier_invalidate_range_only_end()

We call the invalidate_range callback after clearing the page table
under the page table lock and we skip the call to invalidate_range
inside the __mmu_notifier_invalidate_range_end() function.

Idea from Andrea Arcangeli

Link: http://lkml.kernel.org/r/20171017031003.7481-3-jglisse@redhat.com
Signed-off-by: Jérôme Glisse <jglisse@redhat.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Joerg Roedel <jroedel@suse.de>
Cc: Suravee Suthikulpanit <suravee.suthikulpanit@amd.com>
Cc: David Woodhouse <dwmw2@infradead.org>
Cc: Alistair Popple <alistair@popple.id.au>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Stephen Rothwell <sfr@canb.auug.org.au>
Cc: Andrew Donnellan <andrew.donnellan@au1.ibm.com>
Cc: Nadav Amit <nadav.amit@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# b2441318 01-Nov-2017 Greg Kroah-Hartman <gregkh@linuxfoundation.org>

License cleanup: add SPDX GPL-2.0 license identifier to files with no license

Many source files in the tree are missing licensing information, which
makes it harder for compliance tools to determine the correct license.

By default all files without license information are under the default
license of the kernel, which is GPL version 2.

Update the files which contain no license information with the 'GPL-2.0'
SPDX license identifier. The SPDX identifier is a legally binding
shorthand, which can be used instead of the full boiler plate text.

This patch is based on work done by Thomas Gleixner and Kate Stewart and
Philippe Ombredanne.

How this work was done:

Patches were generated and checked against linux-4.14-rc6 for a subset of
the use cases:
- file had no licensing information it it.
- file was a */uapi/* one with no licensing information in it,
- file was a */uapi/* one with existing licensing information,

Further patches will be generated in subsequent months to fix up cases
where non-standard license headers were used, and references to license
had to be inferred by heuristics based on keywords.

The analysis to determine which SPDX License Identifier to be applied to
a file was done in a spreadsheet of side by side results from of the
output of two independent scanners (ScanCode & Windriver) producing SPDX
tag:value files created by Philippe Ombredanne. Philippe prepared the
base worksheet, and did an initial spot review of a few 1000 files.

The 4.13 kernel was the starting point of the analysis with 60,537 files
assessed. Kate Stewart did a file by file comparison of the scanner
results in the spreadsheet to determine which SPDX license identifier(s)
to be applied to the file. She confirmed any determination that was not
immediately clear with lawyers working with the Linux Foundation.

Criteria used to select files for SPDX license identifier tagging was:
- Files considered eligible had to be source code files.
- Make and config files were included as candidates if they contained >5
lines of source
- File already had some variant of a license header in it (even if <5
lines).

All documentation files were explicitly excluded.

The following heuristics were used to determine which SPDX license
identifiers to apply.

- when both scanners couldn't find any license traces, file was
considered to have no license information in it, and the top level
COPYING file license applied.

For non */uapi/* files that summary was:

SPDX license identifier # files
---------------------------------------------------|-------
GPL-2.0 11139

and resulted in the first patch in this series.

If that file was a */uapi/* path one, it was "GPL-2.0 WITH
Linux-syscall-note" otherwise it was "GPL-2.0". Results of that was:

SPDX license identifier # files
---------------------------------------------------|-------
GPL-2.0 WITH Linux-syscall-note 930

and resulted in the second patch in this series.

- if a file had some form of licensing information in it, and was one
of the */uapi/* ones, it was denoted with the Linux-syscall-note if
any GPL family license was found in the file or had no licensing in
it (per prior point). Results summary:

SPDX license identifier # files
---------------------------------------------------|------
GPL-2.0 WITH Linux-syscall-note 270
GPL-2.0+ WITH Linux-syscall-note 169
((GPL-2.0 WITH Linux-syscall-note) OR BSD-2-Clause) 21
((GPL-2.0 WITH Linux-syscall-note) OR BSD-3-Clause) 17
LGPL-2.1+ WITH Linux-syscall-note 15
GPL-1.0+ WITH Linux-syscall-note 14
((GPL-2.0+ WITH Linux-syscall-note) OR BSD-3-Clause) 5
LGPL-2.0+ WITH Linux-syscall-note 4
LGPL-2.1 WITH Linux-syscall-note 3
((GPL-2.0 WITH Linux-syscall-note) OR MIT) 3
((GPL-2.0 WITH Linux-syscall-note) AND MIT) 1

and that resulted in the third patch in this series.

- when the two scanners agreed on the detected license(s), that became
the concluded license(s).

- when there was disagreement between the two scanners (one detected a
license but the other didn't, or they both detected different
licenses) a manual inspection of the file occurred.

- In most cases a manual inspection of the information in the file
resulted in a clear resolution of the license that should apply (and
which scanner probably needed to revisit its heuristics).

- When it was not immediately clear, the license identifier was
confirmed with lawyers working with the Linux Foundation.

- If there was any question as to the appropriate license identifier,
the file was flagged for further research and to be revisited later
in time.

In total, over 70 hours of logged manual review was done on the
spreadsheet to determine the SPDX license identifiers to apply to the
source files by Kate, Philippe, Thomas and, in some cases, confirmation
by lawyers working with the Linux Foundation.

Kate also obtained a third independent scan of the 4.13 code base from
FOSSology, and compared selected files where the other two scanners
disagreed against that SPDX file, to see if there was new insights. The
Windriver scanner is based on an older version of FOSSology in part, so
they are related.

Thomas did random spot checks in about 500 files from the spreadsheets
for the uapi headers and agreed with SPDX license identifier in the
files he inspected. For the non-uapi files Thomas did random spot checks
in about 15000 files.

In initial set of patches against 4.14-rc6, 3 files were found to have
copy/paste license identifier errors, and have been fixed to reflect the
correct identifier.

Additionally Philippe spent 10 hours this week doing a detailed manual
inspection and review of the 12,461 patched files from the initial patch
version early this week with:
- a full scancode scan run, collecting the matched texts, detected
license ids and scores
- reviewing anything where there was a license detected (about 500+
files) to ensure that the applied SPDX license was correct
- reviewing anything where there was no detection but the patch license
was not GPL-2.0 WITH Linux-syscall-note to ensure that the applied
SPDX license was correct

This produced a worksheet with 20 files needing minor correction. This
worksheet was then exported into 3 different .csv files for the
different types of files to be modified.

These .csv files were then reviewed by Greg. Thomas wrote a script to
parse the csv files and add the proper SPDX tag to the file, in the
format that the file expected. This script was further refined by Greg
based on the output to detect more types of files automatically and to
distinguish between header and source .c files (which need different
comment types.) Finally Greg ran the script using the .csv files to
generate the patches.

Reviewed-by: Kate Stewart <kstewart@linuxfoundation.org>
Reviewed-by: Philippe Ombredanne <pombredanne@nexb.com>
Reviewed-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>


# e20d103b 13-Oct-2017 Mark Hairgrove <mhairgrove@nvidia.com>

mm/migrate: fix indexing bug (off by one) and avoid out of bound access

Index was incremented before last use and thus the second array could
dereference to an invalid address (not mentioning the fact that it did
not properly clear the entry we intended to clear).

Link: http://lkml.kernel.org/r/1506973525-16491-1-git-send-email-jglisse@redhat.com
Fixes: 8315ada7f095bf ("mm/migrate: allow migrate_vma() to alloc new page on empty entry")
Signed-off-by: Mark Hairgrove <mhairgrove@nvidia.com>
Signed-off-by: Jérôme Glisse <jglisse@redhat.com>
Cc: Reza Arbab <arbab@linux.vnet.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 6b368cd4 08-Sep-2017 Jérôme Glisse <jglisse@redhat.com>

mm/hmm: avoid bloating arch that do not make use of HMM

This moves all new code including new page migration helper behind kernel
Kconfig option so that there is no codee bloat for arch or user that do
not want to use HMM or any of its associated features.

arm allyesconfig (without all the patchset, then with and this patch):
text data bss dec hex filename
83721896 46511131 27582964 157815991 96814b7 ../without/vmlinux
83722364 46511131 27582964 157816459 968168b vmlinux

[jglisse@redhat.com: struct hmm is only use by HMM mirror functionality]
Link: http://lkml.kernel.org/r/20170825213133.27286-1-jglisse@redhat.com
[sfr@canb.auug.org.au: fix build (arm multi_v7_defconfig)]
Link: http://lkml.kernel.org/r/20170828181849.323ab81b@canb.auug.org.au
Link: http://lkml.kernel.org/r/20170818032858.7447-1-jglisse@redhat.com
Signed-off-by: Jérôme Glisse <jglisse@redhat.com>
Signed-off-by: Stephen Rothwell <sfr@canb.auug.org.au>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: Arnd Bergmann <arnd@arndb.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# df6ad698 08-Sep-2017 Jérôme Glisse <jglisse@redhat.com>

mm/device-public-memory: device memory cache coherent with CPU

Platform with advance system bus (like CAPI or CCIX) allow device memory
to be accessible from CPU in a cache coherent fashion. Add a new type of
ZONE_DEVICE to represent such memory. The use case are the same as for
the un-addressable device memory but without all the corners cases.

Link: http://lkml.kernel.org/r/20170817000548.32038-19-jglisse@redhat.com
Signed-off-by: Jérôme Glisse <jglisse@redhat.com>
Cc: Aneesh Kumar <aneesh.kumar@linux.vnet.ibm.com>
Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: Ross Zwisler <ross.zwisler@linux.intel.com>
Cc: Balbir Singh <bsingharora@gmail.com>
Cc: David Nellans <dnellans@nvidia.com>
Cc: Evgeny Baskakov <ebaskakov@nvidia.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: John Hubbard <jhubbard@nvidia.com>
Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Cc: Mark Hairgrove <mhairgrove@nvidia.com>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Sherry Cheung <SCheung@nvidia.com>
Cc: Subhash Gutti <sgutti@nvidia.com>
Cc: Vladimir Davydov <vdavydov.dev@gmail.com>
Cc: Bob Liu <liubo95@huawei.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 8315ada7 08-Sep-2017 Jérôme Glisse <jglisse@redhat.com>

mm/migrate: allow migrate_vma() to alloc new page on empty entry

This allows callers of migrate_vma() to allocate new page for empty CPU
page table entry (pte_none or back by zero page). This is only for
anonymous memory and it won't allow new page to be instanced if the
userfaultfd is armed.

This is useful to device driver that want to migrate a range of virtual
address and would rather allocate new memory than having to fault later
on.

Link: http://lkml.kernel.org/r/20170817000548.32038-18-jglisse@redhat.com
Signed-off-by: Jérôme Glisse <jglisse@redhat.com>
Cc: Aneesh Kumar <aneesh.kumar@linux.vnet.ibm.com>
Cc: Balbir Singh <bsingharora@gmail.com>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: David Nellans <dnellans@nvidia.com>
Cc: Evgeny Baskakov <ebaskakov@nvidia.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: John Hubbard <jhubbard@nvidia.com>
Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Cc: Mark Hairgrove <mhairgrove@nvidia.com>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Cc: Ross Zwisler <ross.zwisler@linux.intel.com>
Cc: Sherry Cheung <SCheung@nvidia.com>
Cc: Subhash Gutti <sgutti@nvidia.com>
Cc: Vladimir Davydov <vdavydov.dev@gmail.com>
Cc: Bob Liu <liubo95@huawei.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# a5430dda 08-Sep-2017 Jérôme Glisse <jglisse@redhat.com>

mm/migrate: support un-addressable ZONE_DEVICE page in migration

Allow to unmap and restore special swap entry of un-addressable
ZONE_DEVICE memory.

Link: http://lkml.kernel.org/r/20170817000548.32038-17-jglisse@redhat.com
Signed-off-by: Jérôme Glisse <jglisse@redhat.com>
Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Cc: Aneesh Kumar <aneesh.kumar@linux.vnet.ibm.com>
Cc: Balbir Singh <bsingharora@gmail.com>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: David Nellans <dnellans@nvidia.com>
Cc: Evgeny Baskakov <ebaskakov@nvidia.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: John Hubbard <jhubbard@nvidia.com>
Cc: Mark Hairgrove <mhairgrove@nvidia.com>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Cc: Ross Zwisler <ross.zwisler@linux.intel.com>
Cc: Sherry Cheung <SCheung@nvidia.com>
Cc: Subhash Gutti <sgutti@nvidia.com>
Cc: Vladimir Davydov <vdavydov.dev@gmail.com>
Cc: Bob Liu <liubo95@huawei.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 8c3328f1 08-Sep-2017 Jérôme Glisse <jglisse@redhat.com>

mm/migrate: migrate_vma() unmap page from vma while collecting pages

Common case for migration of virtual address range is page are map only
once inside the vma in which migration is taking place. Because we
already walk the CPU page table for that range we can directly do the
unmap there and setup special migration swap entry.

Link: http://lkml.kernel.org/r/20170817000548.32038-16-jglisse@redhat.com
Signed-off-by: Jérôme Glisse <jglisse@redhat.com>
Signed-off-by: Evgeny Baskakov <ebaskakov@nvidia.com>
Signed-off-by: John Hubbard <jhubbard@nvidia.com>
Signed-off-by: Mark Hairgrove <mhairgrove@nvidia.com>
Signed-off-by: Sherry Cheung <SCheung@nvidia.com>
Signed-off-by: Subhash Gutti <sgutti@nvidia.com>
Cc: Aneesh Kumar <aneesh.kumar@linux.vnet.ibm.com>
Cc: Balbir Singh <bsingharora@gmail.com>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: David Nellans <dnellans@nvidia.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Cc: Ross Zwisler <ross.zwisler@linux.intel.com>
Cc: Vladimir Davydov <vdavydov.dev@gmail.com>
Cc: Bob Liu <liubo95@huawei.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 8763cb45 08-Sep-2017 Jérôme Glisse <jglisse@redhat.com>

mm/migrate: new memory migration helper for use with device memory

This patch add a new memory migration helpers, which migrate memory
backing a range of virtual address of a process to different memory (which
can be allocated through special allocator). It differs from numa
migration by working on a range of virtual address and thus by doing
migration in chunk that can be large enough to use DMA engine or special
copy offloading engine.

Expected users are any one with heterogeneous memory where different
memory have different characteristics (latency, bandwidth, ...). As an
example IBM platform with CAPI bus can make use of this feature to migrate
between regular memory and CAPI device memory. New CPU architecture with
a pool of high performance memory not manage as cache but presented as
regular memory (while being faster and with lower latency than DDR) will
also be prime user of this patch.

Migration to private device memory will be useful for device that have
large pool of such like GPU, NVidia plans to use HMM for that.

Link: http://lkml.kernel.org/r/20170817000548.32038-15-jglisse@redhat.com
Signed-off-by: Jérôme Glisse <jglisse@redhat.com>
Signed-off-by: Evgeny Baskakov <ebaskakov@nvidia.com>
Signed-off-by: John Hubbard <jhubbard@nvidia.com>
Signed-off-by: Mark Hairgrove <mhairgrove@nvidia.com>
Signed-off-by: Sherry Cheung <SCheung@nvidia.com>
Signed-off-by: Subhash Gutti <sgutti@nvidia.com>
Cc: Aneesh Kumar <aneesh.kumar@linux.vnet.ibm.com>
Cc: Balbir Singh <bsingharora@gmail.com>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: David Nellans <dnellans@nvidia.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Cc: Ross Zwisler <ross.zwisler@linux.intel.com>
Cc: Vladimir Davydov <vdavydov.dev@gmail.com>
Cc: Bob Liu <liubo95@huawei.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 2916ecc0 08-Sep-2017 Jérôme Glisse <jglisse@redhat.com>

mm/migrate: new migrate mode MIGRATE_SYNC_NO_COPY

Introduce a new migration mode that allow to offload the copy to a device
DMA engine. This changes the workflow of migration and not all
address_space migratepage callback can support this.

This is intended to be use by migrate_vma() which itself is use for thing
like HMM (see include/linux/hmm.h).

No additional per-filesystem migratepage testing is needed. I disables
MIGRATE_SYNC_NO_COPY in all problematic migratepage() callback and i
added comment in those to explain why (part of this patch). The commit
message is unclear it should say that any callback that wish to support
this new mode need to be aware of the difference in the migration flow
from other mode.

Some of these callbacks do extra locking while copying (aio, zsmalloc,
balloon, ...) and for DMA to be effective you want to copy multiple
pages in one DMA operations. But in the problematic case you can not
easily hold the extra lock accross multiple call to this callback.

Usual flow is:

For each page {
1 - lock page
2 - call migratepage() callback
3 - (extra locking in some migratepage() callback)
4 - migrate page state (freeze refcount, update page cache, buffer
head, ...)
5 - copy page
6 - (unlock any extra lock of migratepage() callback)
7 - return from migratepage() callback
8 - unlock page
}

The new mode MIGRATE_SYNC_NO_COPY:
1 - lock multiple pages
For each page {
2 - call migratepage() callback
3 - abort in all problematic migratepage() callback
4 - migrate page state (freeze refcount, update page cache, buffer
head, ...)
} // finished all calls to migratepage() callback
5 - DMA copy multiple pages
6 - unlock all the pages

To support MIGRATE_SYNC_NO_COPY in the problematic case we would need a
new callback migratepages() (for instance) that deals with multiple
pages in one transaction.

Because the problematic cases are not important for current usage I did
not wanted to complexify this patchset even more for no good reason.

Link: http://lkml.kernel.org/r/20170817000548.32038-14-jglisse@redhat.com
Signed-off-by: Jérôme Glisse <jglisse@redhat.com>
Cc: Aneesh Kumar <aneesh.kumar@linux.vnet.ibm.com>
Cc: Balbir Singh <bsingharora@gmail.com>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: David Nellans <dnellans@nvidia.com>
Cc: Evgeny Baskakov <ebaskakov@nvidia.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: John Hubbard <jhubbard@nvidia.com>
Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Cc: Mark Hairgrove <mhairgrove@nvidia.com>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Cc: Ross Zwisler <ross.zwisler@linux.intel.com>
Cc: Sherry Cheung <SCheung@nvidia.com>
Cc: Subhash Gutti <sgutti@nvidia.com>
Cc: Vladimir Davydov <vdavydov.dev@gmail.com>
Cc: Bob Liu <liubo95@huawei.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# e8db67eb 08-Sep-2017 Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>

mm: migrate: move_pages() supports thp migration

This patch enables thp migration for move_pages(2).

Link: http://lkml.kernel.org/r/20170717193955.20207-10-zi.yan@sent.com
Signed-off-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
Signed-off-by: Zi Yan <zi.yan@cs.rutgers.edu>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: Anshuman Khandual <khandual@linux.vnet.ibm.com>
Cc: Dave Hansen <dave.hansen@intel.com>
Cc: David Nellans <dnellans@nvidia.com>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Cc: Mel Gorman <mgorman@techsingularity.net>
Cc: Minchan Kim <minchan@kernel.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Michal Hocko <mhocko@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 616b8371 08-Sep-2017 Zi Yan <zi.yan@cs.rutgers.edu>

mm: thp: enable thp migration in generic path

Add thp migration's core code, including conversions between a PMD entry
and a swap entry, setting PMD migration entry, removing PMD migration
entry, and waiting on PMD migration entries.

This patch makes it possible to support thp migration. If you fail to
allocate a destination page as a thp, you just split the source thp as
we do now, and then enter the normal page migration. If you succeed to
allocate destination thp, you enter thp migration. Subsequent patches
actually enable thp migration for each caller of page migration by
allowing its get_new_page() callback to allocate thps.

[zi.yan@cs.rutgers.edu: fix gcc-4.9.0 -Wmissing-braces warning]
Link: http://lkml.kernel.org/r/A0ABA698-7486-46C3-B209-E95A9048B22C@cs.rutgers.edu
[akpm@linux-foundation.org: fix x86_64 allnoconfig warning]
Signed-off-by: Zi Yan <zi.yan@cs.rutgers.edu>
Acked-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: Anshuman Khandual <khandual@linux.vnet.ibm.com>
Cc: Dave Hansen <dave.hansen@intel.com>
Cc: David Nellans <dnellans@nvidia.com>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: Mel Gorman <mgorman@techsingularity.net>
Cc: Minchan Kim <minchan@kernel.org>
Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Michal Hocko <mhocko@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 197e7e52 20-Aug-2017 Linus Torvalds <torvalds@linux-foundation.org>

Sanitize 'move_pages()' permission checks

The 'move_paghes()' system call was introduced long long ago with the
same permission checks as for sending a signal (except using
CAP_SYS_NICE instead of CAP_SYS_KILL for the overriding capability).

That turns out to not be a great choice - while the system call really
only moves physical page allocations around (and you need other
capabilities to do a lot of it), you can check the return value to map
out some the virtual address choices and defeat ASLR of a binary that
still shares your uid.

So change the access checks to the more common 'ptrace_may_access()'
model instead.

This tightens the access checks for the uid, and also effectively
changes the CAP_SYS_NICE check to CAP_SYS_PTRACE, but it's unlikely that
anybody really _uses_ this legacy system call any more (we hav ebetter
NUMA placement models these days), so I expect nobody to notice.

Famous last words.

Reported-by: Otto Ebeling <otto.ebeling@iki.fi>
Acked-by: Eric W. Biederman <ebiederm@xmission.com>
Cc: Willy Tarreau <w@1wt.eu>
Cc: stable@kernel.org
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# a9b80250 10-Aug-2017 Nadav Amit <namit@vmware.com>

Revert "mm: numa: defer TLB flush for THP migration as long as possible"

While deferring TLB flushes is a good practice, the reverted patch
caused pending TLB flushes to be checked while the page-table lock is
not taken. As a result, in architectures with weak memory model (PPC),
Linux may miss a memory-barrier, miss the fact TLB flushes are pending,
and cause (in theory) a memory corruption.

Since the alternative of using smp_mb__after_unlock_lock() was
considered a bit open-coded, and the performance impact is expected to
be small, the previous patch is reverted.

This reverts b0943d61b8fa ("mm: numa: defer TLB flush for THP migration
as long as possible").

Link: http://lkml.kernel.org/r/20170802000818.4760-4-namit@vmware.com
Signed-off-by: Nadav Amit <namit@vmware.com>
Suggested-by: Mel Gorman <mgorman@suse.de>
Acked-by: Mel Gorman <mgorman@suse.de>
Acked-by: Rik van Riel <riel@redhat.com>
Cc: Minchan Kim <minchan@kernel.org>
Cc: Sergey Senozhatsky <sergey.senozhatsky@gmail.com>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: "David S. Miller" <davem@davemloft.net>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Jeff Dike <jdike@addtoit.com>
Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
Cc: Mel Gorman <mgorman@techsingularity.net>
Cc: Nadav Amit <nadav.amit@gmail.com>
Cc: Russell King <linux@armlinux.org.uk>
Cc: Tony Luck <tony.luck@intel.com>
Cc: Yoshinori Sato <ysato@users.sourceforge.jp>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 8b1b436d 07-Jun-2017 Peter Zijlstra <peterz@infradead.org>

mm, locking: Rework {set,clear,mm}_tlb_flush_pending()

Commit:

af2c1401e6f9 ("mm: numa: guarantee that tlb_flush_pending updates are visible before page table updates")

added smp_mb__before_spinlock() to set_tlb_flush_pending(). I think we
can solve the same problem without this barrier.

If instead we mandate that mm_tlb_flush_pending() is used while
holding the PTL we're guaranteed to observe prior
set_tlb_flush_pending() instances.

For this to work we need to rework migrate_misplaced_transhuge_page()
a little and move the test up into do_huge_pmd_numa_page().

NOTE: this relies on flush_tlb_range() to guarantee:

(1) it ensures that prior page table updates are visible to the
page table walker and
(2) it ensures that subsequent memory accesses are only made
visible after the invalidation has completed

This is required for architectures that implement TRANSPARENT_HUGEPAGE
(arc, arm, arm64, mips, powerpc, s390, sparc, x86) or otherwise use
mm_tlb_flush_pending() in their page-table operations (arm, arm64,
x86).

This appears true for:

- arm (DSB ISB before and after),
- arm64 (DSB ISHST before, and DSB ISH after),
- powerpc (PTESYNC before and after),
- s390 and x86 TLB invalidate are serializing instructions

But I failed to understand the situation for:

- arc, mips, sparc

Now SPARC64 is a wee bit special in that flush_tlb_range() is a no-op
and it flushes the TLBs using arch_{enter,leave}_lazy_mmu_mode()
inside the PTL. It still needs to guarantee the PTL unlock happens
_after_ the invalidate completes.

Vineet, Ralf and Dave could you guys please have a look?

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Acked-by: Will Deacon <will.deacon@arm.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: David S. Miller <davem@davemloft.net>
Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Ralf Baechle <ralf@linux-mips.org>
Cc: Rik van Riel <riel@redhat.com>
Cc: Russell King <linux@armlinux.org.uk>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Vineet Gupta <vgupta@synopsys.com>
Signed-off-by: Ingo Molnar <mingo@kernel.org>


# f4e177d1 10-Jul-2017 Will Deacon <will@kernel.org>

mm/migrate.c: stabilise page count when migrating transparent hugepages

When migrating a transparent hugepage, migrate_misplaced_transhuge_page
guards itself against a concurrent fastgup of the page by checking that
the page count is equal to 2 before and after installing the new pmd.

If the page count changes, then the pmd is reverted back to the original
entry, however there is a small window where the new (possibly writable)
pmd is installed and the underlying page could be written by userspace.
Restoring the old pmd could therefore result in loss of data.

This patch fixes the problem by freezing the page count whilst updating
the page tables, which protects against a concurrent fastgup without the
need to restore the old pmd in the failure case (since the page count
can no longer change under our feet).

Link: http://lkml.kernel.org/r/1497349722-6731-4-git-send-email-will.deacon@arm.com
Signed-off-by: Will Deacon <will.deacon@arm.com>
Acked-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Mark Rutland <mark.rutland@arm.com>
Cc: Steve Capper <steve.capper@arm.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# c3114a84 10-Jul-2017 Anshuman Khandual <khandual@linux.vnet.ibm.com>

mm: hugetlb: soft-offline: dissolve source hugepage after successful migration

Currently hugepage migrated by soft-offline (i.e. due to correctable
memory errors) is contained as a hugepage, which means many non-error
pages in it are unreusable, i.e. wasted.

This patch solves this issue by dissolving source hugepages into buddy.
As done in previous patch, PageHWPoison is set only on a head page of
the error hugepage. Then in dissoliving we move the PageHWPoison flag
to the raw error page so that all healthy subpages return back to buddy.

[arnd@arndb.de: fix warnings: replace some macros with inline functions]
Link: http://lkml.kernel.org/r/20170609102544.2947326-1-arnd@arndb.de
Link: http://lkml.kernel.org/r/1496305019-5493-5-git-send-email-n-horiguchi@ah.jp.nec.com
Signed-off-by: Anshuman Khandual <khandual@linux.vnet.ibm.com>
Signed-off-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 383321ab 06-Jul-2017 Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>

mm/hugetlb/migration: use set_huge_pte_at instead of set_pte_at

Patch series "HugeTLB migration support for PPC64", v2.

This patch (of 9):

The right interface to use to set a hugetlb pte entry is set_huge_pte_at.
Use that instead of set_pte_at.

Link: http://lkml.kernel.org/r/1494926612-23928-2-git-send-email-aneesh.kumar@linux.vnet.ibm.com
Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
Reviewed-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
Cc: Anshuman Khandual <khandual@linux.vnet.ibm.com>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Mike Kravetz <kravetz@us.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# e4b82222 03-May-2017 Minchan Kim <minchan@kernel.org>

mm: make rmap_one boolean function

rmap_one's return value controls whether rmap_work should contine to
scan other ptes or not so it's target for changing to boolean. Return
true if the scan should be continued. Otherwise, return false to stop
the scanning.

This patch makes rmap_one's return value to boolean.

Link: http://lkml.kernel.org/r/1489555493-14659-10-git-send-email-minchan@kernel.org
Signed-off-by: Minchan Kim <minchan@kernel.org>
Cc: Anshuman Khandual <khandual@linux.vnet.ibm.com>
Cc: Hillf Danton <hillf.zj@alibaba-inc.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# d44d363f 03-May-2017 Shaohua Li <shli@fb.com>

mm: don't assume anonymous pages have SwapBacked flag

There are a few places the code assumes anonymous pages should have
SwapBacked flag set. MADV_FREE pages are anonymous pages but we are
going to add them to LRU_INACTIVE_FILE list and clear SwapBacked flag
for them. The assumption doesn't hold any more, so fix them.

Link: http://lkml.kernel.org/r/3945232c0df3dd6c4ef001976f35a95f18dcb407.1487965799.git.shli@fb.com
Signed-off-by: Shaohua Li <shli@fb.com>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Acked-by: Hillf Danton <hillf.zj@alibaba-inc.com>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Minchan Kim <minchan@kernel.org>
Cc: Hugh Dickins <hughd@google.com>
Cc: Rik van Riel <riel@redhat.com>
Cc: Mel Gorman <mgorman@techsingularity.net>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 15038d0d 03-May-2017 Johannes Weiner <hannes@cmpxchg.org>

mm: remove unnecessary reclaimability check from NUMA balancing target

NUMA balancing already checks the watermarks of the target node to
decide whether it's a suitable balancing target. Whether the node is
reclaimable or not is irrelevant when we don't intend to reclaim.

Link: http://lkml.kernel.org/r/20170228214007.5621-5-hannes@cmpxchg.org
Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
Acked-by: Hillf Danton <hillf.zj@alibaba-inc.com>
Acked-by: Michal Hocko <mhocko@suse.com>
Cc: Jia He <hejianet@gmail.com>
Cc: Mel Gorman <mgorman@suse.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# fc280fe8 20-Apr-2017 Rabin Vincent <rabinv@axis.com>

mm: prevent NR_ISOLATE_* stats from going negative

Commit 6afcf8ef0ca0 ("mm, compaction: fix NR_ISOLATED_* stats for pfn
based migration") moved the dec_node_page_state() call (along with the
page_is_file_cache() call) to after putback_lru_page().

But page_is_file_cache() can change after putback_lru_page() is called,
so it should be called before putback_lru_page(), as it was before that
patch, to prevent NR_ISOLATE_* stats from going negative.

Without this fix, non-CONFIG_SMP kernels end up hanging in the
while(too_many_isolated()) { congestion_wait() } loop in
shrink_active_list() due to the negative stats.

Mem-Info:
active_anon:32567 inactive_anon:121 isolated_anon:1
active_file:6066 inactive_file:6639 isolated_file:4294967295
^^^^^^^^^^
unevictable:0 dirty:115 writeback:0 unstable:0
slab_reclaimable:2086 slab_unreclaimable:3167
mapped:3398 shmem:18366 pagetables:1145 bounce:0
free:1798 free_pcp:13 free_cma:0

Fixes: 6afcf8ef0ca0 ("mm, compaction: fix NR_ISOLATED_* stats for pfn based migration")
Link: http://lkml.kernel.org/r/1492683865-27549-1-git-send-email-rabin.vincent@axis.com
Signed-off-by: Rabin Vincent <rabinv@axis.com>
Acked-by: Michal Hocko <mhocko@suse.com>
Cc: Ming Ling <ming.ling@spreadtrum.com>
Cc: Minchan Kim <minchan@kernel.org>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 4b0ece6f 31-Mar-2017 Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>

mm: migrate: fix remove_migration_pte() for ksm pages

I found that calling page migration for ksm pages causes the following
bug:

page:ffffea0004d51180 count:2 mapcount:2 mapping:ffff88013c785141 index:0x913
flags: 0x57ffffc0040068(uptodate|lru|active|swapbacked)
raw: 0057ffffc0040068 ffff88013c785141 0000000000000913 0000000200000001
raw: ffffea0004d5f9e0 ffffea0004d53f60 0000000000000000 ffff88007d81b800
page dumped because: VM_BUG_ON_PAGE(!PageLocked(page))
page->mem_cgroup:ffff88007d81b800
------------[ cut here ]------------
kernel BUG at /src/linux-dev/mm/rmap.c:1086!
invalid opcode: 0000 [#1] SMP
Modules linked in: ppdev parport_pc virtio_balloon i2c_piix4 pcspkr parport i2c_core acpi_cpufreq ip_tables xfs libcrc32c ata_generic pata_acpi ata_piix 8139too libata virtio_blk 8139cp crc32c_intel mii virtio_pci virtio_ring serio_raw virtio floppy dm_mirror dm_region_hash dm_log dm_mod
CPU: 0 PID: 3162 Comm: bash Not tainted 4.11.0-rc2-mm1+ #1
Hardware name: Red Hat KVM, BIOS 0.5.1 01/01/2011
RIP: 0010:do_page_add_anon_rmap+0x1ba/0x260
RSP: 0018:ffffc90002473b30 EFLAGS: 00010282
RAX: 0000000000000021 RBX: ffffea0004d51180 RCX: 0000000000000006
RDX: 0000000000000000 RSI: 0000000000000082 RDI: ffff88007dc0dfe0
RBP: ffffc90002473b58 R08: 00000000fffffffe R09: 00000000000001c1
R10: 0000000000000005 R11: 00000000000001c0 R12: ffff880139ab3d80
R13: 0000000000000000 R14: 0000700000000200 R15: 0000160000000000
FS: 00007f5195f50740(0000) GS:ffff88007dc00000(0000) knlGS:0000000000000000
CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 00007fd450287000 CR3: 000000007a08e000 CR4: 00000000001406f0
Call Trace:
page_add_anon_rmap+0x18/0x20
remove_migration_pte+0x220/0x2c0
rmap_walk_ksm+0x143/0x220
rmap_walk+0x55/0x60
remove_migration_ptes+0x53/0x80
migrate_pages+0x8ed/0xb60
soft_offline_page+0x309/0x8d0
store_soft_offline_page+0xaf/0xf0
dev_attr_store+0x18/0x30
sysfs_kf_write+0x3a/0x50
kernfs_fop_write+0xff/0x180
__vfs_write+0x37/0x160
vfs_write+0xb2/0x1b0
SyS_write+0x55/0xc0
do_syscall_64+0x67/0x180
entry_SYSCALL64_slow_path+0x25/0x25
RIP: 0033:0x7f51956339e0
RSP: 002b:00007ffcfa0dffc8 EFLAGS: 00000246 ORIG_RAX: 0000000000000001
RAX: ffffffffffffffda RBX: 000000000000000c RCX: 00007f51956339e0
RDX: 000000000000000c RSI: 00007f5195f53000 RDI: 0000000000000001
RBP: 00007f5195f53000 R08: 000000000000000a R09: 00007f5195f50740
R10: 000000000000000b R11: 0000000000000246 R12: 00007f5195907400
R13: 000000000000000c R14: 0000000000000001 R15: 0000000000000000
Code: fe ff ff 48 81 c2 00 02 00 00 48 89 55 d8 e8 2e c3 fd ff 48 8b 55 d8 e9 42 ff ff ff 48 c7 c6 e0 52 a1 81 48 89 df e8 46 ad fe ff <0f> 0b 48 83 e8 01 e9 7f fe ff ff 48 83 e8 01 e9 96 fe ff ff 48
RIP: do_page_add_anon_rmap+0x1ba/0x260 RSP: ffffc90002473b30
---[ end trace a679d00f4af2df48 ]---
Kernel panic - not syncing: Fatal exception
Kernel Offset: disabled
---[ end Kernel panic - not syncing: Fatal exception

The problem is in the following lines:

new = page - pvmw.page->index +
linear_page_index(vma, pvmw.address);

The 'new' is calculated with 'page' which is given by the caller as a
destination page and some offset adjustment for thp. But this doesn't
properly work for ksm pages because pvmw.page->index doesn't change for
each address but linear_page_index() changes, which means that 'new'
points to different pages for each addresses backed by the ksm page. As
a result, we try to set totally unrelated pages as destination pages,
and that causes kernel crash.

This patch fixes the miscalculation and makes ksm page migration work
fine.

Fixes: 3fe87967c536 ("mm: convert remove_migration_pte() to use page_vma_mapped_walk()")
Link: http://lkml.kernel.org/r/1489717683-29905-1-git-send-email-n-horiguchi@ah.jp.nec.com
Signed-off-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Hillf Danton <hillf.zj@alibaba-inc.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 6e84f315 08-Feb-2017 Ingo Molnar <mingo@kernel.org>

sched/headers: Prepare for new header dependencies before moving code to <linux/sched/mm.h>

We are going to split <linux/sched/mm.h> out of <linux/sched.h>, which
will have to be picked up from other headers and a couple of .c files.

Create a trivial placeholder <linux/sched/mm.h> file that just
maps to <linux/sched.h> to make this patch obviously correct and
bisectable.

The APIs that are going to be moved first are:

mm_alloc()
__mmdrop()
mmdrop()
mmdrop_async_fn()
mmdrop_async()
mmget_not_zero()
mmput()
mmput_async()
get_task_mm()
mm_access()
mm_release()

Include the new header in the files that are going to need it.

Acked-by: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: linux-kernel@vger.kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>


# 3fe87967 24-Feb-2017 Kirill A. Shutemov <kirill.shutemov@linux.intel.com>

mm: convert remove_migration_pte() to use page_vma_mapped_walk()

remove_migration_pte() also can easily be converted to
page_vma_mapped_walk().

[akpm@linux-foundation.org: coding-style fixes]
Link: http://lkml.kernel.org/r/20170129173858.45174-13-kirill.shutemov@linux.intel.com
Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Hillf Danton <hillf.zj@alibaba-inc.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Rik van Riel <riel@redhat.com>
Cc: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
Cc: Vladimir Davydov <vdavydov.dev@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 9e5bcd61 24-Feb-2017 Yisheng Xie <xieyisheng1@huawei.com>

mm/migration: make isolate_movable_page() return int type

Patch series "HWPOISON: soft offlining for non-lru movable page", v6.

After Minchan's commit bda807d44454 ("mm: migrate: support non-lru
movable page migration"), some type of non-lru page like zsmalloc and
virtio-balloon page also support migration.

Therefore, we can:

1) soft offlining no-lru movable pages, which means when memory
corrected errors occur on a non-lru movable page, we can stop to use
it by migrating data onto another page and disable the original
(maybe half-broken) one.

2) enable memory hotplug for non-lru movable pages, i.e. we may offline
blocks, which include such pages, by using non-lru page migration.

This patchset is heavily dependent on non-lru movable page migration.

This patch (of 4):

Change the return type of isolate_movable_page() from bool to int. It
will return 0 when isolate movable page successfully, and return -EBUSY
when it isolates failed.

There is no functional change within this patch but prepare for later
patch.

[xieyisheng1@huawei.com: v6]
Link: http://lkml.kernel.org/r/1486108770-630-2-git-send-email-xieyisheng1@huawei.com
Link: http://lkml.kernel.org/r/1485867981-16037-2-git-send-email-ysxie@foxmail.com
Signed-off-by: Yisheng Xie <xieyisheng1@huawei.com>
Suggested-by: Michal Hocko <mhocko@kernel.org>
Acked-by: Minchan Kim <minchan@kernel.org>
Cc: Andi Kleen <ak@linux.intel.com>
Cc: Hanjun Guo <guohanjun@huawei.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Cc: Mel Gorman <mgorman@techsingularity.net>
Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
Cc: Reza Arbab <arbab@linux.vnet.ibm.com>
Cc: Taku Izumi <izumi.taku@jp.fujitsu.com>
Cc: Vitaly Kuznetsov <vkuznets@redhat.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Xishi Qiu <qiuxishi@huawei.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 6326fec1 24-Dec-2016 Nicholas Piggin <npiggin@gmail.com>

mm: Use owner_priv bit for PageSwapCache, valid when PageSwapBacked

A page is not added to the swap cache without being swap backed,
so PageSwapBacked mappings can use PG_owner_priv_1 for PageSwapCache.

Signed-off-by: Nicholas Piggin <npiggin@gmail.com>
Acked-by: Hugh Dickins <hughd@google.com>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Bob Peterson <rpeterso@redhat.com>
Cc: Steven Whitehouse <swhiteho@redhat.com>
Cc: Andrew Lutomirski <luto@kernel.org>
Cc: Andreas Gruenbacher <agruenba@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Mel Gorman <mgorman@techsingularity.net>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 6d75f366 12-Dec-2016 Johannes Weiner <hannes@cmpxchg.org>

lib: radix-tree: check accounting of existing slot replacement users

The bug in khugepaged fixed earlier in this series shows that radix tree
slot replacement is fragile; and it will become more so when not only
NULL<->!NULL transitions need to be caught but transitions from and to
exceptional entries as well. We need checks.

Re-implement radix_tree_replace_slot() on top of the sanity-checked
__radix_tree_replace(). This requires existing callers to also pass the
radix tree root, but it'll warn us when somebody replaces slots with
contents that need proper accounting (transitions between NULL entries,
real entries, exceptional entries) and where a replacement through the
slot pointer would corrupt the radix tree node counts.

Link: http://lkml.kernel.org/r/20161117193021.GB23430@cmpxchg.org
Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
Suggested-by: Jan Kara <jack@suse.cz>
Reviewed-by: Jan Kara <jack@suse.cz>
Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: Matthew Wilcox <mawilcox@linuxonhyperv.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 6afcf8ef 12-Dec-2016 Ming Ling <ming.ling@spreadtrum.com>

mm, compaction: fix NR_ISOLATED_* stats for pfn based migration

Since commit bda807d44454 ("mm: migrate: support non-lru movable page
migration") isolate_migratepages_block) can isolate !PageLRU pages which
would acct_isolated account as NR_ISOLATED_*. Accounting these non-lru
pages NR_ISOLATED_{ANON,FILE} doesn't make any sense and it can misguide
heuristics based on those counters such as pgdat_reclaimable_pages resp.
too_many_isolated which would lead to unexpected stalls during the
direct reclaim without any good reason. Note that
__alloc_contig_migrate_range can isolate a lot of pages at once.

On mobile devices such as 512M ram android Phone, it may use a big zram
swap. In some cases zram(zsmalloc) uses too many non-lru but
migratedable pages, such as:

MemTotal: 468148 kB
Normal free:5620kB
Free swap:4736kB
Total swap:409596kB
ZRAM: 164616kB(zsmalloc non-lru pages)
active_anon:60700kB
inactive_anon:60744kB
active_file:34420kB
inactive_file:37532kB

Fix this by only accounting lru pages to NR_ISOLATED_* in
isolate_migratepages_block right after they were isolated and we still
know they were on LRU. Drop acct_isolated because it is called after
the fact and we've lost that information. Batching per-cpu counter
doesn't make much improvement anyway. Also make sure that we uncharge
only LRU pages when putting them back on the LRU in
putback_movable_pages resp. when unmap_and_move migrates the page.

[mhocko@suse.com: replace acct_isolated() with direct counting]
Fixes: bda807d44454 ("mm: migrate: support non-lru movable page migration")
Link: http://lkml.kernel.org/r/20161019080240.9682-1-mhocko@kernel.org
Signed-off-by: Ming Ling <ming.ling@spreadtrum.com>
Signed-off-by: Michal Hocko <mhocko@suse.com>
Acked-by: Minchan Kim <minchan@kernel.org>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Joonsoo Kim <js1304@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 6d2329f8 07-Oct-2016 Andrea Arcangeli <aarcange@redhat.com>

mm: vm_page_prot: update with WRITE_ONCE/READ_ONCE

vma->vm_page_prot is read lockless from the rmap_walk, it may be updated
concurrently and this prevents the risk of reading intermediate values.

Link: http://lkml.kernel.org/r/1474660305-19222-1-git-send-email-aarcange@redhat.com
Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
Cc: Rik van Riel <riel@redhat.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: Mel Gorman <mgorman@techsingularity.net>
Cc: Jan Vorlicek <janvorli@microsoft.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 25160354 28-Jul-2016 Vlastimil Babka <vbabka@suse.cz>

mm, thp: remove __GFP_NORETRY from khugepaged and madvised allocations

After the previous patch, we can distinguish costly allocations that
should be really lightweight, such as THP page faults, with
__GFP_NORETRY. This means we don't need to recognize khugepaged
allocations via PF_KTHREAD anymore. We can also change THP page faults
in areas where madvise(MADV_HUGEPAGE) was used to try as hard as
khugepaged, as the process has indicated that it benefits from THP's and
is willing to pay some initial latency costs.

We can also make the flags handling less cryptic by distinguishing
GFP_TRANSHUGE_LIGHT (no reclaim at all, default mode in page fault) from
GFP_TRANSHUGE (only direct reclaim, khugepaged default). Adding
__GFP_NORETRY or __GFP_KSWAPD_RECLAIM is done where needed.

The patch effectively changes the current GFP_TRANSHUGE users as
follows:

* get_huge_zero_page() - the zero page lifetime should be relatively
long and it's shared by multiple users, so it's worth spending some
effort on it. We use GFP_TRANSHUGE, and __GFP_NORETRY is not added.
This also restores direct reclaim to this allocation, which was
unintentionally removed by commit e4a49efe4e7e ("mm: thp: set THP defrag
by default to madvise and add a stall-free defrag option")

* alloc_hugepage_khugepaged_gfpmask() - this is khugepaged, so latency
is not an issue. So if khugepaged "defrag" is enabled (the default), do
reclaim via GFP_TRANSHUGE without __GFP_NORETRY. We can remove the
PF_KTHREAD check from page alloc.

As a side-effect, khugepaged will now no longer check if the initial
compaction was deferred or contended. This is OK, as khugepaged sleep
times between collapsion attempts are long enough to prevent noticeable
disruption, so we should allow it to spend some effort.

* migrate_misplaced_transhuge_page() - already was masking out
__GFP_RECLAIM, so just convert to GFP_TRANSHUGE_LIGHT which is
equivalent.

* alloc_hugepage_direct_gfpmask() - vma's with VM_HUGEPAGE (via madvise)
are now allocating without __GFP_NORETRY. Other vma's keep using
__GFP_NORETRY if direct reclaim/compaction is at all allowed (by default
it's allowed only for madvised vma's). The rest is conversion to
GFP_TRANSHUGE(_LIGHT).

[mhocko@suse.com: suggested GFP_TRANSHUGE_LIGHT]
Link: http://lkml.kernel.org/r/20160721073614.24395-7-vbabka@suse.cz
Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
Acked-by: Michal Hocko <mhocko@suse.com>
Acked-by: Mel Gorman <mgorman@techsingularity.net>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 5a1c84b4 28-Jul-2016 Mel Gorman <mgorman@techsingularity.net>

mm: remove reclaim and compaction retry approximations

If per-zone LRU accounting is available then there is no point
approximating whether reclaim and compaction should retry based on pgdat
statistics. This is effectively a revert of "mm, vmstat: remove zone
and node double accounting by approximating retries" with the difference
that inactive/active stats are still available. This preserves the
history of why the approximation was retried and why it had to be
reverted to handle OOM kills on 32-bit systems.

Link: http://lkml.kernel.org/r/1469110261-7365-4-git-send-email-mgorman@techsingularity.net
Signed-off-by: Mel Gorman <mgorman@techsingularity.net>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Acked-by: Minchan Kim <minchan@kernel.org>
Cc: Michal Hocko <mhocko@suse.cz>
Cc: Vlastimil Babka <vbabka@suse.cz>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# bca67592 28-Jul-2016 Mel Gorman <mgorman@techsingularity.net>

mm, vmstat: remove zone and node double accounting by approximating retries

The number of LRU pages, dirty pages and writeback pages must be
accounted for on both zones and nodes because of the reclaim retry
logic, compaction retry logic and highmem calculations all depending on
per-zone stats.

Many lowmem allocations are immune from OOM kill due to a check in
__alloc_pages_may_oom for (ac->high_zoneidx < ZONE_NORMAL) since commit
03668b3ceb0c ("oom: avoid oom killer for lowmem allocations"). The
exception is costly high-order allocations or allocations that cannot
fail. If the __alloc_pages_may_oom avoids OOM-kill for low-order lowmem
allocations then it would fall through to __alloc_pages_direct_compact.

This patch will blindly retry reclaim for zone-constrained allocations
in should_reclaim_retry up to MAX_RECLAIM_RETRIES. This is not ideal
but without per-zone stats there are not many alternatives. The impact
it that zone-constrained allocations may delay before considering the
OOM killer.

As there is no guarantee enough memory can ever be freed to satisfy
compaction, this patch avoids retrying compaction for zone-contrained
allocations.

In combination, that means that the per-node stats can be used when
deciding whether to continue reclaim using a rough approximation. While
it is possible this will make the wrong decision on occasion, it will
not infinite loop as the number of reclaim attempts is capped by
MAX_RECLAIM_RETRIES.

The final step is calculating the number of dirtyable highmem pages. As
those calculations only care about the global count of file pages in
highmem. This patch uses a global counter used instead of per-zone
stats as it is sufficient.

In combination, this allows the per-zone LRU and dirty state counters to
be removed.

[mgorman@techsingularity.net: fix acct_highmem_file_pages()]
Link: http://lkml.kernel.org/r/1468853426-12858-4-git-send-email-mgorman@techsingularity.netLink: http://lkml.kernel.org/r/1467970510-21195-35-git-send-email-mgorman@techsingularity.net
Signed-off-by: Mel Gorman <mgorman@techsingularity.net>
Suggested by: Michal Hocko <mhocko@kernel.org>
Acked-by: Hillf Danton <hillf.zj@alibaba-inc.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Minchan Kim <minchan@kernel.org>
Cc: Rik van Riel <riel@surriel.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 11fb9989 28-Jul-2016 Mel Gorman <mgorman@techsingularity.net>

mm: move most file-based accounting to the node

There are now a number of accounting oddities such as mapped file pages
being accounted for on the node while the total number of file pages are
accounted on the zone. This can be coped with to some extent but it's
confusing so this patch moves the relevant file-based accounted. Due to
throttling logic in the page allocator for reliable OOM detection, it is
still necessary to track dirty and writeback pages on a per-zone basis.

[mgorman@techsingularity.net: fix NR_ZONE_WRITE_PENDING accounting]
Link: http://lkml.kernel.org/r/1468404004-5085-5-git-send-email-mgorman@techsingularity.net
Link: http://lkml.kernel.org/r/1467970510-21195-20-git-send-email-mgorman@techsingularity.net
Signed-off-by: Mel Gorman <mgorman@techsingularity.net>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Acked-by: Michal Hocko <mhocko@suse.com>
Cc: Hillf Danton <hillf.zj@alibaba-inc.com>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Cc: Minchan Kim <minchan@kernel.org>
Cc: Rik van Riel <riel@surriel.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 4b9d0fab 28-Jul-2016 Mel Gorman <mgorman@techsingularity.net>

mm: rename NR_ANON_PAGES to NR_ANON_MAPPED

NR_FILE_PAGES is the number of file pages.
NR_FILE_MAPPED is the number of mapped file pages.
NR_ANON_PAGES is the number of mapped anon pages.

This is unhelpful naming as it's easy to confuse NR_FILE_MAPPED and
NR_ANON_PAGES for mapped pages. This patch renames NR_ANON_PAGES so we
have

NR_FILE_PAGES is the number of file pages.
NR_FILE_MAPPED is the number of mapped file pages.
NR_ANON_MAPPED is the number of mapped anon pages.

Link: http://lkml.kernel.org/r/1467970510-21195-19-git-send-email-mgorman@techsingularity.net
Signed-off-by: Mel Gorman <mgorman@techsingularity.net>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Cc: Hillf Danton <hillf.zj@alibaba-inc.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Minchan Kim <minchan@kernel.org>
Cc: Rik van Riel <riel@surriel.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 599d0c95 28-Jul-2016 Mel Gorman <mgorman@techsingularity.net>

mm, vmscan: move LRU lists to node

This moves the LRU lists from the zone to the node and related data such
as counters, tracing, congestion tracking and writeback tracking.

Unfortunately, due to reclaim and compaction retry logic, it is
necessary to account for the number of LRU pages on both zone and node
logic. Most reclaim logic is based on the node counters but the retry
logic uses the zone counters which do not distinguish inactive and
active sizes. It would be possible to leave the LRU counters on a
per-zone basis but it's a heavier calculation across multiple cache
lines that is much more frequent than the retry checks.

Other than the LRU counters, this is mostly a mechanical patch but note
that it introduces a number of anomalies. For example, the scans are
per-zone but using per-node counters. We also mark a node as congested
when a zone is congested. This causes weird problems that are fixed
later but is easier to review.

In the event that there is excessive overhead on 32-bit systems due to
the nodes being on LRU then there are two potential solutions

1. Long-term isolation of highmem pages when reclaim is lowmem

When pages are skipped, they are immediately added back onto the LRU
list. If lowmem reclaim persisted for long periods of time, the same
highmem pages get continually scanned. The idea would be that lowmem
keeps those pages on a separate list until a reclaim for highmem pages
arrives that splices the highmem pages back onto the LRU. It potentially
could be implemented similar to the UNEVICTABLE list.

That would reduce the skip rate with the potential corner case is that
highmem pages have to be scanned and reclaimed to free lowmem slab pages.

2. Linear scan lowmem pages if the initial LRU shrink fails

This will break LRU ordering but may be preferable and faster during
memory pressure than skipping LRU pages.

Link: http://lkml.kernel.org/r/1467970510-21195-4-git-send-email-mgorman@techsingularity.net
Signed-off-by: Mel Gorman <mgorman@techsingularity.net>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Cc: Hillf Danton <hillf.zj@alibaba-inc.com>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Minchan Kim <minchan@kernel.org>
Cc: Rik van Riel <riel@surriel.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 10102459 26-Jul-2016 Kirill A. Shutemov <kirill.shutemov@linux.intel.com>

mm: introduce do_set_pmd()

With postponed page table allocation we have chance to setup huge pages.
do_set_pte() calls do_set_pmd() if following criteria met:

- page is compound;
- pmd entry in pmd_none();
- vma has suitable size and alignment;

Link: http://lkml.kernel.org/r/1466021202-61880-12-git-send-email-kirill.shutemov@linux.intel.com
Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# dd78fedd 26-Jul-2016 Kirill A. Shutemov <kirill.shutemov@linux.intel.com>

rmap: support file thp

Naive approach: on mapping/unmapping the page as compound we update
->_mapcount on each 4k page. That's not efficient, but it's not obvious
how we can optimize this. We can look into optimization later.

PG_double_map optimization doesn't work for file pages since lifecycle
of file pages is different comparing to anon pages: file page can be
mapped again at any time.

Link: http://lkml.kernel.org/r/1466021202-61880-11-git-send-email-kirill.shutemov@linux.intel.com
Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# b1123ea6 26-Jul-2016 Minchan Kim <minchan@kernel.org>

mm: balloon: use general non-lru movable page feature

Now, VM has a feature to migrate non-lru movable pages so balloon
doesn't need custom migration hooks in migrate.c and compaction.c.

Instead, this patch implements the page->mapping->a_ops->
{isolate|migrate|putback} functions.

With that, we could remove hooks for ballooning in general migration
functions and make balloon compaction simple.

[akpm@linux-foundation.org: compaction.h requires that the includer first include node.h]
Link: http://lkml.kernel.org/r/1464736881-24886-4-git-send-email-minchan@kernel.org
Signed-off-by: Gioh Kim <gi-oh.kim@profitbricks.com>
Signed-off-by: Minchan Kim <minchan@kernel.org>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Cc: Rafael Aquini <aquini@redhat.com>
Cc: Konstantin Khlebnikov <koct9i@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# bda807d4 26-Jul-2016 Minchan Kim <minchan@kernel.org>

mm: migrate: support non-lru movable page migration

We have allowed migration for only LRU pages until now and it was enough
to make high-order pages. But recently, embedded system(e.g., webOS,
android) uses lots of non-movable pages(e.g., zram, GPU memory) so we
have seen several reports about troubles of small high-order allocation.
For fixing the problem, there were several efforts (e,g,. enhance
compaction algorithm, SLUB fallback to 0-order page, reserved memory,
vmalloc and so on) but if there are lots of non-movable pages in system,
their solutions are void in the long run.

So, this patch is to support facility to change non-movable pages with
movable. For the feature, this patch introduces functions related to
migration to address_space_operations as well as some page flags.

If a driver want to make own pages movable, it should define three
functions which are function pointers of struct
address_space_operations.

1. bool (*isolate_page) (struct page *page, isolate_mode_t mode);

What VM expects on isolate_page function of driver is to return *true*
if driver isolates page successfully. On returing true, VM marks the
page as PG_isolated so concurrent isolation in several CPUs skip the
page for isolation. If a driver cannot isolate the page, it should
return *false*.

Once page is successfully isolated, VM uses page.lru fields so driver
shouldn't expect to preserve values in that fields.

2. int (*migratepage) (struct address_space *mapping,
struct page *newpage, struct page *oldpage, enum migrate_mode);

After isolation, VM calls migratepage of driver with isolated page. The
function of migratepage is to move content of the old page to new page
and set up fields of struct page newpage. Keep in mind that you should
indicate to the VM the oldpage is no longer movable via
__ClearPageMovable() under page_lock if you migrated the oldpage
successfully and returns 0. If driver cannot migrate the page at the
moment, driver can return -EAGAIN. On -EAGAIN, VM will retry page
migration in a short time because VM interprets -EAGAIN as "temporal
migration failure". On returning any error except -EAGAIN, VM will give
up the page migration without retrying in this time.

Driver shouldn't touch page.lru field VM using in the functions.

3. void (*putback_page)(struct page *);

If migration fails on isolated page, VM should return the isolated page
to the driver so VM calls driver's putback_page with migration failed
page. In this function, driver should put the isolated page back to the
own data structure.

4. non-lru movable page flags

There are two page flags for supporting non-lru movable page.

* PG_movable

Driver should use the below function to make page movable under
page_lock.

void __SetPageMovable(struct page *page, struct address_space *mapping)

It needs argument of address_space for registering migration family
functions which will be called by VM. Exactly speaking, PG_movable is
not a real flag of struct page. Rather than, VM reuses page->mapping's
lower bits to represent it.

#define PAGE_MAPPING_MOVABLE 0x2
page->mapping = page->mapping | PAGE_MAPPING_MOVABLE;

so driver shouldn't access page->mapping directly. Instead, driver
should use page_mapping which mask off the low two bits of page->mapping
so it can get right struct address_space.

For testing of non-lru movable page, VM supports __PageMovable function.
However, it doesn't guarantee to identify non-lru movable page because
page->mapping field is unified with other variables in struct page. As
well, if driver releases the page after isolation by VM, page->mapping
doesn't have stable value although it has PAGE_MAPPING_MOVABLE (Look at
__ClearPageMovable). But __PageMovable is cheap to catch whether page
is LRU or non-lru movable once the page has been isolated. Because LRU
pages never can have PAGE_MAPPING_MOVABLE in page->mapping. It is also
good for just peeking to test non-lru movable pages before more
expensive checking with lock_page in pfn scanning to select victim.

For guaranteeing non-lru movable page, VM provides PageMovable function.
Unlike __PageMovable, PageMovable functions validates page->mapping and
mapping->a_ops->isolate_page under lock_page. The lock_page prevents
sudden destroying of page->mapping.

Driver using __SetPageMovable should clear the flag via
__ClearMovablePage under page_lock before the releasing the page.

* PG_isolated

To prevent concurrent isolation among several CPUs, VM marks isolated
page as PG_isolated under lock_page. So if a CPU encounters PG_isolated
non-lru movable page, it can skip it. Driver doesn't need to manipulate
the flag because VM will set/clear it automatically. Keep in mind that
if driver sees PG_isolated page, it means the page have been isolated by
VM so it shouldn't touch page.lru field. PG_isolated is alias with
PG_reclaim flag so driver shouldn't use the flag for own purpose.

[opensource.ganesh@gmail.com: mm/compaction: remove local variable is_lru]
Link: http://lkml.kernel.org/r/20160618014841.GA7422@leo-test
Link: http://lkml.kernel.org/r/1464736881-24886-3-git-send-email-minchan@kernel.org
Signed-off-by: Gioh Kim <gi-oh.kim@profitbricks.com>
Signed-off-by: Minchan Kim <minchan@kernel.org>
Signed-off-by: Ganesh Mahendran <opensource.ganesh@gmail.com>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Cc: Sergey Senozhatsky <sergey.senozhatsky@gmail.com>
Cc: Rik van Riel <riel@redhat.com>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Hugh Dickins <hughd@google.com>
Cc: Rafael Aquini <aquini@redhat.com>
Cc: Jonathan Corbet <corbet@lwn.net>
Cc: John Einar Reitan <john.reitan@foss.arm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# c6c919eb 26-Jul-2016 Minchan Kim <minchan@kernel.org>

mm: use put_page() to free page instead of putback_lru_page()

Recently, I got many reports about perfermance degradation in embedded
system(Android mobile phone, webOS TV and so on) and easy fork fail.

The problem was fragmentation caused by zram and GPU driver mainly.
With memory pressure, their pages were spread out all of pageblock and
it cannot be migrated with current compaction algorithm which supports
only LRU pages. In the end, compaction cannot work well so reclaimer
shrinks all of working set pages. It made system very slow and even to
fail to fork easily which requires order-[2 or 3] allocations.

Other pain point is that they cannot use CMA memory space so when OOM
kill happens, I can see many free pages in CMA area, which is not memory
efficient. In our product which has big CMA memory, it reclaims zones
too exccessively to allocate GPU and zram page although there are lots
of free space in CMA so system becomes very slow easily.

To solve these problem, this patch tries to add facility to migrate
non-lru pages via introducing new functions and page flags to help
migration.

struct address_space_operations {
..
..
bool (*isolate_page)(struct page *, isolate_mode_t);
void (*putback_page)(struct page *);
..
}

new page flags

PG_movable
PG_isolated

For details, please read description in "mm: migrate: support non-lru
movable page migration".

Originally, Gioh Kim had tried to support this feature but he moved so I
took over the work. I took many code from his work and changed a little
bit and Konstantin Khlebnikov helped Gioh a lot so he should deserve to
have many credit, too.

And I should mention Chulmin who have tested this patchset heavily so I
can find many bugs from him. :)

Thanks, Gioh, Konstantin and Chulmin!

This patchset consists of five parts.

1. clean up migration
mm: use put_page to free page instead of putback_lru_page

2. add non-lru page migration feature
mm: migrate: support non-lru movable page migration

3. rework KVM memory-ballooning
mm: balloon: use general non-lru movable page feature

4. zsmalloc refactoring for preparing page migration
zsmalloc: keep max_object in size_class
zsmalloc: use bit_spin_lock
zsmalloc: use accessor
zsmalloc: factor page chain functionality out
zsmalloc: introduce zspage structure
zsmalloc: separate free_zspage from putback_zspage
zsmalloc: use freeobj for index

5. zsmalloc page migration
zsmalloc: page migration support
zram: use __GFP_MOVABLE for memory allocation

This patch (of 12):

Procedure of page migration is as follows:

First of all, it should isolate a page from LRU and try to migrate the
page. If it is successful, it releases the page for freeing.
Otherwise, it should put the page back to LRU list.

For LRU pages, we have used putback_lru_page for both freeing and
putback to LRU list. It's okay because put_page is aware of LRU list so
if it releases last refcount of the page, it removes the page from LRU
list. However, It makes unnecessary operations (e.g., lru_cache_add,
pagevec and flags operations. It would be not significant but no worth
to do) and harder to support new non-lru page migration because put_page
isn't aware of non-lru page's data structure.

To solve the problem, we can add new hook in put_page with PageMovable
flags check but it can increase overhead in hot path and needs new
locking scheme to stabilize the flag check with put_page.

So, this patch cleans it up to divide two semantic(ie, put and putback).
If migration is successful, use put_page instead of putback_lru_page and
use putback_lru_page only on failure. That makes code more readable and
doesn't add overhead in put_page.

Comment from Vlastimil
"Yeah, and compaction (perhaps also other migration users) has to drain
the lru pvec... Getting rid of this stuff is worth even by itself."

Link: http://lkml.kernel.org/r/1464736881-24886-2-git-send-email-minchan@kernel.org
Signed-off-by: Minchan Kim <minchan@kernel.org>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Cc: Rik van Riel <riel@redhat.com>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Hugh Dickins <hughd@google.com>
Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
Cc: Sergey Senozhatsky <sergey.senozhatsky@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 1118dce7 16-Jun-2016 Richard Weinberger <richard@nod.at>

mm: Export migrate_page_move_mapping and migrate_page_copy

Export these symbols such that UBIFS can implement
->migratepage.

Cc: stable@vger.kernel.org
Signed-off-by: Richard Weinberger <richard@nod.at>
Acked-by: Christoph Hellwig <hch@lst.de>


# dfef2ef4 20-May-2016 David Rientjes <rientjes@google.com>

mm, migrate: increment fail count on ENOMEM

If page migration fails due to -ENOMEM, nr_failed should still be
incremented for proper statistics.

This was encountered recently when all page migration vmstats showed 0,
and inferred that migrate_pages() was never called, although in reality
the first page migration failed because compaction_alloc() failed to
find a migration target.

This patch increments nr_failed so the vmstat is properly accounted on
ENOMEM.

Link: http://lkml.kernel.org/r/alpine.DEB.2.10.1605191510230.32658@chino.kir.corp.google.com
Signed-off-by: David Rientjes <rientjes@google.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Mel Gorman <mgorman@techsingularity.net>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# fa9949da 19-May-2016 Hugh Dickins <hughd@google.com>

mm: use __SetPageSwapBacked and dont ClearPageSwapBacked

v3.16 commit 07a427884348 ("mm: shmem: avoid atomic operation during
shmem_getpage_gfp") rightly replaced one instance of SetPageSwapBacked
by __SetPageSwapBacked, pointing out that the newly allocated page is
not yet visible to other users (except speculative get_page_unless_zero-
ers, who may not update page flags before their further checks).

That was part of a series in which Mel was focused on tmpfs profiles:
but almost all SetPageSwapBacked uses can be so optimized, with the same
justification.

Remove ClearPageSwapBacked from __read_swap_cache_async() error path:
it's not an error to free a page with PG_swapbacked set.

Follow a convention of __SetPageLocked, __SetPageSwapBacked instead of
doing it differently in different places; but that's for tidiness - if
the ordering actually mattered, we should not be using the __variants.

There's probably scope for further __SetPageFlags in other places, but
SwapBacked is the one I'm interested in at the moment.

Signed-off-by: Hugh Dickins <hughd@google.com>
Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Andres Lagar-Cavilla <andreslc@google.com>
Cc: Yang Shi <yang.shi@linaro.org>
Cc: Ning Qu <quning@gmail.com>
Reviewed-by: Mel Gorman <mgorman@techsingularity.net>
Cc: Konstantin Khlebnikov <koct9i@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# d7e69488 28-Apr-2016 Minchan Kim <minchan@kernel.org>

mm/hwpoison: fix wrong num_poisoned_pages accounting

Currently, migration code increses num_poisoned_pages on *failed*
migration page as well as successfully migrated one at the trial of
memory-failure. It will make the stat wrong. As well, it marks the
page as PG_HWPoison even if the migration trial failed. It would mean
we cannot recover the corrupted page using memory-failure facility.

This patches fixes it.

Signed-off-by: Minchan Kim <minchan@kernel.org>
Reported-by: Vlastimil Babka <vbabka@suse.cz>
Acked-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# e388466d 17-Mar-2016 Kirill A. Shutemov <kirill.shutemov@linux.intel.com>

mm: make remove_migration_ptes() beyond mm/migration.c

Make remove_migration_ptes() available to be used in split_huge_page().

New parameter 'locked' added: as with try_to_umap() we need a way to
indicate that caller holds rmap lock.

We also shouldn't try to mlock() pte-mapped huge pages: pte-mapeed THP
pages are never mlocked.

Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# fe896d18 17-Mar-2016 Joonsoo Kim <iamjoonsoo.kim@lge.com>

mm: introduce page reference manipulation functions

The success of CMA allocation largely depends on the success of
migration and key factor of it is page reference count. Until now, page
reference is manipulated by direct calling atomic functions so we cannot
follow up who and where manipulate it. Then, it is hard to find actual
reason of CMA allocation failure. CMA allocation should be guaranteed
to succeed so finding offending place is really important.

In this patch, call sites where page reference is manipulated are
converted to introduced wrapper function. This is preparation step to
add tracepoint to each page reference manipulation function. With this
facility, we can easily find reason of CMA allocation failure. There is
no functional change in this patch.

In addition, this patch also converts reference read sites. It will
help a second step that renames page._count to something else and
prevents later attempt to direct access to it (Suggested by Andrew).

Signed-off-by: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Acked-by: Michal Nazarewicz <mina86@mina86.com>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Cc: Minchan Kim <minchan@kernel.org>
Cc: Mel Gorman <mgorman@techsingularity.net>
Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
Cc: Sergey Senozhatsky <sergey.senozhatsky.work@gmail.com>
Cc: Steven Rostedt <rostedt@goodmis.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 458aa76d 17-Mar-2016 Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>

mm/thp/migration: switch from flush_tlb_range to flush_pmd_tlb_range

We remove one instace of flush_tlb_range here. That was added by commit
f714f4f20e59 ("mm: numa: call MMU notifiers on THP migration"). But the
pmdp_huge_clear_flush_notify should have done the require flush for us.
Hence remove the extra flush.

Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
Cc: Mel Gorman <mgorman@techsingularity.net>
Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
Cc: Vineet Gupta <Vineet.Gupta1@synopsys.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 74485cf2 15-Mar-2016 Johannes Weiner <hannes@cmpxchg.org>

mm: migrate: consolidate mem_cgroup_migrate() calls

Rather than scattering mem_cgroup_migrate() calls all over the place,
have a single call from a safe place where every migration operation
eventually ends up in - migrate_page_copy().

Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
Suggested-by: Hugh Dickins <hughd@google.com>
Acked-by: Vladimir Davydov <vdavydov@virtuozzo.com>
Acked-by: Michal Hocko <mhocko@suse.com>
Cc: Mateusz Guzik <mguzik@redhat.com>
Cc: Sergey Senozhatsky <sergey.senozhatsky.work@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 6a93ca8f 15-Mar-2016 Johannes Weiner <hannes@cmpxchg.org>

mm: migrate: do not touch page->mem_cgroup of live pages

Changing a page's memcg association complicates dealing with the page,
so we want to limit this as much as possible. Page migration e.g. does
not have to do that. Just like page cache replacement, it can forcibly
charge a replacement page, and then uncharge the old page when it gets
freed. Temporarily overcharging the cgroup by a single page is not an
issue in practice, and charging is so cheap nowadays that this is much
preferrable to the headache of messing with live pages.

The only place that still changes the page->mem_cgroup binding of live
pages is when pages move along with a task to another cgroup. But that
path isolates the page from the LRU, takes the page lock, and the move
lock (lock_page_memcg()). That means page->mem_cgroup is always stable
in callers that have the page isolated from the LRU or locked. Lighter
unlocked paths, like writeback accounting, can use lock_page_memcg().

[akpm@linux-foundation.org: fix build]
[vdavydov@virtuozzo.com: fix lockdep splat]
Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
Acked-by: Vladimir Davydov <vdavydov@virtuozzo.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Michal Hocko <mhocko@suse.cz>
Cc: Greg Thelen <gthelen@google.com>
Cc: Hugh Dickins <hughd@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 7cd12b4a 15-Mar-2016 Vlastimil Babka <vbabka@suse.cz>

mm, page_owner: track and print last migrate reason

During migration, page_owner info is now copied with the rest of the
page, so the stacktrace leading to free page allocation during migration
is overwritten. For debugging purposes, it might be however useful to
know that the page has been migrated since its initial allocation. This
might happen many times during the lifetime for different reasons and
fully tracking this, especially with stacktraces would incur extra
memory costs. As a compromise, store and print the migrate_reason of
the last migration that occurred to the page. This is enough to
distinguish compaction, numa balancing etc.

Example page_owner entry after the patch:

Page allocated via order 0, mask 0x24200ca(GFP_HIGHUSER_MOVABLE)
PFN 628753 type Movable Block 1228 type Movable Flags 0x1fffff80040030(dirty|lru|swapbacked)
[<ffffffff811682c4>] __alloc_pages_nodemask+0x134/0x230
[<ffffffff811b6325>] alloc_pages_vma+0xb5/0x250
[<ffffffff81177491>] shmem_alloc_page+0x61/0x90
[<ffffffff8117a438>] shmem_getpage_gfp+0x678/0x960
[<ffffffff8117c2b9>] shmem_fallocate+0x329/0x440
[<ffffffff811de600>] vfs_fallocate+0x140/0x230
[<ffffffff811df434>] SyS_fallocate+0x44/0x70
[<ffffffff8158cc2e>] entry_SYSCALL_64_fastpath+0x12/0x71
Page has been migrated, last migrate reason: compaction

Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Cc: Minchan Kim <minchan@kernel.org>
Cc: Sasha Levin <sasha.levin@oracle.com>
Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Hugh Dickins <hughd@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# d435edca 15-Mar-2016 Vlastimil Babka <vbabka@suse.cz>

mm, page_owner: copy page owner info during migration

The page_owner mechanism stores gfp_flags of an allocation and stack
trace that lead to it. During page migration, the original information
is practically replaced by the allocation of free page as the migration
target. Arguably this is less useful and might lead to all the
page_owner info for migratable pages gradually converge towards
compaction or numa balancing migrations. It has also lead to
inaccuracies such as one fixed by commit e2cfc91120fa ("mm/page_owner:
set correct gfp_mask on page_owner").

This patch thus introduces copying the page_owner info during migration.
However, since the fact that the page has been migrated from its
original place might be useful for debugging, the next patch will
introduce a way to track that information as well.

Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
Acked-by: Michal Hocko <mhocko@suse.com>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Cc: Minchan Kim <minchan@kernel.org>
Cc: Sasha Levin <sasha.levin@oracle.com>
Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Hugh Dickins <hughd@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 8479eba7 26-Feb-2016 Mel Gorman <mgorman@techsingularity.net>

mm: numa: quickly fail allocations for NUMA balancing on full nodes

Commit 4167e9b2cf10 ("mm: remove GFP_THISNODE") removed the GFP_THISNODE
flag combination due to confusing semantics. It noted that
alloc_misplaced_dst_page() was one such user after changes made by
commit e97ca8e5b864 ("mm: fix GFP_THISNODE callers and clarify").

Unfortunately when GFP_THISNODE was removed, users of
alloc_misplaced_dst_page() started waking kswapd and entering direct
reclaim because the wrong GFP flags are cleared. The consequence is
that workloads that used to fit into memory now get reclaimed which is
addressed by this patch.

The problem can be demonstrated with "mutilate" that exercises memcached
which is software dedicated to memory object caching. The configuration
uses 80% of memory and is run 3 times for varying numbers of clients.
The results on a 4-socket NUMA box are

mutilate
4.4.0 4.4.0
vanilla numaswap-v1
Hmean 1 8394.71 ( 0.00%) 8395.32 ( 0.01%)
Hmean 4 30024.62 ( 0.00%) 34513.54 ( 14.95%)
Hmean 7 32821.08 ( 0.00%) 70542.96 (114.93%)
Hmean 12 55229.67 ( 0.00%) 93866.34 ( 69.96%)
Hmean 21 39438.96 ( 0.00%) 85749.21 (117.42%)
Hmean 30 37796.10 ( 0.00%) 50231.49 ( 32.90%)
Hmean 47 18070.91 ( 0.00%) 38530.13 (113.22%)

The metric is queries/second with the more the better. The results are
way outside of the noise and the reason for the improvement is obvious
from some of the vmstats

4.4.0 4.4.0
vanillanumaswap-v1r1
Minor Faults 1929399272 2146148218
Major Faults 19746529 3567
Swap Ins 57307366 9913
Swap Outs 50623229 17094
Allocation stalls 35909 443
DMA allocs 0 0
DMA32 allocs 72976349 170567396
Normal allocs 5306640898 5310651252
Movable allocs 0 0
Direct pages scanned 404130893 799577
Kswapd pages scanned 160230174 0
Kswapd pages reclaimed 55928786 0
Direct pages reclaimed 1843936 41921
Page writes file 2391 0
Page writes anon 50623229 17094

The vanilla kernel is swapping like crazy with large amounts of direct
reclaim and kswapd activity. The figures are aggregate but it's known
that the bad activity is throughout the entire test.

Note that simple streaming anon/file memory consumers also see this
problem but it's not as obvious. In those cases, kswapd is awake when
it should not be.

As there are at least two reclaim-related bugs out there, it's worth
spelling out the user-visible impact. This patch only addresses bugs
related to excessive reclaim on NUMA hardware when the working set is
larger than a NUMA node. There is a bug related to high kswapd CPU
usage but the reports are against laptops and other UMA hardware and is
not addressed by this patch.

Signed-off-by: Mel Gorman <mgorman@techsingularity.net>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: David Rientjes <rientjes@google.com>
Cc: <stable@vger.kernel.org> [4.1+]
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 9a982250 15-Jan-2016 Kirill A. Shutemov <kirill.shutemov@linux.intel.com>

thp: introduce deferred_split_huge_page()

Currently we don't split huge page on partial unmap. It's not an ideal
situation. It can lead to memory overhead.

Furtunately, we can detect partial unmap on page_remove_rmap(). But we
cannot call split_huge_page() from there due to locking context.

It's also counterproductive to do directly from munmap() codepath: in
many cases we will hit this from exit(2) and splitting the huge page
just to free it up in small pages is not what we really want.

The patch introduce deferred_split_huge_page() which put the huge page
into queue for splitting. The splitting itself will happen when we get
memory pressure via shrinker interface. The page will be dropped from
list on freeing through compound page destructor.

Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Tested-by: Sasha Levin <sasha.levin@oracle.com>
Tested-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Acked-by: Jerome Marchand <jmarchan@redhat.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: Dave Hansen <dave.hansen@intel.com>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Rik van Riel <riel@redhat.com>
Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
Cc: Steve Capper <steve.capper@linaro.org>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Michal Hocko <mhocko@suse.cz>
Cc: Christoph Lameter <cl@linux.com>
Cc: David Rientjes <rientjes@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 4d2fa965 15-Jan-2016 Kirill A. Shutemov <kirill.shutemov@linux.intel.com>

thp, mm: split_huge_page(): caller need to lock page

We're going to use migration entries instead of compound_lock() to
stabilize page refcounts. Setup and remove migration entries require
page to be locked.

Some of split_huge_page() callers already have the page locked. Let's
require everybody to lock the page before calling split_huge_page().

Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Tested-by: Sasha Levin <sasha.levin@oracle.com>
Tested-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Acked-by: Jerome Marchand <jmarchan@redhat.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: Dave Hansen <dave.hansen@intel.com>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Rik van Riel <riel@redhat.com>
Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
Cc: Steve Capper <steve.capper@linaro.org>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Michal Hocko <mhocko@suse.cz>
Cc: Christoph Lameter <cl@linux.com>
Cc: David Rientjes <rientjes@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 53f9263b 15-Jan-2016 Kirill A. Shutemov <kirill.shutemov@linux.intel.com>

mm: rework mapcount accounting to enable 4k mapping of THPs

We're going to allow mapping of individual 4k pages of THP compound. It
means we need to track mapcount on per small page basis.

Straight-forward approach is to use ->_mapcount in all subpages to track
how many time this subpage is mapped with PMDs or PTEs combined. But
this is rather expensive: mapping or unmapping of a THP page with PMD
would require HPAGE_PMD_NR atomic operations instead of single we have
now.

The idea is to store separately how many times the page was mapped as
whole -- compound_mapcount. This frees up ->_mapcount in subpages to
track PTE mapcount.

We use the same approach as with compound page destructor and compound
order to store compound_mapcount: use space in first tail page,
->mapping this time.

Any time we map/unmap whole compound page (THP or hugetlb) -- we
increment/decrement compound_mapcount. When we map part of compound
page with PTE we operate on ->_mapcount of the subpage.

page_mapcount() counts both: PTE and PMD mappings of the page.

Basically, we have mapcount for a subpage spread over two counters. It
makes tricky to detect when last mapcount for a page goes away.

We introduced PageDoubleMap() for this. When we split THP PMD for the
first time and there's other PMD mapping left we offset up ->_mapcount
in all subpages by one and set PG_double_map on the compound page.
These additional references go away with last compound_mapcount.

This approach provides a way to detect when last mapcount goes away on
per small page basis without introducing new overhead for most common
cases.

[akpm@linux-foundation.org: fix typo in comment]
[mhocko@suse.com: ignore partial THP when moving task]
Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Tested-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
Acked-by: Jerome Marchand <jmarchan@redhat.com>
Cc: Sasha Levin <sasha.levin@oracle.com>
Cc: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
Cc: Jerome Marchand <jmarchan@redhat.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: Dave Hansen <dave.hansen@intel.com>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Rik van Riel <riel@redhat.com>
Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
Cc: Steve Capper <steve.capper@linaro.org>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Christoph Lameter <cl@linux.com>
Cc: David Rientjes <rientjes@google.com>
Signed-off-by: Michal Hocko <mhocko@suse.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# d281ee61 15-Jan-2016 Kirill A. Shutemov <kirill.shutemov@linux.intel.com>

rmap: add argument to charge compound page

We're going to allow mapping of individual 4k pages of THP compound
page. It means we cannot rely on PageTransHuge() check to decide if
map/unmap small page or THP.

The patch adds new argument to rmap functions to indicate whether we
want to operate on whole compound page or only the small page.

[n-horiguchi@ah.jp.nec.com: fix mapcount mismatch in hugepage migration]
Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Tested-by: Sasha Levin <sasha.levin@oracle.com>
Tested-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Acked-by: Jerome Marchand <jmarchan@redhat.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: Dave Hansen <dave.hansen@intel.com>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Rik van Riel <riel@redhat.com>
Cc: Steve Capper <steve.capper@linaro.org>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Michal Hocko <mhocko@suse.cz>
Cc: Christoph Lameter <cl@linux.com>
Cc: David Rientjes <rientjes@google.com>
Signed-off-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 48c935ad 15-Jan-2016 Kirill A. Shutemov <kirill.shutemov@linux.intel.com>

page-flags: define PG_locked behavior on compound pages

lock_page() must operate on the whole compound page. It doesn't make
much sense to lock part of compound page. Change code to use head
page's PG_locked, if tail page is passed.

This patch also gets rid of custom helper functions --
__set_page_locked() and __clear_page_locked(). They are replaced with
helpers generated by __SETPAGEFLAG/__CLEARPAGEFLAG. Tail pages to these
helper would trigger VM_BUG_ON().

SLUB uses PG_locked as a bit spin locked. IIUC, tail pages should never
appear there. VM_BUG_ON() is added to make sure that this assumption is
correct.

[akpm@linux-foundation.org: fix fs/cifs/file.c]
Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: Dave Hansen <dave.hansen@intel.com>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Rik van Riel <riel@redhat.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Christoph Lameter <cl@linux.com>
Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
Cc: Steve Capper <steve.capper@linaro.org>
Cc: "Aneesh Kumar K.V" <aneesh.kumar@linux.vnet.ibm.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Michal Hocko <mhocko@suse.cz>
Cc: Jerome Marchand <jmarchan@redhat.com>
Cc: Jérôme Glisse <jglisse@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 71baba4b 06-Nov-2015 Mel Gorman <mgorman@techsingularity.net>

mm, page_alloc: rename __GFP_WAIT to __GFP_RECLAIM

__GFP_WAIT was used to signal that the caller was in atomic context and
could not sleep. Now it is possible to distinguish between true atomic
context and callers that are not willing to sleep. The latter should
clear __GFP_DIRECT_RECLAIM so kswapd will still wake. As clearing
__GFP_WAIT behaves differently, there is a risk that people will clear the
wrong flags. This patch renames __GFP_WAIT to __GFP_RECLAIM to clearly
indicate what it does -- setting it allows all reclaim activity, clearing
them prevents it.

[akpm@linux-foundation.org: fix build]
[akpm@linux-foundation.org: coding-style fixes]
Signed-off-by: Mel Gorman <mgorman@techsingularity.net>
Acked-by: Michal Hocko <mhocko@suse.com>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Cc: Christoph Lameter <cl@linux.com>
Acked-by: David Rientjes <rientjes@google.com>
Cc: Vitaly Wool <vitalywool@gmail.com>
Cc: Rik van Riel <riel@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# d0164adc 06-Nov-2015 Mel Gorman <mgorman@techsingularity.net>

mm, page_alloc: distinguish between being unable to sleep, unwilling to sleep and avoiding waking kswapd

__GFP_WAIT has been used to identify atomic context in callers that hold
spinlocks or are in interrupts. They are expected to be high priority and
have access one of two watermarks lower than "min" which can be referred
to as the "atomic reserve". __GFP_HIGH users get access to the first
lower watermark and can be called the "high priority reserve".

Over time, callers had a requirement to not block when fallback options
were available. Some have abused __GFP_WAIT leading to a situation where
an optimisitic allocation with a fallback option can access atomic
reserves.

This patch uses __GFP_ATOMIC to identify callers that are truely atomic,
cannot sleep and have no alternative. High priority users continue to use
__GFP_HIGH. __GFP_DIRECT_RECLAIM identifies callers that can sleep and
are willing to enter direct reclaim. __GFP_KSWAPD_RECLAIM to identify
callers that want to wake kswapd for background reclaim. __GFP_WAIT is
redefined as a caller that is willing to enter direct reclaim and wake
kswapd for background reclaim.

This patch then converts a number of sites

o __GFP_ATOMIC is used by callers that are high priority and have memory
pools for those requests. GFP_ATOMIC uses this flag.

o Callers that have a limited mempool to guarantee forward progress clear
__GFP_DIRECT_RECLAIM but keep __GFP_KSWAPD_RECLAIM. bio allocations fall
into this category where kswapd will still be woken but atomic reserves
are not used as there is a one-entry mempool to guarantee progress.

o Callers that are checking if they are non-blocking should use the
helper gfpflags_allow_blocking() where possible. This is because
checking for __GFP_WAIT as was done historically now can trigger false
positives. Some exceptions like dm-crypt.c exist where the code intent
is clearer if __GFP_DIRECT_RECLAIM is used instead of the helper due to
flag manipulations.

o Callers that built their own GFP flags instead of starting with GFP_KERNEL
and friends now also need to specify __GFP_KSWAPD_RECLAIM.

The first key hazard to watch out for is callers that removed __GFP_WAIT
and was depending on access to atomic reserves for inconspicuous reasons.
In some cases it may be appropriate for them to use __GFP_HIGH.

The second key hazard is callers that assembled their own combination of
GFP flags instead of starting with something like GFP_KERNEL. They may
now wish to specify __GFP_KSWAPD_RECLAIM. It's almost certainly harmless
if it's missed in most cases as other activity will wake kswapd.

Signed-off-by: Mel Gorman <mgorman@techsingularity.net>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Acked-by: Michal Hocko <mhocko@suse.com>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Cc: Christoph Lameter <cl@linux.com>
Cc: David Rientjes <rientjes@google.com>
Cc: Vitaly Wool <vitalywool@gmail.com>
Cc: Rik van Riel <riel@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 42cb14b1 05-Nov-2015 Hugh Dickins <hughd@google.com>

mm: migrate dirty page without clear_page_dirty_for_io etc

clear_page_dirty_for_io() has accumulated writeback and memcg subtleties
since v2.6.16 first introduced page migration; and the set_page_dirty()
which completed its migration of PageDirty, later had to be moderated to
__set_page_dirty_nobuffers(); then PageSwapBacked had to skip that too.

No actual problems seen with this procedure recently, but if you look into
what the clear_page_dirty_for_io(page)+set_page_dirty(newpage) is actually
achieving, it turns out to be nothing more than moving the PageDirty flag,
and its NR_FILE_DIRTY stat from one zone to another.

It would be good to avoid a pile of irrelevant decrementations and
incrementations, and improper event counting, and unnecessary descent of
the radix_tree under tree_lock (to set the PAGECACHE_TAG_DIRTY which
radix_tree_replace_slot() left in place anyway).

Do the NR_FILE_DIRTY movement, like the other stats movements, while
interrupts still disabled in migrate_page_move_mapping(); and don't even
bother if the zone is the same. Do the PageDirty movement there under
tree_lock too, where old page is frozen and newpage not yet visible:
bearing in mind that as soon as newpage becomes visible in radix_tree, an
un-page-locked set_page_dirty() might interfere (or perhaps that's just
not possible: anything doing so should already hold an additional
reference to the old page, preventing its migration; but play safe).

But we do still need to transfer PageDirty in migrate_page_copy(), for
those who don't go the mapping route through migrate_page_move_mapping().

Signed-off-by: Hugh Dickins <hughd@google.com>
Cc: Christoph Lameter <cl@linux.com>
Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
Cc: Rik van Riel <riel@redhat.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Davidlohr Bueso <dave@stgolabs.net>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Sasha Levin <sasha.levin@oracle.com>
Cc: Dmitry Vyukov <dvyukov@google.com>
Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# cf4b769a 05-Nov-2015 Hugh Dickins <hughd@google.com>

mm: page migration avoid touching newpage until no going back

We have had trouble in the past from the way in which page migration's
newpage is initialized in dribs and drabs - see commit 8bdd63809160 ("mm:
fix direct reclaim writeback regression") which proposed a cleanup.

We have no actual problem now, but I think the procedure would be clearer
(and alternative get_new_page pools safer to implement) if we assert that
newpage is not touched until we are sure that it's going to be used -
except for taking the trylock on it in __unmap_and_move().

So shift the early initializations from move_to_new_page() into
migrate_page_move_mapping(), mapping and NULL-mapping paths. Similarly
migrate_huge_page_move_mapping(), but its NULL-mapping path can just be
deleted: you cannot reach hugetlbfs_migrate_page() with a NULL mapping.

Adjust stages 3 to 8 in the Documentation file accordingly.

Signed-off-by: Hugh Dickins <hughd@google.com>
Cc: Christoph Lameter <cl@linux.com>
Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
Cc: Rik van Riel <riel@redhat.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Davidlohr Bueso <dave@stgolabs.net>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Sasha Levin <sasha.levin@oracle.com>
Cc: Dmitry Vyukov <dvyukov@google.com>
Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 03f15c86 05-Nov-2015 Hugh Dickins <hughd@google.com>

mm: simplify page migration's anon_vma comment and flow

__unmap_and_move() contains a long stale comment on page_get_anon_vma()
and PageSwapCache(), with an odd control flow that's hard to follow.
Mostly this reflects our confusion about the lifetime of an anon_vma, in
the early days of page migration, before we could take a reference to one.
Nowadays this seems quite straightforward: cut it all down to essentials.

I cannot see the relevance of swapcache here at all, so don't treat it any
differently: I believe the old comment reflects in part our anon_vma
confusions, and in part the original v2.6.16 page migration technique,
which used actual swap to migrate anon instead of swap-like migration
entries. Why should a swapcache page not be migrated with the aid of
migration entry ptes like everything else? So lose that comment now, and
enable migration entries for swapcache in the next patch.

Signed-off-by: Hugh Dickins <hughd@google.com>
Cc: Christoph Lameter <cl@linux.com>
Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
Cc: Rik van Riel <riel@redhat.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Davidlohr Bueso <dave@stgolabs.net>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Sasha Levin <sasha.levin@oracle.com>
Cc: Dmitry Vyukov <dvyukov@google.com>
Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 5c3f9a67 05-Nov-2015 Hugh Dickins <hughd@google.com>

mm: page migration remove_migration_ptes at lock+unlock level

Clean up page migration a little more by calling remove_migration_ptes()
from the same level, on success or on failure, from __unmap_and_move() or
from unmap_and_move_huge_page().

Don't reset page->mapping of a PageAnon old page in move_to_new_page(),
leave that to when the page is freed. Except for here in page migration,
it has been an invariant that a PageAnon (bit set in page->mapping) page
stays PageAnon until it is freed, and I think we're safer to keep to that.

And with the above rearrangement, it's necessary because zap_pte_range()
wants to identify whether a migration entry represents a file or an anon
page, to update the appropriate rss stats without waiting on it.

Signed-off-by: Hugh Dickins <hughd@google.com>
Cc: Christoph Lameter <cl@linux.com>
Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
Cc: Rik van Riel <riel@redhat.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Davidlohr Bueso <dave@stgolabs.net>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Sasha Levin <sasha.levin@oracle.com>
Cc: Dmitry Vyukov <dvyukov@google.com>
Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 7db7671f 05-Nov-2015 Hugh Dickins <hughd@google.com>

mm: page migration trylock newpage at same level as oldpage

Clean up page migration a little by moving the trylock of newpage from
move_to_new_page() into __unmap_and_move(), where the old page has been
locked. Adjust unmap_and_move_huge_page() and balloon_page_migrate()
accordingly.

But make one kind-of-functional change on the way: whereas trylock of
newpage used to BUG() if it failed, now simply return -EAGAIN if so.
Cutting out BUG()s is good, right? But, to be honest, this is really to
extend the usefulness of the custom put_new_page feature, allowing a pool
of new pages to be shared perhaps with racing uses.

Use an "else" instead of that "skip_unmap" label.

Signed-off-by: Hugh Dickins <hughd@google.com>
Cc: Christoph Lameter <cl@linux.com>
Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
Cc: Rik van Riel <riel@redhat.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Davidlohr Bueso <dave@stgolabs.net>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Sasha Levin <sasha.levin@oracle.com>
Cc: Dmitry Vyukov <dvyukov@google.com>
Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Acked-by: Rafael Aquini <aquini@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 2def7424 05-Nov-2015 Hugh Dickins <hughd@google.com>

mm: page migration use the put_new_page whenever necessary

I don't know of any problem from the way it's used in our current tree,
but there is one defect in page migration's custom put_new_page feature.

An unused newpage is expected to be released with the put_new_page(), but
there was one MIGRATEPAGE_SUCCESS (0) path which released it with
putback_lru_page(): which can be very wrong for a custom pool.

Fixed more easily by resetting put_new_page once it won't be needed, than
by adding a further flag to modify the rc test.

Signed-off-by: Hugh Dickins <hughd@google.com>
Cc: Christoph Lameter <cl@linux.com>
Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
Cc: Rik van Riel <riel@redhat.com>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Cc: Davidlohr Bueso <dave@stgolabs.net>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Sasha Levin <sasha.levin@oracle.com>
Cc: Dmitry Vyukov <dvyukov@google.com>
Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 14e0f9bc 05-Nov-2015 Hugh Dickins <hughd@google.com>

mm: correct a couple of page migration comments

It's migrate.c not migration,c, and nowadays putback_movable_pages() not
putback_lru_pages().

Signed-off-by: Hugh Dickins <hughd@google.com>
Cc: Christoph Lameter <cl@linux.com>
Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
Cc: Rik van Riel <riel@redhat.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Davidlohr Bueso <dave@stgolabs.net>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Sasha Levin <sasha.levin@oracle.com>
Cc: Dmitry Vyukov <dvyukov@google.com>
Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Acked-by: Rafael Aquini <aquini@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 45637bab 05-Nov-2015 Hugh Dickins <hughd@google.com>

mm: rename mem_cgroup_migrate to mem_cgroup_replace_page

After v4.3's commit 0610c25daa3e ("memcg: fix dirty page migration")
mem_cgroup_migrate() doesn't have much to offer in page migration: convert
migrate_misplaced_transhuge_page() to set_page_memcg() instead.

Then rename mem_cgroup_migrate() to mem_cgroup_replace_page(), since its
remaining callers are replace_page_cache_page() and shmem_replace_page():
both of whom passed lrucare true, so just eliminate that argument.

Signed-off-by: Hugh Dickins <hughd@google.com>
Cc: Christoph Lameter <cl@linux.com>
Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
Cc: Rik van Riel <riel@redhat.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Davidlohr Bueso <dave@stgolabs.net>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Sasha Levin <sasha.levin@oracle.com>
Cc: Dmitry Vyukov <dvyukov@google.com>
Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 51afb12b 05-Nov-2015 Hugh Dickins <hughd@google.com>

mm: page migration fix PageMlocked on migrated pages

Commit e6c509f85455 ("mm: use clear_page_mlock() in page_remove_rmap()")
in v3.7 inadvertently made mlock_migrate_page() impotent: page migration
unmaps the page from userspace before migrating, and that commit clears
PageMlocked on the final unmap, leaving mlock_migrate_page() with
nothing to do. Not a serious bug, the next attempt at reclaiming the
page would fix it up; but a betrayal of page migration's intent - the
new page ought to emerge as PageMlocked.

I don't see how to fix it for mlock_migrate_page() itself; but easily
fixed in remove_migration_pte(), by calling mlock_vma_page() when the vma
is VM_LOCKED - under pte lock as in try_to_unmap_one().

Delete mlock_migrate_page()? Not quite, it does still serve a purpose for
migrate_misplaced_transhuge_page(): where we could replace it by a test,
clear_page_mlock(), mlock_vma_page() sequence; but would that be an
improvement? mlock_migrate_page() is fairly lean, and let's make it
leaner by skipping the irq save/restore now clearly not needed.

Signed-off-by: Hugh Dickins <hughd@google.com>
Cc: Christoph Lameter <cl@linux.com>
Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
Cc: Rik van Riel <riel@redhat.com>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Cc: Davidlohr Bueso <dave@stgolabs.net>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Sasha Levin <sasha.levin@oracle.com>
Cc: Dmitry Vyukov <dvyukov@google.com>
Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# f2f81fb2 05-Nov-2015 Vlastimil Babka <vbabka@suse.cz>

mm, migrate: count pages failing all retries in vmstat and tracepoint

Migration tries up to 10 times to migrate pages that return -EAGAIN until
it gives up. If some pages fail all retries, they are counted towards the
number of failed pages that migrate_pages() returns. They should also be
counted in the /proc/vmstat pgmigrate_fail and in the mm_migrate_pages
tracepoint.

Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
Acked-by: David Rientjes <rientjes@google.com>
Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
Cc: "Aneesh Kumar K.V" <aneesh.kumar@linux.vnet.ibm.com>
Cc: Konstantin Khlebnikov <koct9i@gmail.com>
Cc: Mel Gorman <mel@csn.ul.ie>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 0610c25d 01-Oct-2015 Greg Thelen <gthelen@google.com>

memcg: fix dirty page migration

The problem starts with a file backed dirty page which is charged to a
memcg. Then page migration is used to move oldpage to newpage.

Migration:
- copies the oldpage's data to newpage
- clears oldpage.PG_dirty
- sets newpage.PG_dirty
- uncharges oldpage from memcg
- charges newpage to memcg

Clearing oldpage.PG_dirty decrements the charged memcg's dirty page
count.

However, because newpage is not yet charged, setting newpage.PG_dirty
does not increment the memcg's dirty page count. After migration
completes newpage.PG_dirty is eventually cleared, often in
account_page_cleaned(). At this time newpage is charged to a memcg so
the memcg's dirty page count is decremented which causes underflow
because the count was not previously incremented by migration. This
underflow causes balance_dirty_pages() to see a very large unsigned
number of dirty memcg pages which leads to aggressive throttling of
buffered writes by processes in non root memcg.

This issue:
- can harm performance of non root memcg buffered writes.
- can report too small (even negative) values in
memory.stat[(total_)dirty] counters of all memcg, including the root.

To avoid polluting migrate.c with #ifdef CONFIG_MEMCG checks, introduce
page_memcg() and set_page_memcg() helpers.

Test:
0) setup and enter limited memcg
mkdir /sys/fs/cgroup/test
echo 1G > /sys/fs/cgroup/test/memory.limit_in_bytes
echo $$ > /sys/fs/cgroup/test/cgroup.procs

1) buffered writes baseline
dd if=/dev/zero of=/data/tmp/foo bs=1M count=1k
sync
grep ^dirty /sys/fs/cgroup/test/memory.stat

2) buffered writes with compaction antagonist to induce migration
yes 1 > /proc/sys/vm/compact_memory &
rm -rf /data/tmp/foo
dd if=/dev/zero of=/data/tmp/foo bs=1M count=1k
kill %
sync
grep ^dirty /sys/fs/cgroup/test/memory.stat

3) buffered writes without antagonist, should match baseline
rm -rf /data/tmp/foo
dd if=/dev/zero of=/data/tmp/foo bs=1M count=1k
sync
grep ^dirty /sys/fs/cgroup/test/memory.stat

(speed, dirty residue)
unpatched patched
1) 841 MB/s 0 dirty pages 886 MB/s 0 dirty pages
2) 611 MB/s -33427456 dirty pages 793 MB/s 0 dirty pages
3) 114 MB/s -33427456 dirty pages 891 MB/s 0 dirty pages

Notice that unpatched baseline performance (1) fell after
migration (3): 841 -> 114 MB/s. In the patched kernel, post
migration performance matches baseline.

Fixes: c4843a7593a9 ("memcg: add per cgroup dirty page accounting")
Signed-off-by: Greg Thelen <gthelen@google.com>
Reported-by: Dave Hansen <dave.hansen@intel.com>
Acked-by: Michal Hocko <mhocko@suse.com>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Cc: <stable@vger.kernel.org> [4.2+]
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 3aaa76e1 22-Sep-2015 Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>

mm: migrate: hugetlb: putback destination hugepage to active list

Since commit bcc54222309c ("mm: hugetlb: introduce page_huge_active")
each hugetlb page maintains its active flag to avoid a race condition
betwe= en multiple calls of isolate_huge_page(), but current kernel
doesn't set the f= lag on a hugepage allocated by migration because the
proper putback routine isn= 't called. This means that users could
still encounter the race referred to by bcc54222309c in this special
case, so this patch fixes it.

Fixes: bcc54222309c ("mm: hugetlb: introduce page_huge_active")
Signed-off-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
Cc: Michal Hocko <mhocko@suse.cz>
Cc: Andi Kleen <andi@firstfloor.org>
Cc: Hugh Dickins <hughd@google.com>
Cc: <stable@vger.kernel.org> [4.1.x]
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 33c3fc71 09-Sep-2015 Vladimir Davydov <vdavydov.dev@gmail.com>

mm: introduce idle page tracking

Knowing the portion of memory that is not used by a certain application or
memory cgroup (idle memory) can be useful for partitioning the system
efficiently, e.g. by setting memory cgroup limits appropriately.
Currently, the only means to estimate the amount of idle memory provided
by the kernel is /proc/PID/{clear_refs,smaps}: the user can clear the
access bit for all pages mapped to a particular process by writing 1 to
clear_refs, wait for some time, and then count smaps:Referenced. However,
this method has two serious shortcomings:

- it does not count unmapped file pages
- it affects the reclaimer logic

To overcome these drawbacks, this patch introduces two new page flags,
Idle and Young, and a new sysfs file, /sys/kernel/mm/page_idle/bitmap.
A page's Idle flag can only be set from userspace by setting bit in
/sys/kernel/mm/page_idle/bitmap at the offset corresponding to the page,
and it is cleared whenever the page is accessed either through page tables
(it is cleared in page_referenced() in this case) or using the read(2)
system call (mark_page_accessed()). Thus by setting the Idle flag for
pages of a particular workload, which can be found e.g. by reading
/proc/PID/pagemap, waiting for some time to let the workload access its
working set, and then reading the bitmap file, one can estimate the amount
of pages that are not used by the workload.

The Young page flag is used to avoid interference with the memory
reclaimer. A page's Young flag is set whenever the Access bit of a page
table entry pointing to the page is cleared by writing to the bitmap file.
If page_referenced() is called on a Young page, it will add 1 to its
return value, therefore concealing the fact that the Access bit was
cleared.

Note, since there is no room for extra page flags on 32 bit, this feature
uses extended page flags when compiled on 32 bit.

[akpm@linux-foundation.org: fix build]
[akpm@linux-foundation.org: kpageidle requires an MMU]
[akpm@linux-foundation.org: decouple from page-flags rework]
Signed-off-by: Vladimir Davydov <vdavydov@parallels.com>
Reviewed-by: Andres Lagar-Cavilla <andreslc@google.com>
Cc: Minchan Kim <minchan@kernel.org>
Cc: Raghavendra K T <raghavendra.kt@linux.vnet.ibm.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Michal Hocko <mhocko@suse.cz>
Cc: Greg Thelen <gthelen@google.com>
Cc: Michel Lespinasse <walken@google.com>
Cc: David Rientjes <rientjes@google.com>
Cc: Pavel Emelyanov <xemul@parallels.com>
Cc: Cyrill Gorcunov <gorcunov@openvz.org>
Cc: Jonathan Corbet <corbet@lwn.net>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 96db800f 08-Sep-2015 Vlastimil Babka <vbabka@suse.cz>

mm: rename alloc_pages_exact_node() to __alloc_pages_node()

alloc_pages_exact_node() was introduced in commit 6484eb3e2a81 ("page
allocator: do not check NUMA node ID when the caller knows the node is
valid") as an optimized variant of alloc_pages_node(), that doesn't
fallback to current node for nid == NUMA_NO_NODE. Unfortunately the
name of the function can easily suggest that the allocation is
restricted to the given node and fails otherwise. In truth, the node is
only preferred, unless __GFP_THISNODE is passed among the gfp flags.

The misleading name has lead to mistakes in the past, see for example
commits 5265047ac301 ("mm, thp: really limit transparent hugepage
allocation to local node") and b360edb43f8e ("mm, mempolicy:
migrate_to_node should only migrate to node").

Another issue with the name is that there's a family of
alloc_pages_exact*() functions where 'exact' means exact size (instead
of page order), which leads to more confusion.

To prevent further mistakes, this patch effectively renames
alloc_pages_exact_node() to __alloc_pages_node() to better convey that
it's an optimized variant of alloc_pages_node() not intended for general
usage. Both functions get described in comments.

It has been also considered to really provide a convenience function for
allocations restricted to a node, but the major opinion seems to be that
__GFP_THISNODE already provides that functionality and we shouldn't
duplicate the API needlessly. The number of users would be small
anyway.

Existing callers of alloc_pages_exact_node() are simply converted to
call __alloc_pages_node(), with the exception of sba_alloc_coherent()
which open-codes the check for NUMA_NO_NODE, so it is converted to use
alloc_pages_node() instead. This means it no longer performs some
VM_BUG_ON checks, and since the current check for nid in
alloc_pages_node() uses a 'nid < 0' comparison (which includes
NUMA_NO_NODE), it may hide wrong values which would be previously
exposed.

Both differences will be rectified by the next patch.

To sum up, this patch makes no functional changes, except temporarily
hiding potentially buggy callers. Restricting the checks in
alloc_pages_node() is left for the next patch which can in turn expose
more existing buggy callers.

Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Acked-by: Robin Holt <robinmholt@gmail.com>
Acked-by: Michal Hocko <mhocko@suse.com>
Acked-by: Christoph Lameter <cl@linux.com>
Acked-by: Michael Ellerman <mpe@ellerman.id.au>
Cc: Mel Gorman <mgorman@suse.de>
Cc: David Rientjes <rientjes@google.com>
Cc: Greg Thelen <gthelen@google.com>
Cc: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
Cc: Pekka Enberg <penberg@kernel.org>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
Cc: Tony Luck <tony.luck@intel.com>
Cc: Fenghua Yu <fenghua.yu@intel.com>
Cc: Arnd Bergmann <arnd@arndb.de>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Gleb Natapov <gleb@kernel.org>
Cc: Paolo Bonzini <pbonzini@redhat.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: Cliff Whickman <cpw@sgi.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# da1b13cc 08-Sep-2015 Wanpeng Li <wanpeng.li@hotmail.com>

mm/hwpoison: fix race between soft_offline_page and unpoison_memory

Wanpeng Li reported a race between soft_offline_page() and
unpoison_memory(), which causes the following kernel panic:

BUG: Bad page state in process bash pfn:97000
page:ffffea00025c0000 count:0 mapcount:1 mapping: (null) index:0x7f4fdbe00
flags: 0x1fffff80080048(uptodate|active|swapbacked)
page dumped because: PAGE_FLAGS_CHECK_AT_FREE flag(s) set
bad because of flags:
flags: 0x40(active)
Modules linked in: snd_hda_codec_hdmi i915 rpcsec_gss_krb5 nfsv4 dns_resolver bnep rfcomm nfsd bluetooth auth_rpcgss nfs_acl nfs rfkill lockd grace sunrpc i2c_algo_bit drm_kms_helper snd_hda_codec_realtek snd_hda_codec_generic drm snd_hda_intel fscache snd_hda_codec x86_pkg_temp_thermal coretemp kvm_intel snd_hda_core snd_hwdep kvm snd_pcm snd_seq_dummy snd_seq_oss crct10dif_pclmul snd_seq_midi crc32_pclmul snd_seq_midi_event ghash_clmulni_intel snd_rawmidi aesni_intel lrw gf128mul snd_seq glue_helper ablk_helper snd_seq_device cryptd fuse snd_timer dcdbas serio_raw mei_me parport_pc snd mei ppdev i2c_core video lp soundcore parport lpc_ich shpchp mfd_core ext4 mbcache jbd2 sd_mod e1000e ahci ptp libahci crc32c_intel libata pps_core
CPU: 3 PID: 2211 Comm: bash Not tainted 4.2.0-rc5-mm1+ #45
Hardware name: Dell Inc. OptiPlex 7020/0F5C5X, BIOS A03 01/08/2015
Call Trace:
dump_stack+0x48/0x5c
bad_page+0xe6/0x140
free_pages_prepare+0x2f9/0x320
? uncharge_list+0xdd/0x100
free_hot_cold_page+0x40/0x170
__put_single_page+0x20/0x30
put_page+0x25/0x40
unmap_and_move+0x1a6/0x1f0
migrate_pages+0x100/0x1d0
? kill_procs+0x100/0x100
? unlock_page+0x6f/0x90
__soft_offline_page+0x127/0x2a0
soft_offline_page+0xa6/0x200

This race is explained like below:

CPU0 CPU1

soft_offline_page
__soft_offline_page
TestSetPageHWPoison
unpoison_memory
PageHWPoison check (true)
TestClearPageHWPoison
put_page -> release refcount held by get_hwpoison_page in unpoison_memory
put_page -> release refcount held by isolate_lru_page in __soft_offline_page
migrate_pages

The second put_page() releases refcount held by isolate_lru_page() which
will lead to unmap_and_move() releases the last refcount of page and w/
mapcount still 1 since try_to_unmap() is not called if there is only one
user map the page. Anyway, the page refcount and mapcount will still
mess if the page is mapped by multiple users.

This race was introduced by commit 4491f71260 ("mm/memory-failure: set
PageHWPoison before migrate_pages()"), which focuses on preventing the
reuse of successfully migrated page. Before this commit we prevent the
reuse by changing the migratetype to MIGRATE_ISOLATE during soft
offlining, which has the following problems, so simply reverting the
commit is not a best option:

1) it doesn't eliminate the reuse completely, because
set_migratetype_isolate() can fail to set MIGRATE_ISOLATE to the
target page if the pageblock of the page contains one or more
unmovable pages (i.e. has_unmovable_pages() returns true).

2) the original code changes migratetype to MIGRATE_ISOLATE
forcibly, and sets it to MIGRATE_MOVABLE forcibly after soft offline,
regardless of the original migratetype state, which could impact
other subsystems like memory hotplug or compaction.

This patch moves PageSetHWPoison just after put_page() in
unmap_and_move(), which closes up the reported race window and minimizes
another race window b/w SetPageHWPoison and reallocation (which causes
the reuse of soft-offlined page.) The latter race window still exists
but it's acceptable, because it's rare and effectively the same as
ordinary "containment failure" case even if it happens, so keep the
window open is acceptable.

Fixes: 4491f71260 ("mm/memory-failure: set PageHWPoison before migrate_pages()")
Signed-off-by: Wanpeng Li <wanpeng.li@hotmail.com>
Signed-off-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
Reported-by: Wanpeng Li <wanpeng.li@hotmail.com>
Tested-by: Wanpeng Li <wanpeng.li@hotmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# d899844e 04-Sep-2015 Kirill A. Shutemov <kirill.shutemov@linux.intel.com>

mm: fix status code which move_pages() returns for zero page

The manpage for move_pages(2) specifies that status code for zero page is
supposed to be -EFAULT. Currently kernel return -ENOENT in this case.

follow_page() can do it for us, if we would ask for FOLL_DUMP. The use of
FOLL_DUMP also means that the upper layer page tables pages are no longer
allocated.

Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Reviewed-by: Christoph Lameter <cl@linux.com>
Cc: Hugh Dickins <hughd@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 4491f712 06-Aug-2015 Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>

mm/memory-failure: set PageHWPoison before migrate_pages()

Now page freeing code doesn't consider PageHWPoison as a bad page, so by
setting it before completing the page containment, we can prevent the
error page from being reused just after successful page migration.

I added TTU_IGNORE_HWPOISON for try_to_unmap() to make sure that the
page table entry is transformed into migration entry, not to hwpoison
entry.

Signed-off-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
Cc: Andi Kleen <andi@firstfloor.org>
Cc: Dean Nelson <dnelson@redhat.com>
Cc: Tony Luck <tony.luck@intel.com>
Cc: "Kirill A. Shutemov" <kirill@shutemov.name>
Cc: Hugh Dickins <hughd@google.com>
Cc: David Rientjes <rientjes@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# f4c18e6f 06-Aug-2015 Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>

mm: check __PG_HWPOISON separately from PAGE_FLAGS_CHECK_AT_*

The race condition addressed in commit add05cecef80 ("mm: soft-offline:
don't free target page in successful page migration") was not closed
completely, because that can happen not only for soft-offline, but also
for hard-offline. Consider that a slab page is about to be freed into
buddy pool, and then an uncorrected memory error hits the page just
after entering __free_one_page(), then VM_BUG_ON_PAGE(page->flags &
PAGE_FLAGS_CHECK_AT_PREP) is triggered, despite the fact that it's not
necessary because the data on the affected page is not consumed.

To solve it, this patch drops __PG_HWPOISON from page flag checks at
allocation/free time. I think it's justified because __PG_HWPOISON
flags is defined to prevent the page from being reused, and setting it
outside the page's alloc-free cycle is a designed behavior (not a bug.)

For recent months, I was annoyed about BUG_ON when soft-offlined page
remains on lru cache list for a while, which is avoided by calling
put_page() instead of putback_lru_page() in page migration's success
path. This means that this patch reverts a major change from commit
add05cecef80 about the new refcounting rule of soft-offlined pages, so
"reuse window" revives. This will be closed by a subsequent patch.

Signed-off-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
Cc: Andi Kleen <andi@firstfloor.org>
Cc: Dean Nelson <dnelson@redhat.com>
Cc: Tony Luck <tony.luck@intel.com>
Cc: "Kirill A. Shutemov" <kirill@shutemov.name>
Cc: Hugh Dickins <hughd@google.com>
Cc: David Rientjes <rientjes@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 8809aa2d 24-Jun-2015 Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>

mm: clarify that the function operates on hugepage pte

We have confusing functions to clear pmd, pmd_clear_* and pmd_clear. Add
_huge_ to pmdp_clear functions so that we are clear that they operate on
hugepage pte.

We don't bother about other functions like pmdp_set_wrprotect,
pmdp_clear_flush_young, because they operate on PTE bits and hence
indicate they are operating on hugepage ptes

Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
Acked-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# add05cec 24-Jun-2015 Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>

mm: soft-offline: don't free target page in successful page migration

Stress testing showed that soft offline events for a process iterating
"mmap-pagefault-munmap" loop can trigger
VM_BUG_ON(PAGE_FLAGS_CHECK_AT_PREP) in __free_one_page():

Soft offlining page 0x70fe1 at 0x70100008d000
Soft offlining page 0x705fb at 0x70300008d000
page:ffffea0001c3f840 count:0 mapcount:0 mapping: (null) index:0x2
flags: 0x1fffff80800000(hwpoison)
page dumped because: VM_BUG_ON_PAGE(page->flags & ((1 << 25) - 1))
------------[ cut here ]------------
kernel BUG at /src/linux-dev/mm/page_alloc.c:585!
invalid opcode: 0000 [#1] SMP DEBUG_PAGEALLOC
Modules linked in: cfg80211 rfkill crc32c_intel microcode ppdev parport_pc pcspkr serio_raw virtio_balloon parport i2c_piix4 virtio_blk virtio_net ata_generic pata_acpi floppy
CPU: 3 PID: 1779 Comm: test_base_madv_ Not tainted 4.0.0-v4.0-150511-1451-00009-g82360a3730e6 #139
RIP: free_pcppages_bulk+0x52a/0x6f0
Call Trace:
drain_pages_zone+0x3d/0x50
drain_local_pages+0x1d/0x30
on_each_cpu_mask+0x46/0x80
drain_all_pages+0x14b/0x1e0
soft_offline_page+0x432/0x6e0
SyS_madvise+0x73c/0x780
system_call_fastpath+0x12/0x17
Code: ff 89 45 b4 48 8b 45 c0 48 83 b8 a8 00 00 00 00 0f 85 e3 fb ff ff 0f 1f 00 0f 0b 48 8b 7d 90 48 c7 c6 e8 95 a6 81 e8 e6 32 02 00 <0f> 0b 8b 45 cc 49 89 47 30 41 8b 47 18 83 f8 ff 0f 85 10 ff ff
RIP [<ffffffff811a806a>] free_pcppages_bulk+0x52a/0x6f0
RSP <ffff88007a117d28>
---[ end trace 53926436e76d1f35 ]---

When soft offline successfully migrates page, the source page is supposed
to be freed. But there is a race condition where a source page looks
isolated (i.e. the refcount is 0 and the PageHWPoison is set) but
somewhat linked to pcplist. Then another soft offline event calls
drain_all_pages() and tries to free such hwpoisoned page, which is
forbidden.

This odd page state seems to happen due to the race between put_page() in
putback_lru_page() and __pagevec_lru_add_fn(). But I don't want to play
with tweaking drain code as done in commit 9ab3b598d2df "mm: hwpoison:
drop lru_add_drain_all() in __soft_offline_page()", or to change page
freeing code for this soft offline's purpose.

Instead, let's think about the difference between hard offline and soft
offline. There is an interesting difference in how to isolate the in-use
page between these, that is, hard offline marks PageHWPoison of the target
page at first, and doesn't free it by keeping its refcount 1. OTOH, soft
offline tries to free the target page then marks PageHWPoison. This
difference might be the source of complexity and result in bugs like the
above. So making soft offline isolate with keeping refcount can be a
solution for this problem.

We can pass to page migration code the "reason" which shows the caller, so
let's use this more to avoid calling putback_lru_page() when called from
soft offline, which effectively does the isolation for soft offline. With
this change, target pages of soft offline never be reused without changing
migratetype, so this patch also removes the related code.

Signed-off-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
Cc: Andi Kleen <andi@firstfloor.org>
Cc: Tony Luck <tony.luck@intel.com>
Cc: "Kirill A. Shutemov" <kirill@shutemov.name>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# b3b3a99c 15-Apr-2015 Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>

mm/migrate: check-before-clear PageSwapCache

With the page flag sanitization patchset, an invalid usage of
ClearPageSwapCache() is detected in migration_page_copy().
migrate_page_copy() is shared by both normal and hugepage (both thp and
hugetlb) code path, so let's check PageSwapCache() and clear it if it's
set to avoid misuse of the invalid clear operation.

Signed-off-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
Acked-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 2a8e7002 14-Apr-2015 Mel Gorman <mgorman@suse.de>

mm: numa: remove migrate_ratelimited

This code is dead since commit 9e645ab6d089 ("sched/numa: Continue PTE
scanning even if migrate rate limited") so remove it.

Signed-off-by: Mel Gorman <mgorman@suse.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# ef2a5153 14-Apr-2015 Geert Uytterhoeven <geert+renesas@glider.be>

mm/migrate: mark unmap_and_move() "noinline" to avoid ICE in gcc 4.7.3

With gcc version 4.7.3 (Ubuntu/Linaro 4.7.3-12ubuntu1) :

mm/migrate.c: In function `migrate_pages':
mm/migrate.c:1148:1: internal compiler error: in push_minipool_fix, at config/arm/arm.c:13500
Please submit a full bug report,
with preprocessed source if appropriate.
See <file:///usr/share/doc/gcc-4.7/README.Bugs> for instructions.
Preprocessed source stored into /tmp/ccPoM1tr.out file, please attach this to your bugreport.
make[1]: *** [mm/migrate.o] Error 1
make: *** [mm/migrate.o] Error 2

Mark unmap_and_move() (which is used in a single place only) "noinline"
to work around this compiler bug.

[akpm@linux-foundation.org: make it conditional on gcc-4.7.3 and arm]
[khilman@kernel.org: fine-tune compiler versions]
[akpm@linux-foundation.org: fix comment]
Signed-off-by: Geert Uytterhoeven <geert+renesas@glider.be>
Reported-by: Kevin Hilman <khilman@kernel.org>
Cc: Marc Zyngier <marc.zyngier@arm.com>
Tested-by: Kevin Hilman <khilman@linaro.org>
Tested-by: Lina Iyer <lina.iyer@linaro.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 4d942466 12-Feb-2015 Mel Gorman <mgorman@suse.de>

mm: convert p[te|md]_mknonnuma and remaining page table manipulations

With PROT_NONE, the traditional page table manipulation functions are
sufficient.

[andre.przywara@arm.com: fix compiler warning in pmdp_invalidate()]
[akpm@linux-foundation.org: fix build with STRICT_MM_TYPECHECKS]
Signed-off-by: Mel Gorman <mgorman@suse.de>
Acked-by: Linus Torvalds <torvalds@linux-foundation.org>
Acked-by: Aneesh Kumar <aneesh.kumar@linux.vnet.ibm.com>
Tested-by: Sasha Levin <sasha.levin@oracle.com>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Dave Jones <davej@redhat.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Kirill Shutemov <kirill.shutemov@linux.intel.com>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Rik van Riel <riel@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 5d833062 12-Feb-2015 Mel Gorman <mgorman@suse.de>

mm: numa: do not dereference pmd outside of the lock during NUMA hinting fault

Automatic NUMA balancing depends on being able to protect PTEs to trap a
fault and gather reference locality information. Very broadly speaking
it would mark PTEs as not present and use another bit to distinguish
between NUMA hinting faults and other types of faults. It was
universally loved by everybody and caused no problems whatsoever. That
last sentence might be a lie.

This series is very heavily based on patches from Linus and Aneesh to
replace the existing PTE/PMD NUMA helper functions with normal change
protections. I did alter and add parts of it but I consider them
relatively minor contributions. At their suggestion, acked-bys are in
there but I've no problem converting them to Signed-off-by if requested.

AFAIK, this has received no testing on ppc64 and I'm depending on Aneesh
for that. I tested trinity under kvm-tool and passed and ran a few
other basic tests. At the time of writing, only the short-lived tests
have completed but testing of V2 indicated that long-term testing had no
surprises. In most cases I'm leaving out detail as it's not that
interesting.

specjbb single JVM: There was negligible performance difference in the
benchmark itself for short runs. However, system activity is
higher and interrupts are much higher over time -- possibly TLB
flushes. Migrations are also higher. Overall, this is more overhead
but considering the problems faced with the old approach I think
we just have to suck it up and find another way of reducing the
overhead.

specjbb multi JVM: Negligible performance difference to the actual benchmark
but like the single JVM case, the system overhead is noticeably
higher. Again, interrupts are a major factor.

autonumabench: This was all over the place and about all that can be
reasonably concluded is that it's different but not necessarily
better or worse.

autonumabench
3.18.0-rc5 3.18.0-rc5
mmotm-20141119 protnone-v3r3
User NUMA01 32380.24 ( 0.00%) 21642.92 ( 33.16%)
User NUMA01_THEADLOCAL 22481.02 ( 0.00%) 22283.22 ( 0.88%)
User NUMA02 3137.00 ( 0.00%) 3116.54 ( 0.65%)
User NUMA02_SMT 1614.03 ( 0.00%) 1543.53 ( 4.37%)
System NUMA01 322.97 ( 0.00%) 1465.89 (-353.88%)
System NUMA01_THEADLOCAL 91.87 ( 0.00%) 49.32 ( 46.32%)
System NUMA02 37.83 ( 0.00%) 14.61 ( 61.38%)
System NUMA02_SMT 7.36 ( 0.00%) 7.45 ( -1.22%)
Elapsed NUMA01 716.63 ( 0.00%) 599.29 ( 16.37%)
Elapsed NUMA01_THEADLOCAL 553.98 ( 0.00%) 539.94 ( 2.53%)
Elapsed NUMA02 83.85 ( 0.00%) 83.04 ( 0.97%)
Elapsed NUMA02_SMT 86.57 ( 0.00%) 79.15 ( 8.57%)
CPU NUMA01 4563.00 ( 0.00%) 3855.00 ( 15.52%)
CPU NUMA01_THEADLOCAL 4074.00 ( 0.00%) 4136.00 ( -1.52%)
CPU NUMA02 3785.00 ( 0.00%) 3770.00 ( 0.40%)
CPU NUMA02_SMT 1872.00 ( 0.00%) 1959.00 ( -4.65%)

System CPU usage of NUMA01 is worse but it's an adverse workload on this
machine so I'm reluctant to conclude that it's a problem that matters. On
the other workloads that are sensible on this machine, system CPU usage is
great. Overall time to complete the benchmark is comparable

3.18.0-rc5 3.18.0-rc5
mmotm-20141119protnone-v3r3
User 59612.50 48586.44
System 460.22 1537.45
Elapsed 1442.20 1304.29

NUMA alloc hit 5075182 5743353
NUMA alloc miss 0 0
NUMA interleave hit 0 0
NUMA alloc local 5075174 5743339
NUMA base PTE updates 637061448 443106883
NUMA huge PMD updates 1243434 864747
NUMA page range updates 1273699656 885857347
NUMA hint faults 1658116 1214277
NUMA hint local faults 959487 754113
NUMA hint local percent 57 62
NUMA pages migrated 5467056 61676398

The NUMA pages migrated look terrible but when I looked at a graph of the
activity over time I see that the massive spike in migration activity was
during NUMA01. This correlates with high system CPU usage and could be
simply down to bad luck but any modifications that affect that workload
would be related to scan rates and migrations, not the protection
mechanism. For all other workloads, migration activity was comparable.

Overall, headline performance figures are comparable but the overhead is
higher, mostly in interrupts. To some extent, higher overhead from this
approach was anticipated but not to this degree. It's going to be
necessary to reduce this again with a separate series in the future. It's
still worth going ahead with this series though as it's likely to avoid
constant headaches with Xen and is probably easier to maintain.

This patch (of 10):

A transhuge NUMA hinting fault may find the page is migrating and should
wait until migration completes. The check is race-prone because the pmd
is deferenced outside of the page lock and while the race is tiny, it'll
be larger if the PMD is cleared while marking PMDs for hinting fault.
This patch closes the race.

Signed-off-by: Mel Gorman <mgorman@suse.de>
Cc: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Dave Jones <davej@redhat.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Kirill Shutemov <kirill.shutemov@linux.intel.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Rik van Riel <riel@redhat.com>
Cc: Sasha Levin <sasha.levin@oracle.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# e66f17ff 11-Feb-2015 Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>

mm/hugetlb: take page table lock in follow_huge_pmd()

We have a race condition between move_pages() and freeing hugepages, where
move_pages() calls follow_page(FOLL_GET) for hugepages internally and
tries to get its refcount without preventing concurrent freeing. This
race crashes the kernel, so this patch fixes it by moving FOLL_GET code
for hugepages into follow_huge_pmd() with taking the page table lock.

This patch intentionally removes page==NULL check after pte_page.
This is justified because pte_page() never returns NULL for any
architectures or configurations.

This patch changes the behavior of follow_huge_pmd() for tail pages and
then tail pages can be pinned/returned. So the caller must be changed to
properly handle the returned tail pages.

We could have a choice to add the similar locking to
follow_huge_(addr|pud) for consistency, but it's not necessary because
currently these functions don't support FOLL_GET flag, so let's leave it
for future development.

Here is the reproducer:

$ cat movepages.c
#include <stdio.h>
#include <stdlib.h>
#include <numaif.h>

#define ADDR_INPUT 0x700000000000UL
#define HPS 0x200000
#define PS 0x1000

int main(int argc, char *argv[]) {
int i;
int nr_hp = strtol(argv[1], NULL, 0);
int nr_p = nr_hp * HPS / PS;
int ret;
void **addrs;
int *status;
int *nodes;
pid_t pid;

pid = strtol(argv[2], NULL, 0);
addrs = malloc(sizeof(char *) * nr_p + 1);
status = malloc(sizeof(char *) * nr_p + 1);
nodes = malloc(sizeof(char *) * nr_p + 1);

while (1) {
for (i = 0; i < nr_p; i++) {
addrs[i] = (void *)ADDR_INPUT + i * PS;
nodes[i] = 1;
status[i] = 0;
}
ret = numa_move_pages(pid, nr_p, addrs, nodes, status,
MPOL_MF_MOVE_ALL);
if (ret == -1)
err("move_pages");

for (i = 0; i < nr_p; i++) {
addrs[i] = (void *)ADDR_INPUT + i * PS;
nodes[i] = 0;
status[i] = 0;
}
ret = numa_move_pages(pid, nr_p, addrs, nodes, status,
MPOL_MF_MOVE_ALL);
if (ret == -1)
err("move_pages");
}
return 0;
}

$ cat hugepage.c
#include <stdio.h>
#include <sys/mman.h>
#include <string.h>

#define ADDR_INPUT 0x700000000000UL
#define HPS 0x200000

int main(int argc, char *argv[]) {
int nr_hp = strtol(argv[1], NULL, 0);
char *p;

while (1) {
p = mmap((void *)ADDR_INPUT, nr_hp * HPS, PROT_READ | PROT_WRITE,
MAP_PRIVATE | MAP_ANONYMOUS | MAP_HUGETLB, -1, 0);
if (p != (void *)ADDR_INPUT) {
perror("mmap");
break;
}
memset(p, 0, nr_hp * HPS);
munmap(p, nr_hp * HPS);
}
}

$ sysctl vm.nr_hugepages=40
$ ./hugepage 10 &
$ ./movepages 10 $(pgrep -f hugepage)

Fixes: e632a938d914 ("mm: migrate: add hugepage migration code to move_pages()")
Signed-off-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
Reported-by: Hugh Dickins <hughd@google.com>
Cc: James Hogan <james.hogan@imgtec.com>
Cc: David Rientjes <rientjes@google.com>
Cc: Mel Gorman <mel@csn.ul.ie>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Michal Hocko <mhocko@suse.cz>
Cc: Rik van Riel <riel@redhat.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Luiz Capitulino <lcapitulino@redhat.com>
Cc: Nishanth Aravamudan <nacc@linux.vnet.ibm.com>
Cc: Lee Schermerhorn <lee.schermerhorn@hp.com>
Cc: Steve Capper <steve.capper@linaro.org>
Cc: <stable@vger.kernel.org> [3.12+]
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 27ba0644 10-Feb-2015 Kirill A. Shutemov <kirill.shutemov@linux.intel.com>

rmap: drop support of non-linear mappings

We don't create non-linear mappings anymore. Let's drop code which
handles them in rmap.

Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 50062175 15-May-2014 Al Viro <viro@zeniv.linux.org.uk>

vm_area_operations: kill ->migrate()

the only instance this method has ever grown was one in kernfs -
one that call ->migrate() of another vm_ops if it exists.

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>


# 2ebba6b7 12-Dec-2014 Hugh Dickins <hughd@google.com>

mm: unmapped page migration avoid unmap+remap overhead

Page migration's __unmap_and_move(), and rmap's try_to_unmap(), were
created for use on pages almost certainly mapped into userspace. But
nowadays compaction often applies them to unmapped page cache pages: which
may exacerbate contention on i_mmap_rwsem quite unnecessarily, since
try_to_unmap_file() makes no preliminary page_mapped() check.

Now check page_mapped() in __unmap_and_move(); and avoid repeating the
same overhead in rmap_walk_file() - don't remove_migration_ptes() when we
never inserted any.

(The PageAnon(page) comment blocks now look even sillier than before, but
clean that up on some other occasion. And note in passing that
try_to_unmap_one() does not use a migration entry when PageSwapCache, so
remove_migration_ptes() will then not update that swap entry to newpage
pte: not a big deal, but something else to clean up later.)

Davidlohr remarked in "mm,fs: introduce helpers around the i_mmap_mutex"
conversion to i_mmap_rwsem, that "The biggest winner of these changes is
migration": a part of the reason might be all of that unnecessary taking
of i_mmap_mutex in page migration; and it's rather a shame that I didn't
get around to sending this patch in before his - this one is much less
useful after Davidlohr's conversion to rwsem, but still good.

Signed-off-by: Hugh Dickins <hughd@google.com>
Cc: Davidlohr Bueso <dave@stgolabs.net>
Cc: Rik van Riel <riel@redhat.com>
Cc: Mel Gorman <mel@csn.ul.ie>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# d6d86c0a 09-Oct-2014 Konstantin Khlebnikov <koct9i@gmail.com>

mm/balloon_compaction: redesign ballooned pages management

Sasha Levin reported KASAN splash inside isolate_migratepages_range().
Problem is in the function __is_movable_balloon_page() which tests
AS_BALLOON_MAP in page->mapping->flags. This function has no protection
against anonymous pages. As result it tried to check address space flags
inside struct anon_vma.

Further investigation shows more problems in current implementation:

* Special branch in __unmap_and_move() never works:
balloon_page_movable() checks page flags and page_count. In
__unmap_and_move() page is locked, reference counter is elevated, thus
balloon_page_movable() always fails. As a result execution goes to the
normal migration path. virtballoon_migratepage() returns
MIGRATEPAGE_BALLOON_SUCCESS instead of MIGRATEPAGE_SUCCESS,
move_to_new_page() thinks this is an error code and assigns
newpage->mapping to NULL. Newly migrated page lose connectivity with
balloon an all ability for further migration.

* lru_lock erroneously required in isolate_migratepages_range() for
isolation ballooned page. This function releases lru_lock periodically,
this makes migration mostly impossible for some pages.

* balloon_page_dequeue have a tight race with balloon_page_isolate:
balloon_page_isolate could be executed in parallel with dequeue between
picking page from list and locking page_lock. Race is rare because they
use trylock_page() for locking.

This patch fixes all of them.

Instead of fake mapping with special flag this patch uses special state of
page->_mapcount: PAGE_BALLOON_MAPCOUNT_VALUE = -256. Buddy allocator uses
PAGE_BUDDY_MAPCOUNT_VALUE = -128 for similar purpose. Storing mark
directly in struct page makes everything safer and easier.

PagePrivate is used to mark pages present in page list (i.e. not
isolated, like PageLRU for normal pages). It replaces special rules for
reference counter and makes balloon migration similar to migration of
normal pages. This flag is protected by page_lock together with link to
the balloon device.

Signed-off-by: Konstantin Khlebnikov <k.khlebnikov@samsung.com>
Reported-by: Sasha Levin <sasha.levin@oracle.com>
Link: http://lkml.kernel.org/p/53E6CEAA.9020105@oracle.com
Cc: Rafael Aquini <aquini@redhat.com>
Cc: Andrey Ryabinin <ryabinin.a.a@gmail.com>
Cc: <stable@vger.kernel.org> [3.8+]
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# d3cb8bf6 02-Oct-2014 Mel Gorman <mgorman@suse.de>

mm: migrate: Close race between migration completion and mprotect

A migration entry is marked as write if pte_write was true at the time the
entry was created. The VMA protections are not double checked when migration
entries are being removed as mprotect marks write-migration-entries as
read. It means that potentially we take a spurious fault to mark PTEs write
again but it's straight-forward. However, there is a race between write
migrations being marked read and migrations finishing. This potentially
allows a PTE to be write that should have been read. Close this race by
double checking the VMA permissions using maybe_mkwrite when migration
completes.

[torvalds@linux-foundation.org: use maybe_mkwrite]
Cc: stable@vger.kernel.org
Signed-off-by: Mel Gorman <mgorman@suse.de>
Acked-by: Rik van Riel <riel@redhat.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 0a31bc97 08-Aug-2014 Johannes Weiner <hannes@cmpxchg.org>

mm: memcontrol: rewrite uncharge API

The memcg uncharging code that is involved towards the end of a page's
lifetime - truncation, reclaim, swapout, migration - is impressively
complicated and fragile.

Because anonymous and file pages were always charged before they had their
page->mapping established, uncharges had to happen when the page type
could still be known from the context; as in unmap for anonymous, page
cache removal for file and shmem pages, and swap cache truncation for swap
pages. However, these operations happen well before the page is actually
freed, and so a lot of synchronization is necessary:

- Charging, uncharging, page migration, and charge migration all need
to take a per-page bit spinlock as they could race with uncharging.

- Swap cache truncation happens during both swap-in and swap-out, and
possibly repeatedly before the page is actually freed. This means
that the memcg swapout code is called from many contexts that make
no sense and it has to figure out the direction from page state to
make sure memory and memory+swap are always correctly charged.

- On page migration, the old page might be unmapped but then reused,
so memcg code has to prevent untimely uncharging in that case.
Because this code - which should be a simple charge transfer - is so
special-cased, it is not reusable for replace_page_cache().

But now that charged pages always have a page->mapping, introduce
mem_cgroup_uncharge(), which is called after the final put_page(), when we
know for sure that nobody is looking at the page anymore.

For page migration, introduce mem_cgroup_migrate(), which is called after
the migration is successful and the new page is fully rmapped. Because
the old page is no longer uncharged after migration, prevent double
charges by decoupling the page's memcg association (PCG_USED and
pc->mem_cgroup) from the page holding an actual charge. The new bits
PCG_MEM and PCG_MEMSW represent the respective charges and are transferred
to the new page during migration.

mem_cgroup_migrate() is suitable for replace_page_cache() as well,
which gets rid of mem_cgroup_replace_page_cache(). However, care
needs to be taken because both the source and the target page can
already be charged and on the LRU when fuse is splicing: grab the page
lock on the charge moving side to prevent changing pc->mem_cgroup of a
page under migration. Also, the lruvecs of both pages change as we
uncharge the old and charge the new during migration, and putback may
race with us, so grab the lru lock and isolate the pages iff on LRU to
prevent races and ensure the pages are on the right lruvec afterward.

Swap accounting is massively simplified: because the page is no longer
uncharged as early as swap cache deletion, a new mem_cgroup_swapout() can
transfer the page's memory+swap charge (PCG_MEMSW) to the swap entry
before the final put_page() in page reclaim.

Finally, page_cgroup changes are now protected by whatever protection the
page itself offers: anonymous pages are charged under the page table lock,
whereas page cache insertions, swapin, and migration hold the page lock.
Uncharging happens under full exclusion with no outstanding references.
Charging and uncharging also ensure that the page is off-LRU, which
serializes against charge migration. Remove the very costly page_cgroup
lock and set pc->flags non-atomically.

[mhocko@suse.cz: mem_cgroup_charge_statistics needs preempt_disable]
[vdavydov@parallels.com: fix flags definition]
Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
Cc: Hugh Dickins <hughd@google.com>
Cc: Tejun Heo <tj@kernel.org>
Cc: Vladimir Davydov <vdavydov@parallels.com>
Tested-by: Jet Chen <jet.chen@intel.com>
Acked-by: Michal Hocko <mhocko@suse.cz>
Tested-by: Felipe Balbi <balbi@ti.com>
Signed-off-by: Vladimir Davydov <vdavydov@parallels.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 8bdd6380 26-Jul-2014 Hugh Dickins <hughd@google.com>

mm: fix direct reclaim writeback regression

Shortly before 3.16-rc1, Dave Jones reported:

WARNING: CPU: 3 PID: 19721 at fs/xfs/xfs_aops.c:971
xfs_vm_writepage+0x5ce/0x630 [xfs]()
CPU: 3 PID: 19721 Comm: trinity-c61 Not tainted 3.15.0+ #3
Call Trace:
xfs_vm_writepage+0x5ce/0x630 [xfs]
shrink_page_list+0x8f9/0xb90
shrink_inactive_list+0x253/0x510
shrink_lruvec+0x563/0x6c0
shrink_zone+0x3b/0x100
shrink_zones+0x1f1/0x3c0
try_to_free_pages+0x164/0x380
__alloc_pages_nodemask+0x822/0xc90
alloc_pages_vma+0xaf/0x1c0
handle_mm_fault+0xa31/0xc50
etc.

970 if (WARN_ON_ONCE((current->flags & (PF_MEMALLOC|PF_KSWAPD)) ==
971 PF_MEMALLOC))

I did not respond at the time, because a glance at the PageDirty block
in shrink_page_list() quickly shows that this is impossible: we don't do
writeback on file pages (other than tmpfs) from direct reclaim nowadays.
Dave was hallucinating, but it would have been disrespectful to say so.

However, my own /var/log/messages now shows similar complaints

WARNING: CPU: 1 PID: 28814 at fs/ext4/inode.c:1881 ext4_writepage+0xa7/0x38b()
WARNING: CPU: 0 PID: 27347 at fs/ext4/inode.c:1764 ext4_writepage+0xa7/0x38b()

from stressing some mmotm trees during July.

Could a dirty xfs or ext4 file page somehow get marked PageSwapBacked,
so fail shrink_page_list()'s page_is_file_cache() test, and so proceed
to mapping->a_ops->writepage()?

Yes, 3.16-rc1's commit 68711a746345 ("mm, migration: add destination
page freeing callback") has provided such a way to compaction: if
migrating a SwapBacked page fails, its newpage may be put back on the
list for later use with PageSwapBacked still set, and nothing will clear
it.

Whether that can do anything worse than issue WARN_ON_ONCEs, and get
some statistics wrong, is unclear: easier to fix than to think through
the consequences.

Fixing it here, before the put_new_page(), addresses the bug directly,
but is probably the worst place to fix it. Page migration is doing too
many parts of the job on too many levels: fixing it in
move_to_new_page() to complement its SetPageSwapBacked would be
preferable, except why is it (and newpage->mapping and newpage->index)
done there, rather than down in migrate_page_move_mapping(), once we are
sure of success? Not a cleanup to get into right now, especially not
with memcg cleanups coming in 3.17.

Reported-by: Dave Jones <davej@redhat.com>
Signed-off-by: Hugh Dickins <hughd@google.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# f72e7dcd 23-Jun-2014 Hugh Dickins <hughd@google.com>

mm: let mm_find_pmd fix buggy race with THP fault

Trinity has reported:

BUG: unable to handle kernel NULL pointer dereference at 0000000000000018
IP: __lock_acquire (kernel/locking/lockdep.c:3070 (discriminator 1))
CPU: 6 PID: 16173 Comm: trinity-c364 Tainted: G W
3.15.0-rc1-next-20140415-sasha-00020-gaa90d09 #398
lock_acquire (arch/x86/include/asm/current.h:14
kernel/locking/lockdep.c:3602)
_raw_spin_lock (include/linux/spinlock_api_smp.h:143
kernel/locking/spinlock.c:151)
remove_migration_pte (mm/migrate.c:137)
rmap_walk (mm/rmap.c:1628 mm/rmap.c:1699)
remove_migration_ptes (mm/migrate.c:224)
migrate_pages (mm/migrate.c:922 mm/migrate.c:960 mm/migrate.c:1126)
migrate_misplaced_page (mm/migrate.c:1733)
__handle_mm_fault (mm/memory.c:3762 mm/memory.c:3812 mm/memory.c:3925)
handle_mm_fault (mm/memory.c:3948)
__get_user_pages (mm/memory.c:1851)
__mlock_vma_pages_range (mm/mlock.c:255)
__mm_populate (mm/mlock.c:711)
SyS_mlockall (include/linux/mm.h:1799 mm/mlock.c:817 mm/mlock.c:791)

I believe this comes about because, whereas collapsing and splitting THP
functions take anon_vma lock in write mode (which excludes concurrent
rmap walks), faulting THP functions (write protection and misplaced
NUMA) do not - and mostly they do not need to.

But they do use a pmdp_clear_flush(), set_pmd_at() sequence which, for
an instant (indeed, for a long instant, given the inter-CPU TLB flush in
there), leaves *pmd neither present not trans_huge.

Which can confuse a concurrent rmap walk, as when removing migration
ptes, seen in the dumped trace. Although that rmap walk has a 4k page
to insert, anon_vmas containing THPs are in no way segregated from
4k-page anon_vmas, so the 4k-intent mm_find_pmd() does need to cope with
that instant when a trans_huge pmd is temporarily absent.

I don't think we need strengthen the locking at the THP end: it's easily
handled with an ACCESS_ONCE() before testing both conditions.

And since mm_find_pmd() had only one caller who wanted a THP rather than
a pmd, let's slightly repurpose it to fail when it hits a THP or
non-present pmd, and open code split_huge_page_address() again.

Signed-off-by: Hugh Dickins <hughd@google.com>
Reported-by: Sasha Levin <sasha.levin@oracle.com>
Acked-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Cc: Konstantin Khlebnikov <koct9i@gmail.com>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Bob Liu <bob.liu@oracle.com>
Cc: Christoph Lameter <cl@gentwo.org>
Cc: Dave Jones <davej@redhat.com>
Cc: David Rientjes <rientjes@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 100873d7 04-Jun-2014 Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>

hugetlb: rename hugepage_migration_support() to ..._supported()

We already have a function named hugepages_supported(), and the similar
name hugepage_migration_support() is a bit unconfortable, so let's rename
it hugepage_migration_supported().

Signed-off-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
Acked-by: Hugh Dickins <hughd@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 68711a74 04-Jun-2014 David Rientjes <rientjes@google.com>

mm, migration: add destination page freeing callback

Memory migration uses a callback defined by the caller to determine how to
allocate destination pages. When migration fails for a source page,
however, it frees the destination page back to the system.

This patch adds a memory migration callback defined by the caller to
determine how to free destination pages. If a caller, such as memory
compaction, builds its own freelist for migration targets, this can reuse
already freed memory instead of scanning additional memory.

If the caller provides a function to handle freeing of destination pages,
it is called when page migration fails. If the caller passes NULL then
freeing back to the system will be handled as usual. This patch
introduces no functional change.

Signed-off-by: David Rientjes <rientjes@google.com>
Reviewed-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
Acked-by: Mel Gorman <mgorman@suse.de>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Cc: Greg Thelen <gthelen@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 11de9927 04-Jun-2014 Mel Gorman <mgorman@suse.de>

mm: numa: add migrated transhuge pages to LRU the same way as base pages

Migration of misplaced transhuge pages uses page_add_new_anon_rmap() when
putting the page back as it avoided an atomic operations and added the new
page to the correct LRU. A side-effect is that the page gets marked
activated as part of the migration meaning that transhuge and base pages
are treated differently from an aging perspective than base page
migration.

This patch uses page_add_anon_rmap() and putback_lru_page() on completion
of a transhuge migration similar to base page migration. It would require
fewer atomic operations to use lru_cache_add without taking an additional
reference to the page. The downside would be that it's still different to
base page migration and unevictable pages may be added to the wrong LRU
for cleaning up later. Testing of the usual workloads did not show any
adverse impact to the change.

Signed-off-by: Mel Gorman <mgorman@suse.de>
Cc: Rik van Riel <riel@redhat.com>
Cc: Sasha Levin <sasha.levin@oracle.com>
Acked-by: Hugh Dickins <hughd@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 7e09e738 20-Mar-2014 Hugh Dickins <hughd@google.com>

mm: fix swapops.h:131 bug if remap_file_pages raced migration

Add remove_linear_migration_ptes_from_nonlinear(), to fix an interesting
little include/linux/swapops.h:131 BUG_ON(!PageLocked) found by trinity:
indicating that remove_migration_ptes() failed to find one of the
migration entries that was temporarily inserted.

The problem comes from remap_file_pages()'s switch from vma_interval_tree
(good for inserting the migration entry) to i_mmap_nonlinear list (no good
for locating it again); but can only be a problem if the remap_file_pages()
range does not cover the whole of the vma (zap_pte() clears the range).

remove_migration_ptes() needs a file_nonlinear method to go down the
i_mmap_nonlinear list, applying linear location to look for migration
entries in those vmas too, just in case there was this race.

The file_nonlinear method does need rmap_walk_control.arg to do this;
but it never needed vma passed in - vma comes from its own iteration.

Reported-and-tested-by: Dave Jones <davej@redhat.com>
Reported-and-tested-by: Sasha Levin <sasha.levin@oracle.com>
Signed-off-by: Hugh Dickins <hughd@google.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# e97ca8e5 10-Mar-2014 Johannes Weiner <hannes@cmpxchg.org>

mm: fix GFP_THISNODE callers and clarify

GFP_THISNODE is for callers that implement their own clever fallback to
remote nodes. It restricts the allocation to the specified node and
does not invoke reclaim, assuming that the caller will take care of it
when the fallback fails, e.g. through a subsequent allocation request
without GFP_THISNODE set.

However, many current GFP_THISNODE users only want the node exclusive
aspect of the flag, without actually implementing their own fallback or
triggering reclaim if necessary. This results in things like page
migration failing prematurely even when there is easily reclaimable
memory available, unless kswapd happens to be running already or a
concurrent allocation attempt triggers the necessary reclaim.

Convert all callsites that don't implement their own fallback strategy
to __GFP_THISNODE. This restricts the allocation a single node too, but
at the same time allows the allocator to enter the slowpath, wake
kswapd, and invoke direct reclaim if necessary, to make the allocation
happen when memory is full.

Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
Acked-by: Rik van Riel <riel@redhat.com>
Cc: Jan Stancek <jstancek@redhat.com>
Cc: Mel Gorman <mgorman@suse.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# a3978a51 27-Jan-2014 Wanpeng Li <liwanp@linux.vnet.ibm.com>

mm/migrate.c: fix setting of cpupid on page migration twice against normal page

Commit 7851a45cd3f6 ("mm: numa: Copy cpupid on page migration") copies
over the cpupid at page migration time. It is unnecessary to set it
again in alloc_misplaced_dst_page().

Signed-off-by: Wanpeng Li <liwanp@linux.vnet.ibm.com>
Reviewed-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
Acked-by: Mel Gorman <mgorman@suse.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# baae911b 23-Jan-2014 Wanpeng Li <liwanp@linux.vnet.ibm.com>

sched/numa: fix setting of cpupid on page migration twice

Commit 7851a45cd3f6 ("mm: numa: Copy cpupid on page migration") copiess
over the cpupid at page migration time. It is unnecessary to set it
again in migrate_misplaced_transhuge_page().

Signed-off-by: Wanpeng Li <liwanp@linux.vnet.ibm.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Rik van Riel <riel@redhat.com>
Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 309381fea 23-Jan-2014 Sasha Levin <sasha.levin@oracle.com>

mm: dump page when hitting a VM_BUG_ON using VM_BUG_ON_PAGE

Most of the VM_BUG_ON assertions are performed on a page. Usually, when
one of these assertions fails we'll get a BUG_ON with a call stack and
the registers.

I've recently noticed based on the requests to add a small piece of code
that dumps the page to various VM_BUG_ON sites that the page dump is
quite useful to people debugging issues in mm.

This patch adds a VM_BUG_ON_PAGE(cond, page) which beyond doing what
VM_BUG_ON() does, also dumps the page before executing the actual
BUG_ON.

[akpm@linux-foundation.org: fix up includes]
Signed-off-by: Sasha Levin <sasha.levin@oracle.com>
Cc: "Kirill A. Shutemov" <kirill@shutemov.name>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 78d5506e 21-Jan-2014 Joonsoo Kim <iamjoonsoo.kim@lge.com>

mm/migrate: remove unused function, fail_migrate_page()

fail_migrate_page() isn't used anywhere, so remove it.

Signed-off-by: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Acked-by: Christoph Lameter <cl@linux.com>
Reviewed-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
Reviewed-by: Wanpeng Li <liwanp@linux.vnet.ibm.com>
Cc: Rafael Aquini <aquini@redhat.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Wanpeng Li <liwanp@linux.vnet.ibm.com>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Rik van Riel <riel@redhat.com>
Cc: Zhang Yanfei <zhangyanfei@cn.fujitsu.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 59c82b70 21-Jan-2014 Joonsoo Kim <iamjoonsoo.kim@lge.com>

mm/migrate: remove putback_lru_pages, fix comment on putback_movable_pages

Some part of putback_lru_pages() and putback_movable_pages() is
duplicated, so it could confuse us what we should use. We can remove
putback_lru_pages() since it is not really needed now. This makes us
undestand and maintain the code more easily.

And comment on putback_movable_pages() is stale now, so fix it.

Signed-off-by: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Reviewed-by: Wanpeng Li <liwanp@linux.vnet.ibm.com>
Cc: Christoph Lameter <cl@linux.com>
Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
Cc: Rafael Aquini <aquini@redhat.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Wanpeng Li <liwanp@linux.vnet.ibm.com>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Rik van Riel <riel@redhat.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Zhang Yanfei <zhangyanfei@cn.fujitsu.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 32665f2b 21-Jan-2014 Joonsoo Kim <iamjoonsoo.kim@lge.com>

mm/migrate: correct failure handling if !hugepage_migration_support()

We should remove the page from the list if we fail with ENOSYS, since
migrate_pages() consider error cases except -ENOMEM and -EAGAIN as
permanent failure and it assumes that the page would be removed from the
list. Without this patch, we could overcount number of failure.

In addition, we should put back the new hugepage if
!hugepage_migration_support(). If not, we would leak hugepage memory.

Signed-off-by: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Acked-by: Christoph Lameter <cl@linux.com>
Reviewed-by: Wanpeng Li <liwanp@linux.vnet.ibm.com>
Reviewed-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
Cc: Rafael Aquini <aquini@redhat.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Wanpeng Li <liwanp@linux.vnet.ibm.com>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Rik van Riel <riel@redhat.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Zhang Yanfei <zhangyanfei@cn.fujitsu.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 354a3363 21-Jan-2014 Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>

mm/migrate: add comment about permanent failure path

Let's add a comment about where the failed page goes to, which makes
code more readable.

Signed-off-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
Signed-off-by: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Acked-by: Christoph Lameter <cl@linux.com>
Reviewed-by: Wanpeng Li <liwanp@linux.vnet.ibm.com>
Acked-by: Rafael Aquini <aquini@redhat.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Wanpeng Li <liwanp@linux.vnet.ibm.com>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Rik van Riel <riel@redhat.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Zhang Yanfei <zhangyanfei@cn.fujitsu.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# af1839d7 21-Jan-2014 Mel Gorman <mgorman@suse.de>

mm: numa: trace tasks that fail migration due to rate limiting

A low local/remote numa hinting fault ratio is potentially explained by
failed migrations. This patch adds a tracepoint that fires when
migration fails due to migration rate limitation.

Signed-off-by: Mel Gorman <mgorman@suse.de>
Reviewed-by: Rik van Riel <riel@redhat.com>
Cc: Alex Thorlton <athorlton@sgi.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 1c5e9c27 21-Jan-2014 Mel Gorman <mgorman@suse.de>

mm: numa: limit scope of lock for NUMA migrate rate limiting

NUMA migrate rate limiting protects a migration counter and window using
a lock but in some cases this can be a contended lock. It is not
critical that the number of pages be perfect, lost updates are
acceptable. Reduce the importance of this lock.

Signed-off-by: Mel Gorman <mgorman@suse.de>
Reviewed-by: Rik van Riel <riel@redhat.com>
Cc: Alex Thorlton <athorlton@sgi.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 1c30e017 21-Jan-2014 Mel Gorman <mgorman@suse.de>

mm: numa: make NUMA-migrate related functions static

numamigrate_update_ratelimit and numamigrate_isolate_page only have
callers in mm/migrate.c. This patch makes them static.

Signed-off-by: Mel Gorman <mgorman@suse.de>
Reviewed-by: Rik van Riel <riel@redhat.com>
Cc: Alex Thorlton <athorlton@sgi.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 051ac83a 21-Jan-2014 Joonsoo Kim <iamjoonsoo.kim@lge.com>

mm/rmap: make rmap_walk to get the rmap_walk_control argument

In each rmap traverse case, there is some difference so that we need
function pointers and arguments to them in order to handle these

For this purpose, struct rmap_walk_control is introduced in this patch,
and will be extended in following patch. Introducing and extending are
separate, because it clarify changes.

Signed-off-by: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Reviewed-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Hugh Dickins <hughd@google.com>
Cc: Rik van Riel <riel@redhat.com>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: Hillf Danton <dhillf@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 34ee645e 12-Nov-2014 Joerg Roedel <jroedel@suse.de>

mmu_notifier: call mmu_notifier_invalidate_range() from VMM

Add calls to the new mmu_notifier_invalidate_range() function to all
places in the VMM that need it.

Signed-off-by: Joerg Roedel <jroedel@suse.de>
Reviewed-by: Andrea Arcangeli <aarcange@redhat.com>
Reviewed-by: Jérôme Glisse <jglisse@redhat.com>
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Rik van Riel <riel@redhat.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Johannes Weiner <jweiner@redhat.com>
Cc: Jay Cornwall <Jay.Cornwall@amd.com>
Cc: Oded Gabbay <Oded.Gabbay@amd.com>
Cc: Suravee Suthikulpanit <Suravee.Suthikulpanit@amd.com>
Cc: Jesse Barnes <jbarnes@virtuousgeek.org>
Cc: David Woodhouse <dwmw2@infradead.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Oded Gabbay <oded.gabbay@amd.com>


# 8e321fef 21-Dec-2013 Benjamin LaHaise <bcrl@kvack.org>

aio/migratepages: make aio migrate pages sane

The arbitrary restriction on page counts offered by the core
migrate_page_move_mapping() code results in rather suspicious looking
fiddling with page reference counts in the aio_migratepage() operation.
To fix this, make migrate_page_move_mapping() take an extra_count parameter
that allows aio to tell the code about its own reference count on the page
being migrated.

While cleaning up aio_migratepage(), make it validate that the old page
being passed in is actually what aio_migratepage() expects to prevent
misbehaviour in the case of races.

Signed-off-by: Benjamin LaHaise <bcrl@kvack.org>


# b0943d61 18-Dec-2013 Mel Gorman <mgorman@suse.de>

mm: numa: defer TLB flush for THP migration as long as possible

THP migration can fail for a variety of reasons. Avoid flushing the TLB
to deal with THP migration races until the copy is ready to start.

Signed-off-by: Mel Gorman <mgorman@suse.de>
Reviewed-by: Rik van Riel <riel@redhat.com>
Cc: Alex Thorlton <athorlton@sgi.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# de466bd6 18-Dec-2013 Mel Gorman <mgorman@suse.de>

mm: numa: avoid unnecessary disruption of NUMA hinting during migration

do_huge_pmd_numa_page() handles the case where there is parallel THP
migration. However, by the time it is checked the NUMA hinting
information has already been disrupted. This patch adds an earlier
check with some helpers.

Signed-off-by: Mel Gorman <mgorman@suse.de>
Reviewed-by: Rik van Riel <riel@redhat.com>
Cc: Alex Thorlton <athorlton@sgi.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# eb4489f6 18-Dec-2013 Mel Gorman <mgorman@suse.de>

mm: numa: avoid unnecessary work on the failure path

If a PMD changes during a THP migration then migration aborts but the
failure path is doing more work than is necessary.

Signed-off-by: Mel Gorman <mgorman@suse.de>
Reviewed-by: Rik van Riel <riel@redhat.com>
Cc: Alex Thorlton <athorlton@sgi.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# f714f4f2 18-Dec-2013 Mel Gorman <mgorman@suse.de>

mm: numa: call MMU notifiers on THP migration

MMU notifiers must be called on THP page migration or secondary MMUs
will get very confused.

Signed-off-by: Mel Gorman <mgorman@suse.de>
Reviewed-by: Rik van Riel <riel@redhat.com>
Cc: Alex Thorlton <athorlton@sgi.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 2b4847e7 18-Dec-2013 Mel Gorman <mgorman@suse.de>

mm: numa: serialise parallel get_user_page against THP migration

Base pages are unmapped and flushed from cache and TLB during normal
page migration and replaced with a migration entry that causes any
parallel NUMA hinting fault or gup to block until migration completes.

THP does not unmap pages due to a lack of support for migration entries
at a PMD level. This allows races with get_user_pages and
get_user_pages_fast which commit 3f926ab945b6 ("mm: Close races between
THP migration and PMD numa clearing") made worse by introducing a
pmd_clear_flush().

This patch forces get_user_page (fast and normal) on a pmd_numa page to
go through the slow get_user_page path where it will serialise against
THP migration and properly account for the NUMA hinting fault. On the
migration side the page table lock is taken for each PTE update.

Signed-off-by: Mel Gorman <mgorman@suse.de>
Reviewed-by: Rik van Riel <riel@redhat.com>
Cc: Alex Thorlton <athorlton@sgi.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 30b0a105 21-Nov-2013 Dave Hansen <dave.hansen@linux.intel.com>

mm: thp: give transparent hugepage code a separate copy_page

Right now, the migration code in migrate_page_copy() uses copy_huge_page()
for hugetlbfs and thp pages:

if (PageHuge(page) || PageTransHuge(page))
copy_huge_page(newpage, page);

So, yay for code reuse. But:

void copy_huge_page(struct page *dst, struct page *src)
{
struct hstate *h = page_hstate(src);

and a non-hugetlbfs page has no page_hstate(). This works 99% of the
time because page_hstate() determines the hstate from the page order
alone. Since the page order of a THP page matches the default hugetlbfs
page order, it works.

But, if you change the default huge page size on the boot command-line
(say default_hugepagesz=1G), then we might not even *have* a 2MB hstate
so page_hstate() returns null and copy_huge_page() oopses pretty fast
since copy_huge_page() dereferences the hstate:

void copy_huge_page(struct page *dst, struct page *src)
{
struct hstate *h = page_hstate(src);
if (unlikely(pages_per_huge_page(h) > MAX_ORDER_NR_PAGES)) {
...

Mel noticed that the migration code is really the only user of these
functions. This moves all the copy code over to migrate.c and makes
copy_huge_page() work for THP by checking for it explicitly.

I believe the bug was introduced in commit b32967ff101a ("mm: numa: Add
THP migration for the NUMA working set scanning fault case")

[akpm@linux-foundation.org: fix coding-style and comment text, per Naoya Horiguchi]
Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
Acked-by: Mel Gorman <mgorman@suse.de>
Reviewed-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
Cc: Hillf Danton <dhillf@gmail.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Tested-by: Dave Jiang <dave.jiang@intel.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# c4088ebd 14-Nov-2013 Kirill A. Shutemov <kirill.shutemov@linux.intel.com>

mm: convert the rest to new page table lock api

Only trivial cases left. Let's convert them altogether.

Signed-off-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Tested-by: Alex Thorlton <athorlton@sgi.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: "Eric W . Biederman" <ebiederm@xmission.com>
Cc: "Paul E . McKenney" <paulmck@linux.vnet.ibm.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Andi Kleen <ak@linux.intel.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Dave Hansen <dave.hansen@intel.com>
Cc: Dave Jones <davej@redhat.com>
Cc: David Howells <dhowells@redhat.com>
Cc: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Kees Cook <keescook@chromium.org>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Michael Kerrisk <mtk.manpages@gmail.com>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Rik van Riel <riel@redhat.com>
Cc: Robin Holt <robinmholt@gmail.com>
Cc: Sedat Dilek <sedat.dilek@gmail.com>
Cc: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Hugh Dickins <hughd@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# cb900f41 14-Nov-2013 Kirill A. Shutemov <kirill.shutemov@linux.intel.com>

mm, hugetlb: convert hugetlbfs to use split pmd lock

Hugetlb supports multiple page sizes. We use split lock only for PMD
level, but not for PUD.

[akpm@linux-foundation.org: coding-style fixes]
Signed-off-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Tested-by: Alex Thorlton <athorlton@sgi.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: "Eric W . Biederman" <ebiederm@xmission.com>
Cc: "Paul E . McKenney" <paulmck@linux.vnet.ibm.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Andi Kleen <ak@linux.intel.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Dave Hansen <dave.hansen@intel.com>
Cc: Dave Jones <davej@redhat.com>
Cc: David Howells <dhowells@redhat.com>
Cc: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Kees Cook <keescook@chromium.org>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Michael Kerrisk <mtk.manpages@gmail.com>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Rik van Riel <riel@redhat.com>
Cc: Robin Holt <robinmholt@gmail.com>
Cc: Sedat Dilek <sedat.dilek@gmail.com>
Cc: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Hugh Dickins <hughd@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 3f926ab9 07-Oct-2013 Mel Gorman <mgorman@suse.de>

mm: Close races between THP migration and PMD numa clearing

THP migration uses the page lock to guard against parallel allocations
but there are cases like this still open

Task A Task B
--------------------- ---------------------
do_huge_pmd_numa_page do_huge_pmd_numa_page
lock_page
mpol_misplaced == -1
unlock_page
goto clear_pmdnuma
lock_page
mpol_misplaced == 2
migrate_misplaced_transhuge
pmd = pmd_mknonnuma
set_pmd_at

During hours of testing, one crashed with weird errors and while I have
no direct evidence, I suspect something like the race above happened.
This patch extends the page lock to being held until the pmd_numa is
cleared to prevent migration starting in parallel while the pmd_numa is
being cleared. It also flushes the old pmd entry and orders pagetable
insertion before rmap insertion.

Signed-off-by: Mel Gorman <mgorman@suse.de>
Reviewed-by: Rik van Riel <riel@redhat.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
Cc: <stable@kernel.org>
Signed-off-by: Peter Zijlstra <peterz@infradead.org>
Link: http://lkml.kernel.org/r/1381141781-10992-9-git-send-email-mgorman@suse.de
Signed-off-by: Ingo Molnar <mingo@kernel.org>


# c3d16e16 16-Oct-2013 Cyrill Gorcunov <gorcunov@gmail.com>

mm: migration: do not lose soft dirty bit if page is in migration state

If page migration is turned on in config and the page is migrating, we
may lose the soft dirty bit. If fork and mprotect are called on
migrating pages (once migration is complete) pages do not obtain the
soft dirty bit in the correspond pte entries. Fix it adding an
appropriate test on swap entries.

Signed-off-by: Cyrill Gorcunov <gorcunov@openvz.org>
Cc: Pavel Emelyanov <xemul@parallels.com>
Cc: Andy Lutomirski <luto@amacapital.net>
Cc: Matt Mackall <mpm@selenic.com>
Cc: Xiao Guangrong <xiaoguangrong@linux.vnet.ibm.com>
Cc: Marcelo Tosatti <mtosatti@redhat.com>
Cc: KOSAKI Motohiro <kosaki.motohiro@gmail.com>
Cc: Stephen Rothwell <sfr@canb.auug.org.au>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: "Aneesh Kumar K.V" <aneesh.kumar@linux.vnet.ibm.com>
Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
Cc: Mel Gorman <mel@csn.ul.ie>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 7851a45c 07-Oct-2013 Rik van Riel <riel@redhat.com>

mm: numa: Copy cpupid on page migration

After page migration, the new page has the nidpid unset. This makes
every fault on a recently migrated page look like a first numa fault,
leading to another page migration.

Copying over the nidpid at page migration time should prevent erroneous
migrations of recently migrated pages.

Signed-off-by: Rik van Riel <riel@redhat.com>
Signed-off-by: Mel Gorman <mgorman@suse.de>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
Signed-off-by: Peter Zijlstra <peterz@infradead.org>
Link: http://lkml.kernel.org/r/1381141781-10992-46-git-send-email-mgorman@suse.de
Signed-off-by: Ingo Molnar <mingo@kernel.org>


# 90572890 07-Oct-2013 Peter Zijlstra <peterz@infradead.org>

mm: numa: Change page last {nid,pid} into {cpu,pid}

Change the per page last fault tracking to use cpu,pid instead of
nid,pid. This will allow us to try and lookup the alternate task more
easily. Note that even though it is the cpu that is store in the page
flags that the mpol_misplaced decision is still based on the node.

Signed-off-by: Peter Zijlstra <peterz@infradead.org>
Signed-off-by: Mel Gorman <mgorman@suse.de>
Reviewed-by: Rik van Riel <riel@redhat.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
Link: http://lkml.kernel.org/r/1381141781-10992-43-git-send-email-mgorman@suse.de
[ Fixed build failure on 32-bit systems. ]
Signed-off-by: Ingo Molnar <mingo@kernel.org>


# b795854b 07-Oct-2013 Mel Gorman <mgorman@suse.de>

sched/numa: Set preferred NUMA node based on number of private faults

Ideally it would be possible to distinguish between NUMA hinting faults that
are private to a task and those that are shared. If treated identically
there is a risk that shared pages bounce between nodes depending on
the order they are referenced by tasks. Ultimately what is desirable is
that task private pages remain local to the task while shared pages are
interleaved between sharing tasks running on different nodes to give good
average performance. This is further complicated by THP as even
applications that partition their data may not be partitioning on a huge
page boundary.

To start with, this patch assumes that multi-threaded or multi-process
applications partition their data and that in general the private accesses
are more important for cpu->memory locality in the general case. Also,
no new infrastructure is required to treat private pages properly but
interleaving for shared pages requires additional infrastructure.

To detect private accesses the pid of the last accessing task is required
but the storage requirements are a high. This patch borrows heavily from
Ingo Molnar's patch "numa, mm, sched: Implement last-CPU+PID hash tracking"
to encode some bits from the last accessing task in the page flags as
well as the node information. Collisions will occur but it is better than
just depending on the node information. Node information is then used to
determine if a page needs to migrate. The PID information is used to detect
private/shared accesses. The preferred NUMA node is selected based on where
the maximum number of approximately private faults were measured. Shared
faults are not taken into consideration for a few reasons.

First, if there are many tasks sharing the page then they'll all move
towards the same node. The node will be compute overloaded and then
scheduled away later only to bounce back again. Alternatively the shared
tasks would just bounce around nodes because the fault information is
effectively noise. Either way accounting for shared faults the same as
private faults can result in lower performance overall.

The second reason is based on a hypothetical workload that has a small
number of very important, heavily accessed private pages but a large shared
array. The shared array would dominate the number of faults and be selected
as a preferred node even though it's the wrong decision.

The third reason is that multiple threads in a process will race each
other to fault the shared page making the fault information unreliable.

Signed-off-by: Mel Gorman <mgorman@suse.de>
[ Fix complication error when !NUMA_BALANCING. ]
Reviewed-by: Rik van Riel <riel@redhat.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
Signed-off-by: Peter Zijlstra <peterz@infradead.org>
Link: http://lkml.kernel.org/r/1381141781-10992-30-git-send-email-mgorman@suse.de
Signed-off-by: Ingo Molnar <mingo@kernel.org>


# 1bc115d8 07-Oct-2013 Mel Gorman <mgorman@suse.de>

mm: numa: Scan pages with elevated page_mapcount

Currently automatic NUMA balancing is unable to distinguish between false
shared versus private pages except by ignoring pages with an elevated
page_mapcount entirely. This avoids shared pages bouncing between the
nodes whose task is using them but that is ignored quite a lot of data.

This patch kicks away the training wheels in preparation for adding support
for identifying shared/private pages is now in place. The ordering is so
that the impact of the shared/private detection can be easily measured. Note
that the patch does not migrate shared, file-backed within vmas marked
VM_EXEC as these are generally shared library pages. Migrating such pages
is not beneficial as there is an expectation they are read-shared between
caches and iTLB and iCache pressure is generally low.

Signed-off-by: Mel Gorman <mgorman@suse.de>
Reviewed-by: Rik van Riel <riel@redhat.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
Signed-off-by: Peter Zijlstra <peterz@infradead.org>
Link: http://lkml.kernel.org/r/1381141781-10992-28-git-send-email-mgorman@suse.de
Signed-off-by: Ingo Molnar <mingo@kernel.org>


# a54a407f 07-Oct-2013 Mel Gorman <mgorman@suse.de>

mm: Close races between THP migration and PMD numa clearing

THP migration uses the page lock to guard against parallel allocations
but there are cases like this still open

Task A Task B
--------------------- ---------------------
do_huge_pmd_numa_page do_huge_pmd_numa_page
lock_page
mpol_misplaced == -1
unlock_page
goto clear_pmdnuma
lock_page
mpol_misplaced == 2
migrate_misplaced_transhuge
pmd = pmd_mknonnuma
set_pmd_at

During hours of testing, one crashed with weird errors and while I have
no direct evidence, I suspect something like the race above happened.
This patch extends the page lock to being held until the pmd_numa is
cleared to prevent migration starting in parallel while the pmd_numa is
being cleared. It also flushes the old pmd entry and orders pagetable
insertion before rmap insertion.

Signed-off-by: Mel Gorman <mgorman@suse.de>
Reviewed-by: Rik van Riel <riel@redhat.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
Signed-off-by: Peter Zijlstra <peterz@infradead.org>
Link: http://lkml.kernel.org/r/1381141781-10992-9-git-send-email-mgorman@suse.de
Signed-off-by: Ingo Molnar <mingo@kernel.org>


# 117aad1e 30-Sep-2013 Rafael Aquini <aquini@redhat.com>

mm: avoid reinserting isolated balloon pages into LRU lists

Isolated balloon pages can wrongly end up in LRU lists when
migrate_pages() finishes its round without draining all the isolated
page list.

The same issue can happen when reclaim_clean_pages_from_list() tries to
reclaim pages from an isolated page list, before migration, in the CMA
path. Such balloon page leak opens a race window against LRU lists
shrinkers that leads us to the following kernel panic:

BUG: unable to handle kernel NULL pointer dereference at 0000000000000028
IP: [<ffffffff810c2625>] shrink_page_list+0x24e/0x897
PGD 3cda2067 PUD 3d713067 PMD 0
Oops: 0000 [#1] SMP
CPU: 0 PID: 340 Comm: kswapd0 Not tainted 3.12.0-rc1-22626-g4367597 #87
Hardware name: Bochs Bochs, BIOS Bochs 01/01/2011
RIP: shrink_page_list+0x24e/0x897
RSP: 0000:ffff88003da499b8 EFLAGS: 00010286
RAX: 0000000000000000 RBX: ffff88003e82bd60 RCX: 00000000000657d5
RDX: 0000000000000000 RSI: 000000000000031f RDI: ffff88003e82bd40
RBP: ffff88003da49ab0 R08: 0000000000000001 R09: 0000000081121a45
R10: ffffffff81121a45 R11: ffff88003c4a9a28 R12: ffff88003e82bd40
R13: ffff88003da0e800 R14: 0000000000000001 R15: ffff88003da49d58
FS: 0000000000000000(0000) GS:ffff88003fc00000(0000) knlGS:0000000000000000
CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 00000000067d9000 CR3: 000000003ace5000 CR4: 00000000000407b0
Call Trace:
shrink_inactive_list+0x240/0x3de
shrink_lruvec+0x3e0/0x566
__shrink_zone+0x94/0x178
shrink_zone+0x3a/0x82
balance_pgdat+0x32a/0x4c2
kswapd+0x2f0/0x372
kthread+0xa2/0xaa
ret_from_fork+0x7c/0xb0
Code: 80 7d 8f 01 48 83 95 68 ff ff ff 00 4c 89 e7 e8 5a 7b 00 00 48 85 c0 49 89 c5 75 08 80 7d 8f 00 74 3e eb 31 48 8b 80 18 01 00 00 <48> 8b 74 0d 48 8b 78 30 be 02 00 00 00 ff d2 eb
RIP [<ffffffff810c2625>] shrink_page_list+0x24e/0x897
RSP <ffff88003da499b8>
CR2: 0000000000000028
---[ end trace 703d2451af6ffbfd ]---
Kernel panic - not syncing: Fatal exception

This patch fixes the issue, by assuring the proper tests are made at
putback_movable_pages() & reclaim_clean_pages_from_list() to avoid
isolated balloon pages being wrongly reinserted in LRU lists.

[akpm@linux-foundation.org: clarify awkward comment text]
Signed-off-by: Rafael Aquini <aquini@redhat.com>
Reported-by: Luiz Capitulino <lcapitulino@redhat.com>
Tested-by: Luiz Capitulino <lcapitulino@redhat.com>
Cc: Mel Gorman <mel@csn.ul.ie>
Cc: Rik van Riel <riel@redhat.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 6e543d57 11-Sep-2013 Lisa Du <cldu@marvell.com>

mm: vmscan: fix do_try_to_free_pages() livelock

This patch is based on KOSAKI's work and I add a little more description,
please refer https://lkml.org/lkml/2012/6/14/74.

Currently, I found system can enter a state that there are lots of free
pages in a zone but only order-0 and order-1 pages which means the zone is
heavily fragmented, then high order allocation could make direct reclaim
path's long stall(ex, 60 seconds) especially in no swap and no compaciton
enviroment. This problem happened on v3.4, but it seems issue still lives
in current tree, the reason is do_try_to_free_pages enter live lock:

kswapd will go to sleep if the zones have been fully scanned and are still
not balanced. As kswapd thinks there's little point trying all over again
to avoid infinite loop. Instead it changes order from high-order to
0-order because kswapd think order-0 is the most important. Look at
73ce02e9 in detail. If watermarks are ok, kswapd will go back to sleep
and may leave zone->all_unreclaimable =3D 0. It assume high-order users
can still perform direct reclaim if they wish.

Direct reclaim continue to reclaim for a high order which is not a
COSTLY_ORDER without oom-killer until kswapd turn on
zone->all_unreclaimble= . This is because to avoid too early oom-kill.
So it means direct_reclaim depends on kswapd to break this loop.

In worst case, direct-reclaim may continue to page reclaim forever when
kswapd sleeps forever until someone like watchdog detect and finally kill
the process. As described in:
http://thread.gmane.org/gmane.linux.kernel.mm/103737

We can't turn on zone->all_unreclaimable from direct reclaim path because
direct reclaim path don't take any lock and this way is racy. Thus this
patch removes zone->all_unreclaimable field completely and recalculates
zone reclaimable state every time.

Note: we can't take the idea that direct-reclaim see zone->pages_scanned
directly and kswapd continue to use zone->all_unreclaimable. Because, it
is racy. commit 929bea7c71 (vmscan: all_unreclaimable() use
zone->all_unreclaimable as a name) describes the detail.

[akpm@linux-foundation.org: uninline zone_reclaimable_pages() and zone_reclaimable()]
Cc: Aaditya Kumar <aaditya.kumar.30@gmail.com>
Cc: Ying Han <yinghan@google.com>
Cc: Nick Piggin <npiggin@gmail.com>
Acked-by: Rik van Riel <riel@redhat.com>
Cc: Mel Gorman <mel@csn.ul.ie>
Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Christoph Lameter <cl@linux.com>
Cc: Bob Liu <lliubbo@gmail.com>
Cc: Neil Zhang <zhangwm@marvell.com>
Cc: Russell King - ARM Linux <linux@arm.linux.org.uk>
Reviewed-by: Michal Hocko <mhocko@suse.cz>
Acked-by: Minchan Kim <minchan@kernel.org>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Signed-off-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Signed-off-by: Lisa Du <cldu@marvell.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 83467efb 11-Sep-2013 Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>

mm: migrate: check movability of hugepage in unmap_and_move_huge_page()

Currently hugepage migration works well only for pmd-based hugepages
(mainly due to lack of testing,) so we had better not enable migration of
other levels of hugepages until we are ready for it.

Some users of hugepage migration (mbind, move_pages, and migrate_pages) do
page table walk and check pud/pmd_huge() there, so they are safe. But the
other users (softoffline and memory hotremove) don't do this, so without
this patch they can try to migrate unexpected types of hugepages.

To prevent this, we introduce hugepage_migration_support() as an
architecture dependent check of whether hugepage are implemented on a pmd
basis or not. And on some architecture multiple sizes of hugepages are
available, so hugepage_migration_support() also checks hugepage size.

Signed-off-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
Cc: Andi Kleen <ak@linux.intel.com>
Cc: Hillf Danton <dhillf@gmail.com>
Cc: Wanpeng Li <liwanp@linux.vnet.ibm.com>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Hugh Dickins <hughd@google.com>
Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Cc: Michal Hocko <mhocko@suse.cz>
Cc: Rik van Riel <riel@redhat.com>
Cc: "Aneesh Kumar K.V" <aneesh.kumar@linux.vnet.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# e632a938 11-Sep-2013 Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>

mm: migrate: add hugepage migration code to move_pages()

Extend move_pages() to handle vma with VM_HUGETLB set. We will be able to
migrate hugepage with move_pages(2) after applying the enablement patch
which comes later in this series.

We avoid getting refcount on tail pages of hugepage, because unlike thp,
hugepage is not split and we need not care about races with splitting.

And migration of larger (1GB for x86_64) hugepage are not enabled.

Signed-off-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
Acked-by: Andi Kleen <ak@linux.intel.com>
Reviewed-by: Wanpeng Li <liwanp@linux.vnet.ibm.com>
Cc: Hillf Danton <dhillf@gmail.com>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Hugh Dickins <hughd@google.com>
Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Cc: Michal Hocko <mhocko@suse.cz>
Cc: Rik van Riel <riel@redhat.com>
Cc: "Aneesh Kumar K.V" <aneesh.kumar@linux.vnet.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# b8ec1cee 11-Sep-2013 Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>

mm: soft-offline: use migrate_pages() instead of migrate_huge_page()

Currently migrate_huge_page() takes a pointer to a hugepage to be migrated
as an argument, instead of taking a pointer to the list of hugepages to be
migrated. This behavior was introduced in commit 189ebff28 ("hugetlb:
simplify migrate_huge_page()"), and was OK because until now hugepage
migration is enabled only for soft-offlining which migrates only one
hugepage in a single call.

But the situation will change in the later patches in this series which
enable other users of page migration to support hugepage migration. They
can kick migration for both of normal pages and hugepages in a single
call, so we need to go back to original implementation which uses linked
lists to collect the hugepages to be migrated.

With this patch, soft_offline_huge_page() switches to use migrate_pages(),
and migrate_huge_page() is not used any more. So let's remove it.

Signed-off-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
Acked-by: Andi Kleen <ak@linux.intel.com>
Reviewed-by: Wanpeng Li <liwanp@linux.vnet.ibm.com>
Acked-by: Hillf Danton <dhillf@gmail.com>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Hugh Dickins <hughd@google.com>
Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Cc: Michal Hocko <mhocko@suse.cz>
Cc: Rik van Riel <riel@redhat.com>
Cc: "Aneesh Kumar K.V" <aneesh.kumar@linux.vnet.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 31caf665 11-Sep-2013 Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>

mm: migrate: make core migration code aware of hugepage

Currently hugepage migration is available only for soft offlining, but
it's also useful for some other users of page migration (clearly because
users of hugepage can enjoy the benefit of mempolicy and memory hotplug.)
So this patchset tries to extend such users to support hugepage migration.

The target of this patchset is to enable hugepage migration for NUMA
related system calls (migrate_pages(2), move_pages(2), and mbind(2)), and
memory hotplug.

This patchset does not add hugepage migration for memory compaction,
because users of memory compaction mainly expect to construct thp by
arranging raw pages, and there's little or no need to compact hugepages.
CMA, another user of page migration, can have benefit from hugepage
migration, but is not enabled to support it for now (just because of lack
of testing and expertise in CMA.)

Hugepage migration of non pmd-based hugepage (for example 1GB hugepage in
x86_64, or hugepages in architectures like ia64) is not enabled for now
(again, because of lack of testing.)

As for how these are achived, I extended the API (migrate_pages()) to
handle hugepage (with patch 1 and 2) and adjusted code of each caller to
check and collect movable hugepages (with patch 3-7). Remaining 2 patches
are kind of miscellaneous ones to avoid unexpected behavior. Patch 8 is
about making sure that we only migrate pmd-based hugepages. And patch 9
is about choosing appropriate zone for hugepage allocation.

My test is mainly functional one, simply kicking hugepage migration via
each entry point and confirm that migration is done correctly. Test code
is available here:

git://github.com/Naoya-Horiguchi/test_hugepage_migration_extension.git

And I always run libhugetlbfs test when changing hugetlbfs's code. With
this patchset, no regression was found in the test.

This patch (of 9):

Before enabling each user of page migration to support hugepage,
this patch enables the list of pages for migration to link not only
LRU pages, but also hugepages. As a result, putback_movable_pages()
and migrate_pages() can handle both of LRU pages and hugepages.

Signed-off-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
Acked-by: Andi Kleen <ak@linux.intel.com>
Reviewed-by: Wanpeng Li <liwanp@linux.vnet.ibm.com>
Acked-by: Hillf Danton <dhillf@gmail.com>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Hugh Dickins <hughd@google.com>
Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Cc: Michal Hocko <mhocko@suse.cz>
Cc: Rik van Riel <riel@redhat.com>
Cc: "Aneesh Kumar K.V" <aneesh.kumar@linux.vnet.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 36bc08cc 16-Jul-2013 Gu Zheng <guz.fnst@cn.fujitsu.com>

fs/aio: Add support to aio ring pages migration

As the aio job will pin the ring pages, that will lead to mem migrated
failed. In order to fix this problem we use an anon inode to manage the aio ring
pages, and setup the migratepage callback in the anon inode's address space, so
that when mem migrating the aio ring pages will be moved to other mem node safely.

Signed-off-by: Gu Zheng <guz.fnst@cn.fujitsu.com>
Signed-off-by: Benjamin LaHaise <bcrl@kvack.org>


# 30dad309 12-Jun-2013 Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>

mm: migration: add migrate_entry_wait_huge()

When we have a page fault for the address which is backed by a hugepage
under migration, the kernel can't wait correctly and do busy looping on
hugepage fault until the migration finishes. As a result, users who try
to kick hugepage migration (via soft offlining, for example) occasionally
experience long delay or soft lockup.

This is because pte_offset_map_lock() can't get a correct migration entry
or a correct page table lock for hugepage. This patch introduces
migration_entry_wait_huge() to solve this.

Signed-off-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
Reviewed-by: Rik van Riel <riel@redhat.com>
Reviewed-by: Wanpeng Li <liwanp@linux.vnet.ibm.com>
Reviewed-by: Michal Hocko <mhocko@suse.cz>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Andi Kleen <andi@firstfloor.org>
Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Cc: <stable@vger.kernel.org> [2.6.35+]
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# c2cc499c 24-May-2013 Leonid Yegoshin <Leonid.Yegoshin@imgtec.com>

mm compaction: fix of improper cache flush in migration code

Page 'new' during MIGRATION can't be flushed with flush_cache_page().
Using flush_cache_page(vma, addr, pfn) is justified only if the page is
already placed in process page table, and that is done right after
flush_cache_page(). But without it the arch function has no knowledge
of process PTE and does nothing.

Besides that, flush_cache_page() flushes an application cache page, but
the kernel has a different page virtual address and dirtied it.

Replace it with flush_dcache_page(new) which is the proper usage.

The old page is flushed in try_to_unmap_one() before migration.

This bug takes place in Sead3 board with M14Kc MIPS CPU without cache
aliasing (but Harvard arch - separate I and D cache) in tight memory
environment (128MB) each 1-3days on SOAK test. It fails in cc1 during
kernel build (SIGILL, SIGBUS, SIGSEG) if CONFIG_COMPACTION is switched
ON.

Signed-off-by: Leonid Yegoshin <Leonid.Yegoshin@imgtec.com>
Cc: Leonid Yegoshin <yegoshin@mips.com>
Acked-by: Rik van Riel <riel@redhat.com>
Cc: Michal Hocko <mhocko@suse.cz>
Acked-by: Mel Gorman <mgorman@suse.de>
Cc: Ralf Baechle <ralf@linux-mips.org>
Cc: Russell King <rmk@arm.linux.org.uk>
Cc: David Miller <davem@davemloft.net>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# c73e5c9c 29-Apr-2013 Srivatsa S. Bhat <srivatsa.bhat@linux.vnet.ibm.com>

mm: rewrite the comment over migrate_pages() more comprehensibly

The comment over migrate_pages() looks quite weird, and makes it hard to
grasp what it is trying to say. Rewrite it more comprehensibly.

Signed-off-by: Srivatsa S. Bhat <srivatsa.bhat@linux.vnet.ibm.com>
Acked-by: Christoph Lameter <cl@linux.com>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Hugh Dickins <hughd@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# fed5b64a 29-Apr-2013 Jianguo Wu <wujianguo@huawei.com>

mm/migrate: fix comment typo syncronous->synchronous

Signed-off-by: Jianguo Wu <wujianguo@huawei.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 9c620e2b 22-Feb-2013 Hugh Dickins <hughd@google.com>

mm: remove offlining arg to migrate_pages

No functional change, but the only purpose of the offlining argument to
migrate_pages() etc, was to ensure that __unmap_and_move() could migrate a
KSM page for memory hotremove (which took ksm_thread_mutex) but not for
other callers. Now all cases are safe, remove the arg.

Signed-off-by: Hugh Dickins <hughd@google.com>
Cc: Rik van Riel <riel@redhat.com>
Cc: Petr Holasek <pholasek@redhat.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Izik Eidus <izik.eidus@ravellosystems.com>
Cc: Gerald Schaefer <gerald.schaefer@de.ibm.com>
Cc: KOSAKI Motohiro <kosaki.motohiro@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# b79bc0a0 22-Feb-2013 Hugh Dickins <hughd@google.com>

ksm: enable KSM page migration

Migration of KSM pages is now safe: remove the PageKsm restrictions from
mempolicy.c and migrate.c.

But keep PageKsm out of __unmap_and_move()'s anon_vma contortions, which
are irrelevant to KSM: it looks as if that code was preventing hotremove
migration of KSM pages, unless they happened to be in swapcache.

There is some question as to whether enforcing a NUMA mempolicy migration
ought to migrate KSM pages, mapped into entirely unrelated processes; but
moving page_mapcount > 1 is only permitted with MPOL_MF_MOVE_ALL anyway,
and it seems reasonable to assume that you wouldn't set MADV_MERGEABLE on
any area where this is a worry.

Signed-off-by: Hugh Dickins <hughd@google.com>
Cc: Rik van Riel <riel@redhat.com>
Cc: Petr Holasek <pholasek@redhat.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Izik Eidus <izik.eidus@ravellosystems.com>
Cc: Gerald Schaefer <gerald.schaefer@de.ibm.com>
Cc: KOSAKI Motohiro <kosaki.motohiro@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# c8d6553b 22-Feb-2013 Hugh Dickins <hughd@google.com>

ksm: make KSM page migration possible

KSM page migration is already supported in the case of memory hotremove,
which takes the ksm_thread_mutex across all its migrations to keep life
simple.

But the new KSM NUMA merge_across_nodes knob introduces a problem, when
it's set to non-default 0: if a KSM page is migrated to a different NUMA
node, how do we migrate its stable node to the right tree? And what if
that collides with an existing stable node?

So far there's no provision for that, and this patch does not attempt to
deal with it either. But how will I test a solution, when I don't know
how to hotremove memory? The best answer is to enable KSM page migration
in all cases now, and test more common cases. With THP and compaction
added since KSM came in, page migration is now mainstream, and it's a
shame that a KSM page can frustrate freeing a page block.

Without worrying about merge_across_nodes 0 for now, this patch gets KSM
page migration working reliably for default merge_across_nodes 1 (but
leave the patch enabling it until near the end of the series).

It's much simpler than I'd originally imagined, and does not require an
additional tier of locking: page migration relies on the page lock, KSM
page reclaim relies on the page lock, the page lock is enough for KSM page
migration too.

Almost all the care has to be in get_ksm_page(): that's the function which
worries about when a stable node is stale and should be freed, now it also
has to worry about the KSM page being migrated.

The only new overhead is an additional put/get/lock/unlock_page when
stable_tree_search() arrives at a matching node: to make sure migration
respects the raised page count, and so does not migrate the page while
we're busy with it here. That's probably avoidable, either by changing
internal interfaces from using kpage to stable_node, or by moving the
ksm_migrate_page() callsite into a page_freeze_refs() section (even if not
swapcache); but this works well, I've no urge to pull it apart now.

(Descents of the stable tree may pass through nodes whose KSM pages are
under migration: being unlocked, the raised page count does not prevent
that, nor need it: it's safe to memcmp against either old or new page.)

You might worry about mremap, and whether page migration's rmap_walk to
remove migration entries will find all the KSM locations where it inserted
earlier: that should already be handled, by the satisfyingly heavy hammer
of move_vma()'s call to ksm_madvise(,,,MADV_UNMERGEABLE,).

Signed-off-by: Hugh Dickins <hughd@google.com>
Cc: Rik van Riel <riel@redhat.com>
Cc: Petr Holasek <pholasek@redhat.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Izik Eidus <izik.eidus@ravellosystems.com>
Cc: Gerald Schaefer <gerald.schaefer@de.ibm.com>
Cc: KOSAKI Motohiro <kosaki.motohiro@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 22b751c3 22-Feb-2013 Mel Gorman <mgorman@suse.de>

mm: rename page struct field helpers

The function names page_xchg_last_nid(), page_last_nid() and
reset_page_last_nid() were judged to be inconsistent so rename them to a
struct_field_op style pattern. As it looked jarring to have
reset_page_mapcount() and page_nid_reset_last() beside each other in
memmap_init_zone(), this patch also renames reset_page_mapcount() to
page_mapcount_reset(). There are others like init_page_count() but as
it is used throughout the arch code a rename would likely cause more
conflicts than it is worth.

[akpm@linux-foundation.org: fix zcache]
Signed-off-by: Mel Gorman <mgorman@suse.de>
Suggested-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 340ef390 22-Feb-2013 Hugh Dickins <hughd@google.com>

mm: numa: cleanup flow of transhuge page migration

When correcting commit 04fa5d6a6547 ("mm: migrate: check page_count of
THP before migrating") Hugh Dickins noted that the control flow for
transhuge migration was difficult to follow. Unconditionally calling
put_page() in numamigrate_isolate_page() made the failure paths of both
migrate_misplaced_transhuge_page() and migrate_misplaced_page() more
complex that they should be. Further, he was extremely wary that an
unlock_page() should ever happen after a put_page() even if the
put_page() should never be the final put_page.

Hugh implemented the following cleanup to simplify the path by calling
putback_lru_page() inside numamigrate_isolate_page() if it failed to
isolate and always calling unlock_page() within
migrate_misplaced_transhuge_page().

There is no functional change after this patch is applied but the code
is easier to follow and unlock_page() always happens before put_page().

[mgorman@suse.de: changelog only]
Signed-off-by: Mel Gorman <mgorman@suse.de>
Signed-off-by: Hugh Dickins <hughd@google.com>
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: Simon Jeons <simon.jeons@gmail.com>
Cc: Wanpeng Li <liwanp@linux.vnet.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 3abef4e6 22-Feb-2013 Mel Gorman <mgorman@suse.de>

mm: numa: take THP into account when migrating pages for NUMA balancing

Wanpeng Li pointed out that numamigrate_isolate_page() assumes that only
one base page is being migrated when in fact it can also be checking
THP.

The consequences are that a migration will be attempted when a target
node is nearly full and fail later. It's unlikely to be user-visible
but it should be fixed. While we are there, migrate_balanced_pgdat()
should treat nr_migrate_pages as an unsigned long as it is treated as a
watermark.

Signed-off-by: Mel Gorman <mgorman@suse.de>
Suggested-by: Wanpeng Li <liwanp@linux.vnet.ibm.com>
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: Simon Jeons <simon.jeons@gmail.com>
Cc: Hugh Dickins <hughd@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# be7517d6 04-Feb-2013 Tony Lu <zlu@tilera.com>

mm/hugetlb: set PTE as huge in hugetlb_change_protection and remove_migration_pte

When setting a huge PTE, besides calling pte_mkhuge(), we also need to
call arch_make_huge_pte(), which we indeed do in make_huge_pte(), but we
forget to do in hugetlb_change_protection() and remove_migration_pte().

Signed-off-by: Zhigang Lu <zlu@tilera.com>
Signed-off-by: Chris Metcalf <cmetcalf@tilera.com>
Reviewed-by: Michal Hocko <mhocko@suse.cz>
Acked-by: Hillf Danton <dhillf@gmail.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 04fa5d6a 11-Jan-2013 Mel Gorman <mgorman@suse.de>

mm: migrate: check page_count of THP before migrating

Hugh Dickins pointed out that migrate_misplaced_transhuge_page() does
not check page_count before migrating like base page migration and
khugepage. He could not see why this was safe and he is right.

The potential impact of the bug is avoided due to the limitations of
NUMA balancing. The page_mapcount() check ensures that only a single
address space is using this page and as THPs are typically private it
should not be possible for another address space to fault it in
parallel. If the address space has one associated task then it's
difficult to have both a GUP pin and be referencing the page at the same
time. If there are multiple tasks then a buggy scenario requires that
another thread be accessing the page while the direct IO is in flight.
This is dodgy behaviour as there is a possibility of corruption with or
without THP migration. It would be

While we happen to be safe for the most part it is shoddy to depend on
such "safety" so this patch checks the page count similar to anonymous
pages. Note that this does not mean that the page_mapcount() check can
go away. If we were to remove the page_mapcount() check the the THP
would have to be unmapped from all referencing PTEs, replaced with
migration PTEs and restored properly afterwards.

Signed-off-by: Mel Gorman <mgorman@suse.de>
Reported-by: Hugh Dickins <hughd@google.com>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Acked-by: Hugh Dickins <hughd@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# ce4a9cc5 10-Dec-2012 Stephen Rothwell <sfr@canb.auug.org.au>

mm,numa: fix update_mmu_cache_pmd call

This build error is currently hidden by the fact that the x86
implementation of 'update_mmu_cache_pmd()' is a macro that doesn't use
its last argument, but commit b32967ff101a ("mm: numa: Add THP migration
for the NUMA working set scanning fault case") introduced a call with
the wrong third argument.

In the akpm tree, it causes this build error:

mm/migrate.c: In function 'migrate_misplaced_transhuge_page_put':
mm/migrate.c:1666:2: error: incompatible type for argument 3 of 'update_mmu_cache_pmd'
arch/x86/include/asm/pgtable.h:792:20: note: expected 'struct pmd_t *' but argument is of type 'pmd_t'

Fix it.

Signed-off-by: Stephen Rothwell <sfr@canb.auug.org.au>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 389162c2 12-Dec-2012 Lai Jiangshan <laijs@cn.fujitsu.com>

mm,migrate: use N_MEMORY instead N_HIGH_MEMORY

N_HIGH_MEMORY stands for the nodes that has normal or high memory.
N_MEMORY stands for the nodes that has any memory.

The code here need to handle with the nodes which have memory, we should
use N_MEMORY instead.

Signed-off-by: Lai Jiangshan <laijs@cn.fujitsu.com>
Acked-by: Christoph Lameter <cl@linux.com>
Signed-off-by: Wen Congyang <wency@cn.fujitsu.com>
Cc: Hillf Danton <dhillf@gmail.com>
Cc: Lin Feng <linfeng@cn.fujitsu.com>
Cc: David Rientjes <rientjes@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 5733c7d1 11-Dec-2012 Rafael Aquini <aquini@redhat.com>

mm: introduce putback_movable_pages()

The PATCH "mm: introduce compaction and migration for virtio ballooned pages"
hacks around putback_lru_pages() in order to allow ballooned pages to be
re-inserted on balloon page list as if a ballooned page was like a LRU page.

As ballooned pages are not legitimate LRU pages, this patch introduces
putback_movable_pages() to properly cope with cases where the isolated
pageset contains ballooned pages and LRU pages, thus fixing the mentioned
inelegant hack around putback_lru_pages().

Signed-off-by: Rafael Aquini <aquini@redhat.com>
Cc: Rusty Russell <rusty@rustcorp.com.au>
Cc: "Michael S. Tsirkin" <mst@redhat.com>
Cc: Rik van Riel <riel@redhat.com>
Cc: Mel Gorman <mel@csn.ul.ie>
Cc: Andi Kleen <andi@firstfloor.org>
Cc: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Cc: Minchan Kim <minchan@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# bf6bddf1 11-Dec-2012 Rafael Aquini <aquini@redhat.com>

mm: introduce compaction and migration for ballooned pages

Memory fragmentation introduced by ballooning might reduce significantly
the number of 2MB contiguous memory blocks that can be used within a guest,
thus imposing performance penalties associated with the reduced number of
transparent huge pages that could be used by the guest workload.

This patch introduces the helper functions as well as the necessary changes
to teach compaction and migration bits how to cope with pages which are
part of a guest memory balloon, in order to make them movable by memory
compaction procedures.

Signed-off-by: Rafael Aquini <aquini@redhat.com>
Acked-by: Mel Gorman <mel@csn.ul.ie>
Cc: Rusty Russell <rusty@rustcorp.com.au>
Cc: "Michael S. Tsirkin" <mst@redhat.com>
Cc: Rik van Riel <riel@redhat.com>
Cc: Andi Kleen <andi@firstfloor.org>
Cc: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Cc: Minchan Kim <minchan@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 78bd5209 11-Dec-2012 Rafael Aquini <aquini@redhat.com>

mm: adjust address_space_operations.migratepage() return code

Memory fragmentation introduced by ballooning might reduce significantly
the number of 2MB contiguous memory blocks that can be used within a
guest, thus imposing performance penalties associated with the reduced
number of transparent huge pages that could be used by the guest workload.

This patch-set follows the main idea discussed at 2012 LSFMMS session:
"Ballooning for transparent huge pages" -- http://lwn.net/Articles/490114/
to introduce the required changes to the virtio_balloon driver, as well as
the changes to the core compaction & migration bits, in order to make
those subsystems aware of ballooned pages and allow memory balloon pages
become movable within a guest, thus avoiding the aforementioned
fragmentation issue

Following are numbers that prove this patch benefits on allowing
compaction to be more effective at memory ballooned guests.

Results for STRESS-HIGHALLOC benchmark, from Mel Gorman's mmtests suite,
running on a 4gB RAM KVM guest which was ballooning 512mB RAM in 64mB
chunks, at every minute (inflating/deflating), while test was running:

===BEGIN stress-highalloc

STRESS-HIGHALLOC
highalloc-3.7 highalloc-3.7
rc4-clean rc4-patch
Pass 1 55.00 ( 0.00%) 62.00 ( 7.00%)
Pass 2 54.00 ( 0.00%) 62.00 ( 8.00%)
while Rested 75.00 ( 0.00%) 80.00 ( 5.00%)

MMTests Statistics: duration
3.7 3.7
rc4-clean rc4-patch
User 1207.59 1207.46
System 1300.55 1299.61
Elapsed 2273.72 2157.06

MMTests Statistics: vmstat
3.7 3.7
rc4-clean rc4-patch
Page Ins 3581516 2374368
Page Outs 11148692 10410332
Swap Ins 80 47
Swap Outs 3641 476
Direct pages scanned 37978 33826
Kswapd pages scanned 1828245 1342869
Kswapd pages reclaimed 1710236 1304099
Direct pages reclaimed 32207 31005
Kswapd efficiency 93% 97%
Kswapd velocity 804.077 622.546
Direct efficiency 84% 91%
Direct velocity 16.703 15.682
Percentage direct scans 2% 2%
Page writes by reclaim 79252 9704
Page writes file 75611 9228
Page writes anon 3641 476
Page reclaim immediate 16764 11014
Page rescued immediate 0 0
Slabs scanned 2171904 2152448
Direct inode steals 385 2261
Kswapd inode steals 659137 609670
Kswapd skipped wait 1 69
THP fault alloc 546 631
THP collapse alloc 361 339
THP splits 259 263
THP fault fallback 98 50
THP collapse fail 20 17
Compaction stalls 747 499
Compaction success 244 145
Compaction failures 503 354
Compaction pages moved 370888 474837
Compaction move failure 77378 65259

===END stress-highalloc

This patch:

Introduce MIGRATEPAGE_SUCCESS as the default return code for
address_space_operations.migratepage() method and documents the expected
return code for the same method in failure cases.

Signed-off-by: Rafael Aquini <aquini@redhat.com>
Cc: Rusty Russell <rusty@rustcorp.com.au>
Cc: "Michael S. Tsirkin" <mst@redhat.com>
Cc: Rik van Riel <riel@redhat.com>
Cc: Mel Gorman <mel@csn.ul.ie>
Cc: Andi Kleen <andi@firstfloor.org>
Cc: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Cc: Minchan Kim <minchan@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 6219049a 11-Dec-2012 Bob Liu <lliubbo@gmail.com>

mm: introduce mm_find_pmd()

Several place need to find the pmd by(mm_struct, address), so introduce a
function to simplify it.

[akpm@linux-foundation.org: fix warning]
Signed-off-by: Bob Liu <lliubbo@gmail.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Michal Hocko <mhocko@suse.cz>
Cc: Minchan Kim <minchan.kim@gmail.com>
Cc: Ni zhan Chen <nizhan.chen@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 4fc3f1d6 02-Dec-2012 Ingo Molnar <mingo@kernel.org>

mm/rmap, migration: Make rmap_walk_anon() and try_to_unmap_anon() more scalable

rmap_walk_anon() and try_to_unmap_anon() appears to be too
careful about locking the anon vma: while it needs protection
against anon vma list modifications, it does not need exclusive
access to the list itself.

Transforming this exclusive lock to a read-locked rwsem removes
a global lock from the hot path of page-migration intense
threaded workloads which can cause pathological performance like
this:

96.43% process 0 [kernel.kallsyms] [k] perf_trace_sched_switch
|
--- perf_trace_sched_switch
__schedule
schedule
schedule_preempt_disabled
__mutex_lock_common.isra.6
__mutex_lock_slowpath
mutex_lock
|
|--50.61%-- rmap_walk
| move_to_new_page
| migrate_pages
| migrate_misplaced_page
| __do_numa_page.isra.69
| handle_pte_fault
| handle_mm_fault
| __do_page_fault
| do_page_fault
| page_fault
| __memset_sse2
| |
| --100.00%-- worker_thread
| |
| --100.00%-- start_thread
|
--49.39%-- page_lock_anon_vma
try_to_unmap_anon
try_to_unmap
migrate_pages
migrate_misplaced_page
__do_numa_page.isra.69
handle_pte_fault
handle_mm_fault
__do_page_fault
do_page_fault
page_fault
__memset_sse2
|
--100.00%-- worker_thread
start_thread

With this change applied the profile is now nicely flat
and there's no anon-vma related scheduling/blocking.

Rename anon_vma_[un]lock() => anon_vma_[un]lock_write(),
to make it clearer that it's an exclusive write-lock in
that case - suggested by Rik van Riel.

Suggested-by: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Paul Turner <pjt@google.com>
Cc: Lee Schermerhorn <Lee.Schermerhorn@hp.com>
Cc: Christoph Lameter <cl@linux.com>
Cc: Rik van Riel <riel@redhat.com>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Hugh Dickins <hughd@google.com>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Signed-off-by: Mel Gorman <mgorman@suse.de>


# d28d4335 29-Nov-2012 Mel Gorman <mgorman@suse.de>

mm: migrate: Account a transhuge page properly when rate limiting

If there is excessive migration due to NUMA balancing it gets rate
limited. It does this by counting the number of pages it has migrated
recently but counts a transhuge page as 1 page. Account for it properly.

Signed-off-by: Mel Gorman <mgorman@suse.de>


# 7548341b 27-Nov-2012 Mel Gorman <mgorman@suse.de>

mm: numa: Account for failed allocations and isolations as migration failures

Subject says it all. Allocation failures and a failure to isolate should
be accounted as a migration failure. This is partially another
difference between base page and transhuge page migration. A base page
migration makes multiple attempts for these conditions before it would
be accounted for as a failure.

Signed-off-by: Mel Gorman <mgorman@suse.de>


# 220018d3 05-Dec-2012 Mel Gorman <mgorman@suse.de>

mm: numa: Add THP migration for the NUMA working set scanning fault case build fix

Commit "Add THP migration for the NUMA working set scanning fault case"
breaks the build because HPAGE_PMD_SHIFT and HPAGE_PMD_MASK defined to
explode without CONFIG_TRANSPARENT_HUGEPAGE:

mm/migrate.c: In function 'migrate_misplaced_transhuge_page_put':
mm/migrate.c:1549: error: call to '__build_bug_failed' declared with attribute error: BUILD_BUG failed
mm/migrate.c:1564: error: call to '__build_bug_failed' declared with attribute error: BUILD_BUG failed
mm/migrate.c:1566: error: call to '__build_bug_failed' declared with attribute error: BUILD_BUG failed
mm/migrate.c:1573: error: call to '__build_bug_failed' declared with attribute error: BUILD_BUG failed
mm/migrate.c:1606: error: call to '__build_bug_failed' declared with attribute error: BUILD_BUG failed
mm/migrate.c:1648: error: call to '__build_bug_failed' declared with attribute error: BUILD_BUG failed

CONFIG_NUMA_BALANCING allows compilation without enabling transparent
hugepages, so define the dummy function for such a configuration and only
define migrate_misplaced_transhuge_page_put() when transparent hugepages
are enabled.

Signed-off-by: David Rientjes <rientjes@google.com>
Signed-off-by: Mel Gorman <mgorman@suse.de>


# b32967ff 18-Nov-2012 Mel Gorman <mgorman@suse.de>

mm: numa: Add THP migration for the NUMA working set scanning fault case.

Note: This is very heavily based on a patch from Peter Zijlstra with
fixes from Ingo Molnar, Hugh Dickins and Johannes Weiner. That patch
put a lot of migration logic into mm/huge_memory.c where it does
not belong. This version puts tries to share some of the migration
logic with migrate_misplaced_page. However, it should be noted
that now migrate.c is doing more with the pagetable manipulation
than is preferred. The end result is barely recognisable so as
before, the signed-offs had to be removed but will be re-added if
the original authors are ok with it.

Add THP migration for the NUMA working set scanning fault case.

It uses the page lock to serialize. No migration pte dance is
necessary because the pte is already unmapped when we decide
to migrate.

[dhillf@gmail.com: Fix memory leak on isolation failure]
[dhillf@gmail.com: Fix transfer of last_nid information]
Signed-off-by: Mel Gorman <mgorman@suse.de>


# bac0382c 27-Nov-2012 Hillf Danton <dhillf@gmail.com>

mm: numa: migrate: Set last_nid on newly allocated page

Pass last_nid from misplaced page to newly allocated migration target page.

Signed-off-by: Hillf Danton <dhillf@gmail.com>
Signed-off-by: Mel Gorman <mgorman@suse.de>


# e14808b4 19-Nov-2012 Mel Gorman <mgorman@suse.de>

mm: numa: Rate limit setting of pte_numa if node is saturated

If there are a large number of NUMA hinting faults and all of them
are resulting in migrations it may indicate that memory is just
bouncing uselessly around. NUMA balancing cost is likely exceeding
any benefit from locality. Rate limit the PTE updates if the node
is migration rate-limited. As noted in the comments, this distorts
the NUMA faulting statistics.

Signed-off-by: Mel Gorman <mgorman@suse.de>


# a8f60772 14-Nov-2012 Mel Gorman <mgorman@suse.de>

mm: numa: Rate limit the amount of memory that is migrated between nodes

NOTE: This is very heavily based on similar logic in autonuma. It should
be signed off by Andrea but because there was no standalone
patch and it's sufficiently different from what he did that
the signed-off is omitted. Will be added back if requested.

If a large number of pages are misplaced then the memory bus can be
saturated just migrating pages between nodes. This patch rate-limits
the amount of memory that can be migrating between nodes.

Signed-off-by: Mel Gorman <mgorman@suse.de>


# 03c5a6e1 02-Nov-2012 Mel Gorman <mgorman@suse.de>

mm: numa: Add pte updates, hinting and migration stats

It is tricky to quantify the basic cost of automatic NUMA placement in a
meaningful manner. This patch adds some vmstats that can be used as part
of a basic costing model.

u = basic unit = sizeof(void *)
Ca = cost of struct page access = sizeof(struct page) / u
Cpte = Cost PTE access = Ca
Cupdate = Cost PTE update = (2 * Cpte) + (2 * Wlock)
where Cpte is incurred twice for a read and a write and Wlock
is a constant representing the cost of taking or releasing a
lock
Cnumahint = Cost of a minor page fault = some high constant e.g. 1000
Cpagerw = Cost to read or write a full page = Ca + PAGE_SIZE/u
Ci = Cost of page isolation = Ca + Wi
where Wi is a constant that should reflect the approximate cost
of the locking operation
Cpagecopy = Cpagerw + (Cpagerw * Wnuma) + Ci + (Ci * Wnuma)
where Wnuma is the approximate NUMA factor. 1 is local. 1.2
would imply that remote accesses are 20% more expensive

Balancing cost = Cpte * numa_pte_updates +
Cnumahint * numa_hint_faults +
Ci * numa_pages_migrated +
Cpagecopy * numa_pages_migrated

Note that numa_pages_migrated is used as a measure of how many pages
were isolated even though it would miss pages that failed to migrate. A
vmstat counter could have been added for it but the isolation cost is
pretty marginal in comparison to the overall cost so it seemed overkill.

The ideal way to measure automatic placement benefit would be to count
the number of remote accesses versus local accesses and do something like

benefit = (remote_accesses_before - remove_access_after) * Wnuma

but the information is not readily available. As a workload converges, the
expection would be that the number of remote numa hints would reduce to 0.

convergence = numa_hint_faults_local / numa_hint_faults
where this is measured for the last N number of
numa hints recorded. When the workload is fully
converged the value is 1.

This can measure if the placement policy is converging and how fast it is
doing it.

Signed-off-by: Mel Gorman <mgorman@suse.de>
Acked-by: Rik van Riel <riel@redhat.com>


# 149c33e1 27-Nov-2012 Mel Gorman <mgorman@suse.de>

mm: migrate: Drop the misplaced pages reference count if the target node is full

If we have to avoid migrating to a node that is nearly full, put page
and return zero.

Signed-off-by: Hillf Danton <dhillf@gmail.com>
Signed-off-by: Mel Gorman <mgorman@suse.de>


# 7039e1db 25-Oct-2012 Peter Zijlstra <a.p.zijlstra@chello.nl>

mm: migrate: Introduce migrate_misplaced_page()

Note: This was originally based on Peter's patch "mm/migrate: Introduce
migrate_misplaced_page()" but borrows extremely heavily from Andrea's
"autonuma: memory follows CPU algorithm and task/mm_autonuma stats
collection". The end result is barely recognisable so signed-offs
had to be dropped. If original authors are ok with it, I'll
re-add the signed-off-bys.

Add migrate_misplaced_page() which deals with migrating pages from
faults.

Based-on-work-by: Lee Schermerhorn <Lee.Schermerhorn@hp.com>
Based-on-work-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Based-on-work-by: Andrea Arcangeli <aarcange@redhat.com>
Signed-off-by: Mel Gorman <mgorman@suse.de>
Reviewed-by: Rik van Riel <riel@redhat.com>


# 7b2a2d4a 19-Oct-2012 Mel Gorman <mgorman@suse.de>

mm: migrate: Add a tracepoint for migrate_pages

The pgmigrate_success and pgmigrate_fail vmstat counters tells the user
about migration activity but not the type or the reason. This patch adds
a tracepoint to identify the type of page migration and why the page is
being migrated.

Signed-off-by: Mel Gorman <mgorman@suse.de>
Reviewed-by: Rik van Riel <riel@redhat.com>


# 5647bc29 19-Oct-2012 Mel Gorman <mgorman@suse.de>

mm: compaction: Move migration fail/success stats to migrate.c

The compact_pages_moved and compact_pagemigrate_failed events are
convenient for determining if compaction is active and to what
degree migration is succeeding but it's at the wrong level. Other
users of migration may also want to know if migration is working
properly and this will be particularly true for any automated
NUMA migration. This patch moves the counters down to migration
with the new events called pgmigrate_success and pgmigrate_fail.
The compact_blocks_moved counter is removed because while it was
useful for debugging initially, it's worthless now as no meaningful
conclusions can be drawn from its value.

Signed-off-by: Mel Gorman <mgorman@suse.de>
Reviewed-by: Rik van Riel <riel@redhat.com>


# 0030f535 31-Jul-2012 Johannes Weiner <hannes@cmpxchg.org>

mm: memcg: fix compaction/migration failing due to memcg limits

Compaction (and page migration in general) can currently be hindered
through pages being owned by memory cgroups that are at their limits and
unreclaimable.

The reason is that the replacement page is being charged against the limit
while the page being replaced is also still charged. But this seems
unnecessary, given that only one of the two pages will still be in use
after migration finishes.

This patch changes the memcg migration sequence so that the replacement
page is not charged. Whatever page is still in use after successful or
failed migration gets to keep the charge of the page that was going to be
replaced.

The replacement page will still show up temporarily in the rss/cache
statistics, this can be fixed in a later patch as it's less urgent.

Reported-by: David Rientjes <rientjes@google.com>
Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
Acked-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Acked-by: Michal Hocko <mhocko@suse.cz>
Cc: Hugh Dickins <hughd@google.com>
Cc: David Rientjes <rientjes@google.com>
Cc: Wanpeng Li <liwp.linux@gmail.com>
Cc: Mel Gorman <mel@csn.ul.ie>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 8e6ac7fa 31-Jul-2012 Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>

hugetlb/cgroup: migrate hugetlb cgroup info from oldpage to new page during migration

With HugeTLB pages, hugetlb cgroup is uncharged in compound page
destructor. Since we are holding a hugepage reference, we can be sure
that old page won't get uncharged till the last put_page().

Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
Cc: David Rientjes <rientjes@google.com>
Acked-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Hillf Danton <dhillf@gmail.com>
Cc: Michal Hocko <mhocko@suse.cz>
Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 189ebff2 31-Jul-2012 Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>

hugetlb: simplify migrate_huge_page()

Since we migrate only one hugepage, don't use linked list for passing the
page around. Directly pass the page that need to be migrated as argument.
This also removes the usage of page->lru in the migrate path.

Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
Reviewed-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: David Rientjes <rientjes@google.com>
Cc: Hillf Danton <dhillf@gmail.com>
Reviewed-by: Michal Hocko <mhocko@suse.cz>
Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 752dc185 02-Jun-2012 Hugh Dickins <hughd@google.com>

mm: fix warning in __set_page_dirty_nobuffers

New tmpfs use of !PageUptodate pages for fallocate() is triggering the
WARNING: at mm/page-writeback.c:1990 when __set_page_dirty_nobuffers()
is called from migrate_page_copy() for compaction.

It is anomalous that migration should use __set_page_dirty_nobuffers()
on an address_space that does not participate in dirty and writeback
accounting; and this has also been observed to insert surprising dirty
tags into a tmpfs radix_tree, despite tmpfs not using tags at all.

We should probably give migrate_page_copy() a better way to preserve the
tag and migrate accounting info, when mapping_cap_account_dirty(). But
that needs some more work: so in the interim, avoid the warning by using
a simple SetPageDirty on PageSwapBacked pages.

Reported-and-tested-by: Dave Jones <davej@redhat.com>
Signed-off-by: Hugh Dickins <hughd@google.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# b38a86eb 12-Mar-2012 Eric W. Biederman <ebiederm@xmission.com>

userns: Convert the move_pages, and migrate_pages permission checks to use uid_eq

Acked-by: Serge Hallyn <serge.hallyn@canonical.com>
Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>


# 6e8b09ea 25-Apr-2012 Sasha Levin <levinsasha928@gmail.com>

mm: fix NULL ptr dereference in move_pages

Commit 3268c63 ("mm: fix move/migrate_pages() race on task struct") has
added an odd construct where 'mm' is checked for being NULL, and if it is,
it would get dereferenced anyways by mput()ing it.

Signed-off-by: Sasha Levin <levinsasha928@gmail.com>
Cc: Dave Hansen <dave@linux.vnet.ibm.com>
Cc: Mel Gorman <mel@csn.ul.ie>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Hugh Dickins <hughd@google.com>
Acked-by: Christoph Lameter <cl@linux.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 3268c63e 21-Mar-2012 Christoph Lameter <cl@linux.com>

mm: fix move/migrate_pages() race on task struct

Migration functions perform the rcu_read_unlock too early. As a result
the task pointed to may change from under us. This can result in an oops,
as reported by Dave Hansen in https://lkml.org/lkml/2012/2/23/302.

The following patch extend the period of the rcu_read_lock until after the
permissions checks are done. We also take a refcount so that the task
reference is stable when calling security check functions and performing
cpuset node validation (which takes a mutex).

The refcount is dropped before actual page migration occurs so there is no
change to the refcounts held during page migration.

Also move the determination of the mm of the task struct to immediately
before the do_migrate*() calls so that it is clear that we switch from
handling the task during permission checks to the mm for the actual
migration. Since the determination is only done once and we then no
longer use the task_struct we can be sure that we operate on a specific
address space that will not change from under us.

[akpm@linux-foundation.org: checkpatch fixes]
Signed-off-by: Christoph Lameter <cl@linux.com>
Cc: "Eric W. Biederman" <ebiederm@xmission.com>
Reported-by: Dave Hansen <dave@linux.vnet.ibm.com>
Cc: Mel Gorman <mel@csn.ul.ie>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Hugh Dickins <hughd@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 7512102c 05-Mar-2012 Hugh Dickins <hughd@google.com>

memcg: fix GPF when cgroup removal races with last exit

When moving tasks from old memcg (with move_charge_at_immigrate on new
memcg), followed by removal of old memcg, hit General Protection Fault in
mem_cgroup_lru_del_list() (called from release_pages called from
free_pages_and_swap_cache from tlb_flush_mmu from tlb_finish_mmu from
exit_mmap from mmput from exit_mm from do_exit).

Somewhat reproducible, takes a few hours: the old struct mem_cgroup has
been freed and poisoned by SLAB_DEBUG, but mem_cgroup_lru_del_list() is
still trying to update its stats, and take page off lru before freeing.

A task, or a charge, or a page on lru: each secures a memcg against
removal. In this case, the last task has been moved out of the old memcg,
and it is exiting: anonymous pages are uncharged one by one from the
memcg, as they are zapped from its pagetables, so the charge gets down to
0; but the pages themselves are queued in an mmu_gather for freeing.

Most of those pages will be on lru (and force_empty is careful to
lru_add_drain_all, to add pages from pagevec to lru first), but not
necessarily all: perhaps some have been isolated for page reclaim, perhaps
some isolated for other reasons. So, force_empty may find no task, no
charge and no page on lru, and let the removal proceed.

There would still be no problem if these pages were immediately freed; but
typically (and the put_page_testzero protocol demands it) they have to be
added back to lru before they are found freeable, then removed from lru
and freed. We don't see the issue when adding, because the
mem_cgroup_iter() loops keep their own reference to the memcg being
scanned; but when it comes to mem_cgroup_lru_del_list().

I believe this was not an issue in v3.2: there, PageCgroupAcctLRU and
PageCgroupUsed flags were used (like a trick with mirrors) to deflect view
of pc->mem_cgroup to the stable root_mem_cgroup when neither set.
38c5d72f3ebe ("memcg: simplify LRU handling by new rule") mercifully
removed those convolutions, but left this General Protection Fault.

But it's surprisingly easy to restore the old behaviour: just check
PageCgroupUsed in mem_cgroup_lru_add_list() (which decides on which lruvec
to add), and reset pc to root_mem_cgroup if page is uncharged. A risky
change? just going back to how it worked before; testing, and an audit of
uses of pc->mem_cgroup, show no problem.

And there's a nice bonus: with mem_cgroup_lru_add_list() itself making
sure that an uncharged page goes to root lru, mem_cgroup_reset_owner() no
longer has any purpose, and we can safely revert 4e5f01c2b9b9 ("memcg:
clear pc->mem_cgroup if necessary").

Calling update_page_reclaim_stat() after add_page_to_lru_list() in swap.c
is not strictly necessary: the lru_lock there, with RCU before memcg
structures are freed, makes mem_cgroup_get_reclaim_stat_from_page safe
without that; but it seems cleaner to rely on one dependency less.

Signed-off-by: Hugh Dickins <hughd@google.com>
Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Konstantin Khlebnikov <khlebnikov@openvz.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 35512eca 03-Feb-2012 Konstantin Khlebnikov <khlebnikov@openvz.org>

mm: postpone migrated page mapping reset

Postpone resetting page->mapping until the final remove_migration_ptes().
Otherwise the expression PageAnon(migration_entry_to_page(entry)) does not
work.

Signed-off-by: Konstantin Khlebnikov <khlebnikov@openvz.org>
Cc: Hugh Dickins <hughd@google.com>
Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# a6bc32b8 12-Jan-2012 Mel Gorman <mgorman@suse.de>

mm: compaction: introduce sync-light migration for use by compaction

This patch adds a lightweight sync migrate operation MIGRATE_SYNC_LIGHT
mode that avoids writing back pages to backing storage. Async compaction
maps to MIGRATE_ASYNC while sync compaction maps to MIGRATE_SYNC_LIGHT.
For other migrate_pages users such as memory hotplug, MIGRATE_SYNC is
used.

This avoids sync compaction stalling for an excessive length of time,
particularly when copying files to a USB stick where there might be a
large number of dirty pages backed by a filesystem that does not support
->writepages.

[aarcange@redhat.com: This patch is heavily based on Andrea's work]
[akpm@linux-foundation.org: fix fs/nfs/write.c build]
[akpm@linux-foundation.org: fix fs/btrfs/disk-io.c build]
Signed-off-by: Mel Gorman <mgorman@suse.de>
Reviewed-by: Rik van Riel <riel@redhat.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Minchan Kim <minchan.kim@gmail.com>
Cc: Dave Jones <davej@redhat.com>
Cc: Jan Kara <jack@suse.cz>
Cc: Andy Isaacson <adi@hexapodia.org>
Cc: Nai Xia <nai.xia@gmail.com>
Cc: Johannes Weiner <jweiner@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# b969c4ab 12-Jan-2012 Mel Gorman <mgorman@suse.de>

mm: compaction: determine if dirty pages can be migrated without blocking within ->migratepage

Asynchronous compaction is used when allocating transparent hugepages to
avoid blocking for long periods of time. Due to reports of stalling,
there was a debate on disabling synchronous compaction but this severely
impacted allocation success rates. Part of the reason was that many dirty
pages are skipped in asynchronous compaction by the following check;

if (PageDirty(page) && !sync &&
mapping->a_ops->migratepage != migrate_page)
rc = -EBUSY;

This skips over all mapping aops using buffer_migrate_page() even though
it is possible to migrate some of these pages without blocking. This
patch updates the ->migratepage callback with a "sync" parameter. It is
the responsibility of the callback to fail gracefully if migration would
block.

Signed-off-by: Mel Gorman <mgorman@suse.de>
Reviewed-by: Rik van Riel <riel@redhat.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Minchan Kim <minchan.kim@gmail.com>
Cc: Dave Jones <davej@redhat.com>
Cc: Jan Kara <jack@suse.cz>
Cc: Andy Isaacson <adi@hexapodia.org>
Cc: Nai Xia <nai.xia@gmail.com>
Cc: Johannes Weiner <jweiner@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 4e5f01c2 12-Jan-2012 KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>

memcg: clear pc->mem_cgroup if necessary.

This is a preparation before removing a flag PCG_ACCT_LRU in page_cgroup
and reducing atomic ops/complexity in memcg LRU handling.

In some cases, pages are added to lru before charge to memcg and pages
are not classfied to memory cgroup at lru addtion. Now, the lru where
the page should be added is determined a bit in page_cgroup->flags and
pc->mem_cgroup. I'd like to remove the check of flag.

To handle the case pc->mem_cgroup may contain stale pointers if pages
are added to LRU before classification. This patch resets
pc->mem_cgroup to root_mem_cgroup before lru additions.

[akpm@linux-foundation.org: fix CONFIG_CGROUP_MEM_CONT=n build]
[hughd@google.com: fix CONFIG_CGROUP_MEM_RES_CTLR=y CONFIG_CGROUP_MEM_RES_CTLR_SWAP=n build]
[akpm@linux-foundation.org: ksm.c needs memcontrol.h, per Michal]
[hughd@google.com: stop oops in mem_cgroup_reset_owner()]
[hughd@google.com: fix page migration to reset_owner]
Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Miklos Szeredi <mszeredi@suse.cz>
Acked-by: Michal Hocko <mhocko@suse.cz>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Ying Han <yinghan@google.com>
Signed-off-by: Hugh Dickins <hughd@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# faed836a 10-Jan-2012 Wang Sheng-Hui <shhuiw@gmail.com>

mm/migrate.c: remove the unused macro lru_to_page

lru_to_page is not used in mm/migrate.c.

Signed-off-by: Wang Sheng-Hui <shhuiw@gmail.com>
Acked-by: Mel Gorman <mgorman@suse.de>
Acked-by: Kyungmin Park <kyungmin.park@samsung.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 564c81db 10-Jan-2012 Wang Sheng-Hui <shhuiw@gmail.com>

mm/migrate.c: cleanup comment for migration_entry_wait()

migration_entry_wait() can also be called from hugetlb_fault() now.
Remove the incorrect comment.

Signed-off-by: Wang Sheng-Hui <shhuiw@gmail.com>
Acked-by: Minchan Kim <minchan@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 937a94c9 10-Jan-2012 Jacobo Giralt <jacobo.giralt@gmail.com>

mm: migrate: one less atomic operation

migrate_page_move_mapping() drops a reference from the old page after
unfreezing its counter. Both operations can be merged into a single
atomic operation by directly unfreezing to one less reference.

The same applies to migrate_huge_page_move_mapping().

Signed-off-by: Jacobo Giralt <jacobo.giralt@gmail.com>
Cc: Mel Gorman <mel@csn.ul.ie>
Cc: Minchan Kim <minchan.kim@gmail.com>
Cc: Johannes Weiner <jweiner@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 09761333 08-Dec-2011 Hillf Danton <dhillf@gmail.com>

mm/migrate.c: pair unlock_page() and lock_page() when migrating huge pages

Avoid unlocking and unlocked page if we failed to lock it.

Signed-off-by: Hillf Danton <dhillf@gmail.com>
Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 0dabec93 31-Oct-2011 Minchan Kim <minchan.kim@gmail.com>

mm: migration: clean up unmap_and_move()

unmap_and_move() is one a big messy function. Clean it up.

Signed-off-by: Minchan Kim <minchan.kim@gmail.com>
Reviewed-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Rik van Riel <riel@redhat.com>
Cc: Michal Hocko <mhocko@suse.cz>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# b95f1b31 16-Oct-2011 Paul Gortmaker <paul.gortmaker@windriver.com>

mm: Map most files to use export.h instead of module.h

The files changed within are only using the EXPORT_SYMBOL
macro variants. They are not using core modular infrastructure
and hence don't need module.h but only the export.h header.

Signed-off-by: Paul Gortmaker <paul.gortmaker@windriver.com>


# 486cf46f 19-Oct-2011 Hugh Dickins <hughd@google.com>

mm: fix race between mremap and removing migration entry

I don't usually pay much attention to the stale "? " addresses in
stack backtraces, but this lucky report from Pawel Sikora hints that
mremap's move_ptes() has inadequate locking against page migration.

3.0 BUG_ON(!PageLocked(p)) in migration_entry_to_page():
kernel BUG at include/linux/swapops.h:105!
RIP: 0010:[<ffffffff81127b76>] [<ffffffff81127b76>]
migration_entry_wait+0x156/0x160
[<ffffffff811016a1>] handle_pte_fault+0xae1/0xaf0
[<ffffffff810feee2>] ? __pte_alloc+0x42/0x120
[<ffffffff8112c26b>] ? do_huge_pmd_anonymous_page+0xab/0x310
[<ffffffff81102a31>] handle_mm_fault+0x181/0x310
[<ffffffff81106097>] ? vma_adjust+0x537/0x570
[<ffffffff81424bed>] do_page_fault+0x11d/0x4e0
[<ffffffff81109a05>] ? do_mremap+0x2d5/0x570
[<ffffffff81421d5f>] page_fault+0x1f/0x30

mremap's down_write of mmap_sem, together with i_mmap_mutex or lock,
and pagetable locks, were good enough before page migration (with its
requirement that every migration entry be found) came in, and enough
while migration always held mmap_sem; but not enough nowadays, when
there's memory hotremove and compaction.

The danger is that move_ptes() lets a migration entry dodge around
behind remove_migration_pte()'s back, so it's in the old location when
looking at the new, then in the new location when looking at the old.

Either mremap's move_ptes() must additionally take anon_vma lock(), or
migration's remove_migration_pte() must stop peeking for is_swap_entry()
before it takes pagetable lock.

Consensus chooses the latter: we prefer to add overhead to migration
than to mremapping, which gets used by JVMs and by exec stack setup.

Reported-and-tested-by: Paweł Sikora <pluto@agmk.net>
Signed-off-by: Hugh Dickins <hughd@google.com>
Acked-by: Andrea Arcangeli <aarcange@redhat.com>
Acked-by: Mel Gorman <mgorman@suse.de>
Cc: stable@vger.kernel.org
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 99a15e21 16-Jun-2011 Andrea Arcangeli <aarcange@redhat.com>

migrate: don't account swapcache as shmem

swapcache will reach the below code path in migrate_page_move_mapping,
and swapcache is accounted as NR_FILE_PAGES but it's not accounted as
NR_SHMEM.

Hugh pointed out we must use PageSwapCache instead of comparing
mapping to &swapper_space, to avoid build failure with CONFIG_SWAP=n.

Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
Acked-by: Hugh Dickins <hughd@google.com>
Cc: stable@kernel.org
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 746b18d4 24-May-2011 Peter Zijlstra <a.p.zijlstra@chello.nl>

mm: use refcounts for page_lock_anon_vma()

Convert page_lock_anon_vma() over to use refcounts. This is done to
prepare for the conversion of anon_vma from spinlock to mutex.

Sadly this inceases the cost of page_lock_anon_vma() from one to two
atomics, a follow up patch addresses this, lets keep that simple for now.

Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Reviewed-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Reviewed-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Acked-by: Hugh Dickins <hughd@google.com>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: David Miller <davem@davemloft.net>
Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
Cc: Russell King <rmk@arm.linux.org.uk>
Cc: Paul Mundt <lethal@linux-sh.org>
Cc: Jeff Dike <jdike@addtoit.com>
Cc: Richard Weinberger <richard@nod.at>
Cc: Tony Luck <tony.luck@intel.com>
Cc: Mel Gorman <mel@csn.ul.ie>
Cc: Nick Piggin <npiggin@kernel.dk>
Cc: Namhyung Kim <namhyung@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 25985edc 30-Mar-2011 Lucas De Marchi <lucas.demarchi@profusion.mobi>

Fix common misspellings

Fixes generated by 'codespell' and manually reviewed.

Signed-off-by: Lucas De Marchi <lucas.demarchi@profusion.mobi>


# 56039efa 23-Mar-2011 KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>

memcg: fix ugly initialization of return value is in caller

Remove initialization of vaiable in caller of memory cgroup function.
Actually, it's return value of memcg function but it's initialized in
caller.

Some memory cgroup uses following style to bring the result of start
function to the end function for avoiding races.

mem_cgroup_start_A(&(*ptr))
/* Something very complicated can happen here. */
mem_cgroup_end_A(*ptr)

In some calls, *ptr should be initialized to NULL be caller. But it's
ugly. This patch fixes that *ptr is initialized by _start function.

Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Acked-by: Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp>
Cc: Balbir Singh <balbir@linux.vnet.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 11bc82d6 22-Mar-2011 Andrea Arcangeli <aarcange@redhat.com>

mm: compaction: Use async migration for __GFP_NO_KSWAPD and enforce no writeback

__GFP_NO_KSWAPD allocations are usually very expensive and not mandatory
to succeed as they have graceful fallback. Waiting for I/O in those,
tends to be overkill in terms of latencies, so we can reduce their latency
by disabling sync migrate.

Unfortunately, even with async migration it's still possible for the
process to be blocked waiting for a request slot (e.g. get_request_wait
in the block layer) when ->writepage is called. To prevent
__GFP_NO_KSWAPD blocking, this patch prevents ->writepage being called on
dirty page cache for asynchronous migration.

Addresses https://bugzilla.kernel.org/show_bug.cgi?id=31142

[mel@csn.ul.ie: Avoid writebacks for NFS, retry locked pages, use bool]
Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
Signed-off-by: Mel Gorman <mel@csn.ul.ie>
Cc: Arthur Marsh <arthur.marsh@internode.on.net>
Cc: Clemens Ladisch <cladisch@googlemail.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Minchan Kim <minchan.kim@gmail.com>
Reported-by: Alex Villacis Lasso <avillaci@ceibo.fiec.espol.edu.ec>
Tested-by: Alex Villacis Lasso <avillaci@ceibo.fiec.espol.edu.ec>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 9e60109f 22-Mar-2011 Peter Zijlstra <a.p.zijlstra@chello.nl>

mm: rename drop_anon_vma() to put_anon_vma()

The normal code pattern used in the kernel is: get/put.

Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Reviewed-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Acked-by: Hugh Dickins <hughd@google.com>
Reviewed-by: Rik van Riel <riel@redhat.com>
Acked-by: Mel Gorman <mel@csn.ul.ie>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# ef6a3c63 22-Mar-2011 Miklos Szeredi <mszeredi@suse.cz>

mm: add replace_page_cache_page() function

This function basically does:

remove_from_page_cache(old);
page_cache_release(old);
add_to_page_cache_locked(new);

Except it does this atomically, so there's no possibility for the "add" to
fail because of a race.

If memory cgroups are enabled, then the memory cgroup charge is also moved
from the old page to the new.

This function is currently used by fuse to move pages into the page cache
on read, instead of copying the page contents.

[minchan.kim@gmail.com: add freepage() hook to replace_page_cache_page()]
Signed-off-by: Miklos Szeredi <mszeredi@suse.cz>
Acked-by: Rik van Riel <riel@redhat.com>
Acked-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Mel Gorman <mel@csn.ul.ie>
Signed-off-by: Minchan Kim <minchan.kim@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# a879bf58 25-Feb-2011 Greg Thelen <gthelen@google.com>

mm: grab rcu read lock in move_pages()

The move_pages() usage of find_task_by_vpid() requires rcu_read_lock() to
prevent free_pid() from reclaiming the pid.

Without this patch, RCU warnings are printed in v2.6.38-rc4 move_pages()
with:

CONFIG_LOCKUP_DETECTOR=y
CONFIG_PREEMPT=y
CONFIG_LOCKDEP=y
CONFIG_PROVE_LOCKING=y
CONFIG_PROVE_RCU=y

Previously, migrate_pages() went through a similar transformation
replacing usage of tasklist_lock with rcu read lock:

commit 55cfaa3cbdd29c4919ecb5fb8965c310f357e48c
Author: Zeng Zhaoming <zengzm.kernel@gmail.com>
Date: Thu Dec 2 14:31:13 2010 -0800

mm/mempolicy.c: add rcu read lock to protect pid structure

commit 1e50df39f6e2c3a4a3394df62baa8a213df16c54
Author: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Date: Thu Jan 13 15:46:14 2011 -0800

mempolicy: remove tasklist_lock from migrate_pages

Signed-off-by: Greg Thelen <gthelen@google.com>
Cc: Mel Gorman <mel@csn.ul.ie>
Cc: Minchan Kim <minchan.kim@gmail.com>
Cc: Rik van Riel <riel@redhat.com>
Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: "Paul E. McKenney" <paulmck@linux.vnet.ibm.com>
Cc: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
Cc: Sergey Senozhatsky <sergey.senozhatsky@gmail.com>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Zeng Zhaoming <zengzm.kernel@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 48db54ee 01-Feb-2011 Minchan Kim <minchan.kim@gmail.com>

mm/migration: fix page corruption during hugepage migration

If migrate_huge_page by memory-failure fails , it calls put_page in itself
to decrease page reference and caller of migrate_huge_page also calls
putback_lru_pages. It can do double free of page so it can make page
corruption on page holder.

In addtion, clean of pages on caller is consistent behavior with
migrate_pages by cf608ac19c ("mm: compaction: fix COMPACTPAGEFAILED
counting").

Signed-off-by: Minchan Kim <minchan.kim@gmail.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Christoph Lameter <cl@linux.com>
Cc: Mel Gorman <mel@csn.ul.ie>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 57fc4a5e 01-Feb-2011 Andrea Arcangeli <aarcange@redhat.com>

mm: when migrate_pages returns 0, all pages must have been released

In some cases migrate_pages could return zero while still leaving a few
pages in the pagelist (and some caller wouldn't notice it has to call
putback_lru_pages after commit cf608ac19c9 ("mm: compaction: fix
COMPACTPAGEFAILED counting")).

Add one missing putback_lru_pages not added by commit cf608ac19c95 ("mm:
compaction: fix COMPACTPAGEFAILED counting").

Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
Signed-off-by: Minchan Kim <minchan.kim@gmail.com>
Reviewed-by: Minchan Kim <minchan.kim@gmail.com>
Cc: Christoph Lameter <cl@linux.com>
Acked-by: Mel Gorman <mel@csn.ul.ie>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 28bd6578 25-Jan-2011 Minchan Kim <minchan.kim@gmail.com>

mm: migration: clarify migrate_pages() comment

Callers of migrate_pages should putback_lru_pages to return pages
isolated to LRU or free list. Now comment is rather confusing. It says
caller always have to call it.

It is more clear to point out that the caller has to call it if
migrate_pages's return value isn't zero.

Signed-off-by: Minchan Kim <minchan.kim@gmail.com>
Cc: Christoph Lameter <cl@linux.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 50de1dd9 13-Jan-2011 Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp>

memcg: fix memory migration of shmem swapcache

In the current implementation mem_cgroup_end_migration() decides whether
the page migration has succeeded or not by checking "oldpage->mapping".

But if we are tring to migrate a shmem swapcache, the page->mapping of it
is NULL from the begining, so the check would be invalid. As a result,
mem_cgroup_end_migration() assumes the migration has succeeded even if
it's not, so "newpage" would be freed while it's not uncharged.

This patch fixes it by passing mem_cgroup_end_migration() the result of
the page migration.

Signed-off-by: Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp>
Reviewed-by: Minchan Kim <minchan.kim@gmail.com>
Acked-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Acked-by: Balbir Singh <balbir@linux.vnet.ibm.com>
Cc: Minchan Kim <minchan.kim@gmail.com>
Reviewed-by: Johannes Weiner <hannes@cmpxchg.org>
Cc: Hugh Dickins <hughd@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# fd4a4663 13-Jan-2011 Hugh Dickins <hughd@google.com>

mm: fix hugepage migration

2.6.37 added an unmap_and_move_huge_page() for memory failure recovery,
but its anon_vma handling was still based around the 2.6.35 conventions.
Update it to use page_lock_anon_vma, get_anon_vma, page_unlock_anon_vma,
drop_anon_vma in the same way as we're now changing unmap_and_move().

I don't particularly like to propose this for stable when I've not seen
its problems in practice nor tested the solution: but it's clearly out of
synch at present.

Signed-off-by: Hugh Dickins <hughd@google.com>
Cc: Mel Gorman <mel@csn.ul.ie>
Cc: Rik van Riel <riel@redhat.com>
Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
Cc: "Jun'ichi Nomura" <j-nomura@ce.jp.nec.com>
Cc: Andi Kleen <ak@linux.intel.com>
Cc: <stable@kernel.org> [2.6.37, 2.6.36]
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 1ce82b69 13-Jan-2011 Hugh Dickins <hughd@google.com>

mm: fix migration hangs on anon_vma lock

Increased usage of page migration in mmotm reveals that the anon_vma
locking in unmap_and_move() has been deficient since 2.6.36 (or even
earlier). Review at the time of f18194275c39835cb84563500995e0d503a32d9a
("mm: fix hang on anon_vma->root->lock") missed the issue here: the
anon_vma to which we get a reference may already have been freed back to
its slab (it is in use when we check page_mapped, but that can change),
and so its anon_vma->root may be switched at any moment by reuse in
anon_vma_prepare.

Perhaps we could fix that with a get_anon_vma_unless_zero(), but let's
not: just rely on page_lock_anon_vma() to do all the hard thinking for us,
then we don't need any rcu read locking over here.

In removing the rcu_unlock label: since PageAnon is a bit in
page->mapping, it's impossible for a !page->mapping page to be anon; but
insert VM_BUG_ON in case the implementation ever changes.

[akpm@linux-foundation.org: coding-style fixes]
Signed-off-by: Hugh Dickins <hughd@google.com>
Reviewed-by: Mel Gorman <mel@csn.ul.ie>
Reviewed-by: Rik van Riel <riel@redhat.com>
Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
Cc: "Jun'ichi Nomura" <j-nomura@ce.jp.nec.com>
Cc: Andi Kleen <ak@linux.intel.com>
Cc: <stable@kernel.org> [2.6.37, 2.6.36]
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 29c1f677 13-Jan-2011 Mel Gorman <mel@csn.ul.ie>

mm: migration: use rcu_dereference_protected when dereferencing the radix tree slot during file page migration

migrate_pages() -> unmap_and_move() only calls rcu_read_lock() for
anonymous pages, as introduced by git commit
989f89c57e6361e7d16fbd9572b5da7d313b073d ("fix rcu_read_lock() in page
migraton"). The point of the RCU protection there is part of getting a
stable reference to anon_vma and is only held for anon pages as file pages
are locked which is sufficient protection against freeing.

However, while a file page's mapping is being migrated, the radix tree is
double checked to ensure it is the expected page. This uses
radix_tree_deref_slot() -> rcu_dereference() without the RCU lock held
triggering the following warning.

[ 173.674290] ===================================================
[ 173.676016] [ INFO: suspicious rcu_dereference_check() usage. ]
[ 173.676016] ---------------------------------------------------
[ 173.676016] include/linux/radix-tree.h:145 invoked rcu_dereference_check() without protection!
[ 173.676016]
[ 173.676016] other info that might help us debug this:
[ 173.676016]
[ 173.676016]
[ 173.676016] rcu_scheduler_active = 1, debug_locks = 0
[ 173.676016] 1 lock held by hugeadm/2899:
[ 173.676016] #0: (&(&inode->i_data.tree_lock)->rlock){..-.-.}, at: [<c10e3d2b>] migrate_page_move_mapping+0x40/0x1ab
[ 173.676016]
[ 173.676016] stack backtrace:
[ 173.676016] Pid: 2899, comm: hugeadm Not tainted 2.6.37-rc5-autobuild
[ 173.676016] Call Trace:
[ 173.676016] [<c128cc01>] ? printk+0x14/0x1b
[ 173.676016] [<c1063502>] lockdep_rcu_dereference+0x7d/0x86
[ 173.676016] [<c10e3db5>] migrate_page_move_mapping+0xca/0x1ab
[ 173.676016] [<c10e41ad>] migrate_page+0x23/0x39
[ 173.676016] [<c10e491b>] buffer_migrate_page+0x22/0x107
[ 173.676016] [<c10e48f9>] ? buffer_migrate_page+0x0/0x107
[ 173.676016] [<c10e425d>] move_to_new_page+0x9a/0x1ae
[ 173.676016] [<c10e47e6>] migrate_pages+0x1e7/0x2fa

This patch introduces radix_tree_deref_slot_protected() which calls
rcu_dereference_protected(). Users of it must pass in the
mapping->tree_lock that is protecting this dereference. Holding the tree
lock protects against parallel updaters of the radix tree meaning that
rcu_dereference_protected is allowable.

[akpm@linux-foundation.org: remove unneeded casts]
Signed-off-by: Mel Gorman <mel@csn.ul.ie>
Cc: Minchan Kim <minchan.kim@gmail.com>
Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Milton Miller <miltonm@bga.com>
Cc: Nick Piggin <nickpiggin@yahoo.com.au>
Cc: Wu Fengguang <fengguang.wu@intel.com>
Cc: <stable@kernel.org> [2.6.37.early]
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 500d65d4 13-Jan-2011 Andrea Arcangeli <aarcange@redhat.com>

thp: pmd_trans_huge migrate bugcheck

No pmd_trans_huge should ever materialize in migration ptes areas, because
we split the hugepage before migration ptes are instantiated.

Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
Acked-by: Rik van Riel <riel@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 7f0f2496 13-Jan-2011 Mel Gorman <mel@csn.ul.ie>

mm: migration: cleanup migrate_pages API by matching types for offlining and sync

With the introduction of the boolean sync parameter, the API looks a
little inconsistent as offlining is still an int. Convert offlining to a
bool for the sake of being tidy.

Signed-off-by: Mel Gorman <mel@csn.ul.ie>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Cc: Rik van Riel <riel@redhat.com>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Cc: Andy Whitcroft <apw@shadowen.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 77f1fe6b 13-Jan-2011 Mel Gorman <mel@csn.ul.ie>

mm: migration: allow migration to operate asynchronously and avoid synchronous compaction in the faster path

Migration synchronously waits for writeback if the initial passes fails.
Callers of memory compaction do not necessarily want this behaviour if the
caller is latency sensitive or expects that synchronous migration is not
going to have a significantly better success rate.

This patch adds a sync parameter to migrate_pages() allowing the caller to
indicate if wait_on_page_writeback() is allowed within migration or not.
For reclaim/compaction, try_to_compact_pages() is first called
asynchronously, direct reclaim runs and then try_to_compact_pages() is
called synchronously as there is a greater expectation that it'll succeed.

[akpm@linux-foundation.org: build/merge fix]
Signed-off-by: Mel Gorman <mel@csn.ul.ie>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Cc: Rik van Riel <riel@redhat.com>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Cc: Andy Whitcroft <apw@shadowen.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 3e7d3449 13-Jan-2011 Mel Gorman <mel@csn.ul.ie>

mm: vmscan: reclaim order-0 and use compaction instead of lumpy reclaim

Lumpy reclaim is disruptive. It reclaims a large number of pages and
ignores the age of the pages it reclaims. This can incur significant
stalls and potentially increase the number of major faults.

Compaction has reached the point where it is considered reasonably stable
(meaning it has passed a lot of testing) and is a potential candidate for
displacing lumpy reclaim. This patch introduces an alternative to lumpy
reclaim whe compaction is available called reclaim/compaction. The basic
operation is very simple - instead of selecting a contiguous range of
pages to reclaim, a number of order-0 pages are reclaimed and then
compaction is later by either kswapd (compact_zone_order()) or direct
compaction (__alloc_pages_direct_compact()).

[akpm@linux-foundation.org: fix build]
[akpm@linux-foundation.org: use conventional task_struct naming]
Signed-off-by: Mel Gorman <mel@csn.ul.ie>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Cc: Rik van Riel <riel@redhat.com>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Cc: Andy Whitcroft <apw@shadowen.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 0d1836c3 21-Dec-2010 Michal Nazarewicz <m.nazarewicz@samsung.com>

mm/migrate.c: fix compilation error

GCC complained about update_mmu_cache() not being defined in migrate.c.
Including <asm/tlbflush.h> seems to solve the problem.

Signed-off-by: Michal Nazarewicz <m.nazarewicz@samsung.com>
Signed-off-by: Kyungmin Park <kyungmin.park@samsung.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 70384dc6 26-Oct-2010 Gleb Natapov <gleb@redhat.com>

mm: fix error reporting in move_pages() syscall

The vma returned by find_vma does not necessarily include the target
address. If this happens the code tries to follow a page outside of any
vma and returns ENOENT instead of EFAULT.

Signed-off-by: Gleb Natapov <gleb@redhat.com>
Acked-by: Christoph Lameter <cl@linux-foundation.org>
Cc: Minchan Kim <minchan.kim@gmail.com>
Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Mel Gorman <mel@csn.ul.ie>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# cf608ac1 26-Oct-2010 Minchan Kim <minchan.kim@gmail.com>

mm: compaction: fix COMPACTPAGEFAILED counting

Presently update_nr_listpages() doesn't have a role. That's because lists
passed is always empty just after calling migrate_pages. The
migrate_pages cleans up page list which have failed to migrate before
returning by aaa994b3.

[PATCH] page migration: handle freeing of pages in migrate_pages()

Do not leave pages on the lists passed to migrate_pages(). Seems that we will
not need any postprocessing of pages. This will simplify the handling of
pages by the callers of migrate_pages().

At that time, we thought we don't need any postprocessing of pages. But
the situation is changed. The compaction need to know the number of
failed to migrate for COMPACTPAGEFAILED stat

This patch makes new rule for caller of migrate_pages to call
putback_lru_pages. So caller need to clean up the lists so it has a
chance to postprocess the pages. [suggested by Christoph Lameter]

Signed-off-by: Minchan Kim <minchan.kim@gmail.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: Andi Kleen <andi@firstfloor.org>
Reviewed-by: Mel Gorman <mel@csn.ul.ie>
Reviewed-by: Wu Fengguang <fengguang.wu@intel.com>
Acked-by: Christoph Lameter <cl@linux.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 1b430bee 26-Oct-2010 Wu Fengguang <fengguang.wu@intel.com>

writeback: remove nonblocking/encountered_congestion references

This removes more dead code that was somehow missed by commit 0d99519efef
(writeback: remove unused nonblocking and congestion checks). There are
no behavior change except for the removal of two entries from one of the
ext4 tracing interface.

The nonblocking checks in ->writepages are no longer used because the
flusher now prefer to block on get_request_wait() than to skip inodes on
IO congestion. The latter will lead to more seeky IO.

The nonblocking checks in ->writepage are no longer used because it's
redundant with the WB_SYNC_NONE check.

We no long set ->nonblocking in VM page out and page migration, because
a) it's effectively redundant with WB_SYNC_NONE in current code
b) it's old semantic of "Don't get stuck on request queues" is mis-behavior:
that would skip some dirty inodes on congestion and page out others, which
is unfair in terms of LRU age.

Inspired by Christoph Hellwig. Thanks!

Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
Cc: Theodore Ts'o <tytso@mit.edu>
Cc: David Howells <dhowells@redhat.com>
Cc: Sage Weil <sage@newdream.net>
Cc: Steve French <sfrench@samba.org>
Cc: Chris Mason <chris.mason@oracle.com>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: Christoph Hellwig <hch@infradead.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 3ef8fd7f 11-Oct-2010 Andi Kleen <ak@linux.intel.com>

Fix migration.c compilation on s390

31bit s390 doesn't have huge pages and failed with:

> mm/migrate.c: In function 'remove_migration_pte':
> mm/migrate.c:143:3: error: implicit declaration of function 'pte_mkhuge'
> mm/migrate.c:143:7: error: incompatible types when assigning to type 'pte_t' from type 'int'

Put that code into a ifdef.

Reported by Heiko Carstens

Signed-off-by: Andi Kleen <ak@linux.intel.com>


# 290408d4 07-Sep-2010 Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>

hugetlb: hugepage migration core

This patch extends page migration code to support hugepage migration.
One of the potential users of this feature is soft offlining which
is triggered by memory corrected errors (added by the next patch.)

Todo:
- there are other users of page migration such as memory policy,
memory hotplug and memocy compaction.
They are not ready for hugepage support for now.

ChangeLog since v4:
- define migrate_huge_pages()
- remove changes on isolation/putback_lru_page()

ChangeLog since v2:
- refactor isolate/putback_lru_page() to handle hugepage
- add comment about race on unmap_and_move_huge_page()

ChangeLog since v1:
- divide migration code path for hugepage
- define routine checking migration swap entry for hugetlb
- replace "goto" with "if/else" in remove_migration_pte()

Signed-off-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
Signed-off-by: Jun'ichi Nomura <j-nomura@ce.jp.nec.com>
Acked-by: Mel Gorman <mel@csn.ul.ie>
Signed-off-by: Andi Kleen <ak@linux.intel.com>


# 76545066 09-Aug-2010 Rik van Riel <riel@redhat.com>

mm: extend KSM refcounts to the anon_vma root

KSM reference counts can cause an anon_vma to exist after the processe it
belongs to have already exited. Because the anon_vma lock now lives in
the root anon_vma, we need to ensure that the root anon_vma stays around
until after all the "child" anon_vmas have been freed.

The obvious way to do this is to have a "child" anon_vma take a reference
to the root in anon_vma_fork. When the anon_vma is freed at munmap or
process exit, we drop the refcount in anon_vma_unlink and possibly free
the root anon_vma.

The KSM anon_vma reference count function also needs to be modified to
deal with the possibility of freeing 2 levels of anon_vma. The easiest
way to do this is to break out the KSM magic and make it generic.

When compiling without CONFIG_KSM, this code is compiled out.

Signed-off-by: Rik van Riel <riel@redhat.com>
Tested-by: Larry Woodman <lwoodman@redhat.com>
Acked-by: Larry Woodman <lwoodman@redhat.com>
Reviewed-by: Minchan Kim <minchan.kim@gmail.com>
Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Acked-by: Mel Gorman <mel@csn.ul.ie>
Acked-by: Linus Torvalds <torvalds@linux-foundation.org>
Tested-by: Dave Young <hidave.darkstar@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 012f1800 09-Aug-2010 Rik van Riel <riel@redhat.com>

mm: always lock the root (oldest) anon_vma

Always (and only) lock the root (oldest) anon_vma whenever we do something
in an anon_vma. The recently introduced anon_vma scalability is due to
the rmap code scanning only the VMAs that need to be scanned. Many common
operations still took the anon_vma lock on the root anon_vma, so always
taking that lock is not expected to introduce any scalability issues.

However, always taking the same lock does mean we only need to take one
lock, which means rmap_walk on pages from any anon_vma in the vma is
excluded from occurring during an munmap, expand_stack or other operation
that needs to exclude rmap_walk and similar functions.

Also add the proper locking to vma_adjust.

Signed-off-by: Rik van Riel <riel@redhat.com>
Tested-by: Larry Woodman <lwoodman@redhat.com>
Acked-by: Larry Woodman <lwoodman@redhat.com>
Reviewed-by: Minchan Kim <minchan.kim@gmail.com>
Reviewed-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Acked-by: Mel Gorman <mel@csn.ul.ie>
Acked-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# cba48b98 09-Aug-2010 Rik van Riel <riel@redhat.com>

mm: change direct call of spin_lock(anon_vma->lock) to inline function

Subsitute a direct call of spin_lock(anon_vma->lock) with an inline
function doing exactly the same.

This makes it easier to do the substitution to the root anon_vma lock in a
following patch.

We will deal with the handful of special locks (nested, dec_and_lock, etc)
separately.

Signed-off-by: Rik van Riel <riel@redhat.com>
Acked-by: Mel Gorman <mel@csn.ul.ie>
Acked-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Tested-by: Larry Woodman <lwoodman@redhat.com>
Acked-by: Larry Woodman <lwoodman@redhat.com>
Reviewed-by: Minchan Kim <minchan.kim@gmail.com>
Acked-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# ac39cf8c 26-May-2010 Andrew Morton <akpm@linux-foundation.org>

memcg: fix mis-accounting of file mapped racy with migration

FILE_MAPPED per memcg of migrated file cache is not properly updated,
because our hook in page_add_file_rmap() can't know to which memcg
FILE_MAPPED should be counted.

Basically, this patch is for fixing the bug but includes some big changes
to fix up other messes.

Now, at migrating mapped file, events happen in following sequence.

1. allocate a new page.
2. get memcg of an old page.
3. charge ageinst a new page before migration. But at this point,
no changes to new page's page_cgroup, no commit for the charge.
(IOW, PCG_USED bit is not set.)
4. page migration replaces radix-tree, old-page and new-page.
5. page migration remaps the new page if the old page was mapped.
6. Here, the new page is unlocked.
7. memcg commits the charge for newpage, Mark the new page's page_cgroup
as PCG_USED.

Because "commit" happens after page-remap, we can count FILE_MAPPED
at "5", because we should avoid to trust page_cgroup->mem_cgroup.
if PCG_USED bit is unset.
(Note: memcg's LRU removal code does that but LRU-isolation logic is used
for helping it. When we overwrite page_cgroup->mem_cgroup, page_cgroup is
not on LRU or page_cgroup->mem_cgroup is NULL.)

We can lose file_mapped accounting information at 5 because FILE_MAPPED
is updated only when mapcount changes 0->1. So we should catch it.

BTW, historically, above implemntation comes from migration-failure
of anonymous page. Because we charge both of old page and new page
with mapcount=0, we can't catch
- the page is really freed before remap.
- migration fails but it's freed before remap
or .....corner cases.

New migration sequence with memcg is:

1. allocate a new page.
2. mark PageCgroupMigration to the old page.
3. charge against a new page onto the old page's memcg. (here, new page's pc
is marked as PageCgroupUsed.)
4. page migration replaces radix-tree, page table, etc...
5. At remapping, new page's page_cgroup is now makrked as "USED"
We can catch 0->1 event and FILE_MAPPED will be properly updated.

And we can catch SWAPOUT event after unlock this and freeing this
page by unmap() can be caught.

7. Clear PageCgroupMigration of the old page.

So, FILE_MAPPED will be correctly updated.

Then, for what MIGRATION flag is ?
Without it, at migration failure, we may have to charge old page again
because it may be fully unmapped. "charge" means that we have to dive into
memory reclaim or something complated. So, it's better to avoid
charge it again. Before this patch, __commit_charge() was working for
both of the old/new page and fixed up all. But this technique has some
racy condtion around FILE_MAPPED and SWAPOUT etc...
Now, the kernel use MIGRATION flag and don't uncharge old page until
the end of migration.

I hope this change will make memcg's page migration much simpler. This
page migration has caused several troubles. Worth to add a flag for
simplification.

Reviewed-by: Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp>
Tested-by: Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp>
Reported-by: Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp>
Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Balbir Singh <balbir@in.ibm.com>
Cc: Christoph Lameter <cl@linux-foundation.org>
Cc: "Kirill A. Shutemov" <kirill@shutemov.name>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 748446bb 24-May-2010 Mel Gorman <mel@csn.ul.ie>

mm: compaction: memory compaction core

This patch is the core of a mechanism which compacts memory in a zone by
relocating movable pages towards the end of the zone.

A single compaction run involves a migration scanner and a free scanner.
Both scanners operate on pageblock-sized areas in the zone. The migration
scanner starts at the bottom of the zone and searches for all movable
pages within each area, isolating them onto a private list called
migratelist. The free scanner starts at the top of the zone and searches
for suitable areas and consumes the free pages within making them
available for the migration scanner. The pages isolated for migration are
then migrated to the newly isolated free pages.

[aarcange@redhat.com: Fix unsafe optimisation]
[mel@csn.ul.ie: do not schedule work on other CPUs for compaction]
Signed-off-by: Mel Gorman <mel@csn.ul.ie>
Acked-by: Rik van Riel <riel@redhat.com>
Reviewed-by: Minchan Kim <minchan.kim@gmail.com>
Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Cc: Christoph Lameter <cl@linux-foundation.org>
Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 3fe2011f 24-May-2010 Mel Gorman <mel@csn.ul.ie>

mm: migration: allow the migration of PageSwapCache pages

PageAnon pages that are unmapped may or may not have an anon_vma so are
not currently migrated. However, a swap cache page can be migrated and
fits this description. This patch identifies page swap caches and allows
them to be migrated but ensures that no attempt to made to remap the pages
would would potentially try to access an already freed anon_vma.

Signed-off-by: Mel Gorman <mel@csn.ul.ie>
Reviewed-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Reviewed-by: Minchan Kim <minchan.kim@gmail.com>
Cc: Rik van Riel <riel@redhat.com>
Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Cc: Christoph Lameter <cl@linux-foundation.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 67b9509b 24-May-2010 Mel Gorman <mel@csn.ul.ie>

mm: migration: do not try to migrate unmapped anonymous pages

rmap_walk_anon() was triggering errors in memory compaction that look like
use-after-free errors. The problem is that between the page being
isolated from the LRU and rcu_read_lock() being taken, the mapcount of the
page dropped to 0 and the anon_vma gets freed. This can happen during
memory compaction if pages being migrated belong to a process that exits
before migration completes. Hence, the use-after-free race looks like

1. Page isolated for migration
2. Process exits
3. page_mapcount(page) drops to zero so anon_vma was no longer reliable
4. unmap_and_move() takes the rcu_lock but the anon_vma is already garbage
4. call try_to_unmap, looks up tha anon_vma and "locks" it but the lock
is garbage.

This patch checks the mapcount after the rcu lock is taken. If the
mapcount is zero, the anon_vma is assumed to be freed and no further
action is taken.

Signed-off-by: Mel Gorman <mel@csn.ul.ie>
Acked-by: Rik van Riel <riel@redhat.com>
Reviewed-by: Minchan Kim <minchan.kim@gmail.com>
Reviewed-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Cc: Christoph Lameter <cl@linux-foundation.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 7f60c214 24-May-2010 Mel Gorman <mel@csn.ul.ie>

mm: migration: share the anon_vma ref counts between KSM and page migration

For clarity of review, KSM and page migration have separate refcounts on
the anon_vma. While clear, this is a waste of memory. This patch gets
KSM and page migration to share their toys in a spirit of harmony.

Signed-off-by: Mel Gorman <mel@csn.ul.ie>
Reviewed-by: Minchan Kim <minchan.kim@gmail.com>
Reviewed-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Reviewed-by: Christoph Lameter <cl@linux-foundation.org>
Reviewed-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Rik van Riel <riel@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 3f6c8272 24-May-2010 Mel Gorman <mel@csn.ul.ie>

mm: migration: take a reference to the anon_vma before migrating

This patchset is a memory compaction mechanism that reduces external
fragmentation memory by moving GFP_MOVABLE pages to a fewer number of
pageblocks. The term "compaction" was chosen as there are is a number of
mechanisms that are not mutually exclusive that can be used to defragment
memory. For example, lumpy reclaim is a form of defragmentation as was
slub "defragmentation" (really a form of targeted reclaim). Hence, this
is called "compaction" to distinguish it from other forms of
defragmentation.

In this implementation, a full compaction run involves two scanners
operating within a zone - a migration and a free scanner. The migration
scanner starts at the beginning of a zone and finds all movable pages
within one pageblock_nr_pages-sized area and isolates them on a
migratepages list. The free scanner begins at the end of the zone and
searches on a per-area basis for enough free pages to migrate all the
pages on the migratepages list. As each area is respectively migrated or
exhausted of free pages, the scanners are advanced one area. A compaction
run completes within a zone when the two scanners meet.

This method is a bit primitive but is easy to understand and greater
sophistication would require maintenance of counters on a per-pageblock
basis. This would have a big impact on allocator fast-paths to improve
compaction which is a poor trade-off.

It also does not try relocate virtually contiguous pages to be physically
contiguous. However, assuming transparent hugepages were in use, a
hypothetical khugepaged might reuse compaction code to isolate free pages,
split them and relocate userspace pages for promotion.

Memory compaction can be triggered in one of three ways. It may be
triggered explicitly by writing any value to /proc/sys/vm/compact_memory
and compacting all of memory. It can be triggered on a per-node basis by
writing any value to /sys/devices/system/node/nodeN/compact where N is the
node ID to be compacted. When a process fails to allocate a high-order
page, it may compact memory in an attempt to satisfy the allocation
instead of entering direct reclaim. Explicit compaction does not finish
until the two scanners meet and direct compaction ends if a suitable page
becomes available that would meet watermarks.

The series is in 14 patches. The first three are not "core" to the series
but are important pre-requisites.

Patch 1 reference counts anon_vma for rmap_walk_anon(). Without this
patch, it's possible to use anon_vma after free if the caller is
not holding a VMA or mmap_sem for the pages in question. While
there should be no existing user that causes this problem,
it's a requirement for memory compaction to be stable. The patch
is at the start of the series for bisection reasons.
Patch 2 merges the KSM and migrate counts. It could be merged with patch 1
but would be slightly harder to review.
Patch 3 skips over unmapped anon pages during migration as there are no
guarantees about the anon_vma existing. There is a window between
when a page was isolated and migration started during which anon_vma
could disappear.
Patch 4 notes that PageSwapCache pages can still be migrated even if they
are unmapped.
Patch 5 allows CONFIG_MIGRATION to be set without CONFIG_NUMA
Patch 6 exports a "unusable free space index" via debugfs. It's
a measure of external fragmentation that takes the size of the
allocation request into account. It can also be calculated from
userspace so can be dropped if requested
Patch 7 exports a "fragmentation index" which only has meaning when an
allocation request fails. It determines if an allocation failure
would be due to a lack of memory or external fragmentation.
Patch 8 moves the definition for LRU isolation modes for use by compaction
Patch 9 is the compaction mechanism although it's unreachable at this point
Patch 10 adds a means of compacting all of memory with a proc trgger
Patch 11 adds a means of compacting a specific node with a sysfs trigger
Patch 12 adds "direct compaction" before "direct reclaim" if it is
determined there is a good chance of success.
Patch 13 adds a sysctl that allows tuning of the threshold at which the
kernel will compact or direct reclaim
Patch 14 temporarily disables compaction if an allocation failure occurs
after compaction.

Testing of compaction was in three stages. For the test, debugging,
preempt, the sleep watchdog and lockdep were all enabled but nothing nasty
popped out. min_free_kbytes was tuned as recommended by hugeadm to help
fragmentation avoidance and high-order allocations. It was tested on X86,
X86-64 and PPC64.

Ths first test represents one of the easiest cases that can be faced for
lumpy reclaim or memory compaction.

1. Machine freshly booted and configured for hugepage usage with
a) hugeadm --create-global-mounts
b) hugeadm --pool-pages-max DEFAULT:8G
c) hugeadm --set-recommended-min_free_kbytes
d) hugeadm --set-recommended-shmmax

The min_free_kbytes here is important. Anti-fragmentation works best
when pageblocks don't mix. hugeadm knows how to calculate a value that
will significantly reduce the worst of external-fragmentation-related
events as reported by the mm_page_alloc_extfrag tracepoint.

2. Load up memory
a) Start updatedb
b) Create in parallel a X files of pagesize*128 in size. Wait
until files are created. By parallel, I mean that 4096 instances
of dd were launched, one after the other using &. The crude
objective being to mix filesystem metadata allocations with
the buffer cache.
c) Delete every second file so that pageblocks are likely to
have holes
d) kill updatedb if it's still running

At this point, the system is quiet, memory is full but it's full with
clean filesystem metadata and clean buffer cache that is unmapped.
This is readily migrated or discarded so you'd expect lumpy reclaim
to have no significant advantage over compaction but this is at
the POC stage.

3. In increments, attempt to allocate 5% of memory as hugepages.
Measure how long it took, how successful it was, how many
direct reclaims took place and how how many compactions. Note
the compaction figures might not fully add up as compactions
can take place for orders other than the hugepage size

X86 vanilla compaction
Final page count 913 916 (attempted 1002)
pages reclaimed 68296 9791

X86-64 vanilla compaction
Final page count: 901 902 (attempted 1002)
Total pages reclaimed: 112599 53234

PPC64 vanilla compaction
Final page count: 93 94 (attempted 110)
Total pages reclaimed: 103216 61838

There was not a dramatic improvement in success rates but it wouldn't be
expected in this case either. What was important is that fewer pages were
reclaimed in all cases reducing the amount of IO required to satisfy a
huge page allocation.

The second tests were all performance related - kernbench, netperf, iozone
and sysbench. None showed anything too remarkable.

The last test was a high-order allocation stress test. Many kernel
compiles are started to fill memory with a pressured mix of unmovable and
movable allocations. During this, an attempt is made to allocate 90% of
memory as huge pages - one at a time with small delays between attempts to
avoid flooding the IO queue.

vanilla compaction
Percentage of request allocated X86 98 99
Percentage of request allocated X86-64 95 98
Percentage of request allocated PPC64 55 70

This patch:

rmap_walk_anon() does not use page_lock_anon_vma() for looking up and
locking an anon_vma and it does not appear to have sufficient locking to
ensure the anon_vma does not disappear from under it.

This patch copies an approach used by KSM to take a reference on the
anon_vma while pages are being migrated. This should prevent rmap_walk()
running into nasty surprises later because anon_vma has been freed.

Signed-off-by: Mel Gorman <mel@csn.ul.ie>
Acked-by: Rik van Riel <riel@redhat.com>
Cc: Minchan Kim <minchan.kim@gmail.com>
Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Cc: Christoph Lameter <cl@linux-foundation.org>
Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# e13861d8 24-May-2010 Minchan Kim <minchan.kim@gmail.com>

mm: remove return value of putback_lru_pages()

putback_lru_page() never can fail. So it doesn't matter count of "the
number of pages put back".

In addition, users of this functions don't use return value.

Let's remove unnecessary code.

Signed-off-by: Minchan Kim <minchan.kim@gmail.com>
Reviewed-by: Rik van Riel <riel@redhat.com>
Reviewed-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Reviewed-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 5a0e3ad6 24-Mar-2010 Tejun Heo <tj@kernel.org>

include cleanup: Update gfp.h and slab.h includes to prepare for breaking implicit slab.h inclusion from percpu.h

percpu.h is included by sched.h and module.h and thus ends up being
included when building most .c files. percpu.h includes slab.h which
in turn includes gfp.h making everything defined by the two files
universally available and complicating inclusion dependencies.

percpu.h -> slab.h dependency is about to be removed. Prepare for
this change by updating users of gfp and slab facilities include those
headers directly instead of assuming availability. As this conversion
needs to touch large number of source files, the following script is
used as the basis of conversion.

http://userweb.kernel.org/~tj/misc/slabh-sweep.py

The script does the followings.

* Scan files for gfp and slab usages and update includes such that
only the necessary includes are there. ie. if only gfp is used,
gfp.h, if slab is used, slab.h.

* When the script inserts a new include, it looks at the include
blocks and try to put the new include such that its order conforms
to its surrounding. It's put in the include block which contains
core kernel includes, in the same order that the rest are ordered -
alphabetical, Christmas tree, rev-Xmas-tree or at the end if there
doesn't seem to be any matching order.

* If the script can't find a place to put a new include (mostly
because the file doesn't have fitting include block), it prints out
an error message indicating which .h file needs to be added to the
file.

The conversion was done in the following steps.

1. The initial automatic conversion of all .c files updated slightly
over 4000 files, deleting around 700 includes and adding ~480 gfp.h
and ~3000 slab.h inclusions. The script emitted errors for ~400
files.

2. Each error was manually checked. Some didn't need the inclusion,
some needed manual addition while adding it to implementation .h or
embedding .c file was more appropriate for others. This step added
inclusions to around 150 files.

3. The script was run again and the output was compared to the edits
from #2 to make sure no file was left behind.

4. Several build tests were done and a couple of problems were fixed.
e.g. lib/decompress_*.c used malloc/free() wrappers around slab
APIs requiring slab.h to be added manually.

5. The script was run on all .h files but without automatically
editing them as sprinkling gfp.h and slab.h inclusions around .h
files could easily lead to inclusion dependency hell. Most gfp.h
inclusion directives were ignored as stuff from gfp.h was usually
wildly available and often used in preprocessor macros. Each
slab.h inclusion directive was examined and added manually as
necessary.

6. percpu.h was updated not to include slab.h.

7. Build test were done on the following configurations and failures
were fixed. CONFIG_GCOV_KERNEL was turned off for all tests (as my
distributed build env didn't work with gcov compiles) and a few
more options had to be turned off depending on archs to make things
build (like ipr on powerpc/64 which failed due to missing writeq).

* x86 and x86_64 UP and SMP allmodconfig and a custom test config.
* powerpc and powerpc64 SMP allmodconfig
* sparc and sparc64 SMP allmodconfig
* ia64 SMP allmodconfig
* s390 SMP allmodconfig
* alpha SMP allmodconfig
* um on x86_64 SMP allmodconfig

8. percpu.h modifications were reverted so that it could be applied as
a separate patch and serve as bisection point.

Given the fact that I had only a couple of failures from tests on step
6, I'm fairly confident about the coverage of this conversion patch.
If there is a breakage, it's likely to be something in one of the arch
headers which should be easily discoverable easily on most builds of
the specific arch.

Signed-off-by: Tejun Heo <tj@kernel.org>
Guess-its-ok-by: Christoph Lameter <cl@linux-foundation.org>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Lee Schermerhorn <Lee.Schermerhorn@hp.com>


# 85f1fb72 05-Mar-2010 KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>

mm/migrate.c: kill anon local variable from migrate_page_copy

commit 01b1ae63c2 ("memcg: simple migration handling") removed
mem_cgroup_uncharge_cache_page() call from migrate_page_copy. Local
variable `anon' is now unused.

Signed-off-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 87b8d1ad 18-Feb-2010 H. Peter Anvin <hpa@zytor.com>

mm: Make copy_from_user() in migrate.c statically predictable

x86-32 has had a static test for copy_on_user() overflow for a while.
This test currently fails in mm/migrate.c resulting in an
allyesconfig/allmodconfig build failure on x86-32:

In function ‘copy_from_user’,
inlined from ‘do_pages_stat’ at
/home/hpa/kernel/git/mm/migrate.c:1012:
/home/hpa/kernel/git/arch/x86/include/asm/uaccess_32.h:212: error:
call to ‘copy_from_user_overflow’ declared

Make the logic more explicit and therefore easier for gcc to
understand.

v2: rewrite the loop entirely using a more normal structure for a
chunked-data loop (Linus Torvalds)

Reported-by: Len Brown <lenb@kernel.org>
Signed-off-by: H. Peter Anvin <hpa@zytor.com>
Reviewed-and-Tested-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Cc: Arjan van de Ven <arjan@linux.kernel.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Christoph Lameter <cl@linux-foundation.org>
Cc: Hugh Dickins <hugh.dickins@tiscali.co.uk>
Cc: Rik van Riel <riel@redhat.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 4b3073e1 18-Dec-2009 Russell King <rmk+kernel@arm.linux.org.uk>

MM: Pass a PTE pointer to update_mmu_cache() rather than the PTE itself

On VIVT ARM, when we have multiple shared mappings of the same file
in the same MM, we need to ensure that we have coherency across all
copies. We do this via make_coherent() by making the pages
uncacheable.

This used to work fine, until we allowed highmem with highpte - we
now have a page table which is mapped as required, and is not available
for modification via update_mmu_cache().

Ralf Beache suggested getting rid of the PTE value passed to
update_mmu_cache():

On MIPS update_mmu_cache() calls __update_tlb() which walks pagetables
to construct a pointer to the pte again. Passing a pte_t * is much
more elegant. Maybe we might even replace the pte argument with the
pte_t?

Ben Herrenschmidt would also like the pte pointer for PowerPC:

Passing the ptep in there is exactly what I want. I want that
-instead- of the PTE value, because I have issue on some ppc cases,
for I$/D$ coherency, where set_pte_at() may decide to mask out the
_PAGE_EXEC.

So, pass in the mapped page table pointer into update_mmu_cache(), and
remove the PTE value, updating all implementations and call sites to
suit.

Includes a fix from Stephen Rothwell:

sparc: fix fallout from update_mmu_cache API change

Signed-off-by: Stephen Rothwell <sfr@canb.auug.org.au>

Acked-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Signed-off-by: Russell King <rmk+kernel@arm.linux.org.uk>


# 6f5a55f1 05-Feb-2010 Linus Torvalds <torvalds@linux-foundation.org>

Fix potential crash with sys_move_pages

We incorrectly depended on the 'node_state/node_isset()' functions
testing the node range, rather than checking it explicitly. That's not
reliable, even if it might often happen to work. So do the proper
explicit test.

Reported-by: Marcus Meissner <meissner@suse.de>
Acked-and-tested-by: Brice Goglin <Brice.Goglin@inria.fr>
Acked-by: Hugh Dickins <hugh.dickins@tiscali.co.uk>
Cc: stable@kernel.org
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 418b27ef 14-Dec-2009 Lee Schermerhorn <Lee.Schermerhorn@hp.com>

mm: remove unevictable_migrate_page function

unevictable_migrate_page() in mm/internal.h is a relic of the since
removed UNEVICTABLE_LRU Kconfig option. This patch removes the function
and open codes the test in migrate_page_copy().

Signed-off-by: Lee Schermerhorn <lee.schermerhorn@hp.com>
Reviewed-by: Christoph Lameter <cl@linux-foundation.org>
Acked-by: Hugh Dickins <hugh.dickins@tiscali.co.uk>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 62b61f61 14-Dec-2009 Hugh Dickins <hugh.dickins@tiscali.co.uk>

ksm: memory hotremove migration only

The previous patch enables page migration of ksm pages, but that soon gets
into trouble: not surprising, since we're using the ksm page lock to lock
operations on its stable_node, but page migration switches the page whose
lock is to be used for that. Another layer of locking would fix it, but
do we need that yet?

Do we actually need page migration of ksm pages? Yes, memory hotremove
needs to offline sections of memory: and since we stopped allocating ksm
pages with GFP_HIGHUSER, they will tend to be GFP_HIGHUSER_MOVABLE
candidates for migration.

But KSM is currently unconscious of NUMA issues, happily merging pages
from different NUMA nodes: at present the rule must be, not to use
MADV_MERGEABLE where you care about NUMA. So no, NUMA page migration of
ksm pages does not make sense yet.

So, to complete support for ksm swapping we need to make hotremove safe.
ksm_memory_callback() take ksm_thread_mutex when MEM_GOING_OFFLINE and
release it when MEM_OFFLINE or MEM_CANCEL_OFFLINE. But if mapped pages
are freed before migration reaches them, stable_nodes may be left still
pointing to struct pages which have been removed from the system: the
stable_node needs to identify a page by pfn rather than page pointer, then
it can safely prune them when MEM_OFFLINE.

And make NUMA migration skip PageKsm pages where it skips PageReserved.
But it's only when we reach unmap_and_move() that the page lock is taken
and we can be sure that raised pagecount has prevented a PageAnon from
being upgraded: so add offlining arg to migrate_pages(), to migrate ksm
page when offlining (has sufficient locking) but reject it otherwise.

Signed-off-by: Hugh Dickins <hugh.dickins@tiscali.co.uk>
Cc: Izik Eidus <ieidus@redhat.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Chris Wright <chrisw@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# e9995ef9 14-Dec-2009 Hugh Dickins <hugh.dickins@tiscali.co.uk>

ksm: rmap_walk to remove_migation_ptes

A side-effect of making ksm pages swappable is that they have to be placed
on the LRUs: which then exposes them to isolate_lru_page() and hence to
page migration.

Add rmap_walk() for remove_migration_ptes() to use: rmap_walk_anon() and
rmap_walk_file() in rmap.c, but rmap_walk_ksm() in ksm.c. Perhaps some
consolidation with existing code is possible, but don't attempt that yet
(try_to_unmap needs to handle nonlinears, but migration pte removal does
not).

rmap_walk() is sadly less general than it appears: rmap_walk_anon(), like
remove_anon_migration_ptes() which it replaces, avoids calling
page_lock_anon_vma(), because that includes a page_mapped() test which
fails when all migration ptes are in place. That was valid when NUMA page
migration was introduced (holding mmap_sem provided the missing guarantee
that anon_vma's slab had not already been destroyed), but I believe not
valid in the memory hotremove case added since.

For now do the same as before, and consider the best way to fix that
unlikely race later on. When fixed, we can probably use rmap_walk() on
hwpoisoned ksm pages too: for now, they remain among hwpoison's various
exceptions (its PageKsm test comes before the page is locked, but its
page_lock_anon_vma fails safely if an anon gets upgraded).

Signed-off-by: Hugh Dickins <hugh.dickins@tiscali.co.uk>
Cc: Izik Eidus <ieidus@redhat.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Chris Wright <chrisw@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 3ca7b3c5 14-Dec-2009 Hugh Dickins <hugh.dickins@tiscali.co.uk>

mm: define PAGE_MAPPING_FLAGS

At present we define PageAnon(page) by the low PAGE_MAPPING_ANON bit set
in page->mapping, with the higher bits a pointer to the anon_vma; and have
defined PageKsm(page) as that with NULL anon_vma.

But KSM swapping will need to store a pointer there: so in preparation for
that, now define PAGE_MAPPING_FLAGS as the low two bits, including
PAGE_MAPPING_KSM (always set along with PAGE_MAPPING_ANON, until some
other use for the bit emerges).

Declare page_rmapping(page) to return the pointer part of page->mapping,
and page_anon_vma(page) to return the anon_vma pointer when that's what it
is. Use these in a few appropriate places: notably, unuse_vma() has been
testing page->mapping, but is better to be testing page_anon_vma() (cases
may be added in which flag bits are set without any pointer).

Signed-off-by: Hugh Dickins <hugh.dickins@tiscali.co.uk>
Cc: Izik Eidus <ieidus@redhat.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Nick Piggin <npiggin@suse.de>
Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Reviewed-by: Rik van Riel <riel@redhat.com>
Cc: Lee Schermerhorn <Lee.Schermerhorn@hp.com>
Cc: Andi Kleen <andi@firstfloor.org>
Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Wu Fengguang <fengguang.wu@intel.com>
Cc: Minchan Kim <minchan.kim@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 6d9c285a 14-Dec-2009 KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>

mm: move inc_zone_page_state(NR_ISOLATED) to just isolated place

Christoph pointed out inc_zone_page_state(NR_ISOLATED) should be placed
in right after isolate_page().

This patch does it.

Reviewed-by: Christoph Lameter <cl@linux-foundation.org>
Signed-off-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# b9255850 08-Dec-2009 H. Peter Anvin <hpa@zytor.com>

mm: Adjust do_pages_stat() so gcc can see copy_from_user() is safe

Slightly adjust the logic for determining the size of the
copy_form_user() in do_pages_stat(); with this change, gcc can see
that the copying is safe.

Without this, we get a build error for i386 allyesconfig:

/home/hpa/kernel/linux-2.6-tip.urgent/arch/x86/include/asm/uaccess_32.h:213:
error: call to ‘copy_from_user_overflow’ declared with attribute
error: copy_from_user() buffer size is not provably correct

Unlike an earlier patch from Arjan, this doesn't introduce new
variables; merely reshuffles the compare so that gcc can see that an
overflow cannot happen.

Signed-off-by: H. Peter Anvin <hpa@zytor.com>
Cc: Brice Goglin <Brice.Goglin@inria.fr>
Cc: Arjan van de Ven <arjan@infradead.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
LKML-Reference: <20090926205406.30d55b08@infradead.org>


# e00e4316 11-Nov-2009 KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>

memcg: fix wrong pointer initialization at page migration when memcg is disabled.

Lee Schermerhorn reported that he saw bad pointer dereference in
mem_cgroup_end_migration() when he disabled memcg by boot option.

memcg's page migration logic works as

mem_cgroup_prepare_migration(page, &ptr);
do page migration
mem_cgroup_end_migration(page, ptr);

Now, ptr is not initialized in prepare_migration when memcg is disabled
by boot option. This causes panic in end_migration. This patch fixes it.

Reported-by: Lee Schermerhorn <Lee.Schermerhorn@hp.com>
Cc: Balbir Singh <balbir@in.ibm.com>
Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Reviewed-by: Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp>
Cc: <stable@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# edcf4748 21-Sep-2009 Johannes Weiner <hannes@cmpxchg.org>

mm: return boolean from page_has_private()

Make page_has_private() return a true boolean value and remove the double
negations from the two callsites using it for arithmetic.

Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
Cc: Christoph Lameter <cl@linux-foundation.org>
Reviewed-by: Christoph Lameter <cl@linux-foundation.org>
Reviewed-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 6c0b1351 21-Sep-2009 Johannes Weiner <hannes@cmpxchg.org>

mm: return boolean from page_is_file_cache()

page_is_file_cache() has been used for both boolean checks and LRU
arithmetic, which was always a bit weird.

Now that page_lru_base_type() exists for LRU arithmetic, make
page_is_file_cache() a real predicate function and adjust the
boolean-using callsites to drop those pesky double negations.

Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
Reviewed-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# a731286d 21-Sep-2009 KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>

mm: vmstat: add isolate pages

If the system is running a heavy load of processes then concurrent reclaim
can isolate a large number of pages from the LRU. /proc/vmstat and the
output generated for an OOM do not show how many pages were isolated.

This has been observed during process fork bomb testing (mstctl11 in LTP).

This patch shows the information about isolated pages.

Reproduced via:

-----------------------
% ./hackbench 140 process 1000
=> OOM occur

active_anon:146 inactive_anon:0 isolated_anon:49245
active_file:79 inactive_file:18 isolated_file:113
unevictable:0 dirty:0 writeback:0 unstable:0 buffer:39
free:370 slab_reclaimable:309 slab_unreclaimable:5492
mapped:53 shmem:15 pagetables:28140 bounce:0

Signed-off-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Acked-by: Rik van Riel <riel@redhat.com>
Acked-by: Wu Fengguang <fengguang.wu@intel.com>
Reviewed-by: Minchan Kim <minchan.kim@gmail.com>
Cc: Hugh Dickins <hugh.dickins@tiscali.co.uk>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 4b02108a 21-Sep-2009 KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>

mm: oom analysis: add shmem vmstat

Recently we encountered OOM problems due to memory use of the GEM cache.
Generally a large amuont of Shmem/Tmpfs pages tend to create a memory
shortage problem.

We often use the following calculation to determine the amount of shmem
pages:

shmem = NR_ACTIVE_ANON + NR_INACTIVE_ANON - NR_ANON_PAGES

however the expression does not consider isolated and mlocked pages.

This patch adds explicit accounting for pages used by shmem and tmpfs.

Signed-off-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Acked-by: Rik van Riel <riel@redhat.com>
Reviewed-by: Christoph Lameter <cl@linux-foundation.org>
Acked-by: Wu Fengguang <fengguang.wu@intel.com>
Cc: David Rientjes <rientjes@google.com>
Cc: Hugh Dickins <hugh.dickins@tiscali.co.uk>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# abfc3488 21-Sep-2009 Shaohua Li <shaohua.li@intel.com>

memory hotplug: migrate swap cache page

In test, some pages in swap-cache can't be migrated, as they aren't rmap.

unmap_and_move() ignores swap-cache page which is just read in and hasn't
rmap (see the comments in the code), but swap_aops provides .migratepage.
Better to migrate such pages instead of ignore them.

Signed-off-by: Shaohua Li <shaohua.li@intel.com>
Cc: Mel Gorman <mel@csn.ul.ie>
Cc: Christoph Lameter <cl@linux-foundation.org>
Cc: Yakui Zhao <yakui.zhao@intel.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 14fa31b8 16-Sep-2009 Andi Kleen <andi@firstfloor.org>

HWPOISON: Use bitmask/action code for try_to_unmap behaviour

try_to_unmap currently has multiple modi (migration, munlock, normal unmap)
which are selected by magic flag variables. The logic is not very straight
forward, because each of these flag change multiple behaviours (e.g.
migration turns off aging, not only sets up migration ptes etc.)
Also the different flags interact in magic ways.

A later patch in this series adds another mode to try_to_unmap, so
this becomes quickly unmanageable.

Replace the different flags with a action code (migration, munlock, munmap)
and some additional flags as modifiers (ignore mlock, ignore aging).
This makes the logic more straight forward and allows easier extension
to new behaviours. Change all the caller to declare what they want to
do.

This patch is supposed to be a nop in behaviour. If anyone can prove
it is not that would be a bug.

Cc: Lee.Schermerhorn@hp.com
Cc: npiggin@suse.de

Signed-off-by: Andi Kleen <ak@linux.intel.com>


# 35282a2d 16-Jun-2009 Brice Goglin <Brice.Goglin@ens-lyon.org>

migration: only migrate_prep() once per move_pages()

migrate_prep() is fairly expensive (72us on 16-core barcelona 1.9GHz).
Commit 3140a2273009c01c27d316f35ab76a37e105fdd8 improved move_pages()
throughput by breaking it into chunks, but it also made migrate_prep() be
called once per chunk (every 128pages or so) instead of once per
move_pages().

This patch reverts to calling migrate_prep() only once per chunk as we did
before 2.6.29. It is also a followup to commit
0aedadf91a70a11c4a3e7c7d99b21e5528af8d5d ("mm: move migrate_prep out from
under mmap_sem").

This improves migration throughput on the above machine from 600MB/s to
750MB/s.

Signed-off-by: Brice Goglin <Brice.Goglin@inria.fr>
Acked-by: Christoph Lameter <cl@linux-foundation.org>
Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
Cc: Nick Piggin <nickpiggin@yahoo.com.au>
Cc: Hugh Dickins <hugh.dickins@tiscali.co.uk>
Cc: Rik van Riel <riel@redhat.com>
Cc: Lee Schermerhorn <lee.schermerhorn@hp.com>
Reviewed-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 6484eb3e 16-Jun-2009 Mel Gorman <mel@csn.ul.ie>

page allocator: do not check NUMA node ID when the caller knows the node is valid

Callers of alloc_pages_node() can optionally specify -1 as a node to mean
"allocate from the current node". However, a number of the callers in
fast paths know for a fact their node is valid. To avoid a comparison and
branch, this patch adds alloc_pages_exact_node() that only checks the nid
with VM_BUG_ON(). Callers that know their node is valid are then
converted.

Signed-off-by: Mel Gorman <mel@csn.ul.ie>
Reviewed-by: Christoph Lameter <cl@linux-foundation.org>
Reviewed-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Reviewed-by: Pekka Enberg <penberg@cs.helsinki.fi>
Acked-by: Paul Mundt <lethal@linux-sh.org> [for the SLOB NUMA bits]
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Nick Piggin <nickpiggin@yahoo.com.au>
Cc: Dave Hansen <dave@linux.vnet.ibm.com>
Cc: Lee Schermerhorn <Lee.Schermerhorn@hp.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 266cf658 03-Apr-2009 David Howells <dhowells@redhat.com>

FS-Cache: Recruit a page flags for cache management

Recruit a page flag to aid in cache management. The following extra flag is
defined:

(1) PG_fscache (PG_private_2)

The marked page is backed by a local cache and is pinning resources in the
cache driver.

If PG_fscache is set, then things that checked for PG_private will now also
check for that. This includes things like truncation and page invalidation.
The function page_has_private() had been added to make the checks for both
PG_private and PG_private_2 at the same time.

Signed-off-by: David Howells <dhowells@redhat.com>
Acked-by: Steve Dickson <steved@redhat.com>
Acked-by: Trond Myklebust <Trond.Myklebust@netapp.com>
Acked-by: Rik van Riel <riel@redhat.com>
Acked-by: Al Viro <viro@zeniv.linux.org.uk>
Tested-by: Daire Byrne <Daire.Byrne@framestore.com>


# 1001c9fb 11-Feb-2009 Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp>

migration: migrate_vmas should check "vma"

migrate_vmas() should check "vma" not "vma->vm_next" for for-loop condition.

Signed-off-by: Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp>
Cc: Christoph Lameter <cl@linux-foundation.org>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 938bb9f5 14-Jan-2009 Heiko Carstens <hca@linux.ibm.com>

[CVE-2009-0029] System call wrappers part 28

Signed-off-by: Heiko Carstens <heiko.carstens@de.ibm.com>


# 01b1ae63 07-Jan-2009 KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>

memcg: simple migration handling

Now, management of "charge" under page migration is done under following
manner. (Assume migrate page contents from oldpage to newpage)

before
- "newpage" is charged before migration.
at success.
- "oldpage" is uncharged at somewhere(unmap, radix-tree-replace)
at failure
- "newpage" is uncharged.
- "oldpage" is charged if necessary (*1)

But (*1) is not reliable....because of GFP_ATOMIC.

This patch tries to change behavior as following by charge/commit/cancel ops.

before
- charge PAGE_SIZE (no target page)
success
- commit charge against "newpage".
failure
- commit charge against "oldpage".
(PCG_USED bit works effectively to avoid double-counting)
- if "oldpage" is obsolete, cancel charge of PAGE_SIZE.

Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Reviewed-by: Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp>
Cc: Balbir Singh <balbir@in.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 7a81b88c 07-Jan-2009 KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>

memcg: introduce charge-commit-cancel style of functions

There is a small race in do_swap_page(). When the page swapped-in is
charged, the mapcount can be greater than 0. But, at the same time some
process (shares it ) call unmap and make mapcount 1->0 and the page is
uncharged.

CPUA CPUB
mapcount == 1.
(1) charge if mapcount==0 zap_pte_range()
(2) mapcount 1 => 0.
(3) uncharge(). (success)
(4) set page's rmap()
mapcount 0=>1

Then, this swap page's account is leaked.

For fixing this, I added a new interface.
- charge
account to res_counter by PAGE_SIZE and try to free pages if necessary.
- commit
register page_cgroup and add to LRU if necessary.
- cancel
uncharge PAGE_SIZE because of do_swap_page failure.

CPUA
(1) charge (always)
(2) set page's rmap (mapcount > 0)
(3) commit charge was necessary or not after set_pte().

This protocol uses PCG_USED bit on page_cgroup for avoiding over accounting.
Usual mem_cgroup_charge_common() does charge -> commit at a time.

And this patch also adds following function to clarify all charges.

- mem_cgroup_newpage_charge() ....replacement for mem_cgroup_charge()
called against newly allocated anon pages.

- mem_cgroup_charge_migrate_fixup()
called only from remove_migration_ptes().
we'll have to rewrite this later.(this patch just keeps old behavior)
This function will be removed by additional patch to make migration
clearer.

Good for clarifying "what we do"

Then, we have 4 following charge points.
- newpage
- swap-in
- add-to-cache.
- migration.

[akpm@linux-foundation.org: add missing inline directives to stubs]
Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Reviewed-by: Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp>
Cc: Balbir Singh <balbir@in.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 6d91add0 06-Jan-2009 Hugh Dickins <hugh@veritas.com>

mm: add Set,ClearPageSwapCache stubs

If we add NOOP stubs for SetPageSwapCache() and ClearPageSwapCache(), then
we can remove the #ifdef CONFIG_SWAPs from mm/migrate.c.

Signed-off-by: Hugh Dickins <hugh@veritas.com>
Acked-by: Christoph Lameter <cl@linux-foundation.org>
Cc: Nick Piggin <nickpiggin@yahoo.com.au>
Cc: Mel Gorman <mel@csn.ul.ie>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 5bd1455c 06-Jan-2009 Brice Goglin <Brice.Goglin@inria.fr>

mm: move_pages: no need to set pp->page to ZERO_PAGE(0) by default

pp->page is never used when not set to the right page, so there is no need
to set it to ZERO_PAGE(0) by default.

Signed-off-by: Brice Goglin <Brice.Goglin@inria.fr>
Acked-by: Christoph Lameter <cl@linux-foundation.org>
Cc: Nick Piggin <nickpiggin@yahoo.com.au>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 3140a227 06-Jan-2009 Brice Goglin <Brice.Goglin@inria.fr>

mm: rework do_pages_move() to work on page_sized chunks

Rework do_pages_move() to work by page-sized chunks of struct page_to_node
that are passed to do_move_page_to_node_array(). We now only have to
allocate a single page instead a possibly very large vmalloc area to store
all page_to_node entries.

As a result, new_page_node() will now have a very small lookup, hidding
much of the overall sys_move_pages() overhead.

Signed-off-by: Brice Goglin <Brice.Goglin@inria.fr>
Signed-off-by: Nathalie Furmento <Nathalie.Furmento@labri.fr>
Acked-by: Christoph Lameter <cl@linux-foundation.org>
Cc: Nick Piggin <nickpiggin@yahoo.com.au>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# c095adbc 16-Dec-2008 KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>

mm: Don't touch uninitialized variable in do_pages_stat_array()

Commit 80bba1290ab5122c60cdb73332b26d288dc8aedd removed one necessary
variable initialization. As a result following warning happened:

CC mm/migrate.o
mm/migrate.c: In function 'sys_move_pages':
mm/migrate.c:1001: warning: 'err' may be used uninitialized in this function

More unfortunately, if find_vma() failed, kernel read uninitialized
memory.

Signed-off-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
CC: Brice Goglin <Brice.Goglin@inria.fr>
Cc: Christoph Lameter <clameter@sgi.com>
Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Nick Piggin <npiggin@suse.de>
Cc: Hugh Dickins <hugh@veritas.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 80bba129 09-Dec-2008 Brice Goglin <Brice.Goglin@inria.fr>

mm: no get_user/put_user while holding mmap_sem in do_pages_stat?

Since commit 2f007e74bb85b9fc4eab28524052161703300f1a, do_pages_stat()
gets the page address from user-space and puts the corresponding status
back while holding the mmap_sem for read. There is no need to hold
mmap_sem there while some page-faults may occur.

This patch adds a temporary address and status buffer so as to only
hold mmap_sem while working on these kernel buffers. This is
implemented by extracting do_pages_stat_array() out of do_pages_stat().

Signed-off-by: Brice Goglin <Brice.Goglin@inria.fr>
Cc: Christoph Lameter <clameter@sgi.com>
Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Nick Piggin <npiggin@suse.de>
Cc: Hugh Dickins <hugh@veritas.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# bda8550d 19-Nov-2008 Hugh Dickins <hugh@veritas.com>

migration: fix writepage error

Page migration's writeout() has got understandably confused by the nasty
AOP_WRITEPAGE_ACTIVATE case: as in normal success, a writepage() error has
unlocked the page, so writeout() then needs to relock it.

Signed-off-by: Hugh Dickins <hugh@veritas.com>
Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Christoph Lameter <cl@linux-foundation.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# c69e8d9c 13-Nov-2008 David Howells <dhowells@redhat.com>

CRED: Use RCU to access another task's creds and to release a task's own creds

Use RCU to access another task's creds and to release a task's own creds.
This means that it will be possible for the credentials of a task to be
replaced without another task (a) requiring a full lock to read them, and (b)
seeing deallocated memory.

Signed-off-by: David Howells <dhowells@redhat.com>
Acked-by: James Morris <jmorris@namei.org>
Acked-by: Serge Hallyn <serue@us.ibm.com>
Signed-off-by: James Morris <jmorris@namei.org>


# b6dff3ec 13-Nov-2008 David Howells <dhowells@redhat.com>

CRED: Separate task security context from task_struct

Separate the task security context from task_struct. At this point, the
security data is temporarily embedded in the task_struct with two pointers
pointing to it.

Note that the Alpha arch is altered as it refers to (E)UID and (E)GID in
entry.S via asm-offsets.

With comment fixes Signed-off-by: Marc Dionne <marc.c.dionne@gmail.com>

Signed-off-by: David Howells <dhowells@redhat.com>
Acked-by: James Morris <jmorris@namei.org>
Acked-by: Serge Hallyn <serue@us.ibm.com>
Signed-off-by: James Morris <jmorris@namei.org>


# 76aac0e9 13-Nov-2008 David Howells <dhowells@redhat.com>

CRED: Wrap task credential accesses in the core kernel

Wrap access to task credentials so that they can be separated more easily from
the task_struct during the introduction of COW creds.

Change most current->(|e|s|fs)[ug]id to current_(|e|s|fs)[ug]id().

Change some task->e?[ug]id to task_e?[ug]id(). In some places it makes more
sense to use RCU directly rather than a convenient wrapper; these will be
addressed by later patches.

Signed-off-by: David Howells <dhowells@redhat.com>
Reviewed-by: James Morris <jmorris@namei.org>
Acked-by: Serge Hallyn <serue@us.ibm.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: linux-audit@redhat.com
Cc: containers@lists.linux-foundation.org
Cc: linux-mm@kvack.org
Signed-off-by: James Morris <jmorris@namei.org>


# 0aedadf9 06-Nov-2008 Christoph Lameter <cl@linux-foundation.org>

mm: move migrate_prep out from under mmap_sem

Move the migrate_prep outside the mmap_sem for the following system calls

1. sys_move_pages
2. sys_migrate_pages
3. sys_mbind()

It really does not matter when we flush the lru. The system is free to
add pages onto the lru even during migration which will make the page
migration either skip the page (mbind, migrate_pages) or return a busy
state (move_pages).

Fixes this lockdep warning (and potential deadlock):

Some VM place has
mmap_sem -> kevent_wq via lru_add_drain_all()

net/core/dev.c::dev_ioctl() has
rtnl_lock -> mmap_sem (*) the ioctl has copy_from_user() and it can do page fault.

linkwatch_event has
kevent_wq -> rtnl_lock

Signed-off-by: Christoph Lameter <cl@linux-foundation.org>
Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Reported-by: Heiko Carstens <heiko.carstens@de.ibm.com>
Cc: Nick Piggin <nickpiggin@yahoo.com.au>
Cc: Hugh Dickins <hugh@veritas.com>
Cc: Rik van Riel <riel@redhat.com>
Cc: Lee Schermerhorn <lee.schermerhorn@hp.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# b7abea96 18-Oct-2008 KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>

memcg: make page->mapping NULL before uncharge

This patch tries to make page->mapping to be NULL before
mem_cgroup_uncharge_cache_page() is called.

"page->mapping == NULL" is a good check for "whether the page is still
radix-tree or not". This patch also adds BUG_ON() to
mem_cgroup_uncharge_cache_page();

Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Reviewed-by: Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp>
Cc: Balbir Singh <balbir@linux.vnet.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 5e9a0f02 18-Oct-2008 Brice Goglin <Brice.Goglin@inria.fr>

mm: extract do_pages_move() out of sys_move_pages()

To prepare the chunking, move the sys_move_pages() code that is used when
nodes!=NULL into do_pages_move(). And rename do_move_pages() into
do_move_page_to_node_array().

Signed-off-by: Brice Goglin <Brice.Goglin@inria.fr>
Acked-by: Christoph Lameter <cl@linux-foundation.org>
Cc: Nick Piggin <nickpiggin@yahoo.com.au>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 2f007e74 18-Oct-2008 Brice Goglin <Brice.Goglin@inria.fr>

mm: don't vmalloc a huge page_to_node array for do_pages_stat()

do_pages_stat() does not need any page_to_node entry for real. Just pass
the pointers to the user-space page address array and to the user-space
status array, and have do_pages_stat() traverse the former and fill the
latter directly.

Signed-off-by: Brice Goglin <Brice.Goglin@inria.fr>
Acked-by: Christoph Lameter <cl@linux-foundation.org>
Cc: Nick Piggin <nickpiggin@yahoo.com.au>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# e78bbfa8 18-Oct-2008 Brice Goglin <Brice.Goglin@inria.fr>

mm: stop returning -ENOENT from sys_move_pages() if nothing got migrated

A patchset reworking sys_move_pages(). It removes the possibly large
vmalloc by using multiple chunks when migrating large buffers. It also
dramatically increases the throughput for large buffers since the lookup
in new_page_node() is now limited to a single chunk, causing the quadratic
complexity to have a much slower impact. There is no need to use any
radix-tree-like structure to improve this lookup.

sys_move_pages() duration on a 4-quadcore-opteron 2347HE (1.9Gz),
migrating between nodes #2 and #3:

length move_pages (us) move_pages+patch (us)
4kB 126 98
40kB 198 168
400kB 963 937
4MB 12503 11930
40MB 246867 11848

Patches #1 and #4 are the important ones:
1) stop returning -ENOENT from sys_move_pages() if nothing got migrated
2) don't vmalloc a huge page_to_node array for do_pages_stat()
3) extract do_pages_move() out of sys_move_pages()
4) rework do_pages_move() to work on page_sized chunks
5) move_pages: no need to set pp->page to ZERO_PAGE(0) by default

This patch:

There is no point in returning -ENOENT from sys_move_pages() if all pages
were already on the right node, while we return 0 if only 1 page was not.
Most application don't know where their pages are allocated, so it's not
an error to try to migrate them anyway.

Just return 0 and let the status array in user-space be checked if the
application needs details.

It will make the upcoming chunked-move_pages() support much easier.

Signed-off-by: Brice Goglin <Brice.Goglin@inria.fr>
Acked-by: Christoph Lameter <cl@linux-foundation.org>
Cc: Nick Piggin <nickpiggin@yahoo.com.au>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# b291f000 18-Oct-2008 Nick Piggin <npiggin@suse.de>

mlock: mlocked pages are unevictable

Make sure that mlocked pages also live on the unevictable LRU, so kswapd
will not scan them over and over again.

This is achieved through various strategies:

1) add yet another page flag--PG_mlocked--to indicate that
the page is locked for efficient testing in vmscan and,
optionally, fault path. This allows early culling of
unevictable pages, preventing them from getting to
page_referenced()/try_to_unmap(). Also allows separate
accounting of mlock'd pages, as Nick's original patch
did.

Note: Nick's original mlock patch used a PG_mlocked
flag. I had removed this in favor of the PG_unevictable
flag + an mlock_count [new page struct member]. I
restored the PG_mlocked flag to eliminate the new
count field.

2) add the mlock/unevictable infrastructure to mm/mlock.c,
with internal APIs in mm/internal.h. This is a rework
of Nick's original patch to these files, taking into
account that mlocked pages are now kept on unevictable
LRU list.

3) update vmscan.c:page_evictable() to check PageMlocked()
and, if vma passed in, the vm_flags. Note that the vma
will only be passed in for new pages in the fault path;
and then only if the "cull unevictable pages in fault
path" patch is included.

4) add try_to_unlock() to rmap.c to walk a page's rmap and
ClearPageMlocked() if no other vmas have it mlocked.
Reuses as much of try_to_unmap() as possible. This
effectively replaces the use of one of the lru list links
as an mlock count. If this mechanism let's pages in mlocked
vmas leak through w/o PG_mlocked set [I don't know that it
does], we should catch them later in try_to_unmap(). One
hopes this will be rare, as it will be relatively expensive.

Original mm/internal.h, mm/rmap.c and mm/mlock.c changes:
Signed-off-by: Nick Piggin <npiggin@suse.de>

splitlru: introduce __get_user_pages():

New munlock processing need to GUP_FLAGS_IGNORE_VMA_PERMISSIONS.
because current get_user_pages() can't grab PROT_NONE pages theresore it
cause PROT_NONE pages can't munlock.

[akpm@linux-foundation.org: fix this for pagemap-pass-mm-into-pagewalkers.patch]
[akpm@linux-foundation.org: untangle patch interdependencies]
[akpm@linux-foundation.org: fix things after out-of-order merging]
[hugh@veritas.com: fix page-flags mess]
[lee.schermerhorn@hp.com: fix munlock page table walk - now requires 'mm']
[kosaki.motohiro@jp.fujitsu.com: build fix]
[kosaki.motohiro@jp.fujitsu.com: fix truncate race and sevaral comments]
[kosaki.motohiro@jp.fujitsu.com: splitlru: introduce __get_user_pages()]
Signed-off-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Signed-off-by: Rik van Riel <riel@redhat.com>
Signed-off-by: Lee Schermerhorn <lee.schermerhorn@hp.com>
Cc: Nick Piggin <npiggin@suse.de>
Cc: Dave Hansen <dave@linux.vnet.ibm.com>
Cc: Matt Mackall <mpm@selenic.com>
Signed-off-by: Hugh Dickins <hugh@veritas.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 894bc310 18-Oct-2008 Lee Schermerhorn <Lee.Schermerhorn@hp.com>

Unevictable LRU Infrastructure

When the system contains lots of mlocked or otherwise unevictable pages,
the pageout code (kswapd) can spend lots of time scanning over these
pages. Worse still, the presence of lots of unevictable pages can confuse
kswapd into thinking that more aggressive pageout modes are required,
resulting in all kinds of bad behaviour.

Infrastructure to manage pages excluded from reclaim--i.e., hidden from
vmscan. Based on a patch by Larry Woodman of Red Hat. Reworked to
maintain "unevictable" pages on a separate per-zone LRU list, to "hide"
them from vmscan.

Kosaki Motohiro added the support for the memory controller unevictable
lru list.

Pages on the unevictable list have both PG_unevictable and PG_lru set.
Thus, PG_unevictable is analogous to and mutually exclusive with
PG_active--it specifies which LRU list the page is on.

The unevictable infrastructure is enabled by a new mm Kconfig option
[CONFIG_]UNEVICTABLE_LRU.

A new function 'page_evictable(page, vma)' in vmscan.c tests whether or
not a page may be evictable. Subsequent patches will add the various
!evictable tests. We'll want to keep these tests light-weight for use in
shrink_active_list() and, possibly, the fault path.

To avoid races between tasks putting pages [back] onto an LRU list and
tasks that might be moving the page from non-evictable to evictable state,
the new function 'putback_lru_page()' -- inverse to 'isolate_lru_page()'
-- tests the "evictability" of a page after placing it on the LRU, before
dropping the reference. If the page has become unevictable,
putback_lru_page() will redo the 'putback', thus moving the page to the
unevictable list. This way, we avoid "stranding" evictable pages on the
unevictable list.

[akpm@linux-foundation.org: fix fallout from out-of-order merge]
[riel@redhat.com: fix UNEVICTABLE_LRU and !PROC_PAGE_MONITOR build]
[nishimura@mxp.nes.nec.co.jp: remove redundant mapping check]
[kosaki.motohiro@jp.fujitsu.com: unevictable-lru-infrastructure: putback_lru_page()/unevictable page handling rework]
[kosaki.motohiro@jp.fujitsu.com: kill unnecessary lock_page() in vmscan.c]
[kosaki.motohiro@jp.fujitsu.com: revert migration change of unevictable lru infrastructure]
[kosaki.motohiro@jp.fujitsu.com: revert to unevictable-lru-infrastructure-kconfig-fix.patch]
[kosaki.motohiro@jp.fujitsu.com: restore patch failure of vmstat-unevictable-and-mlocked-pages-vm-events.patch]
Signed-off-by: Lee Schermerhorn <lee.schermerhorn@hp.com>
Signed-off-by: Rik van Riel <riel@redhat.com>
Signed-off-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Debugged-by: Benjamin Kidwell <benjkidwell@yahoo.com>
Signed-off-by: Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp>
Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# b2e18538 18-Oct-2008 Rik van Riel <riel@redhat.com>

define page_file_cache() function

Define page_file_cache() function to answer the question:
is page backed by a file?

Originally part of Rik van Riel's split-lru patch. Extracted to make
available for other, independent reclaim patches.

Moved inline function to linux/mm_inline.h where it will be needed by
subsequent "split LRU" and "noreclaim" patches.

Unfortunately this needs to use a page flag, since the PG_swapbacked state
needs to be preserved all the way to the point where the page is last
removed from the LRU. Trying to derive the status from other info in the
page resulted in wrong VM statistics in earlier split VM patchsets.

The total number of page flags in use on a 32 bit machine after this patch
is 19.

[akpm@linux-foundation.org: fix up out-of-order merge fallout]
[hugh@veritas.com: splitlru: shmem_getpage SetPageSwapBacked sooner[
Signed-off-by: Rik van Riel <riel@redhat.com>
Signed-off-by: Lee Schermerhorn <lee.schermerhorn@hp.com>
Signed-off-by: MinChan Kim <minchan.kim@gmail.com>
Signed-off-by: Hugh Dickins <hugh@veritas.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# f04e9ebb 18-Oct-2008 KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>

swap: use an array for the LRU pagevecs

Turn the pagevecs into an array just like the LRUs. This significantly
cleans up the source code and reduces the size of the kernel by about 13kB
after all the LRU lists have been created further down in the split VM
patch series.

Signed-off-by: Rik van Riel <riel@redhat.com>
Signed-off-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 62695a84 18-Oct-2008 Nick Piggin <npiggin@suse.de>

vmscan: move isolate_lru_page() to vmscan.c

On large memory systems, the VM can spend way too much time scanning
through pages that it cannot (or should not) evict from memory. Not only
does it use up CPU time, but it also provokes lock contention and can
leave large systems under memory presure in a catatonic state.

This patch series improves VM scalability by:

1) putting filesystem backed, swap backed and unevictable pages
onto their own LRUs, so the system only scans the pages that it
can/should evict from memory

2) switching to two handed clock replacement for the anonymous LRUs,
so the number of pages that need to be scanned when the system
starts swapping is bound to a reasonable number

3) keeping unevictable pages off the LRU completely, so the
VM does not waste CPU time scanning them. ramfs, ramdisk,
SHM_LOCKED shared memory segments and mlock()ed VMA pages
are keept on the unevictable list.

This patch:

isolate_lru_page logically belongs to be in vmscan.c than migrate.c.

It is tough, because we don't need that function without memory migration
so there is a valid argument to have it in migrate.c. However a
subsequent patch needs to make use of it in the core mm, so we can happily
move it to vmscan.c.

Also, make the function a little more generic by not requiring that it
adds an isolated page to a given list. Callers can do that.

Note that we now have '__isolate_lru_page()', that does
something quite different, visible outside of vmscan.c
for use with memory controller. Methinks we need to
rationalize these names/purposes. --lts

[akpm@linux-foundation.org: fix mm/memory_hotplug.c build]
Signed-off-by: Nick Piggin <npiggin@suse.de>
Signed-off-by: Rik van Riel <riel@redhat.com>
Signed-off-by: Lee Schermerhorn <Lee.Schermerhorn@hp.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 529ae9aa 01-Aug-2008 Nick Piggin <npiggin@suse.de>

mm: rename page trylock

Converting page lock to new locking bitops requires a change of page flag
operation naming, so we might as well convert it to something nicer
(!TestSetPageLocked_Lock => trylock_page, SetPageLocked => set_page_locked).

This also facilitates lockdeping of page lock.

Signed-off-by: Nick Piggin <npiggin@suse.de>
Acked-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Acked-by: Peter Zijlstra <peterz@infradead.org>
Acked-by: Andrew Morton <akpm@linux-foundation.org>
Acked-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 19fd6231 25-Jul-2008 Nick Piggin <npiggin@suse.de>

mm: spinlock tree_lock

mapping->tree_lock has no read lockers. convert the lock from an rwlock
to a spinlock.

Signed-off-by: Nick Piggin <npiggin@suse.de>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Hugh Dickins <hugh@veritas.com>
Cc: "Paul E. McKenney" <paulmck@us.ibm.com>
Reviewed-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# e286781d 25-Jul-2008 Nick Piggin <npiggin@suse.de>

mm: speculative page references

If we can be sure that elevating the page_count on a pagecache page will
pin it, we can speculatively run this operation, and subsequently check to
see if we hit the right page rather than relying on holding a lock or
otherwise pinning a reference to the page.

This can be done if get_page/put_page behaves consistently throughout the
whole tree (ie. if we "get" the page after it has been used for something
else, we must be able to free it with a put_page).

Actually, there is a period where the count behaves differently: when the
page is free or if it is a constituent page of a compound page. We need
an atomic_inc_not_zero operation to ensure we don't try to grab the page
in either case.

This patch introduces the core locking protocol to the pagecache (ie.
adds page_cache_get_speculative, and tweaks some update-side code to make
it work).

Thanks to Hugh for pointing out an improvement to the algorithm setting
page_count to zero when we have control of all references, in order to
hold off speculative getters.

[kamezawa.hiroyu@jp.fujitsu.com: fix migration_entry_wait()]
[hugh@veritas.com: fix add_to_page_cache]
[akpm@linux-foundation.org: repair a comment]
Signed-off-by: Nick Piggin <npiggin@suse.de>
Cc: Jeff Garzik <jeff@garzik.org>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Hugh Dickins <hugh@veritas.com>
Cc: "Paul E. McKenney" <paulmck@us.ibm.com>
Reviewed-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Signed-off-by: Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp>
Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Signed-off-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Signed-off-by: Hugh Dickins <hugh@veritas.com>
Acked-by: Nick Piggin <npiggin@suse.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 69029cd5 25-Jul-2008 KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>

memcg: remove refcnt from page_cgroup

memcg: performance improvements

Patch Description
1/5 ... remove refcnt fron page_cgroup patch (shmem handling is fixed)
2/5 ... swapcache handling patch
3/5 ... add helper function for shmem's memory reclaim patch
4/5 ... optimize by likely/unlikely ppatch
5/5 ... remove redundunt check patch (shmem handling is fixed.)

Unix bench result.

== 2.6.26-rc2-mm1 + memory resource controller
Execl Throughput 2915.4 lps (29.6 secs, 3 samples)
C Compiler Throughput 1019.3 lpm (60.0 secs, 3 samples)
Shell Scripts (1 concurrent) 5796.0 lpm (60.0 secs, 3 samples)
Shell Scripts (8 concurrent) 1097.7 lpm (60.0 secs, 3 samples)
Shell Scripts (16 concurrent) 565.3 lpm (60.0 secs, 3 samples)
File Read 1024 bufsize 2000 maxblocks 1022128.0 KBps (30.0 secs, 3 samples)
File Write 1024 bufsize 2000 maxblocks 544057.0 KBps (30.0 secs, 3 samples)
File Copy 1024 bufsize 2000 maxblocks 346481.0 KBps (30.0 secs, 3 samples)
File Read 256 bufsize 500 maxblocks 319325.0 KBps (30.0 secs, 3 samples)
File Write 256 bufsize 500 maxblocks 148788.0 KBps (30.0 secs, 3 samples)
File Copy 256 bufsize 500 maxblocks 99051.0 KBps (30.0 secs, 3 samples)
File Read 4096 bufsize 8000 maxblocks 2058917.0 KBps (30.0 secs, 3 samples)
File Write 4096 bufsize 8000 maxblocks 1606109.0 KBps (30.0 secs, 3 samples)
File Copy 4096 bufsize 8000 maxblocks 854789.0 KBps (30.0 secs, 3 samples)
Dc: sqrt(2) to 99 decimal places 126145.2 lpm (30.0 secs, 3 samples)

INDEX VALUES
TEST BASELINE RESULT INDEX

Execl Throughput 43.0 2915.4 678.0
File Copy 1024 bufsize 2000 maxblocks 3960.0 346481.0 875.0
File Copy 256 bufsize 500 maxblocks 1655.0 99051.0 598.5
File Copy 4096 bufsize 8000 maxblocks 5800.0 854789.0 1473.8
Shell Scripts (8 concurrent) 6.0 1097.7 1829.5
=========
FINAL SCORE 991.3

== 2.6.26-rc2-mm1 + this set ==
Execl Throughput 3012.9 lps (29.9 secs, 3 samples)
C Compiler Throughput 981.0 lpm (60.0 secs, 3 samples)
Shell Scripts (1 concurrent) 5872.0 lpm (60.0 secs, 3 samples)
Shell Scripts (8 concurrent) 1120.3 lpm (60.0 secs, 3 samples)
Shell Scripts (16 concurrent) 578.0 lpm (60.0 secs, 3 samples)
File Read 1024 bufsize 2000 maxblocks 1003993.0 KBps (30.0 secs, 3 samples)
File Write 1024 bufsize 2000 maxblocks 550452.0 KBps (30.0 secs, 3 samples)
File Copy 1024 bufsize 2000 maxblocks 347159.0 KBps (30.0 secs, 3 samples)
File Read 256 bufsize 500 maxblocks 314644.0 KBps (30.0 secs, 3 samples)
File Write 256 bufsize 500 maxblocks 151852.0 KBps (30.0 secs, 3 samples)
File Copy 256 bufsize 500 maxblocks 101000.0 KBps (30.0 secs, 3 samples)
File Read 4096 bufsize 8000 maxblocks 2033256.0 KBps (30.0 secs, 3 samples)
File Write 4096 bufsize 8000 maxblocks 1611814.0 KBps (30.0 secs, 3 samples)
File Copy 4096 bufsize 8000 maxblocks 847979.0 KBps (30.0 secs, 3 samples)
Dc: sqrt(2) to 99 decimal places 128148.7 lpm (30.0 secs, 3 samples)

INDEX VALUES
TEST BASELINE RESULT INDEX

Execl Throughput 43.0 3012.9 700.7
File Copy 1024 bufsize 2000 maxblocks 3960.0 347159.0 876.7
File Copy 256 bufsize 500 maxblocks 1655.0 101000.0 610.3
File Copy 4096 bufsize 8000 maxblocks 5800.0 847979.0 1462.0
Shell Scripts (8 concurrent) 6.0 1120.3 1867.2
=========
FINAL SCORE 1004.6

This patch:

Remove refcnt from page_cgroup().

After this,

* A page is charged only when !page_mapped() && no page_cgroup is assigned.
* Anon page is newly mapped.
* File page is added to mapping->tree.

* A page is uncharged only when
* Anon page is fully unmapped.
* File page is removed from LRU.

There is no change in behavior from user's view.

This patch also removes unnecessary calls in rmap.c which was used only for
refcnt mangement.

[akpm@linux-foundation.org: fix warning]
[hugh@veritas.com: fix shmem_unuse_inode charging]
Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Balbir Singh <balbir@in.ibm.com>
Cc: "Eric W. Biederman" <ebiederm@xmission.com>
Cc: Pavel Emelyanov <xemul@openvz.org>
Cc: Li Zefan <lizf@cn.fujitsu.com>
Cc: Hugh Dickins <hugh@veritas.com>
Cc: YAMAMOTO Takashi <yamamoto@valinux.co.jp>
Cc: Paul Menage <menage@google.com>
Cc: David Rientjes <rientjes@google.com>
Signed-off-by: Hugh Dickins <hugh@veritas.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# e8589cc1 25-Jul-2008 KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>

memcg: better migration handling

This patch changes page migration under memory controller to use a
different algorithm. (thanks to Christoph for new idea.)

Before:
- page_cgroup is migrated from an old page to a new page.
After:
- a new page is accounted , no reuse of page_cgroup.

Pros:

- We can avoid compliated lock depndencies and races in migration.

Cons:

- new param to mem_cgroup_charge_common().

- mem_cgroup_getref() is added for handling ref_cnt ping-pong.

This version simplifies complicated lock dependency in page migraiton
under memory resource controller.

new refcnt sequence is following.

a mapped page:
prepage_migration() ..... +1 to NEW page
try_to_unmap() ..... all refs to OLD page is gone.
move_pages() ..... +1 to NEW page if page cache.
remap... ..... all refs from *map* is added to NEW one.
end_migration() ..... -1 to New page.

page's mapcount + (page_is_cache) refs are added to NEW one.

Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Balbir Singh <balbir@in.ibm.com>
Cc: Pavel Emelyanov <xemul@openvz.org>
Cc: Li Zefan <lizf@cn.fujitsu.com>
Cc: YAMAMOTO Takashi <yamamoto@valinux.co.jp>
Cc: Hugh Dickins <hugh@veritas.com>
Cc: Christoph Lameter <cl@linux-foundation.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 83d1674a 23-Jul-2008 Gerald Schaefer <gerald.schaefer@linux.ibm.com>

mm: make CONFIG_MIGRATION available w/o CONFIG_NUMA

We'd like to support CONFIG_MEMORY_HOTREMOVE on s390, which depends on
CONFIG_MIGRATION. So far, CONFIG_MIGRATION is only available with NUMA
support.

This patch makes CONFIG_MIGRATION selectable for architectures that define
ARCH_ENABLE_MEMORY_HOTREMOVE. When MIGRATION is enabled w/o NUMA, the
kernel won't compile because migrate_vmas() does not know about
vm_ops->migrate() and vma_migratable() does not know about policy_zone.
To fix this, those two functions can be restricted to '#ifdef CONFIG_NUMA'
because they are not being used w/o NUMA. vma_migratable() is moved over
from migrate.h to mempolicy.h.

[kosaki.motohiro@jp.fujitsu.com: build fix]
Acked-by: Christoph Lameter <cl@linux-foundation.org>
Signed-off-by: Gerald Schaefer <gerald.schaefer@de.ibm.com>
Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
Signed-off-by: KOSAKI Motorhiro <kosaki.motohiro@jp.fujitsu.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 4f5ca265 23-Jul-2008 Adrian Bunk <bunk@kernel.org>

mm/migrate.c should #include <linux/syscalls.h>

Every file should include the headers containing the externs for its
global functions (in this case for sys_move_pages()).

Signed-off-by: Adrian Bunk <bunk@kernel.org>
Acked-by: Christoph Lameter <cl@linux-foundation.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# cde53535 04-Jul-2008 Christoph Lameter <clameter@sgi.com>

Christoph has moved

Remove all clameter@sgi.com addresses from the kernel tree since they will
become invalid on June 27th. Change my maintainer email address for the
slab allocators to cl@linux-foundation.org (which will be the new email
address for the future).

Signed-off-by: Christoph Lameter <clameter@sgi.com>
Signed-off-by: Christoph Lameter <cl@linux-foundation.org>
Cc: Pekka Enberg <penberg@cs.helsinki.fi>
Cc: Stephen Rothwell <sfr@canb.auug.org.au>
Cc: Matt Mackall <mpm@selenic.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 89f5b7da 20-Jun-2008 Linus Torvalds <torvalds@linux-foundation.org>

Reinstate ZERO_PAGE optimization in 'get_user_pages()' and fix XIP

KAMEZAWA Hiroyuki and Oleg Nesterov point out that since the commit
557ed1fa2620dc119adb86b34c614e152a629a80 ("remove ZERO_PAGE") removed
the ZERO_PAGE from the VM mappings, any users of get_user_pages() will
generally now populate the VM with real empty pages needlessly.

We used to get the ZERO_PAGE when we did the "handle_mm_fault()", but
since fault handling no longer uses ZERO_PAGE for new anonymous pages,
we now need to handle that special case in follow_page() instead.

In particular, the removal of ZERO_PAGE effectively removed the core
file writing optimization where we would skip writing pages that had not
been populated at all, and increased memory pressure a lot by allocating
all those useless newly zeroed pages.

This reinstates the optimization by making the unmapped PTE case the
same as for a non-existent page table, which already did this correctly.

While at it, this also fixes the XIP case for follow_page(), where the
caller could not differentiate between the case of a page that simply
could not be used (because it had no "struct page" associated with it)
and a page that just wasn't mapped.

We do that by simply returning an error pointer for pages that could not
be turned into a "struct page *". The error is arbitrarily picked to be
EFAULT, since that was what get_user_pages() already used for the
equivalent IO-mapped page case.

[ Also removed an impossible test for pte_offset_map_lock() failing:
that's not how that function works ]

Acked-by: Oleg Nesterov <oleg@tv-sign.ru>
Acked-by: Nick Piggin <npiggin@suse.de>
Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Hugh Dickins <hugh@veritas.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: Roland McGrath <roland@redhat.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 3a902c5f 30-Apr-2008 Nick Piggin <npiggin@suse.de>

mm: fix warning on memory offline

KAMEZAWA Hiroyuki found a warning message in the buffer dirtying code that
is coming from page migration caller.

WARNING: at fs/buffer.c:720 __set_page_dirty+0x330/0x360()
Call Trace:
[<a000000100015220>] show_stack+0x80/0xa0
[<a000000100015270>] dump_stack+0x30/0x60
[<a000000100089ed0>] warn_on_slowpath+0x90/0xe0
[<a0000001001f8b10>] __set_page_dirty+0x330/0x360
[<a0000001001ffb90>] __set_page_dirty_buffers+0xd0/0x280
[<a00000010012fec0>] set_page_dirty+0xc0/0x260
[<a000000100195670>] migrate_page_copy+0x5d0/0x5e0
[<a000000100197840>] buffer_migrate_page+0x2e0/0x3c0
[<a000000100195eb0>] migrate_pages+0x770/0xe00

What was happening is that migrate_page_copy wants to transfer the PG_dirty
bit from old page to new page, so what it would do is set_page_dirty(newpage).
However set_page_dirty() is used to set the entire page dirty, wheras in
this case, only part of the page was dirty, and it also was not uptodate.

Marking the whole page dirty with set_page_dirty would lead to corruption or
unresolvable conditions -- a dirty && !uptodate page and dirty && !uptodate
buffers.

Possibly we could just ClearPageDirty(oldpage); SetPageDirty(newpage);
however in the interests of keeping the change minimal...

Signed-off-by: Nick Piggin <npiggin@suse.de>
Tested-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 98837c7f 04-Mar-2008 Hugh Dickins <hugh@veritas.com>

memcg: fix VM_BUG_ON from page migration

Page migration gave me free_hot_cold_page's VM_BUG_ON page->page_cgroup.
remove_migration_pte was calling mem_cgroup_charge on the new page whenever it
found a swap pte, before it had determined it to be a migration entry. That
left a surplus reference count on the page_cgroup, so it was still attached
when the page was later freed.

Move that mem_cgroup_charge down to where we're sure it's a migration entry.
We were already under i_mmap_lock or anon_vma->lock, so its GFP_KERNEL was
already inappropriate: change that to GFP_ATOMIC.

It's essential that remove_migration_pte removes all the migration entries,
other crashes follow if not. So proceed even when the charge fails: normally
it cannot, but after a mem_cgroup_force_empty it might - comment in the code.

Signed-off-by: Hugh Dickins <hugh@veritas.com>
Cc: David Rientjes <rientjes@google.com>
Cc: Balbir Singh <balbir@linux.vnet.ibm.com>
Acked-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Hirokazu Takahashi <taka@valinux.co.jp>
Cc: YAMAMOTO Takashi <yamamoto@valinux.co.jp>
Cc: Paul Menage <menage@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# ae41be37 07-Feb-2008 KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>

bugfix for memory cgroup controller: migration under memory controller fix

While using memory control cgroup, page-migration under it works as following.
==
1. uncharge all refs at try to unmap.
2. charge regs again remove_migration_ptes()
==
This is simple but has following problems.
==
The page is uncharged and charged back again if *mapped*.
- This means that cgroup before migration can be different from one after
migration
- If page is not mapped but charged as page cache, charge is just ignored
(because not mapped, it will not be uncharged before migration)
This is memory leak.
==
This patch tries to keep memory cgroup at page migration by increasing
one refcnt during it. 3 functions are added.

mem_cgroup_prepare_migration() --- increase refcnt of page->page_cgroup
mem_cgroup_end_migration() --- decrease refcnt of page->page_cgroup
mem_cgroup_page_migration() --- copy page->page_cgroup from old page to
new page.

During migration
- old page is under PG_locked.
- new page is under PG_locked, too.
- both old page and new page is not on LRU.

These 3 facts guarantee that page_cgroup() migration has no race.

Tested and worked well in x86_64/fake-NUMA box.

Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Balbir Singh <balbir@linux.vnet.ibm.com>
Cc: Pavel Emelianov <xemul@openvz.org>
Cc: Paul Menage <menage@google.com>
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: "Eric W. Biederman" <ebiederm@xmission.com>
Cc: Nick Piggin <nickpiggin@yahoo.com.au>
Cc: Kirill Korotaev <dev@sw.ru>
Cc: Herbert Poetzl <herbert@13thfloor.at>
Cc: David Rientjes <rientjes@google.com>
Cc: Vaidyanathan Srinivasan <svaidy@linux.vnet.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# e1a1cd59 07-Feb-2008 Balbir Singh <balbir@linux.vnet.ibm.com>

Memory controller: make charging gfp mask aware

Nick Piggin pointed out that swap cache and page cache addition routines
could be called from non GFP_KERNEL contexts. This patch makes the
charging routine aware of the gfp context. Charging might fail if the
cgroup is over it's limit, in which case a suitable error is returned.

This patch was tested on a Powerpc box. I am still looking at being able
to test the path, through which allocations happen in non GFP_KERNEL
contexts.

[kamezawa.hiroyu@jp.fujitsu.com: problem with ZONE_MOVABLE]
Signed-off-by: Balbir Singh <balbir@linux.vnet.ibm.com>
Cc: Pavel Emelianov <xemul@openvz.org>
Cc: Paul Menage <menage@google.com>
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: "Eric W. Biederman" <ebiederm@xmission.com>
Cc: Nick Piggin <nickpiggin@yahoo.com.au>
Cc: Kirill Korotaev <dev@sw.ru>
Cc: Herbert Poetzl <herbert@13thfloor.at>
Cc: David Rientjes <rientjes@google.com>
Cc: Vaidyanathan Srinivasan <svaidy@linux.vnet.ibm.com>
Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 8a9f3ccd 07-Feb-2008 Balbir Singh <balbir@linux.vnet.ibm.com>

Memory controller: memory accounting

Add the accounting hooks. The accounting is carried out for RSS and Page
Cache (unmapped) pages. There is now a common limit and accounting for both.
The RSS accounting is accounted at page_add_*_rmap() and page_remove_rmap()
time. Page cache is accounted at add_to_page_cache(),
__delete_from_page_cache(). Swap cache is also accounted for.

Each page's page_cgroup is protected with the last bit of the
page_cgroup pointer, this makes handling of race conditions involving
simultaneous mappings of a page easier. A reference count is kept in the
page_cgroup to deal with cases where a page might be unmapped from the RSS
of all tasks, but still lives in the page cache.

Credits go to Vaidyanathan Srinivasan for helping with reference counting work
of the page cgroup. Almost all of the page cache accounting code has help
from Vaidyanathan Srinivasan.

[hugh@veritas.com: fix swapoff breakage]
[akpm@linux-foundation.org: fix locking]
Signed-off-by: Vaidyanathan Srinivasan <svaidy@linux.vnet.ibm.com>
Signed-off-by: Balbir Singh <balbir@linux.vnet.ibm.com>
Cc: Pavel Emelianov <xemul@openvz.org>
Cc: Paul Menage <menage@google.com>
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: "Eric W. Biederman" <ebiederm@xmission.com>
Cc: Nick Piggin <nickpiggin@yahoo.com.au>
Cc: Kirill Korotaev <dev@sw.ru>
Cc: Herbert Poetzl <herbert@13thfloor.at>
Cc: David Rientjes <rientjes@google.com>
Cc: <Valdis.Kletnieks@vt.edu>
Signed-off-by: Hugh Dickins <hugh@veritas.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 62e1c553 04-Feb-2008 Shaohua Li <shaohua.li@intel.com>

page migraton: handle orphaned pages

Orphaned page might have fs-private metadata, the page is truncated. As
the page hasn't mapping, page migration refuse to migrate the page. It
appears the page is only freed in page reclaim and if zone watermark is
low, the page is never freed, as a result migration always fail. I thought
we could free the metadata so such page can be freed in migration and make
migration more reliable.

[akpm@linux-foundation.org: go direct to try_to_free_buffers()]
Signed-off-by: Shaohua Li <shaohua.li@intel.com>
Acked-by: Nick Piggin <npiggin@suse.de>
Acked-by: Christoph Lameter <clameter@sgi.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 698dd4ba 04-Feb-2008 Matt Mackall <mpm@selenic.com>

maps4: move is_swap_pte

Move is_swap_pte helper function to swapops.h for use by pagemap code

Signed-off-by: Matt Mackall <mpm@selenic.com>
Cc: Dave Hansen <haveblue@us.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# e9534b3f 19-Oct-2007 Gabriel Craciunescu <nix.or.die@googlemail.com>

Typo fixes retrun -> return

Typo fixes retrun -> return

Signed-off-by: Gabriel Craciunescu <nix.or.die@googlemail.com>
Signed-off-by: Adrian Bunk <bunk@kernel.org>


# 228ebcbe 19-Oct-2007 Pavel Emelyanov <xemul@openvz.org>

Uninline find_task_by_xxx set of functions

The find_task_by_something is a set of macros are used to find task by pid
depending on what kind of pid is proposed - global or virtual one. All of
them are wrappers above the most generic one - find_task_by_pid_type_ns() -
and just substitute some args for it.

It turned out, that dereferencing the current->nsproxy->pid_ns construction
and pushing one more argument on the stack inline cause kernel text size to
grow.

This patch moves all this stuff out-of-line into kernel/pid.c. Together
with the next patch it saves a bit less than 400 bytes from the .text
section.

Signed-off-by: Pavel Emelyanov <xemul@openvz.org>
Cc: Sukadev Bhattiprolu <sukadev@us.ibm.com>
Cc: Oleg Nesterov <oleg@tv-sign.ru>
Cc: Paul Menage <menage@google.com>
Cc: "Eric W. Biederman" <ebiederm@xmission.com>
Acked-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# b488893a 19-Oct-2007 Pavel Emelyanov <xemul@openvz.org>

pid namespaces: changes to show virtual ids to user

This is the largest patch in the set. Make all (I hope) the places where
the pid is shown to or get from user operate on the virtual pids.

The idea is:
- all in-kernel data structures must store either struct pid itself
or the pid's global nr, obtained with pid_nr() call;
- when seeking the task from kernel code with the stored id one
should use find_task_by_pid() call that works with global pids;
- when showing pid's numerical value to the user the virtual one
should be used, but however when one shows task's pid outside this
task's namespace the global one is to be used;
- when getting the pid from userspace one need to consider this as
the virtual one and use appropriate task/pid-searching functions.

[akpm@linux-foundation.org: build fix]
[akpm@linux-foundation.org: nuther build fix]
[akpm@linux-foundation.org: yet nuther build fix]
[akpm@linux-foundation.org: remove unneeded casts]
Signed-off-by: Pavel Emelyanov <xemul@openvz.org>
Signed-off-by: Alexey Dobriyan <adobriyan@openvz.org>
Cc: Sukadev Bhattiprolu <sukadev@us.ibm.com>
Cc: Oleg Nesterov <oleg@tv-sign.ru>
Cc: Paul Menage <menage@google.com>
Cc: "Eric W. Biederman" <ebiederm@xmission.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 954ffcb3 16-Oct-2007 KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>

flush icache before set_pte() on ia64: flush icache at set_pte

Current ia64 kernel flushes icache by lazy_mmu_prot_update() *after*
set_pte(). This is too late. This patch removes lazy_mmu_prot_update and
add modfied set_pte() for flushing if necessary.

This patch flush icache of a page when
new pte has exec bit.
&& new pte has present bit
&& new pte is user's page.
&& (old *ptep is not present
|| new pte's pfn is not same to old *ptep's ptn)
&& new pte's page has no Pg_arch_1 bit.
Pg_arch_1 is set when a page is cache consistent.

I think this condition checks are much easier to understand than considering
"Where sync_icache_dcache() should be inserted ?".

pte_user() for ia64 was removed by http://lkml.org/lkml/2007/6/12/67 as
clean-up. So, I added it again.

Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: "Luck, Tony" <tony.luck@intel.com>
Cc: Christoph Lameter <clameter@sgi.com>
Cc: Hugh Dickins <hugh@veritas.com>
Cc: Nick Piggin <nickpiggin@yahoo.com.au>
Acked-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 97ee0524 16-Oct-2007 KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>

flush cache before installing new page at migraton

In migration, a new page should be cache flushed before set_pte() in some
archs which have virtually-tagged cache.

Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: "Luck, Tony" <tony.luck@intel.com>
Cc: Christoph Lameter <clameter@sgi.com>
Cc: Hugh Dickins <hugh@veritas.com>
Cc: Nick Piggin <nickpiggin@yahoo.com.au>
Acked-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 56bbd65d 16-Oct-2007 Christoph Lameter <clameter@sgi.com>

Memoryless nodes: Update memory policy and page migration

Online nodes now may have no memory. The checks and initialization must
therefore be changed to no longer use the online functions.

This will correctly initialize the interleave on bootup to only target nodes
with memory and will make sys_move_pages return an error when a page is to be
moved to a memoryless node. Similarly we will get an error if MPOL_BIND and
MPOL_INTERLEAVE is used on a memoryless node.

These are somewhat new semantics. So far one could specify memoryless nodes
and we would maybe do the right thing and just ignore the node (or we'd do
something strange like with MPOL_INTERLEAVE). If we want to allow the
specification of memoryless nodes via memory policies then we need to keep
checking for online nodes.

Signed-off-by: Christoph Lameter <clameter@sgi.com>
Acked-by: Nishanth Aravamudan <nacc@us.ibm.com>
Tested-by: Lee Schermerhorn <lee.schermerhorn@hp.com>
Acked-by: Bob Picco <bob.picco@hp.com>
Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Mel Gorman <mel@skynet.ie>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 9d966d49 14-Oct-2007 Al Viro <viro@ftp.linux.org.uk>

mm/migrate.c __user annotation

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 989f89c5 31-Aug-2007 KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>

fix rcu_read_lock() in page migraton

In migration fallback path, write_page() or lock_page() will be called.
This causes sleep with holding rcu_read_lock().
For avoding that, just do rcu_lock if the page is Anon.(this is enough.)

Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Acked-by: Christoph Lameter <clameter@sgi.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 3dd9fe8c 26-Jul-2007 KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>

memory unplug: isolate_lru_page fix

release_pages() in mm/swap.c changes page_count() to be 0 without removing
PageLRU flag...

This means isolate_lru_page() can see a page, PageLRU() &&
page_count(page)==0.. This is BUG. (get_page() will be called against
count=0 page.)

Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Acked-by: Christoph Lameter <clameter@sgi.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# dc386d4d 26-Jul-2007 KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>

memory unplug: migration by kernel

In usual, migrate_pages(page,,) is called with holding mm->sem by system call.
(mm here is a mm_struct which maps the migration target page.)
This semaphore helps avoiding some race conditions.

But, if we want to migrate a page by some kernel codes, we have to avoid
some races. This patch adds check code for following race condition.

1. A page which page->mapping==NULL can be target of migration. Then, we have
to check page->mapping before calling try_to_unmap().

2. anon_vma can be freed while page is unmapped, but page->mapping remains as
it was. We drop page->mapcount to be 0. Then we cannot trust page->mapping.
So, use rcu_read_lock() to prevent anon_vma pointed by page->mapping from
being freed during migration.

Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Acked-by: Christoph Lameter <clameter@sgi.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 769848c0 17-Jul-2007 Mel Gorman <mel@csn.ul.ie>

Add __GFP_MOVABLE for callers to flag allocations from high memory that may be migrated

It is often known at allocation time whether a page may be migrated or not.
This patch adds a flag called __GFP_MOVABLE and a new mask called
GFP_HIGH_MOVABLE. Allocations using the __GFP_MOVABLE can be either migrated
using the page migration mechanism or reclaimed by syncing with backing
storage and discarding.

An API function very similar to alloc_zeroed_user_highpage() is added for
__GFP_MOVABLE allocations called alloc_zeroed_user_highpage_movable(). The
flags used by alloc_zeroed_user_highpage() are not changed because it would
change the semantics of an existing API. After this patch is applied there
are no in-kernel users of alloc_zeroed_user_highpage() so it probably should
be marked deprecated if this patch is merged.

Note that this patch includes a minor cleanup to the use of __GFP_ZERO in
shmem.c to keep all flag modifications to inode->mapping in the
shmem_dir_alloc() helper function. This clean-up suggestion is courtesy of
Hugh Dickens.

Additional credit goes to Christoph Lameter and Linus Torvalds for shaping the
concept. Credit to Hugh Dickens for catching issues with shmem swap vector
and ramfs allocations.

[akpm@linux-foundation.org: build fix]
[hugh@veritas.com: __GFP_ZERO cleanup]
Signed-off-by: Mel Gorman <mel@csn.ul.ie>
Cc: Andy Whitcroft <apw@shadowen.org>
Cc: Christoph Lameter <clameter@sgi.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 0e8c7d0f 23-Apr-2007 Christoph Lameter <clameter@sgi.com>

page migration: fix NR_FILE_PAGES accounting

NR_FILE_PAGES must be accounted for depending on the zone that the page
belongs to. If we replace the page in the radix tree then we may have to
shift the count to another zone.

Suggested-by: Ethan Solomita <solo@google.com>
Eventually-typed-in-by: Christoph Lameter <clameter@sgi.com>
Cc: Martin Bligh <mbligh@mbligh.org>
Cc: <stable@kernel.org>
Signed-off-by: Christoph Lameter <clameter@sgi.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 0dc952dc 05-Mar-2007 Christoph Lameter <clameter@engr.sgi.com>

[PATCH] Page migration: Fix vma flag checking

Currently we do not check for vma flags if sys_move_pages is called to move
individual pages. If sys_migrate_pages is called to move pages then we
check for vm_flags that indicate a non migratable vma but that still
includes VM_LOCKED and we can migrate mlocked pages.

Extract the vma_migratable check from mm/mempolicy.c, fix it and put it
into migrate.h so that is can be used from both locations.

Problem was spotted by Lee Schermerhorn

Signed-off-by: Christoph Lameter <clameter@sgi.com>
Signed-off-by: Lee Schermerhorn <lee.schermerhorn@hp.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 7cf9c2c7 06-Dec-2006 Nick Piggin <npiggin@suse.de>

[PATCH] radix-tree: RCU lockless readside

Make radix tree lookups safe to be performed without locks. Readers are
protected against nodes being deleted by using RCU based freeing. Readers
are protected against new node insertion by using memory barriers to ensure
the node itself will be properly written before it is visible in the radix
tree.

Each radix tree node keeps a record of their height (above leaf nodes).
This height does not change after insertion -- when the radix tree is
extended, higher nodes are only inserted in the top. So a lookup can take
the pointer to what is *now* the root node, and traverse down it even if
the tree is concurrently extended and this node becomes a subtree of a new
root.

"Direct" pointers (tree height of 0, where root->rnode points directly to
the data item) are handled by using the low bit of the pointer to signal
whether rnode is a direct pointer or a pointer to a radix tree node.

When a reader wants to traverse the next branch, they will take a copy of
the pointer. This pointer will be either NULL (and the branch is empty) or
non-NULL (and will point to a valid node).

[akpm@osdl.org: cleanups]
[Lee.Schermerhorn@hp.com: bugfixes, comments, simplifications]
[clameter@sgi.com: build fix]
Signed-off-by: Nick Piggin <npiggin@suse.de>
Cc: "Paul E. McKenney" <paulmck@us.ibm.com>
Signed-off-by: Lee Schermerhorn <lee.schermerhorn@hp.com>
Cc: Christoph Lameter <clameter@engr.sgi.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>


# 8ce08464 02-Nov-2006 Stephen Rothwell <sfr@canb.auug.org.au>

[PATCH] Fix sys_move_pages when a NULL node list is passed

sys_move_pages() uses vmalloc() to allocate an array of structures that is
fills with information passed from user mode and then passes to
do_stat_pages() (in the case the node list is NULL). do_stat_pages()
depends on a marker in the node field of the structure to decide how large
the array is and this marker is correctly inserted into the last element of
the array. However, vmalloc() doesn't zero the memory it allocates and if
the user passes NULL for the node list, then the node fields are not filled
in (except for the end marker). If the memory the vmalloc() returned
happend to have a word with the marker value in it in just the right place,
do_pages_stat will fail to fill the status field of part of the array and
we will return (random) kernel data to user mode.

Signed-off-by: Stephen Rothwell <sfr@canb.auug.org.au>
Cc: Christoph Lameter <clameter@engr.sgi.com>
Cc: <stable@kernel.org>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>


# 9361401e 30-Sep-2006 David Howells <dhowells@redhat.com>

[PATCH] BLOCK: Make it possible to disable the block layer [try #6]

Make it possible to disable the block layer. Not all embedded devices require
it, some can make do with just JFFS2, NFS, ramfs, etc - none of which require
the block layer to be present.

This patch does the following:

(*) Introduces CONFIG_BLOCK to disable the block layer, buffering and blockdev
support.

(*) Adds dependencies on CONFIG_BLOCK to any configuration item that controls
an item that uses the block layer. This includes:

(*) Block I/O tracing.

(*) Disk partition code.

(*) All filesystems that are block based, eg: Ext3, ReiserFS, ISOFS.

(*) The SCSI layer. As far as I can tell, even SCSI chardevs use the
block layer to do scheduling. Some drivers that use SCSI facilities -
such as USB storage - end up disabled indirectly from this.

(*) Various block-based device drivers, such as IDE and the old CDROM
drivers.

(*) MTD blockdev handling and FTL.

(*) JFFS - which uses set_bdev_super(), something it could avoid doing by
taking a leaf out of JFFS2's book.

(*) Makes most of the contents of linux/blkdev.h, linux/buffer_head.h and
linux/elevator.h contingent on CONFIG_BLOCK being set. sector_div() is,
however, still used in places, and so is still available.

(*) Also made contingent are the contents of linux/mpage.h, linux/genhd.h and
parts of linux/fs.h.

(*) Makes a number of files in fs/ contingent on CONFIG_BLOCK.

(*) Makes mm/bounce.c (bounce buffering) contingent on CONFIG_BLOCK.

(*) set_page_dirty() doesn't call __set_page_dirty_buffers() if CONFIG_BLOCK
is not enabled.

(*) fs/no-block.c is created to hold out-of-line stubs and things that are
required when CONFIG_BLOCK is not set:

(*) Default blockdev file operations (to give error ENODEV on opening).

(*) Makes some /proc changes:

(*) /proc/devices does not list any blockdevs.

(*) /proc/diskstats and /proc/partitions are contingent on CONFIG_BLOCK.

(*) Makes some compat ioctl handling contingent on CONFIG_BLOCK.

(*) If CONFIG_BLOCK is not defined, makes sys_quotactl() return -ENODEV if
given command other than Q_SYNC or if a special device is specified.

(*) In init/do_mounts.c, no reference is made to the blockdev routines if
CONFIG_BLOCK is not defined. This does not prohibit NFS roots or JFFS2.

(*) The bdflush, ioprio_set and ioprio_get syscalls can now be absent (return
error ENOSYS by way of cond_syscall if so).

(*) The seclvl_bd_claim() and seclvl_bd_release() security calls do nothing if
CONFIG_BLOCK is not set, since they can't then happen.

Signed-Off-By: David Howells <dhowells@redhat.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>


# b398f6bf 29-Aug-2006 David Howells <dhowells@redhat.com>

[PATCH] BLOCK: Stop fallback_migrate_page() from using page_has_buffers() [try #6]

Stop fallback_migrate_page() from using page_has_buffers() since that might not
be available. Use PagePrivate() instead since that's more general.

Signed-Off-By: David Howells <dhowells@redhat.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>


# 980128f2 26-Sep-2006 Christoph Lameter <clameter@sgi.com>

[PATCH] Define easier to handle GFP_THISNODE

In many places we will need to use the same combination of flags. Specify
a single GFP_THISNODE definition for ease of use in gfp.h.

Signed-off-by: Christoph Lameter <clameter@sgi.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>


# 3d99cfb5 26-Sep-2006 Christoph Lameter <clameter@sgi.com>

[PATCH] sys_move_pages: Do not fall back to other nodes

If the user specified a node where we should move the page to then we
really do not want any other node.

Signed-off-by: Christoph Lameter <clameter@sgi.com>
Cc: Andy Whitcroft <apw@shadowen.org>
Cc: Mel Gorman <mel@csn.ul.ie>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>


# e6a1530d 25-Jun-2006 Christoph Lameter <clameter@sgi.com>

[PATCH] Allow migration of mlocked pages

Hugh clarified the role of VM_LOCKED. So we can now implement page
migration for mlocked pages.

Allow the migration of mlocked pages. This means that try_to_unmap must
unmap mlocked pages in the migration case.

Signed-off-by: Christoph Lameter <clameter@sgi.com>
Acked-by: Hugh Dickins <hugh@veritas.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>


# 7b2259b3 25-Jun-2006 Christoph Lameter <clameter@sgi.com>

[PATCH] page migration: Support a vma migration function

Hooks for calling vma specific migration functions

With this patch a vma may define a vma->vm_ops->migrate function. That
function may perform page migration on its own (some vmas may not contain page
structs and therefore cannot be handled by regular page migration. Pages in a
vma may require special preparatory treatment before migration is possible
etc) . Only mmap_sem is held when the migration function is called. The
migrate() function gets passed two sets of nodemasks describing the source and
the target of the migration. The flags parameter either contains

MPOL_MF_MOVE which means that only pages used exclusively by
the specified mm should be moved

or

MPOL_MF_MOVE_ALL which means that pages shared with other processes
should also be moved.

The migration function returns 0 on success or an error condition. An error
condition will prevent regular page migration from occurring.

On its own this patch cannot be included since there are no users for this
functionality. But it seems that the uncached allocator will need this
functionality at some point.

Signed-off-by: Christoph Lameter <clameter@sgi.com>
Cc: Hugh Dickins <hugh@veritas.com>
Cc: Andi Kleen <ak@muc.de>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>


# 86c3a764 23-Jun-2006 David Quigley <dpquigl@tycho.nsa.gov>

[PATCH] SELinux: add security_task_movememory calls to mm code

This patch inserts security_task_movememory hook calls into memory management
code to enable security modules to mediate this operation between tasks.

Since the last posting, the hook has been renamed following feedback from
Christoph Lameter.

Signed-off-by: David Quigley <dpquigl@tycho.nsa.gov>
Acked-by: Stephen Smalley <sds@tycho.nsa.gov>
Signed-off-by: James Morris <jmorris@namei.org>
Cc: Andi Kleen <ak@muc.de>
Acked-by: Christoph Lameter <clameter@sgi.com>
Acked-by: Chris Wright <chrisw@sous-sol.org>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>


# 742755a1 23-Jun-2006 Christoph Lameter <clameter@sgi.com>

[PATCH] page migration: sys_move_pages(): support moving of individual pages

move_pages() is used to move individual pages of a process. The function can
be used to determine the location of pages and to move them onto the desired
node. move_pages() returns status information for each page.

long move_pages(pid, number_of_pages_to_move,
addresses_of_pages[],
nodes[] or NULL,
status[],
flags);

The addresses of pages is an array of void * pointing to the
pages to be moved.

The nodes array contains the node numbers that the pages should be moved
to. If a NULL is passed instead of an array then no pages are moved but
the status array is updated. The status request may be used to determine
the page state before issuing another move_pages() to move pages.

The status array will contain the state of all individual page migration
attempts when the function terminates. The status array is only valid if
move_pages() completed successfullly.

Possible page states in status[]:

0..MAX_NUMNODES The page is now on the indicated node.

-ENOENT Page is not present

-EACCES Page is mapped by multiple processes and can only
be moved if MPOL_MF_MOVE_ALL is specified.

-EPERM The page has been mlocked by a process/driver and
cannot be moved.

-EBUSY Page is busy and cannot be moved. Try again later.

-EFAULT Invalid address (no VMA or zero page).

-ENOMEM Unable to allocate memory on target node.

-EIO Unable to write back page. The page must be written
back in order to move it since the page is dirty and the
filesystem does not provide a migration function that
would allow the moving of dirty pages.

-EINVAL A dirty page cannot be moved. The filesystem does not provide
a migration function and has no ability to write back pages.

The flags parameter indicates what types of pages to move:

MPOL_MF_MOVE Move pages that are only mapped by the process.

MPOL_MF_MOVE_ALL Also move pages that are mapped by multiple processes.
Requires sufficient capabilities.

Possible return codes from move_pages()

-ENOENT No pages found that would require moving. All pages
are either already on the target node, not present, had an
invalid address or could not be moved because they were
mapped by multiple processes.

-EINVAL Flags other than MPOL_MF_MOVE(_ALL) specified or an attempt
to migrate pages in a kernel thread.

-EPERM MPOL_MF_MOVE_ALL specified without sufficient priviledges.
or an attempt to move a process belonging to another user.

-EACCES One of the target nodes is not allowed by the current cpuset.

-ENODEV One of the target nodes is not online.

-ESRCH Process does not exist.

-E2BIG Too many pages to move.

-ENOMEM Not enough memory to allocate control array.

-EFAULT Parameters could not be accessed.

A test program for move_pages() may be found with the patches
on ftp.kernel.org:/pub/linux/kernel/people/christoph/pmig/patches-2.6.17-rc4-mm3

From: Christoph Lameter <clameter@sgi.com>

Detailed results for sys_move_pages()

Pass a pointer to an integer to get_new_page() that may be used to
indicate where the completion status of a migration operation should be
placed. This allows sys_move_pags() to report back exactly what happened to
each page.

Wish there would be a better way to do this. Looks a bit hacky.

Signed-off-by: Christoph Lameter <clameter@sgi.com>
Cc: Hugh Dickins <hugh@veritas.com>
Cc: Jes Sorensen <jes@trained-monkey.org>
Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Lee Schermerhorn <lee.schermerhorn@hp.com>
Cc: Andi Kleen <ak@muc.de>
Cc: Michael Kerrisk <mtk-manpages@gmx.net>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>


# 95a402c3 23-Jun-2006 Christoph Lameter <clameter@sgi.com>

[PATCH] page migration: use allocator function for migrate_pages()

Instead of passing a list of new pages, pass a function to allocate a new
page. This allows the correct placement of MPOL_INTERLEAVE pages during page
migration. It also further simplifies the callers of migrate pages.
migrate_pages() becomes similar to migrate_pages_to() so drop
migrate_pages_to(). The batching of new page allocations becomes unnecessary.

Signed-off-by: Christoph Lameter <clameter@sgi.com>
Cc: Hugh Dickins <hugh@veritas.com>
Cc: Jes Sorensen <jes@trained-monkey.org>
Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Lee Schermerhorn <lee.schermerhorn@hp.com>
Cc: Andi Kleen <ak@muc.de>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>


# aaa994b3 23-Jun-2006 Christoph Lameter <clameter@sgi.com>

[PATCH] page migration: handle freeing of pages in migrate_pages()

Do not leave pages on the lists passed to migrate_pages(). Seems that we will
not need any postprocessing of pages. This will simplify the handling of
pages by the callers of migrate_pages().

Signed-off-by: Christoph Lameter <clameter@sgi.com>
Cc: Hugh Dickins <hugh@veritas.com>
Cc: Jes Sorensen <jes@trained-monkey.org>
Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Lee Schermerhorn <lee.schermerhorn@hp.com>
Cc: Andi Kleen <ak@muc.de>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>


# e24f0b8f 23-Jun-2006 Christoph Lameter <clameter@sgi.com>

[PATCH] page migration: simplify migrate_pages()

Currently migrate_pages() is mess with lots of goto. Extract two functions
from migrate_pages() and get rid of the gotos.

Plus we can just unconditionally set the locked bit on the new page since we
are the only one holding a reference. Locking is to stop others from
accessing the page once we establish references to the new page.

Remove the list_del from move_to_lru in order to have finer control over list
processing.

[akpm@osdl.org: add debug check]
Signed-off-by: Christoph Lameter <clameter@sgi.com>
Cc: Hugh Dickins <hugh@veritas.com>
Cc: Jes Sorensen <jes@trained-monkey.org>
Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Lee Schermerhorn <lee.schermerhorn@hp.com>
Cc: Andi Kleen <ak@muc.de>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>


# 04e62a29 23-Jun-2006 Christoph Lameter <clameter@sgi.com>

[PATCH] More page migration: use migration entries for file pages

This implements the use of migration entries to preserve ptes of file backed
pages during migration. Processes can therefore be migrated back and forth
without loosing their connection to pagecache pages.

Note that we implement the migration entries only for linear mappings.
Nonlinear mappings still require the unmapping of the ptes for migration.

And another writepage() ugliness shows up. writepage() can drop the page
lock. Therefore we have to remove migration ptes before calling writepages()
in order to avoid having migration entries point to unlocked pages.

Signed-off-by: Christoph Lameter <clameter@sgi.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>


# 442c9137 23-Jun-2006 Christoph Lameter <clameter@sgi.com>

[PATCH] More page migration: do not inc/dec rss counters

If we install a migration entry then the rss not really decreases since the
page is just moved somewhere else. We can save ourselves the work of
decrementing and later incrementing which will just eventually cause cacheline
bouncing.

Signed-off-by: Christoph Lameter <clameter@sgi.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>


# 6c5240ae 23-Jun-2006 Christoph Lameter <clameter@sgi.com>

[PATCH] Swapless page migration: modify core logic

Use the migration entries for page migration

This modifies the migration code to use the new migration entries. It now
becomes possible to migrate anonymous pages without having to add a swap
entry.

We add a couple of new functions to replace migration entries with the proper
ptes.

We cannot take the tree_lock for migrating anonymous pages anymore. However,
we know that we hold the only remaining reference to the page when the page
count reaches 1.

Signed-off-by: Christoph Lameter <clameter@sgi.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>


# d75a0fcd 23-Jun-2006 Christoph Lameter <clameter@sgi.com>

[PATCH] Swapless page migration: rip out swap based logic

Rip the page migration logic out.

Remove all code that has to do with swapping during page migration.

This also guts the ability to migrate pages to swap. No one used that so lets
let it go for good.

Page migration should be a bit broken after this patch.

Signed-off-by: Christoph Lameter <clameter@sgi.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>


# 0697212a 23-Jun-2006 Christoph Lameter <clameter@sgi.com>

[PATCH] Swapless page migration: add R/W migration entries

Implement read/write migration ptes

We take the upper two swapfiles for the two types of migration ptes and define
a series of macros in swapops.h.

The VM is modified to handle the migration entries. migration entries can
only be encountered when the page they are pointing to is locked. This limits
the number of places one has to fix. We also check in copy_pte_range and in
mprotect_pte_range() for migration ptes.

We check for migration ptes in do_swap_cache and call a function that will
then wait on the page lock. This allows us to effectively stop all accesses
to apge.

Migration entries are created by try_to_unmap if called for migration and
removed by local functions in migrate.c

From: Hugh Dickins <hugh@veritas.com>

Several times while testing swapless page migration (I've no NUMA, just
hacking it up to migrate recklessly while running load), I've hit the
BUG_ON(!PageLocked(p)) in migration_entry_to_page.

This comes from an orphaned migration entry, unrelated to the current
correctly locked migration, but hit by remove_anon_migration_ptes as it
checks an address in each vma of the anon_vma list.

Such an orphan may be left behind if an earlier migration raced with fork:
copy_one_pte can duplicate a migration entry from parent to child, after
remove_anon_migration_ptes has checked the child vma, but before it has
removed it from the parent vma. (If the process were later to fault on this
orphaned entry, it would hit the same BUG from migration_entry_wait.)

This could be fixed by locking anon_vma in copy_one_pte, but we'd rather
not. There's no such problem with file pages, because vma_prio_tree_add
adds child vma after parent vma, and the page table locking at each end is
enough to serialize. Follow that example with anon_vma: add new vmas to the
tail instead of the head.

(There's no corresponding problem when inserting migration entries,
because a missed pte will leave the page count and mapcount high, which is
allowed for. And there's no corresponding problem when migrating via swap,
because a leftover swap entry will be correctly faulted. But the swapless
method has no refcounting of its entries.)

From: Ingo Molnar <mingo@elte.hu>

pte_unmap_unlock() takes the pte pointer as an argument.

From: Hugh Dickins <hugh@veritas.com>

Several times while testing swapless page migration, gcc has tried to exec
a pointer instead of a string: smells like COW mappings are not being
properly write-protected on fork.

The protection in copy_one_pte looks very convincing, until at last you
realize that the second arg to make_migration_entry is a boolean "write",
and SWP_MIGRATION_READ is 30.

Anyway, it's better done like in change_pte_range, using
is_write_migration_entry and make_migration_entry_read.

From: Hugh Dickins <hugh@veritas.com>

Remove unnecessary obfuscation from sys_swapon's range check on swap type,
which blew up causing memory corruption once swapless migration made
MAX_SWAPFILES no longer 2 ^ MAX_SWAPFILES_SHIFT.

Signed-off-by: Hugh Dickins <hugh@veritas.com>
Acked-by: Martin Schwidefsky <schwidefsky@de.ibm.com>
Signed-off-by: Hugh Dickins <hugh@veritas.com>
Signed-off-by: Christoph Lameter <clameter@engr.sgi.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
From: Hugh Dickins <hugh@veritas.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>


# 8351a6e4 23-Jun-2006 Christoph Lameter <clameter@sgi.com>

[PATCH] page migration cleanup: move fallback handling into special function

Move the fallback code into a new fallback function and make the function
behave like any other migration function. This requires retaking the lock if
pageout() drops it.

Signed-off-by: Christoph Lameter <clameter@sgi.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>


# 2d1db3b1 23-Jun-2006 Christoph Lameter <clameter@sgi.com>

[PATCH] page migration cleanup: pass "mapping" to migration functions

Change handling of address spaces.

Pass a pointer to the address space in which the page is migrated to all
migration function. This avoids repeatedly having to retrieve the address
space pointer from the page and checking it for validity. The old page
mapping will change once migration has gone to a certain step, so it is less
confusing to have the pointer always available.

Move the setting of the mapping and index for the new page into
migrate_pages().

Signed-off-by: Christoph Lameter <clameter@sgi.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>


# c3fcf8a5 23-Jun-2006 Christoph Lameter <clameter@sgi.com>

[PATCH] page migration cleanup: extract try_to_unmap from migration functions

Extract try_to_unmap and rename remove_references -> move_mapping

try_to_unmap() may significantly change the page state by for example setting
the dirty bit. It is therefore best to unmap in migrate_pages() before
calling any migration functions.

migrate_page_remove_references() will then only move the new page in place of
the old page in the mapping. Rename the function to
migrate_page_move_mapping().

This allows us to get rid of the special unmapping for the fallback path.

Signed-off-by: Christoph Lameter <clameter@sgi.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>


# 5b5c7120 23-Jun-2006 Christoph Lameter <clameter@sgi.com>

[PATCH] page migration cleanup: drop nr_refs in remove_references()

Drop nr_refs parameter from migrate_page_remove_references()

The nr_refs parameter is not really useful since the number of remaining
references is always

1 for anonymous pages without a mapping
2 for pages with a mapping
3 for pages with a mapping and PagePrivate set.

Remove the early check for the number of references since we are checking
page_mapcount() earlier. Ultimately only the refcount matters after the
tree_lock has been obtained.

Signed-off-by: Christoph Lameter <clameter@sgi.coim>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>


# e7340f73 23-Jun-2006 Christoph Lameter <clameter@sgi.com>

[PATCH] page migration cleanup: remove useless definitions

Remove the export for migrate_page_remove_references() and migrate_page_copy()
that are unlikely to be used directly by filesystems implementing migration.
The export was useful when buffer_migrate_page() lived in fs/buffer.c but it
has now been moved to migrate.c in the migration reorg.

Signed-off-by: Christoph Lameter <clameter@sgi.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>


# 1d8b85cc 23-Jun-2006 Christoph Lameter <clameter@sgi.com>

[PATCH] page migration cleanup: group functions

Reorder functions in migrate.c. Group all migration functions for struct
address_space_operations together.

Signed-off-by: Christoph Lameter <clameter@sgi.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>


# 4c28f811 01-May-2006 Christoph Lameter <clameter@sgi.com>

[PATCH] page migration: Fix fallback behavior for dirty pages

Currently we check PageDirty() in order to make the decision to swap out
the page. However, the dirty information may be only be contained in the
ptes pointing to the page. We need to first unmap the ptes before checking
for PageDirty(). If unmap is successful then the page count of the page
will also be decreased so that pageout() works properly.

This is a fix necessary for 2.6.17. Without this fix we may migrate dirty
pages for filesystems without migration functions. Filesystems may keep
pointers to dirty pages. Migration of dirty pages can result in the
filesystem keeping pointers to freed pages.

Unmapping is currently not be separated out from removing all the
references to a page and moving the mapping. Therefore try_to_unmap will
be called again in migrate_page() if the writeout is successful. However,
it wont do anything since the ptes are already removed.

The coming updates to the page migration code will restructure the code
so that this is no longer necessary.

Signed-off-by: Christoph Lameter <clameter@sgi.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>


# 64a3ca5f 10-Apr-2006 Adrian Bunk <bunk@stusta.de>

[PATCH] mm/migrate.c: don't export a static function

EXPORT_SYMBOL'ing of a static function is not a good idea.

Signed-off-by: Adrian Bunk <bunk@stusta.de>
Cc: Christoph Lameter <clameter@engr.sgi.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>


# e23ca00b 10-Apr-2006 Christoph Lameter <clameter@sgi.com>

[PATCH] Some page migration fixups

- Remove sparse comment

- Remove duplicated include

- Return the correct error condition in migrate_page_remove_references().

Signed-off-by: Christoph Lameter <clameter@sgi.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>


# b20a3503 22-Mar-2006 Christoph Lameter <clameter@sgi.com>

[PATCH] page migration reorg

Centralize the page migration functions in anticipation of additional
tinkering. Creates a new file mm/migrate.c

1. Extract buffer_migrate_page() from fs/buffer.c

2. Extract central migration code from vmscan.c

3. Extract some components from mempolicy.c

4. Export pageout() and remove_from_swap() from vmscan.c

5. Make it possible to configure NUMA systems without page migration
and non-NUMA systems with page migration.

I had to so some #ifdeffing in mempolicy.c that may need a cleanup.

Signed-off-by: Christoph Lameter <clameter@sgi.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>