History log of /linux-master/mm/memory-failure.c
Revision Date Author Comments
# 1983184c 07-Apr-2024 Miaohe Lin <linmiaohe@huawei.com>

mm/memory-failure: fix deadlock when hugetlb_optimize_vmemmap is enabled

When I did hard offline test with hugetlb pages, below deadlock occurs:

======================================================
WARNING: possible circular locking dependency detected
6.8.0-11409-gf6cef5f8c37f #1 Not tainted
------------------------------------------------------
bash/46904 is trying to acquire lock:
ffffffffabe68910 (cpu_hotplug_lock){++++}-{0:0}, at: static_key_slow_dec+0x16/0x60

but task is already holding lock:
ffffffffabf92ea8 (pcp_batch_high_lock){+.+.}-{3:3}, at: zone_pcp_disable+0x16/0x40

which lock already depends on the new lock.

the existing dependency chain (in reverse order) is:

-> #1 (pcp_batch_high_lock){+.+.}-{3:3}:
__mutex_lock+0x6c/0x770
page_alloc_cpu_online+0x3c/0x70
cpuhp_invoke_callback+0x397/0x5f0
__cpuhp_invoke_callback_range+0x71/0xe0
_cpu_up+0xeb/0x210
cpu_up+0x91/0xe0
cpuhp_bringup_mask+0x49/0xb0
bringup_nonboot_cpus+0xb7/0xe0
smp_init+0x25/0xa0
kernel_init_freeable+0x15f/0x3e0
kernel_init+0x15/0x1b0
ret_from_fork+0x2f/0x50
ret_from_fork_asm+0x1a/0x30

-> #0 (cpu_hotplug_lock){++++}-{0:0}:
__lock_acquire+0x1298/0x1cd0
lock_acquire+0xc0/0x2b0
cpus_read_lock+0x2a/0xc0
static_key_slow_dec+0x16/0x60
__hugetlb_vmemmap_restore_folio+0x1b9/0x200
dissolve_free_huge_page+0x211/0x260
__page_handle_poison+0x45/0xc0
memory_failure+0x65e/0xc70
hard_offline_page_store+0x55/0xa0
kernfs_fop_write_iter+0x12c/0x1d0
vfs_write+0x387/0x550
ksys_write+0x64/0xe0
do_syscall_64+0xca/0x1e0
entry_SYSCALL_64_after_hwframe+0x6d/0x75

other info that might help us debug this:

Possible unsafe locking scenario:

CPU0 CPU1
---- ----
lock(pcp_batch_high_lock);
lock(cpu_hotplug_lock);
lock(pcp_batch_high_lock);
rlock(cpu_hotplug_lock);

*** DEADLOCK ***

5 locks held by bash/46904:
#0: ffff98f6c3bb23f0 (sb_writers#5){.+.+}-{0:0}, at: ksys_write+0x64/0xe0
#1: ffff98f6c328e488 (&of->mutex){+.+.}-{3:3}, at: kernfs_fop_write_iter+0xf8/0x1d0
#2: ffff98ef83b31890 (kn->active#113){.+.+}-{0:0}, at: kernfs_fop_write_iter+0x100/0x1d0
#3: ffffffffabf9db48 (mf_mutex){+.+.}-{3:3}, at: memory_failure+0x44/0xc70
#4: ffffffffabf92ea8 (pcp_batch_high_lock){+.+.}-{3:3}, at: zone_pcp_disable+0x16/0x40

stack backtrace:
CPU: 10 PID: 46904 Comm: bash Kdump: loaded Not tainted 6.8.0-11409-gf6cef5f8c37f #1
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.14.0-0-g155821a1990b-prebuilt.qemu.org 04/01/2014
Call Trace:
<TASK>
dump_stack_lvl+0x68/0xa0
check_noncircular+0x129/0x140
__lock_acquire+0x1298/0x1cd0
lock_acquire+0xc0/0x2b0
cpus_read_lock+0x2a/0xc0
static_key_slow_dec+0x16/0x60
__hugetlb_vmemmap_restore_folio+0x1b9/0x200
dissolve_free_huge_page+0x211/0x260
__page_handle_poison+0x45/0xc0
memory_failure+0x65e/0xc70
hard_offline_page_store+0x55/0xa0
kernfs_fop_write_iter+0x12c/0x1d0
vfs_write+0x387/0x550
ksys_write+0x64/0xe0
do_syscall_64+0xca/0x1e0
entry_SYSCALL_64_after_hwframe+0x6d/0x75
RIP: 0033:0x7fc862314887
Code: 10 00 f7 d8 64 89 02 48 c7 c0 ff ff ff ff eb b7 0f 1f 00 f3 0f 1e fa 64 8b 04 25 18 00 00 00 85 c0 75 10 b8 01 00 00 00 0f 05 <48> 3d 00 f0 ff ff 77 51 c3 48 83 ec 28 48 89 54 24 18 48 89 74 24
RSP: 002b:00007fff19311268 EFLAGS: 00000246 ORIG_RAX: 0000000000000001
RAX: ffffffffffffffda RBX: 000000000000000c RCX: 00007fc862314887
RDX: 000000000000000c RSI: 000056405645fe10 RDI: 0000000000000001
RBP: 000056405645fe10 R08: 00007fc8623d1460 R09: 000000007fffffff
R10: 0000000000000000 R11: 0000000000000246 R12: 000000000000000c
R13: 00007fc86241b780 R14: 00007fc862417600 R15: 00007fc862416a00

In short, below scene breaks the lock dependency chain:

memory_failure
__page_handle_poison
zone_pcp_disable -- lock(pcp_batch_high_lock)
dissolve_free_huge_page
__hugetlb_vmemmap_restore_folio
static_key_slow_dec
cpus_read_lock -- rlock(cpu_hotplug_lock)

Fix this by calling drain_all_pages() instead.

This issue won't occur until commit a6b40850c442 ("mm: hugetlb: replace
hugetlb_free_vmemmap_enabled with a static_key"). As it introduced
rlock(cpu_hotplug_lock) in dissolve_free_huge_page() code path while
lock(pcp_batch_high_lock) is already in the __page_handle_poison().

[linmiaohe@huawei.com: extend comment per Oscar]
[akpm@linux-foundation.org: reflow block comment]
Link: https://lkml.kernel.org/r/20240407085456.2798193-1-linmiaohe@huawei.com
Fixes: a6b40850c442 ("mm: hugetlb: replace hugetlb_free_vmemmap_enabled with a static_key")
Signed-off-by: Miaohe Lin <linmiaohe@huawei.com>
Acked-by: Oscar Salvador <osalvador@suse.de>
Reviewed-by: Jane Chu <jane.chu@oracle.com>
Cc: Naoya Horiguchi <nao.horiguchi@gmail.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>


# 2fde9e7f 24-Jan-2024 Miaohe Lin <linmiaohe@huawei.com>

mm/memory-failure: fix crash in split_huge_page_to_list from soft_offline_page

When I did soft offline stress test, a machine was observed to crash with
the following message:

kernel BUG at include/linux/memcontrol.h:554!
invalid opcode: 0000 [#1] PREEMPT SMP NOPTI
CPU: 5 PID: 3837 Comm: hwpoison.sh Not tainted 6.7.0-next-20240112-00001-g8ecf3e7fb7c8-dirty #97
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.14.0-0-g155821a1990b-prebuilt.qemu.org 04/01/2014
RIP: 0010:folio_memcg+0xaf/0xd0
Code: 10 5b 5d c3 cc cc cc cc 48 c7 c6 08 b1 f2 b2 48 89 ef e8 b4 c5 f8 ff 90 0f 0b 48 c7 c6 d0 b0 f2 b2 48 89 ef e8 a2 c5 f8 ff 90 <0f> 0b 48 c7 c6 08 b1 f2 b2 48 89 ef e8 90 c5 f8 ff 90 0f 0b 66 66
RSP: 0018:ffffb6c043657c98 EFLAGS: 00000296
RAX: 000000000000004b RBX: ffff932bc1d1e401 RCX: ffff933abfb5c908
RDX: 0000000000000000 RSI: 0000000000000027 RDI: ffff933abfb5c900
RBP: ffffea6f04019080 R08: ffffffffb3338ce8 R09: 0000000000009ffb
R10: 00000000000004dd R11: ffffffffb3308d00 R12: ffffea6f04019080
R13: ffffea6f04019080 R14: 0000000000000001 R15: ffffb6c043657da0
FS: 00007f6c60f6b740(0000) GS:ffff933abfb40000(0000) knlGS:0000000000000000
CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 0000559c3bc8b980 CR3: 0000000107f1c000 CR4: 00000000000006f0
Call Trace:
<TASK>
split_huge_page_to_list+0x4d/0x1380
try_to_split_thp_page+0x3a/0xf0
soft_offline_page+0x1ea/0x8a0
soft_offline_page_store+0x52/0x90
kernfs_fop_write_iter+0x118/0x1b0
vfs_write+0x30b/0x430
ksys_write+0x5e/0xe0
do_syscall_64+0xb0/0x1b0
entry_SYSCALL_64_after_hwframe+0x6d/0x75
RIP: 0033:0x7f6c60d14697
Code: 10 00 f7 d8 64 89 02 48 c7 c0 ff ff ff ff eb b7 0f 1f 00 f3 0f 1e fa 64 8b 04 25 18 00 00 00 85 c0 75 10 b8 01 00 00 00 0f 05 <48> 3d 00 f0 ff ff 77 51 c3 48 83 ec 28 48 89 54 24 18 48 89 74 24
RSP: 002b:00007ffe9b72b8d8 EFLAGS: 00000246 ORIG_RAX: 0000000000000001
RAX: ffffffffffffffda RBX: 000000000000000c RCX: 00007f6c60d14697
RDX: 000000000000000c RSI: 0000559c3bc8b980 RDI: 0000000000000001
RBP: 0000559c3bc8b980 R08: 00007f6c60dd1460 R09: 000000007fffffff
R10: 0000000000000000 R11: 0000000000000246 R12: 000000000000000c
R13: 00007f6c60e1a780 R14: 00007f6c60e16600 R15: 00007f6c60e15a00

The problem is that page->mapping is overloaded with slab->slab_list or
slabs fields now, so slab pages could be taken as non-LRU movable pages if
field slabs contains PAGE_MAPPING_MOVABLE or slab_list->prev is set to
LIST_POISON2. These slab pages will be treated as thp later leading to
crash in split_huge_page_to_list().

Link: https://lkml.kernel.org/r/20240126065837.2100184-1-linmiaohe@huawei.com
Link: https://lkml.kernel.org/r/20240124084014.1772906-1-linmiaohe@huawei.com
Signed-off-by: Miaohe Lin <linmiaohe@huawei.com>
Fixes: 130d4df57390 ("mm/sl[au]b: rearrange struct slab fields to allow larger rcu_head")
Reviewed-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Cc: Miaohe Lin <linmiaohe@huawei.com>
Cc: Naoya Horiguchi <naoya.horiguchi@nec.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>


# 19d3e221 12-Jan-2024 Sidhartha Kumar <sidhartha.kumar@oracle.com>

fs/hugetlbfs/inode.c: mm/memory-failure.c: fix hugetlbfs hwpoison handling

has_extra_refcount() makes the assumption that the page cache adds a ref
count of 1 and subtracts this in the extra_pins case. Commit a08c7193e4f1
(mm/filemap: remove hugetlb special casing in filemap.c) modifies
__filemap_add_folio() by calling folio_ref_add(folio, nr); for all cases
(including hugtetlb) where nr is the number of pages in the folio. We
should adjust the number of references coming from the page cache by
subtracing the number of pages rather than 1.

In hugetlbfs_read_iter(), folio_test_has_hwpoisoned() is testing the wrong
flag as, in the hugetlb case, memory-failure code calls
folio_test_set_hwpoison() to indicate poison. folio_test_hwpoison() is
the correct function to test for that flag.

After these fixes, the hugetlb hwpoison read selftest passes all cases.

Link: https://lkml.kernel.org/r/20240112180840.367006-1-sidhartha.kumar@oracle.com
Fixes: a08c7193e4f1 ("mm/filemap: remove hugetlb special casing in filemap.c")
Signed-off-by: Sidhartha Kumar <sidhartha.kumar@oracle.com>
Closes: https://lore.kernel.org/linux-mm/20230713001833.3778937-1-jiaqiyan@google.com/T/#m8e1469119e5b831bbd05d495f96b842e4a1c5519
Reported-by: Muhammad Usama Anjum <usama.anjum@collabora.com>
Tested-by: Muhammad Usama Anjum <usama.anjum@collabora.com>
Acked-by: Miaohe Lin <linmiaohe@huawei.com>
Acked-by: Muchun Song <muchun.song@linux.dev>
Cc: James Houghton <jthoughton@google.com>
Cc: Jiaqi Yan <jiaqiyan@google.com>
Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
Cc: Naoya Horiguchi <naoya.horiguchi@nec.com>
Cc: <stable@vger.kernel.org> [6.7+]
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>


# 4d8f7418 20-Dec-2023 David Hildenbrand <david@redhat.com>

mm/rmap: remove page_remove_rmap()

All callers are gone, let's remove it and some leftover traces.

Link: https://lkml.kernel.org/r/20231220224504.646757-33-david@redhat.com
Signed-off-by: David Hildenbrand <david@redhat.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
Cc: Muchun Song <muchun.song@linux.dev>
Cc: Muchun Song <songmuchun@bytedance.com>
Cc: Peter Xu <peterx@redhat.com>
Cc: Ryan Roberts <ryan.roberts@arm.com>
Cc: Yin Fengwei <fengwei.yin@intel.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>


# af7628d6 17-Nov-2023 Matthew Wilcox (Oracle) <willy@infradead.org>

fs: convert error_remove_page to error_remove_folio

There were already assertions that we were not passing a tail page to
error_remove_page(), so make the compiler enforce that by converting
everything to pass and use a folio.

Link: https://lkml.kernel.org/r/20231117161447.2461643-7-willy@infradead.org
Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Cc: Naoya Horiguchi <naoya.horiguchi@nec.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>


# e130b651 17-Nov-2023 Matthew Wilcox (Oracle) <willy@infradead.org>

memory-failure: convert truncate_error_page to truncate_error_folio

Both callers now have a folio, so pass it in. Nothing downstream was
expecting a tail page; that's asserted in generic_error_remove_page(), for
example.

Link: https://lkml.kernel.org/r/20231117161447.2461643-6-willy@infradead.org
Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Cc: Naoya Horiguchi <naoya.horiguchi@nec.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>


# b6fd410c 17-Nov-2023 Matthew Wilcox (Oracle) <willy@infradead.org>

memory-failure: use a folio in me_huge_page()

This function was already explicitly calling compound_head();
unfortunately the compiler can't know that and elide the redundant calls
to compound_head() buried in page_mapping(), unlock_page(), etc. Switch
to using a folio, which does let us elide these calls.

Link: https://lkml.kernel.org/r/20231117161447.2461643-5-willy@infradead.org
Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Cc: Naoya Horiguchi <naoya.horiguchi@nec.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>


# f7092393 17-Nov-2023 Matthew Wilcox (Oracle) <willy@infradead.org>

memory-failure: convert delete_from_lru_cache() to take a folio

All three callers now have a folio; pass it in instead of the page.
Saves five calls to compound_head().

Link: https://lkml.kernel.org/r/20231117161447.2461643-4-willy@infradead.org
Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Cc: Naoya Horiguchi <naoya.horiguchi@nec.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>


# 6304b531 17-Nov-2023 Matthew Wilcox (Oracle) <willy@infradead.org>

memory-failure: use a folio in me_pagecache_dirty()

Replaces three hidden calls to compound_head() with one visible one.

Link: https://lkml.kernel.org/r/20231117161447.2461643-3-willy@infradead.org
Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Cc: Naoya Horiguchi <naoya.horiguchi@nec.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>


# 3d47e317 17-Nov-2023 Matthew Wilcox (Oracle) <willy@infradead.org>

memory-failure: use a folio in me_pagecache_clean()

Patch series "Convert aops->error_remove_page to ->error_remove_folio".

This is a memory-failure patch series which converts a lot of uses of page
APIs into folio APIs with the usual benefits.


This patch (of 6):

Replaces three hidden calls to compound_head() with one visible one.
Fix up a few comments while I'm modifying this function.

Link: https://lkml.kernel.org/r/20231117161447.2461643-1-willy@infradead.org
Link: https://lkml.kernel.org/r/20231117161447.2461643-2-willy@infradead.org
Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Cc: Naoya Horiguchi <naoya.horiguchi@nec.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>


# 761d79fb 08-Nov-2023 Matthew Wilcox (Oracle) <willy@infradead.org>

mm: convert isolate_page() to mf_isolate_folio()

The only caller now has a folio, so pass it in and operate on it. Saves
many page->folio conversions and introduces only one folio->page
conversion when calling isolate_movable_page().

Link: https://lkml.kernel.org/r/20231108182809.602073-6-willy@infradead.org
Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Cc: Naoya Horiguchi <naoya.horiguchi@nec.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>


# 049b2604 08-Nov-2023 Matthew Wilcox (Oracle) <willy@infradead.org>

mm: convert soft_offline_in_use_page() to use a folio

Replace the existing head-page logic with folio logic.

Link: https://lkml.kernel.org/r/20231108182809.602073-5-willy@infradead.org
Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Cc: Naoya Horiguchi <naoya.horiguchi@nec.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>


# 19369d86 08-Nov-2023 Matthew Wilcox (Oracle) <willy@infradead.org>

mm: use mapping_evict_folio() in truncate_error_page()

We already have the folio and the mapping, so replace the call to
invalidate_inode_page() with mapping_evict_folio().

Link: https://lkml.kernel.org/r/20231108182809.602073-4-willy@infradead.org
Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Cc: Naoya Horiguchi <naoya.horiguchi@nec.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>


# fa422b35 23-Oct-2023 Shiyang Ruan <ruansy.fnst@fujitsu.com>

mm, pmem, xfs: Introduce MF_MEM_PRE_REMOVE for unbind

Now, if we suddenly remove a PMEM device(by calling unbind) which
contains FSDAX while programs are still accessing data in this device,
e.g.:
```
$FSSTRESS_PROG -d $SCRATCH_MNT -n 99999 -p 4 &
# $FSX_PROG -N 1000000 -o 8192 -l 500000 $SCRATCH_MNT/t001 &
echo "pfn1.1" > /sys/bus/nd/drivers/nd_pmem/unbind
```
it could come into an unacceptable state:
1. device has gone but mount point still exists, and umount will fail
with "target is busy"
2. programs will hang and cannot be killed
3. may crash with NULL pointer dereference

To fix this, we introduce a MF_MEM_PRE_REMOVE flag to let it know that we
are going to remove the whole device, and make sure all related processes
could be notified so that they could end up gracefully.

This patch is inspired by Dan's "mm, dax, pmem: Introduce
dev_pagemap_failure()"[1]. With the help of dax_holder and
->notify_failure() mechanism, the pmem driver is able to ask filesystem
on it to unmap all files in use, and notify processes who are using
those files.

Call trace:
trigger unbind
-> unbind_store()
-> ... (skip)
-> devres_release_all()
-> kill_dax()
-> dax_holder_notify_failure(dax_dev, 0, U64_MAX, MF_MEM_PRE_REMOVE)
-> xfs_dax_notify_failure()
`-> freeze_super() // freeze (kernel call)
`-> do xfs rmap
` -> mf_dax_kill_procs()
` -> collect_procs_fsdax() // all associated processes
` -> unmap_and_kill()
` -> invalidate_inode_pages2_range() // drop file's cache
`-> thaw_super() // thaw (both kernel & user call)

Introduce MF_MEM_PRE_REMOVE to let filesystem know this is a remove
event. Use the exclusive freeze/thaw[2] to lock the filesystem to prevent
new dax mapping from being created. Do not shutdown filesystem directly
if configuration is not supported, or if failure range includes metadata
area. Make sure all files and processes(not only the current progress)
are handled correctly. Also drop the cache of associated files before
pmem is removed.

[1]: https://lore.kernel.org/linux-mm/161604050314.1463742.14151665140035795571.stgit@dwillia2-desk3.amr.corp.intel.com/
[2]: https://lore.kernel.org/linux-xfs/169116275623.3187159.16862410128731457358.stg-ugh@frogsfrogsfrogs/

Signed-off-by: Shiyang Ruan <ruansy.fnst@fujitsu.com>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Dan Williams <dan.j.williams@intel.com>
Signed-off-by: Chandan Babu R <chandanbabu@kernel.org>


# 39ebd6dc 18-Dec-2023 Matthew Wilcox (Oracle) <willy@infradead.org>

mm/memory-failure: cast index to loff_t before shifting it

On 32-bit systems, we'll lose the top bits of index because arithmetic
will be performed in unsigned long instead of unsigned long long. This
affects files over 4GB in size.

Link: https://lkml.kernel.org/r/20231218135837.3310403-4-willy@infradead.org
Fixes: 6100e34b2526 ("mm, memory_failure: Teach memory_failure() about dev_pagemap pages")
Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>


# c79c5a0a 18-Dec-2023 Matthew Wilcox (Oracle) <willy@infradead.org>

mm/memory-failure: check the mapcount of the precise page

A process may map only some of the pages in a folio, and might be missed
if it maps the poisoned page but not the head page. Or it might be
unnecessarily hit if it maps the head page, but not the poisoned page.

Link: https://lkml.kernel.org/r/20231218135837.3310403-3-willy@infradead.org
Fixes: 7af446a841a2 ("HWPOISON, hugetlb: enable error handling path for hugepage")
Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>


# 376907f3 18-Dec-2023 Matthew Wilcox (Oracle) <willy@infradead.org>

mm/memory-failure: pass the folio and the page to collect_procs()

Patch series "Three memory-failure fixes".

I've been looking at the memory-failure code and I believe I have found
three bugs that need fixing -- one going all the way back to 2010! I'll
have more patches later to use folios more extensively but didn't want
these bugfixes to get caught up in that.


This patch (of 3):

Both collect_procs_anon() and collect_procs_file() iterate over the VMA
interval trees looking for a single pgoff, so it is wrong to look for the
pgoff of the head page as is currently done. However, it is also wrong to
look at page->mapping of the precise page as this is invalid for tail
pages. Clear up the confusion by passing both the folio and the precise
page to collect_procs().

Link: https://lkml.kernel.org/r/20231218135837.3310403-1-willy@infradead.org
Link: https://lkml.kernel.org/r/20231218135837.3310403-2-willy@infradead.org
Fixes: 415c64c1453a ("mm/memory-failure: split thp earlier in memory error handling")
Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>


# 91e79d22 22-Aug-2023 Matthew Wilcox (Oracle) <willy@infradead.org>

mm: convert DAX lock/unlock page to lock/unlock folio

The one caller of DAX lock/unlock page already calls compound_head(), so
use page_folio() instead, then use a folio throughout the DAX code to
remove uses of page->mapping and page->index.

[jane.chu@oracle.com: add comment to mf_generic_kill_procss(), simplify mf_generic_kill_procs:folio initialization]
Link: https://lkml.kernel.org/r/20230908222336.186313-1-jane.chu@oracle.com
Link: https://lkml.kernel.org/r/20230822231314.349200-1-willy@infradead.org
Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Signed-off-by: Jane Chu <jane.chu@oracle.com>
Acked-by: Naoya Horiguchi <naoya.horiguchi@nec.com>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: Jane Chu <jane.chu@oracle.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>


# d256d1cd 27-Aug-2023 Tong Tiangen <tongtiangen@huawei.com>

mm: memory-failure: use rcu lock instead of tasklist_lock when collect_procs()

We found a softlock issue in our test, analyzed the logs, and found that
the relevant CPU call trace as follows:

CPU0:
_do_fork
-> copy_process()
-> write_lock_irq(&tasklist_lock) //Disable irq,waiting for
//tasklist_lock

CPU1:
wp_page_copy()
->pte_offset_map_lock()
-> spin_lock(&page->ptl); //Hold page->ptl
-> ptep_clear_flush()
-> flush_tlb_others() ...
-> smp_call_function_many()
-> arch_send_call_function_ipi_mask()
-> csd_lock_wait() //Waiting for other CPUs respond
//IPI

CPU2:
collect_procs_anon()
-> read_lock(&tasklist_lock) //Hold tasklist_lock
->for_each_process(tsk)
-> page_mapped_in_vma()
-> page_vma_mapped_walk()
-> map_pte()
->spin_lock(&page->ptl) //Waiting for page->ptl

We can see that CPU1 waiting for CPU0 respond IPI,CPU0 waiting for CPU2
unlock tasklist_lock, CPU2 waiting for CPU1 unlock page->ptl. As a result,
softlockup is triggered.

For collect_procs_anon(), what we're doing is task list iteration, during
the iteration, with the help of call_rcu(), the task_struct object is freed
only after one or more grace periods elapse. the logic as follows:

release_task()
-> __exit_signal()
-> __unhash_process()
-> list_del_rcu()

-> put_task_struct_rcu_user()
-> call_rcu(&task->rcu, delayed_put_task_struct)

delayed_put_task_struct()
-> put_task_struct()
-> if (refcount_sub_and_test())
__put_task_struct()
-> free_task()

Therefore, under the protection of the rcu lock, we can safely use
get_task_struct() to ensure a safe reference to task_struct during the
iteration.

By removing the use of tasklist_lock in task list iteration, we can break
the softlock chain above.

The same logic can also be applied to:
- collect_procs_file()
- collect_procs_fsdax()
- collect_procs_ksm()

Link: https://lkml.kernel.org/r/20230828022527.241693-1-tongtiangen@huawei.com
Signed-off-by: Tong Tiangen <tongtiangen@huawei.com>
Acked-by: Naoya Horiguchi <naoya.horiguchi@nec.com>
Cc: Kefeng Wang <wangkefeng.wang@huawei.com>
Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
Cc: Miaohe Lin <linmiaohe@huawei.com>
Cc: Paul E. McKenney <paulmck@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>


# 6885938c 13-Jul-2023 Jiaqi Yan <jiaqiyan@google.com>

mm/hwpoison: rename hwp_walk* to hwpoison_walk*

In the discussion of "Improve hugetlbfs read on HWPOISON hugepages" [1],
Matthew Wilcox suggests hwp is a bad abbreviation of hwpoison, as hwp is
already used as "an acronym by acpi, intel_pstate, some clock drivers, an
ethernet driver, and a scsi driver"[1].

So rename hwp_walk and hwp_walk_ops to hwpoison_walk and
hwpoison_walk_ops respectively.

raw_hwp_(page|list), *_raw_hwp, and raw_hwp_unreliable flag are other
major appearances of "hwp". However, given the "raw" hint in the name, it
is easy to differentiate them from other "hwp" acronyms. Since renaming
them is not as straightforward as renaming hwp_walk*, they are not covered
by this commit.

[1] https://lore.kernel.org/lkml/20230707201904.953262-5-jiaqiyan@google.com/T/#me6fecb8ce1ad4d5769199c9e162a44bc88f7bdec

Link: https://lkml.kernel.org/r/20230713235553.4121855-1-jiaqiyan@google.com
Signed-off-by: Jiaqi Yan <jiaqiyan@google.com>
Reviewed-by: Anshuman Khandual <anshuman.khandual@arm.com>
Acked-by: Naoya Horiguchi <naoya.horiguchi@nec.com>
Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
Cc: Miaohe Lin <linmiaohe@huawei.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>


# 7a8817f2 27-Jul-2023 Miaohe Lin <linmiaohe@huawei.com>

mm: memory-failure: add PageOffline() check

Memory failure is not interested in logically offlined pages. Skip this
type of page.

Link: https://lkml.kernel.org/r/20230727115643.639741-5-linmiaohe@huawei.com
Signed-off-by: Miaohe Lin <linmiaohe@huawei.com>
Acked-by: Naoya Horiguchi <naoya.horiguchi@nec.com>
Cc: Kefeng Wang <wangkefeng.wang@huawei.com>
Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>


# d51b6846 01-Jul-2023 Miaohe Lin <linmiaohe@huawei.com>

mm: memory-failure: fix potential page refcnt leak in memory_failure()

put_ref_page() is not called to drop extra refcnt when comes from madvise
in the case pfn is valid but pgmap is NULL leading to page refcnt leak.

Link: https://lkml.kernel.org/r/20230701072837.1994253-1-linmiaohe@huawei.com
Fixes: 1e8aaedb182d ("mm,memory_failure: always pin the page in madvise_inject_error")
Signed-off-by: Miaohe Lin <linmiaohe@huawei.com>
Acked-by: Naoya Horiguchi <naoya.horiguchi@nec.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>


# 6379693e 07-Aug-2023 Miaohe Lin <linmiaohe@huawei.com>

mm: memory-failure: use helper macro llist_for_each_entry_safe()

It's more convenient to use helper macro llist_for_each_entry_safe().
No functional change intended.

Link: https://lkml.kernel.org/r/20230807114125.3440802-1-linmiaohe@huawei.com
Signed-off-by: Miaohe Lin <linmiaohe@huawei.com>
Acked-by: Naoya Horiguchi <naoya.horiguchi@nec.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>


# b79f8eb4 12-Jul-2023 Jiaqi Yan <jiaqiyan@google.com>

mm/hwpoison: check if a raw page in a hugetlb folio is raw HWPOISON

Add the functionality, is_raw_hwpoison_page_in_hugepage, to tell if a raw
page in a hugetlb folio is HWPOISON. This functionality relies on
RawHwpUnreliable to be not set; otherwise hugepage's raw HWPOISON list
becomes meaningless.

is_raw_hwpoison_page_in_hugepage holds mf_mutex in order to synchronize
with folio_set_hugetlb_hwpoison and folio_free_raw_hwp who iterate,
insert, or delete entry in raw_hwp_list. llist itself doesn't ensure
insertion and removal are synchornized with the llist_for_each_entry used
by is_raw_hwpoison_page_in_hugepage (unless iterated entries are already
deleted from the list). Caller can minimize the overhead of lock cycles
by first checking HWPOISON flag of the folio.

Exports this functionality to be immediately used in the read operation
for hugetlbfs.

Link: https://lkml.kernel.org/r/20230713001833.3778937-3-jiaqiyan@google.com
Signed-off-by: Jiaqi Yan <jiaqiyan@google.com>
Reviewed-by: Mike Kravetz <mike.kravetz@oracle.com>
Reviewed-by: Naoya Horiguchi <naoya.horiguchi@nec.com>
Reviewed-by: Miaohe Lin <linmiaohe@huawei.com>
Cc: James Houghton <jthoughton@google.com>
Cc: Muchun Song <songmuchun@bytedance.com>
Cc: Yang Shi <shy828301@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>


# 9e130c4b 12-Jul-2023 Jiaqi Yan <jiaqiyan@google.com>

mm/hwpoison: delete all entries before traversal in __folio_free_raw_hwp

Patch series "Improve hugetlbfs read on HWPOISON hugepages", v4.

Today when hardware memory is corrupted in a hugetlb hugepage, kernel
leaves the hugepage in pagecache [1]; otherwise future mmap or read will
suject to silent data corruption. This is implemented by returning -EIO
from hugetlb_read_iter immediately if the hugepage has HWPOISON flag set.

Since memory_failure already tracks the raw HWPOISON subpages in a
hugepage, a natural improvement is possible: if userspace only asks for
healthy subpages in the pagecache, kernel can return these data.

This patchset implements this improvement. It consist of three parts.
The 1st commit exports the functionality to tell if a subpage inside a
hugetlb hugepage is a raw HWPOISON page. The 2nd commit teaches
hugetlbfs_read_iter to return as many healthy bytes as possible. The 3rd
commit properly tests this new feature.

[1] commit 8625147cafaa ("hugetlbfs: don't delete error page from pagecache")


This patch (of 4):

Traversal on llist (e.g. llist_for_each_safe) is only safe AFTER entries
are deleted from the llist. Correct the way __folio_free_raw_hwp deletes
and frees raw_hwp_page entries in raw_hwp_list: first llist_del_all, then
kfree within llist_for_each_safe.

As of today, concurrent adding, deleting, and traversal on raw_hwp_list
from hugetlb.c and/or memory-failure.c are fine with each other. Note
this is guaranteed partly by the lock-free nature of llist, and partly by
holding hugetlb_lock and/or mf_mutex. For example, as llist_del_all is
lock-free with itself, folio_clear_hugetlb_hwpoison()s from
__update_and_free_hugetlb_folio and memory_failure won't need explicit
locking when freeing the raw_hwp_list. New code that manipulates
raw_hwp_list must be careful to ensure the concurrency correctness.

Link: https://lkml.kernel.org/r/20230713001833.3778937-1-jiaqiyan@google.com
Link: https://lkml.kernel.org/r/20230713001833.3778937-2-jiaqiyan@google.com
Signed-off-by: Jiaqi Yan <jiaqiyan@google.com>
Acked-by: Mike Kravetz <mike.kravetz@oracle.com>
Acked-by: Naoya Horiguchi <naoya.horiguchi@nec.com>
Cc: James Houghton <jthoughton@google.com>
Cc: Miaohe Lin <linmiaohe@huawei.com>
Cc: Muchun Song <songmuchun@bytedance.com>
Cc: Yang Shi <shy828301@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>


# d31155b8 10-Jul-2023 Miaohe Lin <linmiaohe@huawei.com>

mm: memory-failure: fix race window when trying to get hugetlb folio

page_folio() is fetched before calling get_hwpoison_hugetlb_folio()
without hugetlb_lock being held. So hugetlb page could be demoted before
get_hwpoison_hugetlb_folio() holding hugetlb_lock but after page_folio()
is fetched. So get_hwpoison_hugetlb_folio() will hold unexpected extra
refcnt of hugetlb folio while leaving demoted page un-refcnted.

Link: https://lkml.kernel.org/r/20230711055016.2286677-9-linmiaohe@huawei.com
Fixes: 25182f05ffed ("mm,hwpoison: fix race with hugetlb page allocation")
Signed-off-by: Miaohe Lin <linmiaohe@huawei.com>
Acked-by: Naoya Horiguchi <naoya.horiguchi@nec.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>


# a363d122 10-Jul-2023 Miaohe Lin <linmiaohe@huawei.com>

mm: memory-failure: fetch compound head after extra page refcnt is held

Page might become thp, huge page or being splited after compound head is
fetched but before page refcnt is bumped. So hpage might be a tail page
leading to VM_BUG_ON_PAGE(PageTail(page)) in PageTransHuge().

Link: https://lkml.kernel.org/r/20230711055016.2286677-8-linmiaohe@huawei.com
Fixes: 415c64c1453a ("mm/memory-failure: split thp earlier in memory error handling")
Signed-off-by: Miaohe Lin <linmiaohe@huawei.com>
Acked-by: Naoya Horiguchi <naoya.horiguchi@nec.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>


# 5885c6a6 10-Jul-2023 Miaohe Lin <linmiaohe@huawei.com>

mm: memory-failure: minor cleanup for comments and codestyle

Fix some wrong function names and grammar error in comments. Also remove
unneeded space after for_each_process. No functional change intended.

Link: https://lkml.kernel.org/r/20230711055016.2286677-7-linmiaohe@huawei.com
Signed-off-by: Miaohe Lin <linmiaohe@huawei.com>
Acked-by: Naoya Horiguchi <naoya.horiguchi@nec.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>


# e9c36f7a 10-Jul-2023 Miaohe Lin <linmiaohe@huawei.com>

mm: memory-failure: remove unneeded header files

Remove some unneeded header files. No functional change intended.

Link: https://lkml.kernel.org/r/20230711055016.2286677-6-linmiaohe@huawei.com
Signed-off-by: Miaohe Lin <linmiaohe@huawei.com>
Acked-by: Naoya Horiguchi <naoya.horiguchi@nec.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>


# 55c7ac45 10-Jul-2023 Miaohe Lin <linmiaohe@huawei.com>

mm: memory-failure: use local variable huge to check hugetlb page

Use local variable huge to check whether page is hugetlb page to avoid
calling PageHuge() multiple times to save cpu cycles. PageHuge() will be
stable while extra page refcnt is held.

Link: https://lkml.kernel.org/r/20230711055016.2286677-5-linmiaohe@huawei.com
Signed-off-by: Miaohe Lin <linmiaohe@huawei.com>
Acked-by: Naoya Horiguchi <naoya.horiguchi@nec.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>


# 80ee7cb2 10-Jul-2023 Miaohe Lin <linmiaohe@huawei.com>

mm: memory-failure: don't account hwpoison_filter() filtered pages

mf_generic_kill_procs() will return -EOPNOTSUPP when hwpoison_filter()
filtered dax page. In that case, action_result() isn't expected to be
called to update mf_stats. This will results in inaccurate but benign
memory failure handling statistics.

Link: https://lkml.kernel.org/r/20230711055016.2286677-4-linmiaohe@huawei.com
Signed-off-by: Miaohe Lin <linmiaohe@huawei.com>
Acked-by: Naoya Horiguchi <naoya.horiguchi@nec.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>


# 92a025a7 10-Jul-2023 Miaohe Lin <linmiaohe@huawei.com>

mm: memory-failure: ensure moving HWPoison flag to the raw error pages

If hugetlb_vmemmap_optimized is enabled, folio_clear_hugetlb_hwpoison()
called from try_memory_failure_hugetlb() won't transfer HWPoison flag to
subpages while folio's HWPoison flag is cleared. So when trying to free
this hugetlb page into buddy, folio_clear_hugetlb_hwpoison() is not called
to move HWPoison flag from head page to the raw error pages even if now
hugetlb_vmemmap_optimized is cleared. This will results in HWPoisoned
page being used again and raw_hwp_page leak.

Link: https://lkml.kernel.org/r/20230711055016.2286677-3-linmiaohe@huawei.com
Fixes: ac5fcde0a96a ("mm, hwpoison: make unpoison aware of raw error info in hwpoisoned hugepage")
Signed-off-by: Miaohe Lin <linmiaohe@huawei.com>
Acked-by: Naoya Horiguchi <naoya.horiguchi@nec.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>


# dbe70dbb 10-Jul-2023 Miaohe Lin <linmiaohe@huawei.com>

mm: memory-failure: remove unneeded PageHuge() check

Patch series "A few fixup and cleanup patches for memory-failure", v2.

This series contains a few fixup patches to fix inaccurate mf_stats, fix
race window when trying to get hugetlb folio and so on. Also there is
minor cleanup for comments and codestyle. More details can be found in
the respective changelogs.


This patch (of 8):

PageHuge() check in me_huge_page() is just for potential problems. Remove
it as it's actually dead code and won't catch anything.

Link: https://lkml.kernel.org/r/20230711055016.2286677-1-linmiaohe@huawei.com
Link: https://lkml.kernel.org/r/20230711055016.2286677-2-linmiaohe@huawei.com
Signed-off-by: Miaohe Lin <linmiaohe@huawei.com>
Acked-by: Naoya Horiguchi <naoya.horiguchi@nec.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>


# 0201ebf2 28-Jun-2023 David Howells <dhowells@redhat.com>

mm: merge folio_has_private()/filemap_release_folio() call pairs

Patch series "mm, netfs, fscache: Stop read optimisation when folio
removed from pagecache", v7.

This fixes an optimisation in fscache whereby we don't read from the cache
for a particular file until we know that there's data there that we don't
have in the pagecache. The problem is that I'm no longer using PG_fscache
(aka PG_private_2) to indicate that the page is cached and so I don't get
a notification when a cached page is dropped from the pagecache.

The first patch merges some folio_has_private() and
filemap_release_folio() pairs and introduces a helper,
folio_needs_release(), to indicate if a release is required.

The second patch is the actual fix. Following Willy's suggestions[1], it
adds an AS_RELEASE_ALWAYS flag to an address_space that will make
filemap_release_folio() always call ->release_folio(), even if
PG_private/PG_private_2 aren't set. folio_needs_release() is altered to
add a check for this.


This patch (of 2):

Make filemap_release_folio() check folio_has_private(). Then, in most
cases, where a call to folio_has_private() is immediately followed by a
call to filemap_release_folio(), we can get rid of the test in the pair.

There are a couple of sites in mm/vscan.c that this can't so easily be
done. In shrink_folio_list(), there are actually three cases (something
different is done for incompletely invalidated buffers), but
filemap_release_folio() elides two of them.

In shrink_active_list(), we don't have have the folio lock yet, so the
check allows us to avoid locking the page unnecessarily.

A wrapper function to check if a folio needs release is provided for those
places that still need to do it in the mm/ directory. This will acquire
additional parts to the condition in a future patch.

After this, the only remaining caller of folio_has_private() outside of
mm/ is a check in fuse.

Link: https://lkml.kernel.org/r/20230628104852.3391651-1-dhowells@redhat.com
Link: https://lkml.kernel.org/r/20230628104852.3391651-2-dhowells@redhat.com
Reported-by: Rohith Surabattula <rohiths.msft@gmail.com>
Suggested-by: Matthew Wilcox <willy@infradead.org>
Signed-off-by: David Howells <dhowells@redhat.com>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Steve French <sfrench@samba.org>
Cc: Shyam Prasad N <nspmangalore@gmail.com>
Cc: Rohith Surabattula <rohiths.msft@gmail.com>
Cc: Dave Wysochanski <dwysocha@redhat.com>
Cc: Dominique Martinet <asmadeus@codewreck.org>
Cc: Ilya Dryomov <idryomov@gmail.com>
Cc: "Theodore Ts'o" <tytso@mit.edu>
Cc: Andreas Dilger <adilger.kernel@dilger.ca>
Cc: Xiubo Li <xiubli@redhat.com>
Cc: Jingbo Xu <jefflexu@linux.alibaba.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>


# 1a7d018d 26-Jun-2023 Miaohe Lin <linmiaohe@huawei.com>

mm: memory-failure: remove unneeded 'inline' annotation

Remove unneeded 'inline' annotation from num_poisoned_pages_inc() and
num_poisoned_pages_sub(). No functional change intended.

Link: https://lkml.kernel.org/r/20230626114343.1846587-1-linmiaohe@huawei.com
Signed-off-by: Miaohe Lin <linmiaohe@huawei.com>
Acked-by: Naoya Horiguchi <naoya.horiguchi@nec.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>


# b7b618da 27-Jun-2023 Miaohe Lin <linmiaohe@huawei.com>

mm: memory-failure: remove unneeded page state check in shake_page()

Remove unneeded PageLRU(p) and is_free_buddy_page(p) check as slab caches
are not shrunk now. This check can be added back when a lightweight range
based shrinker is available.

Link: https://lkml.kernel.org/r/20230628014929.3441386-1-linmiaohe@huawei.com
Signed-off-by: Miaohe Lin <linmiaohe@huawei.com>
Acked-by: Naoya Horiguchi <naoya.horiguchi@nec.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>


# e2c1ab07 27-Jun-2023 Miaohe Lin <linmiaohe@huawei.com>

mm: memory-failure: fix unexpected return value in soft_offline_page()

When page_handle_poison() fails to handle the hugepage or free page in
retry path, soft_offline_page() will return 0 while -EBUSY is expected in
this case.

Consequently the user will think soft_offline_page succeeds while it in
fact failed. So the user will not try again later in this case.

Link: https://lkml.kernel.org/r/20230627112808.1275241-1-linmiaohe@huawei.com
Fixes: b94e02822deb ("mm,hwpoison: try to narrow window race for free pages")
Signed-off-by: Miaohe Lin <linmiaohe@huawei.com>
Acked-by: Naoya Horiguchi <naoya.horiguchi@nec.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>


# 49b06385 04-Aug-2023 Suren Baghdasaryan <surenb@google.com>

mm: enable page walking API to lock vmas during the walk

walk_page_range() and friends often operate under write-locked mmap_lock.
With introduction of vma locks, the vmas have to be locked as well during
such walks to prevent concurrent page faults in these areas. Add an
additional member to mm_walk_ops to indicate locking requirements for the
walk.

The change ensures that page walks which prevent concurrent page faults
by write-locking mmap_lock, operate correctly after introduction of
per-vma locks. With per-vma locks page faults can be handled under vma
lock without taking mmap_lock at all, so write locking mmap_lock would
not stop them. The change ensures vmas are properly locked during such
walks.

A sample issue this solves is do_mbind() performing queue_pages_range()
to queue pages for migration. Without this change a concurrent page
can be faulted into the area and be left out of migration.

Link: https://lkml.kernel.org/r/20230804152724.3090321-2-surenb@google.com
Signed-off-by: Suren Baghdasaryan <surenb@google.com>
Suggested-by: Linus Torvalds <torvalds@linuxfoundation.org>
Suggested-by: Jann Horn <jannh@google.com>
Cc: David Hildenbrand <david@redhat.com>
Cc: Davidlohr Bueso <dave@stgolabs.net>
Cc: Hugh Dickins <hughd@google.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Laurent Dufour <ldufour@linux.ibm.com>
Cc: Liam Howlett <liam.howlett@oracle.com>
Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Michel Lespinasse <michel@lespinasse.org>
Cc: Peter Xu <peterx@redhat.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>


# faeb2ff2 27-Jul-2023 Miaohe Lin <linmiaohe@huawei.com>

mm: memory-failure: avoid false hwpoison page mapped error info

folio->_mapcount is overloaded in SLAB, so folio_mapped() has to be done
after folio_test_slab() is checked. Otherwise slab folio might be treated
as a mapped folio leading to false 'Someone maps the hwpoison page' error
info.

Link: https://lkml.kernel.org/r/20230727115643.639741-4-linmiaohe@huawei.com
Fixes: 230ac719c500 ("mm/hwpoison: don't try to unpoison containment-failed pages")
Signed-off-by: Miaohe Lin <linmiaohe@huawei.com>
Reviewed-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Acked-by: Naoya Horiguchi <naoya.horiguchi@nec.com>
Cc: Kefeng Wang <wangkefeng.wang@huawei.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>


# f29623e4 27-Jul-2023 Miaohe Lin <linmiaohe@huawei.com>

mm: memory-failure: fix potential unexpected return value from unpoison_memory()

If unpoison_memory() fails to clear page hwpoisoned flag, return value ret
is expected to be -EBUSY. But when get_hwpoison_page() returns 1 and
fails to clear page hwpoisoned flag due to races, return value will be
unexpected 1 leading to users being confused. And there's a code smell
that the variable "ret" is used not only to save the return value of
unpoison_memory(), but also the return value from get_hwpoison_page().
Make a further cleanup by using another auto-variable solely to save the
return value of get_hwpoison_page() as suggested by Naoya.

Link: https://lkml.kernel.org/r/20230727115643.639741-3-linmiaohe@huawei.com
Fixes: bf181c582588 ("mm/hwpoison: fix unpoison_memory()")
Signed-off-by: Miaohe Lin <linmiaohe@huawei.com>
Cc: Kefeng Wang <wangkefeng.wang@huawei.com>
Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
Cc: Naoya Horiguchi <naoya.horiguchi@nec.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>


# 6c54312f 17-Jul-2023 Sidhartha Kumar <sidhartha.kumar@oracle.com>

mm/memory-failure: fix hardware poison check in unpoison_memory()

It was pointed out[1] that using folio_test_hwpoison() is wrong as we need
to check the indiviual page that has poison. folio_test_hwpoison() only
checks the head page so go back to using PageHWPoison().

User-visible effects include existing hwpoison-inject tests possibly
failing as unpoisoning a single subpage could lead to unpoisoning an
entire folio. Memory unpoisoning could also not work as expected as
the function will break early due to only checking the head page and
not the actually poisoned subpage.

[1]: https://lore.kernel.org/lkml/ZLIbZygG7LqSI9xe@casper.infradead.org/

Link: https://lkml.kernel.org/r/20230717181812.167757-1-sidhartha.kumar@oracle.com
Fixes: a6fddef49eef ("mm/memory-failure: convert unpoison_memory() to folios")
Signed-off-by: Sidhartha Kumar <sidhartha.kumar@oracle.com>
Reported-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Acked-by: Naoya Horiguchi <naoya.horiguchi@nec.com>
Reviewed-by: Miaohe Lin <linmiaohe@huawei.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>


# c33c7948 12-Jun-2023 Ryan Roberts <ryan.roberts@arm.com>

mm: ptep_get() conversion

Convert all instances of direct pte_t* dereferencing to instead use
ptep_get() helper. This means that by default, the accesses change from a
C dereference to a READ_ONCE(). This is technically the correct thing to
do since where pgtables are modified by HW (for access/dirty) they are
volatile and therefore we should always ensure READ_ONCE() semantics.

But more importantly, by always using the helper, it can be overridden by
the architecture to fully encapsulate the contents of the pte. Arch code
is deliberately not converted, as the arch code knows best. It is
intended that arch code (arm64) will override the default with its own
implementation that can (e.g.) hide certain bits from the core code, or
determine young/dirty status by mixing in state from another source.

Conversion was done using Coccinelle:

----

// $ make coccicheck \
// COCCI=ptepget.cocci \
// SPFLAGS="--include-headers" \
// MODE=patch

virtual patch

@ depends on patch @
pte_t *v;
@@

- *v
+ ptep_get(v)

----

Then reviewed and hand-edited to avoid multiple unnecessary calls to
ptep_get(), instead opting to store the result of a single call in a
variable, where it is correct to do so. This aims to negate any cost of
READ_ONCE() and will benefit arch-overrides that may be more complex.

Included is a fix for an issue in an earlier version of this patch that
was pointed out by kernel test robot. The issue arose because config
MMU=n elides definition of the ptep helper functions, including
ptep_get(). HUGETLB_PAGE=n configs still define a simple
huge_ptep_clear_flush() for linking purposes, which dereferences the ptep.
So when both configs are disabled, this caused a build error because
ptep_get() is not defined. Fix by continuing to do a direct dereference
when MMU=n. This is safe because for this config the arch code cannot be
trying to virtualize the ptes because none of the ptep helpers are
defined.

Link: https://lkml.kernel.org/r/20230612151545.3317766-4-ryan.roberts@arm.com
Reported-by: kernel test robot <lkp@intel.com>
Link: https://lore.kernel.org/oe-kbuild-all/202305120142.yXsNEo6H-lkp@intel.com/
Signed-off-by: Ryan Roberts <ryan.roberts@arm.com>
Cc: Adrian Hunter <adrian.hunter@intel.com>
Cc: Alexander Potapenko <glider@google.com>
Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com>
Cc: Alex Williamson <alex.williamson@redhat.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Andrey Konovalov <andreyknvl@gmail.com>
Cc: Andrey Ryabinin <ryabinin.a.a@gmail.com>
Cc: Christian Brauner <brauner@kernel.org>
Cc: Christoph Hellwig <hch@infradead.org>
Cc: Daniel Vetter <daniel@ffwll.ch>
Cc: Dave Airlie <airlied@gmail.com>
Cc: Dimitri Sivanich <dimitri.sivanich@hpe.com>
Cc: Dmitry Vyukov <dvyukov@google.com>
Cc: Ian Rogers <irogers@google.com>
Cc: Jason Gunthorpe <jgg@ziepe.ca>
Cc: Jérôme Glisse <jglisse@redhat.com>
Cc: Jiri Olsa <jolsa@kernel.org>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Cc: Lorenzo Stoakes <lstoakes@gmail.com>
Cc: Mark Rutland <mark.rutland@arm.com>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Miaohe Lin <linmiaohe@huawei.com>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Mike Kravetz <mike.kravetz@oracle.com>
Cc: Mike Rapoport (IBM) <rppt@kernel.org>
Cc: Muchun Song <muchun.song@linux.dev>
Cc: Namhyung Kim <namhyung@kernel.org>
Cc: Naoya Horiguchi <naoya.horiguchi@nec.com>
Cc: Oleksandr Tyshchenko <oleksandr_tyshchenko@epam.com>
Cc: Pavel Tatashin <pasha.tatashin@soleen.com>
Cc: Roman Gushchin <roman.gushchin@linux.dev>
Cc: SeongJae Park <sj@kernel.org>
Cc: Shakeel Butt <shakeelb@google.com>
Cc: Uladzislau Rezki (Sony) <urezki@gmail.com>
Cc: Vincenzo Frascino <vincenzo.frascino@arm.com>
Cc: Yu Zhao <yuzhao@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>


# 04dee9e8 08-Jun-2023 Hugh Dickins <hughd@google.com>

mm/various: give up if pte_offset_map[_lock]() fails

Following the examples of nearby code, various functions can just give up
if pte_offset_map() or pte_offset_map_lock() fails. And there's no need
for a preliminary pmd_trans_unstable() or other such check, since such
cases are now safely handled inside.

Link: https://lkml.kernel.org/r/7b9bd85d-1652-cbf2-159d-f503b45e5b@google.com
Signed-off-by: Hugh Dickins <hughd@google.com>
Cc: Alistair Popple <apopple@nvidia.com>
Cc: Anshuman Khandual <anshuman.khandual@arm.com>
Cc: Axel Rasmussen <axelrasmussen@google.com>
Cc: Christophe Leroy <christophe.leroy@csgroup.eu>
Cc: Christoph Hellwig <hch@infradead.org>
Cc: David Hildenbrand <david@redhat.com>
Cc: "Huang, Ying" <ying.huang@intel.com>
Cc: Ira Weiny <ira.weiny@intel.com>
Cc: Jason Gunthorpe <jgg@ziepe.ca>
Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Cc: Lorenzo Stoakes <lstoakes@gmail.com>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Mel Gorman <mgorman@techsingularity.net>
Cc: Miaohe Lin <linmiaohe@huawei.com>
Cc: Mike Kravetz <mike.kravetz@oracle.com>
Cc: Mike Rapoport (IBM) <rppt@kernel.org>
Cc: Minchan Kim <minchan@kernel.org>
Cc: Naoya Horiguchi <naoya.horiguchi@nec.com>
Cc: Pavel Tatashin <pasha.tatashin@soleen.com>
Cc: Peter Xu <peterx@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Qi Zheng <zhengqi.arch@bytedance.com>
Cc: Ralph Campbell <rcampbell@nvidia.com>
Cc: Ryan Roberts <ryan.roberts@arm.com>
Cc: SeongJae Park <sj@kernel.org>
Cc: Song Liu <song@kernel.org>
Cc: Steven Price <steven.price@arm.com>
Cc: Suren Baghdasaryan <surenb@google.com>
Cc: Thomas Hellström <thomas.hellstrom@linux.intel.com>
Cc: Will Deacon <will@kernel.org>
Cc: Yang Shi <shy828301@gmail.com>
Cc: Yu Zhao <yuzhao@google.com>
Cc: Zack Rusin <zackr@vmware.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>


# 97de10a9 08-May-2023 Kefeng Wang <wangkefeng.wang@huawei.com>

mm: memory-failure: move sysctl register in memory_failure_init()

There is already a memory_failure_init(), don't add a new initcall, move
register_sysctl_init() into it to cleanup a bit.

Link: https://lkml.kernel.org/r/20230508114128.37081-2-wangkefeng.wang@huawei.com
Signed-off-by: Kefeng Wang <wangkefeng.wang@huawei.com>
Acked-by: Naoya Horiguchi <naoya.horiguchi@nec.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>


# 4248d008 13-Apr-2023 Longlong Xia <xialonglong1@huawei.com>

mm: ksm: support hwpoison for ksm page

hwpoison_user_mappings() is updated to support ksm pages, and add
collect_procs_ksm() to collect processes when the error hit an ksm page.
The difference from collect_procs_anon() is that it also needs to traverse
the rmap-item list on the stable node of the ksm page. At the same time,
add_to_kill_ksm() is added to handle ksm pages. And
task_in_to_kill_list() is added to avoid duplicate addition of tsk to the
to_kill list. This is because when scanning the list, if the pages that
make up the ksm page all come from the same process, they may be added
repeatedly.

Link: https://lkml.kernel.org/r/20230414021741.2597273-3-xialonglong1@huawei.com
Signed-off-by: Longlong Xia <xialonglong1@huawei.com>
Tested-by: Naoya Horiguchi <naoya.horiguchi@nec.com>
Reviewed-by: Kefeng Wang <wangkefeng.wang@huawei.com>
Cc: Miaohe Lin <linmiaohe@huawei.com>
Cc: Nanyong Sun <sunnanyong@huawei.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>


# 4f775086 13-Apr-2023 Longlong Xia <xialonglong1@huawei.com>

mm: memory-failure: refactor add_to_kill()

Patch series "mm: ksm: support hwpoison for ksm page", v2.

Currently, ksm does not support hwpoison. As ksm is being used more
widely for deduplication at the system level, container level, and process
level, supporting hwpoison for ksm has become increasingly important.
However, ksm pages were not processed by hwpoison in 2009 [1].

The main method of implementation:

1. Refactor add_to_kill() and add new add_to_kill_*() to better
accommodate the handling of different types of pages.

2. Add collect_procs_ksm() to collect processes when the error hit an
ksm page.

3. Add task_in_to_kill_list() to avoid duplicate addition of tsk to
the to_kill list.

4. Try_to_unmap ksm page (already supported).

5. Handle related processes such as sending SIGBUS.

Tested with poisoning to ksm page from
1) different process
2) one process

and with/without memory_failure_early_kill set, the processes are killed
as expected with the patchset.

[1] https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/
commit/?h=01e00f880ca700376e1845cf7a2524ebe68e47d6


This patch (of 2):

The page_address_in_vma() is used to find the user virtual address of page
in add_to_kill(), but it doesn't support ksm due to the ksm page->index
unusable, add an ksm_addr as parameter to add_to_kill(), let's the caller
to pass it, also rename the function to __add_to_kill(), and adding
add_to_kill_anon_file() for handling anonymous pages and file pages,
adding add_to_kill_fsdax() for handling fsdax pages.

Link: https://lkml.kernel.org/r/20230414021741.2597273-1-xialonglong1@huawei.com
Link: https://lkml.kernel.org/r/20230414021741.2597273-2-xialonglong1@huawei.com
Signed-off-by: Longlong Xia <xialonglong1@huawei.com>
Tested-by: Naoya Horiguchi <naoya.horiguchi@nec.com>
Reviewed-by: Kefeng Wang <wangkefeng.wang@huawei.com>
Cc: Kefeng Wang <wangkefeng.wang@huawei.com>
Cc: Miaohe Lin <linmiaohe@huawei.com>
Cc: Nanyong Sun <sunnanyong@huawei.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>


# 8cbc82f3 20-Mar-2023 Kefeng Wang <wangkefeng.wang@huawei.com>

mm: memory-failure: Move memory failure sysctls to its own file

The sysctl_memory_failure_early_kill and memory_failure_recovery
are only used in memory-failure.c, move them to its own file.

Acked-by: Naoya Horiguchi <naoya.horiguchi@nec.com>
Signed-off-by: Kefeng Wang <wangkefeng.wang@huawei.com>
[mcgrof: fix by adding empty ctl entry, this caused a crash]
Signed-off-by: Luis Chamberlain <mcgrof@kernel.org>


# 611b9fd8 12-Mar-2023 Kefeng Wang <wangkefeng.wang@huawei.com>

mm: memory-failure: directly use IS_ENABLED(CONFIG_HWPOISON_INJECT)

It's more clear and simple to just use IS_ENABLED(CONFIG_HWPOISON_INJECT)
to check whether or not to enable HWPoison injector module instead of
CONFIG_HWPOISON_INJECT/CONFIG_HWPOISON_INJECT_MODULE.

Link: https://lkml.kernel.org/r/20230313053929.84607-1-wangkefeng.wang@huawei.com
Signed-off-by: Kefeng Wang <wangkefeng.wang@huawei.com>
Reviewed-by: David Hildenbrand <david@redhat.com>
Acked-by: Naoya Horiguchi <naoya.horiguchi@nec.com>
Cc: Miaohe Lin <linmiaohe@huawei.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>


# 6da6b1d4 21-Feb-2023 Naoya Horiguchi <naoya.horiguchi@nec.com>

mm/hwpoison: convert TTU_IGNORE_HWPOISON to TTU_HWPOISON

After a memory error happens on a clean folio, a process unexpectedly
receives SIGBUS when it accesses the error page. This SIGBUS killing is
pointless and simply degrades the level of RAS of the system, because the
clean folio can be dropped without any data lost on memory error handling
as we do for a clean pagecache.

When memory_failure() is called on a clean folio, try_to_unmap() is called
twice (one from split_huge_page() and one from hwpoison_user_mappings()).
The root cause of the issue is that pte conversion to hwpoisoned entry is
now done in the first call of try_to_unmap() because PageHWPoison is
already set at this point, while it's actually expected to be done in the
second call. This behavior disturbs the error handling operation like
removing pagecache, which results in the malfunction described above.

So convert TTU_IGNORE_HWPOISON into TTU_HWPOISON and set TTU_HWPOISON only
when we really intend to convert pte to hwpoison entry. This can prevent
other callers of try_to_unmap() from accidentally converting to hwpoison
entries.

Link: https://lkml.kernel.org/r/20230221085905.1465385-1-naoya.horiguchi@linux.dev
Fixes: a42634a6c07d ("readahead: Use a folio in read_pages()")
Signed-off-by: Naoya Horiguchi <naoya.horiguchi@nec.com>
Cc: David Hildenbrand <david@redhat.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Miaohe Lin <linmiaohe@huawei.com>
Cc: Minchan Kim <minchan@kernel.org>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>


# cd775580 15-Feb-2023 Baolin Wang <baolin.wang@linux.alibaba.com>

mm: change to return bool for isolate_movable_page()

Now the isolate_movable_page() can only return 0 or -EBUSY, and no users
will care about the negative return value, thus we can convert the
isolate_movable_page() to return a boolean value to make the code more
clear when checking the movable page isolation state.

No functional changes intended.

[akpm@linux-foundation.org: remove unneeded comment, per Matthew]
Link: https://lkml.kernel.org/r/cb877f73f4fff8d309611082ec740a7065b1ade0.1676424378.git.baolin.wang@linux.alibaba.com
Signed-off-by: Baolin Wang <baolin.wang@linux.alibaba.com>
Acked-by: David Hildenbrand <david@redhat.com>
Reviewed-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Acked-by: Linus Torvalds <torvalds@linux-foundation.org>
Reviewed-by: SeongJae Park <sj@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>


# 9747b9e9 15-Feb-2023 Baolin Wang <baolin.wang@linux.alibaba.com>

mm: hugetlb: change to return bool for isolate_hugetlb()

Now the isolate_hugetlb() only returns 0 or -EBUSY, and most users did not
care about the negative value, thus we can convert the isolate_hugetlb()
to return a boolean value to make code more clear when checking the
hugetlb isolation state. Moreover converts 2 users which will consider
the negative value returned by isolate_hugetlb().

No functional changes intended.

[akpm@linux-foundation.org: shorten locked section, per SeongJae Park]
Link: https://lkml.kernel.org/r/12a287c5bebc13df304387087bbecc6421510849.1676424378.git.baolin.wang@linux.alibaba.com
Signed-off-by: Baolin Wang <baolin.wang@linux.alibaba.com>
Acked-by: David Hildenbrand <david@redhat.com>
Reviewed-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Reviewed-by: Mike Kravetz <mike.kravetz@oracle.com>
Acked-by: Linus Torvalds <torvalds@linux-foundation.org>
Reviewed-by: SeongJae Park <sj@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>


# f7f9c00d 15-Feb-2023 Baolin Wang <baolin.wang@linux.alibaba.com>

mm: change to return bool for isolate_lru_page()

The isolate_lru_page() can only return 0 or -EBUSY, and most users did not
care about the negative error of isolate_lru_page(), except one user in
add_page_for_migration(). So we can convert the isolate_lru_page() to
return a boolean value, which can help to make the code more clear when
checking the return value of isolate_lru_page().

Also convert all users' logic of checking the isolation state.

No functional changes intended.

Link: https://lkml.kernel.org/r/3074c1ab628d9dbf139b33f248a8bc253a3f95f0.1676424378.git.baolin.wang@linux.alibaba.com
Signed-off-by: Baolin Wang <baolin.wang@linux.alibaba.com>
Acked-by: David Hildenbrand <david@redhat.com>
Reviewed-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Acked-by: Linus Torvalds <torvalds@linux-foundation.org>
Reviewed-by: SeongJae Park <sj@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>


# 6aa3a920 13-Jan-2023 Sidhartha Kumar <sidhartha.kumar@oracle.com>

mm/hugetlb: convert isolate_hugetlb to folios

Patch series "continue hugetlb folio conversion", v3.

This series continues the conversion of core hugetlb functions to use
folios. This series converts many helper funtions in the hugetlb fault
path. This is in preparation for another series to convert the hugetlb
fault code paths to operate on folios.


This patch (of 8):

Convert isolate_hugetlb() to take in a folio and convert its callers to
pass a folio. Use page_folio() to convert the callers to use a folio is
safe as isolate_hugetlb() operates on a head page.

Link: https://lkml.kernel.org/r/20230113223057.173292-1-sidhartha.kumar@oracle.com
Link: https://lkml.kernel.org/r/20230113223057.173292-2-sidhartha.kumar@oracle.com
Signed-off-by: Sidhartha Kumar <sidhartha.kumar@oracle.com>
Reviewed-by: Mike Kravetz <mike.kravetz@oracle.com>
Cc: John Hubbard <jhubbard@nvidia.com>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Mike Kravetz <mike.kravetz@oracle.com>
Cc: Muchun Song <songmuchun@bytedance.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>


# 18f41fa6 19-Jan-2023 Jiaqi Yan <jiaqiyan@google.com>

mm: memory-failure: bump memory failure stats to pglist_data

Right before memory_failure finishes its handling, accumulate poisoned
page's resolution counters to pglist_data's memory_failure_stats, so as to
update the corresponding sysfs entries.

Tested:
1) Start an application to allocate memory buffer chunks
2) Convert random memory buffer addresses to physical addresses
3) Inject memory errors using EINJ at chosen physical addresses
4) Access poisoned memory buffer and recover from SIGBUS
5) Check counter values under
/sys/devices/system/node/node*/memory_failure/*

Link: https://lkml.kernel.org/r/20230120034622.2698268-3-jiaqiyan@google.com
Signed-off-by: Jiaqi Yan <jiaqiyan@google.com>
Acked-by: David Rientjes <rientjes@google.com>
Acked-by: Naoya Horiguchi <naoya.horiguchi@nec.com>
Cc: Kefeng Wang <wangkefeng.wang@huawei.com>
Cc: Tony Luck <tony.luck@intel.com>
Cc: Yang Shi <shy828301@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>


# 44b8f8bf 19-Jan-2023 Jiaqi Yan <jiaqiyan@google.com>

mm: memory-failure: add memory failure stats to sysfs

Patch series "Introduce per NUMA node memory error statistics", v2.

Background
==========

In the RFC for Kernel Support of Memory Error Detection [1], one advantage
of software-based scanning over hardware patrol scrubber is the ability to
make statistics visible to system administrators. The statistics include
2 categories:

* Memory error statistics, for example, how many memory error are
encountered, how many of them are recovered by the kernel. Note these
memory errors are non-fatal to kernel: during the machine check
exception (MCE) handling kernel already classified MCE's severity to be
unnecessary to panic (but either action required or optional).

* Scanner statistics, for example how many times the scanner have fully
scanned a NUMA node, how many errors are first detected by the scanner.

The memory error statistics are useful to userspace and actually not
specific to scanner detected memory errors, and are the focus of this
patchset.

Motivation
==========

Memory error stats are important to userspace but insufficient in kernel
today. Datacenter administrators can better monitor a machine's memory
health with the visible stats. For example, while memory errors are
inevitable on servers with 10+ TB memory, starting server maintenance when
there are only 1~2 recovered memory errors could be overreacting; in cloud
production environment maintenance usually means live migrate all the
workload running on the server and this usually causes nontrivial
disruption to the customer. Providing insight into the scope of memory
errors on a system helps to determine the appropriate follow-up action.
In addition, the kernel's existing memory error stats need to be
standardized so that userspace can reliably count on their usefulness.

Today kernel provides following memory error info to userspace, but they
are not sufficient or have disadvantages:
* HardwareCorrupted in /proc/meminfo: number of bytes poisoned in total,
not per NUMA node stats though
* ras:memory_failure_event: only available after explicitly enabled
* /dev/mcelog provides many useful info about the MCEs, but doesn't
capture how memory_failure recovered memory MCEs
* kernel logs: userspace needs to process log text

Exposing memory error stats is also a good start for the in-kernel memory
error detector. Today the data source of memory error stats are either
direct memory error consumption, or hardware patrol scrubber detection
(either signaled as UCNA or SRAO). Once in-kernel memory scanner is
implemented, it will be the main source as it is usually configured to
scan memory DIMMs constantly and faster than hardware patrol scrubber.

How Implemented
===============

As Naoya pointed out [2], exposing memory error statistics to userspace is
useful independent of software or hardware scanner. Therefore we
implement the memory error statistics independent of the in-kernel memory
error detector. It exposes the following per NUMA node memory error
counters:

/sys/devices/system/node/node${X}/memory_failure/total
/sys/devices/system/node/node${X}/memory_failure/recovered
/sys/devices/system/node/node${X}/memory_failure/ignored
/sys/devices/system/node/node${X}/memory_failure/failed
/sys/devices/system/node/node${X}/memory_failure/delayed

These counters describe how many raw pages are poisoned and after the
attempted recoveries by the kernel, their resolutions: how many are
recovered, ignored, failed, or delayed respectively. This approach can be
easier to extend for future use cases than /proc/meminfo, trace event, and
log. The following math holds for the statistics:

* total = recovered + ignored + failed + delayed

These memory error stats are reset during machine boot.

The 1st commit introduces these sysfs entries. The 2nd commit populates
memory error stats every time memory_failure attempts memory error
recovery. The 3rd commit adds documentations for introduced stats.

[1] https://lore.kernel.org/linux-mm/7E670362-C29E-4626-B546-26530D54F937@gmail.com/T/#mc22959244f5388891c523882e61163c6e4d703af
[2] https://lore.kernel.org/linux-mm/7E670362-C29E-4626-B546-26530D54F937@gmail.com/T/#m52d8d7a333d8536bd7ce74253298858b1c0c0ac6


This patch (of 3):

Today kernel provides following memory error info to userspace, but each
has its own disadvantage

* HardwareCorrupted in /proc/meminfo: number of bytes poisoned in total,
not per NUMA node stats though

* ras:memory_failure_event: only available after explicitly enabled

* /dev/mcelog provides many useful info about the MCEs, but
doesn't capture how memory_failure recovered memory MCEs

* kernel logs: userspace needs to process log text

Exposes per NUMA node memory error stats as sysfs entries:

/sys/devices/system/node/node${X}/memory_failure/total
/sys/devices/system/node/node${X}/memory_failure/recovered
/sys/devices/system/node/node${X}/memory_failure/ignored
/sys/devices/system/node/node${X}/memory_failure/failed
/sys/devices/system/node/node${X}/memory_failure/delayed

These counters describe how many raw pages are poisoned and after the
attempted recoveries by the kernel, their resolutions: how many are
recovered, ignored, failed, or delayed respectively. The following math
holds for the statistics:

* total = recovered + ignored + failed + delayed

Link: https://lkml.kernel.org/r/20230120034622.2698268-1-jiaqiyan@google.com
Link: https://lkml.kernel.org/r/20230120034622.2698268-2-jiaqiyan@google.com
Signed-off-by: Jiaqi Yan <jiaqiyan@google.com>
Acked-by: David Rientjes <rientjes@google.com>
Acked-by: Naoya Horiguchi <naoya.horiguchi@nec.com>
Cc: Kefeng Wang <wangkefeng.wang@huawei.com>
Cc: Tony Luck <tony.luck@intel.com>
Cc: Yang Shi <shy828301@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>


# 04bac040 18-Jan-2023 Sidhartha Kumar <sidhartha.kumar@oracle.com>

mm/hugetlb: convert get_hwpoison_huge_page() to folios

Straightforward conversion of get_hwpoison_huge_page() to
get_hwpoison_hugetlb_folio(). Reduces two references to a head page in
memory-failure.c

[arnd@arndb.de: fix get_hwpoison_hugetlb_folio() stub]
Link: https://lkml.kernel.org/r/20230119111920.635260-1-arnd@kernel.org
Link: https://lkml.kernel.org/r/20230118174039.14247-1-sidhartha.kumar@oracle.com
Signed-off-by: Sidhartha Kumar <sidhartha.kumar@oracle.com>
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Acked-by: Naoya Horiguchi <naoya.horiguchi@nec.com>
Reviewed-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Cc: Mike Kravetz <mike.kravetz@oracle.com>
Cc: Muchun Song <songmuchun@bytedance.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>


# e0650a41 16-Jan-2023 Matthew Wilcox (Oracle) <willy@infradead.org>

mm: clean up mlock_page / munlock_page references in comments

Change documentation and comments that refer to now-renamed functions.

Link: https://lkml.kernel.org/r/20230116192827.2146732-5-willy@infradead.org
Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>


# a6fddef4 12-Jan-2023 Sidhartha Kumar <sidhartha.kumar@oracle.com>

mm/memory-failure: convert unpoison_memory() to folios

Use a folio inside unpoison_memory which replaces a compound_head() call
with a call to page_folio().

Link: https://lkml.kernel.org/r/20230112204608.80136-9-sidhartha.kumar@oracle.com
Signed-off-by: Sidhartha Kumar <sidhartha.kumar@oracle.com>
Acked-by: Naoya Horiguchi <naoya.horiguchi@nec.com>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Miaohe Lin <linmiaohe@huawei.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>


# 595dd818 12-Jan-2023 Sidhartha Kumar <sidhartha.kumar@oracle.com>

mm/memory-failure: convert hugetlb_set_page_hwpoison() to folios

Change hugetlb_set_page_hwpoison() to folio_set_hugetlb_hwpoison() and use
a folio internally.

Link: https://lkml.kernel.org/r/20230112204608.80136-8-sidhartha.kumar@oracle.com
Signed-off-by: Sidhartha Kumar <sidhartha.kumar@oracle.com>
Acked-by: Naoya Horiguchi <naoya.horiguchi@nec.com>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Miaohe Lin <linmiaohe@huawei.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>


# 0858b5eb 12-Jan-2023 Sidhartha Kumar <sidhartha.kumar@oracle.com>

mm/memory-failure: convert __free_raw_hwp_pages() to folios

Change __free_raw_hwp_pages() to __folio_free_raw_hwp() and modify its
callers to pass in a folio.

Link: https://lkml.kernel.org/r/20230112204608.80136-7-sidhartha.kumar@oracle.com
Signed-off-by: Sidhartha Kumar <sidhartha.kumar@oracle.com>
Acked-by: Naoya Horiguchi <naoya.horiguchi@nec.com>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Miaohe Lin <linmiaohe@huawei.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>


# b02e7582 12-Jan-2023 Sidhartha Kumar <sidhartha.kumar@oracle.com>

mm/memory-failure: convert raw_hwp_list_head() to folios

Change raw_hwp_list_head() to take in a folio and modify its callers to
pass in a folio. Also converts two users of hugetlb specific page macro
users to their folio equivalents.

Link: https://lkml.kernel.org/r/20230112204608.80136-6-sidhartha.kumar@oracle.com
Signed-off-by: Sidhartha Kumar <sidhartha.kumar@oracle.com>
Acked-by: Naoya Horiguchi <naoya.horiguchi@nec.com>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Miaohe Lin <linmiaohe@huawei.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>


# 9637d7df 12-Jan-2023 Sidhartha Kumar <sidhartha.kumar@oracle.com>

mm/memory-failure: convert free_raw_hwp_pages() to folios

Change free_raw_hwp_pages() to folio_free_raw_hwp(), converts two users of
hugetlb specific page macro users to their folio equivalents.

Link: https://lkml.kernel.org/r/20230112204608.80136-5-sidhartha.kumar@oracle.com
Signed-off-by: Sidhartha Kumar <sidhartha.kumar@oracle.com>
Acked-by: Naoya Horiguchi <naoya.horiguchi@nec.com>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Miaohe Lin <linmiaohe@huawei.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>


# 2ff6cece 12-Jan-2023 Sidhartha Kumar <sidhartha.kumar@oracle.com>

mm/memory-failure: convert hugetlb_clear_page_hwpoison to folios

Change hugetlb_clear_page_hwpoison() to folio_clear_hugetlb_hwpoison() by
changing the function to take in a folio. This converts one use of
ClearPageHWPoison and HPageRawHwpUnreliable to their folio equivalents.

Link: https://lkml.kernel.org/r/20230112204608.80136-4-sidhartha.kumar@oracle.com
Signed-off-by: Sidhartha Kumar <sidhartha.kumar@oracle.com>
Acked-by: Naoya Horiguchi <naoya.horiguchi@nec.com>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Miaohe Lin <linmiaohe@huawei.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>


# bc1cfde1 12-Jan-2023 Sidhartha Kumar <sidhartha.kumar@oracle.com>

mm/memory-failure: convert try_memory_failure_hugetlb() to folios

Use a struct folio rather than a head page in try_memory_failure_hugetlb.
This converts one user of SetHPageMigratable to the folio equivalent.

Link: https://lkml.kernel.org/r/20230112204608.80136-3-sidhartha.kumar@oracle.com
Signed-off-by: Sidhartha Kumar <sidhartha.kumar@oracle.com>
Acked-by: Naoya Horiguchi <naoya.horiguchi@nec.com>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Miaohe Lin <linmiaohe@huawei.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>


# 4c110ec9 12-Jan-2023 Sidhartha Kumar <sidhartha.kumar@oracle.com>

mm/memory-failure: convert __get_huge_page_for_hwpoison() to folios

Patch series "convert hugepage memory failure functions to folios".

This series contains a 1:1 straightforward page to folio conversion for
memory failure functions which deal with huge pages. I renamed a few
functions to fit with how other folio operating functions are named.
These include:

hugetlb_clear_page_hwpoison -> folio_clear_hugetlb_hwpoison
free_raw_hwp_pages -> folio_free_raw_hwp
__free_raw_hwp_pages -> __folio_free_raw_hwp
hugetlb_set_page_hwpoison -> folio_set_hugetlb_hwpoison

The goal of this series was to reduce users of the hugetlb specific page
flag macros which take in a page so users are protected by the compiler to
make sure they are operating on a head page.


This patch (of 8):

Use a folio throughout the function rather than using a head page. This
also reduces the users of the page version of hugetlb specific page flags.

Link: https://lkml.kernel.org/r/20230112204608.80136-2-sidhartha.kumar@oracle.com
Signed-off-by: Sidhartha Kumar <sidhartha.kumar@oracle.com>
Acked-by: Naoya Horiguchi <naoya.horiguchi@nec.com>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Miaohe Lin <linmiaohe@huawei.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>


# 799fb82a 03-Jan-2023 SeongJae Park <sj@kernel.org>

tools/vm: rename tools/vm to tools/mm

Rename tools/vm to tools/mm for being more consistent with the code and
documentation directories, and won't be confused with virtual machines.

Link: https://lkml.kernel.org/r/20230103180754.129637-4-sj@kernel.org
Signed-off-by: SeongJae Park <sj@kernel.org>
Cc: Jonathan Corbet <corbet@lwn.net>
Cc: Shuah Khan <shuah@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>


# e0ff4280 24-Nov-2022 Ma Wupeng <mawupeng1@huawei.com>

mm/memory-failure.c: cleanup in unpoison_memory

If freeit is true, the value of ret must be zero, there is no need to
check the value of freeit after label unlock_mutex.

We can drop variable freeit to do this cleanup.

Link: https://lkml.kernel.org/r/20221125065444.3462681-1-mawupeng1@huawei.com
Signed-off-by: Ma Wupeng <mawupeng1@huawei.com>
Acked-by: Naoya Horiguchi <naoya.horiguchi@nec.com>
Reviewed-by: Miaohe Lin <linmiaohe@huawei.com>
Cc: zhenwei pi <pizhenwei@bytedance.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>


# ac5efa78 18-Nov-2022 Vishal Moola (Oracle) <vishal.moola@gmail.com>

memory-failure: convert truncate_error_page() to use folio

Replace try_to_release_page() with filemap_release_folio(). This change
is in preparation for the removal of the try_to_release_page() wrapper.

Link: https://lkml.kernel.org/r/20221118073055.55694-4-vishal.moola@gmail.com
Signed-off-by: Vishal Moola (Oracle) <vishal.moola@gmail.com>
Acked-by: Naoya Horiguchi <naoya.horiguchi@nec.com>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Theodore Ts'o <tytso@mit.edu>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>


# dad6a5eb 02-Nov-2022 Hugh Dickins <hughd@google.com>

mm,hugetlb: use folio fields in second tail page

Patch series "mm,huge,rmap: unify and speed up compound mapcounts".


This patch (of 3):

We want to declare one more int in the first tail of a compound page: that
first tail page being valuable property, since every compound page has a
first tail, but perhaps no more than that.

No problem on 64-bit: there is already space for it. No problem with
32-bit THPs: 5.18 commit 5232c63f46fd ("mm: Make compound_pincount always
available") kindly cleared the space for it, apparently not realizing that
only 64-bit architectures enable CONFIG_THP_SWAP (whose use of tail
page->private might conflict) - but make sure of that in its Kconfig.

But hugetlb pages use tail page->private of the first tail page for a
subpool pointer, which will conflict; and they also use page->private of
the 2nd, 3rd and 4th tails.

Undo "mm: add private field of first tail to struct page and struct
folio"'s recent addition of private_1 to the folio tail: instead add
hugetlb_subpool, hugetlb_cgroup, hugetlb_cgroup_rsvd, hugetlb_hwpoison to
a second tail page of the folio: THP has long been using several fields of
that tail, so make better use of it for hugetlb too. This is not how a
generic folio should be declared in future, but it is an effective
transitional way to make use of it.

Delete the SUBPAGE_INDEX stuff, but keep __NR_USED_SUBPAGE: now 3.

[hughd@google.com: prefix folio's page_1 and page_2 with double underscore,
give folio's _flags_2 and _head_2 a line documentation each]
Link: https://lkml.kernel.org/r/9e2cb6b-5b58-d3f2-b5ee-5f8a14e8f10@google.com
Link: https://lkml.kernel.org/r/5f52de70-975-e94f-f141-543765736181@google.com
Link: https://lkml.kernel.org/r/3818cc9a-9999-d064-d778-9c94c5911e6@google.com
Signed-off-by: Hugh Dickins <hughd@google.com>
Acked-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Cc: David Hildenbrand <david@redhat.com>
Cc: James Houghton <jthoughton@google.com>
Cc: John Hubbard <jhubbard@nvidia.com>
Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
Cc: Miaohe Lin <linmiaohe@huawei.com>
Cc: Mike Kravetz <mike.kravetz@oracle.com>
Cc: Mina Almasry <almasrymina@google.com>
Cc: Muchun Song <songmuchun@bytedance.com>
Cc: Naoya Horiguchi <naoya.horiguchi@linux.dev>
Cc: Peter Xu <peterx@redhat.com>
Cc: Sidhartha Kumar <sidhartha.kumar@oracle.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Yang Shi <shy828301@gmail.com>
Cc: Zach O'Keefe <zokeefe@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>


# 5033091d 24-Oct-2022 Naoya Horiguchi <naoya.horiguchi@nec.com>

mm/hwpoison: introduce per-memory_block hwpoison counter

Currently PageHWPoison flag does not behave well when experiencing memory
hotremove/hotplug. Any data field in struct page is unreliable when the
associated memory is offlined, and the current mechanism can't tell
whether a memory block is onlined because a new memory devices is
installed or because previous failed offline operations are undone.
Especially if there's a hwpoisoned memory, it's unclear what the best
option is.

So introduce a new mechanism to make struct memory_block remember that a
memory block has hwpoisoned memory inside it. And make any online event
fail if the onlining memory block contains hwpoison. struct memory_block
is freed and reallocated over ACPI-based hotremove/hotplug, but not over
sysfs-based hotremove/hotplug. So the new counter can distinguish these
cases.

Link: https://lkml.kernel.org/r/20221024062012.1520887-5-naoya.horiguchi@linux.dev
Signed-off-by: Naoya Horiguchi <naoya.horiguchi@nec.com>
Reported-by: kernel test robot <lkp@intel.com>
Reviewed-by: Miaohe Lin <linmiaohe@huawei.com>
Cc: David Hildenbrand <david@redhat.com>
Cc: Jane Chu <jane.chu@oracle.com>
Cc: Mike Kravetz <mike.kravetz@oracle.com>
Cc: Muchun Song <songmuchun@bytedance.com>
Cc: Oscar Salvador <osalvador@suse.de>
Cc: Yang Shi <shy828301@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>


# a46c9304 24-Oct-2022 Naoya Horiguchi <naoya.horiguchi@nec.com>

mm/hwpoison: pass pfn to num_poisoned_pages_*()

No functional change.

Link: https://lkml.kernel.org/r/20221024062012.1520887-4-naoya.horiguchi@linux.dev
Signed-off-by: Naoya Horiguchi <naoya.horiguchi@nec.com>
Reviewed-by: Miaohe Lin <linmiaohe@huawei.com>
Cc: David Hildenbrand <david@redhat.com>
Cc: Jane Chu <jane.chu@oracle.com>
Cc: Mike Kravetz <mike.kravetz@oracle.com>
Cc: Muchun Song <songmuchun@bytedance.com>
Cc: Oscar Salvador <osalvador@suse.de>
Cc: Yang Shi <shy828301@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>


# d027122d 24-Oct-2022 Naoya Horiguchi <naoya.horiguchi@nec.com>

mm/hwpoison: move definitions of num_poisoned_pages_* to memory-failure.c

These interfaces will be used by drivers/base/memory.c by later patch, so
as a preparatory work move them to more common header file visible to the
file.

Link: https://lkml.kernel.org/r/20221024062012.1520887-3-naoya.horiguchi@linux.dev
Signed-off-by: Naoya Horiguchi <naoya.horiguchi@nec.com>
Reviewed-by: Miaohe Lin <linmiaohe@huawei.com>
Cc: David Hildenbrand <david@redhat.com>
Cc: Jane Chu <jane.chu@oracle.com>
Cc: Mike Kravetz <mike.kravetz@oracle.com>
Cc: Muchun Song <songmuchun@bytedance.com>
Cc: Oscar Salvador <osalvador@suse.de>
Cc: Yang Shi <shy828301@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>


# e591ef7d 24-Oct-2022 Naoya Horiguchi <naoya.horiguchi@nec.com>

mm,hwpoison,hugetlb,memory_hotplug: hotremove memory section with hwpoisoned hugepage

Patch series "mm, hwpoison: improve handling workload related to hugetlb
and memory_hotplug", v7.

This patchset tries to solve the issue among memory_hotplug, hugetlb and hwpoison.
In this patchset, memory hotplug handles hwpoison pages like below:

- hwpoison pages should not prevent memory hotremove,
- memory block with hwpoison pages should not be onlined.


This patch (of 4):

HWPoisoned page is not supposed to be accessed once marked, but currently
such accesses can happen during memory hotremove because
do_migrate_range() can be called before dissolve_free_huge_pages() is
called.

Clear HPageMigratable for hwpoisoned hugepages to prevent them from being
migrated. This should be done in hugetlb_lock to avoid race against
isolate_hugetlb().

get_hwpoison_huge_page() needs to have a flag to show it's called from
unpoison to take refcount of hwpoisoned hugepages, so add it.

[naoya.horiguchi@linux.dev: remove TestClearHPageMigratable and reduce to test and clear separately]
Link: https://lkml.kernel.org/r/20221025053559.GA2104800@ik1-406-35019.vs.sakura.ne.jp
Link: https://lkml.kernel.org/r/20221024062012.1520887-1-naoya.horiguchi@linux.dev
Link: https://lkml.kernel.org/r/20221024062012.1520887-2-naoya.horiguchi@linux.dev
Signed-off-by: Naoya Horiguchi <naoya.horiguchi@nec.com>
Reported-by: Miaohe Lin <linmiaohe@huawei.com>
Reviewed-by: Oscar Salvador <osalvador@suse.de>
Reviewed-by: Miaohe Lin <linmiaohe@huawei.com>
Cc: David Hildenbrand <david@redhat.com>
Cc: Jane Chu <jane.chu@oracle.com>
Cc: Mike Kravetz <mike.kravetz@oracle.com>
Cc: Muchun Song <songmuchun@bytedance.com>
Cc: Yang Shi <shy828301@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>


# b66d00df 21-Oct-2022 Kefeng Wang <wangkefeng.wang@huawei.com>

mm: memory-failure: make action_result() return int

Check mf_result in action_result(), only return 0 when MF_RECOVERED,
or return -EBUSY, which will simplify code a bit.

[wangkefeng.wang@huawei.com: v2]
Link: https://lkml.kernel.org/r/20221024035138.99119-1-wangkefeng.wang@huawei.com
Link: https://lkml.kernel.org/r/20221021084611.53765-3-wangkefeng.wang@huawei.com
Signed-off-by: Kefeng Wang <wangkefeng.wang@huawei.com>
Acked-by: Naoya Horiguchi <naoya.horiguchi@nec.com>
Reviewed-by: Miaohe Lin <linmiaohe@huawei.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>


# 183a7c5d 21-Oct-2022 Kefeng Wang <wangkefeng.wang@huawei.com>

mm: memory-failure: avoid pfn_valid() twice in soft_offline_page()

Simplify WARN_ON_ONCE(flags & MF_COUNT_INCREASED) under !pfn_valid().

Link: https://lkml.kernel.org/r/20221021084611.53765-2-wangkefeng.wang@huawei.com
Signed-off-by: Kefeng Wang <wangkefeng.wang@huawei.com>
Acked-by: Naoya Horiguchi <naoya.horiguchi@nec.com>
Reviewed-by: Miaohe Lin <linmiaohe@huawei.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>


# b5f1fc98 21-Oct-2022 Kefeng Wang <wangkefeng.wang@huawei.com>

mm: memory-failure: make put_ref_page() more useful

Pass pfn/flags to put_ref_page(), then check MF_COUNT_INCREASED and drop
refcount to make the code look cleaner.

Link: https://lkml.kernel.org/r/20221021084611.53765-1-wangkefeng.wang@huawei.com
Signed-off-by: Kefeng Wang <wangkefeng.wang@huawei.com>
Acked-by: Naoya Horiguchi <naoya.horiguchi@nec.com>
Reviewed-by: Miaohe Lin <linmiaohe@huawei.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>


# 8625147c 18-Oct-2022 James Houghton <jthoughton@google.com>

hugetlbfs: don't delete error page from pagecache

This change is very similar to the change that was made for shmem [1], and
it solves the same problem but for HugeTLBFS instead.

Currently, when poison is found in a HugeTLB page, the page is removed
from the page cache. That means that attempting to map or read that
hugepage in the future will result in a new hugepage being allocated
instead of notifying the user that the page was poisoned. As [1] states,
this is effectively memory corruption.

The fix is to leave the page in the page cache. If the user attempts to
use a poisoned HugeTLB page with a syscall, the syscall will fail with
EIO, the same error code that shmem uses. For attempts to map the page,
the thread will get a BUS_MCEERR_AR SIGBUS.

[1]: commit a76054266661 ("mm: shmem: don't truncate page if memory failure happens")

Link: https://lkml.kernel.org/r/20221018200125.848471-1-jthoughton@google.com
Signed-off-by: James Houghton <jthoughton@google.com>
Reviewed-by: Mike Kravetz <mike.kravetz@oracle.com>
Reviewed-by: Naoya Horiguchi <naoya.horiguchi@nec.com>
Tested-by: Naoya Horiguchi <naoya.horiguchi@nec.com>
Reviewed-by: Yang Shi <shy828301@gmail.com>
Cc: Axel Rasmussen <axelrasmussen@google.com>
Cc: James Houghton <jthoughton@google.com>
Cc: Miaohe Lin <linmiaohe@huawei.com>
Cc: Muchun Song <songmuchun@bytedance.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>


# 0c826c0b 02-Sep-2022 Matthew Wilcox (Oracle) <willy@infradead.org>

rmap: remove page_unlock_anon_vma_read()

This was simply an alias for anon_vma_unlock_read() since 2011.

Link: https://lkml.kernel.org/r/20220902194653.1739778-56-willy@infradead.org
Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>


# 0d206b5d 10-Aug-2022 Peter Xu <peterx@redhat.com>

mm/swap: add swp_offset_pfn() to fetch PFN from swap entry

We've got a bunch of special swap entries that stores PFN inside the swap
offset fields. To fetch the PFN, normally the user just calls
swp_offset() assuming that'll be the PFN.

Add a helper swp_offset_pfn() to fetch the PFN instead, fetching only the
max possible length of a PFN on the host, meanwhile doing proper check
with MAX_PHYSMEM_BITS to make sure the swap offsets can actually store the
PFNs properly always using the BUILD_BUG_ON() in is_pfn_swap_entry().

One reason to do so is we never tried to sanitize whether swap offset can
really fit for storing PFN. At the meantime, this patch also prepares us
with the future possibility to store more information inside the swp
offset field, so assuming "swp_offset(entry)" to be the PFN will not stand
any more very soon.

Replace many of the swp_offset() callers to use swp_offset_pfn() where
proper. Note that many of the existing users are not candidates for the
replacement, e.g.:

(1) When the swap entry is not a pfn swap entry at all, or,
(2) when we wanna keep the whole swp_offset but only change the swp type.

For the latter, it can happen when fork() triggered on a write-migration
swap entry pte, we may want to only change the migration type from
write->read but keep the rest, so it's not "fetching PFN" but "changing
swap type only". They're left aside so that when there're more
information within the swp offset they'll be carried over naturally in
those cases.

Since at it, dropping hwpoison_entry_to_pfn() because that's exactly what
the new swp_offset_pfn() is about.

Link: https://lkml.kernel.org/r/20220811161331.37055-4-peterx@redhat.com
Signed-off-by: Peter Xu <peterx@redhat.com>
Reviewed-by: "Huang, Ying" <ying.huang@intel.com>
Cc: Alistair Popple <apopple@nvidia.com>
Cc: Andi Kleen <andi.kleen@intel.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: David Hildenbrand <david@redhat.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: "Kirill A . Shutemov" <kirill@shutemov.name>
Cc: Minchan Kim <minchan@kernel.org>
Cc: Nadav Amit <nadav.amit@gmail.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Dave Hansen <dave.hansen@intel.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>


# 9cf28191 30-Aug-2022 Miaohe Lin <linmiaohe@huawei.com>

mm, hwpoison: cleanup some obsolete comments

1.Remove meaningless comment in kill_proc(). That doesn't tell anything.
2.Fix the wrong function name get_hwpoison_unless_zero(). It should be
get_page_unless_zero().
3.The gate keeper for free hwpoison page has moved to check_new_page().
Update the corresponding comment.

Link: https://lkml.kernel.org/r/20220830123604.25763-7-linmiaohe@huawei.com
Signed-off-by: Miaohe Lin <linmiaohe@huawei.com>
Acked-by: Naoya Horiguchi <naoya.horiguchi@nec.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>


# b680dae9 30-Aug-2022 Miaohe Lin <linmiaohe@huawei.com>

mm, hwpoison: check PageTable() explicitly in hwpoison_user_mappings()

PageTable can't be handled by memory_failure(). Filter it out explicitly in
hwpoison_user_mappings(). This will also make code more consistent with the
relevant check in unpoison_memory().

Link: https://lkml.kernel.org/r/20220830123604.25763-6-linmiaohe@huawei.com
Signed-off-by: Miaohe Lin <linmiaohe@huawei.com>
Acked-by: Naoya Horiguchi <naoya.horiguchi@nec.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>


# 36537a67 30-Aug-2022 Miaohe Lin <linmiaohe@huawei.com>

mm, hwpoison: avoid unneeded page_mapped_in_vma() overhead in collect_procs_anon()

If vma->vm_mm != t->mm, there's no need to call page_mapped_in_vma() as
add_to_kill() won't be called in this case. Move up the mm check to avoid
possible unneeded calling to page_mapped_in_vma().

Link: https://lkml.kernel.org/r/20220830123604.25763-5-linmiaohe@huawei.com
Signed-off-by: Miaohe Lin <linmiaohe@huawei.com>
Acked-by: Naoya Horiguchi <naoya.horiguchi@nec.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>


# 21c9e90a 30-Aug-2022 Miaohe Lin <linmiaohe@huawei.com>

mm, hwpoison: use num_poisoned_pages_sub() to decrease num_poisoned_pages

Use num_poisoned_pages_sub() to combine multiple atomic ops into one. Also
num_poisoned_pages_dec() can be killed as there's no caller now.

Link: https://lkml.kernel.org/r/20220830123604.25763-4-linmiaohe@huawei.com
Signed-off-by: Miaohe Lin <linmiaohe@huawei.com>
Acked-by: Naoya Horiguchi <naoya.horiguchi@nec.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>


# da294991 30-Aug-2022 Miaohe Lin <linmiaohe@huawei.com>

mm, hwpoison: use __PageMovable() to detect non-lru movable pages

It's more recommended to use __PageMovable() to detect non-lru movable
pages. We can avoid bumping page refcnt via isolate_movable_page() for
the isolated lru pages. Also if pages become PageLRU just after they're
checked but before trying to isolate them, isolate_lru_page() will be
called to do the right work.

[linmiaohe@huawei.com: fixes per Naoya Horiguchi]
Link: https://lkml.kernel.org/r/1f7ee86e-7d28-0d8c-e0de-b7a5a94519e8@huawei.com
Link: https://lkml.kernel.org/r/20220830123604.25763-3-linmiaohe@huawei.com
Signed-off-by: Miaohe Lin <linmiaohe@huawei.com>
Cc: Naoya Horiguchi <naoya.horiguchi@nec.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>


# 2fe62e22 30-Aug-2022 Miaohe Lin <linmiaohe@huawei.com>

mm, hwpoison: use ClearPageHWPoison() in memory_failure()

Patch series "A few cleanup patches for memory-failure".

his series contains a few cleanup patches to use __PageMovable() to detect
non-lru movable pages, use num_poisoned_pages_sub() to reduce multiple
atomic ops overheads and so on. More details can be found in the
respective changelogs.


This patch (of 6):

Use ClearPageHWPoison() instead of TestClearPageHWPoison() to clear page
hwpoison flags to avoid unneeded full memory barrier overhead.

Link: https://lkml.kernel.org/r/20220830123604.25763-1-linmiaohe@huawei.com
Link: https://lkml.kernel.org/r/20220830123604.25763-2-linmiaohe@huawei.com
Signed-off-by: Miaohe Lin <linmiaohe@huawei.com>
Acked-by: Naoya Horiguchi <naoya.horiguchi@nec.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>


# 48309e1f 18-Aug-2022 Kefeng Wang <wangkefeng.wang@huawei.com>

mm: memory-failure: kill __soft_offline_page()

Squash the __soft_offline_page() into soft_offline_in_use_page() and kill
__soft_offline_page().

[wangkefeng.wang@huawei.com: update hpage when try_to_split_thp_page() succeeds]
Link: https://lkml.kernel.org/r/20220830104654.28234-1-wangkefeng.wang@huawei.com
Link: https://lkml.kernel.org/r/20220819033402.156519-2-wangkefeng.wang@huawei.com
Signed-off-by: Kefeng Wang <wangkefeng.wang@huawei.com>
Reviewed-by: Miaohe Lin <linmiaohe@huawei.com>
Acked-by: Naoya Horiguchi <naoya.horiguchi@nec.com>
Reviewed-by: David Hildenbrand <david@redhat.com>
Reviewed-by: Anshuman Khandual <anshuman.khandual@arm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>


# 7adb4588 18-Aug-2022 Kefeng Wang <wangkefeng.wang@huawei.com>

mm: memory-failure: kill soft_offline_free_page()

Open-code the page_handle_poison() into soft_offline_page() and kill
unneeded soft_offline_free_page().

Link: https://lkml.kernel.org/r/20220819033402.156519-1-wangkefeng.wang@huawei.com
Signed-off-by: Kefeng Wang <wangkefeng.wang@huawei.com>
Reviewed-by: Miaohe Lin <linmiaohe@huawei.com>
Acked-by: Naoya Horiguchi <naoya.horiguchi@nec.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>


# e9ff3ba7 18-Aug-2022 Miaohe Lin <linmiaohe@huawei.com>

mm, hwpoison: avoid trying to unpoison reserved page

For reserved pages, HWPoison flag will be set without increasing the page
refcnt. So we shouldn't even try to unpoison these pages and thus
decrease the page refcnt unexpectly. Add a PageReserved() check to filter
this case out and remove the below unneeded zero page (zero page is
reserved) check.

Link: https://lkml.kernel.org/r/20220818130016.45313-7-linmiaohe@huawei.com
Signed-off-by: Miaohe Lin <linmiaohe@huawei.com>
Acked-by: Naoya Horiguchi <naoya.horiguchi@nec.com>
Cc: "Aneesh Kumar K.V" <aneesh.kumar@linux.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>


# 0792a4a6 18-Aug-2022 Miaohe Lin <linmiaohe@huawei.com>

mm, hwpoison: kill procs if unmap fails

If try_to_unmap() fails, the hwpoisoned page still resides in the address
space of some processes. We should kill these processes or the hwpoisoned
page might be consumed later. collect_procs() is always called to collect
relevant processes now so they can be killed later if unmap fails.

[linmiaohe@huawei.com: v2]
Link: https://lkml.kernel.org/r/20220823032346.4260-6-linmiaohe@huawei.com
Link: https://lkml.kernel.org/r/20220818130016.45313-6-linmiaohe@huawei.com
Signed-off-by: Miaohe Lin <linmiaohe@huawei.com>
Acked-by: Naoya Horiguchi <naoya.horiguchi@nec.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>


# 54f9555d 22-Aug-2022 Miaohe Lin <linmiaohe@huawei.com>

mm, hwpoison: fix possible use-after-free in mf_dax_kill_procs()

After kill_procs(), tk will be freed without being removed from the
to_kill list. In the next iteration, the freed list entry in the to_kill
list will be accessed, thus leading to use-after-free issue. Adding
list_del() in kill_procs() to fix the issue.

Link: https://lkml.kernel.org/r/20220823032346.4260-5-linmiaohe@huawei.com
Fixes: c36e20249571 ("mm: introduce mf_dax_kill_procs() for fsdax case")
Signed-off-by: Miaohe Lin <linmiaohe@huawei.com>
Acked-by: Naoya Horiguchi <naoya.horiguchi@nec.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>


# 12f1dbcf 18-Aug-2022 Miaohe Lin <linmiaohe@huawei.com>

mm, hwpoison: fix extra put_page() in soft_offline_page()

When hwpoison_filter() refuses to soft offline a page, the page refcnt
incremented previously by MF_COUNT_INCREASED would have been consumed via
get_hwpoison_page() if ret <= 0. So the put_ref_page() here will put the
extra one. Remove it to fix the issue.

Link: https://lkml.kernel.org/r/20220818130016.45313-4-linmiaohe@huawei.com
Fixes: 9113eaf331bf ("mm/memory-failure.c: add hwpoison_filter for soft offline")
Signed-off-by: Miaohe Lin <linmiaohe@huawei.com>
Acked-by: Naoya Horiguchi <naoya.horiguchi@nec.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>


# 6bbabd04 18-Aug-2022 Miaohe Lin <linmiaohe@huawei.com>

mm, hwpoison: fix page refcnt leaking in unpoison_memory()

When free_raw_hwp_pages() fails its work, the refcnt of the hugetlb page
would have been incremented if ret > 0. Using put_page() to fix refcnt
leaking in this case.

Link: https://lkml.kernel.org/r/20220818130016.45313-3-linmiaohe@huawei.com
Fixes: debb6b9c3fdd ("mm, hwpoison: make unpoison aware of raw error info in hwpoisoned hugepage")
Signed-off-by: Miaohe Lin <linmiaohe@huawei.com>
Acked-by: Naoya Horiguchi <naoya.horiguchi@nec.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>


# f36a5543 18-Aug-2022 Miaohe Lin <linmiaohe@huawei.com>

mm, hwpoison: fix page refcnt leaking in try_memory_failure_hugetlb()

Patch series "A few fixup patches for memory-failure", v2.

This series contains a few fixup patches to fix incorrect update of page
refcnt, fix possible use-after-free issue and so on. More details can be
found in the respective changelogs.

This patch (of 6):

When hwpoison_filter() refuses to hwpoison a hugetlb page, the refcnt of
the page would have been incremented if res == 1. Using put_page() to fix
the refcnt leaking in this case.

Link: https://lkml.kernel.org/r/20220823032346.4260-1-linmiaohe@huawei.com
Link: https://lkml.kernel.org/r/20220818130016.45313-1-linmiaohe@huawei.com
Link: https://lkml.kernel.org/r/20220818130016.45313-2-linmiaohe@huawei.com
Fixes: 405ce051236c ("mm/hwpoison: fix race between hugetlb free/demotion and memory_failure_hugetlb()")
Signed-off-by: Miaohe Lin <linmiaohe@huawei.com>
Acked-by: Naoya Horiguchi <naoya.horiguchi@nec.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>


# 2ace36f0 09-Aug-2022 Kefeng Wang <wangkefeng.wang@huawei.com>

mm: memory-failure: cleanup try_to_split_thp_page()

Since commit 5d1fd5dc877b ("mm,hwpoison: introduce MF_MSG_UNSPLIT_THP"),
the action_result(,MF_MSG_UNSPLIT_THP,) called to show memory error event
in memory_failure(), so the pr_info() in try_to_split_thp_page() is only
needed in soft_offline_in_use_page().

Meanwhile this could also fix the unexpected prefix for "thp split failed"
due to commit 96f96763de26 ("mm: memory-failure: convert to pr_fmt()").

Link: https://lkml.kernel.org/r/20220809111813.139690-1-wangkefeng.wang@huawei.com
Signed-off-by: Kefeng Wang <wangkefeng.wang@huawei.com>
Acked-by: Naoya Horiguchi <naoya.horiguchi@nec.com>
Cc: Miaohe Lin <linmiaohe@huawei.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>


# 77677cdb 14-Sep-2022 Shuai Xue <xueshuai@linux.alibaba.com>

mm,hwpoison: check mm when killing accessing process

The GHES code calls memory_failure_queue() from IRQ context to queue work
into workqueue and schedule it on the current CPU. Then the work is
processed in memory_failure_work_func() by kworker and calls
memory_failure().

When a page is already poisoned, commit a3f5d80ea401 ("mm,hwpoison: send
SIGBUS with error virutal address") make memory_failure() call
kill_accessing_process() that:

- holds mmap locking of current->mm
- does pagetable walk to find the error virtual address
- and sends SIGBUS to the current process with error info.

However, the mm of kworker is not valid, resulting in a null-pointer
dereference. So check mm when killing the accessing process.

[akpm@linux-foundation.org: remove unrelated whitespace alteration]
Link: https://lkml.kernel.org/r/20220914064935.7851-1-xueshuai@linux.alibaba.com
Fixes: a3f5d80ea401 ("mm,hwpoison: send SIGBUS with error virutal address")
Signed-off-by: Shuai Xue <xueshuai@linux.alibaba.com>
Reviewed-by: Miaohe Lin <linmiaohe@huawei.com>
Acked-by: Naoya Horiguchi <naoya.horiguchi@nec.com>
Cc: Huang Ying <ying.huang@intel.com>
Cc: Baolin Wang <baolin.wang@linux.alibaba.com>
Cc: Bixuan Cui <cuibixuan@linux.alibaba.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>


# ac87ca0e 26-Aug-2022 Dan Williams <dan.j.williams@intel.com>

mm/memory-failure: fall back to vma_address() when ->notify_failure() fails

In the case where a filesystem is polled to take over the memory failure
and receives -EOPNOTSUPP it indicates that page->index and page->mapping
are valid for reverse mapping the failure address. Introduce
FSDAX_INVALID_PGOFF to distinguish when add_to_kill() is being called from
mf_dax_kill_procs() by a filesytem vs the typical memory_failure() path.

Otherwise, vma_pgoff_address() is called with an invalid fsdax_pgoff which
then trips this failing signature:

kernel BUG at mm/memory-failure.c:319!
invalid opcode: 0000 [#1] PREEMPT SMP PTI
CPU: 13 PID: 1262 Comm: dax-pmd Tainted: G OE N 6.0.0-rc2+ #62
Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 0.0.0 02/06/2015
RIP: 0010:add_to_kill.cold+0x19d/0x209
[..]
Call Trace:
<TASK>
collect_procs.part.0+0x2c4/0x460
memory_failure+0x71b/0xba0
? _printk+0x58/0x73
do_madvise.part.0.cold+0xaf/0xc5

Link: https://lkml.kernel.org/r/166153429427.2758201.14605968329933175594.stgit@dwillia2-xfh.jf.intel.com
Fixes: c36e20249571 ("mm: introduce mf_dax_kill_procs() for fsdax case")
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
Acked-by: Naoya Horiguchi <naoya.horiguchi@nec.com>
Reviewed-by: Miaohe Lin <linmiaohe@huawei.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Cc: Shiyang Ruan <ruansy.fnst@fujitsu.com>
Cc: Darrick J. Wong <djwong@kernel.org>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Dave Chinner <david@fromorbit.com>
Cc: Goldwyn Rodrigues <rgoldwyn@suse.de>
Cc: Jane Chu <jane.chu@oracle.com>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Ritesh Harjani <riteshh@linux.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>


# 65d3440e 26-Aug-2022 Dan Williams <dan.j.williams@intel.com>

mm/memory-failure: fix detection of memory_failure() handlers

Some pagemap types, like MEMORY_DEVICE_GENERIC (device-dax) do not even
have pagemap ops which results in crash signatures like this:

BUG: kernel NULL pointer dereference, address: 0000000000000010
#PF: supervisor read access in kernel mode
#PF: error_code(0x0000) - not-present page
PGD 8000000205073067 P4D 8000000205073067 PUD 2062b3067 PMD 0
Oops: 0000 [#1] PREEMPT SMP PTI
CPU: 22 PID: 4535 Comm: device-dax Tainted: G OE N 6.0.0-rc2+ #59
Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 0.0.0 02/06/2015
RIP: 0010:memory_failure+0x667/0xba0
[..]
Call Trace:
<TASK>
? _printk+0x58/0x73
do_madvise.part.0.cold+0xaf/0xc5

Check for ops before checking if the ops have a memory_failure()
handler.

Link: https://lkml.kernel.org/r/166153428781.2758201.1990616683438224741.stgit@dwillia2-xfh.jf.intel.com
Fixes: 33a8f7f2b3a3 ("pagemap,pmem: introduce ->memory_failure()")
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
Acked-by: Naoya Horiguchi <naoya.horiguchi@nec.com>
Reviewed-by: Miaohe Lin <linmiaohe@huawei.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Cc: Shiyang Ruan <ruansy.fnst@fujitsu.com>
Cc: Darrick J. Wong <djwong@kernel.org>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Dave Chinner <david@fromorbit.com>
Cc: Goldwyn Rodrigues <rgoldwyn@suse.de>
Cc: Jane Chu <jane.chu@oracle.com>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Ritesh Harjani <riteshh@linux.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>


# 6f461488 13-Jul-2022 Naoya Horiguchi <naoya.horiguchi@nec.com>

mm, hwpoison: enable memory error handling on 1GB hugepage

Now error handling code is prepared, so remove the blocking code and
enable memory error handling on 1GB hugepage.

Link: https://lkml.kernel.org/r/20220714042420.1847125-9-naoya.horiguchi@linux.dev
Signed-off-by: Naoya Horiguchi <naoya.horiguchi@nec.com>
Reviewed-by: Miaohe Lin <linmiaohe@huawei.com>
Cc: David Hildenbrand <david@redhat.com>
Cc: kernel test robot <lkp@intel.com>
Cc: Liu Shixin <liushixin2@huawei.com>
Cc: Mike Kravetz <mike.kravetz@oracle.com>
Cc: Muchun Song <songmuchun@bytedance.com>
Cc: Oscar Salvador <osalvador@suse.de>
Cc: Yang Shi <shy828301@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>


# ceaf8fbe 13-Jul-2022 Naoya Horiguchi <naoya.horiguchi@nec.com>

mm, hwpoison: skip raw hwpoison page in freeing 1GB hugepage

Currently if memory_failure() (modified to remove blocking code with
subsequent patch) is called on a page in some 1GB hugepage, memory error
handling fails and the raw error page gets into leaked state. The impact
is small in production systems (just leaked single 4kB page), but this
limits the testability because unpoison doesn't work for it. We can no
longer create 1GB hugepage on the 1GB physical address range with such
leaked pages, that's not useful when testing on small systems.

When a hwpoison page in a 1GB hugepage is handled, it's caught by the
PageHWPoison check in free_pages_prepare() because the 1GB hugepage is
broken down into raw error pages before coming to this point:

if (unlikely(PageHWPoison(page)) && !order) {
...
return false;
}

Then, the page is not sent to buddy and the page refcount is left 0.

Originally this check is supposed to work when the error page is freed
from page_handle_poison() (that is called from soft-offline), but now we
are opening another path to call it, so the callers of
__page_handle_poison() need to handle the case by considering the return
value 0 as success. Then page refcount for hwpoison is properly
incremented so unpoison works.

Link: https://lkml.kernel.org/r/20220714042420.1847125-8-naoya.horiguchi@linux.dev
Signed-off-by: Naoya Horiguchi <naoya.horiguchi@nec.com>
Reviewed-by: Miaohe Lin <linmiaohe@huawei.com>
Cc: David Hildenbrand <david@redhat.com>
Cc: kernel test robot <lkp@intel.com>
Cc: Liu Shixin <liushixin2@huawei.com>
Cc: Mike Kravetz <mike.kravetz@oracle.com>
Cc: Muchun Song <songmuchun@bytedance.com>
Cc: Oscar Salvador <osalvador@suse.de>
Cc: Yang Shi <shy828301@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>


# 7453bf62 13-Jul-2022 Naoya Horiguchi <naoya.horiguchi@nec.com>

mm, hwpoison: make __page_handle_poison returns int

__page_handle_poison() returns bool that shows whether
take_page_off_buddy() has passed or not now. But we will want to
distinguish another case of "dissolve has passed but taking off failed" by
its return value. So change the type of the return value. No functional
change.

Link: https://lkml.kernel.org/r/20220714042420.1847125-7-naoya.horiguchi@linux.dev
Signed-off-by: Naoya Horiguchi <naoya.horiguchi@nec.com>
Reviewed-by: Miaohe Lin <linmiaohe@huawei.com>
Cc: David Hildenbrand <david@redhat.com>
Cc: kernel test robot <lkp@intel.com>
Cc: Liu Shixin <liushixin2@huawei.com>
Cc: Mike Kravetz <mike.kravetz@oracle.com>
Cc: Muchun Song <songmuchun@bytedance.com>
Cc: Oscar Salvador <osalvador@suse.de>
Cc: Yang Shi <shy828301@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>


# 38f6d293 13-Jul-2022 Naoya Horiguchi <naoya.horiguchi@nec.com>

mm, hwpoison: set PG_hwpoison for busy hugetlb pages

If memory_failure() fails to grab page refcount on a hugetlb page because
it's busy, it returns without setting PG_hwpoison on it. This not only
loses a chance of error containment, but breaks the rule that
action_result() should be called only when memory_failure() do any of
handling work (even if that's just setting PG_hwpoison). This
inconsistency could harm code maintainability.

So set PG_hwpoison and call hugetlb_set_page_hwpoison() for such a case.

Link: https://lkml.kernel.org/r/20220714042420.1847125-6-naoya.horiguchi@linux.dev
Fixes: 405ce051236c ("mm/hwpoison: fix race between hugetlb free/demotion and memory_failure_hugetlb()")
Signed-off-by: Naoya Horiguchi <naoya.horiguchi@nec.com>
Reviewed-by: Miaohe Lin <linmiaohe@huawei.com>
Cc: David Hildenbrand <david@redhat.com>
Cc: kernel test robot <lkp@intel.com>
Cc: Liu Shixin <liushixin2@huawei.com>
Cc: Mike Kravetz <mike.kravetz@oracle.com>
Cc: Muchun Song <songmuchun@bytedance.com>
Cc: Oscar Salvador <osalvador@suse.de>
Cc: Yang Shi <shy828301@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>


# ac5fcde0 13-Jul-2022 Naoya Horiguchi <naoya.horiguchi@nec.com>

mm, hwpoison: make unpoison aware of raw error info in hwpoisoned hugepage

Raw error info list needs to be removed when hwpoisoned hugetlb is
unpoisoned. And unpoison handler needs to know how many errors there are
in the target hugepage. So add them.

HPageVmemmapOptimized(hpage) and HPageRawHwpUnreliable(hpage)) sometimes
can't be unpoisoned, so skip them.

Link: https://lkml.kernel.org/r/20220714042420.1847125-5-naoya.horiguchi@linux.dev
Signed-off-by: Naoya Horiguchi <naoya.horiguchi@nec.com>
Reported-by: kernel test robot <lkp@intel.com>
Reviewed-by: Miaohe Lin <linmiaohe@huawei.com>
Cc: David Hildenbrand <david@redhat.com>
Cc: Liu Shixin <liushixin2@huawei.com>
Cc: Mike Kravetz <mike.kravetz@oracle.com>
Cc: Muchun Song <songmuchun@bytedance.com>
Cc: Oscar Salvador <osalvador@suse.de>
Cc: Yang Shi <shy828301@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>


# 161df60e 13-Jul-2022 Naoya Horiguchi <naoya.horiguchi@nec.com>

mm, hwpoison, hugetlb: support saving mechanism of raw error pages

When handling memory error on a hugetlb page, the error handler tries to
dissolve and turn it into 4kB pages. If it's successfully dissolved,
PageHWPoison flag is moved to the raw error page, so that's all right.
However, dissolve sometimes fails, then the error page is left as
hwpoisoned hugepage. It's useful if we can retry to dissolve it to save
healthy pages, but that's not possible now because the information about
where the raw error pages is lost.

Use the private field of a few tail pages to keep that information. The
code path of shrinking hugepage pool uses this info to try delayed
dissolve. In order to remember multiple errors in a hugepage, a
singly-linked list originated from SUBPAGE_INDEX_HWPOISON-th tail page is
constructed. Only simple operations (adding an entry or clearing all) are
required and the list is assumed not to be very long, so this simple data
structure should be enough.

If we failed to save raw error info, the hwpoison hugepage has errors on
unknown subpage, then this new saving mechanism does not work any more, so
disable saving new raw error info and freeing hwpoison hugepages.

Link: https://lkml.kernel.org/r/20220714042420.1847125-4-naoya.horiguchi@linux.dev
Signed-off-by: Naoya Horiguchi <naoya.horiguchi@nec.com>
Reported-by: kernel test robot <lkp@intel.com>
Reviewed-by: Miaohe Lin <linmiaohe@huawei.com>
Cc: David Hildenbrand <david@redhat.com>
Cc: Liu Shixin <liushixin2@huawei.com>
Cc: Mike Kravetz <mike.kravetz@oracle.com>
Cc: Muchun Song <songmuchun@bytedance.com>
Cc: Oscar Salvador <osalvador@suse.de>
Cc: Yang Shi <shy828301@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>


# 96f96763 26-Jul-2022 Kefeng Wang <wangkefeng.wang@huawei.com>

mm: memory-failure: convert to pr_fmt()

Use pr_fmt to prefix all pr_<level> output, but unpoison_memory() and
soft_offline_page() are used by error injection, which have own prefixes
like "Unpoison:" and "soft offline:", meanwhile, soft_offline_page() could
be used by memory hotremove, so reset pr_fmt before unpoison_pr_info
definition to keep the original output for them.

[wangkefeng.wang@huawei.com: v3]
Link: https://lkml.kernel.org/r/20220729031919.72331-1-wangkefeng.wang@huawei.com
Link: https://lkml.kernel.org/r/20220726081046.10742-1-wangkefeng.wang@huawei.com
Signed-off-by: Kefeng Wang <wangkefeng.wang@huawei.com>
Acked-by: Naoya Horiguchi <naoya.horiguchi@nec.com>
Reviewed-by: Miaohe Lin <linmiaohe@huawei.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>


# c36e2024 02-Jun-2022 Shiyang Ruan <ruansy.fnst@fujitsu.com>

mm: introduce mf_dax_kill_procs() for fsdax case

This new function is a variant of mf_generic_kill_procs that accepts a
file, offset pair instead of a struct to support multiple files sharing a
DAX mapping. It is intended to be called by the file systems as part of
the memory_failure handler after the file system performed a reverse
mapping from the storage address to the file and file offset.

Link: https://lkml.kernel.org/r/20220603053738.1218681-6-ruansy.fnst@fujitsu.com
Signed-off-by: Shiyang Ruan <ruansy.fnst@fujitsu.com>
Reviewed-by: Dan Williams <dan.j.williams@intel.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Miaohe Lin <linmiaohe@huawei.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Dan Williams <dan.j.wiliams@intel.com>
Cc: Dave Chinner <david@fromorbit.com>
Cc: Goldwyn Rodrigues <rgoldwyn@suse.com>
Cc: Goldwyn Rodrigues <rgoldwyn@suse.de>
Cc: Jane Chu <jane.chu@oracle.com>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Naoya Horiguchi <naoya.horiguchi@nec.com>
Cc: Ritesh Harjani <riteshh@linux.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>


# 33a8f7f2 02-Jun-2022 Shiyang Ruan <ruansy.fnst@fujitsu.com>

pagemap,pmem: introduce ->memory_failure()

When memory-failure occurs, we call this function which is implemented by
each kind of devices. For the fsdax case, pmem device driver implements
it. Pmem device driver will find out the filesystem in which the
corrupted page located in.

With dax_holder notify support, we are able to notify the memory failure
from pmem driver to upper layers. If there is something not support in
the notify routine, memory_failure will fall back to the generic hanlder.

Link: https://lkml.kernel.org/r/20220603053738.1218681-4-ruansy.fnst@fujitsu.com
Signed-off-by: Shiyang Ruan <ruansy.fnst@fujitsu.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Dan Williams <dan.j.williams@intel.com>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Naoya Horiguchi <naoya.horiguchi@nec.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Dan Williams <dan.j.wiliams@intel.com>
Cc: Dave Chinner <david@fromorbit.com>
Cc: Goldwyn Rodrigues <rgoldwyn@suse.com>
Cc: Goldwyn Rodrigues <rgoldwyn@suse.de>
Cc: Jane Chu <jane.chu@oracle.com>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Miaohe Lin <linmiaohe@huawei.com>
Cc: Ritesh Harjani <riteshh@linux.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>


# 00cc790e 02-Jun-2022 Shiyang Ruan <ruansy.fnst@fujitsu.com>

mm: factor helpers for memory_failure_dev_pagemap

memory_failure_dev_pagemap code is a bit complex before introduce RMAP
feature for fsdax. So it is needed to factor some helper functions to
simplify these code.

[akpm@linux-foundation.org: fix CONFIG_HUGETLB_PAGE=n build]
[zhengbin13@huawei.com: fix redefinition of mf_generic_kill_procs]
Link: https://lkml.kernel.org/r/20220628112143.1170473-1-zhengbin13@huawei.com
Link: https://lkml.kernel.org/r/20220603053738.1218681-3-ruansy.fnst@fujitsu.com
Signed-off-by: Shiyang Ruan <ruansy.fnst@fujitsu.com>
Signed-off-by: Zheng Bin <zhengbin13@huawei.com>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Dan Williams <dan.j.williams@intel.com>
Reviewed-by: Miaohe Lin <linmiaohe@huawei.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Dan Williams <dan.j.wiliams@intel.com>
Cc: Dave Chinner <david@fromorbit.com>
Cc: Goldwyn Rodrigues <rgoldwyn@suse.com>
Cc: Goldwyn Rodrigues <rgoldwyn@suse.de>
Cc: Jane Chu <jane.chu@oracle.com>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Naoya Horiguchi <naoya.horiguchi@nec.com>
Cc: Ritesh Harjani <riteshh@linux.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>


# f25cbb7a 15-Jul-2022 Alex Sierra <alex.sierra@amd.com>

mm: add zone device coherent type memory support

Device memory that is cache coherent from device and CPU point of view.
This is used on platforms that have an advanced system bus (like CAPI or
CXL). Any page of a process can be migrated to such memory. However, no
one should be allowed to pin such memory so that it can always be evicted.

[hch@lst.de: rebased ontop of the refcount changes, remove is_dev_private_or_coherent_page]
Link: https://lkml.kernel.org/r/20220715150521.18165-4-alex.sierra@amd.com
Signed-off-by: Alex Sierra <alex.sierra@amd.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Acked-by: Felix Kuehling <Felix.Kuehling@amd.com>
Reviewed-by: Alistair Popple <apopple@nvidia.com>
Acked-by: David Hildenbrand <david@redhat.com>
Cc: Jason Gunthorpe <jgg@nvidia.com>
Cc: Jerome Glisse <jglisse@redhat.com>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Ralph Campbell <rcampbell@nvidia.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>


# 75fa68a5 17-Jun-2022 Matthew Wilcox (Oracle) <willy@infradead.org>

mm/swap: convert delete_from_swap_cache() to take a folio

All but one caller already has a folio, so convert it to use a folio.

Link: https://lkml.kernel.org/r/20220617175020.717127-22-willy@infradead.org
Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>


# 7ce82f4c 30-May-2022 Miaohe Lin <linmiaohe@huawei.com>

mm/migration: return errno when isolate_huge_page failed

We might fail to isolate huge page due to e.g. the page is under
migration which cleared HPageMigratable. We should return errno in this
case rather than always return 1 which could confuse the user, i.e. the
caller might think all of the memory is migrated while the hugetlb page is
left behind. We make the prototype of isolate_huge_page consistent with
isolate_lru_page as suggested by Huang Ying and rename isolate_huge_page
to isolate_hugetlb as suggested by Muchun to improve the readability.

Link: https://lkml.kernel.org/r/20220530113016.16663-4-linmiaohe@huawei.com
Fixes: e8db67eb0ded ("mm: migrate: move_pages() supports thp migration")
Signed-off-by: Miaohe Lin <linmiaohe@huawei.com>
Suggested-by: Huang Ying <ying.huang@intel.com>
Reported-by: kernel test robot <lkp@intel.com> (build error)
Cc: Alistair Popple <apopple@nvidia.com>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Christoph Lameter <cl@linux.com>
Cc: David Hildenbrand <david@redhat.com>
Cc: David Howells <dhowells@redhat.com>
Cc: Mike Kravetz <mike.kravetz@oracle.com>
Cc: Muchun Song <songmuchun@bytedance.com>
Cc: Oscar Salvador <osalvador@suse.de>
Cc: Peter Xu <peterx@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>


# 6ffcd825 28-Jun-2022 Matthew Wilcox (Oracle) <willy@infradead.org>

mm: Remove __delete_from_page_cache()

This wrapper is no longer used. Remove it and all references to it.

Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>


# 67f22ba7 15-Jun-2022 zhenwei pi <pizhenwei@bytedance.com>

mm/memory-failure: disable unpoison once hw error happens

Currently unpoison_memory(unsigned long pfn) is designed for soft
poison(hwpoison-inject) only. Since 17fae1294ad9d, the KPTE gets cleared
on a x86 platform once hardware memory corrupts.

Unpoisoning a hardware corrupted page puts page back buddy only, the
kernel has a chance to access the page with *NOT PRESENT* KPTE. This
leads BUG during accessing on the corrupted KPTE.

Suggested by David&Naoya, disable unpoison mechanism when a real HW error
happens to avoid BUG like this:

Unpoison: Software-unpoisoned page 0x61234
BUG: unable to handle page fault for address: ffff888061234000
#PF: supervisor write access in kernel mode
#PF: error_code(0x0002) - not-present page
PGD 2c01067 P4D 2c01067 PUD 107267063 PMD 10382b063 PTE 800fffff9edcb062
Oops: 0002 [#1] PREEMPT SMP NOPTI
CPU: 4 PID: 26551 Comm: stress Kdump: loaded Tainted: G M OE 5.18.0.bm.1-amd64 #7
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996) ...
RIP: 0010:clear_page_erms+0x7/0x10
Code: ...
RSP: 0000:ffffc90001107bc8 EFLAGS: 00010246
RAX: 0000000000000000 RBX: 0000000000000901 RCX: 0000000000001000
RDX: ffffea0001848d00 RSI: ffffea0001848d40 RDI: ffff888061234000
RBP: ffffea0001848d00 R08: 0000000000000901 R09: 0000000000001276
R10: 0000000000000003 R11: 0000000000000000 R12: 0000000000000001
R13: 0000000000000000 R14: 0000000000140dca R15: 0000000000000001
FS: 00007fd8b2333740(0000) GS:ffff88813fd00000(0000) knlGS:0000000000000000
CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: ffff888061234000 CR3: 00000001023d2005 CR4: 0000000000770ee0
DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
PKRU: 55555554
Call Trace:
<TASK>
prep_new_page+0x151/0x170
get_page_from_freelist+0xca0/0xe20
? sysvec_apic_timer_interrupt+0xab/0xc0
? asm_sysvec_apic_timer_interrupt+0x1b/0x20
__alloc_pages+0x17e/0x340
__folio_alloc+0x17/0x40
vma_alloc_folio+0x84/0x280
__handle_mm_fault+0x8d4/0xeb0
handle_mm_fault+0xd5/0x2a0
do_user_addr_fault+0x1d0/0x680
? kvm_read_and_reset_apf_flags+0x3b/0x50
exc_page_fault+0x78/0x170
asm_exc_page_fault+0x27/0x30

Link: https://lkml.kernel.org/r/20220615093209.259374-2-pizhenwei@bytedance.com
Fixes: 847ce401df392 ("HWPOISON: Add unpoisoning support")
Fixes: 17fae1294ad9d ("x86/{mce,mm}: Unmap the entire page if the whole page is affected and poisoned")
Signed-off-by: zhenwei pi <pizhenwei@bytedance.com>
Acked-by: David Hildenbrand <david@redhat.com>
Acked-by: Naoya Horiguchi <naoya.horiguchi@nec.com>
Reviewed-by: Miaohe Lin <linmiaohe@huawei.com>
Reviewed-by: Oscar Salvador <osalvador@suse.de>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: <stable@vger.kernel.org> [5.8+]
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>


# 6d4675e6 19-May-2022 Minchan Kim <minchan@kernel.org>

mm: don't be stuck to rmap lock on reclaim path

The rmap locks(i_mmap_rwsem and anon_vma->root->rwsem) could be contended
under memory pressure if processes keep working on their vmas(e.g., fork,
mmap, munmap). It makes reclaim path stuck. In our real workload traces,
we see kswapd is waiting the lock for 300ms+(worst case, a sec) and it
makes other processes entering direct reclaim, which were also stuck on
the lock.

This patch makes lru aging path try_lock mode like shink_page_list so the
reclaim context will keep working with next lru pages without being stuck.
if it found the rmap lock contended, it rotates the page back to head of
lru in both active/inactive lrus to make them consistent behavior, which
is basic starting point rather than adding more heristic.

Since this patch introduces a new "contended" field as out-param along
with try_lock in-param in rmap_walk_control, it's not immutable any longer
if the try_lock is set so remove const keywords on rmap related functions.
Since rmap walking is already expensive operation, I doubt the const
would help sizable benefit( And we didn't have it until 5.17).

In a heavy app workload in Android, trace shows following statistics. It
almost removes rmap lock contention from reclaim path.

Martin Liu reported:

Before:

max_dur(ms) min_dur(ms) max-min(dur)ms avg_dur(ms) sum_dur(ms) count blocked_function
1632 0 1631 151.542173 31672 209 page_lock_anon_vma_read
601 0 601 145.544681 28817 198 rmap_walk_file

After:

max_dur(ms) min_dur(ms) max-min(dur)ms avg_dur(ms) sum_dur(ms) count blocked_function
NaN NaN NaN NaN NaN 0.0 NaN
0 0 0 0.127645 1 12 rmap_walk_file

[minchan@kernel.org: add comment, per Matthew]
Link: https://lkml.kernel.org/r/YnNqeB5tUf6LZ57b@google.com
Link: https://lkml.kernel.org/r/20220510215423.164547-1-minchan@kernel.org
Signed-off-by: Minchan Kim <minchan@kernel.org>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Cc: Suren Baghdasaryan <surenb@google.com>
Cc: Michal Hocko <mhocko@suse.com>
Cc: John Dias <joaodias@google.com>
Cc: Tim Murray <timmurray@google.com>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Vladimir Davydov <vdavydov.dev@gmail.com>
Cc: Martin Liu <liumartin@google.com>
Cc: Minchan Kim <minchan@kernel.org>
Cc: Matthew Wilcox <willy@infradead.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>


# e240ac52 12-May-2022 zhenwei pi <pizhenwei@bytedance.com>

mm/memory-failure.c: simplify num_poisoned_pages_inc/dec

Originally, do num_poisoned_pages_inc() in memory failure routine, use
num_poisoned_pages_dec() to rollback the number if filtered/ cancelled.

Suggested by Naoya, do num_poisoned_pages_inc() only in action_result(),
this make this clear and simple.

Link: https://lkml.kernel.org/r/20220509105641.491313-6-pizhenwei@bytedance.com
Signed-off-by: zhenwei pi <pizhenwei@bytedance.com>
Acked-by: Naoya Horiguchi <naoya.horiguchi@nec.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>


# 9113eaf3 12-May-2022 zhenwei pi <pizhenwei@bytedance.com>

mm/memory-failure.c: add hwpoison_filter for soft offline

hwpoison_filter is missing in the soft offline path, this leads an issue:
after enabling the corrupt filter, the user process still has a chance to
inject hwpoison fault by madvise(addr, len, MADV_SOFT_OFFLINE) at PFN
which is expected to reject.

Also do a minor change in comment of memory_failure().

Link: https://lkml.kernel.org/r/20220509105641.491313-4-pizhenwei@bytedance.com
Signed-off-by: zhenwei pi <pizhenwei@bytedance.com>
Acked-by: Naoya Horiguchi <naoya.horiguchi@nec.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>


# c8bd84f7 12-May-2022 zhenwei pi <pizhenwei@bytedance.com>

mm/memory-failure.c: simplify num_poisoned_pages_dec

Don't decrease the number of poisoned pages in page_alloc.c, let the
memory-failure.c do inc/dec poisoned pages only.

Also simplify unpoison_memory(), only decrease the number of
poisoned pages when:
- TestClearPageHWPoison() succeed
- put_page_back_buddy succeed

After decreasing, print necessary log.

Finally, remove clear_page_hwpoison() and unpoison_taken_off_page().

Link: https://lkml.kernel.org/r/20220509105641.491313-3-pizhenwei@bytedance.com
Signed-off-by: zhenwei pi <pizhenwei@bytedance.com>
Acked-by: Naoya Horiguchi <naoya.horiguchi@nec.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>


# 60f272f6 12-May-2022 zhenwei pi <pizhenwei@bytedance.com>

mm/memory-failure.c: move clear_hwpoisoned_pages

Patch series "memory-failure: fix hwpoison_filter", v2.

As well known, the memory failure mechanism handles memory corrupted
event, and try to send SIGBUS to the user process which uses this
corrupted page.

For the virtualization case, QEMU catches SIGBUS and tries to inject MCE
into the guest, and the guest handles memory failure again. Thus the
guest gets the minimal effect from hardware memory corruption.

The further step I'm working on:

1, try to modify code to decrease poisoned pages in a single place
(mm/memofy-failure.c: simplify num_poisoned_pages_dec in this series).

2, try to use page_handle_poison() to handle SetPageHWPoison() and
num_poisoned_pages_inc() together. It would be best to call
num_poisoned_pages_inc() in a single place too.

3, introduce memory failure notifier list in memory-failure.c: notify
the corrupted PFN to someone who registers this list. If I can
complete [1] and [2] part, [3] will be quite easy(just call notifier
list after increasing poisoned page).

4, introduce memory recover VQ for memory balloon device, and registers
memory failure notifier list. During the guest kernel handles memory
failure, balloon device gets notified by memory failure notifier list,
and tells the host to recover the corrupted PFN(GPA) by the new VQ.

5, host side remaps the corrupted page(HVA), and tells the guest side
to unpoison the PFN(GPA). Then the guest fixes the corrupted page(GPA)
dynamically.


This patch (of 5):

clear_hwpoisoned_pages() clears HWPoison flag and decreases the number of
poisoned pages, this actually works as part of memory failure.

Move this function from sparse.c to memory-failure.c, finally there is no
CONFIG_MEMORY_FAILURE in sparse.c.

Link: https://lkml.kernel.org/r/20220509105641.491313-1-pizhenwei@bytedance.com
Link: https://lkml.kernel.org/r/20220509105641.491313-2-pizhenwei@bytedance.com
Signed-off-by: zhenwei pi <pizhenwei@bytedance.com>
Acked-by: Naoya Horiguchi <naoya.horiguchi@nec.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>


# 014bb1de 09-May-2022 NeilBrown <neilb@suse.de>

mm: create new mm/swap.h header file

Patch series "MM changes to improve swap-over-NFS support".

Assorted improvements for swap-via-filesystem.

This is a resend of these patches, rebased on current HEAD. The only
substantial changes is that swap_dirty_folio has replaced
swap_set_page_dirty.

Currently swap-via-fs (SWP_FS_OPS) doesn't work for any filesystem. It
has previously worked for NFS but that broke a few releases back. This
series changes to use a new ->swap_rw rather than ->readpage and
->direct_IO. It also makes other improvements.

There is a companion series already in linux-next which fixes various
issues with NFS. Once both series land, a final patch is needed which
changes NFS over to use ->swap_rw.


This patch (of 10):

Many functions declared in include/linux/swap.h are only used within mm/

Create a new "mm/swap.h" and move some of these declarations there.
Remove the redundant 'extern' from the function declarations.

[akpm@linux-foundation.org: mm/memory-failure.c needs mm/swap.h]
Link: https://lkml.kernel.org/r/164859751830.29473.5309689752169286816.stgit@noble.brown
Link: https://lkml.kernel.org/r/164859778120.29473.11725907882296224053.stgit@noble.brown
Signed-off-by: NeilBrown <neilb@suse.de>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Tested-by: David Howells <dhowells@redhat.com>
Tested-by: Geert Uytterhoeven <geert+renesas@glider.be>
Cc: Trond Myklebust <trond.myklebust@hammerspace.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: Mel Gorman <mgorman@techsingularity.net>
Cc: Miaohe Lin <linmiaohe@huawei.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>


# b283d983 29-Apr-2022 Naoya Horiguchi <naoya.horiguchi@nec.com>

mm, hugetlb, hwpoison: separate branch for free and in-use hugepage

We know that HPageFreed pages should have page refcount 0, so
get_page_unless_zero() always fails and returns 0. So explicitly separate
the branch based on page state for minor optimization and better
readability.

Link: https://lkml.kernel.org/r/20220415041848.GA3034499@ik1-406-35019.vs.sakura.ne.jp
Signed-off-by: Naoya Horiguchi <naoya.horiguchi@nec.com>
Suggested-by: Mike Kravetz <mike.kravetz@oracle.com>
Suggested-by: Miaohe Lin <linmiaohe@huawei.com>
Reviewed-by: Mike Kravetz <mike.kravetz@oracle.com>
Reviewed-by: Miaohe Lin <linmiaohe@huawei.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>


# ef526b17 29-Apr-2022 Miaohe Lin <linmiaohe@huawei.com>

mm/memory-failure.c: dissolve truncated hugetlb page

If me_huge_page meets a truncated but not yet freed hugepage, it won't be
dissolved even if we hold the last refcnt. It's because the hugepage has
NULL page_mapping while it's not anonymous hugepage too. Thus we lose the
last chance to dissolve it into buddy to save healthy subpages. Remove
PageAnon check to handle these hugepages too.

Link: https://lkml.kernel.org/r/20220414114941.11223-3-linmiaohe@huawei.com
Signed-off-by: Miaohe Lin <linmiaohe@huawei.com>
Acked-by: Naoya Horiguchi <naoya.horiguchi@nec.com>
Cc: David Hildenbrand <david@redhat.com>
Cc: Yang Shi <shy828301@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>


# 3f871370 29-Apr-2022 Miaohe Lin <linmiaohe@huawei.com>

mm/memory-failure.c: minor cleanup for HWPoisonHandlable

Patch series "A few fixup and cleanup patches for memory failure", v2.

This series contains a patch to clean up the HWPoisonHandlable and another
one to dissolve truncated hugetlb page. More details can be found in the
respective changelogs.


This patch (of 2):

The local variable movable can be removed by returning true directly. Also
fix typo 'mirgate'. No functional change intended.

Link: https://lkml.kernel.org/r/20220414114941.11223-1-linmiaohe@huawei.com
Link: https://lkml.kernel.org/r/20220414114941.11223-2-linmiaohe@huawei.com
Signed-off-by: Miaohe Lin <linmiaohe@huawei.com>
Reviewed-by: David Hildenbrand <david@redhat.com>
Reviewed-by: Yang Shi <shy828301@gmail.com>
Acked-by: Naoya Horiguchi <naoya.horiguchi@nec.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>


# 2ba2b008 29-Apr-2022 Naoya Horiguchi <naoya.horiguchi@nec.com>

Revert "mm/memory-failure.c: fix race with changing page compound again"

Reverts commit 888af2701db7 ("mm/memory-failure.c: fix race with changing
page compound again") because now we fetch the page refcount under
hugetlb_lock in try_memory_failure_hugetlb() so that the race check is no
longer necessary.

Link: https://lkml.kernel.org/r/20220408135323.1559401-4-naoya.horiguchi@linux.dev
Signed-off-by: Naoya Horiguchi <naoya.horiguchi@nec.com>
Suggested-by: Miaohe Lin <linmiaohe@huawei.com>
Reviewed-by: Miaohe Lin <linmiaohe@huawei.com>
Reviewed-by: Mike Kravetz <mike.kravetz@oracle.com>
Cc: Miaohe Lin <linmiaohe@huawei.com>
Cc: Yang Shi <shy828301@gmail.com>
Cc: Dan Carpenter <dan.carpenter@oracle.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>


# f361e246 29-Apr-2022 Naoya Horiguchi <naoya.horiguchi@nec.com>

mm/hwpoison: put page in already hwpoisoned case with MF_COUNT_INCREASED

In already hwpoisoned case, memory_failure() is supposed to return with
releasing the page refcount taken for error handling. But currently the
refcount is not released when called with MF_COUNT_INCREASED, which makes
page refcount inconsistent. This should be rare and non-critical, but it
might be inconvenient in testing (unpoison doesn't work).

Link: https://lkml.kernel.org/r/20220408135323.1559401-3-naoya.horiguchi@linux.dev
Signed-off-by: Naoya Horiguchi <naoya.horiguchi@nec.com>
Suggested-by: Miaohe Lin <linmiaohe@huawei.com>
Reviewed-by: Miaohe Lin <linmiaohe@huawei.com>
Reviewed-by: Mike Kravetz <mike.kravetz@oracle.com>
Cc: Dan Carpenter <dan.carpenter@oracle.com>
Cc: Yang Shi <shy828301@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>


# f142e707 29-Apr-2022 liqiong <liqiong@nfschina.com>

mm/memory-failure.c: remove unnecessary (void*) conversions

No need cast (void*) to (struct hwp_walk*).

Link: https://lkml.kernel.org/r/20220322142826.25939-1-liqiong@nfschina.com
Signed-off-by: liqiong <liqiong@nfschina.com>
Acked-by: Naoya Horiguchi <naoya.horiguchi@nec.com>
Reviewed-by: Muchun Song <songmuchun@bytedance.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>


# 1825b93b 29-Apr-2022 Naoya Horiguchi <naoya.horiguchi@nec.com>

mm/hwpoison: use pr_err() instead of dump_page() in get_any_page()

The following VM_BUG_ON_FOLIO() is triggered when memory error event
happens on the (thp/folio) pages which are about to be freed:

[ 1160.232771] page:00000000b36a8a0f refcount:1 mapcount:0 mapping:0000000000000000 index:0x1 pfn:0x16a000
[ 1160.236916] page:00000000b36a8a0f refcount:0 mapcount:0 mapping:0000000000000000 index:0x1 pfn:0x16a000
[ 1160.240684] flags: 0x57ffffc0800000(hwpoison|node=1|zone=2|lastcpupid=0x1fffff)
[ 1160.243458] raw: 0057ffffc0800000 dead000000000100 dead000000000122 0000000000000000
[ 1160.246268] raw: 0000000000000001 0000000000000000 00000000ffffffff 0000000000000000
[ 1160.249197] page dumped because: VM_BUG_ON_FOLIO(!folio_test_large(folio))
[ 1160.251815] ------------[ cut here ]------------
[ 1160.253438] kernel BUG at include/linux/mm.h:788!
[ 1160.256162] invalid opcode: 0000 [#1] PREEMPT SMP PTI
[ 1160.258172] CPU: 2 PID: 115368 Comm: mceinj.sh Tainted: G E 5.18.0-rc1-v5.18-rc1-220404-2353-005-g83111+ #3
[ 1160.262049] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.15.0-1.fc35 04/01/2014
[ 1160.265103] RIP: 0010:dump_page.cold+0x27e/0x2bd
[ 1160.266757] Code: fe ff ff 48 c7 c6 81 f1 5a 98 e9 4c fe ff ff 48 c7 c6 a1 95 59 98 e9 40 fe ff ff 48 c7 c6 50 bf 5a 98 48 89 ef e8 9d 04 6d ff <0f> 0b 41 f7 c4 ff 0f 00 00 0f 85 9f fd ff ff 49 8b 04 24 a9 00 00
[ 1160.273180] RSP: 0018:ffffaa2c4d59fd18 EFLAGS: 00010292
[ 1160.274969] RAX: 000000000000003e RBX: 0000000000000001 RCX: 0000000000000000
[ 1160.277263] RDX: 0000000000000001 RSI: ffffffff985995a1 RDI: 00000000ffffffff
[ 1160.279571] RBP: ffffdc9c45a80000 R08: 0000000000000000 R09: 00000000ffffdfff
[ 1160.281794] R10: ffffaa2c4d59fb08 R11: ffffffff98940d08 R12: ffffdc9c45a80000
[ 1160.283920] R13: ffffffff985b6f94 R14: 0000000000000000 R15: ffffdc9c45a80000
[ 1160.286641] FS: 00007eff54ce1740(0000) GS:ffff99c67bd00000(0000) knlGS:0000000000000000
[ 1160.289498] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 1160.291106] CR2: 00005628381a5f68 CR3: 0000000104712003 CR4: 0000000000170ee0
[ 1160.293031] Call Trace:
[ 1160.293724] <TASK>
[ 1160.294334] get_hwpoison_page+0x47d/0x570
[ 1160.295474] memory_failure+0x106/0xaa0
[ 1160.296474] ? security_capable+0x36/0x50
[ 1160.297524] hard_offline_page_store+0x43/0x80
[ 1160.298684] kernfs_fop_write_iter+0x11c/0x1b0
[ 1160.299829] new_sync_write+0xf9/0x160
[ 1160.300810] vfs_write+0x209/0x290
[ 1160.301835] ksys_write+0x4f/0xc0
[ 1160.302718] do_syscall_64+0x3b/0x90
[ 1160.303664] entry_SYSCALL_64_after_hwframe+0x44/0xae
[ 1160.304981] RIP: 0033:0x7eff54b018b7

As shown in the RIP address, this VM_BUG_ON in folio_entire_mapcount() is
called from dump_page("hwpoison: unhandlable page") in get_any_page().
The below explains the mechanism of the race:

CPU 0 CPU 1

memory_failure
get_hwpoison_page
get_any_page
dump_page
compound = PageCompound
free_pages_prepare
page->flags &= ~PAGE_FLAGS_CHECK_AT_PREP
folio_entire_mapcount
VM_BUG_ON_FOLIO(!folio_test_large(folio))

So replace dump_page() with safer one, pr_err().

Link: https://lkml.kernel.org/r/20220427053220.719866-1-naoya.horiguchi@linux.dev
Fixes: 74e8ee4708a8 ("mm: Turn head_compound_mapcount() into folio_entire_mapcount()")
Signed-off-by: Naoya Horiguchi <naoya.horiguchi@nec.com>
Reviewed-by: John Hubbard <jhubbard@nvidia.com>
Reviewed-by: Miaohe Lin <linmiaohe@huawei.com>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Christoph Hellwig <hch@infradead.org>
Cc: Jason Gunthorpe <jgg@nvidia.com>
Cc: William Kucharski <william.kucharski@oracle.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>


# b4e61fc0 29-Apr-2022 Xu Yu <xuyu@linux.alibaba.com>

Revert "mm/memory-failure.c: skip huge_zero_page in memory_failure()"

Patch series "mm/memory-failure: rework fix on huge_zero_page splitting".


This patch (of 2):

This reverts commit d173d5417fb67411e623d394aab986d847e47dad.

The commit d173d5417fb6 ("mm/memory-failure.c: skip huge_zero_page in
memory_failure()") explicitly skips huge_zero_page in memory_failure(), in
order to avoid triggering VM_BUG_ON_PAGE on huge_zero_page in
split_huge_page_to_list().

This works, but Yang Shi thinks that,

Raising BUG is overkilling for splitting huge_zero_page. The
huge_zero_page can't be met from normal paths other than memory
failure, but memory failure is a valid caller. So I tend to replace
the BUG to WARN + returning -EBUSY. If we don't care about the
reason code in memory failure, we don't have to touch memory
failure.

And for the issue that huge_zero_page will be set PG_has_hwpoisoned,
Yang Shi comments that,

The anonymous page fault doesn't check if the page is poisoned or
not since it typically gets a fresh allocated page and assumes the
poisoned page (isolated successfully) can't be reallocated again.
But huge zero page and base zero page are reused every time. So no
matter what fix we pick, the issue is always there.

Finally, Yang, David, Anshuman and Naoya all agree to fix the bug, i.e.,
to split huge_zero_page, in split_huge_page_to_list().

This reverts the commit d173d5417fb6 ("mm/memory-failure.c: skip
huge_zero_page in memory_failure()"), and the original bug will be fixed
by the next patch.

Link: https://lkml.kernel.org/r/872cefb182ba1dd686b0e7db1e6b2ebe5a4fff87.1651039624.git.xuyu@linux.alibaba.com
Fixes: d173d5417fb6 ("mm/memory-failure.c: skip huge_zero_page in memory_failure()")
Fixes: 6a46079cf57a ("HWPOISON: The high level memory error handler in the VM v7")
Signed-off-by: Xu Yu <xuyu@linux.alibaba.com>
Suggested-by: Yang Shi <shy828301@gmail.com>
Reviewed-by: Yang Shi <shy828301@gmail.com>
Reviewed-by: Miaohe Lin <linmiaohe@huawei.com>
Cc: Naoya Horiguchi <naoya.horiguchi@nec.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>


# d173d541 21-Apr-2022 Xu Yu <xuyu@linux.alibaba.com>

mm/memory-failure.c: skip huge_zero_page in memory_failure()

Kernel panic when injecting memory_failure for the global
huge_zero_page, when CONFIG_DEBUG_VM is enabled, as follows.

Injecting memory failure for pfn 0x109ff9 at process virtual address 0x20ff9000
page:00000000fb053fc3 refcount:2 mapcount:0 mapping:0000000000000000 index:0x0 pfn:0x109e00
head:00000000fb053fc3 order:9 compound_mapcount:0 compound_pincount:0
flags: 0x17fffc000010001(locked|head|node=0|zone=2|lastcpupid=0x1ffff)
raw: 017fffc000010001 0000000000000000 dead000000000122 0000000000000000
raw: 0000000000000000 0000000000000000 00000002ffffffff 0000000000000000
page dumped because: VM_BUG_ON_PAGE(is_huge_zero_page(head))
------------[ cut here ]------------
kernel BUG at mm/huge_memory.c:2499!
invalid opcode: 0000 [#1] PREEMPT SMP PTI
CPU: 6 PID: 553 Comm: split_bug Not tainted 5.18.0-rc1+ #11
Hardware name: Alibaba Cloud Alibaba Cloud ECS, BIOS 3288b3c 04/01/2014
RIP: 0010:split_huge_page_to_list+0x66a/0x880
Code: 84 9b fb ff ff 48 8b 7c 24 08 31 f6 e8 9f 5d 2a 00 b8 b8 02 00 00 e9 e8 fb ff ff 48 c7 c6 e8 47 3c 82 4c b
RSP: 0018:ffffc90000dcbdf8 EFLAGS: 00010246
RAX: 000000000000003c RBX: 0000000000000001 RCX: 0000000000000000
RDX: 0000000000000000 RSI: ffffffff823e4c4f RDI: 00000000ffffffff
RBP: ffff88843fffdb40 R08: 0000000000000000 R09: 00000000fffeffff
R10: ffffc90000dcbc48 R11: ffffffff82d68448 R12: ffffea0004278000
R13: ffffffff823c6203 R14: 0000000000109ff9 R15: ffffea000427fe40
FS: 00007fc375a26740(0000) GS:ffff88842fd80000(0000) knlGS:0000000000000000
CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 00007fc3757c9290 CR3: 0000000102174006 CR4: 00000000003706e0
DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
Call Trace:
try_to_split_thp_page+0x3a/0x130
memory_failure+0x128/0x800
madvise_inject_error.cold+0x8b/0xa1
__x64_sys_madvise+0x54/0x60
do_syscall_64+0x35/0x80
entry_SYSCALL_64_after_hwframe+0x44/0xae
RIP: 0033:0x7fc3754f8bf9
Code: 01 00 48 81 c4 80 00 00 00 e9 f1 fe ff ff 0f 1f 00 48 89 f8 48 89 f7 48 89 d6 48 89 ca 4d 89 c2 4d 89 c8 8
RSP: 002b:00007ffeda93a1d8 EFLAGS: 00000217 ORIG_RAX: 000000000000001c
RAX: ffffffffffffffda RBX: 0000000000000000 RCX: 00007fc3754f8bf9
RDX: 0000000000000064 RSI: 0000000000003000 RDI: 0000000020ff9000
RBP: 00007ffeda93a200 R08: 0000000000000000 R09: 0000000000000000
R10: 00000000ffffffff R11: 0000000000000217 R12: 0000000000400490
R13: 00007ffeda93a2e0 R14: 0000000000000000 R15: 0000000000000000

This makes huge_zero_page bail out explicitly before split in
memory_failure(), thus the panic above won't happen again.

Link: https://lkml.kernel.org/r/497d3835612610e370c74e697ea3c721d1d55b9c.1649775850.git.xuyu@linux.alibaba.com
Fixes: 6a46079cf57a ("HWPOISON: The high level memory error handler in the VM v7")
Signed-off-by: Xu Yu <xuyu@linux.alibaba.com>
Reported-by: Abaci <abaci@linux.alibaba.com>
Suggested-by: Naoya Horiguchi <naoya.horiguchi@nec.com>
Acked-by: Naoya Horiguchi <naoya.horiguchi@nec.com>
Reviewed-by: Miaohe Lin <linmiaohe@huawei.com>
Cc: Anshuman Khandual <anshuman.khandual@arm.com>
Cc: Oscar Salvador <osalvador@suse.de>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 405ce051 21-Apr-2022 Naoya Horiguchi <naoya.horiguchi@nec.com>

mm/hwpoison: fix race between hugetlb free/demotion and memory_failure_hugetlb()

There is a race condition between memory_failure_hugetlb() and hugetlb
free/demotion, which causes setting PageHWPoison flag on the wrong page.
The one simple result is that wrong processes can be killed, but another
(more serious) one is that the actual error is left unhandled, so no one
prevents later access to it, and that might lead to more serious results
like consuming corrupted data.

Think about the below race window:

CPU 1 CPU 2
memory_failure_hugetlb
struct page *head = compound_head(p);
hugetlb page might be freed to
buddy, or even changed to another
compound page.

get_hwpoison_page -- page is not what we want now...

The current code first does prechecks roughly and then reconfirms after
taking refcount, but it's found that it makes code overly complicated,
so move the prechecks in a single hugetlb_lock range.

A newly introduced function, try_memory_failure_hugetlb(), always takes
hugetlb_lock (even for non-hugetlb pages). That can be improved, but
memory_failure() is rare in principle, so should not be a big problem.

Link: https://lkml.kernel.org/r/20220408135323.1559401-2-naoya.horiguchi@linux.dev
Fixes: 761ad8d7c7b5 ("mm: hwpoison: introduce memory_failure_hugetlb()")
Signed-off-by: Naoya Horiguchi <naoya.horiguchi@nec.com>
Reported-by: Mike Kravetz <mike.kravetz@oracle.com>
Reviewed-by: Miaohe Lin <linmiaohe@huawei.com>
Reviewed-by: Mike Kravetz <mike.kravetz@oracle.com>
Cc: Yang Shi <shy828301@gmail.com>
Cc: Dan Carpenter <dan.carpenter@oracle.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# bf6445bc 22-Mar-2022 Miaohe Lin <linmiaohe@huawei.com>

mm/memory-failure.c: make non-LRU movable pages unhandlable

We can not really handle non-LRU movable pages in memory failure.
Typically they are balloon, zsmalloc, etc.

Assuming we run into a base (4K) non-LRU movable page, we could reach as
far as identify_page_state(), it should not fall into any category
except me_unknown.

For the non-LRU compound movable pages, they could be taken for
transhuge pages but it's unexpected to split non-LRU movable pages using
split_huge_page_to_list in memory_failure. So we could just simply make
non-LRU movable pages unhandlable to avoid these possible nasty cases.

Link: https://lkml.kernel.org/r/20220312074613.4798-4-linmiaohe@huawei.com
Signed-off-by: Miaohe Lin <linmiaohe@huawei.com>
Suggested-by: Yang Shi <shy828301@gmail.com>
Reviewed-by: Yang Shi <shy828301@gmail.com>
Acked-by: Naoya Horiguchi <naoya.horiguchi@nec.com>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Mike Kravetz <mike.kravetz@oracle.com>
Cc: Tony Luck <tony.luck@intel.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 593396b8 22-Mar-2022 Miaohe Lin <linmiaohe@huawei.com>

mm/memory-failure.c: avoid calling invalidate_inode_page() with unexpected pages

Since commit 042c4f32323b ("mm/truncate: Inline invalidate_complete_page()
into its one caller"), invalidate_inode_page() can invalidate the pages
in the swap cache because the check of page->mapping != mapping is
removed. But invalidate_inode_page() is not expected to deal with the
pages in swap cache. Also non-lru movable page can reach here too.
They're not page cache pages. Skip these pages by checking
PageSwapCache and PageLRU.

Link: https://lkml.kernel.org/r/20220312074613.4798-3-linmiaohe@huawei.com
Signed-off-by: Miaohe Lin <linmiaohe@huawei.com>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Mike Kravetz <mike.kravetz@oracle.com>
Cc: Naoya Horiguchi <naoya.horiguchi@nec.com>
Cc: Tony Luck <tony.luck@intel.com>
Cc: Yang Shi <shy828301@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 888af270 22-Mar-2022 Miaohe Lin <linmiaohe@huawei.com>

mm/memory-failure.c: fix race with changing page compound again

Patch series "A few fixup patches for memory failure", v2.

This series contains a few patches to fix the race with changing page
compound page, make non-LRU movable pages unhandlable and so on. More
details can be found in the respective changelogs.

There is a race window where we got the compound_head, the hugetlb page
could be freed to buddy, or even changed to another compound page just
before we try to get hwpoison page. Think about the below race window:

CPU 1 CPU 2
memory_failure_hugetlb
struct page *head = compound_head(p);
hugetlb page might be freed to
buddy, or even changed to another
compound page.

get_hwpoison_page -- page is not what we want now...

If this race happens, just bail out. Also MF_MSG_DIFFERENT_PAGE_SIZE is
introduced to record this event.

[akpm@linux-foundation.org: s@/**@/*@, per Naoya Horiguchi]

Link: https://lkml.kernel.org/r/20220312074613.4798-1-linmiaohe@huawei.com
Link: https://lkml.kernel.org/r/20220312074613.4798-2-linmiaohe@huawei.com
Signed-off-by: Miaohe Lin <linmiaohe@huawei.com>
Acked-by: Naoya Horiguchi <naoya.horiguchi@nec.com>
Cc: Tony Luck <tony.luck@intel.com>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Mike Kravetz <mike.kravetz@oracle.com>
Cc: Yang Shi <shy828301@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# a06ad3c0 22-Mar-2022 luofei <luofei@unicloud.com>

mm/hwpoison: add in-use hugepage hwpoison filter judgement

After successfully obtaining the reference count of the huge page, it is
still necessary to call hwpoison_filter() to make a filter judgement,
otherwise the filter hugepage will be unmaped and the related process
may be killed.

Link: https://lkml.kernel.org/r/20220223082254.2769757-1-luofei@unicloud.com
Signed-off-by: luofei <luofei@unicloud.com>
Reviewed-by: Miaohe Lin <linmiaohe@huawei.com>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Naoya Horiguchi <naoya.horiguchi@nec.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Tony Luck <tony.luck@intel.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# d1fe111f 22-Mar-2022 luofei <luofei@unicloud.com>

mm/hwpoison: avoid the impact of hwpoison_filter() return value on mce handler

When the hwpoison page meets the filter conditions, it should not be
regarded as successful memory_failure() processing for mce handler, but
should return a distinct value, otherwise mce handler regards the error
page has been identified and isolated, which may lead to calling
set_mce_nospec() to change page attribute, etc.

Here memory_failure() return -EOPNOTSUPP to indicate that the error
event is filtered, mce handler should not take any action for this
situation and hwpoison injector should treat as correct.

Link: https://lkml.kernel.org/r/20220223082135.2769649-1-luofei@unicloud.com
Signed-off-by: luofei <luofei@unicloud.com>
Acked-by: Borislav Petkov <bp@suse.de>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Miaohe Lin <linmiaohe@huawei.com>
Cc: Naoya Horiguchi <naoya.horiguchi@nec.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Tony Luck <tony.luck@intel.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# b04d3eeb 22-Mar-2022 Miaohe Lin <linmiaohe@huawei.com>

mm/memory-failure.c: remove unnecessary PageTransTail check

When we reach here, we're guaranteed to have non-compound page as thp is
already splited. Remove this unnecessary PageTransTail check.

Link: https://lkml.kernel.org/r/20220218090118.1105-9-linmiaohe@huawei.com
Signed-off-by: Miaohe Lin <linmiaohe@huawei.com>
Acked-by: Naoya Horiguchi <naoya.horiguchi@nec.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 2ab91679 22-Mar-2022 Miaohe Lin <linmiaohe@huawei.com>

mm/memory-failure.c: remove obsolete comment in __soft_offline_page

Since commit add05cecef80 ("mm: soft-offline: don't free target page in
successful page migration"), set_migratetype_isolate logic is removed.
Remove this obsolete comment.

Link: https://lkml.kernel.org/r/20220218090118.1105-8-linmiaohe@huawei.com
Signed-off-by: Miaohe Lin <linmiaohe@huawei.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 357670f7 22-Mar-2022 Miaohe Lin <linmiaohe@huawei.com>

mm/memory-failure.c: rework the try_to_unmap logic in hwpoison_user_mappings()

Only for hugetlb pages in shared mappings, try_to_unmap should take
semaphore in write mode here. Rework the code to make it clear.

Link: https://lkml.kernel.org/r/20220218090118.1105-7-linmiaohe@huawei.com
Signed-off-by: Miaohe Lin <linmiaohe@huawei.com>
Acked-by: Naoya Horiguchi <naoya.horiguchi@nec.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 67ff51c6 22-Mar-2022 Miaohe Lin <linmiaohe@huawei.com>

mm/memory-failure.c: remove PageSlab check in hwpoison_filter_dev

Since commit 03e5ac2fc3bf ("mm: fix crash when using XFS on loopback"),
page_mapping() can handle the Slab pages. So remove this unnecessary
PageSlab check and obsolete comment.

Link: https://lkml.kernel.org/r/20220218090118.1105-6-linmiaohe@huawei.com
Signed-off-by: Miaohe Lin <linmiaohe@huawei.com>
Acked-by: Naoya Horiguchi <naoya.horiguchi@nec.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 75ee64b3 22-Mar-2022 Miaohe Lin <linmiaohe@huawei.com>

mm/memory-failure.c: fix race with changing page more robustly

We're only intended to deal with the non-Compound page after we split
thp in memory_failure. However, the page could have changed compound
pages due to race window. If this happens, we could retry once to
hopefully handle the page next round. Also remove unneeded orig_head.
It's always equal to the hpage. So we can use hpage directly and remove
this redundant one.

Link: https://lkml.kernel.org/r/20220218090118.1105-5-linmiaohe@huawei.com
Signed-off-by: Miaohe Lin <linmiaohe@huawei.com>
Acked-by: Naoya Horiguchi <naoya.horiguchi@nec.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 49775047 22-Mar-2022 Miaohe Lin <linmiaohe@huawei.com>

mm/memory-failure.c: rework the signaling logic in kill_proc

BUS_MCEERR_AR code is only sent when MF_ACTION_REQUIRED is set and the
target is current. Rework the code to make this clear.

Link: https://lkml.kernel.org/r/20220218090118.1105-4-linmiaohe@huawei.com
Signed-off-by: Miaohe Lin <linmiaohe@huawei.com>
Acked-by: Naoya Horiguchi <naoya.horiguchi@nec.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# a994402b 22-Mar-2022 Miaohe Lin <linmiaohe@huawei.com>

mm/memory-failure.c: catch unexpected -EFAULT from vma_address()

It's unexpected to walk the page table when vma_address() return
-EFAULT. But dev_pagemap_mapping_shift() is called only when vma
associated to the error page is found already in
collect_procs_{file,anon}, so vma_address() should not return -EFAULT
except with some bug, as Naoya pointed out. We can use VM_BUG_ON_VMA()
to catch this bug here.

Link: https://lkml.kernel.org/r/20220218090118.1105-3-linmiaohe@huawei.com
Signed-off-by: Miaohe Lin <linmiaohe@huawei.com>
Acked-by: Naoya Horiguchi <naoya.horiguchi@nec.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 577553f4 22-Mar-2022 Miaohe Lin <linmiaohe@huawei.com>

mm/memory-failure.c: minor clean up for memory_failure_dev_pagemap

Patch series "A few cleanup and fixup patches for memory failure", v3.

This series contains a few patches to simplify the code logic, remove
unneeded variable and remove obsolete comment. Also we fix race
changing page more robustly in memory_failure. More details can be
found in the respective changelogs.

This patch (of 8):

The flags always has MF_ACTION_REQUIRED and MF_MUST_KILL set. So we do
not need to check these flags again.

Link: https://lkml.kernel.org/r/20220218090118.1105-1-linmiaohe@huawei.com
Link: https://lkml.kernel.org/r/20220218090118.1105-2-linmiaohe@huawei.com
Signed-off-by: Miaohe Lin <linmiaohe@huawei.com>
Acked-by: Naoya Horiguchi <naoya.horiguchi@nec.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 046545a6 22-Mar-2022 Naoya Horiguchi <naoya.horiguchi@nec.com>

mm/hwpoison: fix error page recovered but reported "not recovered"

When an uncorrected memory error is consumed there is a race between the
CMCI from the memory controller reporting an uncorrected error with a
UCNA signature, and the core reporting and SRAR signature machine check
when the data is about to be consumed.

If the CMCI wins that race, the page is marked poisoned when
uc_decode_notifier() calls memory_failure() and the machine check
processing code finds the page already poisoned. It calls
kill_accessing_process() to make sure a SIGBUS is sent. But returns the
wrong error code.

Console log looks like this:

mce: Uncorrected hardware memory error in user-access at 3710b3400
Memory failure: 0x3710b3: recovery action for dirty LRU page: Recovered
Memory failure: 0x3710b3: already hardware poisoned
Memory failure: 0x3710b3: Sending SIGBUS to einj_mem_uc:361438 due to hardware memory corruption
mce: Memory error not recovered

kill_accessing_process() is supposed to return -EHWPOISON to notify that
SIGBUS is already set to the process and kill_me_maybe() doesn't have to
send it again. But current code simply fails to do this, so fix it to
make sure to work as intended. This change avoids the noise message
"Memory error not recovered" and skips duplicate SIGBUSs.

[tony.luck@intel.com: reword some parts of commit message]

Link: https://lkml.kernel.org/r/20220113231117.1021405-1-naoya.horiguchi@linux.dev
Fixes: a3f5d80ea401 ("mm,hwpoison: send SIGBUS with error virutal address")
Signed-off-by: Naoya Horiguchi <naoya.horiguchi@nec.com>
Reported-by: Youquan Song <youquan.song@intel.com>
Cc: Tony Luck <tony.luck@intel.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# ae483c20 22-Mar-2022 Naoya Horiguchi <naoya.horiguchi@nec.com>

mm/memory-failure.c: remove obsolete comment

With the introduction of mf_mutex, most of memory error handling process
is mutually exclusive, so the in-line comment about subtlety about
double-checking PageHWPoison is no more correct. So remove it.

Link: https://lkml.kernel.org/r/20220125025601.3054511-1-naoya.horiguchi@linux.dev
Signed-off-by: Naoya Horiguchi <naoya.horiguchi@nec.com>
Suggested-by: Mike Kravetz <mike.kravetz@oracle.com>
Reviewed-by: Miaohe Lin <linmiaohe@huawei.com>
Reviewed-by: Anshuman Khandual <anshuman.khandual@arm.com>
Reviewed-by: Oscar Salvador <osalvador@suse.de>
Reviewed-by: Yang Shi <shy828301@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 9595d769 01-Feb-2022 Matthew Wilcox (Oracle) <willy@infradead.org>

mm/rmap: Turn page_lock_anon_vma_read() into folio_lock_anon_vma_read()

Add back page_lock_anon_vma_read() as a wrapper. This saves a few calls
to compound_head(). If any callers were passing a tail page before,
this would have failed to lock the anon VMA as page->mapping is not
valid for tail pages.

Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>


# 869f7ee6 15-Feb-2022 Matthew Wilcox (Oracle) <willy@infradead.org>

mm/rmap: Convert try_to_unmap() to take a folio

Change all three callers and the worker function try_to_unmap_one().

Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>


# d6c75dc2 13-Feb-2022 Matthew Wilcox (Oracle) <willy@infradead.org>

mm/truncate: Split invalidate_inode_page() into mapping_evict_folio()

Some of the callers already have the address_space and can avoid calling
folio_mapping() and checking if the folio was already truncated. Also
add kernel-doc and fix the return type (in case we ever support folios
larger than 4TB).

Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Miaohe Lin <linmiaohe@huawei.com>


# 61e28cf0 29-Jan-2022 Joao Martins <joao.m.martins@oracle.com>

memory-failure: fetch compound_head after pgmap_pfn_valid()

memory_failure_dev_pagemap() at the moment assumes base pages (e.g.
dax_lock_page()). For devmap with compound pages fetch the
compound_head in case a tail page memory failure is being handled.

Currently this is a nop, but in the advent of compound pages in
dev_pagemap it allows memory_failure_dev_pagemap() to keep working.

Without this fix memory-failure handling (i.e. MCEs on pmem) with
device-dax configured namespaces will regress (and crash).

Link: https://lkml.kernel.org/r/20211202204422.26777-2-joao.m.martins@oracle.com
Reported-by: Jane Chu <jane.chu@oracle.com>
Signed-off-by: Joao Martins <joao.m.martins@oracle.com>
Reviewed-by: Naoya Horiguchi <naoya.horiguchi@nec.com>
Reviewed-by: Dan Williams <dan.j.williams@intel.com>
Reviewed-by: Muchun Song <songmuchun@bytedance.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 0b8f0d87 14-Jan-2022 Quanfa Fu <fuqf0919@gmail.com>

mm: fix some comment errors

Link: https://lkml.kernel.org/r/20211101040208.460810-1-fuqf0919@gmail.com
Signed-off-by: Quanfa Fu <fuqf0919@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# bf181c58 14-Jan-2022 Naoya Horiguchi <naoya.horiguchi@nec.com>

mm/hwpoison: fix unpoison_memory()

After recent soft-offline rework, error pages can be taken off from
buddy allocator, but the existing unpoison_memory() does not properly
undo the operation. Moreover, due to the recent change on
__get_hwpoison_page(), get_page_unless_zero() is hardly called for
hwpoisoned pages. So __get_hwpoison_page() highly likely returns -EBUSY
(meaning to fail to grab page refcount) and unpoison just clears
PG_hwpoison without releasing a refcount. That does not lead to a
critical issue like kernel panic, but unpoisoned pages never get back to
buddy (leaked permanently), which is not good.

To (partially) fix this, we need to identify "taken off" pages from
other types of hwpoisoned pages. We can't use refcount or page flags
for this purpose, so a pseudo flag is defined by hacking ->private
field. Someone might think that put_page() is enough to cancel
taken-off pages, but the normal free path contains some operations not
suitable for the current purpose, and can fire VM_BUG_ON().

Note that unpoison_memory() is now supposed to be cancel hwpoison events
injected only by madvise() or
/sys/devices/system/memory/{hard,soft}_offline_page, not by MCE
injection, so please don't try to use unpoison when testing with MCE
injection.

[lkp@intel.com: report build failure for ARCH=i386]

Link: https://lkml.kernel.org/r/20211115084006.3728254-4-naoya.horiguchi@linux.dev
Signed-off-by: Naoya Horiguchi <naoya.horiguchi@nec.com>
Reviewed-by: Yang Shi <shy828301@gmail.com>
Cc: David Hildenbrand <david@redhat.com>
Cc: Oscar Salvador <osalvador@suse.de>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Ding Hui <dinghui@sangfor.com.cn>
Cc: Tony Luck <tony.luck@intel.com>
Cc: "Aneesh Kumar K.V" <aneesh.kumar@linux.vnet.ibm.com>
Cc: Miaohe Lin <linmiaohe@huawei.com>
Cc: Peter Xu <peterx@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# c9fdc4d5 14-Jan-2022 Naoya Horiguchi <naoya.horiguchi@nec.com>

mm/hwpoison: remove MF_MSG_BUDDY_2ND and MF_MSG_POISONED_HUGE

These action_page_types are no longer used, so remove them.

Link: https://lkml.kernel.org/r/20211115084006.3728254-3-naoya.horiguchi@linux.dev
Signed-off-by: Naoya Horiguchi <naoya.horiguchi@nec.com>
Acked-by: Yang Shi <shy828301@gmail.com>
Cc: "Aneesh Kumar K.V" <aneesh.kumar@linux.vnet.ibm.com>
Cc: David Hildenbrand <david@redhat.com>
Cc: Ding Hui <dinghui@sangfor.com.cn>
Cc: Miaohe Lin <linmiaohe@huawei.com>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Oscar Salvador <osalvador@suse.de>
Cc: Peter Xu <peterx@redhat.com>
Cc: Tony Luck <tony.luck@intel.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 91d00547 14-Jan-2022 Naoya Horiguchi <naoya.horiguchi@nec.com>

mm/hwpoison: mf_mutex for soft offline and unpoison

Patch series "mm/hwpoison: fix unpoison_memory()", v4.

The main purpose of this series is to sync unpoison code to recent
changes around how hwpoison code takes page refcount. Unpoison should
work or simply fail (without crash) if impossible.

The recent works of keeping hwpoison pages in shmem pagecache introduce
a new state of hwpoisoned pages, but unpoison for such pages is not
supported yet with this series.

It seems that soft-offline and unpoison can be used as general purpose
page offline/online mechanism (not in the context of memory error). I
think that we need some additional works to realize it because currently
soft-offline and unpoison are assumed not to happen so frequently (print
out too many messages for aggressive usecases). But anyway this could
be another interesting next topic.

v1: https://lore.kernel.org/linux-mm/20210614021212.223326-1-nao.horiguchi@gmail.com/
v2: https://lore.kernel.org/linux-mm/20211025230503.2650970-1-naoya.horiguchi@linux.dev/
v3: https://lore.kernel.org/linux-mm/20211105055058.3152564-1-naoya.horiguchi@linux.dev/

This patch (of 3):

Originally mf_mutex is introduced to serialize multiple MCE events, but
it is not that useful to allow unpoison to run in parallel with
memory_failure() and soft offline. So apply mf_mutex to soft offline
and unpoison. The memory failure handler and soft offline handler get
simpler with this.

Link: https://lkml.kernel.org/r/20211115084006.3728254-1-naoya.horiguchi@linux.dev
Link: https://lkml.kernel.org/r/20211115084006.3728254-2-naoya.horiguchi@linux.dev
Signed-off-by: Naoya Horiguchi <naoya.horiguchi@nec.com>
Reviewed-by: Yang Shi <shy828301@gmail.com>
Cc: "Aneesh Kumar K.V" <aneesh.kumar@linux.vnet.ibm.com>
Cc: David Hildenbrand <david@redhat.com>
Cc: Ding Hui <dinghui@sangfor.com.cn>
Cc: Miaohe Lin <linmiaohe@huawei.com>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Oscar Salvador <osalvador@suse.de>
Cc: Peter Xu <peterx@redhat.com>
Cc: Tony Luck <tony.luck@intel.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# a7605426 14-Jan-2022 Yang Shi <shy828301@gmail.com>

mm: shmem: don't truncate page if memory failure happens

The current behavior of memory failure is to truncate the page cache
regardless of dirty or clean. If the page is dirty the later access
will get the obsolete data from disk without any notification to the
users. This may cause silent data loss. It is even worse for shmem
since shmem is in-memory filesystem, truncating page cache means
discarding data blocks. The later read would return all zero.

The right approach is to keep the corrupted page in page cache, any
later access would return error for syscalls or SIGBUS for page fault,
until the file is truncated, hole punched or removed. The regular
storage backed filesystems would be more complicated so this patch is
focused on shmem. This also unblock the support for soft offlining
shmem THP.

[akpm@linux-foundation.org: coding style fixes]
[arnd@arndb.de: fix uninitialized variable use in me_pagecache_clean()]
Link: https://lkml.kernel.org/r/20211022064748.4173718-1-arnd@kernel.org
[Fix invalid pointer dereference in shmem_read_mapping_page_gfp() with a
slight different implementation from what Ajay Garg <ajaygargnsit@gmail.com>
and Muchun Song <songmuchun@bytedance.com> proposed and reworked the
error handling of shmem_write_begin() suggested by Linus]
Link: https://lore.kernel.org/linux-mm/20211111084617.6746-1-ajaygargnsit@gmail.com/

Link: https://lkml.kernel.org/r/20211020210755.23964-6-shy828301@gmail.com
Link: https://lkml.kernel.org/r/20211116193247.21102-1-shy828301@gmail.com
Signed-off-by: Yang Shi <shy828301@gmail.com>
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Cc: Hugh Dickins <hughd@google.com>
Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Naoya Horiguchi <naoya.horiguchi@nec.com>
Cc: Oscar Salvador <osalvador@suse.de>
Cc: Peter Xu <peterx@redhat.com>
Cc: Ajay Garg <ajaygargnsit@gmail.com>
Cc: Muchun Song <songmuchun@bytedance.com>
Cc: Andy Lavr <andy.lavr@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 03b122da 26-Oct-2021 Tony Luck <tony.luck@intel.com>

x86/sgx: Hook arch_memory_failure() into mainline code

Add a call inside memory_failure() to call the arch specific code
to check if the address is an SGX EPC page and handle it.

Note the SGX EPC pages do not have a "struct page" entry, so the hook
goes in at the same point as the device mapping hook.

Pull the call to acquire the mutex earlier so the SGX errors are also
protected.

Make set_mce_nospec() skip SGX pages when trying to adjust
the 1:1 map.

Signed-off-by: Tony Luck <tony.luck@intel.com>
Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
Reviewed-by: Jarkko Sakkinen <jarkko@kernel.org>
Reviewed-by: Naoya Horiguchi <naoya.horiguchi@nec.com>
Tested-by: Reinette Chatre <reinette.chatre@intel.com>
Link: https://lkml.kernel.org/r/20211026220050.697075-6-tony.luck@intel.com


# 2a57d83c 24-Dec-2021 Liu Shixin <liushixin2@huawei.com>

mm/hwpoison: clear MF_COUNT_INCREASED before retrying get_any_page()

Hulk Robot reported a panic in put_page_testzero() when testing
madvise() with MADV_SOFT_OFFLINE. The BUG() is triggered when retrying
get_any_page(). This is because we keep MF_COUNT_INCREASED flag in
second try but the refcnt is not increased.

page dumped because: VM_BUG_ON_PAGE(page_ref_count(page) == 0)
------------[ cut here ]------------
kernel BUG at include/linux/mm.h:737!
invalid opcode: 0000 [#1] PREEMPT SMP
CPU: 5 PID: 2135 Comm: sshd Tainted: G B 5.16.0-rc6-dirty #373
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.13.0-1ubuntu1.1 04/01/2014
RIP: release_pages+0x53f/0x840
Call Trace:
free_pages_and_swap_cache+0x64/0x80
tlb_flush_mmu+0x6f/0x220
unmap_page_range+0xe6c/0x12c0
unmap_single_vma+0x90/0x170
unmap_vmas+0xc4/0x180
exit_mmap+0xde/0x3a0
mmput+0xa3/0x250
do_exit+0x564/0x1470
do_group_exit+0x3b/0x100
__do_sys_exit_group+0x13/0x20
__x64_sys_exit_group+0x16/0x20
do_syscall_64+0x34/0x80
entry_SYSCALL_64_after_hwframe+0x44/0xae
Modules linked in:
---[ end trace e99579b570fe0649 ]---
RIP: 0010:release_pages+0x53f/0x840

Link: https://lkml.kernel.org/r/20211221074908.3910286-1-liushixin2@huawei.com
Fixes: b94e02822deb ("mm,hwpoison: try to narrow window race for free pages")
Signed-off-by: Liu Shixin <liushixin2@huawei.com>
Reported-by: Hulk Robot <hulkci@huawei.com>
Reviewed-by: Oscar Salvador <osalvador@suse.de>
Acked-by: Naoya Horiguchi <naoya.horiguchi@nec.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# e37e7b0b 24-Dec-2021 Naoya Horiguchi <naoya.horiguchi@nec.com>

mm, hwpoison: fix condition in free hugetlb page path

When a memory error hits a tail page of a free hugepage,
__page_handle_poison() is expected to be called to isolate the error in
4kB unit, but it's not called due to the outdated if-condition in
memory_failure_hugetlb(). This loses the chance to isolate the error in
the finer unit, so it's not optimal. Drop the condition.

This "(p != head && TestSetPageHWPoison(head)" condition is based on the
old semantics of PageHWPoison on hugepage (where PG_hwpoison flag was
set on the subpage), so it's not necessray any more. By getting to set
PG_hwpoison on head page for hugepages, concurrent error events on
different subpages in a single hugepage can be prevented by
TestSetPageHWPoison(head) at the beginning of memory_failure_hugetlb().
So dropping the condition should not reopen the race window originally
mentioned in commit b985194c8c0a ("hwpoison, hugetlb:
lock_page/unlock_page does not match for handling a free hugepage")

[naoya.horiguchi@linux.dev: fix "HardwareCorrupted" counter]
Link: https://lkml.kernel.org/r/20211220084851.GA1460264@u2004

Link: https://lkml.kernel.org/r/20211210110208.879740-1-naoya.horiguchi@linux.dev
Signed-off-by: Naoya Horiguchi <naoya.horiguchi@nec.com>
Reported-by: Fei Luo <luofei@unicloud.com>
Reviewed-by: Mike Kravetz <mike.kravetz@oracle.com>
Cc: <stable@vger.kernel.org> [5.14+]
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# d0b51bfb 13-Nov-2021 Linus Torvalds <torvalds@linux-foundation.org>

Revert "mm: shmem: don't truncate page if memory failure happens"

This reverts commit b9d02f1bdd98f38e6e5ecacc9786a8f58f3f8b2c.

The error handling of that patch was fundamentally broken, and it needs
to be entirely re-done.

For example, in shmem_write_begin() it would call shmem_getpage(), then
ignore the error return from that, and look at the page pointer contents
instead.

And in shmem_read_mapping_page_gfp(), the patch tested PageHWPoison() on
a page pointer that two lines earlier had potentially been set as an
error pointer.

These issues could be individually fixed, but when it has this many
issues, I'm just reverting it instead of waiting for fixes.

Link: https://lore.kernel.org/linux-mm/20211111084617.6746-1-ajaygargnsit@gmail.com/
Reported-by: Ajay Garg <ajaygargnsit@gmail.com>
Reported-by: Jens Axboe <axboe@kernel.dk>
Cc: Yang Shi <shy828301@gmail.com>
Cc: Arnd Bergmann <arnd@arndb.de>
Cc: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 4966455d 05-Nov-2021 Yang Shi <shy828301@gmail.com>

mm: hwpoison: handle non-anonymous THP correctly

Currently hwpoison doesn't handle non-anonymous THP, but since v4.8 THP
support for tmpfs and read-only file cache has been added. They could
be offlined by split THP, just like anonymous THP.

Link: https://lkml.kernel.org/r/20211020210755.23964-7-shy828301@gmail.com
Signed-off-by: Yang Shi <shy828301@gmail.com>
Acked-by: Naoya Horiguchi <naoya.horiguchi@nec.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Oscar Salvador <osalvador@suse.de>
Cc: Peter Xu <peterx@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# b9d02f1b 05-Nov-2021 Yang Shi <shy828301@gmail.com>

mm: shmem: don't truncate page if memory failure happens

The current behavior of memory failure is to truncate the page cache
regardless of dirty or clean. If the page is dirty the later access
will get the obsolete data from disk without any notification to the
users. This may cause silent data loss. It is even worse for shmem
since shmem is in-memory filesystem, truncating page cache means
discarding data blocks. The later read would return all zero.

The right approach is to keep the corrupted page in page cache, any
later access would return error for syscalls or SIGBUS for page fault,
until the file is truncated, hole punched or removed. The regular
storage backed filesystems would be more complicated so this patch is
focused on shmem. This also unblock the support for soft offlining
shmem THP.

[arnd@arndb.de: fix uninitialized variable use in me_pagecache_clean()]
Link: https://lkml.kernel.org/r/20211022064748.4173718-1-arnd@kernel.org

Link: https://lkml.kernel.org/r/20211020210755.23964-6-shy828301@gmail.com
Signed-off-by: Yang Shi <shy828301@gmail.com>
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Cc: Hugh Dickins <hughd@google.com>
Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Naoya Horiguchi <naoya.horiguchi@nec.com>
Cc: Oscar Salvador <osalvador@suse.de>
Cc: Peter Xu <peterx@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# dd0f230a 05-Nov-2021 Yang Shi <shy828301@gmail.com>

mm: hwpoison: refactor refcount check handling

Memory failure will report failure if the page still has extra pinned
refcount other than from hwpoison after the handler is done. Actually
the check is not necessary for all handlers, so move the check into
specific handlers. This would make the following keeping shmem page in
page cache patch easier.

There may be expected extra pin for some cases, for example, when the
page is dirty and in swapcache.

Link: https://lkml.kernel.org/r/20211020210755.23964-5-shy828301@gmail.com
Signed-off-by: Yang Shi <shy828301@gmail.com>
Signed-off-by: Naoya Horiguchi <naoya.horiguchi@nec.com>
Suggested-by: Naoya Horiguchi <naoya.horiguchi@nec.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Oscar Salvador <osalvador@suse.de>
Cc: Peter Xu <peterx@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# ba9eb3ce 05-Nov-2021 Rikard Falkeborn <rikard.falkeborn@gmail.com>

mm/memory_failure: constify static mm_walk_ops

The only usage of hwp_walk_ops is to pass its address to
walk_page_range() which takes a pointer to const mm_walk_ops as
argument.

Make it const to allow the compiler to put it in read-only memory.

Link: https://lkml.kernel.org/r/20211014075042.17174-3-rikard.falkeborn@gmail.com
Signed-off-by: Rikard Falkeborn <rikard.falkeborn@gmail.com>
Acked-by: Naoya Horiguchi <naoya.horiguchi@nec.com>
Reviewed-by: Anshuman Khandual <anshuman.khandual@arm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 96c84dde 05-Nov-2021 Christoph Hellwig <hch@lst.de>

mm: don't include <linux/dax.h> in <linux/mempolicy.h>

Not required at all, and having this causes a huge kernel rebuild as
soon as something in dax.h changes.

Link: https://lkml.kernel.org/r/20210921082253.1859794-1-hch@lst.de
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Naoya Horiguchi <naoya.horiguchi@nec.com>
Reviewed-by: Dan Williams <dan.j.williams@intel.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 23efd080 19-Oct-2021 Matthew Wilcox (Oracle) <willy@infradead.org>

vsprintf: Make %pGp print the hex value

All existing users of %pGp want the hex value as well as the decoded
flag names. This looks awkward (passing the same parameter to printf
twice), so move that functionality into the core. If we want, we
can make that optional with flag arguments to %pGp in the future.

Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Reviewed-by: Yafang Shao <laoar.shao@gmail.com>
Reviewed-by: Petr Mladek <pmladek@suse.com>
Signed-off-by: Petr Mladek <pmladek@suse.com>
Link: https://lore.kernel.org/r/20211019142621.2810043-6-willy@infradead.org


# bbc6b703 01-May-2021 Matthew Wilcox (Oracle) <willy@infradead.org>

mm/memcg: Convert mem_cgroup_uncharge() to take a folio

Convert all the callers to call page_folio(). Most of them were already
using a head page, but a few of them I can't prove were, so this may
actually fix a bug.

Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Acked-by: Mike Rapoport <rppt@linux.ibm.com>
Reviewed-by: David Howells <dhowells@redhat.com>
Acked-by: Vlastimil Babka <vbabka@suse.cz>


# eac96c3e 28-Oct-2021 Yang Shi <shy828301@gmail.com>

mm: filemap: check if THP has hwpoisoned subpage for PMD page fault

When handling shmem page fault the THP with corrupted subpage could be
PMD mapped if certain conditions are satisfied. But kernel is supposed
to send SIGBUS when trying to map hwpoisoned page.

There are two paths which may do PMD map: fault around and regular
fault.

Before commit f9ce0be71d1f ("mm: Cleanup faultaround and finish_fault()
codepaths") the thing was even worse in fault around path. The THP
could be PMD mapped as long as the VMA fits regardless what subpage is
accessed and corrupted. After this commit as long as head page is not
corrupted the THP could be PMD mapped.

In the regular fault path the THP could be PMD mapped as long as the
corrupted page is not accessed and the VMA fits.

This loophole could be fixed by iterating every subpage to check if any
of them is hwpoisoned or not, but it is somewhat costly in page fault
path.

So introduce a new page flag called HasHWPoisoned on the first tail
page. It indicates the THP has hwpoisoned subpage(s). It is set if any
subpage of THP is found hwpoisoned by memory failure and after the
refcount is bumped successfully, then cleared when the THP is freed or
split.

The soft offline path doesn't need this since soft offline handler just
marks a subpage hwpoisoned when the subpage is migrated successfully.
But shmem THP didn't get split then migrated at all.

Link: https://lkml.kernel.org/r/20211020210755.23964-3-shy828301@gmail.com
Fixes: 800d8c63b2e9 ("shmem: add huge pages support")
Signed-off-by: Yang Shi <shy828301@gmail.com>
Reviewed-by: Naoya Horiguchi <naoya.horiguchi@nec.com>
Suggested-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Oscar Salvador <osalvador@suse.de>
Cc: Peter Xu <peterx@redhat.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# c7cb42e9 28-Oct-2021 Yang Shi <shy828301@gmail.com>

mm: hwpoison: remove the unnecessary THP check

When handling THP hwpoison checked if the THP is in allocation or free
stage since hwpoison may mistreat it as hugetlb page. After commit
415c64c1453a ("mm/memory-failure: split thp earlier in memory error
handling") the problem has been fixed, so this check is no longer
needed. Remove it. The side effect of the removal is hwpoison may
report unsplit THP instead of unknown error for shmem THP. It seems not
like a big deal.

The following patch "mm: filemap: check if THP has hwpoisoned subpage
for PMD page fault" depends on this, which fixes shmem THP with
hwpoisoned subpage(s) are mapped PMD wrongly. So this patch needs to be
backported to -stable as well.

Link: https://lkml.kernel.org/r/20211020210755.23964-2-shy828301@gmail.com
Signed-off-by: Yang Shi <shy828301@gmail.com>
Suggested-by: Naoya Horiguchi <naoya.horiguchi@nec.com>
Acked-by: Naoya Horiguchi <naoya.horiguchi@nec.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Oscar Salvador <osalvador@suse.de>
Cc: Peter Xu <peterx@redhat.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 5c91c0e7 24-Sep-2021 Qi Zheng <zhengqi.arch@bytedance.com>

mm/memory_failure: fix the missing pte_unmap() call

The paired pte_unmap() call is missing before the
dev_pagemap_mapping_shift() returns. So fix it.

David says:
"I guess this code never runs on 32bit / highmem, that's why we didn't
notice so far".

[akpm@linux-foundation.org: cleanup]

Link: https://lkml.kernel.org/r/20210923122642.4999-1-zhengqi.arch@bytedance.com
Signed-off-by: Qi Zheng <zhengqi.arch@bytedance.com>
Reviewed-by: David Hildenbrand <david@redhat.com>
Cc: Naoya Horiguchi <naoya.horiguchi@nec.com>
Cc: Muchun Song <songmuchun@bytedance.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# acfa299a 24-Sep-2021 Naoya Horiguchi <naoya.horiguchi@nec.com>

mm, hwpoison: add is_free_buddy_page() in HWPoisonHandlable()

Commit fcc00621d88b ("mm/hwpoison: retry with shake_page() for
unhandlable pages") changed the return value of __get_hwpoison_page() to
retry for transiently unhandlable cases. However, __get_hwpoison_page()
currently fails to properly judge buddy pages as handlable, so hard/soft
offline for buddy pages always fail as "unhandlable page". This is
totally regrettable.

So let's add is_free_buddy_page() in HWPoisonHandlable(), so that
__get_hwpoison_page() returns different return values between buddy
pages and unhandlable pages as intended.

Link: https://lkml.kernel.org/r/20210909004131.163221-1-naoya.horiguchi@linux.dev
Fixes: fcc00621d88b ("mm/hwpoison: retry with shake_page() for unhandlable pages")
Signed-off-by: Naoya Horiguchi <naoya.horiguchi@nec.com>
Acked-by: David Hildenbrand <david@redhat.com>
Reviewed-by: Yang Shi <shy828301@gmail.com>
Cc: Tony Luck <tony.luck@intel.com>
Cc: Oscar Salvador <osalvador@suse.de>
Cc: Mike Kravetz <mike.kravetz@oracle.com>
Cc: Michal Hocko <mhocko@suse.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 5ac95884 02-Sep-2021 Yang Shi <yang.shi@linux.alibaba.com>

mm/migrate: enable returning precise migrate_pages() success count

Under normal circumstances, migrate_pages() returns the number of pages
migrated. In error conditions, it returns an error code. When returning
an error code, there is no way to know how many pages were migrated or not
migrated.

Make migrate_pages() return how many pages are demoted successfully for
all cases, including when encountering errors. Page reclaim behavior will
depend on this in subsequent patches.

Link: https://lkml.kernel.org/r/20210721063926.3024591-3-ying.huang@intel.com
Link: https://lkml.kernel.org/r/20210715055145.195411-4-ying.huang@intel.com
Signed-off-by: Yang Shi <yang.shi@linux.alibaba.com>
Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
Signed-off-by: "Huang, Ying" <ying.huang@intel.com>
Suggested-by: Oscar Salvador <osalvador@suse.de> [optional parameter]
Reviewed-by: Yang Shi <shy828301@gmail.com>
Reviewed-by: Zi Yan <ziy@nvidia.com>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Wei Xu <weixugc@google.com>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: David Hildenbrand <david@redhat.com>
Cc: David Rientjes <rientjes@google.com>
Cc: Greg Thelen <gthelen@google.com>
Cc: Keith Busch <kbusch@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# f87060d3 02-Sep-2021 Michael Wang <yun.wang@linux.alibaba.com>

mm: fix panic caused by __page_handle_poison()

In commit 510d25c92ec4 ("mm/hwpoison: disable pcp for
page_handle_poison()"), __page_handle_poison() was introduced, and if we
mark:

RET_A = dissolve_free_huge_page();
RET_B = take_page_off_buddy();

then __page_handle_poison was supposed to return TRUE When RET_A == 0 &&
RET_B == TRUE

But since it failed to take care the case when RET_A is -EBUSY or -ENOMEM,
and just return the ret as a bool which actually become TRUE, it break the
original logic.

The following result is a huge page in freelist but was
referenced as poisoned, and lead into the final panic:

kernel BUG at mm/internal.h:95!
invalid opcode: 0000 [#1] SMP PTI
skip...
RIP: 0010:set_page_refcounted mm/internal.h:95 [inline]
RIP: 0010:remove_hugetlb_page+0x23c/0x240 mm/hugetlb.c:1371
skip...
Call Trace:
remove_pool_huge_page+0xe4/0x110 mm/hugetlb.c:1892
return_unused_surplus_pages+0x8d/0x150 mm/hugetlb.c:2272
hugetlb_acct_memory.part.91+0x524/0x690 mm/hugetlb.c:4017

This patch replaces 'bool' with 'int' to handle RET_A correctly.

Link: https://lkml.kernel.org/r/61782ac6-1e8a-4f6f-35e6-e94fce3b37f5@linux.alibaba.com
Fixes: 510d25c92ec4 ("mm/hwpoison: disable pcp for page_handle_poison()")
Signed-off-by: Michael Wang <yun.wang@linux.alibaba.com>
Acked-by: Naoya Horiguchi <naoya.horiguchi@nec.com>
Reported-by: Abaci <abaci@linux.alibaba.com>
Cc: <stable@vger.kernel.org> [5.14+]
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 941ca063 02-Sep-2021 Yang Shi <shy828301@gmail.com>

mm: hwpoison: dump page for unhandlable page

Currently just very simple message is shown for unhandlable page, e.g.
non-LRU page, like: soft_offline: 0x1469f2: unknown non LRU page type
5ffff0000000000 ()

It is not very helpful for further debug, calling dump_page() could show
more useful information.

Calling dump_page() in get_any_page() in order to not duplicate the call
in a couple of different places. It may be called with pcp disabled and
holding memory hotplug lock, it should be not a big deal since hwpoison
handler is not called very often.

[shy828301@gmail.com: remove redundant pr_info per Noaya Horiguchi]
Link: https://lkml.kernel.org/r/20210824020946.195257-3-shy828301@gmail.com

Link: https://lkml.kernel.org/r/20210819054116.266126-3-shy828301@gmail.com
Signed-off-by: Yang Shi <shy828301@gmail.com>
Suggested-by: Matthew Wilcox <willy@infradead.org>
Acked-by: Naoya Horiguchi <naoya.horiguchi@nec.com>
Reviewed-by: David Hildenbrand <david@redhat.com>
Cc: Oscar Salvador <osalvador@suse.de>
Cc: David Mackey <tdmackey@twitter.com>
Cc: Jonathan Corbet <corbet@lwn.net>
Cc: Matthew Wilcox <willy@infradead.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# d0505e9f 02-Sep-2021 Yang Shi <shy828301@gmail.com>

mm: hwpoison: don't drop slab caches for offlining non-LRU page

In the current implementation of soft offline, if non-LRU page is met,
all the slab caches will be dropped to free the page then offline. But
if the page is not slab page all the effort is wasted in vain. Even
though it is a slab page, it is not guaranteed the page could be freed
at all.

However the side effect and cost is quite high. It does not only drop
the slab caches, but also may drop a significant amount of page caches
which are associated with inode caches. It could make the most
workingset gone in order to just offline a page. And the offline is not
guaranteed to succeed at all, actually I really doubt the success rate
for real life workload.

Furthermore the worse consequence is the system may be locked up and
unusable since the page cache release may incur huge amount of works
queued for memcg release.

Actually we ran into such unpleasant case in our production environment.
Firstly, the workqueue of memory_failure_work_func is locked up as
below:

BUG: workqueue lockup - pool cpus=1 node=0 flags=0x0 nice=0 stuck for 53s!
Showing busy workqueues and worker pools:
workqueue events: flags=0x0
pwq 2: cpus=1 node=0 flags=0x0 nice=0 active=14/256 refcnt=15
in-flight: 409271:memory_failure_work_func
pending: kfree_rcu_work, kfree_rcu_monitor, kfree_rcu_work, rht_deferred_worker, rht_deferred_worker, rht_deferred_worker, rht_deferred_worker, kfree_rcu_work, kfree_rcu_work, kfree_rcu_work, kfree_rcu_work, drain_local_stock, kfree_rcu_work
workqueue mm_percpu_wq: flags=0x8
pwq 2: cpus=1 node=0 flags=0x0 nice=0 active=1/256 refcnt=2
pending: vmstat_update
workqueue cgroup_destroy: flags=0x0
pwq 2: cpus=1 node=0 flags=0x0 nice=0 active=1/1 refcnt=12072
pending: css_release_work_fn

There were over 12K css_release_work_fn queued, and this caused a few
lockups due to the contention of worker pool lock with IRQ disabled, for
example:

NMI watchdog: Watchdog detected hard LOCKUP on cpu 1
Modules linked in: amd64_edac_mod edac_mce_amd crct10dif_pclmul crc32_pclmul ghash_clmulni_intel xt_DSCP iptable_mangle kvm_amd bpfilter vfat fat acpi_ipmi i2c_piix4 usb_storage ipmi_si k10temp i2c_core ipmi_devintf ipmi_msghandler acpi_cpufreq sch_fq_codel xfs libcrc32c crc32c_intel mlx5_core mlxfw nvme xhci_pci ptp nvme_core pps_core xhci_hcd
CPU: 1 PID: 205500 Comm: kworker/1:0 Tainted: G L 5.10.32-t1.el7.twitter.x86_64 #1
Hardware name: TYAN F5AMT /z /S8026GM2NRE-CGN, BIOS V8.030 03/30/2021
Workqueue: events memory_failure_work_func
RIP: 0010:queued_spin_lock_slowpath+0x41/0x1a0
Code: 41 f0 0f ba 2f 08 0f 92 c0 0f b6 c0 c1 e0 08 89 c2 8b 07 30 e4 09 d0 a9 00 01 ff ff 75 1b 85 c0 74 0e 8b 07 84 c0 74 08 f3 90 <8b> 07 84 c0 75 f8 b8 01 00 00 00 66 89 07 c3 f6 c4 01 75 04 c6 47
RSP: 0018:ffff9b2ac278f900 EFLAGS: 00000002
RAX: 0000000000480101 RBX: ffff8ce98ce71800 RCX: 0000000000000084
RDX: 0000000000000000 RSI: 0000000000000000 RDI: ffff8ce98ce6a140
RBP: 00000000000284c8 R08: ffffd7248dcb6808 R09: 0000000000000000
R10: 0000000000000003 R11: ffff9b2ac278f9b0 R12: 0000000000000001
R13: ffff8cb44dab9c00 R14: ffffffffbd1ce6a0 R15: ffff8cacaa37f068
FS: 0000000000000000(0000) GS:ffff8ce98ce40000(0000) knlGS:0000000000000000
CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 00007fcf6e8cb000 CR3: 0000000a0c60a000 CR4: 0000000000350ee0
Call Trace:
__queue_work+0xd6/0x3c0
queue_work_on+0x1c/0x30
uncharge_batch+0x10e/0x110
mem_cgroup_uncharge_list+0x6d/0x80
release_pages+0x37f/0x3f0
__pagevec_release+0x1c/0x50
__invalidate_mapping_pages+0x348/0x380
inode_lru_isolate+0x10a/0x160
__list_lru_walk_one+0x7b/0x170
list_lru_walk_one+0x4a/0x60
prune_icache_sb+0x37/0x50
super_cache_scan+0x123/0x1a0
do_shrink_slab+0x10c/0x2c0
shrink_slab+0x1f1/0x290
drop_slab_node+0x4d/0x70
soft_offline_page+0x1ac/0x5b0
memory_failure_work_func+0x6a/0x90
process_one_work+0x19e/0x340
worker_thread+0x30/0x360
kthread+0x116/0x130

The lockup made the machine is quite unusable. And it also made the
most workingset gone, the reclaimabled slab caches were reduced from 12G
to 300MB, the page caches were decreased from 17G to 4G.

But the most disappointing thing is all the effort doesn't make the page
offline, it just returns:

soft_offline: 0x1469f2: unknown non LRU page type 5ffff0000000000 ()

It seems the aggressive behavior for non-LRU page didn't pay back, so it
doesn't make too much sense to keep it considering the terrible side
effect.

Link: https://lkml.kernel.org/r/20210819054116.266126-1-shy828301@gmail.com
Signed-off-by: Yang Shi <shy828301@gmail.com>
Reported-by: David Mackey <tdmackey@twitter.com>
Acked-by: David Hildenbrand <david@redhat.com>
Acked-by: Naoya Horiguchi <naoya.horiguchi@nec.com>
Cc: Oscar Salvador <osalvador@suse.de>
Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
Cc: Jonathan Corbet <corbet@lwn.net>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# a21c184f 02-Sep-2021 Miaohe Lin <linmiaohe@huawei.com>

mm/hwpoison: fix some obsolete comments

Since commit cb731d6c62bb ("vmscan: per memory cgroup slab shrinkers"),
shrink_node_slabs is renamed to drop_slab_node. And doit argument is
changed to forcekill since commit 6751ed65dc66 ("x86/mce: Fix
siginfo_t->si_addr value for non-recoverable memory faults").

Link: https://lkml.kernel.org/r/20210814105131.48814-5-linmiaohe@huawei.com
Signed-off-by: Miaohe Lin <linmiaohe@huawei.com>
Acked-by: Naoya Horiguchi <naoya.horiguchi@nec.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# ed8c2f49 02-Sep-2021 Miaohe Lin <linmiaohe@huawei.com>

mm/hwpoison: change argument struct page **hpagep to *hpage

It's unnecessary to pass in a struct page **hpagep because it's never
modified. Changing to use *hpage to simplify the code.

Link: https://lkml.kernel.org/r/20210814105131.48814-4-linmiaohe@huawei.com
Signed-off-by: Miaohe Lin <linmiaohe@huawei.com>
Acked-by: Naoya Horiguchi <naoya.horiguchi@nec.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# ea3732f7 02-Sep-2021 Miaohe Lin <linmiaohe@huawei.com>

mm/hwpoison: fix potential pte_unmap_unlock pte error

If the first pte is equal to poisoned_pfn, i.e. check_hwpoisoned_entry()
return 1, the wrong ptep - 1 would be passed to pte_unmap_unlock().

Link: https://lkml.kernel.org/r/20210814105131.48814-3-linmiaohe@huawei.com
Fixes: ad9c59c24095 ("mm,hwpoison: send SIGBUS with error virutal address")
Signed-off-by: Miaohe Lin <linmiaohe@huawei.com>
Acked-by: Naoya Horiguchi <naoya.horiguchi@nec.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# ae611d07 02-Sep-2021 Miaohe Lin <linmiaohe@huawei.com>

mm/hwpoison: remove unneeded variable unmap_success

Patch series "Cleanups and fixup for hwpoison"

This series contains cleanups to remove unneeded variable, fix some
obsolete comments and so on. Also we fix potential pte_unmap_unlock on
wrong pte. More details can be found in the respective changelogs.

This patch (of 4):

unmap_success is used to indicate whether page is successfully unmapped
but it's irrelated with ZONE_DEVICE page and unmap_success is always true
here. Remove this unneeded one.

Link: https://lkml.kernel.org/r/20210814105131.48814-1-linmiaohe@huawei.com
Link: https://lkml.kernel.org/r/20210814105131.48814-2-linmiaohe@huawei.com
Signed-off-by: Miaohe Lin <linmiaohe@huawei.com>
Acked-by: Naoya Horiguchi <naoya.horiguchi@nec.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 9608703e 12-Apr-2021 Jan Kara <jack@suse.cz>

mm: Fix comments mentioning i_mutex

inode->i_mutex has been replaced with inode->i_rwsem long ago. Fix
comments still mentioning i_mutex.

Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Acked-by: Hugh Dickins <hughd@google.com>
Signed-off-by: Jan Kara <jack@suse.cz>


# fcc00621 19-Aug-2021 Naoya Horiguchi <naoya.horiguchi@nec.com>

mm/hwpoison: retry with shake_page() for unhandlable pages

HWPoisonHandlable() sometimes returns false for typical user pages due
to races with average memory events like transfers over LRU lists. This
causes failures in hwpoison handling.

There's retry code for such a case but does not work because the retry
loop reaches the retry limit too quickly before the page settles down to
handlable state. Let get_any_page() call shake_page() to fix it.

[naoya.horiguchi@nec.com: get_any_page(): return -EIO when retry limit reached]
Link: https://lkml.kernel.org/r/20210819001958.2365157-1-naoya.horiguchi@linux.dev

Link: https://lkml.kernel.org/r/20210817053703.2267588-1-naoya.horiguchi@linux.dev
Fixes: 25182f05ffed ("mm,hwpoison: fix race with hugetlb page allocation")
Signed-off-by: Naoya Horiguchi <naoya.horiguchi@nec.com>
Reported-by: Tony Luck <tony.luck@intel.com>
Reviewed-by: Yang Shi <shy828301@gmail.com>
Cc: Oscar Salvador <osalvador@suse.de>
Cc: Muchun Song <songmuchun@bytedance.com>
Cc: Mike Kravetz <mike.kravetz@oracle.com>
Cc: Michal Hocko <mhocko@suse.com>
Cc: <stable@vger.kernel.org> [5.13+]
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 041711ce 30-Jun-2021 Zhen Lei <thunder.leizhen@huawei.com>

mm: fix spelling mistakes

Fix some spelling mistakes in comments:
each having differents usage ==> each has a different usage
statments ==> statements
adresses ==> addresses
aggresive ==> aggressive
datas ==> data
posion ==> poison
higer ==> higher
precisly ==> precisely
wont ==> won't
We moves tha ==> We move the
endianess ==> endianness

Link: https://lkml.kernel.org/r/20210519065853.7723-2-thunder.leizhen@huawei.com
Signed-off-by: Zhen Lei <thunder.leizhen@huawei.com>
Reviewed-by: Souptick Joarder <jrdr.linux@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 36af6737 30-Jun-2021 Hugh Dickins <hughd@google.com>

mm: hwpoison_user_mappings() try_to_unmap() with TTU_SYNC

TTU_SYNC prevents an unlikely race, when try_to_unmap() returns shortly
before the page is accounted as unmapped. It is unlikely to coincide with
hwpoisoning, but now that we have the flag, hwpoison_user_mappings() would
do well to use it.

Link: https://lkml.kernel.org/r/329c28ed-95df-9a2c-8893-b444d8a6d340@google.com
Signed-off-by: Hugh Dickins <hughd@google.com>
Acked-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Acked-by: Naoya Horiguchi <naoya.horiguchi@nec.com>
Cc: Alistair Popple <apopple@nvidia.com>
Cc: Jan Kara <jack@suse.cz>
Cc: Jue Wang <juew@google.com>
Cc: "Matthew Wilcox (Oracle)" <willy@infradead.org>
Cc: Miaohe Lin <linmiaohe@huawei.com>
Cc: Minchan Kim <minchan@kernel.org>
Cc: Oscar Salvador <osalvador@suse.de>
Cc: Peter Xu <peterx@redhat.com>
Cc: Ralph Campbell <rcampbell@nvidia.com>
Cc: Shakeel Butt <shakeelb@google.com>
Cc: Wang Yugui <wangyugui@e16-tech.com>
Cc: Yang Shi <shy828301@gmail.com>
Cc: Zi Yan <ziy@nvidia.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 1fb08ac6 30-Jun-2021 Yang Shi <shy828301@gmail.com>

mm: rmap: make try_to_unmap() void function

Currently try_to_unmap() return bool value by checking page_mapcount(),
however this may return false positive since page_mapcount() doesn't check
all subpages of compound page. The total_mapcount() could be used
instead, but its cost is higher since it traverses all subpages.

Actually the most callers of try_to_unmap() don't care about the return
value at all. So just need check if page is still mapped by page_mapped()
when necessary. And page_mapped() does bail out early when it finds
mapped subpage.

Link: https://lkml.kernel.org/r/bb27e3fe-6036-b637-5086-272befbfe3da@google.com
Suggested-by: Hugh Dickins <hughd@google.com>
Signed-off-by: Yang Shi <shy828301@gmail.com>
Acked-by: Minchan Kim <minchan@kernel.org>
Reviewed-by: Shakeel Butt <shakeelb@google.com>
Acked-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Signed-off-by: Hugh Dickins <hughd@google.com>
Acked-by: Naoya Horiguchi <naoya.horiguchi@nec.com>
Cc: Alistair Popple <apopple@nvidia.com>
Cc: Jan Kara <jack@suse.cz>
Cc: Jue Wang <juew@google.com>
Cc: "Matthew Wilcox (Oracle)" <willy@infradead.org>
Cc: Miaohe Lin <linmiaohe@huawei.com>
Cc: Oscar Salvador <osalvador@suse.de>
Cc: Peter Xu <peterx@redhat.com>
Cc: Ralph Campbell <rcampbell@nvidia.com>
Cc: Wang Yugui <wangyugui@e16-tech.com>
Cc: Zi Yan <ziy@nvidia.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 510d25c9 30-Jun-2021 Naoya Horiguchi <naoya.horiguchi@nec.com>

mm/hwpoison: disable pcp for page_handle_poison()

Recent changes by patch "mm/page_alloc: allow high-order pages to be
stored on the per-cpu lists" makes kernels determine whether to use pcp by
pcp_allowed_order(), which breaks soft-offline for hugetlb pages.

Soft-offline dissolves a migration source page, then removes it from buddy
free list, so it's assumed that any subpage of the soft-offlined hugepage
are recognized as a buddy page just after returning from
dissolve_free_huge_page(). pcp_allowed_order() returns true for hugetlb,
so this assumption is no longer true.

So disable pcp during dissolve_free_huge_page() and take_page_off_buddy()
to prevent soft-offlined hugepages from linking to pcp lists.
Soft-offline should not be common events so the impact on performance
should be minimal. And I think that the optimization of Mel's patch could
benefit to hugetlb so zone_pcp_disable() is called only in hwpoison
context.

Link: https://lkml.kernel.org/r/20210617092626.291006-1-nao.horiguchi@gmail.com
Signed-off-by: Naoya Horiguchi <naoya.horiguchi@nec.com>
Acked-by: Mel Gorman <mgorman@techsingularity.net>
Cc: Mike Kravetz <mike.kravetz@oracle.com>
Cc: David Hildenbrand <david@redhat.com>
Cc: Oscar Salvador <osalvador@suse.de>
Cc: Michal Hocko <mhocko@suse.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 0ed950d1 28-Jun-2021 Naoya Horiguchi <naoya.horiguchi@nec.com>

mm,hwpoison: make get_hwpoison_page() call get_any_page()

__get_hwpoison_page() could fail to grab refcount by some race condition,
so it's helpful if we can handle it by retrying. We already have retry
logic, so make get_hwpoison_page() call get_any_page() when called from
memory_failure().

As a result, get_hwpoison_page() can return negative values (i.e. error
code), so some callers are also changed to handle error cases.
soft_offline_page() does nothing for -EBUSY because that's enough and
users in userspace can easily handle it. unpoison_memory() is also
unchanged because it's broken and need thorough fixes (will be done
later).

Link: https://lkml.kernel.org/r/20210603233632.2964832-3-nao.horiguchi@gmail.com
Signed-off-by: Naoya Horiguchi <naoya.horiguchi@nec.com>
Cc: Oscar Salvador <osalvador@suse.de>
Cc: Muchun Song <songmuchun@bytedance.com>
Cc: Mike Kravetz <mike.kravetz@oracle.com>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Tony Luck <tony.luck@intel.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# a3f5d80e 28-Jun-2021 Naoya Horiguchi <naoya.horiguchi@nec.com>

mm,hwpoison: send SIGBUS with error virutal address

Now an action required MCE in already hwpoisoned address surely sends a
SIGBUS to current process, but the SIGBUS doesn't convey error virtual
address. That's not optimal for hwpoison-aware applications.

To fix the issue, make memory_failure() call kill_accessing_process(),
that does pagetable walk to find the error virtual address. It could find
multiple virtual addresses for the same error page, and it seems hard to
tell which virtual address is correct one. But that's rare and sending
incorrect virtual address could be better than no address. So let's
report the first found virtual address for now.

[naoya.horiguchi@nec.com: fix walk_page_range() return]
Link: https://lkml.kernel.org/r/20210603051055.GA244241@hori.linux.bs1.fc.nec.co.jp

Link: https://lkml.kernel.org/r/20210521030156.2612074-4-nao.horiguchi@gmail.com
Signed-off-by: Naoya Horiguchi <naoya.horiguchi@nec.com>
Cc: Tony Luck <tony.luck@intel.com>
Cc: Aili Yao <yaoaili@kingsoft.com>
Cc: Oscar Salvador <osalvador@suse.de>
Cc: David Hildenbrand <david@redhat.com>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Jue Wang <juew@google.com>
Cc: Borislav Petkov <bp@suse.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# ea6d0630 24-Jun-2021 Naoya Horiguchi <naoya.horiguchi@nec.com>

mm/hwpoison: do not lock page again when me_huge_page() successfully recovers

Currently me_huge_page() temporary unlocks page to perform some actions
then locks it again later. My testcase (which calls hard-offline on
some tail page in a hugetlb, then accesses the address of the hugetlb
range) showed that page allocation code detects this page lock on buddy
page and printed out "BUG: Bad page state" message.

check_new_page_bad() does not consider a page with __PG_HWPOISON as bad
page, so this flag works as kind of filter, but this filtering doesn't
work in this case because the "bad page" is not the actual hwpoisoned
page. So stop locking page again. Actions to be taken depend on the
page type of the error, so page unlocking should be done in ->action()
callbacks. So let's make it assumed and change all existing callbacks
that way.

Link: https://lkml.kernel.org/r/20210609072029.74645-1-nao.horiguchi@gmail.com
Fixes: commit 78bb920344b8 ("mm: hwpoison: dissolve in-use hugepage in unrecoverable memory error")
Signed-off-by: Naoya Horiguchi <naoya.horiguchi@nec.com>
Cc: Oscar Salvador <osalvador@suse.de>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Tony Luck <tony.luck@intel.com>
Cc: "Aneesh Kumar K.V" <aneesh.kumar@linux.vnet.ibm.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 47af12ba 24-Jun-2021 Aili Yao <yaoaili@kingsoft.com>

mm,hwpoison: return -EHWPOISON to denote that the page has already been poisoned

When memory_failure() is called with MF_ACTION_REQUIRED on the page that
has already been hwpoisoned, memory_failure() could fail to send SIGBUS
to the affected process, which results in infinite loop of MCEs.

Currently memory_failure() returns 0 if it's called for already
hwpoisoned page, then the caller, kill_me_maybe(), could return without
sending SIGBUS to current process. An action required MCE is raised
when the current process accesses to the broken memory, so no SIGBUS
means that the current process continues to run and access to the error
page again soon, so running into MCE loop.

This issue can arise for example in the following scenarios:

- Two or more threads access to the poisoned page concurrently. If
local MCE is enabled, MCE handler independently handles the MCE
events. So there's a race among MCE events, and the second or latter
threads fall into the situation in question.

- If there was a precedent memory error event and memory_failure() for
the event failed to unmap the error page for some reason, the
subsequent memory access to the error page triggers the MCE loop
situation.

To fix the issue, make memory_failure() return an error code when the
error page has already been hwpoisoned. This allows memory error
handler to control how it sends signals to userspace. And make sure
that any process touching a hwpoisoned page should get a SIGBUS even in
"already hwpoisoned" path of memory_failure() as is done in page fault
path.

Link: https://lkml.kernel.org/r/20210521030156.2612074-3-nao.horiguchi@gmail.com
Signed-off-by: Aili Yao <yaoaili@kingsoft.com>
Signed-off-by: Naoya Horiguchi <naoya.horiguchi@nec.com>
Reviewed-by: Oscar Salvador <osalvador@suse.de>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Borislav Petkov <bp@suse.de>
Cc: David Hildenbrand <david@redhat.com>
Cc: Jue Wang <juew@google.com>
Cc: Tony Luck <tony.luck@intel.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 171936dd 24-Jun-2021 Tony Luck <tony.luck@intel.com>

mm/memory-failure: use a mutex to avoid memory_failure() races

Patch series "mm,hwpoison: fix sending SIGBUS for Action Required MCE", v5.

I wrote this patchset to materialize what I think is the current
allowable solution mentioned by the previous discussion [1]. I simply
borrowed Tony's mutex patch and Aili's return code patch, then I queued
another one to find error virtual address in the best effort manner. I
know that this is not a perfect solution, but should work for some
typical case.

[1]: https://lore.kernel.org/linux-mm/20210331192540.2141052f@alex-virtual-machine/

This patch (of 2):

There can be races when multiple CPUs consume poison from the same page.
The first into memory_failure() atomically sets the HWPoison page flag
and begins hunting for tasks that map this page. Eventually it
invalidates those mappings and may send a SIGBUS to the affected tasks.

But while all that work is going on, other CPUs see a "success" return
code from memory_failure() and so they believe the error has been
handled and continue executing.

Fix by wrapping most of the internal parts of memory_failure() in a
mutex.

[akpm@linux-foundation.org: make mf_mutex local to memory_failure()]

Link: https://lkml.kernel.org/r/20210521030156.2612074-1-nao.horiguchi@gmail.com
Link: https://lkml.kernel.org/r/20210521030156.2612074-2-nao.horiguchi@gmail.com
Signed-off-by: Tony Luck <tony.luck@intel.com>
Signed-off-by: Naoya Horiguchi <naoya.horiguchi@nec.com>
Reviewed-by: Borislav Petkov <bp@suse.de>
Reviewed-by: Oscar Salvador <osalvador@suse.de>
Cc: Aili Yao <yaoaili@kingsoft.com>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Borislav Petkov <bp@alien8.de>
Cc: David Hildenbrand <david@redhat.com>
Cc: Jue Wang <juew@google.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# e8675d29 15-Jun-2021 yangerkun <yangerkun@huawei.com>

mm/memory-failure: make sure wait for page writeback in memory_failure

Our syzkaller trigger the "BUG_ON(!list_empty(&inode->i_wb_list))" in
clear_inode:

kernel BUG at fs/inode.c:519!
Internal error: Oops - BUG: 0 [#1] SMP
Modules linked in:
Process syz-executor.0 (pid: 249, stack limit = 0x00000000a12409d7)
CPU: 1 PID: 249 Comm: syz-executor.0 Not tainted 4.19.95
Hardware name: linux,dummy-virt (DT)
pstate: 80000005 (Nzcv daif -PAN -UAO)
pc : clear_inode+0x280/0x2a8
lr : clear_inode+0x280/0x2a8
Call trace:
clear_inode+0x280/0x2a8
ext4_clear_inode+0x38/0xe8
ext4_free_inode+0x130/0xc68
ext4_evict_inode+0xb20/0xcb8
evict+0x1a8/0x3c0
iput+0x344/0x460
do_unlinkat+0x260/0x410
__arm64_sys_unlinkat+0x6c/0xc0
el0_svc_common+0xdc/0x3b0
el0_svc_handler+0xf8/0x160
el0_svc+0x10/0x218
Kernel panic - not syncing: Fatal exception

A crash dump of this problem show that someone called __munlock_pagevec
to clear page LRU without lock_page: do_mmap -> mmap_region -> do_munmap
-> munlock_vma_pages_range -> __munlock_pagevec.

As a result memory_failure will call identify_page_state without
wait_on_page_writeback. And after truncate_error_page clear the mapping
of this page. end_page_writeback won't call sb_clear_inode_writeback to
clear inode->i_wb_list. That will trigger BUG_ON in clear_inode!

Fix it by checking PageWriteback too to help determine should we skip
wait_on_page_writeback.

Link: https://lkml.kernel.org/r/20210604084705.3729204-1-yangerkun@huawei.com
Fixes: 0bc1f8b0682c ("hwpoison: fix the handling path of the victimized page frame that belong to non-LRU")
Signed-off-by: yangerkun <yangerkun@huawei.com>
Acked-by: Naoya Horiguchi <naoya.horiguchi@nec.com>
Cc: Jan Kara <jack@suse.cz>
Cc: Theodore Ts'o <tytso@mit.edu>
Cc: Oscar Salvador <osalvador@suse.de>
Cc: Yu Kuai <yukuai3@huawei.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 25182f05 15-Jun-2021 Naoya Horiguchi <naoya.horiguchi@nec.com>

mm,hwpoison: fix race with hugetlb page allocation

When hugetlb page fault (under overcommitting situation) and
memory_failure() race, VM_BUG_ON_PAGE() is triggered by the following
race:

CPU0: CPU1:

gather_surplus_pages()
page = alloc_surplus_huge_page()
memory_failure_hugetlb()
get_hwpoison_page(page)
__get_hwpoison_page(page)
get_page_unless_zero(page)
zero = put_page_testzero(page)
VM_BUG_ON_PAGE(!zero, page)
enqueue_huge_page(h, page)
put_page(page)

__get_hwpoison_page() only checks the page refcount before taking an
additional one for memory error handling, which is not enough because
there's a time window where compound pages have non-zero refcount during
hugetlb page initialization.

So make __get_hwpoison_page() check page status a bit more for hugetlb
pages with get_hwpoison_huge_page(). Checking hugetlb-specific flags
under hugetlb_lock makes sure that the hugetlb page is not transitive.
It's notable that another new function, HWPoisonHandlable(), is helpful
to prevent a race against other transitive page states (like a generic
compound page just before PageHuge becomes true).

Link: https://lkml.kernel.org/r/20210603233632.2964832-2-nao.horiguchi@gmail.com
Fixes: ead07f6a867b ("mm/memory-failure: introduce get_hwpoison_page() for consistent refcount handling")
Signed-off-by: Naoya Horiguchi <naoya.horiguchi@nec.com>
Reported-by: Muchun Song <songmuchun@bytedance.com>
Acked-by: Mike Kravetz <mike.kravetz@oracle.com>
Cc: Oscar Salvador <osalvador@suse.de>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Tony Luck <tony.luck@intel.com>
Cc: <stable@vger.kernel.org> [5.12+]
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# f0953a1b 06-May-2021 Ingo Molnar <mingo@kernel.org>

mm: fix typos in comments

Fix ~94 single-word typos in locking code comments, plus a few
very obvious grammar mistakes.

Link: https://lkml.kernel.org/r/20210322212624.GA1963421@gmail.com
Link: https://lore.kernel.org/r/20210322205203.GB1959563@gmail.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Reviewed-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Reviewed-by: Randy Dunlap <rdunlap@infradead.org>
Cc: Bhaskar Chowdhury <unixbhaskar@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 4d75136b 30-Apr-2021 Jane Chu <jane.chu@oracle.com>

mm/memory-failure: unnecessary amount of unmapping

It appears that unmap_mapping_range() actually takes a 'size' as its third
argument rather than a location, the current calling fashion causes
unnecessary amount of unmapping to occur.

Link: https://lkml.kernel.org/r/20210420002821.2749748-1-jane.chu@oracle.com
Fixes: 6100e34b2526e ("mm, memory_failure: Teach memory_failure() about dev_pagemap pages")
Signed-off-by: Jane Chu <jane.chu@oracle.com>
Reviewed-by: Dan Williams <dan.j.williams@intel.com>
Reviewed-by: Naoya Horiguchi <naoya.horiguchi@nec.com>
Cc: Dave Jiang <dave.jiang@intel.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 34dc45be 25-Feb-2021 Dan Williams <dan.j.williams@intel.com>

mm: fix memory_failure() handling of dax-namespace metadata

Given 'struct dev_pagemap' spans both data pages and metadata pages be
careful to consult the altmap if present to delineate metadata. In fact
the pfn_first() helper already identifies the first valid data pfn, so
export that helper for other code paths via pgmap_pfn_valid().

Other usage of get_dev_pagemap() are not a concern because those are
operating on known data pfns having been looked up by get_user_pages().
I.e. metadata pfns are never user mapped.

Link: https://lkml.kernel.org/r/161058501758.1840162.4239831989762604527.stgit@dwillia2-desk3.amr.corp.intel.com
Fixes: 6100e34b2526 ("mm, memory_failure: Teach memory_failure() about dev_pagemap pages")
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
Reported-by: David Hildenbrand <david@redhat.com>
Reviewed-by: David Hildenbrand <david@redhat.com>
Reviewed-by: Naoya Horiguchi <naoya.horiguchi@nec.com>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Oscar Salvador <osalvador@suse.de>
Cc: Qian Cai <cai@lca.pw>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 30c9cf49 24-Feb-2021 Aili Yao <yaoaili@kingsoft.com>

mm,hwpoison: send SIGBUS to PF_MCE_EARLY processes on action required events

When a memory uncorrected error is triggered by process who accessed the
address with error, It's Action Required Case for only current process
which triggered this; This Action Required case means Action optional to
other process who share the same page. Usually killing current process
will be sufficient, other processes sharing the same page will get be
signaled when they really touch the poisoned page.

But there is another scenario that other processes sharing the same page
want to be signaled early with PF_MCE_EARLY set. In this case, we should
get them into kill list and signal BUS_MCEERR_AO to them.

So in this patch, task_early_kill will check current process if
force_early is set, and if not current,the code will fallback to
find_early_kill_thread() to check if there is PF_MCE_EARLY process who
cares the error.

In kill_proc(), BUS_MCEERR_AR is only send to current, other processes in
kill list will be signaled with BUS_MCEERR_AO.

Link: https://lkml.kernel.org/r/20210122132424.313c8f5f.yaoaili@kingsoft.com
Signed-off-by: Aili Yao <yaoaili@kingsoft.com>
Reviewed-by: Oscar Salvador <osalvador@suse.de>
Acked-by: Naoya Horiguchi <naoya.horiguchi@nec.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# dad4e5b3 23-Jan-2021 Dan Williams <dan.j.williams@intel.com>

mm: fix page reference leak in soft_offline_page()

The conversion to move pfn_to_online_page() internal to
soft_offline_page() missed that the get_user_pages() reference taken by
the madvise() path needs to be dropped when pfn_to_online_page() fails.

Note the direct sysfs-path to soft_offline_page() does not perform a
get_user_pages() lookup.

When soft_offline_page() is handed a pfn_valid() && !pfn_to_online_page()
pfn the kernel hangs at dax-device shutdown due to a leaked reference.

Link: https://lkml.kernel.org/r/161058501210.1840162.8108917599181157327.stgit@dwillia2-desk3.amr.corp.intel.com
Fixes: feec24a6139d ("mm, soft-offline: convert parameter to pfn")
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
Reviewed-by: David Hildenbrand <david@redhat.com>
Reviewed-by: Oscar Salvador <osalvador@suse.de>
Reviewed-by: Naoya Horiguchi <naoya.horiguchi@nec.com>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Qian Cai <cai@lca.pw>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 6696d2a6 12-Jan-2021 Oscar Salvador <osalvador@suse.de>

mm,hwpoison: fix printing of page flags

Format %pG expects a lower case 'p' in order to print the flags.
Fix it.

Link: https://lkml.kernel.org/r/20210108085202.4506-1-osalvador@suse.de
Fixes: 8295d535e2aa ("mm,hwpoison: refactor get_any_page")
Signed-off-by: Oscar Salvador <osalvador@suse.de>
Reported-by: Dan Carpenter <dan.carpenter@oracle.com>
Reviewed-by: Anshuman Khandual <anshuman.khandual@arm.com>
Acked-by: Naoya Horiguchi <naoya.horiguchi@nec.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 3f4b815a 14-Dec-2020 Oscar Salvador <osalvador@suse.de>

mm,hwpoison: return -EBUSY when migration fails

Currently, we return -EIO when we fail to migrate the page.

Migrations' failures are rather transient as they can happen due to
several reasons, e.g: high page refcount bump, mapping->migrate_page
failing etc. All meaning that at that time the page could not be
migrated, but that has nothing to do with an EIO error.

Let us return -EBUSY instead, as we do in case we failed to isolate the
page.

While are it, let us remove the "ret" print as its value does not change.

Link: https://lkml.kernel.org/r/20201209092818.30417-1-osalvador@suse.de
Signed-off-by: Oscar Salvador <osalvador@suse.de>
Acked-by: Naoya Horiguchi <naoya.horiguchi@nec.com>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Cc: David Hildenbrand <david@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 1e8aaedb 14-Dec-2020 Oscar Salvador <osalvador@suse.de>

mm,memory_failure: always pin the page in madvise_inject_error

madvise_inject_error() uses get_user_pages_fast to translate the address
we specified to a page. After [1], we drop the extra reference count for
memory_failure() path. That commit says that memory_failure wanted to
keep the pin in order to take the page out of circulation.

The truth is that we need to keep the page pinned, otherwise the page
might be re-used after the put_page() and we can end up messing with
someone else's memory.

E.g:

CPU0
process X CPU1
madvise_inject_error
get_user_pages
put_page
page gets reclaimed
process Y allocates the page
memory_failure
// We mess with process Y memory

madvise() is meant to operate on a self address space, so messing with
pages that do not belong to us seems the wrong thing to do.
To avoid that, let us keep the page pinned for memory_failure as well.

Pages for DAX mappings will release this extra refcount in
memory_failure_dev_pagemap.

[1] ("23e7b5c2e271: mm, madvise_inject_error:
Let memory_failure() optionally take a page reference")

Link: https://lkml.kernel.org/r/20201207094818.8518-1-osalvador@suse.de
Fixes: 23e7b5c2e271 ("mm, madvise_inject_error: Let memory_failure() optionally take a page reference")
Signed-off-by: Oscar Salvador <osalvador@suse.de>
Suggested-by: Vlastimil Babka <vbabka@suse.cz>
Acked-by: Naoya Horiguchi <naoya.horiguchi@nec.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Dan Williams <dan.j.williams@intel.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 47e431f4 14-Dec-2020 Oscar Salvador <osalvador@suse.de>

mm,hwpoison: remove drain_all_pages from shake_page

get_hwpoison_page already drains pcplists, previously disabling them when
trying to grab a refcount. We do not need shake_page to take care of it
anymore.

Link: https://lkml.kernel.org/r/20201204102558.31607-4-osalvador@suse.de
Signed-off-by: Oscar Salvador <osalvador@suse.de>
Acked-by: Naoya Horiguchi <naoya.horiguchi@nec.com>
Cc: Qian Cai <qcai@redhat.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 2f714160 14-Dec-2020 Oscar Salvador <osalvador@suse.de>

mm,hwpoison: disable pcplists before grabbing a refcount

Currently, we have a sort of retry mechanism to make sure pages in
pcp-lists are spilled to the buddy system, so we can handle those.

We can save us this extra checks with the new disable-pcplist mechanism
that is available with [1].

zone_pcplist_disable makes sure to 1) disable pcplists, so any page that
is freed up from that point onwards will end up in the buddy system and 2)
drain pcplists, so those pages that already in pcplists are spilled to
buddy.

With that, we can make a common entry point for grabbing a refcount from
both soft_offline and memory_failure paths that is guarded by
zone_pcplist_disable/zone_pcplist_enable.

[1] https://patchwork.kernel.org/project/linux-mm/cover/20201111092812.11329-1-vbabka@suse.cz/

Link: https://lkml.kernel.org/r/20201204102558.31607-3-osalvador@suse.de
Signed-off-by: Oscar Salvador <osalvador@suse.de>
Acked-by: Naoya Horiguchi <naoya.horiguchi@nec.com>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Cc: Qian Cai <qcai@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 8295d535 14-Dec-2020 Oscar Salvador <osalvador@suse.de>

mm,hwpoison: refactor get_any_page

Patch series "HWPoison: Refactor get page interface", v2.

This patch (of 3):

When we want to grab a refcount via get_any_page, we call __get_any_page
that calls get_hwpoison_page to get the actual refcount.

get_any_page() is only there because we have a sort of retry mechanism in
case the page we met is unknown to us or if we raced with an allocation.

Also __get_any_page() prints some messages about the page type in case the
page was a free page or the page type was unknown, but if anything, we
only need to print a message in case the pagetype was unknown, as that is
reporting an error down the chain.

Let us merge get_any_page() and __get_any_page(), and let the message be
printed in soft_offline_page. While we are it, we can also remove the
'pfn' parameter as it is no longer used.

Link: https://lkml.kernel.org/r/20201204102558.31607-1-osalvador@suse.de
Link: https://lkml.kernel.org/r/20201204102558.31607-2-osalvador@suse.de
Signed-off-by: Oscar Salvador <osalvador@suse.de>
Acked-by: Naoya Horiguchi <naoya.horiguchi@nec.com>
Acked-by: Vlastimil Babka <Vbabka@suse.cz>
Cc: Qian Cai <qcai@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# a8b2c2ce 14-Dec-2020 Oscar Salvador <osalvador@suse.de>

mm,hwpoison: take free pages off the buddy freelists

The crux of the matter is that historically we left poisoned pages in the
buddy system because we have some checks in place when allocating a page
that are gatekeeper for poisoned pages. Unfortunately, we do have other
users (e.g: compaction [1]) that scan buddy freelists and try to get a
page from there without checking whether the page is HWPoison.

As I stated already, I think it is fundamentally wrong to keep HWPoison
pages within the buddy systems, checks in place or not.

Let us fix this the same way we did for soft_offline [2], taking the page
off the buddy freelist so it is completely unreachable.

Note that this is fairly simple to trigger, as we only need to poison free
buddy pages (madvise MADV_HWPOISON) and then run some sort of memory
stress system.

Just for a matter of reference, I put a dump_page() in compaction_alloc()
to trigger for HWPoison patches:

page:0000000012b2982b refcount:1 mapcount:0 mapping:0000000000000000 index:0x1 pfn:0x1d5db
flags: 0xfffffc0800000(hwpoison)
raw: 000fffffc0800000 ffffea00007573c8 ffffc90000857de0 0000000000000000
raw: 0000000000000001 0000000000000000 00000001ffffffff 0000000000000000
page dumped because: compaction_alloc

CPU: 4 PID: 123 Comm: kcompactd0 Tainted: G E 5.9.0-rc2-mm1-1-default+ #5
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.10.2-0-g5f4c7b1-prebuilt.qemu-project.org 04/01/2014
Call Trace:
dump_stack+0x6d/0x8b
compaction_alloc+0xb2/0xc0
migrate_pages+0x2a6/0x12a0
compact_zone+0x5eb/0x11c0
proactive_compact_node+0x89/0xf0
kcompactd+0x2d0/0x3a0
kthread+0x118/0x130
ret_from_fork+0x22/0x30

After that, if e.g: a process faults in the page, it will get killed
unexpectedly.
Fix it by containing the page immediatelly.

Besides that, two more changes can be noticed:

* MF_DELAYED no longer suits as we are fixing the issue by containing
the page immediately, so it does no longer rely on the allocation-time
checks to stop HWPoison to be handed over.
gain unless it is unpoisoned, so we fixed the situation.
Because of that, let us use MF_RECOVERED from now on.

* The second block that handles PageBuddy pages is no longer needed:
We call shake_page and then check whether the page is Buddy
because shake_page calls drain_all_pages, which sends pcp-pages back to
the buddy freelists, so we could have a chance to handle free pages.
Currently, get_hwpoison_page already calls drain_all_pages, and we call
get_hwpoison_page right before coming here, so we should be on the safe
side.

[1] https://lore.kernel.org/linux-mm/20190826104144.GA7849@linux/T/#u
[2] https://patchwork.kernel.org/cover/11792607/

[osalvador@suse.de: take the poisoned subpage off the buddy frelists]
Link: https://lkml.kernel.org/r/20201013144447.6706-4-osalvador@suse.de

Link: https://lkml.kernel.org/r/20201013144447.6706-3-osalvador@suse.de
Signed-off-by: Oscar Salvador <osalvador@suse.de>
Acked-by: Naoya Horiguchi <naoya.horiguchi@nec.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 17e395b6 14-Dec-2020 Oscar Salvador <osalvador@suse.de>

mm,hwpoison: drain pcplists before bailing out for non-buddy zero-refcount page

Patch series "HWpoison: further fixes and cleanups", v5.

This patchset includes some more fixes and a cleanup.

Patch#2 and patch#3 are both fixes for taking a HWpoison page off a buddy
freelist, since having them there has proved to be bad (see [1] and
pathch#2's commit log). Patch#3 does the same for hugetlb pages.

[1] https://lkml.org/lkml/2020/9/22/565

This patch (of 4):

A page with 0-refcount and !PageBuddy could perfectly be a pcppage.
Currently, we bail out with an error if we encounter such a page, meaning
that we do not handle pcppages neither from hard-offline nor from
soft-offline path.

Fix this by draining pcplists whenever we find this kind of page and retry
the check again. It might be that pcplists have been spilled into the
buddy allocator and so we can handle it.

Link: https://lkml.kernel.org/r/20201013144447.6706-1-osalvador@suse.de
Link: https://lkml.kernel.org/r/20201013144447.6706-2-osalvador@suse.de
Signed-off-by: Oscar Salvador <osalvador@suse.de>
Acked-by: Naoya Horiguchi <naoya.horiguchi@nec.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 013339df 14-Dec-2020 Shakeel Butt <shakeelb@google.com>

mm/rmap: always do TTU_IGNORE_ACCESS

Since commit 369ea8242c0f ("mm/rmap: update to new mmu_notifier semantic
v2"), the code to check the secondary MMU's page table access bit is
broken for !(TTU_IGNORE_ACCESS) because the page is unmapped from the
secondary MMU's page table before the check. More specifically for those
secondary MMUs which unmap the memory in
mmu_notifier_invalidate_range_start() like kvm.

However memory reclaim is the only user of !(TTU_IGNORE_ACCESS) or the
absence of TTU_IGNORE_ACCESS and it explicitly performs the page table
access check before trying to unmap the page. So, at worst the reclaim
will miss accesses in a very short window if we remove page table access
check in unmapping code.

There is an unintented consequence of !(TTU_IGNORE_ACCESS) for the memcg
reclaim. From memcg reclaim the page_referenced() only account the
accesses from the processes which are in the same memcg of the target page
but the unmapping code is considering accesses from all the processes, so,
decreasing the effectiveness of memcg reclaim.

The simplest solution is to always assume TTU_IGNORE_ACCESS in unmapping
code.

Link: https://lkml.kernel.org/r/20201104231928.1494083-1-shakeelb@google.com
Fixes: 369ea8242c0f ("mm/rmap: update to new mmu_notifier semantic v2")
Signed-off-by: Shakeel Butt <shakeelb@google.com>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Cc: Hugh Dickins <hughd@google.com>
Cc: Jerome Glisse <jglisse@redhat.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Dan Williams <dan.j.williams@intel.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 336bf30e 13-Nov-2020 Mike Kravetz <mike.kravetz@oracle.com>

hugetlbfs: fix anon huge page migration race

Qian Cai reported the following BUG in [1]

LTP: starting move_pages12
BUG: unable to handle page fault for address: ffffffffffffffe0
...
RIP: 0010:anon_vma_interval_tree_iter_first+0xa2/0x170 avc_start_pgoff at mm/interval_tree.c:63
Call Trace:
rmap_walk_anon+0x141/0xa30 rmap_walk_anon at mm/rmap.c:1864
try_to_unmap+0x209/0x2d0 try_to_unmap at mm/rmap.c:1763
migrate_pages+0x1005/0x1fb0
move_pages_and_store_status.isra.47+0xd7/0x1a0
__x64_sys_move_pages+0xa5c/0x1100
do_syscall_64+0x5f/0x310
entry_SYSCALL_64_after_hwframe+0x44/0xa9

Hugh Dickins diagnosed this as a migration bug caused by code introduced
to use i_mmap_rwsem for pmd sharing synchronization. Specifically, the
routine unmap_and_move_huge_page() is always passing the TTU_RMAP_LOCKED
flag to try_to_unmap() while holding i_mmap_rwsem. This is wrong for
anon pages as the anon_vma_lock should be held in this case. Further
analysis suggested that i_mmap_rwsem was not required to he held at all
when calling try_to_unmap for anon pages as an anon page could never be
part of a shared pmd mapping.

Discussion also revealed that the hack in hugetlb_page_mapping_lock_write
to drop page lock and acquire i_mmap_rwsem is wrong. There is no way to
keep mapping valid while dropping page lock.

This patch does the following:

- Do not take i_mmap_rwsem and set TTU_RMAP_LOCKED for anon pages when
calling try_to_unmap.

- Remove the hacky code in hugetlb_page_mapping_lock_write. The routine
will now simply do a 'trylock' while still holding the page lock. If
the trylock fails, it will return NULL. This could impact the
callers:

- migration calling code will receive -EAGAIN and retry up to the
hard coded limit (10).

- memory error code will treat the page as BUSY. This will force
killing (SIGKILL) instead of SIGBUS any mapping tasks.

Do note that this change in behavior only happens when there is a
race. None of the standard kernel testing suites actually hit this
race, but it is possible.

[1] https://lore.kernel.org/lkml/20200708012044.GC992@lca.pw/
[2] https://lore.kernel.org/linux-mm/alpine.LSU.2.11.2010071833100.2214@eggly.anvils/

Fixes: c0d0381ade79 ("hugetlbfs: use i_mmap_rwsem for more pmd sharing synchronization")
Reported-by: Qian Cai <cai@lca.pw>
Suggested-by: Hugh Dickins <hughd@google.com>
Signed-off-by: Mike Kravetz <mike.kravetz@oracle.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Acked-by: Naoya Horiguchi <naoya.horiguchi@nec.com>
Cc: <stable@vger.kernel.org>
Link: https://lkml.kernel.org/r/20201105195058.78401-1-mike.kravetz@oracle.com
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 54608759 17-Oct-2020 Joonsoo Kim <iamjoonsoo.kim@lge.com>

mm/memory-failure: remove a wrapper for alloc_migration_target()

There is a well-defined standard migration target callback. Use it
directly.

Signed-off-by: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Cc: Christoph Hellwig <hch@infradead.org>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Mike Kravetz <mike.kravetz@oracle.com>
Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
Cc: Roman Gushchin <guro@fb.com>
Link: http://lkml.kernel.org/r/1594622517-20681-9-git-send-email-iamjoonsoo.kim@lge.com
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# b94e0282 15-Oct-2020 Oscar Salvador <osalvador@suse.de>

mm,hwpoison: try to narrow window race for free pages

Aristeu Rozanski reported that a customer test case started to report
-EBUSY after the hwpoison rework patchset.

There is a race window between spotting a free page and taking it off its
buddy freelist, so it might be that by the time we try to take it off, the
page has been already allocated.

This patch tries to handle such race window by trying to handle the new
type of page again if the page was allocated under us.

Reported-by: Aristeu Rozanski <aris@ruivo.org>
Signed-off-by: Oscar Salvador <osalvador@suse.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Tested-by: Aristeu Rozanski <aris@ruivo.org>
Acked-by: Naoya Horiguchi <naoya.horiguchi@nec.com>
Cc: "Aneesh Kumar K.V" <aneesh.kumar@linux.ibm.com>
Cc: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
Cc: Dave Hansen <dave.hansen@intel.com>
Cc: David Hildenbrand <david@redhat.com>
Cc: Dmitry Yakunin <zeil@yandex-team.ru>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Mike Kravetz <mike.kravetz@oracle.com>
Cc: Oscar Salvador <osalvador@suse.com>
Cc: Qian Cai <cai@lca.pw>
Cc: Tony Luck <tony.luck@intel.com>
Link: https://lkml.kernel.org/r/20200922135650.1634-15-osalvador@suse.de
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 1f2481dd 15-Oct-2020 Naoya Horiguchi <naoya.horiguchi@nec.com>

mm,hwpoison: double-check page count in __get_any_page()

Soft offlining could fail with EIO due to the race condition with hugepage
migration. This issuse became visible due to the change by previous patch
that makes soft offline handler take page refcount by its own. We have no
way to directly pin zero refcount page, and the page considered as a zero
refcount page could be allocated just after the first check.

This patch adds the second check to find the race and gives us chance to
handle it more reliably.

Reported-by: Qian Cai <cai@lca.pw>
Signed-off-by: Naoya Horiguchi <naoya.horiguchi@nec.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Cc: "Aneesh Kumar K.V" <aneesh.kumar@linux.ibm.com>
Cc: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
Cc: Aristeu Rozanski <aris@ruivo.org>
Cc: Dave Hansen <dave.hansen@intel.com>
Cc: David Hildenbrand <david@redhat.com>
Cc: Dmitry Yakunin <zeil@yandex-team.ru>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Mike Kravetz <mike.kravetz@oracle.com>
Cc: Oscar Salvador <osalvador@suse.com>
Cc: Tony Luck <tony.luck@intel.com>
Link: https://lkml.kernel.org/r/20200922135650.1634-14-osalvador@suse.de
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 5d1fd5dc 15-Oct-2020 Naoya Horiguchi <naoya.horiguchi@nec.com>

mm,hwpoison: introduce MF_MSG_UNSPLIT_THP

memory_failure() is supposed to call action_result() when it handles a
memory error event, but there's one missing case. So let's add it.

I find that include/ras/ras_event.h has some other MF_MSG_* undefined, so
this patch also adds them.

Signed-off-by: Naoya Horiguchi <naoya.horiguchi@nec.com>
Signed-off-by: Oscar Salvador <osalvador@suse.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Cc: "Aneesh Kumar K.V" <aneesh.kumar@linux.ibm.com>
Cc: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
Cc: Aristeu Rozanski <aris@ruivo.org>
Cc: Dave Hansen <dave.hansen@intel.com>
Cc: David Hildenbrand <david@redhat.com>
Cc: Dmitry Yakunin <zeil@yandex-team.ru>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Mike Kravetz <mike.kravetz@oracle.com>
Cc: Oscar Salvador <osalvador@suse.com>
Cc: Qian Cai <cai@lca.pw>
Cc: Tony Luck <tony.luck@intel.com>
Link: https://lkml.kernel.org/r/20200922135650.1634-13-osalvador@suse.de
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 5a2ffca3 15-Oct-2020 Oscar Salvador <osalvador@suse.de>

mm,hwpoison: return 0 if the page is already poisoned in soft-offline

Currently, there is an inconsistency when calling soft-offline from
different paths on a page that is already poisoned.

1) madvise:

madvise_inject_error skips any poisoned page and continues
the loop.
If that was the only page to madvise, it returns 0.

2) /sys/devices/system/memory/:

When calling soft_offline_page_store()->soft_offline_page(),
we return -EBUSY in case the page is already poisoned.
This is inconsistent with a) the above example and b)
memory_failure, where we return 0 if the page was poisoned.

Fix this by dropping the PageHWPoison() check in madvise_inject_error, and
let soft_offline_page return 0 if it finds the page already poisoned.

Please, note that this represents a user-api change, since now the return
error when calling soft_offline_page_store()->soft_offline_page() will be
different.

Signed-off-by: Oscar Salvador <osalvador@suse.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Acked-by: Naoya Horiguchi <naoya.horiguchi@nec.com>
Cc: "Aneesh Kumar K.V" <aneesh.kumar@linux.ibm.com>
Cc: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
Cc: Aristeu Rozanski <aris@ruivo.org>
Cc: Dave Hansen <dave.hansen@intel.com>
Cc: David Hildenbrand <david@redhat.com>
Cc: Dmitry Yakunin <zeil@yandex-team.ru>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Mike Kravetz <mike.kravetz@oracle.com>
Cc: Oscar Salvador <osalvador@suse.com>
Cc: Qian Cai <cai@lca.pw>
Cc: Tony Luck <tony.luck@intel.com>
Link: https://lkml.kernel.org/r/20200922135650.1634-12-osalvador@suse.de
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 6b9a217e 15-Oct-2020 Oscar Salvador <osalvador@suse.de>

mm,hwpoison: refactor soft_offline_huge_page and __soft_offline_page

Merging soft_offline_huge_page and __soft_offline_page let us get rid of
quite some duplicated code, and makes the code much easier to follow.

Now, __soft_offline_page will handle both normal and hugetlb pages.

Signed-off-by: Oscar Salvador <osalvador@suse.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Acked-by: Naoya Horiguchi <naoya.horiguchi@nec.com>
Cc: "Aneesh Kumar K.V" <aneesh.kumar@linux.ibm.com>
Cc: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
Cc: Aristeu Rozanski <aris@ruivo.org>
Cc: Dave Hansen <dave.hansen@intel.com>
Cc: David Hildenbrand <david@redhat.com>
Cc: Dmitry Yakunin <zeil@yandex-team.ru>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Mike Kravetz <mike.kravetz@oracle.com>
Cc: Oscar Salvador <osalvador@suse.com>
Cc: Qian Cai <cai@lca.pw>
Cc: Tony Luck <tony.luck@intel.com>
Link: https://lkml.kernel.org/r/20200922135650.1634-11-osalvador@suse.de
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 79f5f8fa 15-Oct-2020 Oscar Salvador <osalvador@suse.de>

mm,hwpoison: rework soft offline for in-use pages

This patch changes the way we set and handle in-use poisoned pages. Until
now, poisoned pages were released to the buddy allocator, trusting that
the checks that take place at allocation time would act as a safe net and
would skip that page.

This has proved to be wrong, as we got some pfn walkers out there, like
compaction, that all they care is the page to be in a buddy freelist.

Although this might not be the only user, having poisoned pages in the
buddy allocator seems a bad idea as we should only have free pages that
are ready and meant to be used as such.

Before explaining the taken approach, let us break down the kind of pages
we can soft offline.

- Anonymous THP (after the split, they end up being 4K pages)
- Hugetlb
- Order-0 pages (that can be either migrated or invalited)

* Normal pages (order-0 and anon-THP)

- If they are clean and unmapped page cache pages, we invalidate
then by means of invalidate_inode_page().
- If they are mapped/dirty, we do the isolate-and-migrate dance.

Either way, do not call put_page directly from those paths. Instead, we
keep the page and send it to page_handle_poison to perform the right
handling.

page_handle_poison sets the HWPoison flag and does the last put_page.

Down the chain, we placed a check for HWPoison page in
free_pages_prepare, that just skips any poisoned page, so those pages
do not end up in any pcplist/freelist.

After that, we set the refcount on the page to 1 and we increment
the poisoned pages counter.

If we see that the check in free_pages_prepare creates trouble, we can
always do what we do for free pages:

- wait until the page hits buddy's freelists
- take it off, and flag it

The downside of the above approach is that we could race with an
allocation, so by the time we want to take the page off the buddy, the
page has been already allocated so we cannot soft offline it.
But the user could always retry it.

* Hugetlb pages

- We isolate-and-migrate them

After the migration has been successful, we call dissolve_free_huge_page,
and we set HWPoison on the page if we succeed.
Hugetlb has a slightly different handling though.

While for non-hugetlb pages we cared about closing the race with an
allocation, doing so for hugetlb pages requires quite some additional
and intrusive code (we would need to hook in free_huge_page and some other
places).
So I decided to not make the code overly complicated and just fail
normally if the page we allocated in the meantime.

We can always build on top of this.

As a bonus, because of the way we handle now in-use pages, we no longer
need the put-as-isolation-migratetype dance, that was guarding for poisoned
pages to end up in pcplists.

Signed-off-by: Oscar Salvador <osalvador@suse.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Acked-by: Naoya Horiguchi <naoya.horiguchi@nec.com>
Cc: "Aneesh Kumar K.V" <aneesh.kumar@linux.ibm.com>
Cc: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
Cc: Aristeu Rozanski <aris@ruivo.org>
Cc: Dave Hansen <dave.hansen@intel.com>
Cc: David Hildenbrand <david@redhat.com>
Cc: Dmitry Yakunin <zeil@yandex-team.ru>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Mike Kravetz <mike.kravetz@oracle.com>
Cc: Oscar Salvador <osalvador@suse.com>
Cc: Qian Cai <cai@lca.pw>
Cc: Tony Luck <tony.luck@intel.com>
Link: https://lkml.kernel.org/r/20200922135650.1634-10-osalvador@suse.de
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 06be6ff3 15-Oct-2020 Oscar Salvador <osalvador@suse.de>

mm,hwpoison: rework soft offline for free pages

When trying to soft-offline a free page, we need to first take it off the
buddy allocator. Once we know is out of reach, we can safely flag it as
poisoned.

take_page_off_buddy will be used to take a page meant to be poisoned off
the buddy allocator. take_page_off_buddy calls break_down_buddy_pages,
which splits a higher-order page in case our page belongs to one.

Once the page is under our control, we call page_handle_poison to set it
as poisoned and grab a refcount on it.

Signed-off-by: Oscar Salvador <osalvador@suse.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Acked-by: Naoya Horiguchi <naoya.horiguchi@nec.com>
Cc: "Aneesh Kumar K.V" <aneesh.kumar@linux.ibm.com>
Cc: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
Cc: Aristeu Rozanski <aris@ruivo.org>
Cc: Dave Hansen <dave.hansen@intel.com>
Cc: David Hildenbrand <david@redhat.com>
Cc: Dmitry Yakunin <zeil@yandex-team.ru>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Mike Kravetz <mike.kravetz@oracle.com>
Cc: Oscar Salvador <osalvador@suse.com>
Cc: Qian Cai <cai@lca.pw>
Cc: Tony Luck <tony.luck@intel.com>
Link: https://lkml.kernel.org/r/20200922135650.1634-9-osalvador@suse.de
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 694bf0b0 15-Oct-2020 Oscar Salvador <osalvador@suse.de>

mm,hwpoison: unify THP handling for hard and soft offline

Place the THP's page handling in a helper and use it from both hard and
soft-offline machinery, so we get rid of some duplicated code.

Signed-off-by: Oscar Salvador <osalvador@suse.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Acked-by: Naoya Horiguchi <naoya.horiguchi@nec.com>
Cc: "Aneesh Kumar K.V" <aneesh.kumar@linux.ibm.com>
Cc: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
Cc: Aristeu Rozanski <aris@ruivo.org>
Cc: Dave Hansen <dave.hansen@intel.com>
Cc: David Hildenbrand <david@redhat.com>
Cc: Dmitry Yakunin <zeil@yandex-team.ru>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Mike Kravetz <mike.kravetz@oracle.com>
Cc: Oscar Salvador <osalvador@suse.com>
Cc: Qian Cai <cai@lca.pw>
Cc: Tony Luck <tony.luck@intel.com>
Link: https://lkml.kernel.org/r/20200922135650.1634-8-osalvador@suse.de
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# dd6e2402 15-Oct-2020 Oscar Salvador <osalvador@suse.de>

mm,hwpoison: kill put_hwpoison_page

After commit 4e41a30c6d50 ("mm: hwpoison: adjust for new thp
refcounting"), put_hwpoison_page got reduced to a put_page. Let us just
use put_page instead.

Signed-off-by: Oscar Salvador <osalvador@suse.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Acked-by: Naoya Horiguchi <naoya.horiguchi@nec.com>
Cc: "Aneesh Kumar K.V" <aneesh.kumar@linux.ibm.com>
Cc: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
Cc: Aristeu Rozanski <aris@ruivo.org>
Cc: Dave Hansen <dave.hansen@intel.com>
Cc: David Hildenbrand <david@redhat.com>
Cc: Dmitry Yakunin <zeil@yandex-team.ru>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Mike Kravetz <mike.kravetz@oracle.com>
Cc: Oscar Salvador <osalvador@suse.com>
Cc: Qian Cai <cai@lca.pw>
Cc: Tony Luck <tony.luck@intel.com>
Link: https://lkml.kernel.org/r/20200922135650.1634-7-osalvador@suse.de
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 7e27f22c 15-Oct-2020 Oscar Salvador <osalvador@suse.de>

mm,hwpoison: unexport get_hwpoison_page and make it static

Since get_hwpoison_page is only used in memory-failure code now, let us
un-export it and make it private to that code.

Signed-off-by: Oscar Salvador <osalvador@suse.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Acked-by: Naoya Horiguchi <naoya.horiguchi@nec.com>
Cc: "Aneesh Kumar K.V" <aneesh.kumar@linux.ibm.com>
Cc: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
Cc: Aristeu Rozanski <aris@ruivo.org>
Cc: Dave Hansen <dave.hansen@intel.com>
Cc: David Hildenbrand <david@redhat.com>
Cc: Dmitry Yakunin <zeil@yandex-team.ru>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Mike Kravetz <mike.kravetz@oracle.com>
Cc: Oscar Salvador <osalvador@suse.com>
Cc: Qian Cai <cai@lca.pw>
Cc: Tony Luck <tony.luck@intel.com>
Link: https://lkml.kernel.org/r/20200922135650.1634-5-osalvador@suse.de
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 1b473bec 15-Oct-2020 Naoya Horiguchi <naoya.horiguchi@nec.com>

mm, hwpoison: remove recalculating hpage

hpage is never used after try_to_split_thp_page() in memory_failure(), so
we don't have to update hpage. So let's not recalculate/use hpage.

Suggested-by: "Aneesh Kumar K.V" <aneesh.kumar@linux.ibm.com>
Signed-off-by: Naoya Horiguchi <naoya.horiguchi@nec.com>
Signed-off-by: Oscar Salvador <osalvador@suse.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Reviewed-by: Mike Kravetz <mike.kravetz@oracle.com>
Cc: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
Cc: Aristeu Rozanski <aris@ruivo.org>
Cc: Dave Hansen <dave.hansen@intel.com>
Cc: David Hildenbrand <david@redhat.com>
Cc: Dmitry Yakunin <zeil@yandex-team.ru>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Oscar Salvador <osalvador@suse.com>
Cc: Qian Cai <cai@lca.pw>
Cc: Tony Luck <tony.luck@intel.com>
Link: https://lkml.kernel.org/r/20200922135650.1634-3-osalvador@suse.de
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 7d9d46ac 15-Oct-2020 Naoya Horiguchi <naoya.horiguchi@nec.com>

mm,hwpoison: cleanup unused PageHuge() check

Patch series "HWPOISON: soft offline rework", v7.

This patchset fixes a couple of issues that the patchset Naoya sent [1]
contained due to rebasing problems and a misunterdansting.

Main focus of this series is to stabilize soft offline. Historically soft
offlined pages have suffered from racy conditions because PageHWPoison is
used to a little too aggressively, which (directly or indirectly) invades
other mm code which cares little about hwpoison. This results in
unexpected behavior or kernel panic, which is very far from soft offline's
"do not disturb userspace or other kernel component" policy. An example
of this can be found here [2].

Along with several cleanups, this code refactors and changes the way soft
offline work. Main point of this change set is to contain target page
"via buddy allocator" or in migrating path. For ther former we first free
the target page as we do for normal pages, and once it has reached buddy
and it has been taken off the freelists, we flag it as HWpoison. For the
latter we never get to release the page in unmap_and_move, so the page is
under our control and we can handle it in hwpoison code.

[1] https://patchwork.kernel.org/cover/11704083/
[2] https://lore.kernel.org/linux-mm/20190826104144.GA7849@linux/T/#u

This patch (of 14):

Drop the PageHuge check, which is dead code since memory_failure() forks
into memory_failure_hugetlb() for hugetlb pages.

memory_failure() and memory_failure_hugetlb() shares some functions like
hwpoison_user_mappings() and identify_page_state(), so they should
properly handle 4kB page, thp, and hugetlb.

Signed-off-by: Naoya Horiguchi <naoya.horiguchi@nec.com>
Signed-off-by: Oscar Salvador <osalvador@suse.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Reviewed-by: Mike Kravetz <mike.kravetz@oracle.com>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Tony Luck <tony.luck@intel.com>
Cc: David Hildenbrand <david@redhat.com>
Cc: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
Cc: Dmitry Yakunin <zeil@yandex-team.ru>
Cc: Qian Cai <cai@lca.pw>
Cc: Dave Hansen <dave.hansen@intel.com>
Cc: "Aneesh Kumar K.V" <aneesh.kumar@linux.ibm.com>
Cc: Aristeu Rozanski <aris@ruivo.org>
Cc: Oscar Salvador <osalvador@suse.com>
Link: https://lkml.kernel.org/r/20200922135650.1634-1-osalvador@suse.de
Link: https://lkml.kernel.org/r/20200922135650.1634-2-osalvador@suse.de
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 2c312597 13-Oct-2020 Alex Shi <alex.shi@linux.alibaba.com>

mm/memory-failure.c: remove unused macro `writeback'

Unlike others we don't use the marco writeback. so let's remove it to
tame gcc warning:

mm/memory-failure.c:827: warning: macro "writeback" is not used
[-Wunused-macros]

Signed-off-by: Alex Shi <alex.shi@linux.alibaba.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Cc: Naoya Horiguchi <naoya.horiguchi@nec.com>
Link: https://lkml.kernel.org/r/1599715096-20369-1-git-send-email-alex.shi@linux.alibaba.com
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# c43bc03d 13-Oct-2020 Xianting Tian <tian.xianting@h3c.com>

mm/memory-failure: do pgoff calculation before for_each_process()

There is no need to calculate pgoff in each loop of for_each_process(), so
move it to the place before for_each_process(), which can save some CPU
cycles.

Signed-off-by: Xianting Tian <tian.xianting@h3c.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Acked-by: Naoya Horiguchi <naoya.horiguchi@nec.com>
Link: http://lkml.kernel.org/r/20200818082647.34322-1-tian.xianting@h3c.com
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# f56753ac 24-Sep-2020 Christoph Hellwig <hch@lst.de>

bdi: replace BDI_CAP_NO_{WRITEBACK,ACCT_DIRTY} with a single flag

Replace the two negative flags that are always used together with a
single positive flag that indicates the writeback capability instead
of two related non-capabilities. Also remove the pointless wrappers
to just check the flag.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Jan Kara <jack@suse.cz>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>


# 19fc7bed 11-Aug-2020 Joonsoo Kim <iamjoonsoo.kim@lge.com>

mm/migrate: introduce a standard migration target allocation function

There are some similar functions for migration target allocation. Since
there is no fundamental difference, it's better to keep just one rather
than keeping all variants. This patch implements base migration target
allocation function. In the following patches, variants will be converted
to use this function.

Changes should be mechanical, but, unfortunately, there are some
differences. First, some callers' nodemask is assgined to NULL since NULL
nodemask will be considered as all available nodes, that is,
&node_states[N_MEMORY]. Second, for hugetlb page allocation, gfp_mask is
redefined as regular hugetlb allocation gfp_mask plus __GFP_THISNODE if
user provided gfp_mask has it. This is because future caller of this
function requires to set this node constaint. Lastly, if provided nodeid
is NUMA_NO_NODE, nodeid is set up to the node where migration source
lives. It helps to remove simple wrappers for setting up the nodeid.

Note that PageHighmem() call in previous function is changed to open-code
"is_highmem_idx()" since it provides more readability.

[akpm@linux-foundation.org: tweak patch title, per Vlastimil]
[akpm@linux-foundation.org: fix typo in comment]

Signed-off-by: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Acked-by: Michal Hocko <mhocko@suse.com>
Cc: Christoph Hellwig <hch@infradead.org>
Cc: Mike Kravetz <mike.kravetz@oracle.com>
Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
Cc: Roman Gushchin <guro@fb.com>
Link: http://lkml.kernel.org/r/1594622517-20681-6-git-send-email-iamjoonsoo.kim@lge.com
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 03151c6e 11-Jun-2020 Naoya Horiguchi <nao.horiguchi@gmail.com>

mm/memory-failure: send SIGBUS(BUS_MCEERR_AR) only to current thread

Action Required memory error should happen only when a processor is
about to access to a corrupted memory, so it's synchronous and only
affects current process/thread.

Recently commit 872e9a205c84 ("mm, memory_failure: don't send
BUS_MCEERR_AO for action required error") fixed the issue that Action
Required memory could unnecessarily send SIGBUS to the processes which
share the error memory. But we still have another issue that we could
send SIGBUS to a wrong thread.

This is because collect_procs() and task_early_kill() fails to add the
current process to "to-kill" list. So this patch is suggesting to fix
it. With this fix, SIGBUS(BUS_MCEERR_AR) is never sent to non-current
process/thread.

Signed-off-by: Naoya Horiguchi <naoya.horiguchi@nec.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Acked-by: Tony Luck <tony.luck@intel.com>
Acked-by: Pankaj Gupta <pankaj.gupta.linux@gmail.com>
Link: http://lkml.kernel.org/r/1591321039-22141-3-git-send-email-naoya.horiguchi@nec.com
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 4e018b45 11-Jun-2020 Naoya Horiguchi <nao.horiguchi@gmail.com>

mm/memory-failure: prioritize prctl(PR_MCE_KILL) over vm.memory_failure_early_kill

Patch series "hwpoison: fixes signaling on memory error"

This is a small patchset to solve issues in memory error handler to send
SIGBUS to proper process/thread as expected in configuration. Please
see descriptions in individual patches for more details.

This patch (of 2):

Early-kill policy is controlled from two types of settings, one is
per-process setting prctl(PR_MCE_KILL) and the other is system-wide
setting vm.memory_failure_early_kill. Users expect per-process setting
to override system-wide setting as many other settings do, but
early-kill setting doesn't work as such.

For example, if a system configures vm.memory_failure_early_kill to 1
(enabled), a process receives SIGBUS even if it's configured to
explicitly disable PF_MCE_KILL by prctl(). That's not desirable for
applications with their own policies.

This patch is suggesting to change the priority of these two types of
settings, by checking sysctl_memory_failure_early_kill only when a given
process has the default kill policy.

Note that this patch is solving a thread choice issue too.

Originally, collect_procs() always chooses the main thread when
vm.memory_failure_early_kill is 1, even if the process has a dedicated
thread for memory error handling. SIGBUS should be sent to the
dedicated thread if early-kill is enabled via
vm.memory_failure_early_kill as we are doing for PR_MCE_KILL_EARLY
processes.

Signed-off-by: Naoya Horiguchi <naoya.horiguchi@nec.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Cc: Tony Luck <tony.luck@intel.com>
Cc: Pankaj Gupta <pankaj.gupta.linux@gmail.com>
Link: http://lkml.kernel.org/r/1591321039-22141-1-git-send-email-naoya.horiguchi@nec.com
Link: http://lkml.kernel.org/r/1591321039-22141-2-git-send-email-naoya.horiguchi@nec.com
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 872e9a20 01-Jun-2020 Wetp Zhang <wetp.zy@linux.alibaba.com>

mm, memory_failure: don't send BUS_MCEERR_AO for action required error

Some processes dont't want to be killed early, but in "Action Required"
case, those also may be killed by BUS_MCEERR_AO when sharing memory with
other which is accessing the fail memory. And sending SIGBUS with
BUS_MCEERR_AO for action required error is strange, so ignore the
non-current processes here.

Suggested-by: Naoya Horiguchi <naoya.horiguchi@nec.com>
Signed-off-by: Wetp Zhang <wetp.zy@linux.alibaba.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Acked-by: Naoya Horiguchi <naoya.horiguchi@nec.com>
Acked-by: Pankaj Gupta <pankaj.gupta.linux@gmail.com>
Link: http://lkml.kernel.org/r/1590817116-21281-1-git-send-email-wetp.zy@linux.alibaba.com
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 06202231 01-May-2020 James Morse <james.morse@arm.com>

mm/memory-failure: Add memory_failure_queue_kick()

The GHES code calls memory_failure_queue() from IRQ context to schedule
work on the current CPU so that memory_failure() can sleep.

For synchronous memory errors the arch code needs to know any signals
that memory_failure() will trigger are pending before it returns to
user-space, possibly when exiting from the IRQ.

Add a helper to kick the memory failure queue, to ensure the scheduled
work has happened. This has to be called from process context, so may
have been migrated from the original cpu. Pass the cpu the work was
queued on.

Change memory_failure_work_func() to permit being called on the 'wrong'
cpu.

Signed-off-by: James Morse <james.morse@arm.com>
Tested-by: Tyler Baicar <baicar@os.amperecomputing.com>
Acked-by: Naoya Horiguchi <naoya.horiguchi@nec.com>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>


# 9de4f22a 06-Apr-2020 Huang Ying <ying.huang@intel.com>

mm: code cleanup for MADV_FREE

Some comments for MADV_FREE is revised and added to help people understand
the MADV_FREE code, especially the page flag, PG_swapbacked. This makes
page_is_file_cache() isn't consistent with its comments. So the function
is renamed to page_is_file_lru() to make them consistent again. All these
are put in one patch as one logical change.

Suggested-by: David Hildenbrand <david@redhat.com>
Suggested-by: Johannes Weiner <hannes@cmpxchg.org>
Suggested-by: David Rientjes <rientjes@google.com>
Signed-off-by: "Huang, Ying" <ying.huang@intel.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Acked-by: David Rientjes <rientjes@google.com>
Acked-by: Michal Hocko <mhocko@kernel.org>
Acked-by: Pankaj Gupta <pankaj.gupta.linux@gmail.com>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Minchan Kim <minchan@kernel.org>
Cc: Hugh Dickins <hughd@google.com>
Cc: Rik van Riel <riel@surriel.com>
Link: http://lkml.kernel.org/r/20200317100342.2730705-1-ying.huang@intel.com
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# c0d0381a 01-Apr-2020 Mike Kravetz <mike.kravetz@oracle.com>

hugetlbfs: use i_mmap_rwsem for more pmd sharing synchronization

Patch series "hugetlbfs: use i_mmap_rwsem for more synchronization", v2.

While discussing the issue with huge_pte_offset [1], I remembered that
there were more outstanding hugetlb races. These issues are:

1) For shared pmds, huge PTE pointers returned by huge_pte_alloc can become
invalid via a call to huge_pmd_unshare by another thread.
2) hugetlbfs page faults can race with truncation causing invalid global
reserve counts and state.

A previous attempt was made to use i_mmap_rwsem in this manner as
described at [2]. However, those patches were reverted starting with [3]
due to locking issues.

To effectively use i_mmap_rwsem to address the above issues it needs to be
held (in read mode) during page fault processing. However, during fault
processing we need to lock the page we will be adding. Lock ordering
requires we take page lock before i_mmap_rwsem. Waiting until after
taking the page lock is too late in the fault process for the
synchronization we want to do.

To address this lock ordering issue, the following patches change the lock
ordering for hugetlb pages. This is not too invasive as hugetlbfs
processing is done separate from core mm in many places. However, I don't
really like this idea. Much ugliness is contained in the new routine
hugetlb_page_mapping_lock_write() of patch 1.

The only other way I can think of to address these issues is by catching
all the races. After catching a race, cleanup, backout, retry ... etc,
as needed. This can get really ugly, especially for huge page
reservations. At one time, I started writing some of the reservation
backout code for page faults and it got so ugly and complicated I went
down the path of adding synchronization to avoid the races. Any other
suggestions would be welcome.

[1] https://lore.kernel.org/linux-mm/1582342427-230392-1-git-send-email-longpeng2@huawei.com/
[2] https://lore.kernel.org/linux-mm/20181222223013.22193-1-mike.kravetz@oracle.com/
[3] https://lore.kernel.org/linux-mm/20190103235452.29335-1-mike.kravetz@oracle.com
[4] https://lore.kernel.org/linux-mm/1584028670.7365.182.camel@lca.pw/
[5] https://lore.kernel.org/lkml/20200312183142.108df9ac@canb.auug.org.au/

This patch (of 2):

While looking at BUGs associated with invalid huge page map counts, it was
discovered and observed that a huge pte pointer could become 'invalid' and
point to another task's page table. Consider the following:

A task takes a page fault on a shared hugetlbfs file and calls
huge_pte_alloc to get a ptep. Suppose the returned ptep points to a
shared pmd.

Now, another task truncates the hugetlbfs file. As part of truncation, it
unmaps everyone who has the file mapped. If the range being truncated is
covered by a shared pmd, huge_pmd_unshare will be called. For all but the
last user of the shared pmd, huge_pmd_unshare will clear the pud pointing
to the pmd. If the task in the middle of the page fault is not the last
user, the ptep returned by huge_pte_alloc now points to another task's
page table or worse. This leads to bad things such as incorrect page
map/reference counts or invalid memory references.

To fix, expand the use of i_mmap_rwsem as follows:
- i_mmap_rwsem is held in read mode whenever huge_pmd_share is called.
huge_pmd_share is only called via huge_pte_alloc, so callers of
huge_pte_alloc take i_mmap_rwsem before calling. In addition, callers
of huge_pte_alloc continue to hold the semaphore until finished with
the ptep.
- i_mmap_rwsem is held in write mode whenever huge_pmd_unshare is called.

One problem with this scheme is that it requires taking i_mmap_rwsem
before taking the page lock during page faults. This is not the order
specified in the rest of mm code. Handling of hugetlbfs pages is mostly
isolated today. Therefore, we use this alternative locking order for
PageHuge() pages.

mapping->i_mmap_rwsem
hugetlb_fault_mutex (hugetlbfs specific page fault mutex)
page->flags PG_locked (lock_page)

To help with lock ordering issues, hugetlb_page_mapping_lock_write() is
introduced to write lock the i_mmap_rwsem associated with a page.

In most cases it is easy to get address_space via vma->vm_file->f_mapping.
However, in the case of migration or memory errors for anon pages we do
not have an associated vma. A new routine _get_hugetlb_page_mapping()
will use anon_vma to get address_space in these cases.

Signed-off-by: Mike Kravetz <mike.kravetz@oracle.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Hugh Dickins <hughd@google.com>
Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
Cc: "Aneesh Kumar K . V" <aneesh.kumar@linux.vnet.ibm.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: "Kirill A . Shutemov" <kirill.shutemov@linux.intel.com>
Cc: Davidlohr Bueso <dave@stgolabs.net>
Cc: Prakash Sangappa <prakash.sangappa@oracle.com>
Link: http://lkml.kernel.org/r/20200316205756.146666-2-mike.kravetz@oracle.com
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 75068518 30-Nov-2019 Yunfeng Ye <yeyunfeng@huawei.com>

mm/memory-failure.c: use page_shift() in add_to_kill()

page_shift() is supported after the commit 94ad9338109f ("mm: introduce
page_shift()").

So replace with page_shift() in add_to_kill() for readability.

Link: http://lkml.kernel.org/r/543d8bc9-f2e7-3023-7c35-2e7ed67c0e82@huawei.com
Signed-off-by: Yunfeng Ye <yeyunfeng@huawei.com>
Reviewed-by: David Hildenbrand <david@redhat.com>
Acked-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# feec24a6 30-Nov-2019 Naoya Horiguchi <nao.horiguchi@gmail.com>

mm, soft-offline: convert parameter to pfn

Currently soft_offline_page() receives struct page, and its sibling
memory_failure() receives pfn. This discrepancy looks weird and makes
precheck on pfn validity tricky. So let's align them.

Link: http://lkml.kernel.org/r/20191016234706.GA5493@www9186uo.sakura.ne.jp
Signed-off-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
Acked-by: Andrew Morton <akpm@linux-foundation.org>
Cc: David Hildenbrand <david@redhat.com>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Oscar Salvador <osalvador@suse.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 996ff7a0 30-Nov-2019 Jane Chu <jane.chu@oracle.com>

mm/memory-failure.c clean up around tk pre-allocation

add_to_kill() expects the first 'tk' to be pre-allocated, it makes
subsequent allocations on need basis, this makes the code a bit
difficult to read.

Move all the allocation internal to add_to_kill() and drop the **tk
argument.

Link: http://lkml.kernel.org/r/1565112345-28754-2-git-send-email-jane.chu@oracle.com
Signed-off-by: Jane Chu <jane.chu@oracle.com>
Reviewed-by: Dan Williams <dan.j.williams@intel.com>
Acked-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
Cc: Michal Hocko <mhocko@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 96c804a6 18-Oct-2019 David Hildenbrand <david@redhat.com>

mm/memory-failure.c: don't access uninitialized memmaps in memory_failure()

We should check for pfn_to_online_page() to not access uninitialized
memmaps. Reshuffle the code so we don't have to duplicate the error
message.

Link: http://lkml.kernel.org/r/20191009142435.3975-3-david@redhat.com
Signed-off-by: David Hildenbrand <david@redhat.com>
Fixes: f1dd2cd13c4b ("mm, memory_hotplug: do not associate hotadded memory to zones until online") [visible after d0dc12e86b319]
Acked-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: <stable@vger.kernel.org> [4.13+]
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 3d7fed4a 14-Oct-2019 Jane Chu <jane.chu@oracle.com>

mm/memory-failure: poison read receives SIGKILL instead of SIGBUS if mmaped more than once

Mmap /dev/dax more than once, then read the poison location using
address from one of the mappings. The other mappings due to not having
the page mapped in will cause SIGKILLs delivered to the process.
SIGKILL succeeds over SIGBUS, so user process loses the opportunity to
handle the UE.

Although one may add MAP_POPULATE to mmap(2) to work around the issue,
MAP_POPULATE makes mapping 128GB of pmem several magnitudes slower, so
isn't always an option.

Details -

ndctl inject-error --block=10 --count=1 namespace6.0

./read_poison -x dax6.0 -o 5120 -m 2
mmaped address 0x7f5bb6600000
mmaped address 0x7f3cf3600000
doing local read at address 0x7f3cf3601400
Killed

Console messages in instrumented kernel -

mce: Uncorrected hardware memory error in user-access at edbe201400
Memory failure: tk->addr = 7f5bb6601000
Memory failure: address edbe201: call dev_pagemap_mapping_shift
dev_pagemap_mapping_shift: page edbe201: no PUD
Memory failure: tk->size_shift == 0
Memory failure: Unable to find user space address edbe201 in read_poison
Memory failure: tk->addr = 7f3cf3601000
Memory failure: address edbe201: call dev_pagemap_mapping_shift
Memory failure: tk->size_shift = 21
Memory failure: 0xedbe201: forcibly killing read_poison:22434 because of failure to unmap corrupted page
=> to deliver SIGKILL
Memory failure: 0xedbe201: Killing read_poison:22434 due to hardware memory corruption
=> to deliver SIGBUS

Link: http://lkml.kernel.org/r/1565112345-28754-3-git-send-email-jane.chu@oracle.com
Signed-off-by: Jane Chu <jane.chu@oracle.com>
Suggested-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
Reviewed-by: Dan Williams <dan.j.williams@intel.com>
Acked-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 135e5351 11-Jul-2019 Jane Chu <jane.chu@oracle.com>

mm/memory-failure.c: clarify error message

Some user who install SIGBUS handler that does longjmp out therefore
keeping the process alive is confused by the error message

"[188988.765862] Memory failure: 0x1840200: Killing cellsrv:33395 due to hardware memory corruption"

Slightly modify the error message to improve clarity.

Link: http://lkml.kernel.org/r/1558403523-22079-1-git-send-email-jane.chu@oracle.com
Signed-off-by: Jane Chu <jane.chu@oracle.com>
Acked-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
Acked-by: Pankaj Gupta <pagupta@redhat.com>
Reviewed-by: Anshuman Khandual <anshuman.khandual@arm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 25b2995a 13-Jun-2019 Christoph Hellwig <hch@lst.de>

mm: remove MEMORY_DEVICE_PUBLIC support

The code hasn't been used since it was added to the tree, and doesn't
appear to actually be usable.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Jason Gunthorpe <jgg@mellanox.com>
Acked-by: Michal Hocko <mhocko@suse.com>
Reviewed-by: Dan Williams <dan.j.williams@intel.com>
Tested-by: Dan Williams <dan.j.williams@intel.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>


# faf53def 28-Jun-2019 Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>

mm: hugetlb: soft-offline: dissolve_free_huge_page() return zero on !PageHuge

madvise(MADV_SOFT_OFFLINE) often returns -EBUSY when calling soft offline
for hugepages with overcommitting enabled. That was caused by the
suboptimal code in current soft-offline code. See the following part:

ret = migrate_pages(&pagelist, new_page, NULL, MPOL_MF_MOVE_ALL,
MIGRATE_SYNC, MR_MEMORY_FAILURE);
if (ret) {
...
} else {
/*
* We set PG_hwpoison only when the migration source hugepage
* was successfully dissolved, because otherwise hwpoisoned
* hugepage remains on free hugepage list, then userspace will
* find it as SIGBUS by allocation failure. That's not expected
* in soft-offlining.
*/
ret = dissolve_free_huge_page(page);
if (!ret) {
if (set_hwpoison_free_buddy_page(page))
num_poisoned_pages_inc();
}
}
return ret;

Here dissolve_free_huge_page() returns -EBUSY if the migration source page
was freed into buddy in migrate_pages(), but even in that case we actually
has a chance that set_hwpoison_free_buddy_page() succeeds. So that means
current code gives up offlining too early now.

dissolve_free_huge_page() checks that a given hugepage is suitable for
dissolving, where we should return success for !PageHuge() case because
the given hugepage is considered as already dissolved.

This change also affects other callers of dissolve_free_huge_page(), which
are cleaned up together.

[n-horiguchi@ah.jp.nec.com: v3]
Link: http://lkml.kernel.org/r/1560761476-4651-3-git-send-email-n-horiguchi@ah.jp.nec.comLink: http://lkml.kernel.org/r/1560154686-18497-3-git-send-email-n-horiguchi@ah.jp.nec.com
Fixes: 6bc9b56433b76 ("mm: fix race on soft-offlining")
Signed-off-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
Reported-by: Chen, Jerry T <jerry.t.chen@intel.com>
Tested-by: Chen, Jerry T <jerry.t.chen@intel.com>
Reviewed-by: Mike Kravetz <mike.kravetz@oracle.com>
Reviewed-by: Oscar Salvador <osalvador@suse.de>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Xishi Qiu <xishi.qiuxishi@alibaba-inc.com>
Cc: "Chen, Jerry T" <jerry.t.chen@intel.com>
Cc: "Zhuo, Qiuxu" <qiuxu.zhuo@intel.com>
Cc: <stable@vger.kernel.org> [4.19+]
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# b38e5962 28-Jun-2019 Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>

mm: soft-offline: return -EBUSY if set_hwpoison_free_buddy_page() fails

The pass/fail of soft offline should be judged by checking whether the
raw error page was finally contained or not (i.e. the result of
set_hwpoison_free_buddy_page()), but current code do not work like
that. It might lead us to misjudge the test result when
set_hwpoison_free_buddy_page() fails.

Without this fix, there are cases where madvise(MADV_SOFT_OFFLINE) may
not offline the original page and will not return an error.

Link: http://lkml.kernel.org/r/1560154686-18497-2-git-send-email-n-horiguchi@ah.jp.nec.com
Signed-off-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
Fixes: 6bc9b56433b76 ("mm: fix race on soft-offlining")
Reviewed-by: Mike Kravetz <mike.kravetz@oracle.com>
Reviewed-by: Oscar Salvador <osalvador@suse.de>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Xishi Qiu <xishi.qiuxishi@alibaba-inc.com>
Cc: "Chen, Jerry T" <jerry.t.chen@intel.com>
Cc: "Zhuo, Qiuxu" <qiuxu.zhuo@intel.com>
Cc: <stable@vger.kernel.org> [4.19+]
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 1439f94c 29-May-2019 Thomas Gleixner <tglx@linutronix.de>

treewide: Replace GPLv2 boilerplate/reference with SPDX - rule 263

Based on 1 normalized pattern(s):

this software may be redistributed and or modified under the terms
of the gnu general public license gpl version 2 only as published by
the free software foundation

extracted by the scancode license scanner the SPDX license identifier

GPL-2.0-only

has been chosen to replace the boilerplate/reference in 1 file(s).

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Richard Fontana <rfontana@redhat.com>
Reviewed-by: Alexios Zavras <alexios.zavras@intel.com>
Reviewed-by: Allison Randal <allison@lohutok.net>
Cc: linux-spdx@vger.kernel.org
Link: https://lkml.kernel.org/r/20190529141333.676969322@linutronix.de
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>


# f8eac901 05-Feb-2019 Eric W. Biederman <ebiederm@xmission.com>

signal: Remove task parameter from force_sig_mceerr

All of the callers pass current into force_sig_mceer so remove the
task parameter to make this obvious.

This also makes it clear that force_sig_mceerr passes current
into force_sig_info.

Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>


# 46612b75 05-Mar-2019 zhongjiang <zhongjiang@huawei.com>

mm: hwpoison: fix thp split handing in soft_offline_in_use_page()

When soft_offline_in_use_page() runs on a thp tail page after pmd is
split, we trigger the following VM_BUG_ON_PAGE():

Memory failure: 0x3755ff: non anonymous thp
__get_any_page: 0x3755ff: unknown zero refcount page type 2fffff80000000
Soft offlining pfn 0x34d805 at process virtual address 0x20fff000
page:ffffea000d360140 count:0 mapcount:0 mapping:0000000000000000 index:0x1
flags: 0x2fffff80000000()
raw: 002fffff80000000 ffffea000d360108 ffffea000d360188 0000000000000000
raw: 0000000000000001 0000000000000000 00000000ffffffff 0000000000000000
page dumped because: VM_BUG_ON_PAGE(page_ref_count(page) == 0)
------------[ cut here ]------------
kernel BUG at ./include/linux/mm.h:519!

soft_offline_in_use_page() passed refcount and page lock from tail page
to head page, which is not needed because we can pass any subpage to
split_huge_page().

Naoya had fixed a similar issue in c3901e722b29 ("mm: hwpoison: fix thp
split handling in memory_failure()"). But he missed fixing soft
offline.

Link: http://lkml.kernel.org/r/1551452476-24000-1-git-send-email-zhongjiang@huawei.com
Fixes: 61f5d698cc97 ("mm: re-enable THP")
Signed-off-by: zhongjiang <zhongjiang@huawei.com>
Acked-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: Kirill A. Shutemov <kirill@shutemov.name>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: <stable@vger.kernel.org> [4.5+]
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 6376360e 01-Feb-2019 Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>

mm: hwpoison: use do_send_sig_info() instead of force_sig()

Currently memory_failure() is racy against process's exiting, which
results in kernel crash by null pointer dereference.

The root cause is that memory_failure() uses force_sig() to forcibly
kill asynchronous (meaning not in the current context) processes. As
discussed in thread https://lkml.org/lkml/2010/6/8/236 years ago for OOM
fixes, this is not a right thing to do. OOM solves this issue by using
do_send_sig_info() as done in commit d2d393099de2 ("signal:
oom_kill_task: use SEND_SIG_FORCED instead of force_sig()"), so this
patch is suggesting to do the same for hwpoison. do_send_sig_info()
properly accesses to siglock with lock_task_sighand(), so is free from
the reported race.

I confirmed that the reported bug reproduces with inserting some delay
in kill_procs(), and it never reproduces with this patch.

Note that memory_failure() can send another type of signal using
force_sig_mceerr(), and the reported race shouldn't happen on it because
force_sig_mceerr() is called only for synchronous processes (i.e.
BUS_MCEERR_AR happens only when some process accesses to the corrupted
memory.)

Link: http://lkml.kernel.org/r/20190116093046.GA29835@hori1.linux.bs1.fc.nec.co.jp
Signed-off-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
Reported-by: Jane Chu <jane.chu@oracle.com>
Reviewed-by: Dan Williams <dan.j.williams@intel.com>
Reviewed-by: William Kucharski <william.kucharski@oracle.com>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# ddeaab32 08-Jan-2019 Mike Kravetz <mike.kravetz@oracle.com>

hugetlbfs: revert "use i_mmap_rwsem for more pmd sharing synchronization"

This reverts b43a9990055958e70347c56f90ea2ae32c67334c

The reverted commit caused issues with migration and poisoning of anon
huge pages. The LTP move_pages12 test will cause an "unable to handle
kernel NULL pointer" BUG would occur with stack similar to:

RIP: 0010:down_write+0x1b/0x40
Call Trace:
migrate_pages+0x81f/0xb90
__ia32_compat_sys_migrate_pages+0x190/0x190
do_move_pages_to_node.isra.53.part.54+0x2a/0x50
kernel_move_pages+0x566/0x7b0
__x64_sys_move_pages+0x24/0x30
do_syscall_64+0x5b/0x180
entry_SYSCALL_64_after_hwframe+0x44/0xa9

The purpose of the reverted patch was to fix some long existing races
with huge pmd sharing. It used i_mmap_rwsem for this purpose with the
idea that this could also be used to address truncate/page fault races
with another patch. Further analysis has determined that i_mmap_rwsem
can not be used to address all these hugetlbfs synchronization issues.
Therefore, revert this patch while working an another approach to the
underlying issues.

Link: http://lkml.kernel.org/r/20190103235452.29335-2-mike.kravetz@oracle.com
Signed-off-by: Mike Kravetz <mike.kravetz@oracle.com>
Reported-by: Jan Stancek <jstancek@redhat.com>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Hugh Dickins <hughd@google.com>
Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
Cc: "Aneesh Kumar K . V" <aneesh.kumar@linux.vnet.ibm.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: "Kirill A . Shutemov" <kirill.shutemov@linux.intel.com>
Cc: Davidlohr Bueso <dave@stgolabs.net>
Cc: Prakash Sangappa <prakash.sangappa@oracle.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# b43a9990 28-Dec-2018 Mike Kravetz <mike.kravetz@oracle.com>

hugetlbfs: use i_mmap_rwsem for more pmd sharing synchronization

While looking at BUGs associated with invalid huge page map counts, it was
discovered and observed that a huge pte pointer could become 'invalid' and
point to another task's page table. Consider the following:

A task takes a page fault on a shared hugetlbfs file and calls
huge_pte_alloc to get a ptep. Suppose the returned ptep points to a
shared pmd.

Now, another task truncates the hugetlbfs file. As part of truncation, it
unmaps everyone who has the file mapped. If the range being truncated is
covered by a shared pmd, huge_pmd_unshare will be called. For all but the
last user of the shared pmd, huge_pmd_unshare will clear the pud pointing
to the pmd. If the task in the middle of the page fault is not the last
user, the ptep returned by huge_pte_alloc now points to another task's
page table or worse. This leads to bad things such as incorrect page
map/reference counts or invalid memory references.

To fix, expand the use of i_mmap_rwsem as follows:

- i_mmap_rwsem is held in read mode whenever huge_pmd_share is called.
huge_pmd_share is only called via huge_pte_alloc, so callers of
huge_pte_alloc take i_mmap_rwsem before calling. In addition, callers
of huge_pte_alloc continue to hold the semaphore until finished with the
ptep.

- i_mmap_rwsem is held in write mode whenever huge_pmd_unshare is
called.

[mike.kravetz@oracle.com: add explicit check for mapping != null]
Link: http://lkml.kernel.org/r/20181218223557.5202-2-mike.kravetz@oracle.com
Fixes: 39dde65c9940 ("shared page table for hugetlb page")
Signed-off-by: Mike Kravetz <mike.kravetz@oracle.com>
Acked-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Hugh Dickins <hughd@google.com>
Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
Cc: "Aneesh Kumar K . V" <aneesh.kumar@linux.vnet.ibm.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Davidlohr Bueso <dave@stgolabs.net>
Cc: Prakash Sangappa <prakash.sangappa@oracle.com>
Cc: Colin Ian King <colin.king@canonical.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 27359fd6 30-Nov-2018 Matthew Wilcox <willy@infradead.org>

dax: Fix unlock mismatch with updated API

Internal to dax_unlock_mapping_entry(), dax_unlock_entry() is used to
store a replacement entry in the Xarray at the given xas-index with the
DAX_LOCKED bit clear. When called, dax_unlock_entry() expects the unlocked
value of the entry relative to the current Xarray state to be specified.

In most contexts dax_unlock_entry() is operating in the same scope as
the matched dax_lock_entry(). However, in the dax_unlock_mapping_entry()
case the implementation needs to recall the original entry. In the case
where the original entry is a 'pmd' entry it is possible that the pfn
performed to do the lookup is misaligned to the value retrieved in the
Xarray.

Change the api to return the unlock cookie from dax_lock_page() and pass
it to dax_unlock_page(). This fixes a bug where dax_unlock_page() was
assuming that the page was PMD-aligned if the entry was a PMD entry with
signatures like:

WARNING: CPU: 38 PID: 1396 at fs/dax.c:340 dax_insert_entry+0x2b2/0x2d0
RIP: 0010:dax_insert_entry+0x2b2/0x2d0
[..]
Call Trace:
dax_iomap_pte_fault.isra.41+0x791/0xde0
ext4_dax_huge_fault+0x16f/0x1f0
? up_read+0x1c/0xa0
__do_fault+0x1f/0x160
__handle_mm_fault+0x1033/0x1490
handle_mm_fault+0x18b/0x3d0

Link: https://lkml.kernel.org/r/20181130154902.GL10377@bombadil.infradead.org
Fixes: 9f32d221301c ("dax: Convert dax_lock_mapping_entry to XArray")
Reported-by: Dan Williams <dan.j.williams@intel.com>
Signed-off-by: Matthew Wilcox <willy@infradead.org>
Tested-by: Dan Williams <dan.j.williams@intel.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>


# d4ae9916 23-Aug-2018 Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>

mm: soft-offline: close the race against page allocation

A process can be killed with SIGBUS(BUS_MCEERR_AR) when it tries to
allocate a page that was just freed on the way of soft-offline. This is
undesirable because soft-offline (which is about corrected error) is
less aggressive than hard-offline (which is about uncorrected error),
and we can make soft-offline fail and keep using the page for good
reason like "system is busy."

Two main changes of this patch are:

- setting migrate type of the target page to MIGRATE_ISOLATE. As done
in free_unref_page_commit(), this makes kernel bypass pcplist when
freeing the page. So we can assume that the page is in freelist just
after put_page() returns,

- setting PG_hwpoison on free page under zone->lock which protects
freelists, so this allows us to avoid setting PG_hwpoison on a page
that is decided to be allocated soon.

[akpm@linux-foundation.org: tweak set_hwpoison_free_buddy_page() comment]
Link: http://lkml.kernel.org/r/1531452366-11661-3-git-send-email-n-horiguchi@ah.jp.nec.com
Signed-off-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
Reported-by: Xishi Qiu <xishi.qiuxishi@alibaba-inc.com>
Tested-by: Mike Kravetz <mike.kravetz@oracle.com>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: <zy.zhengyi@alibaba-inc.com>
Cc: Mike Kravetz <mike.kravetz@oracle.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 6bc9b564 23-Aug-2018 Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>

mm: fix race on soft-offlining free huge pages

Patch series "mm: soft-offline: fix race against page allocation".

Xishi recently reported the issue about race on reusing the target pages
of soft offlining. Discussion and analysis showed that we need make
sure that setting PG_hwpoison should be done in the right place under
zone->lock for soft offline. 1/2 handles free hugepage's case, and 2/2
hanldes free buddy page's case.

This patch (of 2):

There's a race condition between soft offline and hugetlb_fault which
causes unexpected process killing and/or hugetlb allocation failure.

The process killing is caused by the following flow:

CPU 0 CPU 1 CPU 2

soft offline
get_any_page
// find the hugetlb is free
mmap a hugetlb file
page fault
...
hugetlb_fault
hugetlb_no_page
alloc_huge_page
// succeed
soft_offline_free_page
// set hwpoison flag
mmap the hugetlb file
page fault
...
hugetlb_fault
hugetlb_no_page
find_lock_page
return VM_FAULT_HWPOISON
mm_fault_error
do_sigbus
// kill the process

The hugetlb allocation failure comes from the following flow:

CPU 0 CPU 1

mmap a hugetlb file
// reserve all free page but don't fault-in
soft offline
get_any_page
// find the hugetlb is free
soft_offline_free_page
// set hwpoison flag
dissolve_free_huge_page
// fail because all free hugepages are reserved
page fault
...
hugetlb_fault
hugetlb_no_page
alloc_huge_page
...
dequeue_huge_page_node_exact
// ignore hwpoisoned hugepage
// and finally fail due to no-mem

The root cause of this is that current soft-offline code is written based
on an assumption that PageHWPoison flag should be set at first to avoid
accessing the corrupted data. This makes sense for memory_failure() or
hard offline, but does not for soft offline because soft offline is about
corrected (not uncorrected) error and is safe from data lost. This patch
changes soft offline semantics where it sets PageHWPoison flag only after
containment of the error page completes successfully.

Link: http://lkml.kernel.org/r/1531452366-11661-2-git-send-email-n-horiguchi@ah.jp.nec.com
Signed-off-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
Reported-by: Xishi Qiu <xishi.qiuxishi@alibaba-inc.com>
Suggested-by: Xishi Qiu <xishi.qiuxishi@alibaba-inc.com>
Tested-by: Mike Kravetz <mike.kravetz@oracle.com>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: <zy.zhengyi@alibaba-inc.com>
Cc: Mike Kravetz <mike.kravetz@oracle.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 1c4c3b99 21-Aug-2018 Jiang Biao <jiang.biao2@zte.com.cn>

mm: fix page_freeze_refs and page_unfreeze_refs in comments

page_freeze_refs/page_unfreeze_refs have already been relplaced by
page_ref_freeze/page_ref_unfreeze , but they are not modified in the
comments.

Link: http://lkml.kernel.org/r/1532590226-106038-1-git-send-email-jiang.biao2@zte.com.cn
Signed-off-by: Jiang Biao <jiang.biao2@zte.com.cn>
Acked-by: Michal Hocko <mhocko@suse.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 6100e34b 13-Jul-2018 Dan Williams <dan.j.williams@intel.com>

mm, memory_failure: Teach memory_failure() about dev_pagemap pages

mce: Uncorrected hardware memory error in user-access at af34214200
{1}[Hardware Error]: It has been corrected by h/w and requires no further action
mce: [Hardware Error]: Machine check events logged
{1}[Hardware Error]: event severity: corrected
Memory failure: 0xaf34214: reserved kernel page still referenced by 1 users
[..]
Memory failure: 0xaf34214: recovery action for reserved kernel page: Failed
mce: Memory error not recovered

In contrast to typical memory, dev_pagemap pages may be dax mapped. With
dax there is no possibility to map in another page dynamically since dax
establishes 1:1 physical address to file offset associations. Also
dev_pagemap pages associated with NVDIMM / persistent memory devices can
internal remap/repair addresses with poison. While memory_failure()
assumes that it can discard typical poisoned pages and keep them
unmapped indefinitely, dev_pagemap pages may be returned to service
after the error is cleared.

Teach memory_failure() to detect and handle MEMORY_DEVICE_HOST
dev_pagemap pages that have poison consumed by userspace. Mark the
memory as UC instead of unmapping it completely to allow ongoing access
via the device driver (nd_pmem). Later, nd_pmem will grow support for
marking the page back to WB when the error is cleared.

Cc: Jan Kara <jack@suse.cz>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Jérôme Glisse <jglisse@redhat.com>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
Cc: Ross Zwisler <ross.zwisler@linux.intel.com>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
Signed-off-by: Dave Jiang <dave.jiang@intel.com>


# ae1139ec 13-Jul-2018 Dan Williams <dan.j.williams@intel.com>

mm, memory_failure: Collect mapping size in collect_procs()

In preparation for supporting memory_failure() for dax mappings, teach
collect_procs() to also determine the mapping size. Unlike typical
mappings the dax mapping size is determined by walking page-table
entries rather than using the compound-page accounting for THP pages.

Acked-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
Signed-off-by: Dave Jiang <dave.jiang@intel.com>


# 86a66810 13-Jul-2018 Dan Williams <dan.j.williams@intel.com>

mm, madvise_inject_error: Disable MADV_SOFT_OFFLINE for ZONE_DEVICE pages

Given that dax / device-mapped pages are never subject to page
allocations remove them from consideration by the soft-offline
mechanism.

Reported-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
Acked-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
Signed-off-by: Dave Jiang <dave.jiang@intel.com>


# 666feb21 10-Apr-2018 Michal Hocko <mhocko@suse.com>

mm, migrate: remove reason argument from new_page_t

No allocation callback is using this argument anymore. new_page_node
used to use this parameter to convey node_id resp. migration error up
to move_pages code (do_move_page_to_node_array). The error status never
made it into the final status field and we have a better way to
communicate node id to the status field now. All other allocation
callbacks simply ignored the argument so we can drop it finally.

[mhocko@suse.com: fix migration callback]
Link: http://lkml.kernel.org/r/20180105085259.GH2801@dhcp22.suse.cz
[akpm@linux-foundation.org: fix alloc_misplaced_dst_page()]
[mhocko@kernel.org: fix build]
Link: http://lkml.kernel.org/r/20180103091134.GB11319@dhcp22.suse.cz
Link: http://lkml.kernel.org/r/20180103082555.14592-3-mhocko@kernel.org
Signed-off-by: Michal Hocko <mhocko@suse.com>
Reviewed-by: Zi Yan <zi.yan@cs.rutgers.edu>
Cc: Andrea Reale <ar@linux.vnet.ibm.com>
Cc: Anshuman Khandual <khandual@linux.vnet.ibm.com>
Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Cc: Mike Kravetz <mike.kravetz@oracle.com>
Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 31286a84 05-Apr-2018 Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>

mm: hwpoison: disable memory error handling on 1GB hugepage

Recently the following BUG was reported:

Injecting memory failure for pfn 0x3c0000 at process virtual address 0x7fe300000000
Memory failure: 0x3c0000: recovery action for huge page: Recovered
BUG: unable to handle kernel paging request at ffff8dfcc0003000
IP: gup_pgd_range+0x1f0/0xc20
PGD 17ae72067 P4D 17ae72067 PUD 0
Oops: 0000 [#1] SMP PTI
...
CPU: 3 PID: 5467 Comm: hugetlb_1gb Not tainted 4.15.0-rc8-mm1-abc+ #3
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.9.3-1.fc25 04/01/2014

You can easily reproduce this by calling madvise(MADV_HWPOISON) twice on
a 1GB hugepage. This happens because get_user_pages_fast() is not aware
of a migration entry on pud that was created in the 1st madvise() event.

I think that conversion to pud-aligned migration entry is working, but
other MM code walking over page table isn't prepared for it. We need
some time and effort to make all this work properly, so this patch
avoids the reported bug by just disabling error handling for 1GB
hugepage.

[n-horiguchi@ah.jp.nec.com: v2]
Link: http://lkml.kernel.org/r/1517284444-18149-1-git-send-email-n-horiguchi@ah.jp.nec.com
Link: http://lkml.kernel.org/r/1517207283-15769-1-git-send-email-n-horiguchi@ah.jp.nec.com
Signed-off-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
Acked-by: Michal Hocko <mhocko@suse.com>
Reviewed-by: Andrew Morton <akpm@linux-foundation.org>
Reviewed-by: Mike Kravetz <mike.kravetz@oracle.com>
Acked-by: Punit Agrawal <punit.agrawal@arm.com>
Tested-by: Michael Ellerman <mpe@ellerman.id.au>
Cc: Anshuman Khandual <khandual@linux.vnet.ibm.com>
Cc: "Aneesh Kumar K.V" <aneesh.kumar@linux.vnet.ibm.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# fd0e786d 25-Jan-2018 Tony Luck <tony.luck@intel.com>

x86/mm, mm/hwpoison: Don't unconditionally unmap kernel 1:1 pages

In the following commit:

ce0fa3e56ad2 ("x86/mm, mm/hwpoison: Clear PRESENT bit for kernel 1:1 mappings of poison pages")

... we added code to memory_failure() to unmap the page from the
kernel 1:1 virtual address space to avoid speculative access to the
page logging additional errors.

But memory_failure() may not always succeed in taking the page offline,
especially if the page belongs to the kernel. This can happen if
there are too many corrected errors on a page and either mcelog(8)
or drivers/ras/cec.c asks to take a page offline.

Since we remove the 1:1 mapping early in memory_failure(), we can
end up with the page unmapped, but still in use. On the next access
the kernel crashes :-(

There are also various debug paths that call memory_failure() to simulate
occurrence of an error. Since there is no actual error in memory, we
don't need to map out the page for those cases.

Revert most of the previous attempt and keep the solution local to
arch/x86/kernel/cpu/mcheck/mce.c. Unmap the page only when:

1) there is a real error
2) memory_failure() succeeds.

All of this only applies to 64-bit systems. 32-bit kernel doesn't map
all of memory into kernel space. It isn't worth adding the code to unmap
the piece that is mapped because nobody would run a 32-bit kernel on a
machine that has recoverable machine checks.

Signed-off-by: Tony Luck <tony.luck@intel.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Borislav Petkov <bp@suse.de>
Cc: Brian Gerst <brgerst@gmail.com>
Cc: Dave <dave.hansen@intel.com>
Cc: Denys Vlasenko <dvlasenk@redhat.com>
Cc: Josh Poimboeuf <jpoimboe@redhat.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Robert (Persistent Memory) <elliott@hpe.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: linux-mm@kvack.org
Cc: stable@vger.kernel.org #v4.14
Fixes: ce0fa3e56ad2 ("x86/mm, mm/hwpoison: Clear PRESENT bit for kernel 1:1 mappings of poison pages")
Signed-off-by: Ingo Molnar <mingo@kernel.org>


# c0f45555 02-Aug-2017 Eric W. Biederman <ebiederm@xmission.com>

signal/memory-failure: Use force_sig_mceerr and send_sig_mceerr

Delegate filling out struct siginfo to functions in kernel/signal.c
to simplify the code.

Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>


# 83b57531 09-Jul-2017 Eric W. Biederman <ebiederm@xmission.com>

mm/memory_failure: Remove unused trapno from memory_failure

Today 4 architectures set ARCH_SUPPORTS_MEMORY_FAILURE (arm64, parisc,
powerpc, and x86), while 4 other architectures set __ARCH_SI_TRAPNO
(alpha, metag, sparc, and tile). These two sets of architectures do
not interesect so remove the trapno paramater to remove confusion.

Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>


# b6b18aa8 15-Nov-2017 Laszlo Toth <laszlth@gmail.com>

mm, soft_offline: improve hugepage soft offlining error log

On a failed attempt, we get the following entry: soft offline: 0x3c0000:
migration failed 1, type 17ffffc0008008 (uptodate|head)

Make this more specific to be straightforward and to follow other error
log formats in soft_offline_huge_page().

Link: http://lkml.kernel.org/r/20171016171757.GA3018@ubuntu-desk-vm
Signed-off-by: Laszlo Toth <laszlth@gmail.com>
Acked-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# ce0fa3e5 16-Aug-2017 Tony Luck <tony.luck@intel.com>

x86/mm, mm/hwpoison: Clear PRESENT bit for kernel 1:1 mappings of poison pages

Speculative processor accesses may reference any memory that has a
valid page table entry. While a speculative access won't generate
a machine check, it will log the error in a machine check bank. That
could cause escalation of a subsequent error since the overflow bit
will be then set in the machine check bank status register.

Code has to be double-plus-tricky to avoid mentioning the 1:1 virtual
address of the page we want to map out otherwise we may trigger the
very problem we are trying to avoid. We use a non-canonical address
that passes through the usual Linux table walking code to get to the
same "pte".

Thanks to Dave Hansen for reviewing several iterations of this.

Also see:

http://marc.info/?l=linux-mm&m=149860136413338&w=2

Signed-off-by: Tony Luck <tony.luck@intel.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Borislav Petkov <bp@suse.de>
Cc: Brian Gerst <brgerst@gmail.com>
Cc: Dave Hansen <dave.hansen@intel.com>
Cc: Denys Vlasenko <dvlasenk@redhat.com>
Cc: Elliott, Robert (Persistent Memory) <elliott@hpe.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Josh Poimboeuf <jpoimboe@redhat.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: linux-mm@kvack.org
Cc: stable@vger.kernel.org
Link: http://lkml.kernel.org/r/20170816171803.28342-1-tony.luck@intel.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>


# ef77ba5c 10-Jul-2017 Michal Hocko <mhocko@suse.com>

mm, hugetlb, soft_offline: use new_page_nodemask for soft offline migration

new_page is yet another duplication of the migration callback which has
to handle hugetlb migration specially. We can safely use the generic
new_page_nodemask for the same purpose.

Please note that gigantic hugetlb pages do not need any special handling
because alloc_huge_page_nodemask will make sure to check pages in all
per node pools. The reason this was done previously was that
alloc_huge_page_node treated NO_NUMA_NODE and a specific node
differently and so alloc_huge_page_node(nid) would check on this
specific node.

Link: http://lkml.kernel.org/r/20170622193034.28972-4-mhocko@kernel.org
Signed-off-by: Michal Hocko <mhocko@suse.com>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Reported-by: Vlastimil Babka <vbabka@suse.cz>
Reviewed-by: Mike Kravetz <mike.kravetz@oracle.com>
Tested-by: Mike Kravetz <mike.kravetz@oracle.com>
Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
Cc: Mel Gorman <mgorman@suse.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 0348d2eb 10-Jul-2017 Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>

mm: hwpoison: introduce idenfity_page_state

Factoring duplicate code into a function.

Link: http://lkml.kernel.org/r/1496305019-5493-10-git-send-email-n-horiguchi@ah.jp.nec.com
Signed-off-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# ddd40d8a 10-Jul-2017 Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>

mm: hugetlb: delete dequeue_hwpoisoned_huge_page()

dequeue_hwpoisoned_huge_page() is no longer used, so let's remove it.

Link: http://lkml.kernel.org/r/1496305019-5493-9-git-send-email-n-horiguchi@ah.jp.nec.com
Signed-off-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 78bb9203 10-Jul-2017 Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>

mm: hwpoison: dissolve in-use hugepage in unrecoverable memory error

Currently me_huge_page() relies on dequeue_hwpoisoned_huge_page() to
keep the error hugepage away from the system, which is OK but not good
enough because the hugepage still has a refcount and unpoison doesn't
work on the error hugepage (PageHWPoison flags are cleared but pages are
still leaked.) And there's "wasting health subpages" issue too. This
patch reworks on me_huge_page() to solve these issues.

For hugetlb file, recently we have truncating code so let's use it in
hugetlbfs specific ->error_remove_page().

For anonymous hugepage, it's helpful to dissolve the error page after
freeing it into free hugepage list. Migration entry and PageHWPoison in
the head page prevent the access to it.

TODO: dissolve_free_huge_page() can fail but we don't considered it yet.
It's not critical (and at least no worse that now) because in such case
the error hugepage just stays in free hugepage list without being
dissolved. By virtue of PageHWPoison in head page, it's never allocated
to processes.

[akpm@linux-foundation.org: fix unused var warnings]
Fixes: 23a003bfd23ea9ea0b7756b920e51f64b284b468 ("mm/madvise: pass return code of memory_failure() to userspace")
Link: http://lkml.kernel.org/r/20170417055948.GM31394@yexl-desktop
Link: http://lkml.kernel.org/r/1496305019-5493-8-git-send-email-n-horiguchi@ah.jp.nec.com
Signed-off-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
Reported-by: kernel test robot <lkp@intel.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 761ad8d7 10-Jul-2017 Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>

mm: hwpoison: introduce memory_failure_hugetlb()

memory_failure() is a big function and hard to maintain. Handling
hugetlb- and non-hugetlb- case in a single function is not good, so this
patch separates PageHuge() branch into a new function, which saves many
PageHuge() check.

Link: http://lkml.kernel.org/r/1496305019-5493-7-git-send-email-n-horiguchi@ah.jp.nec.com
Signed-off-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# d4a3a60b 10-Jul-2017 Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>

mm: soft-offline: dissolve free hugepage if soft-offlined

Now we have code to rescue most of healthy pages from a hwpoisoned
hugepage. So let's apply it to soft_offline_free_page too.

Link: http://lkml.kernel.org/r/1496305019-5493-6-git-send-email-n-horiguchi@ah.jp.nec.com
Signed-off-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# c3114a84 10-Jul-2017 Anshuman Khandual <khandual@linux.vnet.ibm.com>

mm: hugetlb: soft-offline: dissolve source hugepage after successful migration

Currently hugepage migrated by soft-offline (i.e. due to correctable
memory errors) is contained as a hugepage, which means many non-error
pages in it are unreusable, i.e. wasted.

This patch solves this issue by dissolving source hugepages into buddy.
As done in previous patch, PageHWPoison is set only on a head page of
the error hugepage. Then in dissoliving we move the PageHWPoison flag
to the raw error page so that all healthy subpages return back to buddy.

[arnd@arndb.de: fix warnings: replace some macros with inline functions]
Link: http://lkml.kernel.org/r/20170609102544.2947326-1-arnd@arndb.de
Link: http://lkml.kernel.org/r/1496305019-5493-5-git-send-email-n-horiguchi@ah.jp.nec.com
Signed-off-by: Anshuman Khandual <khandual@linux.vnet.ibm.com>
Signed-off-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# b37ff71c 10-Jul-2017 Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>

mm: hwpoison: change PageHWPoison behavior on hugetlb pages

We'd like to narrow down the error region in memory error on hugetlb
pages. However, currently we set PageHWPoison flags on all subpages in
the error hugepage and add # of subpages to num_hwpoison_pages, which
doesn't fit our purpose.

So this patch changes the behavior and we only set PageHWPoison on the
head page then increase num_hwpoison_pages only by 1. This is a
preparation for narrow-down part which comes in later patches.

Link: http://lkml.kernel.org/r/1496305019-5493-4-git-send-email-n-horiguchi@ah.jp.nec.com
Signed-off-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: "Aneesh Kumar K.V" <aneesh.kumar@linux.vnet.ibm.com>
Cc: Anshuman Khandual <khandual@linux.vnet.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 243abd5b 10-Jul-2017 Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>

mm: hugetlb: prevent reuse of hwpoisoned free hugepages

Patch series "mm: hwpoison: fixlet for hugetlb migration".

This patchset updates the hwpoison/hugetlb code to address 2 reported
issues.

One is madvise(MADV_HWPOISON) failure reported by Intel's lkp robot (see
http://lkml.kernel.org/r/20170417055948.GM31394@yexl-desktop.) First
half was already fixed in mainline, and another half about hugetlb cases
are solved in this series.

Another issue is "narrow-down error affected region into a single 4kB
page instead of a whole hugetlb page" issue, which was tried by Anshuman
(http://lkml.kernel.org/r/20170420110627.12307-1-khandual@linux.vnet.ibm.com)
and I updated it to apply it more widely.

This patch (of 9):

We no longer use MIGRATE_ISOLATE to prevent reuse of hwpoison hugepages
as we did before. So current dequeue_huge_page_node() doesn't work as
intended because it still uses is_migrate_isolate_page() for this check.
This patch fixes it with PageHWPoison flag.

Link: http://lkml.kernel.org/r/1496305019-5493-2-git-send-email-n-horiguchi@ah.jp.nec.com
Signed-off-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: "Aneesh Kumar K.V" <aneesh.kumar@linux.vnet.ibm.com>
Cc: Anshuman Khandual <khandual@linux.vnet.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 94310cbc 06-Jul-2017 Anshuman Khandual <khandual@linux.vnet.ibm.com>

mm/madvise: enable (soft|hard) offline of HugeTLB pages at PGD level

Though migrating gigantic HugeTLB pages does not sound much like real
world use case, they can be affected by memory errors. Hence migration
at the PGD level HugeTLB pages should be supported just to enable soft
and hard offline use cases.

While allocating the new gigantic HugeTLB page, it should not matter
whether new page comes from the same node or not. There would be very
few gigantic pages on the system afterall, we should not be bothered
about node locality when trying to save a big page from crashing.

This change renames dequeu_huge_page_node() function as dequeue_huge
_page_node_exact() preserving it's original functionality. Now the new
dequeue_huge_page_node() function scans through all available online nodes
to allocate a huge page for the NUMA_NO_NODE case and just falls back
calling dequeu_huge_page_node_exact() for all other cases.

[arnd@arndb.de: make hstate_is_gigantic() inline]
Link: http://lkml.kernel.org/r/20170522124748.3911296-1-arnd@arndb.de
Link: http://lkml.kernel.org/r/20170516100509.20122-1-khandual@linux.vnet.ibm.com
Signed-off-by: Anshuman Khandual <khandual@linux.vnet.ibm.com>
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Cc: "Aneesh Kumar K.V" <aneesh.kumar@linux.vnet.ibm.com>
Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# af21bfaf 06-Jul-2017 Jeff Layton <jlayton@kernel.org>

mm: fix mapping_set_error call in me_pagecache_dirty

The error code should be negative. Since this ends up in the default case
anyway, this is harmless, but it's less confusing to negate it. Also,
later patches will require a negative error code here.

Link: http://lkml.kernel.org/r/20170525103355.6760-1-jlayton@redhat.com
Signed-off-by: Jeff Layton <jlayton@redhat.com>
Reviewed-by: Ross Zwisler <ross.zwisler@linux.intel.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Reviewed-by: Matthew Wilcox <mawilcox@microsoft.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>


# 7258ae5c 16-Jun-2017 James Morse <james.morse@arm.com>

mm/memory-failure.c: use compound_head() flags for huge pages

memory_failure() chooses a recovery action function based on the page
flags. For huge pages it uses the tail page flags which don't have
anything interesting set, resulting in:

> Memory failure: 0x9be3b4: Unknown page state
> Memory failure: 0x9be3b4: recovery action for unknown page: Failed

Instead, save a copy of the head page's flags if this is a huge page,
this means if there are no relevant flags for this tail page, we use the
head pages flags instead. This results in the me_huge_page() recovery
action being called:

> Memory failure: 0x9b7969: recovery action for huge page: Delayed

For hugepages that have not yet been allocated, this allows the hugepage
to be dequeued.

Fixes: 524fca1e7356 ("HWPOISON: fix misjudgement of page_action() for errors on mlocked pages")
Link: http://lkml.kernel.org/r/20170524130204.21845-1-james.morse@arm.com
Signed-off-by: James Morse <james.morse@arm.com>
Tested-by: Punit Agrawal <punit.agrawal@arm.com>
Acked-by: Punit Agrawal <punit.agrawal@arm.com>
Acked-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 30809f55 02-Jun-2017 Punit Agrawal <punitagrawal@gmail.com>

mm/migrate: fix refcount handling when !hugepage_migration_supported()

On failing to migrate a page, soft_offline_huge_page() performs the
necessary update to the hugepage ref-count.

But when !hugepage_migration_supported() , unmap_and_move_hugepage()
also decrements the page ref-count for the hugepage. The combined
behaviour leaves the ref-count in an inconsistent state.

This leads to soft lockups when running the overcommitted hugepage test
from mce-tests suite.

Soft offlining pfn 0x83ed600 at process virtual address 0x400000000000
soft offline: 0x83ed600: migration failed 1, type 1fffc00000008008 (uptodate|head)
INFO: rcu_preempt detected stalls on CPUs/tasks:
Tasks blocked on level-0 rcu_node (CPUs 0-7): P2715
(detected by 7, t=5254 jiffies, g=963, c=962, q=321)
thugetlb_overco R running task 0 2715 2685 0x00000008
Call trace:
dump_backtrace+0x0/0x268
show_stack+0x24/0x30
sched_show_task+0x134/0x180
rcu_print_detail_task_stall_rnp+0x54/0x7c
rcu_check_callbacks+0xa74/0xb08
update_process_times+0x34/0x60
tick_sched_handle.isra.7+0x38/0x70
tick_sched_timer+0x4c/0x98
__hrtimer_run_queues+0xc0/0x300
hrtimer_interrupt+0xac/0x228
arch_timer_handler_phys+0x3c/0x50
handle_percpu_devid_irq+0x8c/0x290
generic_handle_irq+0x34/0x50
__handle_domain_irq+0x68/0xc0
gic_handle_irq+0x5c/0xb0

Address this by changing the putback_active_hugepage() in
soft_offline_huge_page() to putback_movable_pages().

This only triggers on systems that enable memory failure handling
(ARCH_SUPPORTS_MEMORY_FAILURE) but not hugepage migration
(!ARCH_ENABLE_HUGEPAGE_MIGRATION).

I imagine this wasn't triggered as there aren't many systems running
this configuration.

[akpm@linux-foundation.org: remove dead comment, per Naoya]
Link: http://lkml.kernel.org/r/20170525135146.32011-1-punit.agrawal@arm.com
Reported-by: Manoj Iyer <manoj.iyer@canonical.com>
Tested-by: Manoj Iyer <manoj.iyer@canonical.com>
Suggested-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
Signed-off-by: Punit Agrawal <punit.agrawal@arm.com>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Cc: Wanpeng Li <wanpeng.li@hotmail.com>
Cc: Christoph Lameter <cl@linux.com>
Cc: Mel Gorman <mgorman@techsingularity.net>
Cc: <stable@vger.kernel.org> [3.14+]
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 18365225 12-May-2017 Michal Hocko <mhocko@suse.com>

hwpoison, memcg: forcibly uncharge LRU pages

Laurent Dufour has noticed that hwpoinsoned pages are kept charged. In
his particular case he has hit a bad_page("page still charged to
cgroup") when onlining a hwpoison page. While this looks like something
that shouldn't happen in the first place because onlining hwpages and
returning them to the page allocator makes only little sense it shows a
real problem.

hwpoison pages do not get freed usually so we do not uncharge them (at
least not since commit 0a31bc97c80c ("mm: memcontrol: rewrite uncharge
API")). Each charge pins memcg (since e8ea14cc6ead ("mm: memcontrol:
take a css reference for each charged page")) as well and so the
mem_cgroup and the associated state will never go away. Fix this leak
by forcibly uncharging a LRU hwpoisoned page in delete_from_lru_cache().
We also have to tweak uncharge_list because it cannot rely on zero ref
count for these pages.

[akpm@linux-foundation.org: coding-style fixes]
Fixes: 0a31bc97c80c ("mm: memcontrol: rewrite uncharge API")
Link: http://lkml.kernel.org/r/20170502185507.GB19165@dhcp22.suse.cz
Signed-off-by: Michal Hocko <mhocko@suse.com>
Reported-by: Laurent Dufour <ldufour@linux.vnet.ibm.com>
Tested-by: Laurent Dufour <ldufour@linux.vnet.ibm.com>
Reviewed-by: Balbir Singh <bsingharora@gmail.com>
Reviewed-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 286c469a 03-May-2017 Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>

mm: hwpoison: call shake_page() after try_to_unmap() for mlocked page

Memory error handler calls try_to_unmap() for error pages in various
states. If the error page is a mlocked page, error handling could fail
with "still referenced by 1 users" message. This is because the page is
linked to and stays in lru cache after the following call chain.

try_to_unmap_one
page_remove_rmap
clear_page_mlock
putback_lru_page
lru_cache_add

memory_failure() calls shake_page() to hanlde the similar issue, but
current code doesn't cover because shake_page() is called only before
try_to_unmap(). So this patches adds shake_page().

Fixes: 23a003bfd23ea9ea0b7756b920e51f64b284b468 ("mm/madvise: pass return code of memory_failure() to userspace")
Link: http://lkml.kernel.org/r/20170417055948.GM31394@yexl-desktop
Link: http://lkml.kernel.org/r/1493197841-23986-3-git-send-email-n-horiguchi@ah.jp.nec.com
Signed-off-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
Reported-by: kernel test robot <lkp@intel.com>
Cc: Xiaolong Ye <xiaolong.ye@intel.com>
Cc: Chen Gong <gong.chen@linux.intel.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 8bcb74de 03-May-2017 Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>

mm: hwpoison: call shake_page() unconditionally

shake_page() is called before going into core error handling code in
order to ensure that the error page is flushed from lru_cache lists
where pages stay during transferring among LRU lists.

But currently it's not fully functional because when the page is linked
to lru_cache by calling activate_page(), its PageLRU flag is set and
shake_page() is skipped. The result is to fail error handling with
"still referenced by 1 users" message.

When the page is linked to lru_cache by isolate_lru_page(), its PageLRU
is clear, so that's fine.

This patch makes shake_page() unconditionally called to avoild the
failure.

Fixes: 23a003bfd23ea9ea0b7756b920e51f64b284b468 ("mm/madvise: pass return code of memory_failure() to userspace")
Link: http://lkml.kernel.org/r/20170417055948.GM31394@yexl-desktop
Link: http://lkml.kernel.org/r/1493197841-23986-2-git-send-email-n-horiguchi@ah.jp.nec.com
Signed-off-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
Reported-by: kernel test robot <lkp@intel.com>
Cc: Xiaolong Ye <xiaolong.ye@intel.com>
Cc: Chen Gong <gong.chen@linux.intel.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 82a2481e 03-May-2017 Anshuman Khandual <khandual@linux.vnet.ibm.com>

mm/memory-failure.c: add page flag description in error paths

It helps to provide page flag description along with the raw value in
error paths during soft offline process. From sample experiments

Before the patch:

soft offline: 0x6100: migration failed 1, type 3ffff800008018
soft offline: 0x7400: migration failed 1, type 3ffff800008018

After the patch:

soft offline: 0x5900: migration failed 1, type 3ffff800008018 (uptodate|dirty|head)
soft offline: 0x6c00: migration failed 1, type 3ffff800008018 (uptodate|dirty|head)

Link: http://lkml.kernel.org/r/20170409023829.10788-1-khandual@linux.vnet.ibm.com
Signed-off-by: Anshuman Khandual <khandual@linux.vnet.ibm.com>
Acked-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 666e5a40 03-May-2017 Minchan Kim <minchan@kernel.org>

mm: make ttu's return boolean

try_to_unmap() returns SWAP_SUCCESS or SWAP_FAIL so it's suitable for
boolean return. This patch changes it.

Link: http://lkml.kernel.org/r/1489555493-14659-8-git-send-email-minchan@kernel.org
Signed-off-by: Minchan Kim <minchan@kernel.org>
Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
Cc: Anshuman Khandual <khandual@linux.vnet.ibm.com>
Cc: Hillf Danton <hillf.zj@alibaba-inc.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# a128ca71 03-May-2017 Shaohua Li <shli@fb.com>

mm: delete unnecessary TTU_* flags

Patch series "mm: fix some MADV_FREE issues", v5.

We are trying to use MADV_FREE in jemalloc. Several issues are found.
Without solving the issues, jemalloc can't use the MADV_FREE feature.

- Doesn't support system without swap enabled. Because if swap is off,
we can't or can't efficiently age anonymous pages. And since
MADV_FREE pages are mixed with other anonymous pages, we can't
reclaim MADV_FREE pages. In current implementation, MADV_FREE will
fallback to MADV_DONTNEED without swap enabled. But in our
environment, a lot of machines don't enable swap. This will prevent
our setup using MADV_FREE.

- Increases memory pressure. page reclaim bias file pages reclaim
against anonymous pages. This doesn't make sense for MADV_FREE pages,
because those pages could be freed easily and refilled with very
slight penality. Even page reclaim doesn't bias file pages, there is
still an issue, because MADV_FREE pages and other anonymous pages are
mixed together. To reclaim a MADV_FREE page, we probably must scan a
lot of other anonymous pages, which is inefficient. In our test, we
usually see oom with MADV_FREE enabled and nothing without it.

- Accounting. There are two accounting problems. We don't have a global
accounting. If the system is abnormal, we don't know if it's a
problem from MADV_FREE side. The other problem is RSS accounting.
MADV_FREE pages are accounted as normal anon pages and reclaimed
lazily, so application's RSS becomes bigger. This confuses our
workloads. We have monitoring daemon running and if it finds
applications' RSS becomes abnormal, the daemon will kill the
applications even kernel can reclaim the memory easily.

To address the first the two issues, we can either put MADV_FREE pages
into a separate LRU list (Minchan's previous patches and V1 patches), or
put them into LRU_INACTIVE_FILE list (suggested by Johannes). The
patchset use the second idea. The reason is LRU_INACTIVE_FILE list is
tiny nowadays and should be full of used once file pages. So we can
still efficiently reclaim MADV_FREE pages there without interference
with other anon and active file pages. Putting the pages into inactive
file list also has an advantage which allows page reclaim to prioritize
MADV_FREE pages and used once file pages. MADV_FREE pages are put into
the lru list and clear SwapBacked flag, so PageAnon(page) &&
!PageSwapBacked(page) will indicate a MADV_FREE pages. These pages will
directly freed without pageout if they are clean, otherwise normal swap
will reclaim them.

For the third issue, the previous post adds global accounting and a
separate RSS count for MADV_FREE pages. The problem is we never get
accurate accounting for MADV_FREE pages. The pages are mapped to
userspace, can be dirtied without notice from kernel side. To get
accurate accounting, we could write protect the page, but then there is
extra page fault overhead, which people don't want to pay. Jemalloc
guys have concerns about the inaccurate accounting, so this post drops
the accounting patches temporarily. The info exported to
/proc/pid/smaps for MADV_FREE pages are kept, which is the only place we
can get accurate accounting right now.

This patch (of 6):

Johannes pointed out TTU_LZFREE is unnecessary. It's true because we
always have the flag set if we want to do an unmap. For cases we don't
do an unmap, the TTU_LZFREE part of code should never run.

Also the TTU_UNMAP is unnecessary. If no other flags set (for example,
TTU_MIGRATION), an unmap is implied.

The patch includes Johannes's cleanup and dead TTU_ACTION macro removal
code

Link: http://lkml.kernel.org/r/4be3ea1bc56b26fd98a54d0a6f70bec63f6d8980.1487965799.git.shli@fb.com
Signed-off-by: Shaohua Li <shli@fb.com>
Suggested-by: Johannes Weiner <hannes@cmpxchg.org>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Acked-by: Minchan Kim <minchan@kernel.org>
Acked-by: Hillf Danton <hillf.zj@alibaba-inc.com>
Acked-by: Michal Hocko <mhocko@suse.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: Rik van Riel <riel@redhat.com>
Cc: Mel Gorman <mgorman@techsingularity.net>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 29930025 08-Feb-2017 Ingo Molnar <mingo@kernel.org>

sched/headers: Prepare for new header dependencies before moving code to <linux/sched/task.h>

We are going to split <linux/sched/task.h> out of <linux/sched.h>, which
will have to be picked up from other headers and a couple of .c files.

Create a trivial placeholder <linux/sched/task.h> file that just
maps to <linux/sched.h> to make this patch obviously correct and
bisectable.

Include the new header in the files that are going to need it.

Acked-by: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: linux-kernel@vger.kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>


# 3f07c014 08-Feb-2017 Ingo Molnar <mingo@kernel.org>

sched/headers: Prepare for new header dependencies before moving code to <linux/sched/signal.h>

We are going to split <linux/sched/signal.h> out of <linux/sched.h>, which
will have to be picked up from other headers and a couple of .c files.

Create a trivial placeholder <linux/sched/signal.h> file that just
maps to <linux/sched.h> to make this patch obviously correct and
bisectable.

Include the new header in the files that are going to need it.

Acked-by: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: linux-kernel@vger.kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>


# 85fbe5d1 24-Feb-2017 Yisheng Xie <xieyisheng1@huawei.com>

HWPOISON: soft offlining for non-lru movable page

Extend soft offlining framework to support non-lru page, which already
support migration after commit bda807d44454 ("mm: migrate: support
non-lru movable page migration")

When memory corrected errors occur on a non-lru movable page, we can
choose to stop using it by migrating data onto another page and disable
the original (maybe half-broken) one.

Link: http://lkml.kernel.org/r/1485867981-16037-4-git-send-email-ysxie@foxmail.com
Signed-off-by: Yisheng Xie <xieyisheng1@huawei.com>
Suggested-by: Michal Hocko <mhocko@kernel.org>
Suggested-by: Minchan Kim <minchan@kernel.org>
Reviewed-by: Minchan Kim <minchan@kernel.org>
Acked-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Andi Kleen <ak@linux.intel.com>
Cc: Hanjun Guo <guohanjun@huawei.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Cc: Mel Gorman <mgorman@techsingularity.net>
Cc: Reza Arbab <arbab@linux.vnet.ibm.com>
Cc: Taku Izumi <izumi.taku@jp.fujitsu.com>
Cc: Vitaly Kuznetsov <vkuznets@redhat.com>
Cc: Xishi Qiu <qiuxishi@huawei.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 6326fec1 24-Dec-2016 Nicholas Piggin <npiggin@gmail.com>

mm: Use owner_priv bit for PageSwapCache, valid when PageSwapBacked

A page is not added to the swap cache without being swap backed,
so PageSwapBacked mappings can use PG_owner_priv_1 for PageSwapCache.

Signed-off-by: Nicholas Piggin <npiggin@gmail.com>
Acked-by: Hugh Dickins <hughd@google.com>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Bob Peterson <rpeterso@redhat.com>
Cc: Steven Whitehouse <swhiteho@redhat.com>
Cc: Andrew Lutomirski <luto@kernel.org>
Cc: Andreas Gruenbacher <agruenba@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Mel Gorman <mgorman@techsingularity.net>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# c3901e72 10-Nov-2016 Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>

mm: hwpoison: fix thp split handling in memory_failure()

When memory_failure() runs on a thp tail page after pmd is split, we
trigger the following VM_BUG_ON_PAGE():

page:ffffd7cd819b0040 count:0 mapcount:0 mapping: (null) index:0x1
flags: 0x1fffc000400000(hwpoison)
page dumped because: VM_BUG_ON_PAGE(!page_count(p))
------------[ cut here ]------------
kernel BUG at /src/linux-dev/mm/memory-failure.c:1132!

memory_failure() passed refcount and page lock from tail page to head
page, which is not needed because we can pass any subpage to
split_huge_page().

Fixes: 61f5d698cc97 ("mm: re-enable THP")
Link: http://lkml.kernel.org/r/1477961577-7183-1-git-send-email-n-horiguchi@ah.jp.nec.com
Signed-off-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
Cc: <stable@vger.kernel.org> [4.5+]
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 7c7fd825 28-Jul-2016 Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>

mm: hwpoison: remove incorrect comments

dequeue_hwpoisoned_huge_page() can be called without page lock hold, so
let's remove incorrect comment.

The reason why the page lock is not really needed is that
dequeue_hwpoisoned_huge_page() checks page_huge_active() inside
hugetlb_lock, which allows us to avoid trying to dequeue a hugepage that
are just allocated but not linked to active list yet, even without
taking page lock.

Link: http://lkml.kernel.org/r/20160720092901.GA15995@www9186uo.sakura.ne.jp
Signed-off-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
Reported-by: Zhan Chen <zhanc1@andrew.cmu.edu>
Acked-by: Michal Hocko <mhocko@suse.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 599d0c95 28-Jul-2016 Mel Gorman <mgorman@techsingularity.net>

mm, vmscan: move LRU lists to node

This moves the LRU lists from the zone to the node and related data such
as counters, tracing, congestion tracking and writeback tracking.

Unfortunately, due to reclaim and compaction retry logic, it is
necessary to account for the number of LRU pages on both zone and node
logic. Most reclaim logic is based on the node counters but the retry
logic uses the zone counters which do not distinguish inactive and
active sizes. It would be possible to leave the LRU counters on a
per-zone basis but it's a heavier calculation across multiple cache
lines that is much more frequent than the retry checks.

Other than the LRU counters, this is mostly a mechanical patch but note
that it introduces a number of anomalies. For example, the scans are
per-zone but using per-node counters. We also mark a node as congested
when a zone is congested. This causes weird problems that are fixed
later but is easier to review.

In the event that there is excessive overhead on 32-bit systems due to
the nodes being on LRU then there are two potential solutions

1. Long-term isolation of highmem pages when reclaim is lowmem

When pages are skipped, they are immediately added back onto the LRU
list. If lowmem reclaim persisted for long periods of time, the same
highmem pages get continually scanned. The idea would be that lowmem
keeps those pages on a separate list until a reclaim for highmem pages
arrives that splices the highmem pages back onto the LRU. It potentially
could be implemented similar to the UNEVICTABLE list.

That would reduce the skip rate with the potential corner case is that
highmem pages have to be scanned and reclaimed to free lowmem slab pages.

2. Linear scan lowmem pages if the initial LRU shrink fails

This will break LRU ordering but may be preferable and faster during
memory pressure than skipping LRU pages.

Link: http://lkml.kernel.org/r/1467970510-21195-4-git-send-email-mgorman@techsingularity.net
Signed-off-by: Mel Gorman <mgorman@techsingularity.net>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Cc: Hillf Danton <hillf.zj@alibaba-inc.com>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Minchan Kim <minchan@kernel.org>
Cc: Rik van Riel <riel@surriel.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 495367c0 20-May-2016 Chen Yucong <slaoub@gmail.com>

mm/memory-failure.c: replace "MCE" with "Memory failure"

HWPoison was specific to some particular x86 platforms. And it is often
seen as high level machine check handler. And therefore, 'MCE' is used
for the format prefix of printk(). However, 'PowerNV' has also used
HWPoison for handling memory errors[1], so 'MCE' is no longer suitable
to memory_failure.c.

Additionally, 'MCE' and 'Memory failure' have different context. The
former belongs to exception context and the latter belongs to process
context. Furthermore, HWPoison can also be used for off-lining those
sub-health pages that do not trigger any machine check exception.

This patch aims to replace 'MCE' with a more appropriate prefix.

[1] commit 75eb3d9b60c2 ("powerpc/powernv: Get FSP memory errors
and plumb into memory poison infrastructure.")

Signed-off-by: Chen Yucong <slaoub@gmail.com>
Acked-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# c2e7e00b 28-Apr-2016 Konstantin Khlebnikov <koct9i@gmail.com>

mm/memory-failure: fix race with compound page split/merge

get_hwpoison_page() must recheck relation between head and tail pages.

n-horiguchi said: without this recheck, the race causes kernel to pin an
irrelevant page, and finally makes kernel crash for refcount mismatch.

Signed-off-by: Konstantin Khlebnikov <khlebnikov@yandex-team.ru>
Acked-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 09cbfeaf 01-Apr-2016 Kirill A. Shutemov <kirill.shutemov@linux.intel.com>

mm, fs: get rid of PAGE_CACHE_* and page_cache_{get,release} macros

PAGE_CACHE_{SIZE,SHIFT,MASK,ALIGN} macros were introduced *long* time
ago with promise that one day it will be possible to implement page
cache with bigger chunks than PAGE_SIZE.

This promise never materialized. And unlikely will.

We have many places where PAGE_CACHE_SIZE assumed to be equal to
PAGE_SIZE. And it's constant source of confusion on whether
PAGE_CACHE_* or PAGE_* constant should be used in a particular case,
especially on the border between fs and mm.

Global switching to PAGE_CACHE_SIZE != PAGE_SIZE would cause to much
breakage to be doable.

Let's stop pretending that pages in page cache are special. They are
not.

The changes are pretty straight-forward:

- <foo> << (PAGE_CACHE_SHIFT - PAGE_SHIFT) -> <foo>;

- <foo> >> (PAGE_CACHE_SHIFT - PAGE_SHIFT) -> <foo>;

- PAGE_CACHE_{SIZE,SHIFT,MASK,ALIGN} -> PAGE_{SIZE,SHIFT,MASK,ALIGN};

- page_cache_get() -> get_page();

- page_cache_release() -> put_page();

This patch contains automated changes generated with coccinelle using
script below. For some reason, coccinelle doesn't patch header files.
I've called spatch for them manually.

The only adjustment after coccinelle is revert of changes to
PAGE_CAHCE_ALIGN definition: we are going to drop it later.

There are few places in the code where coccinelle didn't reach. I'll
fix them manually in a separate patch. Comments and documentation also
will be addressed with the separate patch.

virtual patch

@@
expression E;
@@
- E << (PAGE_CACHE_SHIFT - PAGE_SHIFT)
+ E

@@
expression E;
@@
- E >> (PAGE_CACHE_SHIFT - PAGE_SHIFT)
+ E

@@
@@
- PAGE_CACHE_SHIFT
+ PAGE_SHIFT

@@
@@
- PAGE_CACHE_SIZE
+ PAGE_SIZE

@@
@@
- PAGE_CACHE_MASK
+ PAGE_MASK

@@
expression E;
@@
- PAGE_CACHE_ALIGN(E)
+ PAGE_ALIGN(E)

@@
expression E;
@@
- page_cache_get(E)
+ get_page(E)

@@
expression E;
@@
- page_cache_release(E)
+ put_page(E)

Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Acked-by: Michal Hocko <mhocko@suse.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 1170532b 17-Mar-2016 Joe Perches <joe@perches.com>

mm: convert printk(KERN_<LEVEL> to pr_<level>

Most of the mm subsystem uses pr_<level> so make it consistent.

Miscellanea:

- Realign arguments
- Add missing newline to format
- kmemleak-test.c has a "kmemleak: " prefix added to the
"Kmemleak testing" logging message via pr_fmt

Signed-off-by: Joe Perches <joe@perches.com>
Acked-by: Tejun Heo <tj@kernel.org> [percpu]
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 0b94f175 15-Mar-2016 Wang Xiaoqiang <wangxq10@lzu.edu.cn>

mm/memory-failure.c: remove useless "undef"s

Remove the useless #undef, since the corresponding #define has already
been removed.

Signed-off-by: Wang Xiaoqiang <wangxq10@lzu.edu.cn>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 98fd1ef4 15-Jan-2016 Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>

mm: soft-offline: exit with failure for non anonymous thp

Currently memory_failure() doesn't handle non anonymous thp case,
because we can hardly expect the error handling to be successful, and it
can just hit some corner case which results in BUG_ON or something
severe like that. This is also the case for soft offline code, so let's
make it in the same way.

Orignal code has a MF_COUNT_INCREASED check before put_hwpoison_page(),
but it's unnecessary because get_any_page() is already called when
running on this code, which takes a refcount of the target page
regardress of the flag. So this patch also removes it.

[akpm@linux-foundation.org: fix build]
Signed-off-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
Cc: Andi Kleen <andi@firstfloor.org>
Cc: "Kirill A. Shutemov" <kirill@shutemov.name>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# acc14dc4 15-Jan-2016 Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>

mm: soft-offline: clean up soft_offline_page()

soft_offline_page() has some deeply indented code, that's the sign of
demand for cleanup. So let's do this. No functionality change.

Signed-off-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
Cc: Andi Kleen <andi@firstfloor.org>
Cc: "Kirill A. Shutemov" <kirill@shutemov.name>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 4e41a30c 15-Jan-2016 Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>

mm: hwpoison: adjust for new thp refcounting

Some mm-related BUG_ON()s could trigger from hwpoison code due to recent
changes in thp refcounting rule. This patch fixes them up.

In the new refcounting, we no longer use tail->_mapcount to keep tail's
refcount, and thereby we can simplify get/put_hwpoison_page().

And another change is that tail's refcount is not transferred to the raw
page during thp split (more precisely, in new rule we don't take
refcount on tail page any more.) So when we need thp split, we have to
transfer the refcount properly to the 4kB soft-offlined page before
migration.

thp split code goes into core code only when precheck
(total_mapcount(head) == page_count(head) - 1) passes to avoid useless
split, where we assume that one refcount is held by the caller of thp
split and the others are taken via mapping. To meet this assumption,
this patch moves thp split part in soft_offline_page() after
get_any_page().

[akpm@linux-foundation.org: remove unneeded #define, per Kirill]
Signed-off-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
Acked-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# d96b339f 15-Jan-2016 Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>

mm: soft-offline: check return value in second __get_any_page() call

I saw the following BUG_ON triggered in a testcase where a process calls
madvise(MADV_SOFT_OFFLINE) on thps, along with a background process that
calls migratepages command repeatedly (doing ping-pong among different
NUMA nodes) for the first process:

Soft offlining page 0x60000 at 0x700000600000
__get_any_page: 0x60000 free buddy page
page:ffffea0001800000 count:0 mapcount:-127 mapping: (null) index:0x1
flags: 0x1fffc0000000000()
page dumped because: VM_BUG_ON_PAGE(atomic_read(&page->_count) == 0)
------------[ cut here ]------------
kernel BUG at /src/linux-dev/include/linux/mm.h:342!
invalid opcode: 0000 [#1] SMP DEBUG_PAGEALLOC
Modules linked in: cfg80211 rfkill crc32c_intel serio_raw virtio_balloon i2c_piix4 virtio_blk virtio_net ata_generic pata_acpi
CPU: 3 PID: 3035 Comm: test_alloc_gene Tainted: G O 4.4.0-rc8-v4.4-rc8-160107-1501-00000-rc8+ #74
Hardware name: Bochs Bochs, BIOS Bochs 01/01/2011
task: ffff88007c63d5c0 ti: ffff88007c210000 task.ti: ffff88007c210000
RIP: 0010:[<ffffffff8118998c>] [<ffffffff8118998c>] put_page+0x5c/0x60
RSP: 0018:ffff88007c213e00 EFLAGS: 00010246
Call Trace:
put_hwpoison_page+0x4e/0x80
soft_offline_page+0x501/0x520
SyS_madvise+0x6bc/0x6f0
entry_SYSCALL_64_fastpath+0x12/0x6a
Code: 8b fc ff ff 5b 5d c3 48 89 df e8 b0 fa ff ff 48 89 df 31 f6 e8 c6 7d ff ff 5b 5d c3 48 c7 c6 08 54 a2 81 48 89 df e8 a4 c5 01 00 <0f> 0b 66 90 66 66 66 66 90 55 48 89 e5 41 55 41 54 53 48 8b 47
RIP [<ffffffff8118998c>] put_page+0x5c/0x60
RSP <ffff88007c213e00>

The root cause resides in get_any_page() which retries to get a refcount
of the page to be soft-offlined. This function calls
put_hwpoison_page(), expecting that the target page is putback to LRU
list. But it can be also freed to buddy. So the second check need to
care about such case.

Fixes: af8fae7c0886 ("mm/memory-failure.c: clean up soft_offline_page()")
Signed-off-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
Cc: Sasha Levin <sasha.levin@oracle.com>
Cc: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Jerome Marchand <jmarchan@redhat.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: Dave Hansen <dave.hansen@intel.com>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Rik van Riel <riel@redhat.com>
Cc: Steve Capper <steve.capper@linaro.org>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Michal Hocko <mhocko@suse.cz>
Cc: Christoph Lameter <cl@linux.com>
Cc: David Rientjes <rientjes@google.com>
Cc: <stable@vger.kernel.org> [3.9+]
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 4d2fa965 15-Jan-2016 Kirill A. Shutemov <kirill.shutemov@linux.intel.com>

thp, mm: split_huge_page(): caller need to lock page

We're going to use migration entries instead of compound_lock() to
stabilize page refcounts. Setup and remove migration entries require
page to be locked.

Some of split_huge_page() callers already have the page locked. Let's
require everybody to lock the page before calling split_huge_page().

Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Tested-by: Sasha Levin <sasha.levin@oracle.com>
Tested-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Acked-by: Jerome Marchand <jmarchan@redhat.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: Dave Hansen <dave.hansen@intel.com>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Rik van Riel <riel@redhat.com>
Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
Cc: Steve Capper <steve.capper@linaro.org>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Michal Hocko <mhocko@suse.cz>
Cc: Christoph Lameter <cl@linux.com>
Cc: David Rientjes <rientjes@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 48c935ad 15-Jan-2016 Kirill A. Shutemov <kirill.shutemov@linux.intel.com>

page-flags: define PG_locked behavior on compound pages

lock_page() must operate on the whole compound page. It doesn't make
much sense to lock part of compound page. Change code to use head
page's PG_locked, if tail page is passed.

This patch also gets rid of custom helper functions --
__set_page_locked() and __clear_page_locked(). They are replaced with
helpers generated by __SETPAGEFLAG/__CLEARPAGEFLAG. Tail pages to these
helper would trigger VM_BUG_ON().

SLUB uses PG_locked as a bit spin locked. IIUC, tail pages should never
appear there. VM_BUG_ON() is added to make sure that this assumption is
correct.

[akpm@linux-foundation.org: fix fs/cifs/file.c]
Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: Dave Hansen <dave.hansen@intel.com>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Rik van Riel <riel@redhat.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Christoph Lameter <cl@linux.com>
Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
Cc: Steve Capper <steve.capper@linaro.org>
Cc: "Aneesh Kumar K.V" <aneesh.kumar@linux.vnet.ibm.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Michal Hocko <mhocko@suse.cz>
Cc: Jerome Marchand <jmarchan@redhat.com>
Cc: Jérôme Glisse <jglisse@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 1d798ca3 06-Nov-2015 Kirill A. Shutemov <kirill.shutemov@linux.intel.com>

mm: make compound_head() robust

Hugh has pointed that compound_head() call can be unsafe in some
context. There's one example:

CPU0 CPU1

isolate_migratepages_block()
page_count()
compound_head()
!!PageTail() == true
put_page()
tail->first_page = NULL
head = tail->first_page
alloc_pages(__GFP_COMP)
prep_compound_page()
tail->first_page = head
__SetPageTail(p);
!!PageTail() == true
<head == NULL dereferencing>

The race is pure theoretical. I don't it's possible to trigger it in
practice. But who knows.

We can fix the race by changing how encode PageTail() and compound_head()
within struct page to be able to update them in one shot.

The patch introduces page->compound_head into third double word block in
front of compound_dtor and compound_order. Bit 0 encodes PageTail() and
the rest bits are pointer to head page if bit zero is set.

The patch moves page->pmd_huge_pte out of word, just in case if an
architecture defines pgtable_t into something what can have the bit 0
set.

hugetlb_cgroup uses page->lru.next in the second tail page to store
pointer struct hugetlb_cgroup. The patch switch it to use page->private
in the second tail page instead. The space is free since ->first_page is
removed from the union.

The patch also opens possibility to remove HUGETLB_CGROUP_MIN_ORDER
limitation, since there's now space in first tail page to store struct
hugetlb_cgroup pointer. But that's out of scope of the patch.

That means page->compound_head shares storage space with:

- page->lru.next;
- page->next;
- page->rcu_head.next;

That's too long list to be absolutely sure, but looks like nobody uses
bit 0 of the word.

page->rcu_head.next guaranteed[1] to have bit 0 clean as long as we use
call_rcu(), call_rcu_bh(), call_rcu_sched(), or call_srcu(). But future
call_rcu_lazy() is not allowed as it makes use of the bit and we can
get false positive PageTail().

[1] http://lkml.kernel.org/g/20150827163634.GD4029@linux.vnet.ibm.com

Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Acked-by: Michal Hocko <mhocko@suse.com>
Reviewed-by: Andrea Arcangeli <aarcange@redhat.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: David Rientjes <rientjes@google.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Acked-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Cc: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
Cc: Andi Kleen <ak@linux.intel.com>
Cc: Christoph Lameter <cl@linux.com>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Cc: Sergey Senozhatsky <sergey.senozhatsky@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# a5f65109 05-Nov-2015 Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>

mm: hwpoison: ratelimit messages from unpoison_memory()

Currently kernel prints out results of every single unpoison event, which
i= s not necessary because unpoison is purely a testing feature and
testers can = get little or no information from lots of lines of unpoison
log storm. So this patch ratelimits printk in unpoison_memory().

This patch introduces a file local ratelimit_state, which adds 64 bytes to
memory-failure.o. If we apply pr_info_ratelimited() for 8 callsite below,
2= 56 bytes is added, so it's a win.

Signed-off-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
Cc: Andi Kleen <andi@firstfloor.org>
Cc: Wanpeng Li <wanpeng.li@hotmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 94a59fb3 09-Sep-2015 Vladimir Davydov <vdavydov.dev@gmail.com>

hwpoison: use page_cgroup_ino for filtering by memcg

Hwpoison allows to filter pages by memory cgroup ino. Currently, it
calls try_get_mem_cgroup_from_page to obtain the cgroup from a page and
then its ino using cgroup_ino, but now we have a helper method for
that, page_cgroup_ino, so use it instead.

This patch also loosens the hwpoison memcg filter dependency rules - it
makes it depend on CONFIG_MEMCG instead of CONFIG_MEMCG_SWAP, because
hwpoison memcg filter does not require anything (nor it used to) from
CONFIG_MEMCG_SWAP side.

Signed-off-by: Vladimir Davydov <vdavydov@parallels.com>
Reviewed-by: Andres Lagar-Cavilla <andreslc@google.com>
Cc: Minchan Kim <minchan@kernel.org>
Cc: Raghavendra K T <raghavendra.kt@linux.vnet.ibm.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Michal Hocko <mhocko@suse.cz>
Cc: Greg Thelen <gthelen@google.com>
Cc: Michel Lespinasse <walken@google.com>
Cc: David Rientjes <rientjes@google.com>
Cc: Pavel Emelyanov <xemul@parallels.com>
Cc: Cyrill Gorcunov <gorcunov@openvz.org>
Cc: Jonathan Corbet <corbet@lwn.net>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 96db800f 08-Sep-2015 Vlastimil Babka <vbabka@suse.cz>

mm: rename alloc_pages_exact_node() to __alloc_pages_node()

alloc_pages_exact_node() was introduced in commit 6484eb3e2a81 ("page
allocator: do not check NUMA node ID when the caller knows the node is
valid") as an optimized variant of alloc_pages_node(), that doesn't
fallback to current node for nid == NUMA_NO_NODE. Unfortunately the
name of the function can easily suggest that the allocation is
restricted to the given node and fails otherwise. In truth, the node is
only preferred, unless __GFP_THISNODE is passed among the gfp flags.

The misleading name has lead to mistakes in the past, see for example
commits 5265047ac301 ("mm, thp: really limit transparent hugepage
allocation to local node") and b360edb43f8e ("mm, mempolicy:
migrate_to_node should only migrate to node").

Another issue with the name is that there's a family of
alloc_pages_exact*() functions where 'exact' means exact size (instead
of page order), which leads to more confusion.

To prevent further mistakes, this patch effectively renames
alloc_pages_exact_node() to __alloc_pages_node() to better convey that
it's an optimized variant of alloc_pages_node() not intended for general
usage. Both functions get described in comments.

It has been also considered to really provide a convenience function for
allocations restricted to a node, but the major opinion seems to be that
__GFP_THISNODE already provides that functionality and we shouldn't
duplicate the API needlessly. The number of users would be small
anyway.

Existing callers of alloc_pages_exact_node() are simply converted to
call __alloc_pages_node(), with the exception of sba_alloc_coherent()
which open-codes the check for NUMA_NO_NODE, so it is converted to use
alloc_pages_node() instead. This means it no longer performs some
VM_BUG_ON checks, and since the current check for nid in
alloc_pages_node() uses a 'nid < 0' comparison (which includes
NUMA_NO_NODE), it may hide wrong values which would be previously
exposed.

Both differences will be rectified by the next patch.

To sum up, this patch makes no functional changes, except temporarily
hiding potentially buggy callers. Restricting the checks in
alloc_pages_node() is left for the next patch which can in turn expose
more existing buggy callers.

Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Acked-by: Robin Holt <robinmholt@gmail.com>
Acked-by: Michal Hocko <mhocko@suse.com>
Acked-by: Christoph Lameter <cl@linux.com>
Acked-by: Michael Ellerman <mpe@ellerman.id.au>
Cc: Mel Gorman <mgorman@suse.de>
Cc: David Rientjes <rientjes@google.com>
Cc: Greg Thelen <gthelen@google.com>
Cc: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
Cc: Pekka Enberg <penberg@kernel.org>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
Cc: Tony Luck <tony.luck@intel.com>
Cc: Fenghua Yu <fenghua.yu@intel.com>
Cc: Arnd Bergmann <arnd@arndb.de>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Gleb Natapov <gleb@kernel.org>
Cc: Paolo Bonzini <pbonzini@redhat.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: Cliff Whickman <cpw@sgi.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 230ac719 08-Sep-2015 Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>

mm/hwpoison: don't try to unpoison containment-failed pages

memory_failure() can be called at any page at any time, which means that
we can't eliminate the possibility of containment failure. In such case
the best option is to leak the page intentionally (and never touch it
later.)

We have an unpoison function for testing, and it cannot handle such
containment-failed pages, which results in kernel panic (visible with
various calltraces.) So this patch suggests that we limit the
unpoisonable pages to properly contained pages and ignore any other
ones.

Testers are recommended to keep in mind that there're un-unpoisonable
pages when writing test programs.

Signed-off-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
Tested-by: Wanpeng Li <wanpeng.li@hotmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# da1b13cc 08-Sep-2015 Wanpeng Li <wanpeng.li@hotmail.com>

mm/hwpoison: fix race between soft_offline_page and unpoison_memory

Wanpeng Li reported a race between soft_offline_page() and
unpoison_memory(), which causes the following kernel panic:

BUG: Bad page state in process bash pfn:97000
page:ffffea00025c0000 count:0 mapcount:1 mapping: (null) index:0x7f4fdbe00
flags: 0x1fffff80080048(uptodate|active|swapbacked)
page dumped because: PAGE_FLAGS_CHECK_AT_FREE flag(s) set
bad because of flags:
flags: 0x40(active)
Modules linked in: snd_hda_codec_hdmi i915 rpcsec_gss_krb5 nfsv4 dns_resolver bnep rfcomm nfsd bluetooth auth_rpcgss nfs_acl nfs rfkill lockd grace sunrpc i2c_algo_bit drm_kms_helper snd_hda_codec_realtek snd_hda_codec_generic drm snd_hda_intel fscache snd_hda_codec x86_pkg_temp_thermal coretemp kvm_intel snd_hda_core snd_hwdep kvm snd_pcm snd_seq_dummy snd_seq_oss crct10dif_pclmul snd_seq_midi crc32_pclmul snd_seq_midi_event ghash_clmulni_intel snd_rawmidi aesni_intel lrw gf128mul snd_seq glue_helper ablk_helper snd_seq_device cryptd fuse snd_timer dcdbas serio_raw mei_me parport_pc snd mei ppdev i2c_core video lp soundcore parport lpc_ich shpchp mfd_core ext4 mbcache jbd2 sd_mod e1000e ahci ptp libahci crc32c_intel libata pps_core
CPU: 3 PID: 2211 Comm: bash Not tainted 4.2.0-rc5-mm1+ #45
Hardware name: Dell Inc. OptiPlex 7020/0F5C5X, BIOS A03 01/08/2015
Call Trace:
dump_stack+0x48/0x5c
bad_page+0xe6/0x140
free_pages_prepare+0x2f9/0x320
? uncharge_list+0xdd/0x100
free_hot_cold_page+0x40/0x170
__put_single_page+0x20/0x30
put_page+0x25/0x40
unmap_and_move+0x1a6/0x1f0
migrate_pages+0x100/0x1d0
? kill_procs+0x100/0x100
? unlock_page+0x6f/0x90
__soft_offline_page+0x127/0x2a0
soft_offline_page+0xa6/0x200

This race is explained like below:

CPU0 CPU1

soft_offline_page
__soft_offline_page
TestSetPageHWPoison
unpoison_memory
PageHWPoison check (true)
TestClearPageHWPoison
put_page -> release refcount held by get_hwpoison_page in unpoison_memory
put_page -> release refcount held by isolate_lru_page in __soft_offline_page
migrate_pages

The second put_page() releases refcount held by isolate_lru_page() which
will lead to unmap_and_move() releases the last refcount of page and w/
mapcount still 1 since try_to_unmap() is not called if there is only one
user map the page. Anyway, the page refcount and mapcount will still
mess if the page is mapped by multiple users.

This race was introduced by commit 4491f71260 ("mm/memory-failure: set
PageHWPoison before migrate_pages()"), which focuses on preventing the
reuse of successfully migrated page. Before this commit we prevent the
reuse by changing the migratetype to MIGRATE_ISOLATE during soft
offlining, which has the following problems, so simply reverting the
commit is not a best option:

1) it doesn't eliminate the reuse completely, because
set_migratetype_isolate() can fail to set MIGRATE_ISOLATE to the
target page if the pageblock of the page contains one or more
unmovable pages (i.e. has_unmovable_pages() returns true).

2) the original code changes migratetype to MIGRATE_ISOLATE
forcibly, and sets it to MIGRATE_MOVABLE forcibly after soft offline,
regardless of the original migratetype state, which could impact
other subsystems like memory hotplug or compaction.

This patch moves PageSetHWPoison just after put_page() in
unmap_and_move(), which closes up the reported race window and minimizes
another race window b/w SetPageHWPoison and reallocation (which causes
the reuse of soft-offlined page.) The latter race window still exists
but it's acceptable, because it's rare and effectively the same as
ordinary "containment failure" case even if it happens, so keep the
window open is acceptable.

Fixes: 4491f71260 ("mm/memory-failure: set PageHWPoison before migrate_pages()")
Signed-off-by: Wanpeng Li <wanpeng.li@hotmail.com>
Signed-off-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
Reported-by: Wanpeng Li <wanpeng.li@hotmail.com>
Tested-by: Wanpeng Li <wanpeng.li@hotmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 8e30456b 08-Sep-2015 Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>

mm/hwpoison: introduce num_poisoned_pages wrappers

num_poisoned_pages counter will be changed outside mm/memory-failure.c
by a subsequent patch, so this patch prepares wrappers to manipulate it.

Signed-off-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
Tested-by: Wanpeng Li <wanpeng.li@hotmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 665d9da7 08-Sep-2015 Wanpeng Li <wanpeng.li@hotmail.com>

mm/hwpoison: replace most of put_page in memory error handling by put_hwpoison_page

Replace most instances of put_page() in memory error handling with
put_hwpoison_page().

Signed-off-by: Wanpeng Li <wanpeng.li@hotmail.com>
Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 94bf4ec8 08-Sep-2015 Wanpeng Li <wanpeng.li@hotmail.com>

mm/hwpoison: introduce put_hwpoison_page to put refcount for memory error handling

Introduce put_hwpoison_page to put refcount for memory error handling.

Signed-off-by: Wanpeng Li <wanpeng.li@hotmail.com>
Suggested-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
Acked-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 1e0e635b 08-Sep-2015 Wanpeng Li <wanpeng.li@hotmail.com>

mm/hwpoison: fix PageHWPoison test/set race

There is a race between madvise_hwpoison path and memory_failure:

CPU0 CPU1

madvise_hwpoison
get_user_pages_fast
PageHWPoison check (false)
memory_failure
TestSetPageHWPoison
soft_offline_page
PageHWPoison check (true)
return -EBUSY (without put_page)

Signed-off-by: Wanpeng Li <wanpeng.li@hotmail.com>
Suggested-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 7d1900c7 08-Sep-2015 Wanpeng Li <wanpeng.li@hotmail.com>

mm/hwpoison: fix failure to split thp w/ refcount held

THP pages will get a refcount in madvise_hwpoison() w/
MF_COUNT_INCREASED flag, however, the refcount is still held when fail
to split THP pages.

Fix it by reducing the refcount of THP pages when fail to split THP.

Signed-off-by: Wanpeng Li <wanpeng.li@hotmail.com>
Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 33398cf2 08-Sep-2015 Michal Hocko <mhocko@suse.cz>

memcg: export struct mem_cgroup

mem_cgroup structure is defined in mm/memcontrol.c currently which means
that the code outside of this file has to use external API even for
trivial access stuff.

This patch exports mm_struct with its dependencies and makes some of the
exported functions inlines. This even helps to reduce the code size a bit
(make defconfig + CONFIG_MEMCG=y)

text data bss dec hex filename
12355346 1823792 1089536 15268674 e8fb42 vmlinux.before
12354970 1823792 1089536 15268298 e8f9ca vmlinux.after

This is not much (370B) but better than nothing.

We also save a function call in some hot paths like callers of
mem_cgroup_count_vm_event which is used for accounting.

The patch doesn't introduce any functional changes.

[vdavykov@parallels.com: inline memcg_kmem_is_active]
[vdavykov@parallels.com: do not expose type outside of CONFIG_MEMCG]
[akpm@linux-foundation.org: memcontrol.h needs eventfd.h for eventfd_ctx]
[akpm@linux-foundation.org: export mem_cgroup_from_task() to modules]
Signed-off-by: Michal Hocko <mhocko@suse.cz>
Reviewed-by: Vladimir Davydov <vdavydov@parallels.com>
Suggested-by: Johannes Weiner <hannes@cmpxchg.org>
Cc: Tejun Heo <tj@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 7f6bf39b 14-Aug-2015 Wanpeng Li <wanpeng.li@hotmail.com>

mm/hwpoison: fix panic due to split huge zero page

Bug:

------------[ cut here ]------------
kernel BUG at mm/huge_memory.c:1957!
invalid opcode: 0000 [#1] SMP
Modules linked in: snd_hda_codec_hdmi i915 rpcsec_gss_krb5 snd_hda_codec_realtek snd_hda_codec_generic nfsv4 dns_re
CPU: 2 PID: 2576 Comm: test_huge Not tainted 4.2.0-rc5-mm1+ #27
Hardware name: Dell Inc. OptiPlex 7020/0F5C5X, BIOS A03 01/08/2015
task: ffff880204e3d600 ti: ffff8800db16c000 task.ti: ffff8800db16c000
RIP: split_huge_page_to_list+0xdb/0x120
Call Trace:
memory_failure+0x32e/0x7c0
madvise_hwpoison+0x8b/0x160
SyS_madvise+0x40/0x240
? do_page_fault+0x37/0x90
entry_SYSCALL_64_fastpath+0x12/0x71
Code: ff f0 41 ff 4c 24 30 74 0d 31 c0 48 83 c4 08 5b 41 5c 41 5d c9 c3 4c 89 e7 e8 e2 58 fd ff 48 83 c4 08 31 c0
RIP split_huge_page_to_list+0xdb/0x120
RSP <ffff8800db16fde8>
---[ end trace aee7ce0df8e44076 ]---

Testcase:

#define _GNU_SOURCE
#include <stdlib.h>
#include <stdio.h>
#include <sys/mman.h>
#include <unistd.h>
#include <fcntl.h>
#include <sys/types.h>
#include <errno.h>
#include <string.h>

#define MB 1024*1024

int main(void)
{
char *mem;

posix_memalign((void **)&mem, 2 * MB, 200 * MB);

madvise(mem, 200 * MB, MADV_HWPOISON);

free(mem);

return 0;
}

Huge zero page is allocated if page fault w/o FAULT_FLAG_WRITE flag.
The get_user_pages_fast() which called in madvise_hwpoison() will get
huge zero page if the page is not allocated before. Huge zero page is a
tranparent huge page, however, it is not an anonymous page.
memory_failure will split the huge zero page and trigger
BUG_ON(is_huge_zero_page(page));

After commit 98ed2b0052e6 ("mm/memory-failure: give up error handling
for non-tail-refcounted thp"), memory_failure will not catch non anon
thp from madvise_hwpoison path and this bug occur.

Fix it by catching non anon thp in memory_failure in order to not split
huge zero page in madvise_hwpoison path.

After this patch:

Injecting memory failure for page 0x202800 at 0x7fd8ae800000
MCE: 0x202800: non anonymous thp
[...]

[akpm@linux-foundation.org: remove second split, per Wanpeng]
Signed-off-by: Wanpeng Li <wanpeng.li@hotmail.com>
Acked-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 03613808 14-Aug-2015 Wanpeng Li <wanpeng.li@hotmail.com>

mm/hwpoison: fix fail isolate hugetlbfs page w/ refcount held

Hugetlbfs pages will get a refcount in get_any_page() or
madvise_hwpoison() if soft offlining through madvise. The refcount which
is held by the soft offline path should be released if we fail to isolate
hugetlbfs pages.

Fix it by reducing the refcount for both isolation success and failure.

Signed-off-by: Wanpeng Li <wanpeng.li@hotmail.com>
Acked-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
Cc: <stable@vger.kernel.org> [3.9+]
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 4f32be67 14-Aug-2015 Wanpeng Li <wanpeng.li@hotmail.com>

mm/hwpoison: fix page refcount of unknown non LRU page

After trying to drain pages from pagevec/pageset, we try to get reference
count of the page again, however, the reference count of the page is not
reduced if the page is still not on LRU list.

Fix it by adding the put_page() to drop the page reference which is from
__get_any_page().

Signed-off-by: Wanpeng Li <wanpeng.li@hotmail.com>
Acked-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
Cc: <stable@vger.kernel.org> [3.9+]
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 4491f712 06-Aug-2015 Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>

mm/memory-failure: set PageHWPoison before migrate_pages()

Now page freeing code doesn't consider PageHWPoison as a bad page, so by
setting it before completing the page containment, we can prevent the
error page from being reused just after successful page migration.

I added TTU_IGNORE_HWPOISON for try_to_unmap() to make sure that the
page table entry is transformed into migration entry, not to hwpoison
entry.

Signed-off-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
Cc: Andi Kleen <andi@firstfloor.org>
Cc: Dean Nelson <dnelson@redhat.com>
Cc: Tony Luck <tony.luck@intel.com>
Cc: "Kirill A. Shutemov" <kirill@shutemov.name>
Cc: Hugh Dickins <hughd@google.com>
Cc: David Rientjes <rientjes@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 98ed2b00 06-Aug-2015 Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>

mm/memory-failure: give up error handling for non-tail-refcounted thp

"non anonymous thp" case is still racy with freeing thp, which causes
panic due to put_page() for refcount-0 page. It seems that closing up
this race might be hard (and/or not worth doing,) so let's give up the
error handling for this case.

Signed-off-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
Cc: Andi Kleen <andi@firstfloor.org>
Cc: Dean Nelson <dnelson@redhat.com>
Cc: Tony Luck <tony.luck@intel.com>
Cc: "Kirill A. Shutemov" <kirill@shutemov.name>
Cc: Hugh Dickins <hughd@google.com>
Cc: David Rientjes <rientjes@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# a209ef09 06-Aug-2015 Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>

mm/memory-failure: fix race in counting num_poisoned_pages

When memory_failure() is called on a page which are just freed after
page migration from soft offlining, the counter num_poisoned_pages is
raised twi= ce. So let's fix it with using TestSetPageHWPoison.

Signed-off-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
Cc: Andi Kleen <andi@firstfloor.org>
Cc: Dean Nelson <dnelson@redhat.com>
Cc: Tony Luck <tony.luck@intel.com>
Cc: "Kirill A. Shutemov" <kirill@shutemov.name>
Cc: Hugh Dickins <hughd@google.com>
Cc: David Rientjes <rientjes@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# a09233f3 06-Aug-2015 Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>

mm/memory-failure: unlock_page before put_page

Recently I addressed a few of hwpoison race problems and the patches are
merged on v4.2-rc1. It made progress, but unfortunately some problems
still remain due to less coverage of my testing. So I'm trying to fix
or avoid them in this series.

One point I'm expecting to discuss is that patch 4/5 changes the page
flag set to be checked on free time. In current behavior, __PG_HWPOISON
is not supposed to be set when the page is freed. I think that there is
no strong reason for this behavior, and it causes a problem hard to fix
only in error handler side (because __PG_HWPOISON could be set at
arbitrary timing.) So I suggest to change it.

With this patchset, hwpoison stress testing in official mce-test
testsuite (which previously failed) passes.

This patch (of 5):

In "just unpoisoned" path, we do put_page and then unlock_page, which is
a wrong order and causes "freeing locked page" bug. So let's fix it.

Signed-off-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
Cc: Andi Kleen <andi@firstfloor.org>
Cc: Dean Nelson <dnelson@redhat.com>
Cc: Tony Luck <tony.luck@intel.com>
Cc: "Kirill A. Shutemov" <kirill@shutemov.name>
Cc: Hugh Dickins <hughd@google.com>
Acked-by: David Rientjes <rientjes@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 97f0b134 24-Jun-2015 Xie XiuQi <xiexiuqi@huawei.com>

tracing: add trace event for memory-failure

RAS user space tools like rasdaemon which base on trace event, could
receive mce error event, but no memory recovery result event. So, I want
to add this event to make this scenario complete.

This patch add a event at ras group for memory-failure.

The output like below:
# tracer: nop
#
# entries-in-buffer/entries-written: 2/2 #P:24
#
# _-----=> irqs-off
# / _----=> need-resched
# | / _---=> hardirq/softirq
# || / _--=> preempt-depth
# ||| / delay
# TASK-PID CPU# |||| TIMESTAMP FUNCTION
# | | | |||| | |
mce-inject-13150 [001] .... 277.019359: memory_failure_event: pfn 0x19869: recovery action for free buddy page: Delayed

[xiexiuqi@huawei.com: fix build error]
Signed-off-by: Xie XiuQi <xiexiuqi@huawei.com>
Reviewed-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
Acked-by: Steven Rostedt <rostedt@goodmis.org>
Cc: Tony Luck <tony.luck@intel.com>
Cc: Chen Gong <gong.chen@linux.intel.com>
Cc: Jim Davis <jim.epost@gmail.com>
Signed-off-by: Xie XiuQi <xiexiuqi@huawei.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# cc3e2af4 24-Jun-2015 Xie XiuQi <xiexiuqi@huawei.com>

memory-failure: change type of action_result's param 3 to enum

Change type of action_result's param 3 to enum for type consistency,
and rename mf_outcome to mf_result for clearly.

Signed-off-by: Xie XiuQi <xiexiuqi@huawei.com>
Acked-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
Cc: Chen Gong <gong.chen@linux.intel.com>
Cc: Jim Davis <jim.epost@gmail.com>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Tony Luck <tony.luck@intel.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# cc637b17 24-Jun-2015 Xie XiuQi <xiexiuqi@huawei.com>

memory-failure: export page_type and action result

Export 'outcome' and 'action_page_type' to mm.h, so we could use
this emnus outside.

This patch is preparation for adding trace events for memory-failure
recovery action.

Signed-off-by: Xie XiuQi <xiexiuqi@huawei.com>
Acked-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
Cc: Chen Gong <gong.chen@linux.intel.com>
Cc: Jim Davis <jim.epost@gmail.com>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Tony Luck <tony.luck@intel.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 2491ffee 24-Jun-2015 Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>

mm/memory-failure: me_huge_page() does nothing for thp

memory_failure() is supposed not to handle thp itself, but to split it.
But if something were wrong and page_action() were called on thp,
me_huge_page() (action routine for hugepages) should be better to take
no action, rather than to take wrong action prepared for hugetlb (which
triggers BUG_ON().)

This change is for potential problems, but makes sense to me because thp
is an actively developing feature and this code path can be open in the
future.

Signed-off-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
Cc: Andi Kleen <andi@firstfloor.org>
Cc: Tony Luck <tony.luck@intel.com>
Cc: "Kirill A. Shutemov" <kirill@shutemov.name>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# add05cec 24-Jun-2015 Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>

mm: soft-offline: don't free target page in successful page migration

Stress testing showed that soft offline events for a process iterating
"mmap-pagefault-munmap" loop can trigger
VM_BUG_ON(PAGE_FLAGS_CHECK_AT_PREP) in __free_one_page():

Soft offlining page 0x70fe1 at 0x70100008d000
Soft offlining page 0x705fb at 0x70300008d000
page:ffffea0001c3f840 count:0 mapcount:0 mapping: (null) index:0x2
flags: 0x1fffff80800000(hwpoison)
page dumped because: VM_BUG_ON_PAGE(page->flags & ((1 << 25) - 1))
------------[ cut here ]------------
kernel BUG at /src/linux-dev/mm/page_alloc.c:585!
invalid opcode: 0000 [#1] SMP DEBUG_PAGEALLOC
Modules linked in: cfg80211 rfkill crc32c_intel microcode ppdev parport_pc pcspkr serio_raw virtio_balloon parport i2c_piix4 virtio_blk virtio_net ata_generic pata_acpi floppy
CPU: 3 PID: 1779 Comm: test_base_madv_ Not tainted 4.0.0-v4.0-150511-1451-00009-g82360a3730e6 #139
RIP: free_pcppages_bulk+0x52a/0x6f0
Call Trace:
drain_pages_zone+0x3d/0x50
drain_local_pages+0x1d/0x30
on_each_cpu_mask+0x46/0x80
drain_all_pages+0x14b/0x1e0
soft_offline_page+0x432/0x6e0
SyS_madvise+0x73c/0x780
system_call_fastpath+0x12/0x17
Code: ff 89 45 b4 48 8b 45 c0 48 83 b8 a8 00 00 00 00 0f 85 e3 fb ff ff 0f 1f 00 0f 0b 48 8b 7d 90 48 c7 c6 e8 95 a6 81 e8 e6 32 02 00 <0f> 0b 8b 45 cc 49 89 47 30 41 8b 47 18 83 f8 ff 0f 85 10 ff ff
RIP [<ffffffff811a806a>] free_pcppages_bulk+0x52a/0x6f0
RSP <ffff88007a117d28>
---[ end trace 53926436e76d1f35 ]---

When soft offline successfully migrates page, the source page is supposed
to be freed. But there is a race condition where a source page looks
isolated (i.e. the refcount is 0 and the PageHWPoison is set) but
somewhat linked to pcplist. Then another soft offline event calls
drain_all_pages() and tries to free such hwpoisoned page, which is
forbidden.

This odd page state seems to happen due to the race between put_page() in
putback_lru_page() and __pagevec_lru_add_fn(). But I don't want to play
with tweaking drain code as done in commit 9ab3b598d2df "mm: hwpoison:
drop lru_add_drain_all() in __soft_offline_page()", or to change page
freeing code for this soft offline's purpose.

Instead, let's think about the difference between hard offline and soft
offline. There is an interesting difference in how to isolate the in-use
page between these, that is, hard offline marks PageHWPoison of the target
page at first, and doesn't free it by keeping its refcount 1. OTOH, soft
offline tries to free the target page then marks PageHWPoison. This
difference might be the source of complexity and result in bugs like the
above. So making soft offline isolate with keeping refcount can be a
solution for this problem.

We can pass to page migration code the "reason" which shows the caller, so
let's use this more to avoid calling putback_lru_page() when called from
soft offline, which effectively does the isolation for soft offline. With
this change, target pages of soft offline never be reused without changing
migratetype, so this patch also removes the related code.

Signed-off-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
Cc: Andi Kleen <andi@firstfloor.org>
Cc: Tony Luck <tony.luck@intel.com>
Cc: "Kirill A. Shutemov" <kirill@shutemov.name>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# ead07f6a 24-Jun-2015 Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>

mm/memory-failure: introduce get_hwpoison_page() for consistent refcount handling

memory_failure() can run in 2 different mode (specified by
MF_COUNT_INCREASED) in page refcount perspective. When
MF_COUNT_INCREASED is set, memory_failure() assumes that the caller
takes a refcount of the target page. And if cleared, memory_failure()
takes it in it's own.

In current code, however, refcounting is done differently in each caller.
For example, madvise_hwpoison() uses get_user_pages_fast() and
hwpoison_inject() uses get_page_unless_zero(). So this inconsistent
refcounting causes refcount failure especially for thp tail pages.
Typical user visible effects are like memory leak or
VM_BUG_ON_PAGE(!page_count(page)) in isolate_lru_page().

To fix this refcounting issue, this patch introduces get_hwpoison_page()
to handle thp tail pages in the same manner for each caller of hwpoison
code.

memory_failure() might fail to split thp and in such case it returns
without completing page isolation. This is not good because PageHWPoison
on the thp is still set and there's no easy way to unpoison such thps. So
this patch try to roll back any action to the thp in "non anonymous thp"
case and "thp split failed" case, expecting an MCE(SRAR) generated by
later access afterward will properly free such thps.

[akpm@linux-foundation.org: fix CONFIG_HWPOISON_INJECT=m]
Signed-off-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
Cc: Andi Kleen <andi@firstfloor.org>
Cc: Tony Luck <tony.luck@intel.com>
Cc: "Kirill A. Shutemov" <kirill@shutemov.name>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 415c64c1 24-Jun-2015 Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>

mm/memory-failure: split thp earlier in memory error handling

memory_failure() doesn't handle thp itself at this time and need to split
it before doing isolation. Currently thp is split in the middle of
hwpoison_user_mappings(), but there're corner cases where memory_failure()
wrongly tries to handle thp without splitting.

1) "non anonymous" thp, which is not a normal operating mode of thp,
but a memory error could hit a thp before anon_vma is initialized. In
such case, split_huge_page() fails and me_huge_page() (intended for
hugetlb) is called for thp, which triggers BUG_ON in page_hstate().

2) !PageLRU case, where hwpoison_user_mappings() returns with
SWAP_SUCCESS and the result is the same as case 1.

memory_failure() can't avoid splitting, so let's split it more earlier,
which also reduces code which are prepared for both of normal page and
thp.

Signed-off-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
Cc: Andi Kleen <andi@firstfloor.org>
Cc: Tony Luck <tony.luck@intel.com>
Cc: "Kirill A. Shutemov" <kirill@shutemov.name>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# ebb09738 24-Jun-2015 Andi Kleen <ak@linux.intel.com>

mm, hwpoison: remove obsolete "Notebook" todo list

All the items mentioned here have been either addressed, or were not
really needed. So just remove the comment.

Signed-off-by: Andi Kleen <ak@linux.intel.com>
Acked-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# e0de78df 24-Jun-2015 Andi Kleen <ak@linux.intel.com>

mm, hwpoison: add comment describing when to add new cases

Here's another comment fix for hwpoison.

It describes the "guiding principle" on when to add new
memory error recovery code.

Signed-off-by: Andi Kleen <ak@linux.intel.com>
Acked-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 602498f9 05-May-2015 Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>

mm: soft-offline: fix num_poisoned_pages counting on concurrent events

If multiple soft offline events hit one free page/hugepage concurrently,
soft_offline_page() can handle the free page/hugepage multiple times,
which makes num_poisoned_pages counter increased more than once. This
patch fixes this wrong counting by checking TestSetPageHWPoison for normal
papes and by checking the return value of dequeue_hwpoisoned_huge_page()
for hugepages.

Signed-off-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
Acked-by: Dean Nelson <dnelson@redhat.com>
Cc: Andi Kleen <andi@firstfloor.org>
Cc: <stable@vger.kernel.org> [3.14+]
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 09789e5d 05-May-2015 Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>

mm/memory-failure: call shake_page() when error hits thp tail page

Currently memory_failure() calls shake_page() to sweep pages out from
pcplists only when the victim page is 4kB LRU page or thp head page.
But we should do this for a thp tail page too.

Consider that a memory error hits a thp tail page whose head page is on
a pcplist when memory_failure() runs. Then, the current kernel skips
shake_pages() part, so hwpoison_user_mappings() returns without calling
split_huge_page() nor try_to_unmap() because PageLRU of the thp head is
still cleared due to the skip of shake_page().

As a result, me_huge_page() runs for the thp, which is broken behavior.

One effect is a leak of the thp. And another is to fail to isolate the
memory error, so later access to the error address causes another MCE,
which kills the processes which used the thp.

This patch fixes this problem by calling shake_page() for thp tail case.

Fixes: 385de35722c9 ("thp: allow a hwpoisoned head page to be put back to LRU")
Signed-off-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
Reviewed-by: Andi Kleen <ak@linux.intel.com>
Acked-by: Dean Nelson <dnelson@redhat.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Hidetoshi Seto <seto.hidetoshi@jp.fujitsu.com>
Cc: Jin Dongming <jin.dongming@np.css.fujitsu.com>
Cc: <stable@vger.kernel.org> [3.4+]
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# bcc54222 15-Apr-2015 Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>

mm: hugetlb: introduce page_huge_active

We are not safe from calling isolate_huge_page() on a hugepage
concurrently, which can make the victim hugepage in invalid state and
results in BUG_ON().

The root problem of this is that we don't have any information on struct
page (so easily accessible) about hugepages' activeness. Note that
hugepages' activeness means just being linked to
hstate->hugepage_activelist, which is not the same as normal pages'
activeness represented by PageActive flag.

Normal pages are isolated by isolate_lru_page() which prechecks PageLRU
before isolation, so let's do similarly for hugetlb with a new
paeg_huge_active().

set/clear_page_huge_active() should be called within hugetlb_lock. But
hugetlb_cow() and hugetlb_no_page() don't do this, being justified because
in these functions set_page_huge_active() is called right after the
hugepage is allocated and no other thread tries to isolate it.

[akpm@linux-foundation.org: s/PageHugeActive/page_huge_active/, make it return bool]
[fengguang.wu@intel.com: set_page_huge_active() can be static]
Signed-off-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
Cc: Hugh Dickins <hughd@google.com>
Reviewed-by: Michal Hocko <mhocko@suse.cz>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: David Rientjes <rientjes@google.com>
Signed-off-by: Fengguang Wu <fengguang.wu@intel.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 64d37a2b 15-Apr-2015 Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>

mm/memory-failure.c: define page types for action_result() in one place

This cleanup patch moves all strings passed to action_result() into a
singl= e array action_page_type so that a reader can easily find which
kind of actio= n results are possible. And this patch also fixes the
odd lines to be printed out, like "unknown page state page" or "free
buddy, 2nd try page".

[akpm@linux-foundation.org: rename messages, per David]
[akpm@linux-foundation.org: s/DIRTY_UNEVICTABLE_LRU/CLEAN_UNEVICTABLE_LRU', per Andi]
Signed-off-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
Reviewed-by: Andi Kleen <ak@linux.intel.com>
Cc: Tony Luck <tony.luck@intel.com>
Cc: "Xie XiuQi" <xiexiuqi@huawei.com>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Chen Gong <gong.chen@linux.intel.com>
Cc: David Rientjes <rientjes@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 9ab3b598 12-Feb-2015 Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>

mm: hwpoison: drop lru_add_drain_all() in __soft_offline_page()

A race condition starts to be visible in recent mmotm, where a PG_hwpoison
flag is set on a migration source page *before* it's back in buddy page
poo= l.

This is problematic because no page flag is supposed to be set when
freeing (see __free_one_page().) So the user-visible effect of this race
is that it could trigger the BUG_ON() when soft-offlining is called.

The root cause is that we call lru_add_drain_all() to make sure that the
page is in buddy, but that doesn't work because this function just
schedule= s a work item and doesn't wait its completion.
drain_all_pages() does drainin= g directly, so simply dropping
lru_add_drain_all() solves this problem.

Fixes: f15bdfa802bf ("mm/memory-failure.c: fix memory leak in successful soft offlining")
Signed-off-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
Cc: Andi Kleen <andi@firstfloor.org>
Cc: Tony Luck <tony.luck@intel.com>
Cc: Chen Gong <gong.chen@linux.intel.com>
Cc: <stable@vger.kernel.org> [3.11+]
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# cb731d6c 12-Feb-2015 Vladimir Davydov <vdavydov.dev@gmail.com>

vmscan: per memory cgroup slab shrinkers

This patch adds SHRINKER_MEMCG_AWARE flag. If a shrinker has this flag
set, it will be called per memory cgroup. The memory cgroup to scan
objects from is passed in shrink_control->memcg. If the memory cgroup
is NULL, a memcg aware shrinker is supposed to scan objects from the
global list. Unaware shrinkers are only called on global pressure with
memcg=NULL.

Signed-off-by: Vladimir Davydov <vdavydov@parallels.com>
Cc: Dave Chinner <david@fromorbit.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Michal Hocko <mhocko@suse.cz>
Cc: Greg Thelen <gthelen@google.com>
Cc: Glauber Costa <glommer@gmail.com>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Cc: Christoph Lameter <cl@linux.com>
Cc: Pekka Enberg <penberg@kernel.org>
Cc: David Rientjes <rientjes@google.com>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Cc: Tejun Heo <tj@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 6b4f7799 12-Dec-2014 Johannes Weiner <hannes@cmpxchg.org>

mm: vmscan: invoke slab shrinkers from shrink_zone()

The slab shrinkers are currently invoked from the zonelist walkers in
kswapd, direct reclaim, and zone reclaim, all of which roughly gauge the
eligible LRU pages and assemble a nodemask to pass to NUMA-aware
shrinkers, which then again have to walk over the nodemask. This is
redundant code, extra runtime work, and fairly inaccurate when it comes to
the estimation of actually scannable LRU pages. The code duplication will
only get worse when making the shrinkers cgroup-aware and requiring them
to have out-of-band cgroup hierarchy walks as well.

Instead, invoke the shrinkers from shrink_zone(), which is where all
reclaimers end up, to avoid this duplication.

Take the count for eligible LRU pages out of get_scan_count(), which
considers many more factors than just the availability of swap space, like
zone_reclaimable_pages() currently does. Accumulate the number over all
visited lruvecs to get the per-zone value.

Some nodes have multiple zones due to memory addressing restrictions. To
avoid putting too much pressure on the shrinkers, only invoke them once
for each such node, using the class zone of the allocation as the pivot
zone.

For now, this integrates the slab shrinking better into the reclaim logic
and gets rid of duplicative invocations from kswapd, direct reclaim, and
zone reclaim. It also prepares for cgroup-awareness, allowing
memcg-capable shrinkers to be added at the lruvec level without much
duplication of both code and runtime work.

This changes kswapd behavior, which used to invoke the shrinkers for each
zone, but with scan ratios gathered from the entire node, resulting in
meaningless pressure quantities on multi-zone nodes.

Zone reclaim behavior also changes. It used to shrink slabs until the
same amount of pages were shrunk as were reclaimed from the LRUs. Now it
merely invokes the shrinkers once with the zone's scan ratio, which makes
the shrinkers go easier on caches that implement aging and would prefer
feeding back pressure from recently used slab objects to unused LRU pages.

[vdavydov@parallels.com: assure class zone is populated]
Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
Cc: Dave Chinner <david@fromorbit.com>
Signed-off-by: Vladimir Davydov <vdavydov@parallels.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# d28eb9c8 12-Dec-2014 Davidlohr Bueso <dave@stgolabs.net>

mm/memory-failure: share the i_mmap_rwsem

No brainer conversion: collect_procs_file() only schedules a process for
later kill, share the lock, similarly to the anon vma variant.

Signed-off-by: Davidlohr Bueso <dbueso@suse.de>
Acked-by: "Kirill A. Shutemov" <kirill@shutemov.name>
Acked-by: Hugh Dickins <hughd@google.com>
Cc: Oleg Nesterov <oleg@redhat.com>
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Rik van Riel <riel@redhat.com>
Cc: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
Acked-by: Mel Gorman <mgorman@suse.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 83cde9e8 12-Dec-2014 Davidlohr Bueso <dave@stgolabs.net>

mm: use new helper functions around the i_mmap_mutex

Convert all open coded mutex_lock/unlock calls to the
i_mmap_[lock/unlock]_write() helpers.

Signed-off-by: Davidlohr Bueso <dbueso@suse.de>
Acked-by: Rik van Riel <riel@redhat.com>
Acked-by: "Kirill A. Shutemov" <kirill@shutemov.name>
Acked-by: Hugh Dickins <hughd@google.com>
Cc: Oleg Nesterov <oleg@redhat.com>
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
Acked-by: Mel Gorman <mgorman@suse.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# c0554329 10-Dec-2014 Vlastimil Babka <vbabka@suse.cz>

mm, memory_hotplug/failure: drain single zone pcplists

Memory hotplug and failure mechanisms have several places where pcplists
are drained so that pages are returned to the buddy allocator and can be
e.g. prepared for offlining. This is always done in the context of a
single zone, we can reduce the pcplists drain to the single zone, which
is now possible.

The change should make memory offlining due to hotremove or failure
faster and not disturbing unrelated pcplists anymore.

Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Rik van Riel <riel@redhat.com>
Cc: Yasuaki Ishimatsu <isimatu.yasuaki@jp.fujitsu.com>
Cc: Zhang Yanfei <zhangyanfei@cn.fujitsu.com>
Cc: Xishi Qiu <qiuxishi@huawei.com>
Cc: Vladimir Davydov <vdavydov@parallels.com>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Cc: Michal Nazarewicz <mina86@mina86.com>
Cc: Marek Szyprowski <m.szyprowski@samsung.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 93481ff0 10-Dec-2014 Vlastimil Babka <vbabka@suse.cz>

mm: introduce single zone pcplists drain

The functions for draining per-cpu pages back to buddy allocators
currently always operate on all zones. There are however several cases
where the drain is only needed in the context of a single zone, and
spilling other pcplists is a waste of time both due to the extra
spilling and later refilling.

This patch introduces new zone pointer parameter to drain_all_pages()
and changes the dummy parameter of drain_local_pages() to be also a zone
pointer. When NULL is passed, the functions operate on all zones as
usual. Passing a specific zone pointer reduces the work to the single
zone.

All callers are updated to pass the NULL pointer in this patch.
Conversion to single zone (where appropriate) is done in further
patches.

Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Rik van Riel <riel@redhat.com>
Cc: Yasuaki Ishimatsu <isimatu.yasuaki@jp.fujitsu.com>
Cc: Zhang Yanfei <zhangyanfei@cn.fujitsu.com>
Cc: Xishi Qiu <qiuxishi@huawei.com>
Cc: Vladimir Davydov <vdavydov@parallels.com>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Cc: Michal Nazarewicz <mina86@mina86.com>
Cc: Marek Szyprowski <m.szyprowski@samsung.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 6dc52cbe 28-Jul-2014 Chen, Gong <gong.chen@linux.intel.com>

RAS, HWPOISON: Fix wrong error recovery status

When Uncorrected error happens, if the poisoned page is referenced
by more than one user after error recovery, the recovery is not
successful. But currently the display result is wrong.
Before this patch:

MCE 0x44e336: dirty mlocked LRU page recovery: Recovered
MCE 0x44e336: dirty mlocked LRU page still referenced by 1 users
mce: Memory error not recovered

After this patch:

MCE 0x44e336: dirty mlocked LRU page recovery: Failed
MCE 0x44e336: dirty mlocked LRU page still referenced by 1 users
mce: Memory error not recovered

Signed-off-by: Chen, Gong <gong.chen@linux.intel.com>
Link: http://lkml.kernel.org/r/1406530260-26078-3-git-send-email-gong.chen@linux.intel.com
Acked-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
Acked-by: Tony Luck <tony.luck@intel.com>
Signed-off-by: Borislav Petkov <bp@suse.de>


# f29374b1 19-Sep-2014 Zefan Li <lizefan@huawei.com>

cgroup: remove redundant check in cgroup_ino()

After we implemented default unified hierarchy, cgrp->kn can never
be NULL.

Signed-off-by: Zefan Li <lizefan@huawei.com>
Signed-off-by: Tejun Heo <tj@kernel.org>


# f37d4298 06-Aug-2014 Andi Kleen <ak@linux.intel.com>

hwpoison: fix race with changing page during offlining

When a hwpoison page is locked it could change state due to parallel
modifications. The original compound page can be torn down and then
this 4k page becomes part of a differently-size compound page is is a
standalone regular page.

Check after the lock if the page is still the same compound page.

We could go back, grab the new head page and try again but it should be
quite rare, so I thought this was safest. A retry loop would be more
difficult to test and may have more side effects.

The hwpoison code by design only tries to handle cases that are
reasonably common in workloads, as visible in page-flags.

I'm not really that concerned about handling this (likely rare case),
just not crashing on it.

Signed-off-by: Andi Kleen <ak@linux.intel.com>
Acked-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 52089b14 30-Jul-2014 Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>

hwpoison: call action_result() in failure path of hwpoison_user_mappings()

hwpoison_user_mappings() could fail for various reasons, so printk()s to
print out the reasons should be done in each failure check inside
hwpoison_user_mappings().

And currently we don't call action_result() when hwpoison_user_mappings()
fails, which is not consistent with other exit points of memory error
handler. So this patch fixes these messaging problems.

Signed-off-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
Cc: Andi Kleen <andi@firstfloor.org>
Cc: Chen Yucong <slaoub@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 93a9eb39 30-Jul-2014 Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>

hwpoison: fix hugetlbfs/thp precheck in hwpoison_user_mappings()

A recent fix from Chen Yucong, commit 0bc1f8b0682c ("hwpoison: fix the
handling path of the victimized page frame that belong to non-LRU")
rejects going into unmapping operation for hugetlbfs/thp pages, which
results in failing error containing on such pages. This patch fixes it.

With this patch, hwpoison functional tests in mce-test testsuite pass.

Signed-off-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
Cc: Andi Kleen <andi@firstfloor.org>
Cc: Chen Yucong <slaoub@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# a0f7a756 23-Jul-2014 Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>

mm/rmap.c: fix pgoff calculation to handle hugepage correctly

I triggered VM_BUG_ON() in vma_address() when I tried to migrate an
anonymous hugepage with mbind() in the kernel v3.16-rc3. This is
because pgoff's calculation in rmap_walk_anon() fails to consider
compound_order() only to have an incorrect value.

This patch introduces page_to_pgoff(), which gets the page's offset in
PAGE_CACHE_SIZE.

Kirill pointed out that page cache tree should natively handle
hugepages, and in order to make hugetlbfs fit it, page->index of
hugetlbfs page should be in PAGE_CACHE_SIZE. This is beyond this patch,
but page_to_pgoff() contains the point to be fixed in a single function.

Signed-off-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
Acked-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: Rik van Riel <riel@redhat.com>
Cc: Hillf Danton <dhillf@gmail.com>
Cc: Naoya Horiguchi <nao.horiguchi@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 0bc1f8b0 02-Jul-2014 Chen Yucong <slaoub@gmail.com>

hwpoison: fix the handling path of the victimized page frame that belong to non-LRU

Until now, the kernel has the same policy to handle victimized page
frames that belong to kernel-space(reserved/slab-subsystem) or
non-LRU(unknown page state). In other word, the result of handling
either of these victimized page frames is (IGNORED | FAILED), and the
return value of memory_failure() is -EBUSY.

This patch is to avoid that memory_failure() returns very soon due to
the "true" value of (!PageLRU(p)), and it also ensures that
action_result() can report more precise information("reserved kernel",
"kernel slab", and "unknown page state") instead of "non LRU",
especially for memory errors which are detected by memory-scrubbing.

Andi said:

: While running the mcelog test suite on 3.14 I hit the following VM_BUG_ON:
:
: soft_offline: 0x56d4: unknown non LRU page type 3ffff800008000
: page:ffffea000015b400 count:3 mapcount:2097169 mapping: (null) index:0xffff8800056d7000
: page flags: 0x3ffff800004081(locked|slab|head)
: ------------[ cut here ]------------
: kernel BUG at mm/rmap.c:1495!
:
: I think what happened is that a LRU page turned into a slab page in
: parallel with offlining. memory_failure initially tests for this case,
: but doesn't retest later after the page has been locked.
:
: ...
:
: I ran this patch in a loop over night with some stress plus
: the mcelog test suite running in a loop. I cannot guarantee it hit it,
: but it should have given it a good beating.
:
: The kernel survived with no messages, although the mcelog test suite
: got killed at some point because it couldn't fork anymore. Probably
: some unrelated problem.
:
: So the patch is ok for me for .16.

Signed-off-by: Chen Yucong <slaoub@gmail.com>
Acked-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
Reported-by: Andi Kleen <andi@firstfloor.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 3ba08129 04-Jun-2014 Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>

mm/memory-failure.c: support use of a dedicated thread to handle SIGBUS(BUS_MCEERR_AO)

Currently memory error handler handles action optional errors in the
deferred manner by default. And if a recovery aware application wants
to handle it immediately, it can do it by setting PF_MCE_EARLY flag.
However, such signal can be sent only to the main thread, so it's
problematic if the application wants to have a dedicated thread to
handler such signals.

So this patch adds dedicated thread support to memory error handler. We
have PF_MCE_EARLY flags for each thread separately, so with this patch
AO signal is sent to the thread with PF_MCE_EARLY flag set, not the main
thread. If you want to implement a dedicated thread, you call prctl()
to set PF_MCE_EARLY on the thread.

Memory error handler collects processes to be killed, so this patch lets
it check PF_MCE_EARLY flag on each thread in the collecting routines.

No behavioral change for all non-early kill cases.

Tony said:

: The old behavior was crazy - someone with a multithreaded process might
: well expect that if they call prctl(PF_MCE_EARLY) in just one thread, then
: that thread would see the SIGBUS with si_code = BUS_MCEERR_A0 - even if
: that thread wasn't the main thread for the process.

[akpm@linux-foundation.org: coding-style fixes]
Signed-off-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
Reviewed-by: Tony Luck <tony.luck@intel.com>
Cc: Kamil Iskra <iskra@mcs.anl.gov>
Cc: Andi Kleen <andi@firstfloor.org>
Cc: Borislav Petkov <bp@suse.de>
Cc: Chen Gong <gong.chen@linux.jf.intel.com>
Cc: <stable@vger.kernel.org> [3.2+]
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 74614de1 04-Jun-2014 Tony Luck <tony.luck@intel.com>

mm/memory-failure.c: don't let collect_procs() skip over processes for MF_ACTION_REQUIRED

When Linux sees an "action optional" machine check (where h/w has reported
an error that is not in the current execution path) we generally do not
want to signal a process, since most processes do not have a SIGBUS
handler - we'd just prematurely terminate the process for a problem that
they might never actually see.

task_early_kill() decides whether to consider a process - and it checks
whether this specific process has been marked for early signals with
"prctl", or if the system administrator has requested early signals for
all processes using /proc/sys/vm/memory_failure_early_kill.

But for MF_ACTION_REQUIRED case we must not defer. The error is in the
execution path of the current thread so we must send the SIGBUS
immediatley.

Fix by passing a flag argument through collect_procs*() to
task_early_kill() so it knows whether we can defer or must take action.

Signed-off-by: Tony Luck <tony.luck@intel.com>
Signed-off-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
Cc: Andi Kleen <andi@firstfloor.org>
Cc: Borislav Petkov <bp@suse.de>
Cc: Chen Gong <gong.chen@linux.jf.intel.com>
Cc: <stable@vger.kernel.org> [3.2+]
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# a70ffcac 04-Jun-2014 Tony Luck <tony.luck@intel.com>

mm/memory-failure.c-failure: send right signal code to correct thread

When a thread in a multi-threaded application hits a machine check because
of an uncorrectable error in memory - we want to send the SIGBUS with
si.si_code = BUS_MCEERR_AR to that thread. Currently we fail to do that
if the active thread is not the primary thread in the process.
collect_procs() just finds primary threads and this test:

if ((flags & MF_ACTION_REQUIRED) && t == current) {

will see that the thread we found isn't the current thread and so send a
si.si_code = BUS_MCEERR_AO to the primary (and nothing to the active
thread at this time).

We can fix this by checking whether "current" shares the same mm with the
process that collect_procs() said owned the page. If so, we send the
SIGBUS to current (with code BUS_MCEERR_AR).

Signed-off-by: Tony Luck <tony.luck@intel.com>
Signed-off-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
Reported-by: Otto Bruggeman <otto.g.bruggeman@intel.com>
Cc: Andi Kleen <andi@firstfloor.org>
Cc: Borislav Petkov <bp@suse.de>
Cc: Chen Gong <gong.chen@linux.jf.intel.com>
Cc: <stable@vger.kernel.org> [3.2+]
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 6edd6cc6 04-Jun-2014 Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>

mm/memory-failure.c: move comment

The comment about pages under writeback is far from the relevant code, so
let's move it to the right place.

Signed-off-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
Cc: Andi Kleen <andi@firstfloor.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 68711a74 04-Jun-2014 David Rientjes <rientjes@google.com>

mm, migration: add destination page freeing callback

Memory migration uses a callback defined by the caller to determine how to
allocate destination pages. When migration fails for a source page,
however, it frees the destination page back to the system.

This patch adds a memory migration callback defined by the caller to
determine how to free destination pages. If a caller, such as memory
compaction, builds its own freelist for migration targets, this can reuse
already freed memory instead of scanning additional memory.

If the caller provides a function to handle freeing of destination pages,
it is called when page migration fails. If the caller passes NULL then
freeing back to the system will be handled as usual. This patch
introduces no functional change.

Signed-off-by: David Rientjes <rientjes@google.com>
Reviewed-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
Acked-by: Mel Gorman <mgorman@suse.de>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Cc: Greg Thelen <gthelen@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 7c8e0181 04-Jun-2014 Christoph Lameter <cl@linux.com>

mm: replace __get_cpu_var uses with this_cpu_ptr

Replace places where __get_cpu_var() is used for an address calculation
with this_cpu_ptr().

Signed-off-by: Christoph Lameter <cl@linux.com>
Cc: Tejun Heo <tj@kernel.org>
Cc: Hugh Dickins <hughd@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# bfc8c901 04-Jun-2014 Vladimir Davydov <vdavydov.dev@gmail.com>

mem-hotplug: implement get/put_online_mems

kmem_cache_{create,destroy,shrink} need to get a stable value of
cpu/node online mask, because they init/destroy/access per-cpu/node
kmem_cache parts, which can be allocated or destroyed on cpu/mem
hotplug. To protect against cpu hotplug, these functions use
{get,put}_online_cpus. However, they do nothing to synchronize with
memory hotplug - taking the slab_mutex does not eliminate the
possibility of race as described in patch 2.

What we need there is something like get_online_cpus, but for memory.
We already have lock_memory_hotplug, which serves for the purpose, but
it's a bit of a hammer right now, because it's backed by a mutex. As a
result, it imposes some limitations to locking order, which are not
desirable, and can't be used just like get_online_cpus. That's why in
patch 1 I substitute it with get/put_online_mems, which work exactly
like get/put_online_cpus except they block not cpu, but memory hotplug.

[ v1 can be found at https://lkml.org/lkml/2014/4/6/68. I NAK'ed it by
myself, because it used an rw semaphore for get/put_online_mems,
making them dead lock prune. ]

This patch (of 2):

{un}lock_memory_hotplug, which is used to synchronize against memory
hotplug, is currently backed by a mutex, which makes it a bit of a
hammer - threads that only want to get a stable value of online nodes
mask won't be able to proceed concurrently. Also, it imposes some
strong locking ordering rules on it, which narrows down the set of its
usage scenarios.

This patch introduces get/put_online_mems, which are the same as
get/put_online_cpus, but for memory hotplug, i.e. executing a code
inside a get/put_online_mems section will guarantee a stable value of
online nodes, present pages, etc.

lock_memory_hotplug()/unlock_memory_hotplug() are removed altogether.

Signed-off-by: Vladimir Davydov <vdavydov@parallels.com>
Cc: Christoph Lameter <cl@linux.com>
Cc: Pekka Enberg <penberg@kernel.org>
Cc: Tang Chen <tangchen@cn.fujitsu.com>
Cc: Zhang Yanfei <zhangyanfei@cn.fujitsu.com>
Cc: Toshi Kani <toshi.kani@hp.com>
Cc: Xishi Qiu <qiuxishi@huawei.com>
Cc: Jiang Liu <liuj97@gmail.com>
Cc: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Cc: David Rientjes <rientjes@google.com>
Cc: Wen Congyang <wency@cn.fujitsu.com>
Cc: Yasuaki Ishimatsu <isimatu.yasuaki@jp.fujitsu.com>
Cc: Lai Jiangshan <laijs@cn.fujitsu.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 3e030ecc 22-May-2014 Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>

mm/memory-failure.c: fix memory leak by race between poison and unpoison

When a memory error happens on an in-use page or (free and in-use)
hugepage, the victim page is isolated with its refcount set to one.

When you try to unpoison it later, unpoison_memory() calls put_page()
for it twice in order to bring the page back to free page pool (buddy or
free hugepage list). However, if another memory error occurs on the
page which we are unpoisoning, memory_failure() returns without
releasing the refcount which was incremented in the same call at first,
which results in memory leak and unconsistent num_poisoned_pages
statistics. This patch fixes it.

Signed-off-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
Cc: Andi Kleen <andi@firstfloor.org>
Cc: <stable@vger.kernel.org> [2.6.32+]
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# b985194c 22-May-2014 Chen Yucong <slaoub@gmail.com>

hwpoison, hugetlb: lock_page/unlock_page does not match for handling a free hugepage

For handling a free hugepage in memory failure, the race will happen if
another thread hwpoisoned this hugepage concurrently. So we need to
check PageHWPoison instead of !PageHWPoison.

If hwpoison_filter(p) returns true or a race happens, then we need to
unlock_page(hpage).

Signed-off-by: Chen Yucong <slaoub@gmail.com>
Reviewed-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
Tested-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
Reviewed-by: Andi Kleen <ak@linux.intel.com>
Cc: <stable@vger.kernel.org> [2.6.36+]
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 668f9abb 03-Mar-2014 David Rientjes <rientjes@google.com>

mm: close PageTail race

Commit bf6bddf1924e ("mm: introduce compaction and migration for
ballooned pages") introduces page_count(page) into memory compaction
which dereferences page->first_page if PageTail(page).

This results in a very rare NULL pointer dereference on the
aforementioned page_count(page). Indeed, anything that does
compound_head(), including page_count() is susceptible to racing with
prep_compound_page() and seeing a NULL or dangling page->first_page
pointer.

This patch uses Andrea's implementation of compound_trans_head() that
deals with such a race and makes it the default compound_head()
implementation. This includes a read memory barrier that ensures that
if PageTail(head) is true that we return a head page that is neither
NULL nor dangling. The patch then adds a store memory barrier to
prep_compound_page() to ensure page->first_page is set.

This is the safest way to ensure we see the head page that we are
expecting, PageTail(page) is already in the unlikely() path and the
memory barriers are unfortunately required.

Hugetlbfs is the exception, we don't enforce a store memory barrier
during init since no race is possible.

Signed-off-by: David Rientjes <rientjes@google.com>
Cc: Holger Kiehl <Holger.Kiehl@dwd.de>
Cc: Christoph Lameter <cl@linux.com>
Cc: Rafael Aquini <aquini@redhat.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Michal Hocko <mhocko@suse.cz>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Rik van Riel <riel@redhat.com>
Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# b1664924 11-Feb-2014 Tejun Heo <tj@kernel.org>

cgroup: introduce cgroup_ino()

mm/memory-failure.c::hwpoison_filter_task() has been reaching into
cgroup to extract the associated ino to be used as a filtering
criterion. This is an implementation detail which shouldn't be
depended upon from outside cgroup proper and is about to change with
the scheduled kernfs conversion.

This patch introduces a proper interface to determine the associated
ino, cgroup_ino(), and updates hwpoison_filter_task() to use it
instead of reaching directly into cgroup.

Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Li Zefan <lizefan@huawei.com>
Cc: Andi Kleen <andi@firstfloor.org>
Cc: Wu Fengguang <fengguang.wu@intel.com>


# 8d547ff4 10-Feb-2014 Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>

mm/memory-failure.c: move refcount only in !MF_COUNT_INCREASED

mce-test detected a test failure when injecting error to a thp tail
page. This is because we take page refcount of the tail page in
madvise_hwpoison() while the fix in commit a3e0f9e47d5e
("mm/memory-failure.c: transfer page count from head page to tail page
after split thp") assumes that we always take refcount on the head page.

When a real memory error happens we take refcount on the head page where
memory_failure() is called without MF_COUNT_INCREASED set, so it seems
to me that testing memory error on thp tail page using madvise makes
little sense.

This patch cancels moving refcount in !MF_COUNT_INCREASED for valid
testing.

[akpm@linux-foundation.org: s/&&/&/]
Signed-off-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
Cc: Andi Kleen <andi@firstfloor.org>
Cc: Wanpeng Li <liwanp@linux.vnet.ibm.com>
Cc: Chen Gong <gong.chen@linux.intel.com>
Cc: <stable@vger.kernel.org> [3.9+: a3e0f9e47d5e]
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 54b9dd14 23-Jan-2014 Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>

mm/memory-failure.c: shift page lock from head page to tail page after thp split

After thp split in hwpoison_user_mappings(), we hold page lock on the
raw error page only between try_to_unmap, hence we are in danger of race
condition.

I found in the RHEL7 MCE-relay testing that we have "bad page" error
when a memory error happens on a thp tail page used by qemu-kvm:

Triggering MCE exception on CPU 10
mce: [Hardware Error]: Machine check events logged
MCE exception done on CPU 10
MCE 0x38c535: Killing qemu-kvm:8418 due to hardware memory corruption
MCE 0x38c535: dirty LRU page recovery: Recovered
qemu-kvm[8418]: segfault at 20 ip 00007ffb0f0f229a sp 00007fffd6bc5240 error 4 in qemu-kvm[7ffb0ef14000+420000]
BUG: Bad page state in process qemu-kvm pfn:38c400
page:ffffea000e310000 count:0 mapcount:0 mapping: (null) index:0x7ffae3c00
page flags: 0x2fffff0008001d(locked|referenced|uptodate|dirty|swapbacked)
Modules linked in: hwpoison_inject mce_inject vhost_net macvtap macvlan ...
CPU: 0 PID: 8418 Comm: qemu-kvm Tainted: G M -------------- 3.10.0-54.0.1.el7.mce_test_fixed.x86_64 #1
Hardware name: NEC NEC Express5800/R120b-1 [N8100-1719F]/MS-91E7-001, BIOS 4.6.3C19 02/10/2011
Call Trace:
dump_stack+0x19/0x1b
bad_page.part.59+0xcf/0xe8
free_pages_prepare+0x148/0x160
free_hot_cold_page+0x31/0x140
free_hot_cold_page_list+0x46/0xa0
release_pages+0x1c1/0x200
free_pages_and_swap_cache+0xad/0xd0
tlb_flush_mmu.part.46+0x4c/0x90
tlb_finish_mmu+0x55/0x60
exit_mmap+0xcb/0x170
mmput+0x67/0xf0
vhost_dev_cleanup+0x231/0x260 [vhost_net]
vhost_net_release+0x3f/0x90 [vhost_net]
__fput+0xe9/0x270
____fput+0xe/0x10
task_work_run+0xc4/0xe0
do_exit+0x2bb/0xa40
do_group_exit+0x3f/0xa0
get_signal_to_deliver+0x1d0/0x6e0
do_signal+0x48/0x5e0
do_notify_resume+0x71/0xc0
retint_signal+0x48/0x8c

The reason of this bug is that a page fault happens before unlocking the
head page at the end of memory_failure(). This strange page fault is
trying to access to address 0x20 and I'm not sure why qemu-kvm does
this, but anyway as a result the SIGSEGV makes qemu-kvm exit and on the
way we catch the bad page bug/warning because we try to free a locked
page (which was the former head page.)

To fix this, this patch suggests to shift page lock from head page to
tail page just after thp split. SIGSEGV still happens, but it affects
only error affected VMs, not a whole system.

Signed-off-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
Cc: Andi Kleen <andi@firstfloor.org>
Cc: Wanpeng Li <liwanp@linux.vnet.ibm.com>
Cc: <stable@vger.kernel.org> [3.9+] # a3e0f9e47d5ef "mm/memory-failure.c: transfer page count from head page to tail page after split thp"
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 59c82b70 21-Jan-2014 Joonsoo Kim <iamjoonsoo.kim@lge.com>

mm/migrate: remove putback_lru_pages, fix comment on putback_movable_pages

Some part of putback_lru_pages() and putback_movable_pages() is
duplicated, so it could confuse us what we should use. We can remove
putback_lru_pages() since it is not really needed now. This makes us
undestand and maintain the code more easily.

And comment on putback_movable_pages() is stale now, so fix it.

Signed-off-by: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Reviewed-by: Wanpeng Li <liwanp@linux.vnet.ibm.com>
Cc: Christoph Lameter <cl@linux.com>
Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
Cc: Rafael Aquini <aquini@redhat.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Wanpeng Li <liwanp@linux.vnet.ibm.com>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Rik van Riel <riel@redhat.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Zhang Yanfei <zhangyanfei@cn.fujitsu.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 549543df 21-Jan-2014 Zhi Yong Wu <wuzhy@linux.vnet.ibm.com>

mm, memory-failure: fix typo in me_pagecache_dirty()

[akpm@linux-foundation.org: s/cache/pagecache/]
Signed-off-by: Zhi Yong Wu <wuzhy@linux.vnet.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# a3e0f9e4 02-Jan-2014 Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>

mm/memory-failure.c: transfer page count from head page to tail page after split thp

Memory failures on thp tail pages cause kernel panic like below:

mce: [Hardware Error]: Machine check events logged
MCE exception done on CPU 7
BUG: unable to handle kernel NULL pointer dereference at 0000000000000058
IP: [<ffffffff811b7cd1>] dequeue_hwpoisoned_huge_page+0x131/0x1e0
PGD bae42067 PUD ba47d067 PMD 0
Oops: 0000 [#1] SMP
...
CPU: 7 PID: 128 Comm: kworker/7:2 Tainted: G M O 3.13.0-rc4-131217-1558-00003-g83b7df08e462 #25
...
Call Trace:
me_huge_page+0x3e/0x50
memory_failure+0x4bb/0xc20
mce_process_work+0x3e/0x70
process_one_work+0x171/0x420
worker_thread+0x11b/0x3a0
? manage_workers.isra.25+0x2b0/0x2b0
kthread+0xe4/0x100
? kthread_create_on_node+0x190/0x190
ret_from_fork+0x7c/0xb0
? kthread_create_on_node+0x190/0x190
...
RIP dequeue_hwpoisoned_huge_page+0x131/0x1e0
CR2: 0000000000000058

The reasoning of this problem is shown below:
- when we have a memory error on a thp tail page, the memory error
handler grabs a refcount of the head page to keep the thp under us.
- Before unmapping the error page from processes, we split the thp,
where page refcounts of both of head/tail pages don't change.
- Then we call try_to_unmap() over the error page (which was a tail
page before). We didn't pin the error page to handle the memory error,
this error page is freed and removed from LRU list.
- We never have the error page on LRU list, so the first page state
check returns "unknown page," then we move to the second check
with the saved page flag.
- The saved page flag have PG_tail set, so the second page state check
returns "hugepage."
- We call me_huge_page() for freed error page, then we hit the above panic.

The root cause is that we didn't move refcount from the head page to the
tail page after split thp. So this patch suggests to do this.

This panic was introduced by commit 524fca1e73 ("HWPOISON: fix
misjudgement of page_action() for errors on mlocked pages"). Note that we
did have the same refcount problem before this commit, but it was just
ignored because we had only first page state check which returned "unknown
page." The commit changed the refcount problem from "doesn't work" to
"kernel panic."

Signed-off-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
Reviewed-by: Wanpeng Li <liwanp@linux.vnet.ibm.com>
Cc: Andi Kleen <andi@firstfloor.org>
Cc: <stable@vger.kernel.org> [3.9+]
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# a49ecbcd 18-Dec-2013 Jianguo Wu <wujianguo@huawei.com>

mm/memory-failure.c: recheck PageHuge() after hugetlb page migrate successfully

After a successful hugetlb page migration by soft offline, the source
page will either be freed into hugepage_freelists or buddy(over-commit
page). If page is in buddy, page_hstate(page) will be NULL. It will
hit a NULL pointer dereference in dequeue_hwpoisoned_huge_page().

BUG: unable to handle kernel NULL pointer dereference at 0000000000000058
IP: [<ffffffff81163761>] dequeue_hwpoisoned_huge_page+0x131/0x1d0
PGD c23762067 PUD c24be2067 PMD 0
Oops: 0000 [#1] SMP

So check PageHuge(page) after call migrate_pages() successfully.

Signed-off-by: Jianguo Wu <wujianguo@huawei.com>
Tested-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
Reviewed-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 498d319b 14-Nov-2013 Stefani Seibold <stefani@seibold.net>

kfifo API type safety

This patch enhances the type safety for the kfifo API. It is now safe
to put const data into a non const FIFO and the API will now generate a
compiler warning when reading from the fifo where the destination
address is pointing to a const variable.

As a side effect the kfifo_put() does now expect the value of an element
instead a pointer to the element. This was suggested Russell King. It
make the handling of the kfifo_put easier since there is no need to
create a helper variable for getting the address of a pointer or to pass
integers of different sizes.

IMHO the API break is okay, since there are currently only six users of
kfifo_put().

The code is also cleaner by kicking out the "if (0)" expressions.

[akpm@linux-foundation.org: coding-style fixes]
Signed-off-by: Stefani Seibold <stefani@seibold.net>
Cc: Russell King <rmk@arm.linux.org.uk>
Cc: Hauke Mehrtens <hauke@hauke-m.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 03b61ff3 12-Nov-2013 Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>

mm/memory-failure.c: move set_migratetype_isolate() outside get_any_page()

Chen Gong pointed out that set/unset_migratetype_isolate() was done in
different functions in mm/memory-failure.c, which makes the code less
readable/maintainable. So this patch does it in soft_offline_page().

With this patch, we get to hold lock_memory_hotplug() longer but it's
not a problem because races between memory hotplug and soft offline are
very rare.

Signed-off-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
Reviewed-by: Chen, Gong <gong.chen@linux.intel.com>
Acked-by: Andi Kleen <ak@linux.intel.com>
Reviewed-by: Wanpeng Li <liwanp@linux.vnet.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 2d421acd 30-Sep-2013 Wanpeng Li <liwanp@linux.vnet.ibm.com>

mm/hwpoison: fix false report on 2nd attempt at page recovery

If the page is poisoned by software injection w/ MF_COUNT_INCREASED
flag, there is a false report during the 2nd attempt at page recovery
which is not truthful.

This patch fixes it by reporting the first attempt to try free buddy
page recovery if MF_COUNT_INCREASED is set.

Before patch:

[ 346.332041] Injecting memory failure at pfn 200010
[ 346.332189] MCE 0x200010: free buddy, 2nd try page recovery: Delayed

After patch:

[ 297.742600] Injecting memory failure at pfn 200010
[ 297.742941] MCE 0x200010: free buddy page recovery: Delayed

Reviewed-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
Acked-by: Andi Kleen <ak@linux.intel.com>
Signed-off-by: Wanpeng Li <liwanp@linux.vnet.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# e76d30e2 30-Sep-2013 Wanpeng Li <liwanp@linux.vnet.ibm.com>

mm/hwpoison: fix test for a transparent huge page

PageTransHuge() can't guarantee the page is a transparent huge page
since it returns true for both transparent huge and hugetlbfs pages.

This patch fixes it by checking the page is also !hugetlbfs page.

Before patch:

[ 121.571128] Injecting memory failure at pfn 23a200
[ 121.571141] MCE 0x23a200: huge page recovery: Delayed
[ 140.355100] MCE: Memory failure is now running on 0x23a200

After patch:

[ 94.290793] Injecting memory failure at pfn 23a000
[ 94.290800] MCE 0x23a000: huge page recovery: Delayed
[ 105.722303] MCE: Software-unpoisoned page 0x23a000

Signed-off-by: Wanpeng Li <liwanp@linux.vnet.ibm.com>
Reviewed-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
Acked-by: Andi Kleen <ak@linux.intel.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 3ba5eebc 11-Sep-2013 Wanpeng Li <liwanp@linux.vnet.ibm.com>

mm/memory-failure.c: fix bug triggered by unpoisoning empty zero page

Injecting memory failure for page 0x19d0 at 0xb77d2000
MCE 0x19d0: non LRU page recovery: Ignored
MCE: Software-unpoisoned page 0x19d0
BUG: Bad page state in process bash pfn:019d0
page:f3461a00 count:0 mapcount:0 mapping: (null) index:0x0
page flags: 0x40000404(referenced|reserved)
Modules linked in: nfsd auth_rpcgss i915 nfs_acl nfs lockd video drm_kms_helper drm bnep rfcomm sunrpc bluetooth psmouse parport_pc ppdev lp serio_raw fscache parport gpio_ich lpc_ich mac_hid i2c_algo_bit tpm_tis wmi usb_storage hid_generic usbhid hid e1000e firewire_ohci firewire_core ahci ptp libahci pps_core crc_itu_t
CPU: 3 PID: 2123 Comm: bash Not tainted 3.11.0-rc6+ #12
Hardware name: LENOVO 7034DD7/ , BIOS 9HKT47AUS 01//2012
00000000 00000000 e9625ea0 c15ec49b f3461a00 e9625eb8 c15ea119 c17cbf18
ef084314 000019d0 f3461a00 e9625ed8 c110dc8a f3461a00 00000001 00000000
f3461a00 40000404 00000000 e9625ef8 c110dcc1 f3461a00 f3461a00 000019d0
Call Trace:
dump_stack+0x41/0x52
bad_page+0xcf/0xeb
free_pages_prepare+0x12a/0x140
free_hot_cold_page+0x21/0x110
__put_single_page+0x21/0x30
put_page+0x25/0x40
unpoison_memory+0x107/0x200
hwpoison_unpoison+0x20/0x30
simple_attr_write+0xb6/0xd0
vfs_write+0xa0/0x1b0
SyS_write+0x4f/0x90
sysenter_do_call+0x12/0x22
Disabling lock debugging due to kernel taint

Testcase:

#define _GNU_SOURCE
#include <stdlib.h>
#include <stdio.h>
#include <sys/mman.h>
#include <unistd.h>
#include <fcntl.h>
#include <sys/types.h>
#include <errno.h>

#define PAGES_TO_TEST 1
#define PAGE_SIZE 4096

int main(void)
{
char *mem;

mem = mmap(NULL, PAGES_TO_TEST * PAGE_SIZE,
PROT_READ | PROT_WRITE, MAP_PRIVATE | MAP_ANONYMOUS, 0, 0);

if (madvise(mem, PAGES_TO_TEST * PAGE_SIZE, MADV_HWPOISON) == -1)
return -1;

munmap(mem, PAGES_TO_TEST * PAGE_SIZE);

return 0;
}

There is one page reference count for default empty zero page,
madvise_hwpoison add another one by get_user_pages_fast. memory_hwpoison
reduce one page reference count since it's a non LRU page.
unpoison_memory release the last page reference count and free empty zero
page to buddy system which is not correct since empty zero page has
PG_reserved flag. This patch fix it by don't reduce the page reference
count under 1 against empty zero page.

Signed-off-by: Wanpeng Li <liwanp@linux.vnet.ibm.com>
Reviewed-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
Cc: Andi Kleen <andi@firstfloor.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 86e05773 11-Sep-2013 Wanpeng Li <liwanp@linux.vnet.ibm.com>

mm/hwpoison: drop forward reference declarations __soft_offline_page()

Drop forward reference declarations __soft_offline_page.

Signed-off-by: Wanpeng Li <liwanp@linux.vnet.ibm.com>
Reviewed-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
Cc: Andi Kleen <andi@firstfloor.org>
Cc: Tony Luck <tony.luck@intel.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 0be35096 11-Sep-2013 Wanpeng Li <liwanp@linux.vnet.ibm.com>

mm/hwpoison: don't set migration type twice to avoid holding heavily contend zone->lock

Set pageblock migration type will hold zone->lock which is heavy contended
in system to avoid race. However, soft offline page will set pageblock
migration type twice during get page if the page is in used, not hugetlbfs
page and not on lru list. There is unnecessary to set the pageblock
migration type and hold heavy contended zone->lock again if the first
round get page have already set the pageblock to right migration type.

The trick here is migration type is MIGRATE_ISOLATE. There are other two
parts can change MIGRATE_ISOLATE except hwpoison. One is memory hoplug,
however, we hold lock_memory_hotplug() which avoid race. The second is
CMA which umovable page allocation requst can't fallback to. So it's safe
here.

Signed-off-by: Wanpeng Li <liwanp@linux.vnet.ibm.com>
Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
Cc: Andi Kleen <andi@firstfloor.org>
Cc: Tony Luck <tony.luck@intel.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# dd9538a5 11-Sep-2013 Wanpeng Li <liwanp@linux.vnet.ibm.com>

mm/hwpoison: replace atomic_long_sub() with atomic_long_dec()

Replace atomic_long_sub() with atomic_long_dec() since the page is normal
page instead of hugetlbfs page or thp.

Signed-off-by: Wanpeng Li <liwanp@linux.vnet.ibm.com>
Reviewed-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
Cc: Andi Kleen <andi@firstfloor.org>
Cc: Tony Luck <tony.luck@intel.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 0cea3fdc 11-Sep-2013 Wanpeng Li <liwanp@linux.vnet.ibm.com>

mm/hwpoison: fix race against poison thp

There is a race between hwpoison page and unpoison page, memory_failure
set the page hwpoison and increase num_poisoned_pages without hold page
lock, and one page count will be accounted against thp for
num_poisoned_pages. However, unpoison can occur before memory_failure
hold page lock and split transparent hugepage, unpoison will decrease
num_poisoned_pages by 1 << compound_order since memory_failure has not yet
split transparent hugepage with page lock held. That means we account one
page for hwpoison and 1 << compound_order for unpoison. This patch fix it
by inserting a PageTransHuge check before doing TestClearPageHWPoison,
unpoison failed without clearing PageHWPoison and decreasing
num_poisoned_pages.

A B
memory_failue
TestSetPageHWPoison(p);
if (PageHuge(p))
nr_pages = 1 << compound_order(hpage);
else
nr_pages = 1;
atomic_long_add(nr_pages, &num_poisoned_pages);
unpoison_memory
nr_pages = 1<< compound_trans_order(page);
if(TestClearPageHWPoison(p))
atomic_long_sub(nr_pages, &num_poisoned_pages);
lock page
if (!PageHWPoison(p))
unlock page and return
hwpoison_user_mappings
if (PageTransHuge(hpage))
split_huge_page(hpage);

Signed-off-by: Wanpeng Li <liwanp@linux.vnet.ibm.com>
Suggested-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
Cc: Andi Kleen <andi@firstfloor.org>
Cc: Tony Luck <tony.luck@intel.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# f9121153 11-Sep-2013 Wanpeng Li <liwanp@linux.vnet.ibm.com>

mm/hwpoison: don't need to hold compound lock for hugetlbfs page

compound lock is introduced by commit e9da73d67("thp: compound_lock."), it
is used to serialize put_page against __split_huge_page_refcount(). In
addition, transparent hugepages will be splitted in hwpoison handler and
just one subpage will be poisoned. There is unnecessary to hold compound
lock for hugetlbfs page. This patch replace compound_trans_order by
compond_order in the place where the page is hugetlbfs page.

Signed-off-by: Wanpeng Li <liwanp@linux.vnet.ibm.com>
Reviewed-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
Cc: Andi Kleen <andi@firstfloor.org>
Cc: Tony Luck <tony.luck@intel.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 841fcc58 11-Sep-2013 Wanpeng Li <liwanp@linux.vnet.ibm.com>

mm/hwpoison: fix loss of PG_dirty for errors on mlocked pages

memory_failure() store the page flag of the error page before doing unmap,
and (only) if the first check with page flags at the time decided the
error page is unknown, it do the second check with the stored page flag
since memory_failure() does unmapping of the error pages before doing
page_action(). This unmapping changes the page state, especially
page_remove_rmap() (called from try_to_unmap_one()) clears PG_mlocked, so
page_action() can't catch mlocked pages after that.

However, memory_failure() can't handle memory errors on dirty mlocked
pages correctly. try_to_unmap_one will move the dirty bit from pte to the
physical page, the second check lose it since it check the stored page
flag. This patch fix it by restore PG_dirty flag to stored page flag if
the page is dirty.

Testcase:

#define _GNU_SOURCE
#include <stdlib.h>
#include <stdio.h>
#include <sys/mman.h>
#include <sys/types.h>
#include <errno.h>

#define PAGES_TO_TEST 2
#define PAGE_SIZE 4096

int main(void)
{
char *mem;
int i;

mem = mmap(NULL, PAGES_TO_TEST * PAGE_SIZE,
PROT_READ | PROT_WRITE, MAP_PRIVATE | MAP_ANONYMOUS | MAP_LOCKED, 0, 0);

for (i = 0; i < PAGES_TO_TEST; i++)
mem[i * PAGE_SIZE] = 'a';

if (madvise(mem, PAGES_TO_TEST * PAGE_SIZE, MADV_HWPOISON) == -1)
return -1;

return 0;
}

Before patch:

[ 912.839247] Injecting memory failure for page 7dfb8 at 7f6b4e37b000
[ 912.839257] MCE 0x7dfb8: clean mlocked LRU page recovery: Recovered
[ 912.845550] MCE 0x7dfb8: clean mlocked LRU page still referenced by 1 users
[ 912.852586] Injecting memory failure for page 7e6aa at 7f6b4e37c000
[ 912.852594] MCE 0x7e6aa: clean mlocked LRU page recovery: Recovered
[ 912.858936] MCE 0x7e6aa: clean mlocked LRU page still referenced by 1 users

After patch:

[ 163.590225] Injecting memory failure for page 91bc2f at 7f9f5b0e5000
[ 163.590264] MCE 0x91bc2f: dirty mlocked LRU page recovery: Recovered
[ 163.596680] MCE 0x91bc2f: dirty mlocked LRU page still referenced by 1 users
[ 163.603831] Injecting memory failure for page 91cdd3 at 7f9f5b0e6000
[ 163.603852] MCE 0x91cdd3: dirty mlocked LRU page recovery: Recovered
[ 163.610305] MCE 0x91cdd3: dirty mlocked LRU page still referenced by 1 users

Signed-off-by: Wanpeng Li <liwanp@linux.vnet.ibm.com>
Reviewed-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
Cc: Andi Kleen <andi@firstfloor.org>
Cc: Tony Luck <tony.luck@intel.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 0d6fdbdb 11-Sep-2013 Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>

hwpoison: always unset MIGRATE_ISOLATE before returning from soft_offline_page()

Soft offline code expects that MIGRATE_ISOLATE is set on the target page
only during soft offlining work. But currenly it doesn't work as expected
when get_any_page() fails and returns negative value. In the result, end
users can have unexpectedly isolated pages. This patch just fixes it.

Signed-off-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
Reviewed-by: Wanpeng Li <liwanp@linux.vnet.ibm.com>
Cc: Andi Kleen <andi@firstfloor.org>
Cc: Fengguang Wu <fengguang.wu@intel.com>
Cc: Tony Luck <tony.luck@intel.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# b8ec1cee 11-Sep-2013 Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>

mm: soft-offline: use migrate_pages() instead of migrate_huge_page()

Currently migrate_huge_page() takes a pointer to a hugepage to be migrated
as an argument, instead of taking a pointer to the list of hugepages to be
migrated. This behavior was introduced in commit 189ebff28 ("hugetlb:
simplify migrate_huge_page()"), and was OK because until now hugepage
migration is enabled only for soft-offlining which migrates only one
hugepage in a single call.

But the situation will change in the later patches in this series which
enable other users of page migration to support hugepage migration. They
can kick migration for both of normal pages and hugepages in a single
call, so we need to go back to original implementation which uses linked
lists to collect the hugepages to be migrated.

With this patch, soft_offline_huge_page() switches to use migrate_pages(),
and migrate_huge_page() is not used any more. So let's remove it.

Signed-off-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
Acked-by: Andi Kleen <ak@linux.intel.com>
Reviewed-by: Wanpeng Li <liwanp@linux.vnet.ibm.com>
Acked-by: Hillf Danton <dhillf@gmail.com>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Hugh Dickins <hughd@google.com>
Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Cc: Michal Hocko <mhocko@suse.cz>
Cc: Rik van Riel <riel@redhat.com>
Cc: "Aneesh Kumar K.V" <aneesh.kumar@linux.vnet.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 0ce3d744 27-Aug-2013 Dave Chinner <dchinner@redhat.com>

shrinker: add node awareness

Pass the node of the current zone being reclaimed to shrink_slab(),
allowing the shrinker control nodemask to be set appropriately for node
aware shrinkers.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Glauber Costa <glommer@openvz.org>
Acked-by: Mel Gorman <mgorman@suse.de>
Cc: "Theodore Ts'o" <tytso@mit.edu>
Cc: Adrian Hunter <adrian.hunter@intel.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Artem Bityutskiy <artem.bityutskiy@linux.intel.com>
Cc: Arve Hjønnevåg <arve@android.com>
Cc: Carlos Maiolino <cmaiolino@redhat.com>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Chuck Lever <chuck.lever@oracle.com>
Cc: Daniel Vetter <daniel.vetter@ffwll.ch>
Cc: David Rientjes <rientjes@google.com>
Cc: Gleb Natapov <gleb@redhat.com>
Cc: Greg Thelen <gthelen@google.com>
Cc: J. Bruce Fields <bfields@redhat.com>
Cc: Jan Kara <jack@suse.cz>
Cc: Jerome Glisse <jglisse@redhat.com>
Cc: John Stultz <john.stultz@linaro.org>
Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Kent Overstreet <koverstreet@google.com>
Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Cc: Marcelo Tosatti <mtosatti@redhat.com>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Steven Whitehouse <swhiteho@redhat.com>
Cc: Thomas Hellstrom <thellstrom@vmware.com>
Cc: Trond Myklebust <Trond.Myklebust@netapp.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>


# 8e33a52f 25-Jul-2013 Joe Perches <joe@perches.com>

treewide: Fix printks with 0x%#

Using 0x%# emits 0x0x. Only one is necessary.

Signed-off-by: Joe Perches <joe@perches.com>
Acked-by: Takashi Iwai <tiwai@suse.de>
Signed-off-by: Jiri Kosina <jkosina@suse.cz>


# cf870c70 10-Jul-2013 Naveen N. Rao <naveen.n.rao@linux.vnet.ibm.com>

mce: acpi/apei: Soft-offline a page on firmware GHES notification

If the firmware indicates in GHES error data entry that the error threshold
has exceeded for a corrected error event, then we try to soft-offline the
page. This could be called in interrupt context, so we queue this up similar
to how we handle memory failure scenarios.

Signed-off-by: Naveen N. Rao <naveen.n.rao@linux.vnet.ibm.com>
Acked-by: Borislav Petkov <bp@suse.de>
Signed-off-by: Tony Luck <tony.luck@intel.com>


# f15bdfa8 03-Jul-2013 Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>

mm/memory-failure.c: fix memory leak in successful soft offlining

After a successful page migration by soft offlining, the source page is
not properly freed and it's never reusable even if we unpoison it
afterward.

This is caused by the race between freeing page and setting PG_hwpoison.
In successful soft offlining, the source page is put (and the refcount
becomes 0) by putback_lru_page() in unmap_and_move(), where it's linked
to pagevec and actual freeing back to buddy is delayed. So if
PG_hwpoison is set for the page before freeing, the freeing does not
functions as expected (in such case freeing aborts in
free_pages_prepare() check.)

This patch tries to make sure to free the source page before setting
PG_hwpoison on it. To avoid reallocating, the page keeps
MIGRATE_ISOLATE until after setting PG_hwpoison.

This patch also removes obsolete comments about "keeping elevated
refcount" because what they say is not true. Unlike memory_failure(),
soft_offline_page() uses no special page isolation code, and the
soft-offlined pages have no elevated.

Signed-off-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
Cc: Andi Kleen <andi@firstfloor.org>
Cc: Mel Gorman <mel@csn.ul.ie>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# e3986295 29-Apr-2013 Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>

HWPOISON: check dirty flag to match against clean page

Currently page_action() does not check dirty flag to determine whether
the error page is "clean mlocked/unevictable LRU" page. This doesn't
cause any misjudgement because we do matching against "dirty
mlocked/unevictable LRU" just before the check. But in order to make
code consistent and/or to avoid potential regression, we had better
check dirty flag explicitly.

Signed-off-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
Suggested-by: Chen Gong <gong.chen@linux.intel.com>
Cc: Andi Kleen <andi@firstfloor.org>
Cc: Tony Luck <tony.luck@intel.com>
Cc: Wu Fengguang <fengguang.wu@intel.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 5f4b9fc5 22-Feb-2013 Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>

HWPOISON: change order of error_states[]'s elements

error_states[] has two separate states "unevictable LRU page" and
"mlocked LRU page", and the former one has the higher priority now. But
because of that the latter one is rarely chosen because pages with
PageMlocked highly likely have PG_unevictable set. On the other hand,
PG_unevictable without PageMlocked is common for ramfs or SHM_LOCKed
shared memory, so reversing the priority of these two states helps us
clearly distinguish them.

Signed-off-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
Cc: Andi Kleen <andi@firstfloor.org>
Cc: Chen Gong <gong.chen@linux.intel.com>
Cc: Tony Luck <tony.luck@intel.com>
Cc: Wu Fengguang <fengguang.wu@intel.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 524fca1e 22-Feb-2013 Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>

HWPOISON: fix misjudgement of page_action() for errors on mlocked pages

memory_failure() can't handle memory errors on mlocked pages correctly,
because page_action() judges such errors as ones on "unknown pages"
instead of ones on "unevictable LRU page" or "mlocked LRU page". In
order to determine page_state page_action() checks page flags at the
timing of the judgement, but such page flags are not the same with those
just after memory_failure() is called, because memory_failure() does
unmapping of the error pages before doing page_action(). This unmapping
changes the page state, especially page_remove_rmap() (called from
try_to_unmap_one()) clears PG_mlocked, so page_action() can't catch
mlocked pages after that.

With this patch, we store the page flag of the error page before doing
unmap, and (only) if the first check with page flags at the time decided
the error page is unknown, we do the second check with the stored page
flag. This implementation doesn't change error handling for the page
types for which the first check can determine the page state correctly.

[akpm@linux-foundation.org: tweak comments]
Signed-off-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
Cc: Andi Kleen <andi@firstfloor.org>
Cc: Tony Luck <tony.luck@intel.com>
Cc: Chen Gong <gong.chen@linux.intel.com>
Cc: Wu Fengguang <fengguang.wu@intel.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 9c620e2b 22-Feb-2013 Hugh Dickins <hughd@google.com>

mm: remove offlining arg to migrate_pages

No functional change, but the only purpose of the offlining argument to
migrate_pages() etc, was to ensure that __unmap_and_move() could migrate a
KSM page for memory hotremove (which took ksm_thread_mutex) but not for
other callers. Now all cases are safe, remove the arg.

Signed-off-by: Hugh Dickins <hughd@google.com>
Cc: Rik van Riel <riel@redhat.com>
Cc: Petr Holasek <pholasek@redhat.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Izik Eidus <izik.eidus@ravellosystems.com>
Cc: Gerald Schaefer <gerald.schaefer@de.ibm.com>
Cc: KOSAKI Motohiro <kosaki.motohiro@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 4db0e950 22-Feb-2013 Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>

mm/memory-failure.c: fix wrong num_poisoned_pages in handling memory error on thp

num_poisoned_pages counts up the number of pages isolated by memory
errors. But for thp, only one subpage is isolated because memory error
handler splits it, so it's wrong to add (1 << compound_trans_order).

[akpm@linux-foundation.org: tweak comment]
Signed-off-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
Cc: Andi Kleen <andi@firstfloor.org>
Cc: Tony Luck <tony.luck@intel.com>
Cc: Wu Fengguang <fengguang.wu@intel.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# af8fae7c 22-Feb-2013 Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>

mm/memory-failure.c: clean up soft_offline_page()

Currently soft_offline_page() is hard to maintain because it has many
return points and goto statements. All of this mess come from
get_any_page().

This function should only get page refcount as the name implies, but it
does some page isolating actions like SetPageHWPoison() and dequeuing
hugepage. This patch corrects it and introduces some internal
subroutines to make soft offlining code more readable and maintainable.

Signed-off-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
Reviewed-by: Andi Kleen <andi@firstfloor.org>
Cc: Tony Luck <tony.luck@intel.com>
Cc: Wu Fengguang <fengguang.wu@intel.com>
Cc: Xishi Qiu <qiuxishi@huawei.com>
Cc: Jiang Liu <jiang.liu@huawei.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 293c07e3 22-Feb-2013 Xishi Qiu <qiuxishi@huawei.com>

memory-failure: use num_poisoned_pages instead of mce_bad_pages

Since MCE is an x86 concept, and this code is in mm/, it would be better
to use the name num_poisoned_pages instead of mce_bad_pages.

[akpm@linux-foundation.org: fix mm/sparse.c]
Signed-off-by: Xishi Qiu <qiuxishi@huawei.com>
Signed-off-by: Jiang Liu <jiang.liu@huawei.com>
Suggested-by: Borislav Petkov <bp@alien8.de>
Reviewed-by: Wanpeng Li <liwanp@linux.vnet.ibm.com>
Cc: Andi Kleen <andi@firstfloor.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# fa8dd8a9 22-Feb-2013 Xishi Qiu <qiuxishi@huawei.com>

memory-failure: do code refactor of soft_offline_page()

There are too many return points randomly intermingled with some "goto
done" return points. So adjust the function structure, one for the
success path, the other for the failure path. Use atomic_long_inc
instead of atomic_long_add.

Signed-off-by: Xishi Qiu <qiuxishi@huawei.com>
Signed-off-by: Jiang Liu <jiang.liu@huawei.com>
Suggested-by: Andrew Morton <akpm@linux-foundation.org>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Wanpeng Li <liwanp@linux.vnet.ibm.com>
Cc: Andi Kleen <andi@firstfloor.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 0ebff32c 22-Feb-2013 Xishi Qiu <qiuxishi@huawei.com>

memory-failure: fix an error of mce_bad_pages statistics

When doing

$ echo paddr > /sys/devices/system/memory/soft_offline_page

to offline a *free* page, the value of mce_bad_pages will be added, and
the page is set HWPoison flag, but it is still managed by page buddy
alocator.

$ cat /proc/meminfo | grep HardwareCorrupted

shows the value.

If we offline the same page, the value of mce_bad_pages will be added
*again*, this means the value is incorrect now. Assume the page is
still free during this short time.

soft_offline_page()
get_any_page()
"else if (is_free_buddy_page(p))" branch return 0
"goto done";
"atomic_long_add(1, &mce_bad_pages);"

This patch:

Move poisoned page check at the beginning of the function in order to
fix the error.

Signed-off-by: Xishi Qiu <qiuxishi@huawei.com>
Signed-off-by: Jiang Liu <jiang.liu@huawei.com>
Tested-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Wanpeng Li <liwanp@linux.vnet.ibm.com>
Cc: Andi Kleen <andi@firstfloor.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# ff604cf6 11-Dec-2012 Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>

mm: hwpoison: fix action_result() to print out dirty/clean

action_result() fails to print out "dirty" even if an error occurred on
a dirty pagecache, because when we check PageDirty in action_result() it
was cleared after page isolation even if it's dirty before error
handling. This can break some applications that monitor this message,
so should be fixed.

There are several callers of action_result() except page_action(), but
either of them are not for LRU pages but for free pages or kernel pages,
so we don't have to consider dirty or not for them.

Note that PG_dirty can be set outside page locks as described in commit
6746aff74da2 ("HWPOISON: shmem: call set_page_dirty() with locked
page"), so this patch does not completely closes the race window, but
just narrows it.

Signed-off-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
Reviewed-by: Andi Kleen <ak@linux.intel.com>
Cc: Tony Luck <tony.luck@intel.com>
Cc: "Jun'ichi Nomura" <j-nomura@ce.jp.nec.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# b023f468 11-Dec-2012 Wen Congyang <wency@cn.fujitsu.com>

memory-hotplug: skip HWPoisoned page when offlining pages

hwpoisoned may be set when we offline a page by the sysfs interface
/sys/devices/system/memory/soft_offline_page or
/sys/devices/system/memory/hard_offline_page. If we don't clear
this flag when onlining pages, this page can't be freed, and will
not in free list. So we can't offline these pages again. So we
should skip such page when offlining pages.

Signed-off-by: Wen Congyang <wency@cn.fujitsu.com>
Cc: David Rientjes <rientjes@google.com>
Cc: Jiang Liu <liuj97@gmail.com>
Cc: Len Brown <len.brown@intel.com>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Christoph Lameter <cl@linux.com>
Cc: Minchan Kim <minchan.kim@gmail.com>
Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Cc: Yasuaki Ishimatsu <isimatu.yasuaki@jp.fujitsu.com>
Cc: Andi Kleen <andi@firstfloor.org>
Cc: Mel Gorman <mel@csn.ul.ie>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 4fc3f1d6 02-Dec-2012 Ingo Molnar <mingo@kernel.org>

mm/rmap, migration: Make rmap_walk_anon() and try_to_unmap_anon() more scalable

rmap_walk_anon() and try_to_unmap_anon() appears to be too
careful about locking the anon vma: while it needs protection
against anon vma list modifications, it does not need exclusive
access to the list itself.

Transforming this exclusive lock to a read-locked rwsem removes
a global lock from the hot path of page-migration intense
threaded workloads which can cause pathological performance like
this:

96.43% process 0 [kernel.kallsyms] [k] perf_trace_sched_switch
|
--- perf_trace_sched_switch
__schedule
schedule
schedule_preempt_disabled
__mutex_lock_common.isra.6
__mutex_lock_slowpath
mutex_lock
|
|--50.61%-- rmap_walk
| move_to_new_page
| migrate_pages
| migrate_misplaced_page
| __do_numa_page.isra.69
| handle_pte_fault
| handle_mm_fault
| __do_page_fault
| do_page_fault
| page_fault
| __memset_sse2
| |
| --100.00%-- worker_thread
| |
| --100.00%-- start_thread
|
--49.39%-- page_lock_anon_vma
try_to_unmap_anon
try_to_unmap
migrate_pages
migrate_misplaced_page
__do_numa_page.isra.69
handle_pte_fault
handle_mm_fault
__do_page_fault
do_page_fault
page_fault
__memset_sse2
|
--100.00%-- worker_thread
start_thread

With this change applied the profile is now nicely flat
and there's no anon-vma related scheduling/blocking.

Rename anon_vma_[un]lock() => anon_vma_[un]lock_write(),
to make it clearer that it's an exclusive write-lock in
that case - suggested by Rik van Riel.

Suggested-by: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Paul Turner <pjt@google.com>
Cc: Lee Schermerhorn <Lee.Schermerhorn@hp.com>
Cc: Christoph Lameter <cl@linux.com>
Cc: Rik van Riel <riel@redhat.com>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Hugh Dickins <hughd@google.com>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Signed-off-by: Mel Gorman <mgorman@suse.de>


# 7b2a2d4a 19-Oct-2012 Mel Gorman <mgorman@suse.de>

mm: migrate: Add a tracepoint for migrate_pages

The pgmigrate_success and pgmigrate_fail vmstat counters tells the user
about migration activity but not the type or the reason. This patch adds
a tracepoint to identify the type of page migration and why the page is
being migrated.

Signed-off-by: Mel Gorman <mgorman@suse.de>
Reviewed-by: Rik van Riel <riel@redhat.com>


# 783657a7 29-Nov-2012 Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>

mm: soft offline: split thp at the beginning of soft_offline_page()

When we try to soft-offline a thp tail page, put_page() is called on the
tail page unthinkingly and VM_BUG_ON is triggered in put_compound_page().

This patch splits thp before going into the main body of soft-offlining.

Signed-off-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
Cc: Andi Kleen <andi@firstfloor.org>
Cc: Tony Luck <tony.luck@intel.com>
Cc: Andi Kleen <andi.kleen@intel.com>
Cc: Wu Fengguang <fengguang.wu@intel.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# bf181b9f 08-Oct-2012 Michel Lespinasse <walken@google.com>

mm anon rmap: replace same_anon_vma linked list with an interval tree.

When a large VMA (anon or private file mapping) is first touched, which
will populate its anon_vma field, and then split into many regions through
the use of mprotect(), the original anon_vma ends up linking all of the
vmas on a linked list. This can cause rmap to become inefficient, as we
have to walk potentially thousands of irrelevent vmas before finding the
one a given anon page might fall into.

By replacing the same_anon_vma linked list with an interval tree (where
each avc's interval is determined by its vma's start and last pgoffs), we
can make rmap efficient for this use case again.

While the change is large, all of its pieces are fairly simple.

Most places that were walking the same_anon_vma list were looking for a
known pgoff, so they can just use the anon_vma_interval_tree_foreach()
interval tree iterator instead. The exception here is ksm, where the
page's index is not known. It would probably be possible to rework ksm so
that the index would be known, but for now I have decided to keep things
simple and just walk the entirety of the interval tree there.

When updating vma's that already have an anon_vma assigned, we must take
care to re-index the corresponding avc's on their interval tree. This is
done through the use of anon_vma_interval_tree_pre_update_vma() and
anon_vma_interval_tree_post_update_vma(), which remove the avc's from
their interval tree before the update and re-insert them after the update.
The anon_vma stays locked during the update, so there is no chance that
rmap would miss the vmas that are being updated.

Signed-off-by: Michel Lespinasse <walken@google.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Rik van Riel <riel@redhat.com>
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Daniel Santos <daniel.santos@pobox.com>
Cc: Hugh Dickins <hughd@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 6b2dbba8 08-Oct-2012 Michel Lespinasse <walken@google.com>

mm: replace vma prio_tree with an interval tree

Implement an interval tree as a replacement for the VMA prio_tree. The
algorithms are similar to lib/interval_tree.c; however that code can't be
directly reused as the interval endpoints are not explicitly stored in the
VMA. So instead, the common algorithm is moved into a template and the
details (node type, how to get interval endpoints from the node, etc) are
filled in using the C preprocessor.

Once the interval tree functions are available, using them as a
replacement to the VMA prio tree is a relatively simple, mechanical job.

Signed-off-by: Michel Lespinasse <walken@google.com>
Cc: Rik van Riel <riel@redhat.com>
Cc: Hillf Danton <dhillf@gmail.com>
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: David Woodhouse <dwmw2@infradead.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# c255a458 31-Jul-2012 Andrew Morton <akpm@linux-foundation.org>

memcg: rename config variables

Sanity:

CONFIG_CGROUP_MEM_RES_CTLR -> CONFIG_MEMCG
CONFIG_CGROUP_MEM_RES_CTLR_SWAP -> CONFIG_MEMCG_SWAP
CONFIG_CGROUP_MEM_RES_CTLR_SWAP_ENABLED -> CONFIG_MEMCG_SWAP_ENABLED
CONFIG_CGROUP_MEM_RES_CTLR_KMEM -> CONFIG_MEMCG_KMEM

[mhocko@suse.cz: fix missed bits]
Cc: Glauber Costa <glommer@parallels.com>
Acked-by: Michal Hocko <mhocko@suse.cz>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: Tejun Heo <tj@kernel.org>
Cc: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
Cc: David Rientjes <rientjes@google.com>
Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 189ebff2 31-Jul-2012 Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>

hugetlb: simplify migrate_huge_page()

Since we migrate only one hugepage, don't use linked list for passing the
page around. Directly pass the page that need to be migrated as argument.
This also removes the usage of page->lru in the migrate path.

Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
Reviewed-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: David Rientjes <rientjes@google.com>
Cc: Hillf Danton <dhillf@gmail.com>
Reviewed-by: Michal Hocko <mhocko@suse.cz>
Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# dc32f634 30-Jul-2012 Joonsoo Kim <js1304@gmail.com>

mm: fix wrong argument of migrate_huge_pages() in soft_offline_huge_page()

Commit a6bc32b89922 ("mm: compaction: introduce sync-light migration for
use by compaction") changed the declaration of migrate_pages() and
migrate_huge_pages().

But it missed changing the argument of migrate_huge_pages() in
soft_offline_huge_page(). In this case, we should call
migrate_huge_pages() with MIGRATE_SYNC.

Additionally, there is a mismatch between type the of argument and the
function declaration for migrate_pages().

Signed-off-by: Joonsoo Kim <js1304@gmail.com>
Cc: Christoph Lameter <cl@linux.com>
Cc: Mel Gorman <mgorman@suse.de>
Acked-by: David Rientjes <rientjes@google.com>
Cc: "Aneesh Kumar K.V" <aneesh.kumar@linux.vnet.ibm.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 6751ed65 11-Jul-2012 Tony Luck <tony.luck@intel.com>

x86/mce: Fix siginfo_t->si_addr value for non-recoverable memory faults

In commit dad1743e5993f1 ("x86/mce: Only restart instruction after machine
check recovery if it is safe") we fixed mce_notify_process() to force a
signal to the current process if it was not restartable (RIPV bit not
set in MCG_STATUS). But doing it here means that the process doesn't
get told the virtual address of the fault via siginfo_t->si_addr. This
would prevent application level recovery from the fault.

Make a new MF_MUST_KILL flag bit for memory_failure() et al. to use so
that we will provide the right information with the signal.

Signed-off-by: Tony Luck <tony.luck@intel.com>
Acked-by: Borislav Petkov <borislav.petkov@amd.com>
Cc: stable@kernel.org # 3.4+


# 71dd0b8a 29-May-2012 Borislav Petkov <borislav.petkov@amd.com>

mm/memory_failure: let the compiler add the function name

These things tend to get out of sync with time so let the compiler
automatically enter the current function name using __func__.

No functional change.

Signed-off-by: Borislav Petkov <borislav.petkov@amd.com>
Acked-by: Andi Kleen <andi@firstfloor.org>
Cc: David Rientjes <rientjes@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 0815f3d8 03-Apr-2012 Michal Nazarewicz <mina86@mina86.com>

mm: page_isolation: MIGRATE_CMA isolation functions added

This commit changes various functions that change pages and
pageblocks migrate type between MIGRATE_ISOLATE and
MIGRATE_MOVABLE in such a way as to allow to work with
MIGRATE_CMA migrate type.

Signed-off-by: Michal Nazarewicz <mina86@mina86.com>
Signed-off-by: Marek Szyprowski <m.szyprowski@samsung.com>
Reviewed-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Tested-by: Rob Clark <rob.clark@linaro.org>
Tested-by: Ohad Ben-Cohen <ohad@wizery.com>
Tested-by: Benjamin Gaignard <benjamin.gaignard@linaro.org>
Tested-by: Robert Nelson <robertcnelson@gmail.com>
Tested-by: Barry Song <Baohua.Song@csr.com>


# 385de357 21-Mar-2012 Dean Nelson <dnelson@redhat.com>

thp: allow a hwpoisoned head page to be put back to LRU

Andrea Arcangeli pointed out to me that a check in __memory_failure()
which was intended to prevent THP tail pages from being checked for the
absence of the PG_lru flag (something that is always the case), was also
preventing THP head pages from being checked.

A THP head page could actually benefit from the call to shake_page() by
ending up being put back to a LRU, provided it had been waiting in a
pagevec array.

Andrea suggested that the "!PageTransCompound(p)" in the if-statement
should be replaced by a "!PageTransTail(p)", thus allowing THP head pages
to be checked and possibly shaken.

Signed-off-by: Dean Nelson <dnelson@redhat.com>
Cc: Jin Dongming <jin.dongming@np.css.fujitsu.com>
Reviewed-by: Andrea Arcangeli <aarcange@redhat.com>
Cc: Andi Kleen <andi@firstfloor.org>
Cc: Hidetoshi Seto <seto.hidetoshi@jp.fujitsu.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# a6bc32b8 12-Jan-2012 Mel Gorman <mgorman@suse.de>

mm: compaction: introduce sync-light migration for use by compaction

This patch adds a lightweight sync migrate operation MIGRATE_SYNC_LIGHT
mode that avoids writing back pages to backing storage. Async compaction
maps to MIGRATE_ASYNC while sync compaction maps to MIGRATE_SYNC_LIGHT.
For other migrate_pages users such as memory hotplug, MIGRATE_SYNC is
used.

This avoids sync compaction stalling for an excessive length of time,
particularly when copying files to a USB stick where there might be a
large number of dirty pages backed by a filesystem that does not support
->writepages.

[aarcange@redhat.com: This patch is heavily based on Andrea's work]
[akpm@linux-foundation.org: fix fs/nfs/write.c build]
[akpm@linux-foundation.org: fix fs/btrfs/disk-io.c build]
Signed-off-by: Mel Gorman <mgorman@suse.de>
Reviewed-by: Rik van Riel <riel@redhat.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Minchan Kim <minchan.kim@gmail.com>
Cc: Dave Jones <davej@redhat.com>
Cc: Jan Kara <jack@suse.cz>
Cc: Andy Isaacson <adi@hexapodia.org>
Cc: Nai Xia <nai.xia@gmail.com>
Cc: Johannes Weiner <jweiner@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 7329bbeb 13-Dec-2011 Tony Luck <tony.luck@intel.com>

HWPOISON: Add code to handle "action required" errors.

Add new flag bit "MF_ACTION_REQUIRED" to be used by machine check
code to force a signal with si_code = BUS_MCEERR_AR in the case
where the error occurs in processor execution context. Pass the
flags argument along call chain:
memory_failure()
hwpoison_user_mappings()
kill_procs()
kill_proc()

Drop the "_ao" suffix from kill_procs_ao() and kill_proc_ao() since
they can now handle "action required" as well as "action optional" errors.

Acked-by: Borislav Petkov <bp@amd64.org>
Signed-off-by: Tony Luck <tony.luck@intel.com>


# cd42f4a3 15-Dec-2011 Tony Luck <tony.luck@intel.com>

HWPOISON: Clean up memory_failure() vs. __memory_failure()

There is only one caller of memory_failure(), all other users call
__memory_failure() and pass in the flags argument explicitly. The
lone user of memory_failure() will soon need to pass flags too.

Add flags argument to the callsite in mce.c. Delete the old memory_failure()
function, and then rename __memory_failure() without the leading "__".

Provide clearer message when action optional memory errors are ignored.

Acked-by: Borislav Petkov <bp@amd64.org>
Signed-off-by: Tony Luck <tony.luck@intel.com>


# dd73e85f 31-Oct-2011 Dean Nelson <dnelson@redhat.com>

HWPOISON: convert pr_debug()s to pr_info()s

Commit fb46e73520940b ("HWPOISON: Convert pr_debugs to pr_info) authored
by Andi Kleen converted a number of pr_debug()s to pr_info()s.

About the same time additional code with pr_debug()s was added by two
other commits 8c6c2ecb4466 ("HWPOSION, hugetlb: recover from free hugepage
error when !MF_COUNT_INCREASED") and d950b95882f3d ("HWPOISON, hugetlb:
soft offlining for hugepage"). And these pr_debug()s failed to get
converted to pr_info()s.

This patch converts them as well. And does some minor related whitespace
cleanup.

Signed-off-by: Dean Nelson <dnelson@redhat.com>
Reviewed-by: Andi Kleen <ak@linux.intel.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# b9e15baf 26-May-2011 Paul Gortmaker <paul.gortmaker@windriver.com>

mm: Add export.h for EXPORT_SYMBOL to active symbol exporters

These files were getting <linux/module.h> via an implicit include
path, but we want to crush those out of existence since they cost
time during compiles of processing thousands of lines of headers
for no reason. Give them the lightweight header that just contains
the EXPORT_SYMBOL infrastructure.

Signed-off-by: Paul Gortmaker <paul.gortmaker@windriver.com>


# ea8f5fb8 12-Jul-2011 Huang Ying <ying.huang@intel.com>

HWPoison: add memory_failure_queue()

memory_failure() is the entry point for HWPoison memory error
recovery. It must be called in process context. But commonly
hardware memory errors are notified via MCE or NMI, so some delayed
execution mechanism must be used. In MCE handler, a work queue + ring
buffer mechanism is used.

In addition to MCE, now APEI (ACPI Platform Error Interface) GHES
(Generic Hardware Error Source) can be used to report memory errors
too. To add support to APEI GHES memory recovery, a mechanism similar
to that of MCE is implemented. memory_failure_queue() is the new
entry point that can be called in IRQ context. The next step is to
make MCE handler uses this interface too.

Signed-off-by: Huang Ying <ying.huang@intel.com>
Cc: Andi Kleen <ak@linux.intel.com>
Cc: Wu Fengguang <fengguang.wu@intel.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Len Brown <len.brown@intel.com>


# 9b679320 27-Jun-2011 Peter Zijlstra <peterz@infradead.org>

mm/memory-failure.c: fix spinlock vs mutex order

We cannot take a mutex while holding a spinlock, so flip the order and
fix the locking documentation.

Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Acked-by: Andi Kleen <ak@linux.intel.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 5db8a73a 15-Jun-2011 Minchan Kim <minchan.kim@gmail.com>

mm/memory-failure.c: fix page isolated count mismatch

Pages isolated for migration are accounted with the vmstat counters
NR_ISOLATE_[ANON|FILE]. Callers of migrate_pages() are expected to
increment these counters when pages are isolated from the LRU. Once the
pages have been migrated, they are put back on the LRU or freed and the
isolated count is decremented.

Memory failure is not properly accounting for pages it isolates causing
the NR_ISOLATED counters to be negative. On SMP builds, this goes
unnoticed as negative counters are treated as 0 due to expected per-cpu
drift. On UP builds, the counter is treated by too_many_isolated() as a
large value causing processes to enter D state during page reclaim or
compaction. This patch accounts for pages isolated by memory failure
correctly.

[mel@csn.ul.ie: rewrote changelog]
Reviewed-by: Andrea Arcangeli <aarcange@redhat.com>
Signed-off-by: Minchan Kim <minchan.kim@gmail.com>
Cc: Andi Kleen <andi@firstfloor.org>
Acked-by: Mel Gorman <mel@csn.ul.ie>
Cc: <stable@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 1495f230 24-May-2011 Ying Han <yinghan@google.com>

vmscan: change shrinker API by passing shrink_control struct

Change each shrinker's API by consolidating the existing parameters into
shrink_control struct. This will simplify any further features added w/o
touching each file of shrinker.

[akpm@linux-foundation.org: fix build]
[akpm@linux-foundation.org: fix warning]
[kosaki.motohiro@jp.fujitsu.com: fix up new shrinker API]
[akpm@linux-foundation.org: fix xfs warning]
[akpm@linux-foundation.org: update gfs2]
Signed-off-by: Ying Han <yinghan@google.com>
Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Cc: Minchan Kim <minchan.kim@gmail.com>
Acked-by: Pavel Emelyanov <xemul@openvz.org>
Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Mel Gorman <mel@csn.ul.ie>
Acked-by: Rik van Riel <riel@redhat.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Hugh Dickins <hughd@google.com>
Cc: Dave Hansen <dave@linux.vnet.ibm.com>
Cc: Steven Whitehouse <swhiteho@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# a09ed5e0 24-May-2011 Ying Han <yinghan@google.com>

vmscan: change shrink_slab() interfaces by passing shrink_control

Consolidate the existing parameters to shrink_slab() into a new
shrink_control struct. This is needed later to pass the same struct to
shrinkers.

Signed-off-by: Ying Han <yinghan@google.com>
Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Cc: Minchan Kim <minchan.kim@gmail.com>
Acked-by: Pavel Emelyanov <xemul@openvz.org>
Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Mel Gorman <mel@csn.ul.ie>
Acked-by: Rik van Riel <riel@redhat.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Hugh Dickins <hughd@google.com>
Cc: Dave Hansen <dave@linux.vnet.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# bd486285 24-May-2011 Konstantin Khlebnikov <khlebnikov@openvz.org>

mem-hwpoison: fix page refcount around isolate_lru_page()

Drop first page reference only after calling isolate_lru_page() to keep
page stable reference while isolating.

Signed-off-by: Konstantin Khlebnikov <khlebnikov@openvz.org>
Cc: Andi Kleen <andi@firstfloor.org>
Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Cc: Mel Gorman <mel@csn.ul.ie>
Cc: Lee Schermerhorn <lee.schermerhorn@hp.com>
Cc: Rik van Riel <riel@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 3d48ae45 24-May-2011 Peter Zijlstra <a.p.zijlstra@chello.nl>

mm: Convert i_mmap_lock to a mutex

Straightforward conversion of i_mmap_lock to a mutex.

Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Acked-by: Hugh Dickins <hughd@google.com>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: David Miller <davem@davemloft.net>
Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
Cc: Russell King <rmk@arm.linux.org.uk>
Cc: Paul Mundt <lethal@linux-sh.org>
Cc: Jeff Dike <jdike@addtoit.com>
Cc: Richard Weinberger <richard@nod.at>
Cc: Tony Luck <tony.luck@intel.com>
Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Mel Gorman <mel@csn.ul.ie>
Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Cc: Nick Piggin <npiggin@kernel.dk>
Cc: Namhyung Kim <namhyung@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 25985edc 30-Mar-2011 Lucas De Marchi <lucas.demarchi@profusion.mobi>

Fix common misspellings

Fixes generated by 'codespell' and manually reviewed.

Signed-off-by: Lucas De Marchi <lucas.demarchi@profusion.mobi>


# e64a782f 22-Mar-2011 Minchan Kim <minchan.kim@gmail.com>

mm: change __remove_from_page_cache()

Now we renamed remove_from_page_cache with delete_from_page_cache. As
consistency of __remove_from_swap_cache and remove_from_swap_cache, we
change internal page cache handling function name, too.

Signed-off-by: Minchan Kim <minchan.kim@gmail.com>
Cc: Christoph Hellwig <hch@infradead.org>
Acked-by: Hugh Dickins <hughd@google.com>
Acked-by: Mel Gorman <mel@csn.ul.ie>
Reviewed-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Reviewed-by: Johannes Weiner <hannes@cmpxchg.org>
Reviewed-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# f58c9df7 29-Jan-2011 Huang Ying <ying.huang@intel.com>

mm: remove is_hwpoison_address

Unused.

Signed-off-by: Huang Ying <ying.huang@intel.com>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>


# 7eaceacc 10-Mar-2011 Jens Axboe <jaxboe@fusionio.com>

block: remove per-queue plugging

Code has been converted over to the new explicit on-stack plugging,
and delay users have been converted to use the new API for that.
So lets kill off the old plugging along with aops->sync_page().

Signed-off-by: Jens Axboe <jaxboe@fusionio.com>


# af241a08 01-Feb-2011 Jin Dongming <jin.dongming@np.css.fujitsu.com>

thp: fix unsuitable behavior for hwpoisoned tail page

When a tail page of THP is poisoned, memory-failure will do nothing except
setting PG_hwpoison, while the expected behavior is that the process, who
is using the poisoned tail page, should be killed.

The above problem is caused by lru check of the poisoned tail page of THP.
Because PG_lru flag is only set on the head page of THP, the check always
consider the poisoned tail page as NON lru page.

So the lru check for the tail page of THP should be avoided, as like as
hugetlb.

This patch adds !PageTransCompound() before lru check for THP, because of
the check (!PageHuge() && !PageTransCompound()) the whole branch could be
optimized away at build time when both hugetlbfs and THP are set with "N"
(or in archs not supporting either of those).

[akpm@linux-foundation.org: fix unrelated typo in shake_page() comment]
Signed-off-by: Jin Dongming <jin.dongming@np.css.fujitsu.com>
Reviewed-by: Hidetoshi Seto <seto.hidetoshi@jp.fujitsu.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Andi Kleen <andi@firstfloor.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# a6d30ddd 01-Feb-2011 Jin Dongming <jin.dongming@np.css.fujitsu.com>

thp: fix the wrong reported address of hwpoisoned hugepages

When the tail page of THP is poisoned, the head page will be poisoned too.
And the wrong address, address of head page, will be sent with sigbus
always.

So when the poisoned page is used by Guest OS which is running on KVM,
after the address changing(hva->gpa) by qemu, the unexpected process on
Guest OS will be killed by sigbus.

What we expected is that the process using the poisoned tail page could be
killed on Guest OS, but not that the process using the healthy head page
is killed.

Since it is not good to poison the healthy page, avoid poisoning other
than the page which is really poisoned.
(While we poison all pages in a huge page in case of hugetlb,
we can do this for THP thanks to split_huge_page().)

Here we fix two parts:
1. Isolate the poisoned page only to make sure
the reported address is the address of poisoned page.
2. make the poisoned page work as the poisoned regular page.

[akpm@linux-foundation.org: fix spello in comment]
Signed-off-by: Jin Dongming <jin.dongming@np.css.fujitsu.com>
Reviewed-by: Hidetoshi Seto <seto.hidetoshi@jp.fujitsu.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Andi Kleen <andi@firstfloor.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# efeda7a4 01-Feb-2011 Jin Dongming <jin.dongming@np.css.fujitsu.com>

thp: fix splitting of hwpoisoned hugepages

The poisoned THP is now split with split_huge_page() in
collect_procs_anon(). If kmalloc() is failed in collect_procs(),
split_huge_page() could not be called. And the work after
split_huge_page() for collecting the processes using poisoned page will
not be done, too. So the processes using the poisoned page could not be
killed.

The condition becomes worse when CONFIG_DEBUG_VM == "Y". Because the
poisoned THP could not be split, system panic will be caused by
VM_BUG_ON(PageTransHuge(page)) in try_to_unmap().

This patch does:
1. move split_huge_page() to the place before collect_procs().
This can be sure the failure of splitting THP is caused by itself.
2. when splitting THP is failed, stop the operations after it.
This can avoid unexpected system panic or non sense works.

[akpm@linux-foundation.org: coding-style fixes]
Signed-off-by: Jin Dongming <jin.dongming@np.css.fujitsu.com>
Reviewed-by: Hidetoshi Seto <seto.hidetoshi@jp.fujitsu.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Andi Kleen <andi@firstfloor.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 48db54ee 01-Feb-2011 Minchan Kim <minchan.kim@gmail.com>

mm/migration: fix page corruption during hugepage migration

If migrate_huge_page by memory-failure fails , it calls put_page in itself
to decrease page reference and caller of migrate_huge_page also calls
putback_lru_pages. It can do double free of page so it can make page
corruption on page holder.

In addtion, clean of pages on caller is consistent behavior with
migrate_pages by cf608ac19c ("mm: compaction: fix COMPACTPAGEFAILED
counting").

Signed-off-by: Minchan Kim <minchan.kim@gmail.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Christoph Lameter <cl@linux.com>
Cc: Mel Gorman <mel@csn.ul.ie>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 57fc4a5e 01-Feb-2011 Andrea Arcangeli <aarcange@redhat.com>

mm: when migrate_pages returns 0, all pages must have been released

In some cases migrate_pages could return zero while still leaving a few
pages in the pagelist (and some caller wouldn't notice it has to call
putback_lru_pages after commit cf608ac19c9 ("mm: compaction: fix
COMPACTPAGEFAILED counting")).

Add one missing putback_lru_pages not added by commit cf608ac19c95 ("mm:
compaction: fix COMPACTPAGEFAILED counting").

Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
Signed-off-by: Minchan Kim <minchan.kim@gmail.com>
Reviewed-by: Minchan Kim <minchan.kim@gmail.com>
Cc: Christoph Lameter <cl@linux.com>
Acked-by: Mel Gorman <mel@csn.ul.ie>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 37c2ac78 13-Jan-2011 Andrea Arcangeli <aarcange@redhat.com>

thp: compound_trans_order

Read compound_trans_order safe. Noop for CONFIG_TRANSPARENT_HUGEPAGE=n.

Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
Cc: Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp>
Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 91600e9e 13-Jan-2011 Andrea Arcangeli <aarcange@redhat.com>

thp: fix memory-failure hugetlbfs vs THP collision

hugetlbfs was changed to allow memory failure to migrate the hugetlbfs
pages and that broke THP as split_huge_page was then called on hugetlbfs
pages too.

compound_head/order was also run unsafe on THP pages that can be splitted
at any time.

All compound_head() invocations in memory-failure.c that are run on pages
that aren't pinned and that can be freed and reused from under us (while
compound_head is running) are buggy because compound_head can return a
dangling pointer, but I'm not fixing this as this is a generic
memory-failure bug not specific to THP but it applies to hugetlbfs too, so
I can fix it later after THP is merged upstream.

Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 3f04f62f 13-Jan-2011 Andrea Arcangeli <aarcange@redhat.com>

thp: split_huge_page paging

Paging logic that splits the page before it is unmapped and added to swap
to ensure backwards compatibility with the legacy swap code. Eventually
swap should natively pageout the hugepages to increase performance and
decrease seeking and fragmentation of swap space. swapoff can just skip
over huge pmd as they cannot be part of swap yet. In add_to_swap be
careful to split the page only if we got a valid swap entry so we don't
split hugepages with a full swap.

In theory we could split pages before isolating them during the lru scan,
but for khugepaged to be safe, I'm relying on either mmap_sem write mode,
or PG_lock taken, so split_huge_page has to run either with mmap_sem
read/write mode or PG_lock taken. Calling it from isolate_lru_page would
make locking more complicated, in addition to that split_huge_page would
deadlock if called by __isolate_lru_page because it has to take the lru
lock to add the tail pages.

Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
Acked-by: Mel Gorman <mel@csn.ul.ie>
Acked-by: Rik van Riel <riel@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 77f1fe6b 13-Jan-2011 Mel Gorman <mel@csn.ul.ie>

mm: migration: allow migration to operate asynchronously and avoid synchronous compaction in the faster path

Migration synchronously waits for writeback if the initial passes fails.
Callers of memory compaction do not necessarily want this behaviour if the
caller is latency sensitive or expects that synchronous migration is not
going to have a significantly better success rate.

This patch adds a sync parameter to migrate_pages() allowing the caller to
indicate if wait_on_page_writeback() is allowed within migration or not.
For reclaim/compaction, try_to_compact_pages() is first called
asynchronously, direct reclaim runs and then try_to_compact_pages() is
called synchronously as there is a greater expectation that it'll succeed.

[akpm@linux-foundation.org: build/merge fix]
Signed-off-by: Mel Gorman <mel@csn.ul.ie>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Cc: Rik van Riel <riel@redhat.com>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Cc: Andy Whitcroft <apw@shadowen.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 20d6c96b 02-Dec-2010 KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>

mem-hotplug: introduce {un}lock_memory_hotplug()

Presently hwpoison is using lock_system_sleep() to prevent a race with
memory hotplug. However lock_system_sleep() is a no-op if
CONFIG_HIBERNATION=n. Therefore we need a new lock.

Signed-off-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Cc: Andi Kleen <andi@firstfloor.org>
Cc: Kamezawa Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Suggested-by: Hugh Dickins <hughd@google.com>
Acked-by: Hugh Dickins <hughd@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# cf608ac1 26-Oct-2010 Minchan Kim <minchan.kim@gmail.com>

mm: compaction: fix COMPACTPAGEFAILED counting

Presently update_nr_listpages() doesn't have a role. That's because lists
passed is always empty just after calling migrate_pages. The
migrate_pages cleans up page list which have failed to migrate before
returning by aaa994b3.

[PATCH] page migration: handle freeing of pages in migrate_pages()

Do not leave pages on the lists passed to migrate_pages(). Seems that we will
not need any postprocessing of pages. This will simplify the handling of
pages by the callers of migrate_pages().

At that time, we thought we don't need any postprocessing of pages. But
the situation is changed. The compaction need to know the number of
failed to migrate for COMPACTPAGEFAILED stat

This patch makes new rule for caller of migrate_pages to call
putback_lru_pages. So caller need to clean up the lists so it has a
chance to postprocess the pages. [suggested by Christoph Lameter]

Signed-off-by: Minchan Kim <minchan.kim@gmail.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: Andi Kleen <andi@firstfloor.org>
Reviewed-by: Mel Gorman <mel@csn.ul.ie>
Reviewed-by: Wu Fengguang <fengguang.wu@intel.com>
Acked-by: Christoph Lameter <cl@linux.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# a08c80eb 27-Sep-2010 Andi Kleen <ak@linux.intel.com>

HWPOISON: Remove retry loop for try_to_unmap

We don't reply in other temporary failure cases and there were no
reports of replies happening. I think the original reason it was
added was also just an early bug, not an observation of the race.

So remove the loop for now, but keep a warning message.

Signed-off-by: Andi Kleen <ak@linux.intel.com>


# 9033ae16 27-Sep-2010 Andi Kleen <ak@linux.intel.com>

HWPOISON: Turn addr_valid from bitfield into char

The addr_valid flag is the only flag in "to_kill" and it's slightly more
efficient to have it as char instead of a bitfield.

Signed-off-by: Andi Kleen <ak@linux.intel.com>


# 898e70d1 27-Sep-2010 Andi Kleen <ak@linux.intel.com>

HWPOISON: Disable DEBUG by default

Now that only a few obscure messages are left as pr_debug disable
outputting of pr_debug in memory-failure.c by default.

Signed-off-by: Andi Kleen <ak@linux.intel.com>


# fb46e735 27-Sep-2010 Andi Kleen <ak@linux.intel.com>

HWPOISON: Convert pr_debugs to pr_info

Convert a lot of pr_debugs in memory-failure.c that are generally useful
to pr_info. It's reasonable to print at least one message why
offlining succeeded or failed by default.

Signed-off-by: Andi Kleen <ak@linux.intel.com>


# 1c80b990 27-Sep-2010 Andi Kleen <ak@linux.intel.com>

HWPOISON: Improve comments in memory-failure.c

Clean up and improve the overview comment in memory-failure.c

Tidy some grammar issues in other comments.

Signed-off-by: Andi Kleen <ak@linux.intel.com>


# 6a90181c 07-Sep-2010 Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>

HWPOISON, hugetlb: fix unpoison for hugepage

Currently unpoisoning hugepages doesn't work correctly because
clearing PG_HWPoison is done outside if (TestClearPageHWPoison).
This patch fixes it.

Signed-off-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
Signed-off-by: Jun'ichi Nomura <j-nomura@ce.jp.nec.com>
Acked-by: Mel Gorman <mel@csn.ul.ie>
Signed-off-by: Andi Kleen <ak@linux.intel.com>


# d950b958 07-Sep-2010 Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>

HWPOISON, hugetlb: soft offlining for hugepage

This patch extends soft offlining framework to support hugepage.
When memory corrected errors occur repeatedly on a hugepage,
we can choose to stop using it by migrating data onto another hugepage
and disabling the original (maybe half-broken) one.

ChangeLog since v4:
- branch soft_offline_page() for hugepage

ChangeLog since v3:
- remove comment about "ToDo: hugepage soft-offline"

ChangeLog since v2:
- move refcount handling into isolate_lru_page()

ChangeLog since v1:
- add double check in isolating hwpoisoned hugepage
- define free/non-free checker for hugepage
- postpone calling put_page() for hugepage in soft_offline_page()

Signed-off-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
Signed-off-by: Jun'ichi Nomura <j-nomura@ce.jp.nec.com>
Acked-by: Mel Gorman <mel@csn.ul.ie>
Signed-off-by: Andi Kleen <ak@linux.intel.com>


# 8c6c2ecb 07-Sep-2010 Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>

HWPOSION, hugetlb: recover from free hugepage error when !MF_COUNT_INCREASED

Currently error recovery for free hugepage works only for MF_COUNT_INCREASED.
This patch enables !MF_COUNT_INCREASED case.

Free hugepages can be handled directly by alloc_huge_page() and
dequeue_hwpoisoned_huge_page(), and both of them are protected
by hugetlb_lock, so there is no race between them.

Note that this patch defines the refcount of HWPoisoned hugepage
dequeued from freelist is 1, deviated from present 0, thereby we
can avoid race between unpoison and memory failure on free hugepage.
This is reasonable because unlikely to free buddy pages, free hugepage
is governed by hugetlbfs even after error handling finishes.
And it also makes unpoison code added in the later patch cleaner.

Signed-off-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
Signed-off-by: Jun'ichi Nomura <j-nomura@ce.jp.nec.com>
Acked-by: Mel Gorman <mel@csn.ul.ie>
Signed-off-by: Andi Kleen <ak@linux.intel.com>


# 6de2b1aa 07-Sep-2010 Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>

HWPOISON, hugetlb: add free check to dequeue_hwpoison_huge_page()

This check is necessary to avoid race between dequeue and allocation,
which can cause a free hugepage to be dequeued twice and get kernel unstable.

Signed-off-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
Acked-by: Mel Gorman <mel@csn.ul.ie>
Reviewed-by: Christoph Lameter <cl@linux.com>
Signed-off-by: Andi Kleen <ak@linux.intel.com>


# 47f43e7e 27-Sep-2010 Andi Kleen <ak@linux.intel.com>

HWPOISON: Stop shrinking at right page count

When we call the slab shrinker to free a page we need to stop at
page count one because the caller always holds a single reference, not zero.

This avoids useless looping over slab shrinkers and freeing too much
memory.

Reviewed-by: Wu Fengguang <fengguang.wu@intel.com>
Signed-off-by: Andi Kleen <ak@linux.intel.com>


# 0d9ee6a2 27-Sep-2010 Andi Kleen <ak@linux.intel.com>

HWPOISON: Report correct address granuality for AO huge page errors

The SIGBUS user space signalling is supposed to report the
address granuality of a corruption. Pass this information correctly
for huge pages by querying the hpage order.

Reviewed-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
Reviewed-by: Wu Fengguang <fengguang.wu@intel.com>
Signed-off-by: Andi Kleen <ak@linux.intel.com>


# 93f70f90 27-May-2010 Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>

HWPOISON, hugetlb: isolate corrupted hugepage

If error hugepage is not in-use, we can fully recovery from error
by dequeuing it from freelist, so return RECOVERY.
Otherwise whether or not we can recovery depends on user processes,
so return DELAYED.

Dependency:
"HWPOISON, hugetlb: enable error handling path for hugepage"

Signed-off-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Acked-by: Fengguang Wu <fengguang.wu@intel.com>
Signed-off-by: Andi Kleen <ak@linux.intel.com>


# c9fbdd5f 27-May-2010 Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>

HWPOISON, hugetlb: maintain mce_bad_pages in handling hugepage error

For now all pages in the error hugepage are considered as hwpoisoned,
so count all of them in mce_bad_pages.

Dependency:
"HWPOISON, hugetlb: enable error handling path for hugepage"

Signed-off-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Acked-by: Fengguang Wu <fengguang.wu@intel.com>
Signed-off-by: Andi Kleen <ak@linux.intel.com>


# 7013febc 27-May-2010 Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>

HWPOISON, hugetlb: set/clear PG_hwpoison bits on hugepage

To avoid race condition between concurrent memory errors on identified
hugepage, we atomically test and set PG_hwpoison bit on the head page.
All pages in the error hugepage are considered as hwpoisoned
for now, so set and clear all PG_hwpoison bits in the hugepage
with page lock of the head page held.

Dependency:
"HWPOISON, hugetlb: enable error handling path for hugepage"

Signed-off-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Acked-by: Fengguang Wu <fengguang.wu@intel.com>
Signed-off-by: Andi Kleen <ak@linux.intel.com>


# 7af446a8 27-May-2010 Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>

HWPOISON, hugetlb: enable error handling path for hugepage

This patch just enables handling path. Real containing and
recovering operation will be implemented in following patches.

Dependency:
"hugetlb, rmap: add reverse mapping for hugepage."

Signed-off-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
Cc: Andi Kleen <andi@firstfloor.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Acked-by: Fengguang Wu <fengguang.wu@intel.com>
Signed-off-by: Andi Kleen <ak@linux.intel.com>


# bbeb3406 22-Jun-2010 Huang Ying <ying.huang@intel.com>

KVM: Fix a race condition for usage of is_hwpoison_address()

is_hwpoison_address accesses the page table, so the caller must hold
current->mm->mmap_sem in read mode. So fix its usage in hva_to_pfn of
kvm accordingly.

Comment is_hwpoison_address to remind other users.

Reported-by: Avi Kivity <avi@redhat.com>
Signed-off-by: Huang Ying <ying.huang@intel.com>
Signed-off-by: Avi Kivity <avi@redhat.com>


# bf998156 31-May-2010 Huang Ying <ying.huang@intel.com>

KVM: Avoid killing userspace through guest SRAO MCE on unmapped pages

In common cases, guest SRAO MCE will cause corresponding poisoned page
be un-mapped and SIGBUS be sent to QEMU-KVM, then QEMU-KVM will relay
the MCE to guest OS.

But it is reported that if the poisoned page is accessed in guest
after unmapping and before MCE is relayed to guest OS, userspace will
be killed.

The reason is as follows. Because poisoned page has been un-mapped,
guest access will cause guest exit and kvm_mmu_page_fault will be
called. kvm_mmu_page_fault can not get the poisoned page for fault
address, so kernel and user space MMIO processing is tried in turn. In
user MMIO processing, poisoned page is accessed again, then userspace
is killed by force_sig_info.

To fix the bug, kvm_mmu_page_fault send HWPOISON signal to QEMU-KVM
and do not try kernel and user space MMIO processing for poisoned
page.

[xiao: fix warning introduced by avi]

Reported-by: Max Asbock <masbock@linux.vnet.ibm.com>
Signed-off-by: Huang Ying <ying.huang@intel.com>
Signed-off-by: Xiao Guangrong <xiaoguangrong@cn.fujitsu.com>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
Signed-off-by: Avi Kivity <avi@redhat.com>


# 5a0e3ad6 24-Mar-2010 Tejun Heo <tj@kernel.org>

include cleanup: Update gfp.h and slab.h includes to prepare for breaking implicit slab.h inclusion from percpu.h

percpu.h is included by sched.h and module.h and thus ends up being
included when building most .c files. percpu.h includes slab.h which
in turn includes gfp.h making everything defined by the two files
universally available and complicating inclusion dependencies.

percpu.h -> slab.h dependency is about to be removed. Prepare for
this change by updating users of gfp and slab facilities include those
headers directly instead of assuming availability. As this conversion
needs to touch large number of source files, the following script is
used as the basis of conversion.

http://userweb.kernel.org/~tj/misc/slabh-sweep.py

The script does the followings.

* Scan files for gfp and slab usages and update includes such that
only the necessary includes are there. ie. if only gfp is used,
gfp.h, if slab is used, slab.h.

* When the script inserts a new include, it looks at the include
blocks and try to put the new include such that its order conforms
to its surrounding. It's put in the include block which contains
core kernel includes, in the same order that the rest are ordered -
alphabetical, Christmas tree, rev-Xmas-tree or at the end if there
doesn't seem to be any matching order.

* If the script can't find a place to put a new include (mostly
because the file doesn't have fitting include block), it prints out
an error message indicating which .h file needs to be added to the
file.

The conversion was done in the following steps.

1. The initial automatic conversion of all .c files updated slightly
over 4000 files, deleting around 700 includes and adding ~480 gfp.h
and ~3000 slab.h inclusions. The script emitted errors for ~400
files.

2. Each error was manually checked. Some didn't need the inclusion,
some needed manual addition while adding it to implementation .h or
embedding .c file was more appropriate for others. This step added
inclusions to around 150 files.

3. The script was run again and the output was compared to the edits
from #2 to make sure no file was left behind.

4. Several build tests were done and a couple of problems were fixed.
e.g. lib/decompress_*.c used malloc/free() wrappers around slab
APIs requiring slab.h to be added manually.

5. The script was run on all .h files but without automatically
editing them as sprinkling gfp.h and slab.h inclusions around .h
files could easily lead to inclusion dependency hell. Most gfp.h
inclusion directives were ignored as stuff from gfp.h was usually
wildly available and often used in preprocessor macros. Each
slab.h inclusion directive was examined and added manually as
necessary.

6. percpu.h was updated not to include slab.h.

7. Build test were done on the following configurations and failures
were fixed. CONFIG_GCOV_KERNEL was turned off for all tests (as my
distributed build env didn't work with gcov compiles) and a few
more options had to be turned off depending on archs to make things
build (like ipr on powerpc/64 which failed due to missing writeq).

* x86 and x86_64 UP and SMP allmodconfig and a custom test config.
* powerpc and powerpc64 SMP allmodconfig
* sparc and sparc64 SMP allmodconfig
* ia64 SMP allmodconfig
* s390 SMP allmodconfig
* alpha SMP allmodconfig
* um on x86_64 SMP allmodconfig

8. percpu.h modifications were reverted so that it could be applied as
a separate patch and serve as bisection point.

Given the fact that I had only a couple of failures from tests on step
6, I'm fairly confident about the coverage of this conversion patch.
If there is a breakage, it's likely to be something in one of the arch
headers which should be easily discoverable easily on most builds of
the specific arch.

Signed-off-by: Tejun Heo <tj@kernel.org>
Guess-its-ok-by: Christoph Lameter <cl@linux-foundation.org>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Lee Schermerhorn <Lee.Schermerhorn@hp.com>


# 5beb4930 05-Mar-2010 Rik van Riel <riel@redhat.com>

mm: change anon_vma linking to fix multi-process server scalability issue

The old anon_vma code can lead to scalability issues with heavily forking
workloads. Specifically, each anon_vma will be shared between the parent
process and all its child processes.

In a workload with 1000 child processes and a VMA with 1000 anonymous
pages per process that get COWed, this leads to a system with a million
anonymous pages in the same anon_vma, each of which is mapped in just one
of the 1000 processes. However, the current rmap code needs to walk them
all, leading to O(N) scanning complexity for each page.

This can result in systems where one CPU is walking the page tables of
1000 processes in page_referenced_one, while all other CPUs are stuck on
the anon_vma lock. This leads to catastrophic failure for a benchmark
like AIM7, where the total number of processes can reach in the tens of
thousands. Real workloads are still a factor 10 less process intensive
than AIM7, but they are catching up.

This patch changes the way anon_vmas and VMAs are linked, which allows us
to associate multiple anon_vmas with a VMA. At fork time, each child
process gets its own anon_vmas, in which its COWed pages will be
instantiated. The parents' anon_vma is also linked to the VMA, because
non-COWed pages could be present in any of the children.

This reduces rmap scanning complexity to O(1) for the pages of the 1000
child processes, with O(N) complexity for at most 1/N pages in the system.
This reduces the average scanning cost in heavily forking workloads from
O(N) to 2.

The only real complexity in this patch stems from the fact that linking a
VMA to anon_vmas now involves memory allocations. This means vma_adjust
can fail, if it needs to attach a VMA to anon_vma structures. This in
turn means error handling needs to be added to the calling functions.

A second source of complexity is that, because there can be multiple
anon_vmas, the anon_vma linking in vma_adjust can no longer be done under
"the" anon_vma lock. To prevent the rmap code from walking up an
incomplete VMA, this patch introduces the VM_LOCK_RMAP VMA flag. This bit
flag uses the same slot as the NOMMU VM_MAPPED_COPY, with an ifdef in mm.h
to make sure it is impossible to compile a kernel that needs both symbolic
values for the same bitflag.

Some test results:

Without the anon_vma changes, when AIM7 hits around 9.7k users (on a test
box with 16GB RAM and not quite enough IO), the system ends up running
>99% in system time, with every CPU on the same anon_vma lock in the
pageout code.

With these changes, AIM7 hits the cross-over point around 29.7k users.
This happens with ~99% IO wait time, there never seems to be any spike in
system time. The anon_vma lock contention appears to be resolved.

[akpm@linux-foundation.org: cleanups]
Signed-off-by: Rik van Riel <riel@redhat.com>
Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Cc: Larry Woodman <lwoodman@redhat.com>
Cc: Lee Schermerhorn <Lee.Schermerhorn@hp.com>
Cc: Minchan Kim <minchan.kim@gmail.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Hugh Dickins <hugh.dickins@tiscali.co.uk>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 27df5068 21-Dec-2009 Andi Kleen <andi@firstfloor.org>

HWPOISON: Add PROC_FS dependency to hwpoison injector v2

The injector filter requires stable_page_flags() which is supplied
by procfs. So make it dependent on that.

Also add ifdefs around the filter code in memory-failure.c so that
when the filter is disabled due to missing dependencies the whole
code still builds.

Reported-by: Ingo Molnar
Signed-off-by: Andi Kleen <ak@linux.intel.com>


# f2c03deb 15-Dec-2009 Andi Kleen <andi@firstfloor.org>

HWPOISON: Remove stray phrase in a comment

Better to have complete sentences.

Signed-off-by: Andi Kleen <ak@linux.intel.com>


# 12686d15 15-Dec-2009 Andi Kleen <andi@firstfloor.org>

HWPOISON: Try to allocate migration page on the same node

Signed-off-by: Andi Kleen <ak@linux.intel.com>


# facb6011 15-Dec-2009 Andi Kleen <andi@firstfloor.org>

HWPOISON: Add soft page offline support

This is a simpler, gentler variant of memory_failure() for soft page
offlining controlled from user space. It doesn't kill anything, just
tries to invalidate and if that doesn't work migrate the
page away.

This is useful for predictive failure analysis, where a page has
a high rate of corrected errors, but hasn't gone bad yet. Instead
it can be offlined early and avoided.

The offlining is controlled from sysfs, including a new generic
entry point for hard page offlining for symmetry too.

We use the page isolate facility to prevent re-allocation
race. Normally this is only used by memory hotplug. To avoid
races with memory allocation I am using lock_system_sleep().
This avoids the situation where memory hotplug is about
to isolate a page range and then hwpoison undoes that work.
This is a big hammer currently, but the simplest solution
currently.

When the page is not free or LRU we try to free pages
from slab and other caches. The slab freeing is currently
quite dumb and does not try to focus on the specific slab
cache which might own the page. This could be potentially
improved later.

Thanks to Fengguang Wu and Haicheng Li for some fixes.

[Added fix from Andrew Morton to adapt to new migrate_pages prototype]
Signed-off-by: Andi Kleen <ak@linux.intel.com>


# 2326c467 15-Dec-2009 Andi Kleen <andi@firstfloor.org>

HWPOISON: Undefine short-hand macros after use to avoid namespace conflict

Signed-off-by: Andi Kleen <ak@linux.intel.com>


# 0474a60e 15-Dec-2009 Andi Kleen <andi@firstfloor.org>

HWPOISON: Use new shake_page in memory_failure

shake_page handles more types of page caches than
the much simpler lru_add_drain_all:

- slab (quite inefficiently for now)
- any other caches with a shrinker callback
- per cpu page allocator pages
- per CPU LRU

Use this call to try to turn pages into free or LRU pages.
Then handle the case of the page becoming free after drain everything.

Signed-off-by: Andi Kleen <ak@linux.intel.com>


# 1bfe5feb 15-Dec-2009 Haicheng Li <haicheng.li@linux.intel.com>

HWPOISON: add an interface to switch off/on all the page filters

In some use cases, user doesn't need extra filtering. E.g. user program
can inject errors through madvise syscall to its own pages, however it
might not know what the page state exactly is or which inode the page
belongs to.

So introduce an one-off interface "corrupt-filter-enable".

Echo 0 to switch off page filters, and echo 1 to switch on the filters.
[AK: changed default to 0]

Signed-off-by: Haicheng Li <haicheng.li@linux.intel.com>
Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
Signed-off-by: Andi Kleen <ak@linux.intel.com>


# 4fd466eb 15-Dec-2009 Andi Kleen <andi@firstfloor.org>

HWPOISON: add memory cgroup filter

The hwpoison test suite need to inject hwpoison to a collection of
selected task pages, and must not touch pages not owned by them and
thus kill important system processes such as init. (But it's OK to
mis-hwpoison free/unowned pages as well as shared clean pages.
Mis-hwpoison of shared dirty pages will kill all tasks, so the test
suite will target all or non of such tasks in the first place.)

The memory cgroup serves this purpose well. We can put the target
processes under the control of a memory cgroup, and tell the hwpoison
injection code to only kill pages associated with some active memory
cgroup.

The prerequisite for doing hwpoison stress tests with mem_cgroup is,
the mem_cgroup code tracks task pages _accurately_ (unless page is
locked). Which we believe is/should be true.

The benefits are simplification of hwpoison injector code. Also the
mem_cgroup code will automatically be tested by hwpoison test cases.

The alternative interfaces pin-pfn/unpin-pfn can also delegate the
(process and page flags) filtering functions reliably to user space.
However prototype implementation shows that this scheme adds more
complexity than we wanted.

Example test case:

mkdir /cgroup/hwpoison

usemem -m 100 -s 1000 &
echo `jobs -p` > /cgroup/hwpoison/tasks

memcg_ino=$(ls -id /cgroup/hwpoison | cut -f1 -d' ')
echo $memcg_ino > /debug/hwpoison/corrupt-filter-memcg

page-types -p `pidof init` --hwpoison # shall do nothing
page-types -p `pidof usemem` --hwpoison # poison its pages

[AK: Fix documentation]
[Add fix for problem noticed by Li Zefan <lizf@cn.fujitsu.com>;
dentry in the css could be NULL]

CC: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
CC: Hugh Dickins <hugh.dickins@tiscali.co.uk>
CC: Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp>
CC: Balbir Singh <balbir@linux.vnet.ibm.com>
CC: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
CC: Li Zefan <lizf@cn.fujitsu.com>
CC: Paul Menage <menage@google.com>
CC: Nick Piggin <npiggin@suse.de>
CC: Andi Kleen <andi@firstfloor.org>
Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
Signed-off-by: Andi Kleen <ak@linux.intel.com>


# 478c5ffc 15-Dec-2009 Wu Fengguang <fengguang.wu@intel.com>

HWPOISON: add page flags filter

When specified, only poison pages if ((page_flags & mask) == value).

- corrupt-filter-flags-mask
- corrupt-filter-flags-value

This allows stress testing of many kinds of pages.

Strictly speaking, the buddy pages requires taking zone lock, to avoid
setting PG_hwpoison on a "was buddy but now allocated to someone" page.
However we can just do nothing because we set PG_locked in the beginning,
this prevents the page allocator from allocating it to someone. (It will
BUG() on the unexpected PG_locked, which is fine for hwpoison testing.)

[AK: Add select PROC_PAGE_MONITOR to satisfy dependency]

CC: Nick Piggin <npiggin@suse.de>
Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
Signed-off-by: Andi Kleen <ak@linux.intel.com>


# 7c116f2b 15-Dec-2009 Wu Fengguang <fengguang.wu@intel.com>

HWPOISON: add fs/device filters

Filesystem data/metadata present the most tricky-to-isolate pages.
It requires careful code review and stress testing to get them right.

The fs/device filter helps to target the stress tests to some specific
filesystem pages. The filter condition is block device's major/minor
numbers:
- corrupt-filter-dev-major
- corrupt-filter-dev-minor
When specified (non -1), only page cache pages that belong to that
device will be poisoned.

The filters are checked reliably on the locked and refcounted page.

Haicheng: clear PG_hwpoison and drop bad page count if filter not OK
AK: Add documentation

CC: Haicheng Li <haicheng.li@intel.com>
CC: Nick Piggin <npiggin@suse.de>
Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
Signed-off-by: Andi Kleen <ak@linux.intel.com>


# 138ce286 15-Dec-2009 Wu Fengguang <fengguang.wu@intel.com>

HWPOISON: return 0 to indicate success reliably

Return 0 to indicate success, when
- action result is RECOVERED or DELAYED
- no extra page reference

Note that dirty swapcache pages are kept in swapcache, so can have one
more reference count.

Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
Signed-off-by: Andi Kleen <ak@linux.intel.com>


# d95ea51e 15-Dec-2009 Wu Fengguang <fengguang.wu@intel.com>

HWPOISON: make semantics of IGNORED/DELAYED clear

Change semantics for
- IGNORED: not handled; it may well be _unsafe_
- DELAYED: to be handled later; it is _safe_

With this change,
- IGNORED/FAILED mean (maybe) Error
- DELAYED/RECOVERED mean Success

Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
Signed-off-by: Andi Kleen <ak@linux.intel.com>


# 847ce401 15-Dec-2009 Wu Fengguang <fengguang.wu@intel.com>

HWPOISON: Add unpoisoning support

The unpoisoning interface is useful for stress testing tools to
reclaim poisoned pages (to prevent OOM)

There is no hardware level unpoisioning, so this
cannot be used for real memory errors, only for software injected errors.

Note that it may leak pages silently - those who have been removed from
LRU cache, but not isolated from page cache/swap cache at hwpoison time.
Especially the stress test of dirty swap cache pages shall reboot system
before exhausting memory.

AK: Fix comments, add documentation, add printks, rename symbol

Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
Signed-off-by: Andi Kleen <ak@linux.intel.com>


# 8d22ba1b 15-Dec-2009 Wu Fengguang <fengguang.wu@intel.com>

HWPOISON: detect free buddy pages explicitly

Most free pages in the buddy system have no PG_buddy set.
Introduce is_free_buddy_page() for detecting them reliably.

CC: Nick Piggin <npiggin@suse.de>
CC: Mel Gorman <mel@linux.vnet.ibm.com>
Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
Signed-off-by: Andi Kleen <ak@linux.intel.com>


# 95d01fc6 15-Dec-2009 Wu Fengguang <fengguang.wu@intel.com>

HWPOISON: remove the free buddy page handler

The buddy page has already be handled in the very beginning.
So remove redundant code.

Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
Signed-off-by: Andi Kleen <ak@linux.intel.com>


# dc2a1cbf 15-Dec-2009 Wu Fengguang <fengguang.wu@intel.com>

HWPOISON: introduce delete_from_lru_cache()

Introduce delete_from_lru_cache() to
- clear PG_active, PG_unevictable to avoid complains at unpoison time
- move the isolate_lru_page() call back to the handlers instead of the
entrance of __memory_failure(), this is more hwpoison filter friendly

Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
Signed-off-by: Andi Kleen <ak@linux.intel.com>


# db0480b3 15-Dec-2009 Wu Fengguang <fengguang.wu@intel.com>

HWPOISON: comment the possible set_page_dirty() race

Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
Signed-off-by: Andi Kleen <ak@linux.intel.com>


# 1668bfd5 15-Dec-2009 Wu Fengguang <fengguang.wu@intel.com>

HWPOISON: abort on failed unmap

Don't try to isolate a still mapped page. Otherwise we will hit the
BUG_ON(page_mapped(page)) in __remove_from_page_cache().

Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
Signed-off-by: Andi Kleen <ak@linux.intel.com>


# 82ba011b 15-Dec-2009 Andi Kleen <andi@firstfloor.org>

HWPOISON: Turn ref argument into flags argument

Now that "ref" is just a boolean turn it into
a flags argument. First step is only a single flag
that makes the code's intention more clear, but more
may follow.

Signed-off-by: Andi Kleen <ak@linux.intel.com>


# bd1ce5f9 15-Dec-2009 Wu Fengguang <fengguang.wu@intel.com>

HWPOISON: avoid grabbing the page count multiple times during madvise injection

If page is double referenced in madvise_hwpoison() and __memory_failure(),
remove_mapping() will fail because it expects page_count=2. Fix it by
not grabbing extra page count in __memory_failure().

Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
Signed-off-by: Andi Kleen <ak@linux.intel.com>


# a7560fc8 15-Dec-2009 Wu Fengguang <fengguang.wu@intel.com>

HWPOISON: return ENXIO on invalid page number

Use a different errno than the usual EIO for invalid page numbers.
This is mainly for better reporting for the injector.

This also avoids calling action_result() with invalid pfn.

Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
Signed-off-by: Andi Kleen <ak@linux.intel.com>


# 9b9a29ec 15-Dec-2009 Wu Fengguang <fengguang.wu@intel.com>

HWPOISON: remove the anonymous entry

(PG_swapbacked && !PG_lru) pages should not happen.
Better to treat them as unknown pages.

Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
Signed-off-by: Andi Kleen <ak@linux.intel.com>


# 588f9ce6 15-Dec-2009 Andi Kleen <andi@firstfloor.org>

HWPOISON: Be more aggressive at freeing non LRU caches

shake_page handles more types of page caches than lru_drain_all()

- per cpu page allocator pages
- per CPU LRU

Stops early when the page became free.

Used in followon patches.

Signed-off-by: Andi Kleen <ak@linux.intel.com>


# af8e3354 14-Dec-2009 Hugh Dickins <hugh.dickins@tiscali.co.uk>

mm: CONFIG_MMU for PG_mlocked

Remove three degrees of obfuscation, left over from when we had
CONFIG_UNEVICTABLE_LRU. MLOCK_PAGES is CONFIG_HAVE_MLOCKED_PAGE_BIT is
CONFIG_HAVE_MLOCK is CONFIG_MMU. rmap.o (and memory-failure.o) are only
built when CONFIG_MMU, so don't need such conditions at all.

Somehow, I feel no compulsion to remove the CONFIG_HAVE_MLOCK* lines from
169 defconfigs: leave those to evolve in due course.

Signed-off-by: Hugh Dickins <hugh.dickins@tiscali.co.uk>
Cc: Izik Eidus <ieidus@redhat.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Nick Piggin <npiggin@suse.de>
Reviewed-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Cc: Rik van Riel <riel@redhat.com>
Cc: Lee Schermerhorn <Lee.Schermerhorn@hp.com>
Cc: Andi Kleen <andi@firstfloor.org>
Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Wu Fengguang <fengguang.wu@intel.com>
Cc: Minchan Kim <minchan.kim@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# af901ca1 14-Nov-2009 André Goddard Rosa <andre.goddard@gmail.com>

tree-wide: fix assorted typos all over the place

That is "success", "unknown", "through", "performance", "[re|un]mapping"
, "access", "default", "reasonable", "[con]currently", "temperature"
, "channel", "[un]used", "application", "example","hierarchy", "therefore"
, "[over|under]flow", "contiguous", "threshold", "enough" and others.

Signed-off-by: André Goddard Rosa <andre.goddard@gmail.com>
Signed-off-by: Jiri Kosina <jkosina@suse.cz>


# 92f7ba70 26-Oct-2009 Hugh Dickins <hugh.dickins@tiscali.co.uk>

hwpoison: fix oops on ksm pages

Memory failure on a KSM page currently oopses on its NULL anon_vma in
page_lock_anon_vma(): that may not be much worse than the consequence of
ignoring it, but it is better to be consistent with how ZERO_PAGE and
hugetlb pages and other awkward cases are treated. Just skip it.

We could fix it for 2.6.32 at the KSM end, by putting a dummy anon_vma
pointer in there; but that would get harder next time, when KSM will put a
pointer to something else there (and I'm not currently planning to do any
work to open that up to memory_failure). So I would prefer this simple
PageKsm test, until the other exceptions are handled.

Signed-off-by: Hugh Dickins <hugh.dickins@tiscali.co.uk>
Cc: Andi Kleen <andi@firstfloor.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 7456b040 19-Oct-2009 Wu Fengguang <fengguang.wu@intel.com>

HWPOISON: fix invalid page count in printk output

The madvise injector already holds a reference when passing in a page
to the memory-failure code. The code corrects for this additional reference
for its checks, but the final printk output didn't. Fix that.

Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
Signed-off-by: Andi Kleen <ak@linux.intel.com>


# 01e00f88 13-Oct-2009 Hugh Dickins <hugh.dickins@tiscali.co.uk>

HWPOISON: fix oops on ksm pages

Memory failure on a KSM page currently oopses on its NULL anon_vma in
page_lock_anon_vma(): that may not be much worse than the consequence
of ignoring it, but it is better to be consistent with how ZERO_PAGE
and hugetlb pages and other awkward cases are treated. Just skip it.

Signed-off-by: Hugh Dickins <hugh.dickins@tiscali.co.uk>
Signed-off-by: Andi Kleen <ak@linux.intel.com>


# e43c3afb 28-Sep-2009 Wu Fengguang <fengguang.wu@intel.com>

HWPOISON: return early on non-LRU pages

Right now we have some trouble with non atomic access
to page flags when locking the page. To plug this hole
for now, limit error recovery to LRU pages for now.

This could be better fixed by defining a suitable protocol,
but let's go this simple way for now

This avoids unnecessary races with __set_page_locked() and
__SetPageSlab*() and maybe more non-atomic page flag operations.

This loses isolated pages which are currently in page reclaim, but these
are relatively limited compared to the total memory.

Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
Signed-off-by: Andi Kleen <ak@linux.intel.com>
[AK: new description, bug fixes, cleanups]


# 6a46079c 16-Sep-2009 Andi Kleen <andi@firstfloor.org>

HWPOISON: The high level memory error handler in the VM v7

Add the high level memory handler that poisons pages
that got corrupted by hardware (typically by a two bit flip in a DIMM
or a cache) on the Linux level. The goal is to prevent everyone
from accessing these pages in the future.

This done at the VM level by marking a page hwpoisoned
and doing the appropriate action based on the type of page
it is.

The code that does this is portable and lives in mm/memory-failure.c

To quote the overview comment:

High level machine check handler. Handles pages reported by the
hardware as being corrupted usually due to a 2bit ECC memory or cache
failure.

This focuses on pages detected as corrupted in the background.
When the current CPU tries to consume corruption the currently
running process can just be killed directly instead. This implies
that if the error cannot be handled for some reason it's safe to
just ignore it because no corruption has been consumed yet. Instead
when that happens another machine check will happen.

Handles page cache pages in various states. The tricky part
here is that we can access any page asynchronous to other VM
users, because memory failures could happen anytime and anywhere,
possibly violating some of their assumptions. This is why this code
has to be extremely careful. Generally it tries to use normal locking
rules, as in get the standard locks, even if that means the
error handling takes potentially a long time.

Some of the operations here are somewhat inefficient and have non
linear algorithmic complexity, because the data structures have not
been optimized for this case. This is in particular the case
for the mapping from a vma to a process. Since this case is expected
to be rare we hope we can get away with this.

There are in principle two strategies to kill processes on poison:
- just unmap the data and wait for an actual reference before
killing
- kill as soon as corruption is detected.
Both have advantages and disadvantages and should be used
in different situations. Right now both are implemented and can
be switched with a new sysctl vm.memory_failure_early_kill
The default is early kill.

The patch does some rmap data structure walking on its own to collect
processes to kill. This is unusual because normally all rmap data structure
knowledge is in rmap.c only. I put it here for now to keep
everything together and rmap knowledge has been seeping out anyways

Includes contributions from Johannes Weiner, Chris Mason, Fengguang Wu,
Nick Piggin (who did a lot of great work) and others.

Cc: npiggin@suse.de
Cc: riel@redhat.com
Signed-off-by: Andi Kleen <ak@linux.intel.com>
Acked-by: Rik van Riel <riel@redhat.com>
Reviewed-by: Hidehiro Kawai <hidehiro.kawai.ez@hitachi.com>