History log of /linux-master/mm/page_alloc.c
Revision Date Author Comments
# 72741db6 28-Feb-2024 Thorsten Blum <thorsten.blum@toblux.com>

mm: page_alloc: use div64_ul() instead of do_div()

Fixes Coccinelle/coccicheck warning reported by do_div.cocci.

Compared to do_div(), div64_ul() does not implicitly cast the divisor and
does not unnecessarily calculate the remainder.

Link: https://lkml.kernel.org/r/20240228224911.1164-2-thorsten.blum@toblux.com
Signed-off-by: Thorsten Blum <thorsten.blum@toblux.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>


# c8b36003 27-Feb-2024 Richard Chang <richardycc@google.com>

mm: add alloc_contig_migrate_range allocation statistics

alloc_contig_migrate_range has every information to be able to understand
big contiguous allocation latency. For example, how many pages are
migrated, how many times they were needed to unmap from page tables.

This patch adds the trace event to collect the allocation statistics. In
the field, it was quite useful to understand CMA allocation latency.

[akpm@linux-foundation.org: a/trace_mm_alloc_config_migrate_range_info_enabled/trace_mm_alloc_contig_migrate_range_info_enabled]
Link: https://lkml.kernel.org/r/20240228051127.2859472-1-richardycc@google.com
Signed-off-by: Richard Chang <richardycc@google.com>
Reviewed-by: Steven Rostedt (Google) <rostedt@goodmis.org.
Cc: Martin Liu <liumartin@google.com>
Cc: "Masami Hiramatsu (Google)" <mhiramat@kernel.org>
Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Cc: Minchan Kim <minchan@kernel.org>
Cc: Suren Baghdasaryan <surenb@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>


# 8b7b0a5e 27-Feb-2024 Matthew Wilcox (Oracle) <willy@infradead.org>

mm: remove free_unref_page_list()

All callers now use free_unref_folios() so we can delete this function.

Link: https://lkml.kernel.org/r/20240227174254.710559-15-willy@infradead.org
Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Reviewed-by: Ryan Roberts <ryan.roberts@arm.com>
Cc: David Hildenbrand <david@redhat.com>
Cc: Mel Gorman <mgorman@suse.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>


# 31b2ff82 27-Feb-2024 Matthew Wilcox (Oracle) <willy@infradead.org>

mm: handle large folios in free_unref_folios()

Call folio_undo_large_rmappable() if needed. free_unref_page_prepare()
destroys the ability to call folio_order(), so stash the order in
folio->private for the benefit of the second loop.

Link: https://lkml.kernel.org/r/20240227174254.710559-10-willy@infradead.org
Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Cc: David Hildenbrand <david@redhat.com>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Ryan Roberts <ryan.roberts@arm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>


# 90491d87 27-Feb-2024 Matthew Wilcox (Oracle) <willy@infradead.org>

mm: add free_unref_folios()

Iterate over a folio_batch rather than a linked list. This is easier for
the CPU to prefetch and has a batch count naturally built in so we don't
need to track it. Again, this lowers the maximum lock hold time from
32 folios to 15, but I do not expect this to have a significant effect.

Link: https://lkml.kernel.org/r/20240227174254.710559-4-willy@infradead.org
Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Cc: David Hildenbrand <david@redhat.com>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Ryan Roberts <ryan.roberts@arm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>


# 7c76d922 27-Feb-2024 Matthew Wilcox (Oracle) <willy@infradead.org>

mm: convert free_unref_page_list() to use folios

Most of its callees are not yet ready to accept a folio, but we know all
of the pages passed in are actually folios because they're linked through
->lru.

Link: https://lkml.kernel.org/r/20240227174254.710559-3-willy@infradead.org
Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Reviewed-by: Ryan Roberts <ryan.roberts@arm.com>
Cc: David Hildenbrand <david@redhat.com>
Cc: Mel Gorman <mgorman@suse.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>


# 46d44d09 26-Feb-2024 Zi Yan <ziy@nvidia.com>

mm: page_owner: add support for splitting to any order in split page_owner

It adds a new_order parameter to set new page order in page owner. It
prepares for upcoming changes to support split huge page to any lower
order.

Link: https://lkml.kernel.org/r/20240226205534.1603748-7-zi.yan@sent.com
Signed-off-by: Zi Yan <ziy@nvidia.com>
Cc: David Hildenbrand <david@redhat.com>
Acked-by: David Hildenbrand <david@redhat.com>
Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Cc: Luis Chamberlain <mcgrof@kernel.org>
Cc: "Matthew Wilcox (Oracle)" <willy@infradead.org>
Cc: Michal Koutny <mkoutny@suse.com>
Cc: Roman Gushchin <roman.gushchin@linux.dev>
Cc: Ryan Roberts <ryan.roberts@arm.com>
Cc: Yang Shi <shy828301@gmail.com>
Cc: Yu Zhao <yuzhao@google.com>
Cc: Zach O'Keefe <zokeefe@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>


# b8791381 26-Feb-2024 Zi Yan <ziy@nvidia.com>

mm: memcg: make memcg huge page split support any order split

It sets memcg information for the pages after the split. A new parameter
new_order is added to tell the order of subpages in the new page, always 0
for now. It prepares for upcoming changes to support split huge page to
any lower order.

Link: https://lkml.kernel.org/r/20240226205534.1603748-6-zi.yan@sent.com
Signed-off-by: Zi Yan <ziy@nvidia.com>
Acked-by: David Hildenbrand <david@redhat.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Cc: Luis Chamberlain <mcgrof@kernel.org>
Cc: "Matthew Wilcox (Oracle)" <willy@infradead.org>
Cc: Michal Koutny <mkoutny@suse.com>
Cc: Roman Gushchin <roman.gushchin@linux.dev>
Cc: Ryan Roberts <ryan.roberts@arm.com>
Cc: Yang Shi <shy828301@gmail.com>
Cc: Yu Zhao <yuzhao@google.com>
Cc: Zach O'Keefe <zokeefe@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>


# 9a581c12 26-Feb-2024 Zi Yan <ziy@nvidia.com>

mm/page_owner: use order instead of nr in split_page_owner()

We do not have non power of two pages, using nr is error prone if nr is
not power-of-two. Use page order instead.

Link: https://lkml.kernel.org/r/20240226205534.1603748-5-zi.yan@sent.com
Signed-off-by: Zi Yan <ziy@nvidia.com>
Acked-by: David Hildenbrand <david@redhat.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Cc: Luis Chamberlain <mcgrof@kernel.org>
Cc: "Matthew Wilcox (Oracle)" <willy@infradead.org>
Cc: Michal Koutny <mkoutny@suse.com>
Cc: Roman Gushchin <roman.gushchin@linux.dev>
Cc: Ryan Roberts <ryan.roberts@arm.com>
Cc: Yang Shi <shy828301@gmail.com>
Cc: Yu Zhao <yuzhao@google.com>
Cc: Zach O'Keefe <zokeefe@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>


# 502003bb 26-Feb-2024 Zi Yan <ziy@nvidia.com>

mm/memcg: use order instead of nr in split_page_memcg()

We do not have non power of two pages, using nr is error prone if nr is
not power-of-two. Use page order instead.

Link: https://lkml.kernel.org/r/20240226205534.1603748-4-zi.yan@sent.com
Signed-off-by: Zi Yan <ziy@nvidia.com>
Acked-by: David Hildenbrand <david@redhat.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Cc: Luis Chamberlain <mcgrof@kernel.org>
Cc: "Matthew Wilcox (Oracle)" <willy@infradead.org>
Cc: Michal Koutny <mkoutny@suse.com>
Cc: Roman Gushchin <roman.gushchin@linux.dev>
Cc: Ryan Roberts <ryan.roberts@arm.com>
Cc: Yang Shi <shy828301@gmail.com>
Cc: Yu Zhao <yuzhao@google.com>
Cc: Zach O'Keefe <zokeefe@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>


# 77c7a095 22-Feb-2024 Hao Ge <gehao@kylinos.cn>

mm/page_alloc: make check_new_page() return bool

Make check_new_page() return bool like check_new_pages()

Link: https://lkml.kernel.org/r/20240222091932.54799-1-gehao@kylinos.cn
Signed-off-by: Hao Ge <gehao@kylinos.cn>
Reviewed-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>


# 5bb14214 21-Feb-2024 Hao Ge <gehao@kylinos.cn>

mm/page_alloc: make bad_range() return bool

bad_range() can return bool, so let us change it.

Link: https://lkml.kernel.org/r/20240221073227.276234-1-gehao@kylinos.cn
Signed-off-by: Hao Ge <gehao@kylinos.cn>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>


# 733aea0b 20-Feb-2024 Zi Yan <ziy@nvidia.com>

mm/compaction: add support for >0 order folio memory compaction.

Before last commit, memory compaction only migrates order-0 folios and
skips >0 order folios. Last commit splits all >0 order folios during
compaction. This commit migrates >0 order folios during compaction by
keeping isolated free pages at their original size without splitting them
into order-0 pages and using them directly during migration process.

What is different from the prior implementation:
1. All isolated free pages are kept in a NR_PAGE_ORDERS array of page
lists, where each page list stores free pages in the same order.
2. All free pages are not post_alloc_hook() processed nor buddy pages,
although their orders are stored in first page's private like buddy
pages.
3. During migration, in new page allocation time (i.e., in
compaction_alloc()), free pages are then processed by post_alloc_hook().
When migration fails and a new page is returned (i.e., in
compaction_free()), free pages are restored by reversing the
post_alloc_hook() operations using newly added
free_pages_prepare_fpi_none().

Step 3 is done for a latter optimization that splitting and/or merging
free pages during compaction becomes easier.

Note: without splitting free pages, compaction can end prematurely due to
migration will return -ENOMEM even if there is free pages. This happens
when no order-0 free page exist and compaction_alloc() return NULL.

Link: https://lkml.kernel.org/r/20240220183220.1451315-4-zi.yan@sent.com
Signed-off-by: Zi Yan <ziy@nvidia.com>
Reviewed-by: Baolin Wang <baolin.wang@linux.alibaba.com>
Reviewed-by: Vlastimil Babka <vbabka@suse.cz>
Tested-by: Baolin Wang <baolin.wang@linux.alibaba.com>
Tested-by: Yu Zhao <yuzhao@google.com>
Cc: Adam Manzanares <a.manzanares@samsung.com>
Cc: David Hildenbrand <david@redhat.com>
Cc: Huang Ying <ying.huang@intel.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Kemeng Shi <shikemeng@huaweicloud.com>
Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Cc: Luis Chamberlain <mcgrof@kernel.org>
Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
Cc: Mel Gorman <mgorman@techsingularity.net>
Cc: Ryan Roberts <ryan.roberts@arm.com>
Cc: Vishal Moola (Oracle) <vishal.moola@gmail.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Yin Fengwei <fengwei.yin@intel.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>


# 5267fe5d0 20-Feb-2024 Zi Yan <ziy@nvidia.com>

mm/page_alloc: remove unused fpi_flags in free_pages_prepare()

Patch series "Enable >0 order folio memory compaction", v7.

This patchset enables >0 order folio memory compaction, which is one of
the prerequisitions for large folio support[1].

I am aware of that split free pages is necessary for folio migration in
compaction, since if >0 order free pages are never split and no order-0
free page is scanned, compaction will end prematurely due to migration
returns -ENOMEM. Free page split becomes a must instead of an
optimization.

lkp ncompare results (on a 8-CPU (Intel Xeon E5-2650 v4 @2.20GHz) 16G VM)
for default LRU (-no-mglru) and CONFIG_LRU_GEN are shown at the bottom,
copied from V3[4]. In sum, most of vm-scalability applications do not see
performance change, and the others see ~4% to ~26% performance boost under
default LRU and ~2% to ~6% performance boost under CONFIG_LRU_GEN.

Overview
===

To support >0 order folio compaction, the patchset changes how free pages
used for migration are kept during compaction. Free pages used to be
split into order-0 pages that are post allocation processed (i.e.,
PageBuddy flag cleared, page order stored in page->private is zeroed, and
page reference is set to 1). Now all free pages are kept in a
NR_PAGE_ORDER array of page lists based on their order without post
allocation process. When migrate_pages() asks for a new page, one of the
free pages, based on the requested page order, is then processed and given
out. And THP <2MB would need this feature.


[1] https://lore.kernel.org/linux-mm/f8d47176-03a8-99bf-a813-b5942830fd73@arm.com/
[2] https://lore.kernel.org/linux-mm/20231113170157.280181-1-zi.yan@sent.com/
[3] https://lore.kernel.org/linux-mm/20240123034636.1095672-1-zi.yan@sent.com/
[4] https://lore.kernel.org/linux-mm/20240202161554.565023-1-zi.yan@sent.com/
[5] https://lore.kernel.org/linux-mm/20240212163510.859822-1-zi.yan@sent.com/
[6] https://lore.kernel.org/linux-mm/20240214220420.1229173-1-zi.yan@sent.com/
[7] https://lore.kernel.org/linux-mm/20240216170432.1268753-1-zi.yan@sent.com/


This patch (of 4):

Commit 0a54864f8dfb ("kasan: remove PG_skip_kasan_poison flag") removes
the use of fpi_flags in should_skip_kasan_poison() and fpi_flags is only
passed to should_skip_kasan_poison() in free_pages_prepare(). Remove the
unused parameter.

Link: https://lkml.kernel.org/r/20240220183220.1451315-1-zi.yan@sent.com
Link: https://lkml.kernel.org/r/20240220183220.1451315-2-zi.yan@sent.com
Signed-off-by: Zi Yan <ziy@nvidia.com>
Reviewed-by: Vlastimil Babka <vbabka@suse.cz>
Reviewed-by: David Hildenbrand <david@redhat.com>
Cc: Adam Manzanares <a.manzanares@samsung.com>
Cc: Baolin Wang <baolin.wang@linux.alibaba.com>
Cc: "Huang, Ying" <ying.huang@intel.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Kemeng Shi <shikemeng@huaweicloud.com>
Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Cc: Luis Chamberlain <mcgrof@kernel.org>
Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
Cc: Mel Gorman <mgorman@techsingularity.net>
Cc: Ryan Roberts <ryan.roberts@arm.com>
Cc: Vishal Moola (Oracle) <vishal.moola@gmail.com>
Cc: Yin Fengwei <fengwei.yin@intel.com>
Cc: Yu Zhao <yuzhao@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>


# 5cec4eb7 26-Jan-2024 Huang Ying <ying.huang@intel.com>

mm and cache_info: remove unnecessary CPU cache info update

For each CPU hotplug event, we will update per-CPU data slice size and
corresponding PCP configuration for every online CPU to make the
implementation simple. But, Kyle reported that this takes tens seconds
during boot on a machine with 34 zones and 3840 CPUs.

So, in this patch, for each CPU hotplug event, we only update per-CPU data
slice size and corresponding PCP configuration for the CPUs that share
caches with the hotplugged CPU. With the patch, the system boot time
reduces 67 seconds on the machine.

Link: https://lkml.kernel.org/r/20240126081944.414520-1-ying.huang@intel.com
Fixes: 362d37a106dd ("mm, pcp: reduce lock contention for draining high-order pages")
Signed-off-by: "Huang, Ying" <ying.huang@intel.com>
Originally-by: Kyle Meyer <kyle.meyer@hpe.com>
Reported-and-tested-by: Kyle Meyer <kyle.meyer@hpe.com>
Cc: Sudeep Holla <sudeep.holla@arm.com>
Cc: Mel Gorman <mgorman@techsingularity.net>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>


# a0727489 28-Feb-2024 Yunsheng Lin <linyunsheng@huawei.com>

net: introduce page_frag_cache_drain()

When draining a page_frag_cache, most user are doing
the similar steps, so introduce an API to avoid code
duplication.

Signed-off-by: Yunsheng Lin <linyunsheng@huawei.com>
Acked-by: Jason Wang <jasowang@redhat.com>
Reviewed-by: Alexander Duyck <alexanderduyck@fb.com>
Acked-by: Michael S. Tsirkin <mst@redhat.com>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>


# 4bc0d63a 28-Feb-2024 Yunsheng Lin <linyunsheng@huawei.com>

page_frag: unify gfp bits for order 3 page allocation

Currently there seems to be three page frag implementations
which all try to allocate order 3 page, if that fails, it
then fail back to allocate order 0 page, and each of them
all allow order 3 page allocation to fail under certain
condition by using specific gfp bits.

The gfp bits for order 3 page allocation are different
between different implementation, __GFP_NOMEMALLOC is
or'd to forbid access to emergency reserves memory for
__page_frag_cache_refill(), but it is not or'd in other
implementions, __GFP_DIRECT_RECLAIM is masked off to avoid
direct reclaim in vhost_net_page_frag_refill(), but it is
not masked off in __page_frag_cache_refill().

This patch unifies the gfp bits used between different
implementions by or'ing __GFP_NOMEMALLOC and masking off
__GFP_DIRECT_RECLAIM for order 3 page allocation to avoid
possible pressure for mm.

Leave the gfp unifying for page frag implementation in sock.c
for now as suggested by Paolo Abeni.

Signed-off-by: Yunsheng Lin <linyunsheng@huawei.com>
Reviewed-by: Alexander Duyck <alexanderduyck@fb.com>
CC: Alexander Duyck <alexander.duyck@gmail.com>
Acked-by: Michael S. Tsirkin <mst@redhat.com>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>


# 411c5f36 28-Feb-2024 Yunsheng Lin <linyunsheng@huawei.com>

mm/page_alloc: modify page_frag_alloc_align() to accept align as an argument

napi_alloc_frag_align() and netdev_alloc_frag_align() accept
align as an argument, and they are thin wrappers around the
__napi_alloc_frag_align() and __netdev_alloc_frag_align() APIs
doing the alignment checking and align mask conversion, in order
to call page_frag_alloc_align() directly. The intention here is
to keep the alignment checking and the alignmask conversion in
in-line wrapper to avoid those kind of operations during execution
time since it can usually be handled during compile time.

We are going to use page_frag_alloc_align() in vhost_net.c, it
need the same kind of alignment checking and alignmask conversion,
so split up page_frag_alloc_align into an inline wrapper doing the
above operation, and add __page_frag_alloc_align() which is passed
with the align mask the original function expected as suggested by
Alexander.

Signed-off-by: Yunsheng Lin <linyunsheng@huawei.com>
CC: Alexander Duyck <alexander.duyck@gmail.com>
Acked-by: Michael S. Tsirkin <mst@redhat.com>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>


# 803de900 20-Feb-2024 Vlastimil Babka <vbabka@suse.cz>

mm, vmscan: prevent infinite loop for costly GFP_NOIO | __GFP_RETRY_MAYFAIL allocations

Sven reports an infinite loop in __alloc_pages_slowpath() for costly order
__GFP_RETRY_MAYFAIL allocations that are also GFP_NOIO. Such combination
can happen in a suspend/resume context where a GFP_KERNEL allocation can
have __GFP_IO masked out via gfp_allowed_mask.

Quoting Sven:

1. try to do a "costly" allocation (order > PAGE_ALLOC_COSTLY_ORDER)
with __GFP_RETRY_MAYFAIL set.

2. page alloc's __alloc_pages_slowpath tries to get a page from the
freelist. This fails because there is nothing free of that costly
order.

3. page alloc tries to reclaim by calling __alloc_pages_direct_reclaim,
which bails out because a zone is ready to be compacted; it pretends
to have made a single page of progress.

4. page alloc tries to compact, but this always bails out early because
__GFP_IO is not set (it's not passed by the snd allocator, and even
if it were, we are suspending so the __GFP_IO flag would be cleared
anyway).

5. page alloc believes reclaim progress was made (because of the
pretense in item 3) and so it checks whether it should retry
compaction. The compaction retry logic thinks it should try again,
because:
a) reclaim is needed because of the early bail-out in item 4
b) a zonelist is suitable for compaction

6. goto 2. indefinite stall.

(end quote)

The immediate root cause is confusing the COMPACT_SKIPPED returned from
__alloc_pages_direct_compact() (step 4) due to lack of __GFP_IO to be
indicating a lack of order-0 pages, and in step 5 evaluating that in
should_compact_retry() as a reason to retry, before incrementing and
limiting the number of retries. There are however other places that
wrongly assume that compaction can happen while we lack __GFP_IO.

To fix this, introduce gfp_compaction_allowed() to abstract the __GFP_IO
evaluation and switch the open-coded test in try_to_compact_pages() to use
it.

Also use the new helper in:
- compaction_ready(), which will make reclaim not bail out in step 3, so
there's at least one attempt to actually reclaim, even if chances are
small for a costly order
- in_reclaim_compaction() which will make should_continue_reclaim()
return false and we don't over-reclaim unnecessarily
- in __alloc_pages_slowpath() to set a local variable can_compact,
which is then used to avoid retrying reclaim/compaction for costly
allocations (step 5) if we can't compact and also to skip the early
compaction attempt that we do in some cases

Link: https://lkml.kernel.org/r/20240221114357.13655-2-vbabka@suse.cz
Fixes: 3250845d0526 ("Revert "mm, oom: prevent premature OOM killer invocation for high order request"")
Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
Reported-by: Sven van Ashbrook <svenva@chromium.org>
Closes: https://lore.kernel.org/all/CAG-rBihs_xMKb3wrMO1%2B-%2Bp4fowP9oy1pa_OTkfxBzPUVOZF%2Bg@mail.gmail.com/
Tested-by: Karthikeyan Ramasubramanian <kramasub@chromium.org>
Cc: Brian Geffon <bgeffon@google.com>
Cc: Curtis Malainey <cujomalainey@chromium.org>
Cc: Jaroslav Kysela <perex@perex.cz>
Cc: Mel Gorman <mgorman@techsingularity.net>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Takashi Iwai <tiwai@suse.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>


# dba1b8a7 24-Nov-2023 Jesper Dangaard Brouer <hawk@kernel.org>

mm/page_pool: catch page_pool memory leaks

Pages belonging to a page_pool (PP) instance must be freed through the
PP APIs in-order to correctly release any DMA mappings and release
refcnt on the DMA device when freeing PP instance. When PP release a
page (page_pool_release_page) the page->pp_magic value is cleared.

This patch detect a leaked PP page in free_page_is_bad() via
unexpected state of page->pp_magic value being PP_SIGNATURE.

We choose to report and treat it as a bad page. It would be possible
to release the page via returning it to the PP instance as the
page->pp pointer is likely still valid.

Notice this code is only activated when either compiled with
CONFIG_DEBUG_VM or boot cmdline debug_pagealloc=on, and
CONFIG_PAGE_POOL.

Reduced example output of leak with PP_SIGNATURE = dead000000000040:

BUG: Bad page state in process swapper/4 pfn:141fa6
page:000000006dbf8062 refcount:0 mapcount:0 mapping:0000000000000000 index:0x141fa6000 pfn:0x141fa6
flags: 0x2fffff80000000(node=0|zone=2|lastcpupid=0x1fffff)
page_type: 0xffffffff()
raw: 002fffff80000000 dead000000000040 ffff88814888a000 0000000000000000
raw: 0000000141fa6000 0000000000000001 00000000ffffffff 0000000000000000
page dumped because: page_pool leak
[...]
Call Trace:
<IRQ>
dump_stack_lvl+0x32/0x50
bad_page+0x70/0xf0
free_unref_page_prepare+0x263/0x430
free_unref_page+0x34/0x130
mlx5e_free_rx_mpwqe+0x190/0x1c0 [mlx5_core]
mlx5e_post_rx_mpwqes+0x1ac/0x280 [mlx5_core]
mlx5e_napi_poll+0x12b/0x710 [mlx5_core]
? skb_free_head+0x4f/0x90
__napi_poll+0x2b/0x1c0
net_rx_action+0x27b/0x360

The advantage is the Call Trace directly points to the function
leaking the PP page, which in this case is an on purpose bug
introduced into the mlx5 driver to test this code change.

Currently PP will periodically in page_pool_release_retry()
printk warning "stalled pool shutdown" which cannot be directly
corrolated to leaking and might as well be a false positive
due to SKBs being stuck on a socket for an extended period.
After this patch we should be able to remove this printk.

Signed-off-by: Jesper Dangaard Brouer <hawk@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>


# 5e0a760b 28-Dec-2023 Kirill A. Shutemov <kirill.shutemov@linux.intel.com>

mm, treewide: rename MAX_ORDER to MAX_PAGE_ORDER

commit 23baf831a32c ("mm, treewide: redefine MAX_ORDER sanely") has
changed the definition of MAX_ORDER to be inclusive. This has caused
issues with code that was not yet upstream and depended on the previous
definition.

To draw attention to the altered meaning of the define, rename MAX_ORDER
to MAX_PAGE_ORDER.

Link: https://lkml.kernel.org/r/20231228144704.14033-2-kirill.shutemov@linux.intel.com
Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>


# fd377218 28-Dec-2023 Kirill A. Shutemov <kirill.shutemov@linux.intel.com>

mm, treewide: introduce NR_PAGE_ORDERS

NR_PAGE_ORDERS defines the number of page orders supported by the page
allocator, ranging from 0 to MAX_ORDER, MAX_ORDER + 1 in total.

NR_PAGE_ORDERS assists in defining arrays of page orders and allows for
more natural iteration over them.

[kirill.shutemov@linux.intel.com: fixup for kerneldoc warning]
Link: https://lkml.kernel.org/r/20240101111512.7empzyifq7kxtzk3@box
Link: https://lkml.kernel.org/r/20231228144704.14033-1-kirill.shutemov@linux.intel.com
Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Reviewed-by: Zi Yan <ziy@nvidia.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>


# 5cb6674b 21-Dec-2023 Andrey Konovalov <andreyknvl@gmail.com>

mm, kasan: use KASAN_TAG_KERNEL instead of 0xff

Use the KASAN_TAG_KERNEL marco instead of open-coding 0xff in the mm code.
This macro is provided by include/linux/kasan-tags.h, which does not
include any other headers, so it's safe to include it into mm.h without
causing circular include dependencies.

Link: https://lkml.kernel.org/r/71db9087b0aebb6c4dccbc609cc0cd50621533c7.1703188911.git.andreyknvl@google.com
Signed-off-by: Andrey Konovalov <andreyknvl@google.com>
Cc: Alexander Potapenko <glider@google.com>
Cc: Andrey Ryabinin <ryabinin.a.a@gmail.com>
Cc: Dmitry Vyukov <dvyukov@google.com>
Cc: Marco Elver <elver@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>


# 250ae189 15-Dec-2023 Yajun Deng <yajun.deng@linux.dev>

mm: page_alloc: simplify __free_pages_ok()

There is redundant code in __free_pages_ok(). Use free_one_page()
simplify it.

Link: https://lkml.kernel.org/r/20231216030503.2126130-1-yajun.deng@linux.dev
Signed-off-by: Yajun Deng <yajun.deng@linux.dev>
Reviewed-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>


# ac3f3b0a 24-Nov-2023 Charan Teja Kalla <quic_charante@quicinc.com>

mm: page_alloc: unreserve highatomic page blocks before oom

__alloc_pages_direct_reclaim() is called from slowpath allocation where
high atomic reserves can be unreserved after there is a progress in
reclaim and yet no suitable page is found. Later should_reclaim_retry()
gets called from slow path allocation to decide if the reclaim needs to be
retried before OOM kill path is taken.

should_reclaim_retry() checks the available(reclaimable + free pages)
memory against the min wmark levels of a zone and returns:

a) true, if it is above the min wmark so that slow path allocation will
do the reclaim retries.

b) false, thus slowpath allocation takes oom kill path.

should_reclaim_retry() can also unreserves the high atomic reserves **but
only after all the reclaim retries are exhausted.**

In a case where there are almost none reclaimable memory and free pages
contains mostly the high atomic reserves but allocation context can't use
these high atomic reserves, makes the available memory below min wmark
levels hence false is returned from should_reclaim_retry() leading the
allocation request to take OOM kill path. This can turn into a early oom
kill if high atomic reserves are holding lot of free memory and
unreserving of them is not attempted.

(early)OOM is encountered on a VM with the below state:
[ 295.998653] Normal free:7728kB boost:0kB min:804kB low:1004kB
high:1204kB reserved_highatomic:8192KB active_anon:4kB inactive_anon:0kB
active_file:24kB inactive_file:24kB unevictable:1220kB writepending:0kB
present:70732kB managed:49224kB mlocked:0kB bounce:0kB free_pcp:688kB
local_pcp:492kB free_cma:0kB
[ 295.998656] lowmem_reserve[]: 0 32
[ 295.998659] Normal: 508*4kB (UMEH) 241*8kB (UMEH) 143*16kB (UMEH)
33*32kB (UH) 7*64kB (UH) 0*128kB 0*256kB 0*512kB 0*1024kB 0*2048kB
0*4096kB = 7752kB

Per above log, the free memory of ~7MB exist in the high atomic reserves
is not freed up before falling back to oom kill path.

Fix it by trying to unreserve the high atomic reserves in
should_reclaim_retry() before __alloc_pages_direct_reclaim() can fallback
to oom kill path.

Link: https://lkml.kernel.org/r/1700823445-27531-1-git-send-email-quic_charante@quicinc.com
Fixes: 0aaa29a56e4f ("mm, page_alloc: reserve pageblocks for high-order atomic allocations on demand")
Signed-off-by: Charan Teja Kalla <quic_charante@quicinc.com>
Reported-by: Chris Goldsworthy <quic_cgoldswo@quicinc.com>
Suggested-by: Michal Hocko <mhocko@suse.com>
Acked-by: Michal Hocko <mhocko@suse.com>
Acked-by: David Rientjes <rientjes@google.com>
Cc: Chris Goldsworthy <quic_cgoldswo@quicinc.com>
Cc: David Hildenbrand <david@redhat.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Mel Gorman <mgorman@techsingularity.net>
Cc: Pavankumar Kondeti <quic_pkondeti@quicinc.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>


# 9cd20f3f 24-Nov-2023 Charan Teja Kalla <quic_charante@quicinc.com>

mm: page_alloc: enforce minimum zone size to do high atomic reserves

Highatomic reserves are set to roughly 1% of zone for maximum and a
pageblock size for minimum. Encountered a system with the below
configuration:
Normal free:7728kB boost:0kB min:804kB low:1004kB high:1204kB
reserved_highatomic:8192KB managed:49224kB

On such systems, even a single pageblock makes highatomic reserves are set
to ~8% of the zone memory. This high value can easily exert pressure on
the zone.

Per discussion with Michal and Mel, it is not much useful to reserve the
memory for highatomic allocations on such small systems[1]. Since the
minimum size for high atomic reserves is always going to be a pageblock
size and if 1% of zone managed pages is going to be below pageblock size,
don't reserve memory for high atomic allocations. Thanks Michal for this
suggestion[2].

Since no memory is being reserved for high atomic allocations and if
respective allocation failures are seen, this patch can be reverted.

[1] https://lore.kernel.org/linux-mm/20231117161956.d3yjdxhhm4rhl7h2@techsingularity.net/
[2] https://lore.kernel.org/linux-mm/ZVYRJMUitykepLRy@tiehlicka/

Link: https://lkml.kernel.org/r/c3a2a48e2cfe08176a80eaf01c110deb9e918055.1700821416.git.quic_charante@quicinc.com
Signed-off-by: Charan Teja Kalla <quic_charante@quicinc.com>
Acked-by: David Rientjes <rientjes@google.com>
Cc: David Hildenbrand <david@redhat.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Mel Gorman <mgorman@techsingularity.net>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Pavankumar Kondeti <quic_pkondeti@quicinc.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>


# d68e39fc 24-Nov-2023 Charan Teja Kalla <quic_charante@quicinc.com>

mm: page_alloc: correct high atomic reserve calculations

Patch series "mm: page_alloc: fixes for high atomic reserve
caluculations", v3.

The state of the system where the issue exposed shown in oom kill logs:

[ 295.998653] Normal free:7728kB boost:0kB min:804kB low:1004kB high:1204kB reserved_highatomic:8192KB active_anon:4kB inactive_anon:0kB active_file:24kB inactive_file:24kB unevictable:1220kB writepending:0kB present:70732kB managed:49224kB mlocked:0kB bounce:0kB free_pcp:688kBlocal_pcp:492kB free_cma:0kB
[ 295.998656] lowmem_reserve[]: 0 32
[ 295.998659] Normal: 508*4kB (UMEH) 241*8kB (UMEH) 143*16kB (UMEH)
33*32kB (UH) 7*64kB (UH) 0*128kB 0*256kB 0*512kB 0*1024kB 0*2048kB 0*4096kB = 7752kB

From the above, it is seen that ~16MB of memory reserved for high atomic
reserves against the expectation of 1% reserves which is fixed in the 1st
patch.

Don't reserve the high atomic page blocks if 1% of zone memory size is
below a pageblock size.


This patch (of 2):

reserve_highatomic_pageblock() aims to reserve the 1% of the managed pages
of a zone, which is used for the high order atomic allocations.

It uses the below calculation to reserve:
static void reserve_highatomic_pageblock(struct page *page, ....) {

.......
max_managed = (zone_managed_pages(zone) / 100) + pageblock_nr_pages;

if (zone->nr_reserved_highatomic >= max_managed)
goto out;

zone->nr_reserved_highatomic += pageblock_nr_pages;
set_pageblock_migratetype(page, MIGRATE_HIGHATOMIC);
move_freepages_block(zone, page, MIGRATE_HIGHATOMIC, NULL);

out:
....
}

Since we are always appending the 1% of zone managed pages count to
pageblock_nr_pages, the minimum it is turning into 2 pageblocks as the
nr_reserved_highatomic is incremented/decremented in pageblock sizes.

Encountered a system(actually a VM running on the Linux kernel) with the
below zone configuration:
Normal free:7728kB boost:0kB min:804kB low:1004kB high:1204kB
reserved_highatomic:8192KB managed:49224kB

The existing calculations making it to reserve the 8MB(with pageblock size
of 4MB) i.e. 16% of the zone managed memory. Reserving such high amount
of memory can easily exert memory pressure in the system thus may lead
into unnecessary reclaims till unreserving of high atomic reserves.

Since high atomic reserves are managed in pageblock size granules, as
MIGRATE_HIGHATOMIC is set for such pageblock, fix the calculations for
high atomic reserves as, minimum is pageblock size , maximum is
approximately 1% of the zone managed pages.

Link: https://lkml.kernel.org/r/cover.1700821416.git.quic_charante@quicinc.com
Link: https://lkml.kernel.org/r/1660034138397b82a0a8b6ae51cbe96bd583d89e.1700821416.git.quic_charante@quicinc.com
Signed-off-by: Charan Teja Kalla <quic_charante@quicinc.com>
Acked-by: Mel Gorman <mgorman@techsingularity.net>
Acked-by: David Rientjes <rientjes@google.com>
Cc: David Hildenbrand <david@redhat.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Pavankumar Kondeti <quic_pkondeti@quicinc.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>


# 17b46e7b 08-Nov-2023 Brendan Jackman <jackmanb@google.com>

mm/page_alloc: dedupe some memcg uncharging logic

The duplication makes it seem like some work is required before uncharging
in the !PageHWPoison case. But it isn't, so we can simplify the code a
little.

Note the PageMemcgKmem check is redundant, but I've left it in as it
avoids an unnecessary function call.

Link: https://lkml.kernel.org/r/20231108164920.3401565-1-jackmanb@google.com
Signed-off-by: Brendan Jackman <jackmanb@google.com>
Reviewed-by: Yosry Ahmed <yosryahmed@google.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Muchun Song <muchun.song@linux.dev>
Cc: Roman Gushchin <roman.gushchin@linux.dev>
Cc: Shakeel Butt <shakeelb@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>


# 23e48832 03-Oct-2023 Hugh Dickins <hughd@google.com>

mm: add page_rmappable_folio() wrapper

folio_prep_large_rmappable() is being used repeatedly along with a
conversion from page to folio, a check non-NULL, a check order > 1: wrap
it all up into struct folio *page_rmappable_folio(struct page *).

Link: https://lkml.kernel.org/r/8d92c6cf-eebe-748-e29c-c8ab224c741@google.com
Signed-off-by: Hugh Dickins <hughd@google.com>
Cc: Andi Kleen <ak@linux.intel.com>
Cc: Christoph Lameter <cl@linux.com>
Cc: David Hildenbrand <david@redhat.com>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: "Huang, Ying" <ying.huang@intel.com>
Cc: Kefeng Wang <wangkefeng.wang@huawei.com>
Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
Cc: Mel Gorman <mgorman@techsingularity.net>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Mike Kravetz <mike.kravetz@oracle.com>
Cc: Nhat Pham <nphamcs@gmail.com>
Cc: Sidhartha Kumar <sidhartha.kumar@oracle.com>
Cc: Suren Baghdasaryan <surenb@google.com>
Cc: Tejun heo <tj@kernel.org>
Cc: Vishal Moola (Oracle) <vishal.moola@gmail.com>
Cc: Yang Shi <shy828301@gmail.com>
Cc: Yosry Ahmed <yosryahmed@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>


# 76f26535 23-Oct-2023 Hyesoo Yu <hyesoo.yu@samsung.com>

mm: page_alloc: check the order of compound page even when the order is zero

For compound pages, the head sets the PG_head flag and the tail sets the
compound_head to indicate the head page. If a user allocates a compound
page and frees it with a different order, the compound page information
will not be properly initialized. To detect this problem,
compound_order(page) and the order argument are compared, but this is not
checked when the order argument is zero. That error should be checked
regardless of the order.

Link: https://lkml.kernel.org/r/20231023083217.1866451-1-hyesoo.yu@samsung.com
Signed-off-by: Hyesoo Yu <hyesoo.yu@samsung.com>
Reviewed-by: Vishal Moola (Oracle) <vishal.moola@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>


# c2baef39 19-Oct-2023 Qi Zheng <zhengqi.arch@bytedance.com>

mm: page_alloc: skip memoryless nodes entirely

Patch series "handle memoryless nodes more appropriately", v3.

Currently, in the process of initialization or offline memory, memoryless
nodes will still be built into the fallback list of itself or other nodes.

This is not what we expected, so this patch series removes memoryless
nodes from the fallback list entirely.


This patch (of 2):

In find_next_best_node(), we skipped the memoryless nodes when building
the zonelists of other normal nodes (N_NORMAL), but did not skip the
memoryless node itself when building the zonelist. This will cause it to
be traversed at runtime.

For example, say we have node0 and node1, node0 is memoryless
node, then the fallback order of node0 and node1 as follows:

[ 0.153005] Fallback order for Node 0: 0 1
[ 0.153564] Fallback order for Node 1: 1

After this patch, we skip memoryless node0 entirely, then
the fallback order of node0 and node1 as follows:

[ 0.155236] Fallback order for Node 0: 1
[ 0.155806] Fallback order for Node 1: 1

So it becomes completely invisible, which will reduce runtime
overhead.

And in this way, we will not try to allocate pages from memoryless node0,
then the panic mentioned in [1] will also be fixed. Even though this
problem has been solved by dropping the NODE_MIN_SIZE constrain in x86
[2], it would be better to fix it in core MM as well.

[1]. https://lore.kernel.org/all/20230212110305.93670-1-zhengqi.arch@bytedance.com/
[2]. https://lore.kernel.org/all/20231017062215.171670-1-rppt@kernel.org/

[zhengqi.arch@bytedance.com: update comment, per Ingo]
Link: https://lkml.kernel.org/r/7300fc00a057eefeb9a68c8ad28171c3f0ce66ce.1697799303.git.zhengqi.arch@bytedance.com
Link: https://lkml.kernel.org/r/cover.1697799303.git.zhengqi.arch@bytedance.com
Link: https://lkml.kernel.org/r/cover.1697711415.git.zhengqi.arch@bytedance.com
Link: https://lkml.kernel.org/r/157013e978468241de4a4c05d5337a44638ecb0e.1697711415.git.zhengqi.arch@bytedance.com
Signed-off-by: Qi Zheng <zhengqi.arch@bytedance.com>
Acked-by: David Hildenbrand <david@redhat.com>
Acked-by: Ingo Molnar <mingo@kernel.org>
Cc: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
Cc: "Huang, Ying" <ying.huang@intel.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
Cc: Mel Gorman <mgorman@techsingularity.net>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Mike Rapoport <rppt@kernel.org>
Cc: Oscar Salvador <osalvador@suse.de>
Cc: Vlastimil Babka <vbabka@suse.cz>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>


# 6ccdcb6d 15-Oct-2023 Huang Ying <ying.huang@intel.com>

mm, pcp: reduce detecting time of consecutive high order page freeing

In current PCP auto-tuning design, if the number of pages allocated is
much more than that of pages freed on a CPU, the PCP high may become the
maximal value even if the allocating/freeing depth is small, for example,
in the sender of network workloads. If a CPU was used as sender
originally, then it is used as receiver after context switching, we need
to fill the whole PCP with maximal high before triggering PCP draining for
consecutive high order freeing. This will hurt the performance of some
network workloads.

To solve the issue, in this patch, we will track the consecutive page
freeing with a counter in stead of relying on PCP draining. So, we can
detect consecutive page freeing much earlier.

On a 2-socket Intel server with 128 logical CPU, we tested
SCTP_STREAM_MANY test case of netperf test suite with 64-pair processes.
With the patch, the network bandwidth improves 5.0%. This restores the
performance drop caused by PCP auto-tuning.

Link: https://lkml.kernel.org/r/20231016053002.756205-10-ying.huang@intel.com
Signed-off-by: "Huang, Ying" <ying.huang@intel.com>
Cc: Mel Gorman <mgorman@techsingularity.net>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: David Hildenbrand <david@redhat.com>
Cc: Johannes Weiner <jweiner@redhat.com>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Pavel Tatashin <pasha.tatashin@soleen.com>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Christoph Lameter <cl@linux.com>
Cc: Arjan van de Ven <arjan@linux.intel.com>
Cc: Sudeep Holla <sudeep.holla@arm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>


# 57c0419c 15-Oct-2023 Huang Ying <ying.huang@intel.com>

mm, pcp: decrease PCP high if free pages < high watermark

One target of PCP is to minimize pages in PCP if the system free pages is
too few. To reach that target, when page reclaiming is active for the
zone (ZONE_RECLAIM_ACTIVE), we will stop increasing PCP high in allocating
path, decrease PCP high and free some pages in freeing path. But this may
be too late because the background page reclaiming may introduce latency
for some workloads. So, in this patch, during page allocation we will
detect whether the number of free pages of the zone is below high
watermark. If so, we will stop increasing PCP high in allocating path,
decrease PCP high and free some pages in freeing path. With this, we can
reduce the possibility of the premature background page reclaiming caused
by too large PCP.

The high watermark checking is done in allocating path to reduce the
overhead in hotter freeing path.

Link: https://lkml.kernel.org/r/20231016053002.756205-9-ying.huang@intel.com
Signed-off-by: "Huang, Ying" <ying.huang@intel.com>
Cc: Mel Gorman <mgorman@techsingularity.net>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: David Hildenbrand <david@redhat.com>
Cc: Johannes Weiner <jweiner@redhat.com>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Pavel Tatashin <pasha.tatashin@soleen.com>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Christoph Lameter <cl@linux.com>
Cc: Arjan van de Ven <arjan@linux.intel.com>
Cc: Sudeep Holla <sudeep.holla@arm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>


# 51a755c5 15-Oct-2023 Huang Ying <ying.huang@intel.com>

mm: tune PCP high automatically

The target to tune PCP high automatically is as follows,

- Minimize allocation/freeing from/to shared zone

- Minimize idle pages in PCP

- Minimize pages in PCP if the system free pages is too few

To reach these target, a tuning algorithm as follows is designed,

- When we refill PCP via allocating from the zone, increase PCP high.
Because if we had larger PCP, we could avoid to allocate from the
zone.

- In periodic vmstat updating kworker (via refresh_cpu_vm_stats()),
decrease PCP high to try to free possible idle PCP pages.

- When page reclaiming is active for the zone, stop increasing PCP
high in allocating path, decrease PCP high and free some pages in
freeing path.

So, the PCP high can be tuned to the page allocating/freeing depth of
workloads eventually.

One issue of the algorithm is that if the number of pages allocated is
much more than that of pages freed on a CPU, the PCP high may become the
maximal value even if the allocating/freeing depth is small. But this
isn't a severe issue, because there are no idle pages in this case.

One alternative choice is to increase PCP high when we drain PCP via
trying to free pages to the zone, but don't increase PCP high during PCP
refilling. This can avoid the issue above. But if the number of pages
allocated is much less than that of pages freed on a CPU, there will be
many idle pages in PCP and it is hard to free these idle pages.

1/8 (>> 3) of PCP high will be decreased periodically. The value 1/8 is
kind of arbitrary. Just to make sure that the idle PCP pages will be
freed eventually.

On a 2-socket Intel server with 224 logical CPU, we run 8 kbuild instances
in parallel (each with `make -j 28`) in 8 cgroup. This simulates the
kbuild server that is used by 0-Day kbuild service. With the patch, the
build time decreases 3.5%. The cycles% of the spinlock contention (mostly
for zone lock) decreases from 11.0% to 0.5%. The number of PCP draining
for high order pages freeing (free_high) decreases 65.6%. The number of
pages allocated from zone (instead of from PCP) decreases 83.9%.

Link: https://lkml.kernel.org/r/20231016053002.756205-8-ying.huang@intel.com
Signed-off-by: "Huang, Ying" <ying.huang@intel.com>
Suggested-by: Mel Gorman <mgorman@techsingularity.net>
Suggested-by: Michal Hocko <mhocko@suse.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: David Hildenbrand <david@redhat.com>
Cc: Johannes Weiner <jweiner@redhat.com>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Pavel Tatashin <pasha.tatashin@soleen.com>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Christoph Lameter <cl@linux.com>
Cc: Arjan van de Ven <arjan@linux.intel.com>
Cc: Sudeep Holla <sudeep.holla@arm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>


# 90b41691 15-Oct-2023 Huang Ying <ying.huang@intel.com>

mm: add framework for PCP high auto-tuning

The page allocation performance requirements of different workloads are
usually different. So, we need to tune PCP (per-CPU pageset) high to
optimize the workload page allocation performance. Now, we have a system
wide sysctl knob (percpu_pagelist_high_fraction) to tune PCP high by hand.
But, it's hard to find out the best value by hand. And one global
configuration may not work best for the different workloads that run on
the same system. One solution to these issues is to tune PCP high of each
CPU automatically.

This patch adds the framework for PCP high auto-tuning. With it,
pcp->high of each CPU will be changed automatically by tuning algorithm at
runtime. The minimal high (pcp->high_min) is the original PCP high value
calculated based on the low watermark pages. While the maximal high
(pcp->high_max) is the PCP high value when percpu_pagelist_high_fraction
sysctl knob is set to MIN_PERCPU_PAGELIST_HIGH_FRACTION. That is, the
maximal pcp->high that can be set via sysctl knob by hand.

It's possible that PCP high auto-tuning doesn't work well for some
workloads. So, when PCP high is tuned by hand via the sysctl knob, the
auto-tuning will be disabled. The PCP high set by hand will be used
instead.

This patch only adds the framework, so pcp->high will be set to
pcp->high_min (original default) always. We will add actual auto-tuning
algorithm in the following patches in the series.

Link: https://lkml.kernel.org/r/20231016053002.756205-7-ying.huang@intel.com
Signed-off-by: "Huang, Ying" <ying.huang@intel.com>
Acked-by: Mel Gorman <mgorman@techsingularity.net>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: David Hildenbrand <david@redhat.com>
Cc: Johannes Weiner <jweiner@redhat.com>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Pavel Tatashin <pasha.tatashin@soleen.com>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Christoph Lameter <cl@linux.com>
Cc: Arjan van de Ven <arjan@linux.intel.com>
Cc: Sudeep Holla <sudeep.holla@arm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>


# c0a24239 15-Oct-2023 Huang Ying <ying.huang@intel.com>

mm, page_alloc: scale the number of pages that are batch allocated

When a task is allocating a large number of order-0 pages, it may acquire
the zone->lock multiple times allocating pages in batches. This may
unnecessarily contend on the zone lock when allocating very large number
of pages. This patch adapts the size of the batch based on the recent
pattern to scale the batch size for subsequent allocations.

On a 2-socket Intel server with 224 logical CPU, we run 8 kbuild instances
in parallel (each with `make -j 28`) in 8 cgroup. This simulates the
kbuild server that is used by 0-Day kbuild service. With the patch, the
cycles% of the spinlock contention (mostly for zone lock) decreases from
12.6% to 11.0% (with PCP size == 367).

Link: https://lkml.kernel.org/r/20231016053002.756205-6-ying.huang@intel.com
Signed-off-by: "Huang, Ying" <ying.huang@intel.com>
Suggested-by: Mel Gorman <mgorman@techsingularity.net>
Acked-by: Mel Gorman <mgorman@techsingularity.net>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: David Hildenbrand <david@redhat.com>
Cc: Johannes Weiner <jweiner@redhat.com>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Pavel Tatashin <pasha.tatashin@soleen.com>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Christoph Lameter <cl@linux.com>
Cc: Arjan van de Ven <arjan@linux.intel.com>
Cc: Sudeep Holla <sudeep.holla@arm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>


# 52166607 15-Oct-2023 Huang Ying <ying.huang@intel.com>

mm: restrict the pcp batch scale factor to avoid too long latency

In page allocator, PCP (Per-CPU Pageset) is refilled and drained in
batches to increase page allocation throughput, reduce page
allocation/freeing latency per page, and reduce zone lock contention. But
too large batch size will cause too long maximal allocation/freeing
latency, which may punish arbitrary users. So the default batch size is
chosen carefully (in zone_batchsize(), the value is 63 for zone > 1GB) to
avoid that.

In commit 3b12e7e97938 ("mm/page_alloc: scale the number of pages that are
batch freed"), the batch size will be scaled for large number of page
freeing to improve page freeing performance and reduce zone lock
contention. Similar optimization can be used for large number of pages
allocation too.

To find out a suitable max batch scale factor (that is, max effective
batch size), some tests and measurement on some machines were done as
follows.

A set of debug patches are implemented as follows,

- Set PCP high to be 2 * batch to reduce the effect of PCP high

- Disable free batch size scaling to get the raw performance.

- The code with zone lock held is extracted from rmqueue_bulk() and
free_pcppages_bulk() to 2 separate functions to make it easy to
measure the function run time with ftrace function_graph tracer.

- The batch size is hard coded to be 63 (default), 127, 255, 511,
1023, 2047, 4095.

Then will-it-scale/page_fault1 is used to generate the page
allocation/freeing workload. The page allocation/freeing throughput
(page/s) is measured via will-it-scale. The page allocation/freeing
average latency (alloc/free latency avg, in us) and allocation/freeing
latency at 99 percentile (alloc/free latency 99%, in us) are measured with
ftrace function_graph tracer.

The test results are as follows,

Sapphire Rapids Server
======================
Batch throughput free latency free latency alloc latency alloc latency
page/s avg / us 99% / us avg / us 99% / us
----- ---------- ------------ ------------ ------------- -------------
63 513633.4 2.33 3.57 2.67 6.83
127 517616.7 4.35 6.65 4.22 13.03
255 520822.8 8.29 13.32 7.52 25.24
511 524122.0 15.79 23.42 14.02 49.35
1023 525980.5 30.25 44.19 25.36 94.88
2047 526793.6 59.39 84.50 45.22 140.81

Ice Lake Server
===============
Batch throughput free latency free latency alloc latency alloc latency
page/s avg / us 99% / us avg / us 99% / us
----- ---------- ------------ ------------ ------------- -------------
63 620210.3 2.21 3.68 2.02 4.35
127 627003.0 4.09 6.86 3.51 8.28
255 630777.5 7.70 13.50 6.17 15.97
511 633651.5 14.85 22.62 11.66 31.08
1023 637071.1 28.55 42.02 20.81 54.36
2047 638089.7 56.54 84.06 39.28 91.68

Cascade Lake Server
===================
Batch throughput free latency free latency alloc latency alloc latency
page/s avg / us 99% / us avg / us 99% / us
----- ---------- ------------ ------------ ------------- -------------
63 404706.7 3.29 5.03 3.53 4.75
127 422475.2 6.12 9.09 6.36 8.76
255 411522.2 11.68 16.97 10.90 16.39
511 428124.1 22.54 31.28 19.86 32.25
1023 414718.4 43.39 62.52 40.00 66.33
2047 429848.7 86.64 120.34 71.14 106.08

Commet Lake Desktop
===================
Batch throughput free latency free latency alloc latency alloc latency
page/s avg / us 99% / us avg / us 99% / us
----- ---------- ------------ ------------ ------------- -------------

63 795183.13 2.18 3.55 2.03 3.05
127 803067.85 3.91 6.56 3.85 5.52
255 812771.10 7.35 10.80 7.14 10.20
511 817723.48 14.17 27.54 13.43 30.31
1023 818870.19 27.72 40.10 27.89 46.28

Coffee Lake Desktop
===================
Batch throughput free latency free latency alloc latency alloc latency
page/s avg / us 99% / us avg / us 99% / us
----- ---------- ------------ ------------ ------------- -------------
63 510542.8 3.13 4.40 2.48 3.43
127 514288.6 5.97 7.89 4.65 6.04
255 516889.7 11.86 15.58 8.96 12.55
511 519802.4 23.10 28.81 16.95 26.19
1023 520802.7 45.30 52.51 33.19 45.95
2047 519997.1 90.63 104.00 65.26 81.74

From the above data, to restrict the allocation/freeing latency to be less
than 100 us in most times, the max batch scale factor needs to be less
than or equal to 5.

Although it is reasonable to use 5 as max batch scale factor for the
systems tested, there are also slower systems. Where smaller value should
be used to constrain the page allocation/freeing latency.

So, in this patch, a new kconfig option (PCP_BATCH_SCALE_MAX) is added to
set the max batch scale factor. Whose default value is 5, and users can
reduce it when necessary.

Link: https://lkml.kernel.org/r/20231016053002.756205-5-ying.huang@intel.com
Signed-off-by: "Huang, Ying" <ying.huang@intel.com>
Acked-by: Andrew Morton <akpm@linux-foundation.org>
Acked-by: Mel Gorman <mgorman@techsingularity.net>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: David Hildenbrand <david@redhat.com>
Cc: Johannes Weiner <jweiner@redhat.com>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Pavel Tatashin <pasha.tatashin@soleen.com>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Christoph Lameter <cl@linux.com>
Cc: Arjan van de Ven <arjan@linux.intel.com>
Cc: Sudeep Holla <sudeep.holla@arm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>


# 362d37a1 15-Oct-2023 Huang Ying <ying.huang@intel.com>

mm, pcp: reduce lock contention for draining high-order pages

In commit f26b3fa04611 ("mm/page_alloc: limit number of high-order pages
on PCP during bulk free"), the PCP (Per-CPU Pageset) will be drained when
PCP is mostly used for high-order pages freeing to improve the cache-hot
pages reusing between page allocating and freeing CPUs.

On system with small per-CPU data cache slice, pages shouldn't be cached
before draining to guarantee cache-hot. But on a system with large
per-CPU data cache slice, some pages can be cached before draining to
reduce zone lock contention.

So, in this patch, instead of draining without any caching, "pcp->batch"
pages will be cached in PCP before draining if the size of the per-CPU
data cache slice is more than "3 * batch".

In theory, if the size of per-CPU data cache slice is more than "2 *
batch", we can reuse cache-hot pages between CPUs. But considering the
other usage of cache (code, other data accessing, etc.), "3 * batch" is
used.

Note: "3 * batch" is chosen to make sure the optimization works on recent
x86_64 server CPUs. If you want to increase it, please check whether it
breaks the optimization.

On a 2-socket Intel server with 128 logical CPU, with the patch, the
network bandwidth of the UNIX (AF_UNIX) test case of lmbench test suite
with 16-pair processes increase 70.5%. The cycles% of the spinlock
contention (mostly for zone lock) decreases from 46.1% to 21.3%. The
number of PCP draining for high order pages freeing (free_high) decreases
89.9%. The cache miss rate keeps 0.2%.

Link: https://lkml.kernel.org/r/20231016053002.756205-4-ying.huang@intel.com
Signed-off-by: "Huang, Ying" <ying.huang@intel.com>
Acked-by: Mel Gorman <mgorman@techsingularity.net>
Cc: Sudeep Holla <sudeep.holla@arm.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: David Hildenbrand <david@redhat.com>
Cc: Johannes Weiner <jweiner@redhat.com>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Pavel Tatashin <pasha.tatashin@soleen.com>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Christoph Lameter <cl@linux.com>
Cc: Arjan van de Ven <arjan@linux.intel.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>


# ca71fe1a 15-Oct-2023 Huang Ying <ying.huang@intel.com>

mm, pcp: avoid to drain PCP when process exit

Patch series "mm: PCP high auto-tuning", v3.

The page allocation performance requirements of different workloads are
often different. So, we need to tune the PCP (Per-CPU Pageset) high on
each CPU automatically to optimize the page allocation performance.

The list of patches in series is as follows,

[1/9] mm, pcp: avoid to drain PCP when process exit
[2/9] cacheinfo: calculate per-CPU data cache size
[3/9] mm, pcp: reduce lock contention for draining high-order pages
[4/9] mm: restrict the pcp batch scale factor to avoid too long latency
[5/9] mm, page_alloc: scale the number of pages that are batch allocated
[6/9] mm: add framework for PCP high auto-tuning
[7/9] mm: tune PCP high automatically
[8/9] mm, pcp: decrease PCP high if free pages < high watermark
[9/9] mm, pcp: reduce detecting time of consecutive high order page freeing

Patch [1/9], [2/9], [3/9] optimize the PCP draining for consecutive
high-order pages freeing.

Patch [4/9], [5/9] optimize batch freeing and allocating.

Patch [6/9], [7/9], [8/9] implement and optimize a PCP high
auto-tuning method.

Patch [9/9] optimize the PCP draining for consecutive high order page
freeing based on PCP high auto-tuning.

The test results for patches with performance impact are as follows,

kbuild
======

On a 2-socket Intel server with 224 logical CPU, we run 8 kbuild instances
in parallel (each with `make -j 28`) in 8 cgroup. This simulates the
kbuild server that is used by 0-Day kbuild service.

build time lock contend% free_high alloc_zone
---------- ---------- --------- ----------
base 100.0 14.0 100.0 100.0
patch1 99.5 12.8 19.5 95.6
patch3 99.4 12.6 7.1 95.6
patch5 98.6 11.0 8.1 97.1
patch7 95.1 0.5 2.8 15.6
patch9 95.0 1.0 8.8 20.0

The PCP draining optimization (patch [1/9], [3/9]) and PCP batch
allocation optimization (patch [5/9]) reduces zone lock contention a
little. The PCP high auto-tuning (patch [7/9], [9/9]) reduces build time
visibly. Where the tuning target: the number of pages allocated from zone
reduces greatly. So, the zone contention cycles% reduces greatly.

With PCP tuning patches (patch [7/9], [9/9]), the average used memory
during test increases up to 18.4% because more pages are cached in PCP.
But at the end of the test, the number of the used memory decreases to the
same level as that of the base patch. That is, the pages cached in PCP
will be released to zone after not being used actively.

netperf SCTP_STREAM_MANY
========================

On a 2-socket Intel server with 128 logical CPU, we tested
SCTP_STREAM_MANY test case of netperf test suite with 64-pair processes.

score lock contend% free_high alloc_zone cache miss rate%
----- ---------- --------- ---------- ----------------
base 100.0 2.1 100.0 100.0 1.3
patch1 99.4 2.1 99.4 99.4 1.3
patch3 106.4 1.3 13.3 106.3 1.3
patch5 106.0 1.2 13.2 105.9 1.3
patch7 103.4 1.9 6.7 90.3 7.6
patch9 108.6 1.3 13.7 108.6 1.3

The PCP draining optimization (patch [1/9]+[3/9]) improves performance.
The PCP high auto-tuning (patch [7/9]) reduces performance a little
because PCP draining cannot be triggered in time sometimes. So, the cache
miss rate% increases. The further PCP draining optimization (patch [9/9])
based on PCP tuning restore the performance.

lmbench3 UNIX (AF_UNIX)
=======================

On a 2-socket Intel server with 128 logical CPU, we tested UNIX
(AF_UNIX socket) test case of lmbench3 test suite with 16-pair
processes.

score lock contend% free_high alloc_zone cache miss rate%
----- ---------- --------- ---------- ----------------
base 100.0 51.4 100.0 100.0 0.2
patch1 116.8 46.1 69.5 104.3 0.2
patch3 199.1 21.3 7.0 104.9 0.2
patch5 200.0 20.8 7.1 106.9 0.3
patch7 191.6 19.9 6.8 103.8 2.8
patch9 193.4 21.7 7.0 104.7 2.1

The PCP draining optimization (patch [1/9], [3/9]) improves performance
much. The PCP tuning (patch [7/9]) reduces performance a little because
PCP draining cannot be triggered in time sometimes. The further PCP
draining optimization (patch [9/9]) based on PCP tuning restores the
performance partly.

The patchset adds several fields in struct per_cpu_pages. The struct
layout before/after the patchset is as follows,

base
====

struct per_cpu_pages {
spinlock_t lock; /* 0 4 */
int count; /* 4 4 */
int high; /* 8 4 */
int batch; /* 12 4 */
short int free_factor; /* 16 2 */
short int expire; /* 18 2 */

/* XXX 4 bytes hole, try to pack */

struct list_head lists[13]; /* 24 208 */

/* size: 256, cachelines: 4, members: 7 */
/* sum members: 228, holes: 1, sum holes: 4 */
/* padding: 24 */
} __attribute__((__aligned__(64)));

patched
=======

struct per_cpu_pages {
spinlock_t lock; /* 0 4 */
int count; /* 4 4 */
int high; /* 8 4 */
int high_min; /* 12 4 */
int high_max; /* 16 4 */
int batch; /* 20 4 */
u8 flags; /* 24 1 */
u8 alloc_factor; /* 25 1 */
u8 expire; /* 26 1 */

/* XXX 1 byte hole, try to pack */

short int free_count; /* 28 2 */

/* XXX 2 bytes hole, try to pack */

struct list_head lists[13]; /* 32 208 */

/* size: 256, cachelines: 4, members: 11 */
/* sum members: 237, holes: 2, sum holes: 3 */
/* padding: 16 */
} __attribute__((__aligned__(64)));

The size of the struct doesn't changed with the patchset.


This patch (of 9):

In commit f26b3fa04611 ("mm/page_alloc: limit number of high-order pages
on PCP during bulk free"), the PCP (Per-CPU Pageset) will be drained when
PCP is mostly used for high-order pages freeing to improve the cache-hot
pages reusing between page allocation and freeing CPUs.

But, the PCP draining mechanism may be triggered unexpectedly when process
exits. With some customized trace point, it was found that PCP draining
(free_high == true) was triggered with the order-1 page freeing with the
following call stack,

=> free_unref_page_commit
=> free_unref_page
=> __mmdrop
=> exit_mm
=> do_exit
=> do_group_exit
=> __x64_sys_exit_group
=> do_syscall_64

Checking the source code, this is the page table PGD freeing
(mm_free_pgd()). It's a order-1 page freeing if
CONFIG_PAGE_TABLE_ISOLATION=y. Which is a common configuration for
security.

Just before that, page freeing with the following call stack was found,

=> free_unref_page_commit
=> free_unref_page_list
=> release_pages
=> tlb_batch_pages_flush
=> tlb_finish_mmu
=> exit_mmap
=> __mmput
=> exit_mm
=> do_exit
=> do_group_exit
=> __x64_sys_exit_group
=> do_syscall_64

So, when a process exits,

- a large number of user pages of the process will be freed without
page allocation, it's highly possible that pcp->free_factor becomes >
0. In fact, this is expected behavior to improve process exit
performance.

- after freeing all user pages, the PGD will be freed, which is a
order-1 page freeing, PCP will be drained.

All in all, when a process exits, it's high possible that the PCP will be
drained. This is an unexpected behavior.

To avoid this, in the patch, the PCP draining will only be triggered for 2
consecutive high-order page freeing.

On a 2-socket Intel server with 224 logical CPU, we run 8 kbuild instances
in parallel (each with `make -j 28`) in 8 cgroup. This simulates the
kbuild server that is used by 0-Day kbuild service. With the patch, the
cycles% of the spinlock contention (mostly for zone lock) decreases from
14.0% to 12.8% (with PCP size == 367). The number of PCP draining for
high order pages freeing (free_high) decreases 80.5%.

This helps network workload too for reduced zone lock contention. On a
2-socket Intel server with 128 logical CPU, with the patch, the network
bandwidth of the UNIX (AF_UNIX) test case of lmbench test suite with
16-pair processes increase 16.8%. The cycles% of the spinlock contention
(mostly for zone lock) decreases from 51.4% to 46.1%. The number of PCP
draining for high order pages freeing (free_high) decreases 30.5%. The
cache miss rate keeps 0.2%.

Link: https://lkml.kernel.org/r/20231016053002.756205-1-ying.huang@intel.com
Link: https://lkml.kernel.org/r/20231016053002.756205-2-ying.huang@intel.com
Signed-off-by: "Huang, Ying" <ying.huang@intel.com>
Acked-by: Mel Gorman <mgorman@techsingularity.net>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: David Hildenbrand <david@redhat.com>
Cc: Johannes Weiner <jweiner@redhat.com>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Pavel Tatashin <pasha.tatashin@soleen.com>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Christoph Lameter <cl@linux.com>
Cc: Arjan van de Ven <arjan@linux.intel.com>
Cc: Sudeep Holla <sudeep.holla@arm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>


# 0dfca313 27-Sep-2023 Kemeng Shi <shikemeng@huaweicloud.com>

mm/page_alloc: remove unnecessary next_page in break_down_buddy_pages

The next_page is only used to forward page in case target is in second
half range. Move forward page directly to remove unnecessary next_page.

Link: https://lkml.kernel.org/r/20230927103514.98281-3-shikemeng@huaweicloud.com
Signed-off-by: Kemeng Shi <shikemeng@huaweicloud.com>
Acked-by: Naoya Horiguchi <naoya.horiguchi@nec.com>
Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
Cc: Oscar Salvador <osalvador@suse.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>


# 27e0db3c 27-Sep-2023 Kemeng Shi <shikemeng@huaweicloud.com>

mm/page_alloc: remove unnecessary check in break_down_buddy_pages

Patch series "Two minor cleanups to break_down_buddy_pages", v2.

Two minor cleanups to break_down_buddy_pages.


This patch (of 2):

1. We always have target in range started with next_page and full free
range started with current_buddy.

2. The last split range size is 1 << low and low should be >= 0, then
size >= 1. So page + size != page is always true (because size > 0).
As summary, current_page will not equal to target page.

Link: https://lkml.kernel.org/r/20230927103514.98281-1-shikemeng@huaweicloud.com
Link: https://lkml.kernel.org/r/20230927103514.98281-2-shikemeng@huaweicloud.com
Signed-off-by: Kemeng Shi <shikemeng@huaweicloud.com>
Acked-by: Naoya Horiguchi <naoya.horiguchi@nec.com>
Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
Cc: Oscar Salvador <osalvador@suse.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>


# 61e21cf2 27-Sep-2023 Kemeng Shi <shikemeng@huaweicloud.com>

mm/page_alloc: correct start page when guard page debug is enabled

When guard page debug is enabled and set_page_guard returns success, we
miss to forward page to point to start of next split range and we will do
split unexpectedly in page range without target page. Move start page
update before set_page_guard to fix this.

As we split to wrong target page, then splited pages are not able to merge
back to original order when target page is put back and splited pages
except target page is not usable. To be specific:

Consider target page is the third page in buddy page with order 2.
| buddy-2 | Page | Target | Page |

After break down to target page, we will only set first page to Guard
because of bug.
| Guard | Page | Target | Page |

When we try put_page_back_buddy with target page, the buddy page of target
if neither guard nor buddy, Then it's not able to construct original page
with order 2
| Guard | Page | buddy-0 | Page |

All pages except target page is not in free list and is not usable.

Link: https://lkml.kernel.org/r/20230927094401.68205-1-shikemeng@huaweicloud.com
Fixes: 06be6ff3d2ec ("mm,hwpoison: rework soft offline for free pages")
Signed-off-by: Kemeng Shi <shikemeng@huaweicloud.com>
Acked-by: Naoya Horiguchi <naoya.horiguchi@nec.com>
Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
Cc: Oscar Salvador <osalvador@suse.de>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>


# 7b086755 11-Sep-2023 Johannes Weiner <hannes@cmpxchg.org>

mm: page_alloc: fix CMA and HIGHATOMIC landing on the wrong buddy list

Commit 4b23a68f9536 ("mm/page_alloc: protect PCP lists with a spinlock")
bypasses the pcplist on lock contention and returns the page directly to
the buddy list of the page's migratetype.

For pages that don't have their own pcplist, such as CMA and HIGHATOMIC,
the migratetype is temporarily updated such that the page can hitch a ride
on the MOVABLE pcplist. Their true type is later reassessed when flushing
in free_pcppages_bulk(). However, when lock contention is detected after
the type was already overridden, the bypass will then put the page on the
wrong buddy list.

Once on the MOVABLE buddy list, the page becomes eligible for fallbacks
and even stealing. In the case of HIGHATOMIC, otherwise ineligible
allocations can dip into the highatomic reserves. In the case of CMA, the
page can be lost from the CMA region permanently.

Use a separate pcpmigratetype variable for the pcplist override. Use the
original migratetype when going directly to the buddy. This fixes the bug
and should make the intentions more obvious in the code.

Originally sent here to address the HIGHATOMIC case:
https://lore.kernel.org/lkml/20230821183733.106619-4-hannes@cmpxchg.org/

Changelog updated in response to the CMA-specific bug report.

[mgorman@techsingularity.net: updated changelog]
Link: https://lkml.kernel.org/r/20230911181108.GA104295@cmpxchg.org
Fixes: 4b23a68f9536 ("mm/page_alloc: protect PCP lists with a spinlock")
Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
Reported-by: Joe Liu <joe.liu@mediatek.com>
Reviewed-by: Vlastimil Babka <vbabka@suse.cz>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>


# f945116e 24-Aug-2023 Johannes Weiner <hannes@cmpxchg.org>

mm: page_alloc: remove stale CMA guard code

In the past, movable allocations could be disallowed from CMA through
PF_MEMALLOC_PIN. As CMA pages are funneled through the MOVABLE pcplist,
this required filtering that cornercase during allocations, such that
pinnable allocations wouldn't accidentally get a CMA page.

However, since 8e3560d963d2 ("mm: honor PF_MEMALLOC_PIN for all movable
pages"), PF_MEMALLOC_PIN automatically excludes __GFP_MOVABLE. Once
again, MOVABLE implies CMA is allowed.

Remove the stale filtering code. Also remove a stale comment that was
introduced as part of the filtering code, because the filtering let
order-0 pages fall through to the buddy allocator. See 1d91df85f399
("mm/page_alloc: handle a missing case for memalloc_nocma_{save/restore}
APIs") for context. The comment's been obsolete since the introduction of
the explicit ALLOC_HIGHATOMIC flag in eb2e2b425c69 ("mm/page_alloc:
explicitly record high-order atomic allocations in alloc_flags").

Link: https://lkml.kernel.org/r/20230824153821.243148-1-hannes@cmpxchg.org
Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
Acked-by: Mel Gorman <mgorman@techsingularity.net>
Cc: David Hildenbrand <david@redhat.com>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Cc: Miaohe Lin <linmiaohe@huawei.com>
Cc: Pasha Tatashin <pasha.tatashin@soleen.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>


# de53c05f 16-Aug-2023 Matthew Wilcox (Oracle) <willy@infradead.org>

mm: add large_rmappable page flag

Stored in the first tail page's flags, this flag replaces the destructor.
That removes the last of the destructors, so remove all references to
folio_dtor and compound_dtor.

Link: https://lkml.kernel.org/r/20230816151201.3655946-9-willy@infradead.org
Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Cc: David Hildenbrand <david@redhat.com>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: Sidhartha Kumar <sidhartha.kumar@oracle.com>
Cc: Yanteng Si <siyanteng@loongson.cn>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>


# 9c5ccf2d 16-Aug-2023 Matthew Wilcox (Oracle) <willy@infradead.org>

mm: remove HUGETLB_PAGE_DTOR

We can use a bit in page[1].flags to indicate that this folio belongs to
hugetlb instead of using a value in page[1].dtors. That lets
folio_test_hugetlb() become an inline function like it should be. We can
also get rid of NULL_COMPOUND_DTOR.

Link: https://lkml.kernel.org/r/20230816151201.3655946-8-willy@infradead.org
Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Cc: David Hildenbrand <david@redhat.com>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: Sidhartha Kumar <sidhartha.kumar@oracle.com>
Cc: Yanteng Si <siyanteng@loongson.cn>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>


# 0f2f43fa 16-Aug-2023 Matthew Wilcox (Oracle) <willy@infradead.org>

mm: remove free_compound_page() and the compound_page_dtors array

The only remaining destructor is free_compound_page(). Inline it into
destroy_large_folio() and remove the array it used to live in.

Link: https://lkml.kernel.org/r/20230816151201.3655946-7-willy@infradead.org
Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Cc: David Hildenbrand <david@redhat.com>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: Sidhartha Kumar <sidhartha.kumar@oracle.com>
Cc: Yanteng Si <siyanteng@loongson.cn>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>


# da6e7bf3 16-Aug-2023 Matthew Wilcox (Oracle) <willy@infradead.org>

mm: convert prep_transhuge_page() to folio_prep_large_rmappable()

Match folio_undo_large_rmappable(), and move the casting from page to
folio into the callers (which they were largely doing anyway).

Link: https://lkml.kernel.org/r/20230816151201.3655946-6-willy@infradead.org
Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Cc: David Hildenbrand <david@redhat.com>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: Sidhartha Kumar <sidhartha.kumar@oracle.com>
Cc: Yanteng Si <siyanteng@loongson.cn>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>


# 8dc4a8f1 16-Aug-2023 Matthew Wilcox (Oracle) <willy@infradead.org>

mm: convert free_transhuge_folio() to folio_undo_large_rmappable()

Indirect calls are expensive, thanks to Spectre. Test for
TRANSHUGE_PAGE_DTOR and destroy the folio appropriately. Move the
free_compound_page() call into destroy_large_folio() to simplify later
patches.

Link: https://lkml.kernel.org/r/20230816151201.3655946-5-willy@infradead.org
Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Cc: David Hildenbrand <david@redhat.com>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: Sidhartha Kumar <sidhartha.kumar@oracle.com>
Cc: Yanteng Si <siyanteng@loongson.cn>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>


# 454a00c4 16-Aug-2023 Matthew Wilcox (Oracle) <willy@infradead.org>

mm: convert free_huge_page() to free_huge_folio()

Pass a folio instead of the head page to save a few instructions. Update
the documentation, at least in English.

Link: https://lkml.kernel.org/r/20230816151201.3655946-4-willy@infradead.org
Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Reviewed-by: Sidhartha Kumar <sidhartha.kumar@oracle.com>
Cc: Yanteng Si <siyanteng@loongson.cn>
Cc: David Hildenbrand <david@redhat.com>
Cc: Jens Axboe <axboe@kernel.dk>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>


# dd6fa0b6 16-Aug-2023 Matthew Wilcox (Oracle) <willy@infradead.org>

mm: call free_huge_page() directly

Indirect calls are expensive, thanks to Spectre. Call free_huge_page()
directly if the folio belongs to hugetlb.

Link: https://lkml.kernel.org/r/20230816151201.3655946-3-willy@infradead.org
Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Cc: David Hildenbrand <david@redhat.com>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: Sidhartha Kumar <sidhartha.kumar@oracle.com>
Cc: Yanteng Si <siyanteng@loongson.cn>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>


# b5ffd297 11-Aug-2023 Kemeng Shi <shikemeng@huaweicloud.com>

mm/page_alloc: use get_pfnblock_migratetype to avoid extra page_to_pfn

We have get_pageblock_migratetype and get_pfnblock_migratetype to get
migratetype of page. get_pfnblock_migratetype accepts both page and pfn
from caller while get_pageblock_migratetype only accept page and get pfn
with page_to_pfn from page.

In case we already record pfn of page, we can simply call
get_pfnblock_migratetype to avoid a page_to_pfn.

Link: https://lkml.kernel.org/r/20230811115945.3423894-3-shikemeng@huaweicloud.com
Signed-off-by: Kemeng Shi <shikemeng@huaweicloud.com>
Acked-by: Mel Gorman <mgorman@techsingularity.net>
Cc: Baolin Wang <baolin.wang@linux.alibaba.com>
Cc: David Hildenbrand <david@redhat.com>
Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>


# a04d12c2 11-Aug-2023 Kemeng Shi <shikemeng@huaweicloud.com>

mm/page_alloc: remove unnecessary inner __get_pfnblock_flags_mask

Patch series "Two minor cleanups for get pageblock migratetype".

This series contains two minor cleanups for get pageblock migratetype.
More details can be found in respective patches.


This patch (of 2):

get_pfnblock_flags_mask() just calls inline inner
__get_pfnblock_flags_mask without any extra work. Just opencode
__get_pfnblock_flags_mask in get_pfnblock_flags_mask and replace call to
__get_pfnblock_flags_mask with call to get_pfnblock_flags_mask to remove
unnecessary __get_pfnblock_flags_mask.

Link: https://lkml.kernel.org/r/20230811115945.3423894-1-shikemeng@huaweicloud.com
Link: https://lkml.kernel.org/r/20230811115945.3423894-2-shikemeng@huaweicloud.com
Signed-off-by: Kemeng Shi <shikemeng@huaweicloud.com>
Acked-by: Mel Gorman <mgorman@techsingularity.net>
Reviewed-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Cc: Baolin Wang <baolin.wang@linux.alibaba.com>
Cc: David Hildenbrand <david@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>


# 368d983b 09-Aug-2023 ZhangPeng <zhangpeng362@huawei.com>

mm: page_alloc: remove unused parameter from reserve_highatomic_pageblock()

Just remove the redundant parameter alloc_order from
reserve_highatomic_pageblock(). No functional modification involved.

Link: https://lkml.kernel.org/r/20230809073323.1065286-1-zhangpeng362@huawei.com
Signed-off-by: ZhangPeng <zhangpeng362@huawei.com>
Cc: Kefeng Wang <wangkefeng.wang@huawei.com>
Cc: Mel Gorman <mgorman@techsingularity.net>
Cc: Nanyong Sun <sunnanyong@huawei.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>


# 13058705 09-Aug-2023 Kemeng Shi <shikemeng@huaweicloud.com>

mm/page_alloc: remove unnecessary parameter batch of nr_pcp_free

We get batch from pcp and just pass it to nr_pcp_free immediately. Get
batch from pcp inside nr_pcp_free to remove unnecessary parameter batch.

Link: https://lkml.kernel.org/r/20230809100754.3094517-3-shikemeng@huaweicloud.com
Signed-off-by: Kemeng Shi <shikemeng@huaweicloud.com>
Cc: Baolin Wang <baolin.wang@linux.alibaba.com>
Cc: David Hildenbrand <david@redhat.com>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Mel Gorman <mgorman@techsingularity.net>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>


# f142b2c2 09-Aug-2023 Kemeng Shi <shikemeng@huaweicloud.com>

mm/page_alloc: remove track of active PCP lists range in bulk free

Patch series "Two minor cleanups for pcp list in page_alloc".

There are two minor cleanups for pcp list in page_alloc. More details
can be found in respective patches.


This patch (of 2):

After commit fd56eef258a17 ("mm/page_alloc: simplify how many pages are
selected per pcp list during bulk free"), we will drain all pages in
selected pcp list. And we ensured passed count is < pcp->count. Then,
the search will finish before wrap-around and track of active PCP lists
range intended for wrap-around case is no longer needed.

Link: https://lkml.kernel.org/r/20230809100754.3094517-1-shikemeng@huaweicloud.com
Link: https://lkml.kernel.org/r/20230809100754.3094517-2-shikemeng@huaweicloud.com
Signed-off-by: Kemeng Shi <shikemeng@huaweicloud.com>
Cc: Baolin Wang <baolin.wang@linux.alibaba.com>
Cc: David Hildenbrand <david@redhat.com>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Mel Gorman <mgorman@techsingularity.net>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>


# c1dc69e6 03-Aug-2023 Miaohe Lin <linmiaohe@huawei.com>

mm/page_alloc: remove unneeded variable base

Since commit 5d0a661d808f ("mm/page_alloc: use only one PCP list for
THP-sized allocations"), local variable base is just as same as order. So
remove it. No functional change intended.

Link: https://lkml.kernel.org/r/20230803114934.693989-1-linmiaohe@huawei.com
Signed-off-by: Miaohe Lin <linmiaohe@huawei.com>
Reviewed-by: David Hildenbrand <david@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>


# ebddd111 01-Aug-2023 Miaohe Lin <linmiaohe@huawei.com>

mm/page_alloc: avoid unneeded alike_pages calculation

When free_pages is 0, alike_pages is not used. So alike_pages calculation
can be avoided by checking free_pages early to save cpu cycles. Also fix
typo 'comparable'. It should be 'compatible' here.

Link: https://lkml.kernel.org/r/20230801123723.2225543-1-linmiaohe@huawei.com
Signed-off-by: Miaohe Lin <linmiaohe@huawei.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>


# 82d9b8c8 04-Jul-2023 Miaohe Lin <linmiaohe@huawei.com>

mm: page_alloc: avoid false page outside zone error info

If pfn is outside zone boundaries in the first round, ret will be set to
1. But if pfn is changed to inside the zone boundaries in zone span
seqretry path, ret is still set to 1 leading to false page outside zone
error info.

This is from code inspection. The race window should be really small thus
hard to trigger in real world.

[akpm@linux-foundation.org: code simplification, per Matthew]
Link: https://lkml.kernel.org/r/20230704111823.940331-1-linmiaohe@huawei.com
Fixes: bdc8cb984576 ("[PATCH] memory hotplug locking: zone span seqlock")
Signed-off-by: Miaohe Lin <linmiaohe@huawei.com>
Cc: Matthew Wilcox <willy@infradead.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>


# a2ebb515 23-Jun-2023 Sebastian Andrzej Siewior <bigeasy@linutronix.de>

mm/page_alloc: use write_seqlock_irqsave() instead write_seqlock() + local_irq_save().

__build_all_zonelists() acquires zonelist_update_seq by first disabling
interrupts via local_irq_save() and then acquiring the seqlock with
write_seqlock(). This is troublesome and leads to problems on PREEMPT_RT.
The problem is that the inner spinlock_t becomes a sleeping lock on
PREEMPT_RT and must not be acquired with disabled interrupts.

The API provides write_seqlock_irqsave() which does the right thing in one
step. printk_deferred_enter() has to be invoked in non-migrate-able
context to ensure that deferred printing is enabled and disabled on the
same CPU. This is the case after zonelist_update_seq has been acquired.

There was discussion on the first submission that the order should be:
local_irq_disable();
printk_deferred_enter();
write_seqlock();

to avoid pitfalls like having an unaccounted printk() coming from
write_seqlock_irqsave() before printk_deferred_enter() is invoked. The
only origin of such a printk() can be a lockdep splat because the lockdep
annotation happens after the sequence count is incremented. This is
exceptional and subject to change.

It was also pointed that PREEMPT_RT can be affected by the printk problem
since its write_seqlock_irqsave() does not really disable interrupts.
This isn't the case because PREEMPT_RT's printk implementation differs
from the mainline implementation in two important aspects:

- Printing happens in a dedicated threads and not at during the
invocation of printk().
- In emergency cases where synchronous printing is used, a different
driver is used which does not use tty_port::lock.

Acquire zonelist_update_seq with write_seqlock_irqsave() and then defer
printk output.

Link: https://lkml.kernel.org/r/20230623201517.yw286Knb@linutronix.de
Fixes: 1007843a91909 ("mm/page_alloc: fix potential deadlock on zonelist_update_seq seqlock")
Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Acked-by: Michal Hocko <mhocko@suse.com>
Reviewed-by: David Hildenbrand <david@redhat.com>
Acked-by: Mel Gorman <mgorman@techsingularity.net>
Cc: Boqun Feng <boqun.feng@gmail.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: John Ogness <john.ogness@linutronix.de>
Cc: Luis Claudio R. Goncalves <lgoncalv@redhat.com>
Cc: Mel Gorman <mgorman@techsingularity.net>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Petr Mladek <pmladek@suse.com>
Cc: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Waiman Long <longman@redhat.com>
Cc: Will Deacon <will@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>


# 416ef04f 24-Jun-2023 liuq <liuq131@chinatelecom.cn>

mm/page_alloc: fix min_free_kbytes calculation regarding ZONE_MOVABLE

The current calculation of min_free_kbytes only uses ZONE_DMA and
ZONE_NORMAL pages,but the ZONE_MOVABLE zone->_watermark[WMARK_MIN] will
also divide part of min_free_kbytes.This will cause the min watermark of
ZONE_NORMAL to be too small in the presence of ZONE_MOVEABLE.

__GFP_HIGH and PF_MEMALLOC allocations usually don't need movable zone
pages, so just like ZONE_HIGHMEM, cap pages_min to a small value in
__setup_per_zone_wmarks().

On my testing machine with 16GB of memory (transparent hugepage is turned
off by default, and movablecore=12G is configured) The following is a
comparative test data of watermark_min

no patch add patch
ZONE_DMA 1 8
ZONE_DMA32 151 709
ZONE_NORMAL 233 1113
ZONE_MOVABLE 1434 128
min_free_kbytes 7288 7326

Link: https://lkml.kernel.org/r/20230625031656.23941-1-liuq131@chinatelecom.cn
Signed-off-by: liuq <liuq131@chinatelecom.cn>
Reviewed-by: "Huang, Ying" <ying.huang@intel.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>


# 1bf61092 21-Jun-2023 Baolin Wang <baolin.wang@linux.alibaba.com>

mm: page_alloc: use the correct type of list for free pages

Commit bf75f200569d ("mm/page_alloc: add page->buddy_list and
page->pcp_list") introduces page->buddy_list and page->pcp_list as a union
with page->lru, but missed to change get_page_from_free_area() to use
page->buddy_list to clarify the correct type of list for a free page.

Link: https://lkml.kernel.org/r/7e7ab533247d40c0ea0373c18a6a48e5667f9e10.1687333557.git.baolin.wang@linux.alibaba.com
Signed-off-by: Baolin Wang <baolin.wang@linux.alibaba.com>
Acked-by: Mel Gorman <mgorman@techsingularity.net>
Reviewed-by: David Hildenbrand <david@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>


# cf01724e 16-Jun-2023 Miaohe Lin <linmiaohe@huawei.com>

mm: page_alloc: make compound_page_dtors static

It's only used inside page_alloc.c now. So make it static and remove the
declaration in mm.h.

Link: https://lkml.kernel.org/r/20230617034622.1235913-1-linmiaohe@huawei.com
Signed-off-by: Miaohe Lin <linmiaohe@huawei.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>


# 12dd992a 03-Jun-2023 Miaohe Lin <linmiaohe@huawei.com>

mm: page_alloc: remove unneeded header files

Remove some unneeded header files. No functional change intended.

Link: https://lkml.kernel.org/r/20230603112558.213694-1-linmiaohe@huawei.com
Signed-off-by: Miaohe Lin <linmiaohe@huawei.com>
Reviewed-by: David Hildenbrand <david@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>


# 3b11edf1 13-May-2023 Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>

mm/page_alloc: don't wake kswapd from rmqueue() unless __GFP_KSWAPD_RECLAIM is specified

Commit 73444bc4d8f9 ("mm, page_alloc: do not wake kswapd with zone lock
held") moved wakeup_kswapd() from steal_suitable_fallback() to rmqueue()
using ZONE_BOOSTED_WATERMARK flag.

Only allocation contexts that include ALLOC_KSWAPD (which corresponds to
__GFP_KSWAPD_RECLAIM) should wake kswapd, for callers are supposed to
remove __GFP_KSWAPD_RECLAIM if trying to hold pgdat->kswapd_wait has a
risk of deadlock. But since zone->flags is a shared variable, a thread
doing !__GFP_KSWAPD_RECLAIM allocation request might observe this flag
being set immediately after another thread doing __GFP_KSWAPD_RECLAIM
allocation request set this flag, causing possibility of deadlock.

Link: https://lkml.kernel.org/r/c3c3dacf-dd3b-77c9-f96a-d0982b4b2a4f@I-love.SAKURA.ne.jp
Fixes: 73444bc4d8f9 ("mm, page_alloc: do not wake kswapd with zone lock held")
Signed-off-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
Acked-by: Mel Gorman <mgorman@techsingularity.net>
Cc: "Huang, Ying" <ying.huang@intel.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>


# 511a69b2 19-May-2023 Johannes Weiner <hannes@cmpxchg.org>

mm: compaction: simplify should_compact_retry()

The different branches for retry are unnecessarily complicated. There are
really only three outcomes: progress (retry n times), skipped (retry if
reclaim can help), failed (retry with higher priority).

Rearrange the branches and the retry counter to make it simpler.

[hannes@cmpxchg.org: restore behavior when hitting max_retries]
Link: https://lkml.kernel.org/r/20230602144705.GB161817@cmpxchg.org
Link: https://lkml.kernel.org/r/20230519123959.77335-3-hannes@cmpxchg.org
Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Cc: Mel Gorman <mgorman@techsingularity.net>
Cc: Michal Hocko <mhocko@suse.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>


# ecd8b292 19-May-2023 Johannes Weiner <hannes@cmpxchg.org>

mm: compaction: remove compaction result helpers

Patch series "mm: compaction: cleanups & simplifications".

These compaction cleanups are split out from the huge page allocator
series[1], as requested by reviewer feedback.

[1] https://lore.kernel.org/linux-mm/20230418191313.268131-1-hannes@cmpxchg.org/


This patch (of 5):

The compaction result helpers encode quirks that are specific to the
allocator's retry logic. E.g. COMPACT_SUCCESS and COMPACT_COMPLETE
actually represent failures that should be retried upon, and so on. I
frequently found myself pulling up the helper implementation in order to
understand and work on the retry logic. They're not quite clean
abstractions; rather they split the retry logic into two locations.

Remove the helpers and inline the checks. Then comment on the result
interpretations directly where the decision making happens.

Link: https://lkml.kernel.org/r/20230519123959.77335-1-hannes@cmpxchg.org
Link: https://lkml.kernel.org/r/20230519123959.77335-2-hannes@cmpxchg.org
Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Cc: Mel Gorman <mgorman@techsingularity.net>
Cc: Michal Hocko <mhocko@suse.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>


# 62069aac 18-May-2023 Tom Rix <trix@redhat.com>

mm: page_alloc: set sysctl_lowmem_reserve_ratio storage-class-specifier to static

smatch reports
mm/page_alloc.c:247:5: warning: symbol
'sysctl_lowmem_reserve_ratio' was not declared. Should it be static?

This variable is only used in its defining file, so it should be static

Link: https://lkml.kernel.org/r/20230518141119.927074-1-trix@redhat.com
Signed-off-by: Tom Rix <trix@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>


# ecbb490d 16-May-2023 Kefeng Wang <wangkefeng.wang@huawei.com>

mm: page_alloc: move is_check_pages_enabled() into page_alloc.c

The is_check_pages_enabled() only used in page_alloc.c, move it into
page_alloc.c, also use it in free_tail_page_prepare().

Link: https://lkml.kernel.org/r/20230516063821.121844-14-wangkefeng.wang@huawei.com
Signed-off-by: Kefeng Wang <wangkefeng.wang@huawei.com>
Cc: David Hildenbrand <david@redhat.com>
Cc: "Huang, Ying" <ying.huang@intel.com>
Cc: Iurii Zaikin <yzaikin@google.com>
Cc: Kees Cook <keescook@chromium.org>
Cc: Len Brown <len.brown@intel.com>
Cc: Luis Chamberlain <mcgrof@kernel.org>
Cc: Mike Rapoport (IBM) <rppt@kernel.org>
Cc: Oscar Salvador <osalvador@suse.de>
Cc: Pavel Machek <pavel@ucw.cz>
Cc: Rafael J. Wysocki <rafael@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>


# e95d372c 16-May-2023 Kefeng Wang <wangkefeng.wang@huawei.com>

mm: page_alloc: move sysctls into it own fils

This moves all page alloc related sysctls to its own file, as part of the
kernel/sysctl.c spring cleaning, also move some functions declarations
from mm.h into internal.h.

Link: https://lkml.kernel.org/r/20230516063821.121844-13-wangkefeng.wang@huawei.com
Signed-off-by: Kefeng Wang <wangkefeng.wang@huawei.com>
Cc: David Hildenbrand <david@redhat.com>
Cc: "Huang, Ying" <ying.huang@intel.com>
Cc: Iurii Zaikin <yzaikin@google.com>
Cc: Kees Cook <keescook@chromium.org>
Cc: Len Brown <len.brown@intel.com>
Cc: Luis Chamberlain <mcgrof@kernel.org>
Cc: Mike Rapoport (IBM) <rppt@kernel.org>
Cc: Oscar Salvador <osalvador@suse.de>
Cc: Pavel Machek <pavel@ucw.cz>
Cc: Rafael J. Wysocki <rafael@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>


# 07f44ac3 16-May-2023 Kefeng Wang <wangkefeng.wang@huawei.com>

mm: page_alloc: move pm_* function into power

pm_restrict_gfp_mask()/pm_restore_gfp_mask() only used in power, let's
move them out of page_alloc.c.

Adding a general gfp_has_io_fs() function which return true if gfp with
both __GFP_IO and __GFP_FS flags, then use it inside of
pm_suspended_storage(), also the pm_suspended_storage() is moved into
suspend.h.

Link: https://lkml.kernel.org/r/20230516063821.121844-11-wangkefeng.wang@huawei.com
Signed-off-by: Kefeng Wang <wangkefeng.wang@huawei.com>
Cc: David Hildenbrand <david@redhat.com>
Cc: "Huang, Ying" <ying.huang@intel.com>
Cc: Iurii Zaikin <yzaikin@google.com>
Cc: Kees Cook <keescook@chromium.org>
Cc: Len Brown <len.brown@intel.com>
Cc: Luis Chamberlain <mcgrof@kernel.org>
Cc: Mike Rapoport (IBM) <rppt@kernel.org>
Cc: Oscar Salvador <osalvador@suse.de>
Cc: Pavel Machek <pavel@ucw.cz>
Cc: Rafael J. Wysocki <rafael@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>


# 31a1b9d7 16-May-2023 Kefeng Wang <wangkefeng.wang@huawei.com>

mm: page_alloc: move mark_free_page() into snapshot.c

The mark_free_page() is only used in kernel/power/snapshot.c, move it out
to reduce a bit of page_alloc.c

Link: https://lkml.kernel.org/r/20230516063821.121844-10-wangkefeng.wang@huawei.com
Signed-off-by: Kefeng Wang <wangkefeng.wang@huawei.com>
Cc: David Hildenbrand <david@redhat.com>
Cc: "Huang, Ying" <ying.huang@intel.com>
Cc: Iurii Zaikin <yzaikin@google.com>
Cc: Kees Cook <keescook@chromium.org>
Cc: Len Brown <len.brown@intel.com>
Cc: Luis Chamberlain <mcgrof@kernel.org>
Cc: Mike Rapoport (IBM) <rppt@kernel.org>
Cc: Oscar Salvador <osalvador@suse.de>
Cc: Pavel Machek <pavel@ucw.cz>
Cc: Rafael J. Wysocki <rafael@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>


# 884c175f 16-May-2023 Kefeng Wang <wangkefeng.wang@huawei.com>

mm: page_alloc: split out DEBUG_PAGEALLOC

Move DEBUG_PAGEALLOC related functions into a single file to reduce a bit
of page_alloc.c.

Link: https://lkml.kernel.org/r/20230516063821.121844-9-wangkefeng.wang@huawei.com
Signed-off-by: Kefeng Wang <wangkefeng.wang@huawei.com>
Cc: David Hildenbrand <david@redhat.com>
Cc: "Huang, Ying" <ying.huang@intel.com>
Cc: Iurii Zaikin <yzaikin@google.com>
Cc: Kees Cook <keescook@chromium.org>
Cc: Len Brown <len.brown@intel.com>
Cc: Luis Chamberlain <mcgrof@kernel.org>
Cc: Mike Rapoport (IBM) <rppt@kernel.org>
Cc: Oscar Salvador <osalvador@suse.de>
Cc: Pavel Machek <pavel@ucw.cz>
Cc: Rafael J. Wysocki <rafael@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>


# 0866e82e 16-May-2023 Kefeng Wang <wangkefeng.wang@huawei.com>

mm: page_alloc: split out FAIL_PAGE_ALLOC

... to a single file to reduce a bit of page_alloc.c.

Link: https://lkml.kernel.org/r/20230516063821.121844-8-wangkefeng.wang@huawei.com
Signed-off-by: Kefeng Wang <wangkefeng.wang@huawei.com>
Cc: David Hildenbrand <david@redhat.com>
Cc: "Huang, Ying" <ying.huang@intel.com>
Cc: Iurii Zaikin <yzaikin@google.com>
Cc: Kees Cook <keescook@chromium.org>
Cc: Len Brown <len.brown@intel.com>
Cc: Luis Chamberlain <mcgrof@kernel.org>
Cc: Mike Rapoport (IBM) <rppt@kernel.org>
Cc: Oscar Salvador <osalvador@suse.de>
Cc: Pavel Machek <pavel@ucw.cz>
Cc: Rafael J. Wysocki <rafael@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>


# e9f2b529 16-May-2023 Kefeng Wang <wangkefeng.wang@huawei.com>

mm: page_alloc: remove alloc_contig_dump_pages() stub

DEFINE_DYNAMIC_DEBUG_METADATA and DYNAMIC_DEBUG_BRANCH already has stub
definitions without dynamic debug feature, remove unnecessary
alloc_contig_dump_pages() stub.

Link: https://lkml.kernel.org/r/20230516063821.121844-7-wangkefeng.wang@huawei.com
Signed-off-by: Kefeng Wang <wangkefeng.wang@huawei.com>
Reviewed-by: Mike Rapoport (IBM) <rppt@kernel.org>
Cc: David Hildenbrand <david@redhat.com>
Cc: "Huang, Ying" <ying.huang@intel.com>
Cc: Iurii Zaikin <yzaikin@google.com>
Cc: Kees Cook <keescook@chromium.org>
Cc: Len Brown <len.brown@intel.com>
Cc: Luis Chamberlain <mcgrof@kernel.org>
Cc: Oscar Salvador <osalvador@suse.de>
Cc: Pavel Machek <pavel@ucw.cz>
Cc: Rafael J. Wysocki <rafael@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>


# 5b855aa3 16-May-2023 Kefeng Wang <wangkefeng.wang@huawei.com>

mm: page_alloc: squash page_is_consistent()

Squash the page_is_consistent() into bad_range() as there is only one
caller.

Link: https://lkml.kernel.org/r/20230516063821.121844-6-wangkefeng.wang@huawei.com
Signed-off-by: Kefeng Wang <wangkefeng.wang@huawei.com>
Reviewed-by: Mike Rapoport (IBM) <rppt@kernel.org>
Cc: David Hildenbrand <david@redhat.com>
Cc: "Huang, Ying" <ying.huang@intel.com>
Cc: Iurii Zaikin <yzaikin@google.com>
Cc: Kees Cook <keescook@chromium.org>
Cc: Len Brown <len.brown@intel.com>
Cc: Luis Chamberlain <mcgrof@kernel.org>
Cc: Oscar Salvador <osalvador@suse.de>
Cc: Pavel Machek <pavel@ucw.cz>
Cc: Rafael J. Wysocki <rafael@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>


# e9aae170 16-May-2023 Kefeng Wang <wangkefeng.wang@huawei.com>

mm: page_alloc: collect mem statistic into show_mem.c

Let's move show_mem.c from lib to mm, as it belongs memory subsystem, also
split some memory statistic related functions from page_alloc.c to
show_mem.c, and we cleanup some unneeded include.

There is no functional change.

Link: https://lkml.kernel.org/r/20230516063821.121844-5-wangkefeng.wang@huawei.com
Signed-off-by: Kefeng Wang <wangkefeng.wang@huawei.com>
Cc: David Hildenbrand <david@redhat.com>
Cc: "Huang, Ying" <ying.huang@intel.com>
Cc: Iurii Zaikin <yzaikin@google.com>
Cc: Kees Cook <keescook@chromium.org>
Cc: Len Brown <len.brown@intel.com>
Cc: Luis Chamberlain <mcgrof@kernel.org>
Cc: Mike Rapoport (IBM) <rppt@kernel.org>
Cc: Oscar Salvador <osalvador@suse.de>
Cc: Pavel Machek <pavel@ucw.cz>
Cc: Rafael J. Wysocki <rafael@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>


# 904d5857 16-May-2023 Kefeng Wang <wangkefeng.wang@huawei.com>

mm: page_alloc: move set_zone_contiguous() into mm_init.c

set_zone_contiguous() is only used in mm init/hotplug, and
clear_zone_contiguous() only used in hotplug, move them from page_alloc.c
to the more appropriate file.

Link: https://lkml.kernel.org/r/20230516063821.121844-4-wangkefeng.wang@huawei.com
Signed-off-by: Kefeng Wang <wangkefeng.wang@huawei.com>
Cc: David Hildenbrand <david@redhat.com>
Cc: "Huang, Ying" <ying.huang@intel.com>
Cc: Iurii Zaikin <yzaikin@google.com>
Cc: Kees Cook <keescook@chromium.org>
Cc: Len Brown <len.brown@intel.com>
Cc: Luis Chamberlain <mcgrof@kernel.org>
Cc: Mike Rapoport (IBM) <rppt@kernel.org>
Cc: Oscar Salvador <osalvador@suse.de>
Cc: Pavel Machek <pavel@ucw.cz>
Cc: Rafael J. Wysocki <rafael@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>


# 5e7d5da2 16-May-2023 Kefeng Wang <wangkefeng.wang@huawei.com>

mm: page_alloc: move init_on_alloc/free() into mm_init.c

Since commit f2fc4b44ec2b ("mm: move init_mem_debugging_and_hardening() to
mm/mm_init.c"), the init_on_alloc() and init_on_free() define is better to
move there too.

Link: https://lkml.kernel.org/r/20230516063821.121844-3-wangkefeng.wang@huawei.com
Signed-off-by: Kefeng Wang <wangkefeng.wang@huawei.com>
Reviewed-by: Mike Rapoport (IBM) <rppt@kernel.org>
Cc: David Hildenbrand <david@redhat.com>
Cc: "Huang, Ying" <ying.huang@intel.com>
Cc: Iurii Zaikin <yzaikin@google.com>
Cc: Kees Cook <keescook@chromium.org>
Cc: Len Brown <len.brown@intel.com>
Cc: Luis Chamberlain <mcgrof@kernel.org>
Cc: Oscar Salvador <osalvador@suse.de>
Cc: Pavel Machek <pavel@ucw.cz>
Cc: Rafael J. Wysocki <rafael@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>


# 072ba380 16-May-2023 Kefeng Wang <wangkefeng.wang@huawei.com>

mm: page_alloc: move mirrored_kernelcore into mm_init.c

Patch series "mm: page_alloc: misc cleanup and refactor", v2.

This aims to reduce more space in page_alloc.c, also do some cleanup, no
functional changes intended.


This patch (of 13):

Since commit 9420f89db2dd ("mm: move most of core MM initialization to
mm/mm_init.c"), mirrored_kernelcore should be moved into mm_init.c, as
most related codes are already there.

Link: https://lkml.kernel.org/r/20230516063821.121844-1-wangkefeng.wang@huawei.com
Link: https://lkml.kernel.org/r/20230516063821.121844-2-wangkefeng.wang@huawei.com
Signed-off-by: Kefeng Wang <wangkefeng.wang@huawei.com>
Reviewed-by: Mike Rapoport (IBM) <rppt@kernel.org>
Cc: David Hildenbrand <david@redhat.com>
Cc: "Huang, Ying" <ying.huang@intel.com>
Cc: Iurii Zaikin <yzaikin@google.com>
Cc: Kees Cook <keescook@chromium.org>
Cc: Len Brown <len.brown@intel.com>
Cc: Luis Chamberlain <mcgrof@kernel.org>
Cc: Oscar Salvador <osalvador@suse.de>
Cc: Pavel Machek <pavel@ucw.cz>
Cc: Rafael J. Wysocki <rafael@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>


# 3c4322c9 24-Apr-2023 Baolin Wang <baolin.wang@linux.alibaba.com>

mm/page_alloc: drop the unnecessary pfn_valid() for start pfn

__pageblock_pfn_to_page() currently performs both pfn_valid check and
pfn_to_online_page(). The former one is redundant because the latter is a
stronger check. Drop pfn_valid().

Link: https://lkml.kernel.org/r/c3868b58c6714c09a43440d7d02c7b4eed6e03f6.1682342634.git.baolin.wang@linux.alibaba.com
Signed-off-by: Baolin Wang <baolin.wang@linux.alibaba.com>
Reviewed-by: David Hildenbrand <david@redhat.com>
Reviewed-by: "Huang, Ying" <ying.huang@intel.com>
Acked-by: Michal Hocko <mhocko@suse.com>
Cc: Mel Gorman <mgorman@techsingularity.net>
Cc: Mike Rapoport (IBM) <rppt@kernel.org>
Cc: Vlastimil Babka <vbabka@suse.cz>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>


# dcdfdd40 06-Jun-2023 Kirill A. Shutemov <kirill.shutemov@linux.intel.com>

mm: Add support for unaccepted memory

UEFI Specification version 2.9 introduces the concept of memory
acceptance. Some Virtual Machine platforms, such as Intel TDX or AMD
SEV-SNP, require memory to be accepted before it can be used by the
guest. Accepting happens via a protocol specific to the Virtual Machine
platform.

There are several ways the kernel can deal with unaccepted memory:

1. Accept all the memory during boot. It is easy to implement and it
doesn't have runtime cost once the system is booted. The downside is
very long boot time.

Accept can be parallelized to multiple CPUs to keep it manageable
(i.e. via DEFERRED_STRUCT_PAGE_INIT), but it tends to saturate
memory bandwidth and does not scale beyond the point.

2. Accept a block of memory on the first use. It requires more
infrastructure and changes in page allocator to make it work, but
it provides good boot time.

On-demand memory accept means latency spikes every time kernel steps
onto a new memory block. The spikes will go away once workload data
set size gets stabilized or all memory gets accepted.

3. Accept all memory in background. Introduce a thread (or multiple)
that gets memory accepted proactively. It will minimize time the
system experience latency spikes on memory allocation while keeping
low boot time.

This approach cannot function on its own. It is an extension of #2:
background memory acceptance requires functional scheduler, but the
page allocator may need to tap into unaccepted memory before that.

The downside of the approach is that these threads also steal CPU
cycles and memory bandwidth from the user's workload and may hurt
user experience.

Implement #1 and #2 for now. #2 is the default. Some workloads may want
to use #1 with accept_memory=eager in kernel command line. #3 can be
implemented later based on user's demands.

Support of unaccepted memory requires a few changes in core-mm code:

- memblock accepts memory on allocation. It serves early boot memory
allocations and doesn't limit them to pre-accepted pool of memory.

- page allocator accepts memory on the first allocation of the page.
When kernel runs out of accepted memory, it accepts memory until the
high watermark is reached. It helps to minimize fragmentation.

EFI code will provide two helpers if the platform supports unaccepted
memory:

- accept_memory() makes a range of physical addresses accepted.

- range_contains_unaccepted_memory() checks anything within the range
of physical addresses requires acceptance.

Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Reviewed-by: Vlastimil Babka <vbabka@suse.cz>
Acked-by: Mike Rapoport <rppt@linux.ibm.com> # memblock
Link: https://lore.kernel.org/r/20230606142637.5171-2-kirill.shutemov@linux.intel.com


# 65f67a3e 25-Apr-2023 Baolin Wang <baolin.wang@linux.alibaba.com>

mm/page_alloc: add some comments to explain the possible hole in __pageblock_pfn_to_page()

Now the __pageblock_pfn_to_page() is used by set_zone_contiguous(), which
checks whether the given zone contains holes, and uses
pfn_to_online_page() to validate if the start pfn is online and valid, as
well as using pfn_valid() to validate the end pfn.

However, the __pageblock_pfn_to_page() function may return non-NULL even
if the end pfn of a pageblock is in a memory hole in some situations. For
example, if the pageblock order is MAX_ORDER, which will fall into 2
sub-sections, and the end pfn of the pageblock may be hole even though the
start pfn is online and valid.

See below memory layout as an example and suppose the pageblock order is
MAX_ORDER.

[ 0.000000] Zone ranges:
[ 0.000000] DMA [mem 0x0000000040000000-0x00000000ffffffff]
[ 0.000000] DMA32 empty
[ 0.000000] Normal [mem 0x0000000100000000-0x0000001fa7ffffff]
[ 0.000000] Movable zone start for each node
[ 0.000000] Early memory node ranges
[ 0.000000] node 0: [mem 0x0000000040000000-0x0000001fa3c7ffff]
[ 0.000000] node 0: [mem 0x0000001fa3c80000-0x0000001fa3ffffff]
[ 0.000000] node 0: [mem 0x0000001fa4000000-0x0000001fa402ffff]
[ 0.000000] node 0: [mem 0x0000001fa4030000-0x0000001fa40effff]
[ 0.000000] node 0: [mem 0x0000001fa40f0000-0x0000001fa73cffff]
[ 0.000000] node 0: [mem 0x0000001fa73d0000-0x0000001fa745ffff]
[ 0.000000] node 0: [mem 0x0000001fa7460000-0x0000001fa746ffff]
[ 0.000000] node 0: [mem 0x0000001fa7470000-0x0000001fa758ffff]
[ 0.000000] node 0: [mem 0x0000001fa7590000-0x0000001fa7dfffff]

Focus on the last memory range, and there is a hole for the range [mem
0x0000001fa7590000-0x0000001fa7dfffff]. That means the last pageblock
will contain the range from 0x1fa7c00000 to 0x1fa7ffffff, since the
pageblock must be 4M aligned. And in this pageblock, these pfns will fall
into 2 sub-section (the sub-section size is 2M aligned).

So, the 1st sub-section (indicates pfn range: 0x1fa7c00000 - 0x1fa7dfffff
) in this pageblock is valid by calling subsection_map_init() in
free_area_init(), but the 2nd sub-section (indicates pfn range:
0x1fa7e00000 - 0x1fa7ffffff ) in this pageblock is not valid.

This did not break anything until now, but the zone continuous is fragile
in this possible scenario. So as previous discussion[1], it is better to
add some comments to explain this possible issue in case there are some
future pfn walkers that rely on this.

[1] https://lore.kernel.org/all/87r0sdsmr6.fsf@yhuang6-desk2.ccr.corp.intel.com/

Link: https://lkml.kernel.org/r/5c26368865e79c743a453dea48d30670b19d2e4f.1682425534.git.baolin.wang@linux.alibaba.com
Link: https://lkml.kernel.org/r/5c26368865e79c743a453dea48d30670b19d2e4f.1682425534.git.baolin.wang@linux.alibaba.com
Signed-off-by: Baolin Wang <baolin.wang@linux.alibaba.com>
Acked-by: Michal Hocko <mhocko@suse.com>
Reviewed-by: "Huang, Ying" <ying.huang@intel.com>
Cc: Baolin Wang <baolin.wang@linux.alibaba.com>
Cc: David Hildenbrand <david@redhat.com>
Cc: Mel Gorman <mgorman@techsingularity.net>
Cc: Mike Rapoport (IBM) <rppt@kernel.org>
Cc: Vlastimil Babka <vbabka@suse.cz>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>


# 8e464522 28-Feb-2023 Haifeng Xu <haifeng.xu@shopee.com>

cpuset: Clean up cpuset_node_allowed

Commit 002f290627c2 ("cpuset: use static key better and convert to new API")
has used __cpuset_node_allowed() instead of cpuset_node_allowed() to check
whether we can allocate on a memory node. Now this function isn't used by
anyone, so we can do the follow things to clean up it.

1. remove unused codes
2. rename __cpuset_node_allowed() to cpuset_node_allowed()
3. update comments in mm/page_alloc.c

Suggested-by: Waiman Long <longman@redhat.com>
Signed-off-by: Haifeng Xu <haifeng.xu@shopee.com>
Acked-by: Waiman Long <longman@redhat.com>
Signed-off-by: Tejun Heo <tj@kernel.org>


# 8666925c 05-Apr-2023 Vlastimil Babka <vbabka@suse.cz>

mm, page_alloc: use check_pages_enabled static key to check tail pages

Commit 700d2e9a36b9 ("mm, page_alloc: reduce page alloc/free sanity
checks") has introduced a new static key check_pages_enabled to control
when struct pages are sanity checked during allocation and freeing. Mel
Gorman suggested that free_tail_pages_check() could use this static key as
well, instead of relying on CONFIG_DEBUG_VM. That makes sense, so do
that. Also rename the function to free_tail_page_prepare() because it
works on a single tail page and has a struct page preparation component as
well as the optional checking component.

Also remove some unnecessary unlikely() within static_branch_unlikely()
statements that Mel pointed out for commit 700d2e9a36b9.

Link: https://lkml.kernel.org/r/20230405142840.11068-1-vbabka@suse.cz
Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
Suggested-by: Mel Gorman <mgorman@techsingularity.net>
Acked-by: Mel Gorman <mgorman@techsingularity.net>
Cc: Alexander Halbuer <halbuer@sra.uni-hannover.de>
Cc: Kees Cook <keescook@chromium.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>


# eb8589b4 21-Mar-2023 Mike Rapoport (IBM) <rppt@kernel.org>

mm: move mem_init_print_info() to mm_init.c

mem_init_print_info() is only called from mm_core_init().

Move it close to the caller and make it static.

Link: https://lkml.kernel.org/r/20230321170513.2401534-12-rppt@kernel.org
Signed-off-by: Mike Rapoport (IBM) <rppt@kernel.org>
Acked-by: David Hildenbrand <david@redhat.com>
Reviewed-by: Vlastimil Babka <vbabka@suse.cz>
Cc: Doug Berger <opendmb@gmail.com>
Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Thomas Bogendoerfer <tsbogend@alpha.franken.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>


# f2fc4b44 21-Mar-2023 Mike Rapoport (IBM) <rppt@kernel.org>

mm: move init_mem_debugging_and_hardening() to mm/mm_init.c

init_mem_debugging_and_hardening() is only called from mm_core_init().

Move it close to the caller, make it static and rename it to
mem_debugging_and_hardening_init() for consistency with surrounding
convention.

Link: https://lkml.kernel.org/r/20230321170513.2401534-10-rppt@kernel.org
Signed-off-by: Mike Rapoport (IBM) <rppt@kernel.org>
Acked-by: David Hildenbrand <david@redhat.com>
Reviewed-by: Vlastimil Babka <vbabka@suse.cz>
Cc: Doug Berger <opendmb@gmail.com>
Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Thomas Bogendoerfer <tsbogend@alpha.franken.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>


# c4fbed4b 21-Mar-2023 Mike Rapoport (IBM) <rppt@kernel.org>

mm/page_alloc: rename page_alloc_init() to page_alloc_init_cpuhp()

The page_alloc_init() name is really misleading because all this function
does is sets up CPU hotplug callbacks for the page allocator.

Rename it to page_alloc_init_cpuhp() so that name will reflect what the
function does.

Link: https://lkml.kernel.org/r/20230321170513.2401534-6-rppt@kernel.org
Signed-off-by: Mike Rapoport (IBM) <rppt@kernel.org>
Reviewed-by: David Hildenbrand <david@redhat.com>
Reviewed-by: Vlastimil Babka <vbabka@suse.cz>
Cc: Doug Berger <opendmb@gmail.com>
Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Thomas Bogendoerfer <tsbogend@alpha.franken.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>


# 534ef4e1 21-Mar-2023 Mike Rapoport (IBM) <rppt@kernel.org>

mm: handle hashdist initialization in mm/mm_init.c

The hashdist variable must be initialized before the first call to
alloc_large_system_hash() and free_area_init() looks like a better place
for it than page_alloc_init().

Move hashdist handling to mm/mm_init.c

Link: https://lkml.kernel.org/r/20230321170513.2401534-5-rppt@kernel.org
Signed-off-by: Mike Rapoport (IBM) <rppt@kernel.org>
Acked-by: David Hildenbrand <david@redhat.com>
Reviewed-by: Vlastimil Babka <vbabka@suse.cz>
Cc: Doug Berger <opendmb@gmail.com>
Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Thomas Bogendoerfer <tsbogend@alpha.franken.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>


# 9420f89d 21-Mar-2023 Mike Rapoport (IBM) <rppt@kernel.org>

mm: move most of core MM initialization to mm/mm_init.c

The bulk of memory management initialization code is spread all over
mm/page_alloc.c and makes navigating through page allocator functionality
difficult.

Move most of the functions marked __init and __meminit to mm/mm_init.c to
make it better localized and allow some more spare room before
mm/page_alloc.c reaches 10k lines.

No functional changes.

Link: https://lkml.kernel.org/r/20230321170513.2401534-4-rppt@kernel.org
Signed-off-by: Mike Rapoport (IBM) <rppt@kernel.org>
Acked-by: David Hildenbrand <david@redhat.com>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Cc: Doug Berger <opendmb@gmail.com>
Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Thomas Bogendoerfer <tsbogend@alpha.franken.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>


# fce0b421 21-Mar-2023 Mike Rapoport (IBM) <rppt@kernel.org>

mm/page_alloc: add helper for checking if check_pages_enabled

Instead of duplicating long static_branch_enabled(&check_pages_enabled)
wrap it in a helper function is_check_pages_enabled()

Link: https://lkml.kernel.org/r/20230321170513.2401534-3-rppt@kernel.org
Signed-off-by: Mike Rapoport (IBM) <rppt@kernel.org>
Reviewed-by: David Hildenbrand <david@redhat.com>
Reviewed-by: Vlastimil Babka <vbabka@suse.cz>
Cc: Doug Berger <opendmb@gmail.com>
Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Thomas Bogendoerfer <tsbogend@alpha.franken.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>


# 5d671eb4 19-Mar-2023 Mike Rapoport (IBM) <rppt@kernel.org>

mm: move get_page_from_free_area() to mm/page_alloc.c

The get_page_from_free_area() helper is only used in mm/page_alloc.c so
move it there to reduce noise in include/linux/mmzone.h

Link: https://lkml.kernel.org/r/20230319114214.2133332-1-rppt@kernel.org
Signed-off-by: Mike Rapoport (IBM) <rppt@kernel.org>
Reviewed-by: Lorenzo Stoakes <lstoakes@gmail.com>
Acked-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Reviewed-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>


# 23baf831 15-Mar-2023 Kirill A. Shutemov <kirill.shutemov@linux.intel.com>

mm, treewide: redefine MAX_ORDER sanely

MAX_ORDER currently defined as number of orders page allocator supports:
user can ask buddy allocator for page order between 0 and MAX_ORDER-1.

This definition is counter-intuitive and lead to number of bugs all over
the kernel.

Change the definition of MAX_ORDER to be inclusive: the range of orders
user can ask from buddy allocator is 0..MAX_ORDER now.

[kirill@shutemov.name: fix min() warning]
Link: https://lkml.kernel.org/r/20230315153800.32wib3n5rickolvh@box
[akpm@linux-foundation.org: fix another min_t warning]
[kirill@shutemov.name: fixups per Zi Yan]
Link: https://lkml.kernel.org/r/20230316232144.b7ic4cif4kjiabws@box.shutemov.name
[akpm@linux-foundation.org: fix underlining in docs]
Link: https://lore.kernel.org/oe-kbuild-all/202303191025.VRCTk6mP-lkp@intel.com/
Link: https://lkml.kernel.org/r/20230315113133.11326-11-kirill.shutemov@linux.intel.com
Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Reviewed-by: Michael Ellerman <mpe@ellerman.id.au> [powerpc]
Cc: "Kirill A. Shutemov" <kirill@shutemov.name>
Cc: Zi Yan <ziy@nvidia.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>


# dcc1be11 12-Mar-2023 Lorenzo Stoakes <lstoakes@gmail.com>

mm: prefer xxx_page() alloc/free functions for order-0 pages

Update instances of alloc_pages(..., 0), __get_free_pages(..., 0) and
__free_pages(..., 0) to use alloc_page(), __get_free_page() and
__free_page() respectively in core code.

Link: https://lkml.kernel.org/r/50c48ca4789f1da2a65795f2346f5ae3eff7d665.1678710232.git.lstoakes@gmail.com
Signed-off-by: Lorenzo Stoakes <lstoakes@gmail.com>
Reviewed-by: David Hildenbrand <david@redhat.com>
Reviewed-by: Mike Rapoport (IBM) <rppt@kernel.org>
Acked-by: Mel Gorman <mgorman@techsingularity.net>
Cc: Arnd Bergmann <arnd@arndb.de>
Cc: Christoph Hellwig <hch@infradead.org>
Cc: Nicholas Piggin <npiggin@gmail.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Uladzislau Rezki (Sony) <urezki@gmail.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>


# 0a54864f 09-Mar-2023 Peter Collingbourne <pcc@google.com>

kasan: remove PG_skip_kasan_poison flag

Code inspection reveals that PG_skip_kasan_poison is redundant with
kasantag, because the former is intended to be set iff the latter is the
match-all tag. It can also be observed that it's basically pointless to
poison pages which have kasantag=0, because any pages with this tag would
have been pointed to by pointers with match-all tags, so poisoning the
pages would have little to no effect in terms of bug detection.
Therefore, change the condition in should_skip_kasan_poison() to check
kasantag instead, and remove PG_skip_kasan_poison and associated flags.

Link: https://lkml.kernel.org/r/20230310042914.3805818-3-pcc@google.com
Link: https://linux-review.googlesource.com/id/I57f825f2eaeaf7e8389d6cf4597c8a5821359838
Signed-off-by: Peter Collingbourne <pcc@google.com>
Reviewed-by: Andrey Konovalov <andreyknvl@gmail.com>
Cc: Andrey Ryabinin <ryabinin.a.a@gmail.com>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Evgenii Stepanov <eugenis@google.com>
Cc: Vincenzo Frascino <vincenzo.frascino@arm.com>
Cc: Will Deacon <will@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>


# 700d2e9a 16-Feb-2023 Vlastimil Babka <vbabka@suse.cz>

mm, page_alloc: reduce page alloc/free sanity checks

Historically, we have performed sanity checks on all struct pages being
allocated or freed, making sure they have no unexpected page flags or
certain field values. This can detect insufficient cleanup and some cases
of use-after-free, although on its own it can't always identify the
culprit. The result is a warning and the "bad page" being leaked.

The checks do need some cpu cycles, so in 4.7 with commits 479f854a207c
("mm, page_alloc: defer debugging checks of pages allocated from the PCP")
and 4db7548ccbd9 ("mm, page_alloc: defer debugging checks of freed pages
until a PCP drain") they were no longer performed in the hot paths when
allocating and freeing from pcplists, but only when pcplists are bypassed,
refilled or drained. For debugging purposes, with CONFIG_DEBUG_VM enabled
the checks were instead still done in the hot paths and not when refilling
or draining pcplists.

With 4462b32c9285 ("mm, page_alloc: more extensive free page checking with
debug_pagealloc"), enabling debug_pagealloc also moved the sanity checks
back to hot pahs. When both debug_pagealloc and CONFIG_DEBUG_VM are
enabled, the checks are done both in hotpaths and pcplist refill/drain.

Even though the non-debug default today might seem to be a sensible
tradeoff between overhead and ability to detect bad pages, on closer look
it's arguably not. As most allocations go through the pcplists, catching
any bad pages when refilling or draining pcplists has only a small chance,
insufficient for debugging or serious hardening purposes. On the other
hand the cost of the checks is concentrated in the already expensive
drain/refill batching operations, and those are done under the often
contended zone lock. That was recently identified as an issue for page
allocation and the zone lock contention reduced by moving the checks
outside of the locked section with a patch "mm: reduce lock contention of
pcp buffer refill", but the cost of the checks is still visible compared
to their removal [1]. In the pcplist draining path free_pcppages_bulk()
the checks are still done under zone->lock.

Thus, remove the checks from pcplist refill and drain paths completely.
Introduce a static key check_pages_enabled to control checks during page
allocation a freeing (whether pcplist is used or bypassed). The static
key is enabled if either is true:

- kernel is built with CONFIG_DEBUG_VM=y (debugging)
- debug_pagealloc or page poisoning is boot-time enabled (debugging)
- init_on_alloc or init_on_free is boot-time enabled (hardening)

The resulting user visible changes:
- no checks when draining/refilling pcplists - less overhead, with
likely no practical reduction of ability to catch bad pages
- no checks when bypassing pcplists in default config (no
debugging/hardening) - less overhead etc. as above
- on typical hardened kernels [2], checks are now performed on each page
allocation/free (previously only when bypassing/draining/refilling
pcplists) - the init_on_alloc/init_on_free enabled should be sufficient
indication for preferring more costly alloc/free operations for
hardening purposes and we shouldn't need to introduce another toggle
- code (various wrappers) removal and simplification

[1] https://lore.kernel.org/all/68ba44d8-6899-c018-dcb3-36f3a96e6bea@sra.uni-hannover.de/
[2] https://lore.kernel.org/all/63ebc499.a70a0220.9ac51.29ea@mx.google.com/

[akpm@linux-foundation.org: coding-style cleanups]
[akpm@linux-foundation.org: make check_pages_enabled static]
Link: https://lkml.kernel.org/r/20230216095131.17336-1-vbabka@suse.cz
Reported-by: Alexander Halbuer <halbuer@sra.uni-hannover.de>
Reported-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
Cc: Kees Cook <keescook@chromium.org>
Cc: Mel Gorman <mgorman@techsingularity.net>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>


# 2ede3c13 01-Feb-2023 Alexander Halbuer <halbuer@sra.uni-hannover.de>

mm: reduce lock contention of pcp buffer refill

rmqueue_bulk() batches the allocation of multiple elements to refill the
per-CPU buffers into a single hold of the zone lock. Each element is
allocated and checked using check_pcp_refill(). The check touches every
related struct page which is especially expensive for higher order
allocations (huge pages).

This patch reduces the time holding the lock by moving the check out of
the critical section similar to rmqueue_buddy() which allocates a single
element.

Measurements of parallel allocation-heavy workloads show a reduction of
the average huge page allocation latency of 50 percent for two cores and
nearly 90 percent for 24 cores.

Link: https://lkml.kernel.org/r/20230201162549.68384-1-halbuer@sra.uni-hannover.de
Signed-off-by: Alexander Halbuer <halbuer@sra.uni-hannover.de>
Cc: Mel Gorman <mgorman@techsingularity.net>
Cc: Vlastimil Babka <vbabka@suse.cz>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>


# 4d73ba5f 14-Apr-2023 Mel Gorman <mgorman@techsingularity.net>

mm: page_alloc: skip regions with hugetlbfs pages when allocating 1G pages

A bug was reported by Yuanxi Liu where allocating 1G pages at runtime is
taking an excessive amount of time for large amounts of memory. Further
testing allocating huge pages that the cost is linear i.e. if allocating
1G pages in batches of 10 then the time to allocate nr_hugepages from
10->20->30->etc increases linearly even though 10 pages are allocated at
each step. Profiles indicated that much of the time is spent checking the
validity within already existing huge pages and then attempting a
migration that fails after isolating the range, draining pages and a whole
lot of other useless work.

Commit eb14d4eefdc4 ("mm,page_alloc: drop unnecessary checks from
pfn_range_valid_contig") removed two checks, one which ignored huge pages
for contiguous allocations as huge pages can sometimes migrate. While
there may be value on migrating a 2M page to satisfy a 1G allocation, it's
potentially expensive if the 1G allocation fails and it's pointless to try
moving a 1G page for a new 1G allocation or scan the tail pages for valid
PFNs.

Reintroduce the PageHuge check and assume any contiguous region with
hugetlbfs pages is unsuitable for a new 1G allocation.

The hpagealloc test allocates huge pages in batches and reports the
average latency per page over time. This test happens just after boot
when fragmentation is not an issue. Units are in milliseconds.

hpagealloc
6.3.0-rc6 6.3.0-rc6 6.3.0-rc6
vanilla hugeallocrevert-v1r1 hugeallocsimple-v1r2
Min Latency 26.42 ( 0.00%) 5.07 ( 80.82%) 18.94 ( 28.30%)
1st-qrtle Latency 356.61 ( 0.00%) 5.34 ( 98.50%) 19.85 ( 94.43%)
2nd-qrtle Latency 697.26 ( 0.00%) 5.47 ( 99.22%) 20.44 ( 97.07%)
3rd-qrtle Latency 972.94 ( 0.00%) 5.50 ( 99.43%) 20.81 ( 97.86%)
Max-1 Latency 26.42 ( 0.00%) 5.07 ( 80.82%) 18.94 ( 28.30%)
Max-5 Latency 82.14 ( 0.00%) 5.11 ( 93.78%) 19.31 ( 76.49%)
Max-10 Latency 150.54 ( 0.00%) 5.20 ( 96.55%) 19.43 ( 87.09%)
Max-90 Latency 1164.45 ( 0.00%) 5.53 ( 99.52%) 20.97 ( 98.20%)
Max-95 Latency 1223.06 ( 0.00%) 5.55 ( 99.55%) 21.06 ( 98.28%)
Max-99 Latency 1278.67 ( 0.00%) 5.57 ( 99.56%) 22.56 ( 98.24%)
Max Latency 1310.90 ( 0.00%) 8.06 ( 99.39%) 26.62 ( 97.97%)
Amean Latency 678.36 ( 0.00%) 5.44 * 99.20%* 20.44 * 96.99%*

6.3.0-rc6 6.3.0-rc6 6.3.0-rc6
vanilla revert-v1 hugeallocfix-v2
Duration User 0.28 0.27 0.30
Duration System 808.66 17.77 35.99
Duration Elapsed 830.87 18.08 36.33

The vanilla kernel is poor, taking up to 1.3 second to allocate a huge
page and almost 10 minutes in total to run the test. Reverting the
problematic commit reduces it to 8ms at worst and the patch takes 26ms.
This patch fixes the main issue with skipping huge pages but leaves the
page_count() out because a page with an elevated count potentially can
migrate.

BugLink: https://bugzilla.kernel.org/show_bug.cgi?id=217022
Link: https://lkml.kernel.org/r/20230414141429.pwgieuwluxwez3rj@techsingularity.net
Fixes: eb14d4eefdc4 ("mm,page_alloc: drop unnecessary checks from pfn_range_valid_contig")
Signed-off-by: Mel Gorman <mgorman@techsingularity.net>
Reported-by: Yuanxi Liu <y.liu@naruida.com>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Reviewed-by: David Hildenbrand <david@redhat.com>
Acked-by: Michal Hocko <mhocko@suse.com>
Reviewed-by: Oscar Salvador <osalvador@suse.de>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>


# 1007843a 04-Apr-2023 Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>

mm/page_alloc: fix potential deadlock on zonelist_update_seq seqlock

syzbot is reporting circular locking dependency which involves
zonelist_update_seq seqlock [1], for this lock is checked by memory
allocation requests which do not need to be retried.

One deadlock scenario is kmalloc(GFP_ATOMIC) from an interrupt handler.

CPU0
----
__build_all_zonelists() {
write_seqlock(&zonelist_update_seq); // makes zonelist_update_seq.seqcount odd
// e.g. timer interrupt handler runs at this moment
some_timer_func() {
kmalloc(GFP_ATOMIC) {
__alloc_pages_slowpath() {
read_seqbegin(&zonelist_update_seq) {
// spins forever because zonelist_update_seq.seqcount is odd
}
}
}
}
// e.g. timer interrupt handler finishes
write_sequnlock(&zonelist_update_seq); // makes zonelist_update_seq.seqcount even
}

This deadlock scenario can be easily eliminated by not calling
read_seqbegin(&zonelist_update_seq) from !__GFP_DIRECT_RECLAIM allocation
requests, for retry is applicable to only __GFP_DIRECT_RECLAIM allocation
requests. But Michal Hocko does not know whether we should go with this
approach.

Another deadlock scenario which syzbot is reporting is a race between
kmalloc(GFP_ATOMIC) from tty_insert_flip_string_and_push_buffer() with
port->lock held and printk() from __build_all_zonelists() with
zonelist_update_seq held.

CPU0 CPU1
---- ----
pty_write() {
tty_insert_flip_string_and_push_buffer() {
__build_all_zonelists() {
write_seqlock(&zonelist_update_seq);
build_zonelists() {
printk() {
vprintk() {
vprintk_default() {
vprintk_emit() {
console_unlock() {
console_flush_all() {
console_emit_next_record() {
con->write() = serial8250_console_write() {
spin_lock_irqsave(&port->lock, flags);
tty_insert_flip_string() {
tty_insert_flip_string_fixed_flag() {
__tty_buffer_request_room() {
tty_buffer_alloc() {
kmalloc(GFP_ATOMIC | __GFP_NOWARN) {
__alloc_pages_slowpath() {
zonelist_iter_begin() {
read_seqbegin(&zonelist_update_seq); // spins forever because zonelist_update_seq.seqcount is odd
spin_lock_irqsave(&port->lock, flags); // spins forever because port->lock is held
}
}
}
}
}
}
}
}
spin_unlock_irqrestore(&port->lock, flags);
// message is printed to console
spin_unlock_irqrestore(&port->lock, flags);
}
}
}
}
}
}
}
}
}
write_sequnlock(&zonelist_update_seq);
}
}
}

This deadlock scenario can be eliminated by

preventing interrupt context from calling kmalloc(GFP_ATOMIC)

and

preventing printk() from calling console_flush_all()

while zonelist_update_seq.seqcount is odd.

Since Petr Mladek thinks that __build_all_zonelists() can become a
candidate for deferring printk() [2], let's address this problem by

disabling local interrupts in order to avoid kmalloc(GFP_ATOMIC)

and

disabling synchronous printk() in order to avoid console_flush_all()

.

As a side effect of minimizing duration of zonelist_update_seq.seqcount
being odd by disabling synchronous printk(), latency at
read_seqbegin(&zonelist_update_seq) for both !__GFP_DIRECT_RECLAIM and
__GFP_DIRECT_RECLAIM allocation requests will be reduced. Although, from
lockdep perspective, not calling read_seqbegin(&zonelist_update_seq) (i.e.
do not record unnecessary locking dependency) from interrupt context is
still preferable, even if we don't allow calling kmalloc(GFP_ATOMIC)
inside
write_seqlock(&zonelist_update_seq)/write_sequnlock(&zonelist_update_seq)
section...

Link: https://lkml.kernel.org/r/8796b95c-3da3-5885-fddd-6ef55f30e4d3@I-love.SAKURA.ne.jp
Fixes: 3d36424b3b58 ("mm/page_alloc: fix race condition between build_all_zonelists and page allocation")
Link: https://lkml.kernel.org/r/ZCrs+1cDqPWTDFNM@alley [2]
Reported-by: syzbot <syzbot+223c7461c58c58a4cb10@syzkaller.appspotmail.com>
Link: https://syzkaller.appspot.com/bug?extid=223c7461c58c58a4cb10 [1]
Signed-off-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
Acked-by: Michal Hocko <mhocko@suse.com>
Acked-by: Mel Gorman <mgorman@techsingularity.net>
Cc: Petr Mladek <pmladek@suse.com>
Cc: David Hildenbrand <david@redhat.com>
Cc: Ilpo Järvinen <ilpo.jarvinen@linux.intel.com>
Cc: John Ogness <john.ogness@linutronix.de>
Cc: Patrick Daly <quic_pdaly@quicinc.com>
Cc: Sergey Senozhatsky <senozhatsky@chromium.org>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>


# f446883d 09-Mar-2023 Peter Collingbourne <pcc@google.com>

Revert "kasan: drop skip_kasan_poison variable in free_pages_prepare"

This reverts commit 487a32ec24be819e747af8c2ab0d5c515508086a.

should_skip_kasan_poison() reads the PG_skip_kasan_poison flag from
page->flags. However, this line of code in free_pages_prepare():

page->flags &= ~PAGE_FLAGS_CHECK_AT_PREP;

clears most of page->flags, including PG_skip_kasan_poison, before calling
should_skip_kasan_poison(), which meant that it would never return true as
a result of the page flag being set. Therefore, fix the code to call
should_skip_kasan_poison() before clearing the flags, as we were doing
before the reverted patch.

This fixes a measurable performance regression introduced in the reverted
commit, where munmap() takes longer than intended if HW tags KASAN is
supported and enabled at runtime. Without this patch, we see a
single-digit percentage performance regression in a particular
mmap()-heavy benchmark when enabling HW tags KASAN, and with the patch,
there is no statistically significant performance impact when enabling HW
tags KASAN.

Link: https://lkml.kernel.org/r/20230310042914.3805818-2-pcc@google.com
Fixes: 487a32ec24be ("kasan: drop skip_kasan_poison variable in free_pages_prepare")
Link: https://linux-review.googlesource.com/id/Ic4f13affeebd20548758438bb9ed9ca40e312b79
Signed-off-by: Peter Collingbourne <pcc@google.com>
Reviewed-by: Andrey Konovalov <andreyknvl@gmail.com>
Cc: Andrey Ryabinin <ryabinin.a.a@gmail.com>
Cc: Catalin Marinas <catalin.marinas@arm.com> [arm64]
Cc: Evgenii Stepanov <eugenis@google.com>
Cc: Vincenzo Frascino <vincenzo.frascino@arm.com>
Cc: Will Deacon <will@kernel.org>
Cc: <stable@vger.kernel.org> [6.1]
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>


# f7a449f7 13-Feb-2023 Roman Gushchin <roman.gushchin@linux.dev>

mm: memcontrol: rename memcg_kmem_enabled()

Currently there are two kmem-related helper functions with a confusing
semantics: memcg_kmem_enabled() and mem_cgroup_kmem_disabled().

The problem is that an obvious expectation
memcg_kmem_enabled() == !mem_cgroup_kmem_disabled(),
can be false.

mem_cgroup_kmem_disabled() is similar to mem_cgroup_disabled(): it returns
true only if CONFIG_MEMCG_KMEM is not set or the kmem accounting is
disabled using a boot time kernel option "cgroup.memory=nokmem". It never
changes the value dynamically.

memcg_kmem_enabled() is different: it always returns false until the first
non-root memory cgroup will get online (assuming the kernel memory
accounting is enabled). It's goal is to improve the performance on
systems without the cgroupfs mounted/memory controller enabled or on the
systems with only the root memory cgroup.

To make things more obvious and avoid potential bugs, let's rename
memcg_kmem_enabled() to memcg_kmem_online().

Link: https://lkml.kernel.org/r/20230213192922.1146370-1-roman.gushchin@linux.dev
Signed-off-by: Roman Gushchin <roman.gushchin@linux.dev>
Acked-by: Muchun Song <songmuchun@bytedance.com>
Acked-by: Michal Hocko <mhocko@suse.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Shakeel Butt <shakeelb@google.com>
Cc: Dennis Zhou <dennis@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>


# 1bc67ca6 12-Feb-2023 Qi Zheng <zhengqi.arch@bytedance.com>

mm: page_alloc: call panic() when memoryless node allocation fails

In free_area_init(), we will continue to run after allocation of
memoryless node pgdat fails. However, in the subsequent process (such as
when initializing zonelist), the case that NODE_DATA(nid) is NULL is not
handled, which will cause panic. Instead of this, it's better to call
panic() directly when the memory allocation fails during system boot.

Link: https://lkml.kernel.org/r/20230212111027.95520-1-zhengqi.arch@bytedance.com
Signed-off-by: Qi Zheng <zhengqi.arch@bytedance.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>


# aa02d3c1 03-Feb-2023 Yajun Deng <yajun.deng@linux.dev>

mm/page_alloc: reduce fallbacks to (MIGRATE_PCPTYPES - 1)

The commit 1dd214b8f21c ("mm: page_alloc: avoid merging non-fallbackable
pageblocks with others") has removed MIGRATE_CMA and MIGRATE_ISOLATE from
fallbacks list. so there is no need to add an element at the end of every
type.

Reduce fallbacks to (MIGRATE_PCPTYPES - 1).

Link: https://lkml.kernel.org/r/20230203100132.1627787-1-yajun.deng@linux.dev
Signed-off-by: Yajun Deng <yajun.deng@linux.dev>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Cc: Zi Yan <ziy@nvidia.com>
Cc: Mel Gorman <mgorman@techsingularity.net>
Cc: David Hildenbrand <david@redhat.com>
Cc: Mike Rapoport <rppt@linux.ibm.com>
Cc: Oscar Salvador <osalvador@suse.de>
Cc: Mike Rapoport <rppt@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>


# 420ef683 24-Jan-2023 Andrey Konovalov <andreyknvl@gmail.com>

kasan: reset page tags properly with sampling

The implementation of page_alloc poisoning sampling assumed that
tag_clear_highpage resets page tags for __GFP_ZEROTAGS allocations.
However, this is no longer the case since commit 70c248aca9e7 ("mm: kasan:
Skip unpoisoning of user pages").

This leads to kernel crashes when MTE-enabled userspace mappings are used
with Hardware Tag-Based KASAN enabled.

Reset page tags for __GFP_ZEROTAGS allocations in post_alloc_hook().

Also clarify and fix related comments.

[andreyknvl@google.com: update comment]
Link: https://lkml.kernel.org/r/5dbd866714b4839069e2d8469ac45b60953db290.1674592780.git.andreyknvl@google.com
Link: https://lkml.kernel.org/r/24ea20c1b19c2b4b56cf9f5b354915f8dbccfc77.1674592496.git.andreyknvl@google.com
Fixes: 44383cef54c0 ("kasan: allow sampling page_alloc allocations for HW_TAGS")
Signed-off-by: Andrey Konovalov <andreyknvl@google.com>
Reported-by: Peter Collingbourne <pcc@google.com>
Tested-by: Peter Collingbourne <pcc@google.com>
Cc: Alexander Potapenko <glider@google.com>
Cc: Andrey Ryabinin <ryabinin.a.a@gmail.com>
Cc: Dmitry Vyukov <dvyukov@google.com>
Cc: Marco Elver <elver@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>


# 076cf7ea 05-Jan-2023 Anshuman Khandual <anshuman.khandual@arm.com>

mm/page_alloc: use deferred_pages_enabled() wherever applicable

Instead of directly accessing static deferred_pages, replace such
instances with the helper deferred_pages_enabled(). No functional change
is intended.

Link: https://lkml.kernel.org/r/20230105082506.241529-1-anshuman.khandual@arm.com
Signed-off-by: Anshuman Khandual <anshuman.khandual@arm.com>
Reviewed-by: Mike Rapoport (IBM) <rppt@kernel.org>
Reviewed-by: David Hildenbrand <david@redhat.com>
Acked-by: Mel Gorman <mgorman@suse.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>


# 7ec7096b 17-Jan-2023 Pasha Tatashin <pasha.tatashin@soleen.com>

mm/page_ext: init page_ext early if there are no deferred struct pages

page_ext must be initialized after all struct pages are initialized.
Therefore, page_ext is initialized after page_alloc_init_late(), and can
optionally be initialized earlier via early_page_ext kernel parameter
which as a side effect also disables deferred struct pages.

Allow to automatically init page_ext early when there are no deferred
struct pages in order to be able to use page_ext during kernel boot and
track for example page allocations early.

[pasha.tatashin@soleen.com: fix build with CONFIG_PAGE_EXTENSION=n]
Link: https://lkml.kernel.org/r/20230118155251.2522985-1-pasha.tatashin@soleen.com
Link: https://lkml.kernel.org/r/20230117204617.1553748-1-pasha.tatashin@soleen.com
Signed-off-by: Pasha Tatashin <pasha.tatashin@soleen.com>
Acked-by: Mike Rapoport (IBM) <rppt@kernel.org>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Cc: Charan Teja Kalla <quic_charante@quicinc.com>
Cc: David Hildenbrand <david@redhat.com>
Cc: Li Zhe <lizhe.67@bytedance.com>
Cc: Michal Hocko <mhocko@suse.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>


# 2973d822 13-Jan-2023 NeilBrown <neilb@suse.de>

mm: discard __GFP_ATOMIC

__GFP_ATOMIC serves little purpose. Its main effect is to set
ALLOC_HARDER which adds a few little boosts to increase the chance of an
allocation succeeding, one of which is to lower the water-mark at which it
will succeed.

It is *always* paired with __GFP_HIGH which sets ALLOC_HIGH which also
adjusts this watermark. It is probable that other users of __GFP_HIGH
should benefit from the other little bonuses that __GFP_ATOMIC gets.

__GFP_ATOMIC also gives a warning if used with __GFP_DIRECT_RECLAIM.
There is little point to this. We already get a might_sleep() warning if
__GFP_DIRECT_RECLAIM is set.

__GFP_ATOMIC allows the "watermark_boost" to be side-stepped. It is
probable that testing ALLOC_HARDER is a better fit here.

__GFP_ATOMIC is used by tegra-smmu.c to check if the allocation might
sleep. This should test __GFP_DIRECT_RECLAIM instead.

This patch:
- removes __GFP_ATOMIC
- allows __GFP_HIGH allocations to ignore watermark boosting as well
as GFP_ATOMIC requests.
- makes other adjustments as suggested by the above.

The net result is not change to GFP_ATOMIC allocations. Other
allocations that use __GFP_HIGH will benefit from a few different extra
privileges. This affects:
xen, dm, md, ntfs3
the vermillion frame buffer
hibernation
ksm
swap
all of which likely produce more benefit than cost if these selected
allocation are more likely to succeed quickly.

[mgorman: Minor adjustments to rework on top of a series]
Link: https://lkml.kernel.org/r/163712397076.13692.4727608274002939094@noble.neil.brown.name
Link: https://lkml.kernel.org/r/20230113111217.14134-7-mgorman@techsingularity.net
Signed-off-by: NeilBrown <neilb@suse.de>
Signed-off-by: Mel Gorman <mgorman@techsingularity.net>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Acked-by: Michal Hocko <mhocko@suse.com>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Thierry Reding <thierry.reding@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>


# 1ebbb218 13-Jan-2023 Mel Gorman <mgorman@techsingularity.net>

mm/page_alloc: explicitly define how __GFP_HIGH non-blocking allocations accesses reserves

GFP_ATOMIC allocations get flagged ALLOC_HARDER which is a vague
description. In preparation for the removal of GFP_ATOMIC redefine
__GFP_ATOMIC to simply mean non-blocking and renaming ALLOC_HARDER to
ALLOC_NON_BLOCK accordingly. __GFP_HIGH is required for access to
reserves but non-blocking is granted more access. For example, GFP_NOWAIT
is non-blocking but has no special access to reserves. A __GFP_NOFAIL
blocking allocation is granted access similar to __GFP_HIGH if the only
alternative is an OOM kill.

Link: https://lkml.kernel.org/r/20230113111217.14134-6-mgorman@techsingularity.net
Signed-off-by: Mel Gorman <mgorman@techsingularity.net>
Acked-by: Michal Hocko <mhocko@suse.com>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: NeilBrown <neilb@suse.de>
Cc: Thierry Reding <thierry.reding@gmail.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>


# ab350885 13-Jan-2023 Mel Gorman <mgorman@techsingularity.net>

mm/page_alloc: explicitly define what alloc flags deplete min reserves

As there are more ALLOC_ flags that affect reserves, define what flags
affect reserves and clarify the effect of each flag.

Link: https://lkml.kernel.org/r/20230113111217.14134-5-mgorman@techsingularity.net
Signed-off-by: Mel Gorman <mgorman@techsingularity.net>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Acked-by: Michal Hocko <mhocko@suse.com>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: NeilBrown <neilb@suse.de>
Cc: Thierry Reding <thierry.reding@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>


# eb2e2b42 13-Jan-2023 Mel Gorman <mgorman@techsingularity.net>

mm/page_alloc: explicitly record high-order atomic allocations in alloc_flags

A high-order ALLOC_HARDER allocation is assumed to be atomic. While that
is accurate, it changes later in the series. In preparation, explicitly
record high-order atomic allocations in gfp_to_alloc_flags().

Link: https://lkml.kernel.org/r/20230113111217.14134-4-mgorman@techsingularity.net
Signed-off-by: Mel Gorman <mgorman@techsingularity.net>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Acked-by: Michal Hocko <mhocko@suse.com>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: NeilBrown <neilb@suse.de>
Cc: Thierry Reding <thierry.reding@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>


# c988dcbe 13-Jan-2023 Mel Gorman <mgorman@techsingularity.net>

mm/page_alloc: treat RT tasks similar to __GFP_HIGH

RT tasks are allowed to dip below the min reserve but ALLOC_HARDER is
typically combined with ALLOC_MIN_RESERVE so RT tasks are a little
unusual. While there is some justification for allowing RT tasks access
to memory reserves, there is a strong chance that a RT task that is also
under memory pressure is at risk of missing deadlines anyway. Relax how
much reserves an RT task can access by treating it the same as __GFP_HIGH
allocations.

Note that in a future kernel release that the RT special casing will be
removed. Hard realtime tasks should be locking down resources in advance
and ensuring enough memory is available. Even a soft-realtime task like
audio or video live decoding which cannot jitter should be allocating both
memory and any disk space required up-front before the recording starts
instead of relying on reserves. At best, reserve access will only delay
the problem by a very short interval.

Link: https://lkml.kernel.org/r/20230113111217.14134-3-mgorman@techsingularity.net
Signed-off-by: Mel Gorman <mgorman@techsingularity.net>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Acked-by: Michal Hocko <mhocko@suse.com>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: NeilBrown <neilb@suse.de>
Cc: Thierry Reding <thierry.reding@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>


# 524c4807 13-Jan-2023 Mel Gorman <mgorman@techsingularity.net>

mm/page_alloc: rename ALLOC_HIGH to ALLOC_MIN_RESERVE

Patch series "Discard __GFP_ATOMIC", v3.

Neil's patch has been residing in mm-unstable as commit 2fafb4fe8f7a ("mm:
discard __GFP_ATOMIC") for a long time and recently brought up again.
Most recently, I was worried that __GFP_HIGH allocations could use
high-order atomic reserves which is unintentional but there was no
response so lets revisit -- this series reworks how min reserves are used,
protects highorder reserves and then finishes with Neil's patch with very
minor modifications so it fits on top.

There was a review discussion on renaming __GFP_DIRECT_RECLAIM to
__GFP_ALLOW_BLOCKING but I didn't think it was that big an issue and is
orthogonal to the removal of __GFP_ATOMIC.

There were some concerns about how the gfp flags affect the min reserves
but it never reached a solid conclusion so I made my own attempt.

The series tries to iron out some of the details on how reserves are used.
ALLOC_HIGH becomes ALLOC_MIN_RESERVE and ALLOC_HARDER becomes
ALLOC_NON_BLOCK and documents how the reserves are affected. For example,
ALLOC_NON_BLOCK (no direct reclaim) on its own allows 25% of the min
reserve. ALLOC_MIN_RESERVE (__GFP_HIGH) allows 50% and both combined
allows deeper access again. ALLOC_OOM allows access to 75%.

High-order atomic allocations are explicitly handled with the caveat that
no __GFP_ATOMIC flag means that any high-order allocation that specifies
GFP_HIGH and cannot enter direct reclaim will be treated as if it was
GFP_ATOMIC.


This patch (of 6):

__GFP_HIGH aliases to ALLOC_HIGH but the name does not really hint what it
means. As ALLOC_HIGH is internal to the allocator, rename it to
ALLOC_MIN_RESERVE to document that the min reserves can be depleted.

Link: https://lkml.kernel.org/r/20230113111217.14134-1-mgorman@techsingularity.net
Link: https://lkml.kernel.org/r/20230113111217.14134-2-mgorman@techsingularity.net
Signed-off-by: Mel Gorman <mgorman@techsingularity.net>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Acked-by: Michal Hocko <mhocko@suse.com>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: NeilBrown <neilb@suse.de>
Cc: Thierry Reding <thierry.reding@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>


# 96f97c43 11-Jan-2023 Lorenzo Stoakes <lstoakes@gmail.com>

mm: mlock: update the interface to use folios

Update the mlock interface to accept folios rather than pages, bringing
the interface in line with the internal implementation.

munlock_vma_page() still requires a page_folio() conversion, however this
is consistent with the existent mlock_vma_page() implementation and a
product of rmap still dealing in pages rather than folios.

Link: https://lkml.kernel.org/r/cba12777c5544305014bc0cbec56bb4cc71477d8.1673526881.git.lstoakes@gmail.com
Signed-off-by: Lorenzo Stoakes <lstoakes@gmail.com>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Cc: Christian Brauner <brauner@kernel.org>
Cc: Geert Uytterhoeven <geert@linux-m68k.org>
Cc: Hugh Dickins <hughd@google.com>
Cc: Joel Fernandes (Google) <joel@joelfernandes.org>
Cc: Jonathan Corbet <corbet@lwn.net>
Cc: Liam R. Howlett <Liam.Howlett@oracle.com>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Mike Rapoport (IBM) <rppt@kernel.org>
Cc: William Kucharski <william.kucharski@oracle.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>


# a60d5942 11-Jan-2023 Matthew Wilcox (Oracle) <willy@infradead.org>

mm: convert destroy_large_folio() to use folio_dtor

Replace a use of compound_dtor.

Link: https://lkml.kernel.org/r/20230111142915.1001531-21-willy@infradead.org
Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>


# 65a689f3 11-Jan-2023 Matthew Wilcox (Oracle) <willy@infradead.org>

page_alloc: use folio fields directly

Rmove the uses of compound_mapcount_ptr(), head_compound_mapcount() and
subpages_mapcount_ptr()

Link: https://lkml.kernel.org/r/20230111142915.1001531-10-willy@infradead.org
Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>


# 94688e8e 11-Jan-2023 Matthew Wilcox (Oracle) <willy@infradead.org>

mm: remove folio_pincount_ptr() and head_compound_pincount()

We can use folio->_pincount directly, since all users are guarded by tests
of compound/large.

Link: https://lkml.kernel.org/r/20230111142915.1001531-2-willy@infradead.org
Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Reviewed-by: John Hubbard <jhubbard@nvidia.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>


# fc574488 04-Jan-2023 Mike Rapoport (IBM) <rppt@kernel.org>

mm/page_alloc: invert logic for early page initialisation checks

Rename early_page_uninitialised() to early_page_initialised() and invert
its logic to make the code more readable.

Link: https://lkml.kernel.org/r/20230104191805.2535864-1-rppt@kernel.org
Signed-off-by: Mike Rapoport (IBM) <rppt@kernel.org>
Reviewed-by: David Hildenbrand <david@redhat.com>
Reviewed-by: Anshuman Khandual <anshuman.khandual@arm.com>
Acked-by: Mel Gorman <mgorman@suse.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>


# e4dde56c 21-Dec-2022 Yu Zhao <yuzhao@google.com>

mm: multi-gen LRU: per-node lru_gen_folio lists

For each node, memcgs are divided into two generations: the old and
the young. For each generation, memcgs are randomly sharded into
multiple bins to improve scalability. For each bin, an RCU hlist_nulls
is virtually divided into three segments: the head, the tail and the
default.

An onlining memcg is added to the tail of a random bin in the old
generation. The eviction starts at the head of a random bin in the old
generation. The per-node memcg generation counter, whose reminder (mod
2) indexes the old generation, is incremented when all its bins become
empty.

There are four operations:
1. MEMCG_LRU_HEAD, which moves an memcg to the head of a random bin in
its current generation (old or young) and updates its "seg" to
"head";
2. MEMCG_LRU_TAIL, which moves an memcg to the tail of a random bin in
its current generation (old or young) and updates its "seg" to
"tail";
3. MEMCG_LRU_OLD, which moves an memcg to the head of a random bin in
the old generation, updates its "gen" to "old" and resets its "seg"
to "default";
4. MEMCG_LRU_YOUNG, which moves an memcg to the tail of a random bin
in the young generation, updates its "gen" to "young" and resets
its "seg" to "default".

The events that trigger the above operations are:
1. Exceeding the soft limit, which triggers MEMCG_LRU_HEAD;
2. The first attempt to reclaim an memcg below low, which triggers
MEMCG_LRU_TAIL;
3. The first attempt to reclaim an memcg below reclaimable size
threshold, which triggers MEMCG_LRU_TAIL;
4. The second attempt to reclaim an memcg below reclaimable size
threshold, which triggers MEMCG_LRU_YOUNG;
5. Attempting to reclaim an memcg below min, which triggers
MEMCG_LRU_YOUNG;
6. Finishing the aging on the eviction path, which triggers
MEMCG_LRU_YOUNG;
7. Offlining an memcg, which triggers MEMCG_LRU_OLD.

Note that memcg LRU only applies to global reclaim, and the
round-robin incrementing of their max_seq counters ensures the
eventual fairness to all eligible memcgs. For memcg reclaim, it still
relies on mem_cgroup_iter().

Link: https://lkml.kernel.org/r/20221222041905.2431096-7-yuzhao@google.com
Signed-off-by: Yu Zhao <yuzhao@google.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Jonathan Corbet <corbet@lwn.net>
Cc: Michael Larabel <Michael@MichaelLarabel.com>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Mike Rapoport <rppt@kernel.org>
Cc: Roman Gushchin <roman.gushchin@linux.dev>
Cc: Suren Baghdasaryan <surenb@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>


# 44383cef 19-Dec-2022 Andrey Konovalov <andreyknvl@gmail.com>

kasan: allow sampling page_alloc allocations for HW_TAGS

As Hardware Tag-Based KASAN is intended to be used in production, its
performance impact is crucial. As page_alloc allocations tend to be big,
tagging and checking all such allocations can introduce a significant
slowdown.

Add two new boot parameters that allow to alleviate that slowdown:

- kasan.page_alloc.sample, which makes Hardware Tag-Based KASAN tag only
every Nth page_alloc allocation with the order configured by the second
added parameter (default: tag every such allocation).

- kasan.page_alloc.sample.order, which makes sampling enabled by the first
parameter only affect page_alloc allocations with the order equal or
greater than the specified value (default: 3, see below).

The exact performance improvement caused by using the new parameters
depends on their values and the applied workload.

The chosen default value for kasan.page_alloc.sample.order is 3, which
matches both PAGE_ALLOC_COSTLY_ORDER and SKB_FRAG_PAGE_ORDER. This is
done for two reasons:

1. PAGE_ALLOC_COSTLY_ORDER is "the order at which allocations are deemed
costly to service", which corresponds to the idea that only large and
thus costly allocations are supposed to sampled.

2. One of the workloads targeted by this patch is a benchmark that sends
a large amount of data over a local loopback connection. Most multi-page
data allocations in the networking subsystem have the order of
SKB_FRAG_PAGE_ORDER (or PAGE_ALLOC_COSTLY_ORDER).

When running a local loopback test on a testing MTE-enabled device in sync
mode, enabling Hardware Tag-Based KASAN introduces a ~50% slowdown.
Applying this patch and setting kasan.page_alloc.sampling to a value
higher than 1 allows to lower the slowdown. The performance improvement
saturates around the sampling interval value of 10 with the default
sampling page order of 3. This lowers the slowdown to ~20%. The slowdown
in real scenarios involving the network will likely be better.

Enabling page_alloc sampling has a downside: KASAN misses bad accesses to
a page_alloc allocation that has not been tagged. This lowers the value
of KASAN as a security mitigation.

However, based on measuring the number of page_alloc allocations of
different orders during boot in a test build, sampling with the default
kasan.page_alloc.sample.order value affects only ~7% of allocations. The
rest ~93% of allocations are still checked deterministically.

Link: https://lkml.kernel.org/r/129da0614123bb85ed4dd61ae30842b2dd7c903f.1671471846.git.andreyknvl@google.com
Signed-off-by: Andrey Konovalov <andreyknvl@google.com>
Reviewed-by: Marco Elver <elver@google.com>
Cc: Alexander Potapenko <glider@google.com>
Cc: Andrey Ryabinin <ryabinin.a.a@gmail.com>
Cc: Dmitry Vyukov <dvyukov@google.com>
Cc: Evgenii Stepanov <eugenis@google.com>
Cc: Jann Horn <jannh@google.com>
Cc: Mark Brand <markbrand@google.com>
Cc: Peter Collingbourne <pcc@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>


# 462a8e08 09-Feb-2023 David Chen <david.chen@nutanix.com>

Fix page corruption caused by racy check in __free_pages

When we upgraded our kernel, we started seeing some page corruption like
the following consistently:

BUG: Bad page state in process ganesha.nfsd pfn:1304ca
page:0000000022261c55 refcount:0 mapcount:-128 mapping:0000000000000000 index:0x0 pfn:0x1304ca
flags: 0x17ffffc0000000()
raw: 0017ffffc0000000 ffff8a513ffd4c98 ffffeee24b35ec08 0000000000000000
raw: 0000000000000000 0000000000000001 00000000ffffff7f 0000000000000000
page dumped because: nonzero mapcount
CPU: 0 PID: 15567 Comm: ganesha.nfsd Kdump: loaded Tainted: P B O 5.10.158-1.nutanix.20221209.el7.x86_64 #1
Hardware name: VMware, Inc. VMware Virtual Platform/440BX Desktop Reference Platform, BIOS 6.00 04/05/2016
Call Trace:
dump_stack+0x74/0x96
bad_page.cold+0x63/0x94
check_new_page_bad+0x6d/0x80
rmqueue+0x46e/0x970
get_page_from_freelist+0xcb/0x3f0
? _cond_resched+0x19/0x40
__alloc_pages_nodemask+0x164/0x300
alloc_pages_current+0x87/0xf0
skb_page_frag_refill+0x84/0x110
...

Sometimes, it would also show up as corruption in the free list pointer
and cause crashes.

After bisecting the issue, we found the issue started from commit
e320d3012d25 ("mm/page_alloc.c: fix freeing non-compound pages"):

if (put_page_testzero(page))
free_the_page(page, order);
else if (!PageHead(page))
while (order-- > 0)
free_the_page(page + (1 << order), order);

So the problem is the check PageHead is racy because at this point we
already dropped our reference to the page. So even if we came in with
compound page, the page can already be freed and PageHead can return
false and we will end up freeing all the tail pages causing double free.

Fixes: e320d3012d25 ("mm/page_alloc.c: fix freeing non-compound pages")
Link: https://lore.kernel.org/lkml/BYAPR02MB448855960A9656EEA81141FC94D99@BYAPR02MB4488.namprd02.prod.outlook.com/
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: stable@vger.kernel.org
Signed-off-by: Chunwei Chen <david.chen@nutanix.com>
Reviewed-by: Vlastimil Babka <vbabka@suse.cz>
Reviewed-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# ac4b2901 01-Dec-2022 Deyan Wang <wonder_rock@126.com>

mm/page_alloc: update comments in __free_pages_ok()

Add a comment to explain why we call get_pfnblock_migratetype() twice in
__free_pages_ok().

Link: https://lkml.kernel.org/r/20221201135045.31663-1-wonder_rock@126.com
Signed-off-by: Deyan Wang <wonder_rock@126.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>


# a4bafffb 22-Nov-2022 Mel Gorman <mgorman@techsingularity.net>

mm/page_alloc: simplify locking during free_unref_page_list

While freeing a large list, the zone lock will be released and reacquired
to avoid long hold times since commit c24ad77d962c ("mm/page_alloc.c:
avoid excessive IRQ disabled times in free_unref_page_list()"). As
suggested by Vlastimil Babka, the lockrelease/reacquire logic can be
simplified by reusing the logic that acquires a different lock when
changing zones.

Link: https://lkml.kernel.org/r/20221122131229.5263-3-mgorman@techsingularity.net
Signed-off-by: Mel Gorman <mgorman@techsingularity.net>
Reviewed-by: Vlastimil Babka <vbabka@suse.cz>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>


# 57490774 18-Nov-2022 Mel Gorman <mgorman@techsingularity.net>

mm/page_alloc: leave IRQs enabled for per-cpu page allocations

The pcp_spin_lock_irqsave protecting the PCP lists is IRQ-safe as a task
allocating from the PCP must not re-enter the allocator from IRQ context.
In each instance where IRQ-reentrancy is possible, the lock is acquired
using pcp_spin_trylock_irqsave() even though IRQs are disabled and
re-entrancy is impossible.

Demote the lock to pcp_spin_lock avoids an IRQ disable/enable in the
common case at the cost of some IRQ allocations taking a slower path. If
the PCP lists need to be refilled, the zone lock still needs to disable
IRQs but that will only happen on PCP refill and drain. If an IRQ is
raised when a PCP allocation is in progress, the trylock will fail and
fallback to using the buddy lists directly. Note that this may not be a
universal win if an interrupt-intensive workload also allocates heavily
from interrupt context and contends heavily on the zone->lock as a result.

[mgorman@techsingularity.net: migratetype might be wrong if a PCP was locked]
Link: https://lkml.kernel.org/r/20221122131229.5263-2-mgorman@techsingularity.net
[yuzhao@google.com: reported lockdep issue on IO completion from softirq]
[hughd@google.com: fix list corruption, lock improvements, micro-optimsations]
Link: https://lkml.kernel.org/r/20221118101714.19590-3-mgorman@techsingularity.net
Signed-off-by: Mel Gorman <mgorman@techsingularity.net>
Reviewed-by: Vlastimil Babka <vbabka@suse.cz>
Cc: Marcelo Tosatti <mtosatti@redhat.com>
Cc: Marek Szyprowski <m.szyprowski@samsung.com>
Cc: Michal Hocko <mhocko@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>


# c3e58a70 18-Nov-2022 Mel Gorman <mgorman@techsingularity.net>

mm/page_alloc: always remove pages from temporary list

Patch series "Leave IRQs enabled for per-cpu page allocations", v3.


This patch (of 2):

free_unref_page_list() has neglected to remove pages properly from the
list of pages to free since forever. It works by coincidence because
list_add happened to do the right thing adding the pages to just the PCP
lists. However, a later patch added pages to either the PCP list or the
zone list but only properly deleted the page from the list in one path
leading to list corruption and a subsequent failure. As a preparation
patch, always delete the pages from one list properly before adding to
another. On its own, this fixes nothing although it adds a fractional
amount of overhead but is critical to the next patch.

Link: https://lkml.kernel.org/r/20221118101714.19590-1-mgorman@techsingularity.net
Link: https://lkml.kernel.org/r/20221118101714.19590-2-mgorman@techsingularity.net
Signed-off-by: Mel Gorman <mgorman@techsingularity.net>
Reported-by: Hugh Dickins <hughd@google.com>
Reviewed-by: Vlastimil Babka <vbabka@suse.cz>
Cc: Marcelo Tosatti <mtosatti@redhat.com>
Cc: Marek Szyprowski <m.szyprowski@samsung.com>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Yu Zhao <yuzhao@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>


# 4b51634c 22-Nov-2022 Hugh Dickins <hughd@google.com>

mm,thp,rmap: subpages_mapcount COMPOUND_MAPPED if PMD-mapped

Can the lock_compound_mapcount() bit_spin_lock apparatus be removed now?
Yes. Not by atomic64_t or cmpxchg games, those get difficult on 32-bit;
but if we slightly abuse subpages_mapcount by additionally demanding that
one bit be set there when the compound page is PMD-mapped, then a cascade
of two atomic ops is able to maintain the stats without bit_spin_lock.

This is harder to reason about than when bit_spin_locked, but I believe
safe; and no drift in stats detected when testing. When there are racing
removes and adds, of course the sequence of operations is less well-
defined; but each operation on subpages_mapcount is atomically good. What
might be disastrous, is if subpages_mapcount could ever fleetingly appear
negative: but the pte lock (or pmd lock) these rmap functions are called
under, ensures that a last remove cannot race ahead of a first add.

Continue to make an exception for hugetlb (PageHuge) pages, though that
exception can be easily removed by a further commit if necessary: leave
subpages_mapcount 0, don't bother with COMPOUND_MAPPED in its case, just
carry on checking compound_mapcount too in folio_mapped(), page_mapped().

Evidence is that this way goes slightly faster than the previous
implementation in all cases (pmds after ptes now taking around 103ms); and
relieves us of worrying about contention on the bit_spin_lock.

Link: https://lkml.kernel.org/r/3978f3ca-5473-55a7-4e14-efea5968d892@google.com
Signed-off-by: Hugh Dickins <hughd@google.com>
Acked-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Cc: Dan Carpenter <error27@gmail.com>
Cc: David Hildenbrand <david@redhat.com>
Cc: James Houghton <jthoughton@google.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: John Hubbard <jhubbard@nvidia.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Miaohe Lin <linmiaohe@huawei.com>
Cc: Mike Kravetz <mike.kravetz@oracle.com>
Cc: Mina Almasry <almasrymina@google.com>
Cc: Muchun Song <songmuchun@bytedance.com>
Cc: Naoya Horiguchi <naoya.horiguchi@linux.dev>
Cc: Peter Xu <peterx@redhat.com>
Cc: Sidhartha Kumar <sidhartha.kumar@oracle.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Yang Shi <shy828301@gmail.com>
Cc: Yu Zhao <yuzhao@google.com>
Cc: Zach O'Keefe <zokeefe@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>


# cb67f428 02-Nov-2022 Hugh Dickins <hughd@google.com>

mm,thp,rmap: simplify compound page mapcount handling

Compound page (folio) mapcount calculations have been different for anon
and file (or shmem) THPs, and involved the obscure PageDoubleMap flag.
And each huge mapping and unmapping of a file (or shmem) THP involved
atomically incrementing and decrementing the mapcount of every subpage of
that huge page, dirtying many struct page cachelines.

Add subpages_mapcount field to the struct folio and first tail page, so
that the total of subpage mapcounts is available in one place near the
head: then page_mapcount() and total_mapcount() and page_mapped(), and
their folio equivalents, are so quick that anon and file and hugetlb don't
need to be optimized differently. Delete the unloved PageDoubleMap.

page_add and page_remove rmap functions must now maintain the
subpages_mapcount as well as the subpage _mapcount, when dealing with pte
mappings of huge pages; and correct maintenance of NR_ANON_MAPPED and
NR_FILE_MAPPED statistics still needs reading through the subpages, using
nr_subpages_unmapped() - but only when first or last pmd mapping finds
subpages_mapcount raised (double-map case, not the common case).

But are those counts (used to decide when to split an anon THP, and in
vmscan's pagecache_reclaimable heuristic) correctly maintained? Not
quite: since page_remove_rmap() (and also split_huge_pmd()) is often
called without page lock, there can be races when a subpage pte mapcount
0<->1 while compound pmd mapcount 0<->1 is scanning - races which the
previous implementation had prevented. The statistics might become
inaccurate, and even drift down until they underflow through 0. That is
not good enough, but is better dealt with in a followup patch.

Update a few comments on first and second tail page overlaid fields.
hugepage_add_new_anon_rmap() has to "increment" compound_mapcount, but
subpages_mapcount and compound_pincount are already correctly at 0, so
delete its reinitialization of compound_pincount.

A simple 100 X munmap(mmap(2GB, MAP_SHARED|MAP_POPULATE, tmpfs), 2GB) took
18 seconds on small pages, and used to take 1 second on huge pages, but
now takes 119 milliseconds on huge pages. Mapping by pmds a second time
used to take 860ms and now takes 92ms; mapping by pmds after mapping by
ptes (when the scan is needed) used to take 870ms and now takes 495ms.
But there might be some benchmarks which would show a slowdown, because
tail struct pages now fall out of cache until final freeing checks them.

Link: https://lkml.kernel.org/r/47ad693-717-79c8-e1ba-46c3a6602e48@google.com
Signed-off-by: Hugh Dickins <hughd@google.com>
Acked-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Cc: David Hildenbrand <david@redhat.com>
Cc: James Houghton <jthoughton@google.com>
Cc: John Hubbard <jhubbard@nvidia.com>
Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
Cc: Miaohe Lin <linmiaohe@huawei.com>
Cc: Mike Kravetz <mike.kravetz@oracle.com>
Cc: Mina Almasry <almasrymina@google.com>
Cc: Muchun Song <songmuchun@bytedance.com>
Cc: Naoya Horiguchi <naoya.horiguchi@linux.dev>
Cc: Peter Xu <peterx@redhat.com>
Cc: Sidhartha Kumar <sidhartha.kumar@oracle.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Yang Shi <shy828301@gmail.com>
Cc: Zach O'Keefe <zokeefe@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>


# ea4452de 18-Nov-2022 Qi Zheng <zhengqi.arch@bytedance.com>

mm: fix unexpected changes to {failslab|fail_page_alloc}.attr

When we specify __GFP_NOWARN, we only expect that no warnings will be
issued for current caller. But in the __should_failslab() and
__should_fail_alloc_page(), the local GFP flags alter the global
{failslab|fail_page_alloc}.attr, which is persistent and shared by all
tasks. This is not what we expected, let's fix it.

[akpm@linux-foundation.org: unexport should_fail_ex()]
Link: https://lkml.kernel.org/r/20221118100011.2634-1-zhengqi.arch@bytedance.com
Fixes: 3f913fc5f974 ("mm: fix missing handler for __GFP_NOWARN")
Signed-off-by: Qi Zheng <zhengqi.arch@bytedance.com>
Reported-by: Dmitry Vyukov <dvyukov@google.com>
Reviewed-by: Akinobu Mita <akinobu.mita@gmail.com>
Reviewed-by: Jason Gunthorpe <jgg@nvidia.com>
Cc: Akinobu Mita <akinobu.mita@gmail.com>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>


# 5aae9265 22-Oct-2022 Hugh Dickins <hughd@google.com>

mm: prep_compound_tail() clear page->private

Although page allocation always clears page->private in the first page or
head page of an allocation, it has never made a point of clearing
page->private in the tails (though 0 is often what is already there).

But now commit 71e2d666ef85 ("mm/huge_memory: do not clobber swp_entry_t
during THP split") issues a warning when page_tail->private is found to be
non-0 (unless it's swapcache).

Change that warning to dump page_tail (which also dumps head), instead of
just the head: so far we have seen dead000000000122, dead000000000003,
dead000000000001 or 0000000000000002 in the raw output for tail private.

We could just delete the warning, but today's consensus appears to want
page->private to be 0, unless there's a good reason for it to be set: so
now clear it in prep_compound_tail() (more general than just for THP; but
not for high order allocation, which makes no pass down the tails).

Link: https://lkml.kernel.org/r/1c4233bb-4e4d-5969-fbd4-96604268a285@google.com
Fixes: 71e2d666ef85 ("mm/huge_memory: do not clobber swp_entry_t during THP split")
Signed-off-by: Hugh Dickins <hughd@google.com>
Acked-by: Mel Gorman <mgorman@techsingularity.net>
Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>


# df48a5f7 31-May-2022 Liam R. Howlett <Liam.Howlett@Oracle.com>

mm/page_alloc: reduce potential fragmentation in make_alloc_exact()

Try to avoid using the left over split page on the next request for a page
by calling __free_pages_ok() with FPI_TO_TAIL. This increases the
potential of defragmenting memory when it's used for a short period of
time.

Link: https://lkml.kernel.org/r/20220531185626.yvlmymbxyoe5vags@revolver
Signed-off-by: Liam R. Howlett <Liam.Howlett@oracle.com>
Suggested-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>


# 15cd9004 06-Oct-2022 Yafang Shao <laoar.shao@gmail.com>

mm/page_alloc: fix incorrect PGFREE and PGALLOC for high-order page

PGFREE and PGALLOC represent the number of freed and allocated pages. So
the page order must be considered.

Link: https://lkml.kernel.org/r/20221006101540.40686-1-laoar.shao@gmail.com
Fixes: 44042b449872 ("mm/page_alloc: allow high-order pages to be stored on the per-cpu lists")
Signed-off-by: Yafang Shao <laoar.shao@gmail.com>
Acked-by: Mel Gorman <mgorman@techsingularity.net>
Reviewed-by: Miaohe Lin <linmiaohe@huawei.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>


# ef233450 28-Sep-2022 Alistair Popple <apopple@nvidia.com>

mm: free device private pages have zero refcount

Since 27674ef6c73f ("mm: remove the extra ZONE_DEVICE struct page
refcount") device private pages have no longer had an extra reference
count when the page is in use. However before handing them back to the
owning device driver we add an extra reference count such that free pages
have a reference count of one.

This makes it difficult to tell if a page is free or not because both free
and in use pages will have a non-zero refcount. Instead we should return
pages to the drivers page allocator with a zero reference count. Kernel
code can then safely use kernel functions such as get_page_unless_zero().

Link: https://lkml.kernel.org/r/cf70cf6f8c0bdb8aaebdbfb0d790aea4c683c3c6.1664366292.git-series.apopple@nvidia.com
Signed-off-by: Alistair Popple <apopple@nvidia.com>
Acked-by: Felix Kuehling <Felix.Kuehling@amd.com>
Cc: Jason Gunthorpe <jgg@nvidia.com>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Cc: Alex Deucher <alexander.deucher@amd.com>
Cc: Christian König <christian.koenig@amd.com>
Cc: Ben Skeggs <bskeggs@redhat.com>
Cc: Lyude Paul <lyude@redhat.com>
Cc: Ralph Campbell <rcampbell@nvidia.com>
Cc: Alex Sierra <alex.sierra@amd.com>
Cc: John Hubbard <jhubbard@nvidia.com>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: David Hildenbrand <david@redhat.com>
Cc: "Huang, Ying" <ying.huang@intel.com>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Yang Shi <shy828301@gmail.com>
Cc: Zi Yan <ziy@nvidia.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>


# a57ae9ef 17-Sep-2022 Ran Xiaokai <ran.xiaokai@zte.com.cn>

mm/page_alloc: update comments for rmqueue()

Since commit 44042b449872 ("mm/page_alloc: allow high-order pages to be
stored on the per-cpu lists"), the per-cpu page allocators (PCP) is not
only for order-0 pages. Update the comments.

Link: https://lkml.kernel.org/r/20220918025640.208586-1-ran.xiaokai@zte.com.cn
Signed-off-by: Ran Xiaokai <ran.xiaokai@zte.com.cn>
Cc: Mel Gorman <mgorman@techsingularity.net>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>


# c9b3637f 16-Sep-2022 Miaohe Lin <linmiaohe@huawei.com>

mm/page_alloc: fix obsolete comment in deferred_pfn_valid()

There are no architectures that can have holes in the memory map within a
pageblock since commit 859a85ddf90e ("mm: remove pfn_valid_within() and
CONFIG_HOLES_IN_ZONE"). Update the corresponding comment.

Link: https://lkml.kernel.org/r/20220916072257.9639-17-linmiaohe@huawei.com
Signed-off-by: Miaohe Lin <linmiaohe@huawei.com>
Reviewed-by: David Hildenbrand <david@redhat.com>
Reviewed-by: Anshuman Khandual <anshuman.khandual@arm.com>
Reviewed-by: Oscar Salvador <osalvador@suse.de>
Cc: Matthew Wilcox <willy@infradead.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>


# 896c4d52 16-Sep-2022 Miaohe Lin <linmiaohe@huawei.com>

mm/page_alloc: use costly_order in WARN_ON_ONCE_GFP()

There's no need to check whether order > PAGE_ALLOC_COSTLY_ORDER again.
Minor readability improvement.

Link: https://lkml.kernel.org/r/20220916072257.9639-15-linmiaohe@huawei.com
Signed-off-by: Miaohe Lin <linmiaohe@huawei.com>
Reviewed-by: David Hildenbrand <david@redhat.com>
Reviewed-by: Anshuman Khandual <anshuman.khandual@arm.com>
Reviewed-by: Oscar Salvador <osalvador@suse.de>
Cc: Matthew Wilcox <willy@infradead.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>


# dae37a5d 16-Sep-2022 Miaohe Lin <linmiaohe@huawei.com>

mm/page_alloc: init local variable buddy_pfn

The local variable buddy_pfn could be passed to buddy_merge_likely()
without initialization if the passed in order is MAX_ORDER - 1. This
looks buggy but buddy_pfn won't be used in this case as there's a order >=
MAX_ORDER - 2 check. Init buddy_pfn to 0 anyway to avoid possible future
misuse.

Link: https://lkml.kernel.org/r/20220916072257.9639-14-linmiaohe@huawei.com
Signed-off-by: Miaohe Lin <linmiaohe@huawei.com>
Reviewed-by: David Hildenbrand <david@redhat.com>
Reviewed-by: Anshuman Khandual <anshuman.khandual@arm.com>
Reviewed-by: Oscar Salvador <osalvador@suse.de>
Cc: Matthew Wilcox <willy@infradead.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>


# c940e020 16-Sep-2022 Miaohe Lin <linmiaohe@huawei.com>

mm/page_alloc: use helper macro SZ_1{K,M}

Use helper macro SZ_1K and SZ_1M to do the size conversion. Minor
readability improvement.

Link: https://lkml.kernel.org/r/20220916072257.9639-13-linmiaohe@huawei.com
Signed-off-by: Miaohe Lin <linmiaohe@huawei.com>
Cc: Anshuman Khandual <anshuman.khandual@arm.com>
Cc: David Hildenbrand <david@redhat.com>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Oscar Salvador <osalvador@suse.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>


# 6dc2c87a 16-Sep-2022 Miaohe Lin <linmiaohe@huawei.com>

mm/page_alloc: make boot_nodestats static

It's only used in mm/page_alloc.c now. Make it static.

Link: https://lkml.kernel.org/r/20220916072257.9639-12-linmiaohe@huawei.com
Signed-off-by: Miaohe Lin <linmiaohe@huawei.com>
Reviewed-by: David Hildenbrand <david@redhat.com>
Reviewed-by: Anshuman Khandual <anshuman.khandual@arm.com>
Reviewed-by: Oscar Salvador <osalvador@suse.de>
Cc: Matthew Wilcox <willy@infradead.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>


# c0352904 16-Sep-2022 Miaohe Lin <linmiaohe@huawei.com>

mm/page_alloc: use local variable zone_idx directly

Use local variable zone_idx directly since it holds the exact value of
zone_idx(). No functional change intended.

Link: https://lkml.kernel.org/r/20220916072257.9639-10-linmiaohe@huawei.com
Signed-off-by: Miaohe Lin <linmiaohe@huawei.com>
Reviewed-by: David Hildenbrand <david@redhat.com>
Reviewed-by: Anshuman Khandual <anshuman.khandual@arm.com>
Reviewed-by: Oscar Salvador <osalvador@suse.de>
Cc: Matthew Wilcox <willy@infradead.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>


# b3618455 16-Sep-2022 Miaohe Lin <linmiaohe@huawei.com>

mm/page_alloc: add missing is_migrate_isolate() check in set_page_guard()

In MIGRATE_ISOLATE case, zone freepage state shouldn't be modified as
caller will take care of it. Add missing is_migrate_isolate() here to
avoid possible unbalanced freepage state. This would happen if someone
isolates the block, and then we face an MCE failure/soft-offline on a page
within that block. __mod_zone_freepage_state() will be triggered via
below call trace which already had been triggered back when block was
isolated:

take_page_off_buddy
break_down_buddy_pages
set_page_guard

Link: https://lkml.kernel.org/r/20220916072257.9639-9-linmiaohe@huawei.com
Fixes: 06be6ff3d2ec ("mm,hwpoison: rework soft offline for free pages")
Signed-off-by: Miaohe Lin <linmiaohe@huawei.com>
Reviewed-by: Oscar Salvador <osalvador@suse.de>
Reviewed-by: David Hildenbrand <david@redhat.com>
Cc: Anshuman Khandual <anshuman.khandual@arm.com>
Cc: Matthew Wilcox <willy@infradead.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>


# 022e7fa0 16-Sep-2022 Miaohe Lin <linmiaohe@huawei.com>

mm/page_alloc: fix freeing static percpu memory

The size of struct per_cpu_zonestat can be 0 on !SMP && !NUMA. In that
case, zone->per_cpu_zonestats will always equal to boot_zonestats. But in
zone_pcp_reset(), zone->per_cpu_zonestats is freed via free_percpu()
directly without checking against boot_zonestats first. boot_zonestats
will be released by free_percpu() unexpectedly.

Link: https://lkml.kernel.org/r/20220916072257.9639-7-linmiaohe@huawei.com
Fixes: 28f836b6777b ("mm/page_alloc: split per cpu page lists and zone stats")
Signed-off-by: Miaohe Lin <linmiaohe@huawei.com>
Reviewed-by: David Hildenbrand <david@redhat.com>
Reviewed-by: Oscar Salvador <osalvador@suse.de>
Cc: Anshuman Khandual <anshuman.khandual@arm.com>
Cc: Matthew Wilcox <willy@infradead.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>


# 5749fcc5 16-Sep-2022 Miaohe Lin <linmiaohe@huawei.com>

mm/page_alloc: add __init annotations to init_mem_debugging_and_hardening()

It's only called by mm_init(). Add __init annotations to it.

Link: https://lkml.kernel.org/r/20220916072257.9639-6-linmiaohe@huawei.com
Signed-off-by: Miaohe Lin <linmiaohe@huawei.com>
Reviewed-by: David Hildenbrand <david@redhat.com>
Reviewed-by: Anshuman Khandual <anshuman.khandual@arm.com>
Reviewed-by: Oscar Salvador <osalvador@suse.de>
Cc: Matthew Wilcox <willy@infradead.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>


# 709924bc 16-Sep-2022 Miaohe Lin <linmiaohe@huawei.com>

mm/page_alloc: remove obsolete comment in zone_statistics()

Since commit 43c95bcc51e4 ("mm/page_alloc: reduce duration that IRQs are
disabled for VM counters"), zone_statistics() is not called with
interrupts disabled. Update the corresponding comment.

Link: https://lkml.kernel.org/r/20220916072257.9639-5-linmiaohe@huawei.com
Signed-off-by: Miaohe Lin <linmiaohe@huawei.com>
Acked-by: David Hildenbrand <david@redhat.com>
Reviewed-by: Anshuman Khandual <anshuman.khandual@arm.com>
Reviewed-by: Oscar Salvador <osalvador@suse.de>
Cc: Matthew Wilcox <willy@infradead.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>


# 638a9ae9 16-Sep-2022 Miaohe Lin <linmiaohe@huawei.com>

mm: remove obsolete macro NR_PCP_ORDER_MASK and NR_PCP_ORDER_WIDTH

Since commit 8b10b465d0e1 ("mm/page_alloc: free pages in a single pass
during bulk free"), they're not used anymore. Remove them.

Link: https://lkml.kernel.org/r/20220916072257.9639-4-linmiaohe@huawei.com
Signed-off-by: Miaohe Lin <linmiaohe@huawei.com>
Reviewed-by: David Hildenbrand <david@redhat.com>
Reviewed-by: Anshuman Khandual <anshuman.khandual@arm.com>
Reviewed-by: Oscar Salvador <osalvador@suse.de>
Cc: Matthew Wilcox <willy@infradead.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>


# b89f1735 16-Sep-2022 Miaohe Lin <linmiaohe@huawei.com>

mm/page_alloc: make zone_pcp_update() static

Since commit b92ca18e8ca5 ("mm/page_alloc: disassociate the pcp->high from
pcp->batch"), zone_pcp_update() is only used in mm/page_alloc.c. Move
zone_pcp_update() up to avoid forward declaration and then make it static.
No functional change intended.

Link: https://lkml.kernel.org/r/20220916072257.9639-3-linmiaohe@huawei.com
Signed-off-by: Miaohe Lin <linmiaohe@huawei.com>
Reviewed-by: David Hildenbrand <david@redhat.com>
Reviewed-by: Anshuman Khandual <anshuman.khandual@arm.com>
Reviewed-by: Oscar Salvador <osalvador@suse.de>
Cc: Matthew Wilcox <willy@infradead.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>


# ce96fa62 16-Sep-2022 Miaohe Lin <linmiaohe@huawei.com>

mm/page_alloc: ensure kswapd doesn't accidentally go to sleep

Patch series "A few cleanup patches for mm", v2.

This series contains a few cleanup patches to remove the obsolete comments
and functions, use helper macro to improve readability and so on. More
details can be found in the respective changelogs.


This patch (of 16):

If ALLOC_KSWAPD is set, wake_all_kswapds() will be called to ensure kswapd
doesn't accidentally go to sleep. But when reserve_flags is set,
alloc_flags will be overwritten and ALLOC_KSWAPD is thus lost. Preserve
the ALLOC_KSWAPD flag in alloc_flags to ensure kswapd won't go to sleep
accidentally.

Link: https://lkml.kernel.org/r/20220916072257.9639-1-linmiaohe@huawei.com
Link: https://lkml.kernel.org/r/20220916072257.9639-2-linmiaohe@huawei.com
Fixes: 0a79cdad5eb2 ("mm: use alloc_flags to record if kswapd can wake")
Signed-off-by: Miaohe Lin <linmiaohe@huawei.com>
Acked-by: David Hildenbrand <david@redhat.com>
Reviewed-by: Oscar Salvador <osalvador@suse.de>
Cc: Anshuman Khandual <anshuman.khandual@arm.com>
Cc: Matthew Wilcox <willy@infradead.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>


# 42eaa27d 15-Sep-2022 Alexander Potapenko <glider@google.com>

security: kmsan: fix interoperability with auto-initialization

Heap and stack initialization is great, but not when we are trying uses of
uninitialized memory. When the kernel is built with KMSAN, having kernel
memory initialization enabled may introduce false negatives.

We disable CONFIG_INIT_STACK_ALL_PATTERN and CONFIG_INIT_STACK_ALL_ZERO
under CONFIG_KMSAN, making it impossible to auto-initialize stack
variables in KMSAN builds. We also disable
CONFIG_INIT_ON_ALLOC_DEFAULT_ON and CONFIG_INIT_ON_FREE_DEFAULT_ON to
prevent accidental use of heap auto-initialization.

We however still let the users enable heap auto-initialization at
boot-time (by setting init_on_alloc=1 or init_on_free=1), in which case a
warning is printed.

Link: https://lkml.kernel.org/r/20220915150417.722975-31-glider@google.com
Signed-off-by: Alexander Potapenko <glider@google.com>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Cc: Alexei Starovoitov <ast@kernel.org>
Cc: Andrey Konovalov <andreyknvl@gmail.com>
Cc: Andrey Konovalov <andreyknvl@google.com>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Arnd Bergmann <arnd@arndb.de>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Christoph Lameter <cl@linux.com>
Cc: David Rientjes <rientjes@google.com>
Cc: Dmitry Vyukov <dvyukov@google.com>
Cc: Eric Biggers <ebiggers@google.com>
Cc: Eric Biggers <ebiggers@kernel.org>
Cc: Eric Dumazet <edumazet@google.com>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: Herbert Xu <herbert@gondor.apana.org.au>
Cc: Ilya Leoshkevich <iii@linux.ibm.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Cc: Kees Cook <keescook@chromium.org>
Cc: Marco Elver <elver@google.com>
Cc: Mark Rutland <mark.rutland@arm.com>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Michael S. Tsirkin <mst@redhat.com>
Cc: Pekka Enberg <penberg@kernel.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Petr Mladek <pmladek@suse.com>
Cc: Stephen Rothwell <sfr@canb.auug.org.au>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Vasily Gorbik <gor@linux.ibm.com>
Cc: Vegard Nossum <vegard.nossum@oracle.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>


# 3c206509 15-Sep-2022 Alexander Potapenko <glider@google.com>

init: kmsan: call KMSAN initialization routines

kmsan_init_shadow() scans the mappings created at boot time and creates
metadata pages for those mappings.

When the memblock allocator returns pages to pagealloc, we reserve 2/3 of
those pages and use them as metadata for the remaining 1/3. Once KMSAN
starts, every page allocated by pagealloc has its associated shadow and
origin pages.

kmsan_initialize() initializes the bookkeeping for init_task and enables
KMSAN.

Link: https://lkml.kernel.org/r/20220915150417.722975-18-glider@google.com
Signed-off-by: Alexander Potapenko <glider@google.com>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Cc: Alexei Starovoitov <ast@kernel.org>
Cc: Andrey Konovalov <andreyknvl@gmail.com>
Cc: Andrey Konovalov <andreyknvl@google.com>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Arnd Bergmann <arnd@arndb.de>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Christoph Lameter <cl@linux.com>
Cc: David Rientjes <rientjes@google.com>
Cc: Dmitry Vyukov <dvyukov@google.com>
Cc: Eric Biggers <ebiggers@google.com>
Cc: Eric Biggers <ebiggers@kernel.org>
Cc: Eric Dumazet <edumazet@google.com>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: Herbert Xu <herbert@gondor.apana.org.au>
Cc: Ilya Leoshkevich <iii@linux.ibm.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Cc: Kees Cook <keescook@chromium.org>
Cc: Marco Elver <elver@google.com>
Cc: Mark Rutland <mark.rutland@arm.com>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Michael S. Tsirkin <mst@redhat.com>
Cc: Pekka Enberg <penberg@kernel.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Petr Mladek <pmladek@suse.com>
Cc: Stephen Rothwell <sfr@canb.auug.org.au>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Vasily Gorbik <gor@linux.ibm.com>
Cc: Vegard Nossum <vegard.nossum@oracle.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>


# b073d7f8 15-Sep-2022 Alexander Potapenko <glider@google.com>

mm: kmsan: maintain KMSAN metadata for page operations

Insert KMSAN hooks that make the necessary bookkeeping changes:
- poison page shadow and origins in alloc_pages()/free_page();
- clear page shadow and origins in clear_page(), copy_user_highpage();
- copy page metadata in copy_highpage(), wp_page_copy();
- handle vmap()/vunmap()/iounmap();

Link: https://lkml.kernel.org/r/20220915150417.722975-15-glider@google.com
Signed-off-by: Alexander Potapenko <glider@google.com>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Cc: Alexei Starovoitov <ast@kernel.org>
Cc: Andrey Konovalov <andreyknvl@gmail.com>
Cc: Andrey Konovalov <andreyknvl@google.com>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Arnd Bergmann <arnd@arndb.de>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Christoph Lameter <cl@linux.com>
Cc: David Rientjes <rientjes@google.com>
Cc: Dmitry Vyukov <dvyukov@google.com>
Cc: Eric Biggers <ebiggers@google.com>
Cc: Eric Biggers <ebiggers@kernel.org>
Cc: Eric Dumazet <edumazet@google.com>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: Herbert Xu <herbert@gondor.apana.org.au>
Cc: Ilya Leoshkevich <iii@linux.ibm.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Cc: Kees Cook <keescook@chromium.org>
Cc: Marco Elver <elver@google.com>
Cc: Mark Rutland <mark.rutland@arm.com>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Michael S. Tsirkin <mst@redhat.com>
Cc: Pekka Enberg <penberg@kernel.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Petr Mladek <pmladek@suse.com>
Cc: Stephen Rothwell <sfr@canb.auug.org.au>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Vasily Gorbik <gor@linux.ibm.com>
Cc: Vegard Nossum <vegard.nossum@oracle.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>


# d452289f 13-Sep-2022 Andrew Morton <akpm@linux-foundation.org>

mm/page_alloc.c: document bulkfree_pcp_prepare() return value

Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: ke.wang <ke.wang@unisoc.com>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Zhaoyang Huang <huangzhaoyang@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>


# a8368cd8 13-Sep-2022 Andrew Morton <akpm@linux-foundation.org>

mm/page_alloc.c: rename check_free_page() to free_page_is_bad()

The name "check_free_page()" provides no information regarding its return
value when the page is indeed found to be bad.

Renaming it to "free_page_is_bad()" makes it clear that a `true' return
value means the page was bad.

And make it return a bool, not an int.

[akpm@linux-foundation.org: don't use bool as int]
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: ke.wang <ke.wang@unisoc.com>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Zhaoyang Huang <huangzhaoyang@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>


# ee0913c4 07-Sep-2022 Kefeng Wang <wangkefeng.wang@huawei.com>

mm: add pageblock_aligned() macro

Add pageblock_aligned() and use it to simplify code.

Link: https://lkml.kernel.org/r/20220907060844.126891-3-wangkefeng.wang@huawei.com
Signed-off-by: Kefeng Wang <wangkefeng.wang@huawei.com>
Acked-by: Mike Rapoport <rppt@linux.ibm.com>
Cc: David Hildenbrand <david@redhat.com>
Cc: Oscar Salvador <osalvador@suse.de>
Cc: Vlastimil Babka <vbabka@suse.cz>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>


# 4f9bc69a 07-Sep-2022 Kefeng Wang <wangkefeng.wang@huawei.com>

mm: reuse pageblock_start/end_pfn() macro

Move pageblock_start_pfn/pageblock_end_pfn() into pageblock-flags.h, then
they could be used somewhere else, not only in compaction, also use
ALIGN_DOWN() instead of round_down() to be pair with ALIGN(), which should
be same for pageblock usage.

Link: https://lkml.kernel.org/r/20220907060844.126891-1-wangkefeng.wang@huawei.com
Signed-off-by: Kefeng Wang <wangkefeng.wang@huawei.com>
Acked-by: Mike Rapoport <rppt@linux.ibm.com>
Reviewed-by: David Hildenbrand <david@redhat.com>
Cc: Oscar Salvador <osalvador@suse.de>
Cc: Vlastimil Babka <vbabka@suse.cz>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>


# 974f4367 23-Aug-2022 Michal Hocko <mhocko@suse.com>

mm: reduce noise in show_mem for lowmem allocations

While discussing early DMA pool pre-allocation failure with Christoph [1]
I have realized that the allocation failure warning is rather noisy for
constrained allocations like GFP_DMA{32}. Those zones are usually not
populated on all nodes very often as their memory ranges are constrained.

This is an attempt to reduce the ballast that doesn't provide any relevant
information for those allocation failures investigation. Please note that
I have only compile tested it (in my default config setup) and I am
throwing it mostly to see what people think about it.

[1] http://lkml.kernel.org/r/20220817060647.1032426-1-hch@lst.de

[mhocko@suse.com: update]
Link: https://lkml.kernel.org/r/Yw29bmJTIkKogTiW@dhcp22.suse.cz
[mhocko@suse.com: fix build]
[akpm@linux-foundation.org: fix it for mapletree]
[akpm@linux-foundation.org: update it for Michal's update]
[mhocko@suse.com: fix arch/powerpc/xmon/xmon.c]
Link: https://lkml.kernel.org/r/Ywh3C4dKB9B93jIy@dhcp22.suse.cz
[akpm@linux-foundation.org: fix arch/sparc/kernel/setup_32.c]
Link: https://lkml.kernel.org/r/YwScVmVofIZkopkF@dhcp22.suse.cz
Signed-off-by: Michal Hocko <mhocko@suse.com>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Cc: Christoph Hellwig <hch@infradead.org>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Dan Carpenter <dan.carpenter@oracle.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>


# 9a157dd8 31-Aug-2022 Kefeng Wang <wangkefeng.wang@huawei.com>

mm: remove BUG_ON() in __isolate_free_page()

Drop unneed comment and blank, adjust the variable, and the most important
is to delete BUG_ON(). The page passed is always buddy page into
__isolate_free_page() from compaction, page_isolation and page_reporting,
and the caller also check the return, BUG_ON() is a too drastic measure,
remove it.

Link: https://lkml.kernel.org/r/20220901015043.189276-1-wangkefeng.wang@huawei.com
Signed-off-by: Kefeng Wang <wangkefeng.wang@huawei.com>
Cc: David Hildenbrand <david@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>


# 663d0cfd 30-Aug-2022 zezuo <zuoze1@huawei.com>

mm/page_alloc.c: delete a redundant parameter of rmqueue_pcplist

The gfp_flags parameter is not used in rmqueue_pcplist, so directly delete
this parameter.

Link: https://lkml.kernel.org/r/20220831013404.3360714-1-zuoze1@huawei.com
Signed-off-by: zezuo <zuoze1@huawei.com>
Cc: Kefeng Wang <wangkefeng.wang@huawei.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>


# b4a0215e 27-Aug-2022 Kefeng Wang <wangkefeng.wang@huawei.com>

mm: fix null-ptr-deref in kswapd_is_running()

kswapd_run/stop() will set pgdat->kswapd to NULL, which could race with
kswapd_is_running() in kcompactd(),

kswapd_run/stop() kcompactd()
kswapd_is_running()
pgdat->kswapd // error or nomal ptr
verify pgdat->kswapd
// load non-NULL
pgdat->kswapd
pgdat->kswapd = NULL
task_is_running(pgdat->kswapd)
// Null pointer derefence

KASAN reports the null-ptr-deref shown below,

vmscan: Failed to start kswapd on node 0
...
BUG: KASAN: null-ptr-deref in kcompactd+0x440/0x504
Read of size 8 at addr 0000000000000024 by task kcompactd0/37

CPU: 0 PID: 37 Comm: kcompactd0 Kdump: loaded Tainted: G OE 5.10.60 #1
Hardware name: QEMU KVM Virtual Machine, BIOS 0.0.0 02/06/2015
Call trace:
dump_backtrace+0x0/0x394
show_stack+0x34/0x4c
dump_stack+0x158/0x1e4
__kasan_report+0x138/0x140
kasan_report+0x44/0xdc
__asan_load8+0x94/0xd0
kcompactd+0x440/0x504
kthread+0x1a4/0x1f0
ret_from_fork+0x10/0x18

At present kswapd/kcompactd_run() and kswapd/kcompactd_stop() are protected
by mem_hotplug_begin/done(), but without kcompactd(). There is no need to
involve memory hotplug lock in kcompactd(), so let's add a new mutex to
protect pgdat->kswapd accesses.

Also, because the kcompactd task will check the state of kswapd task, it's
better to call kcompactd_stop() before kswapd_stop() to reduce lock
conflicts.

[akpm@linux-foundation.org: add comments]
Link: https://lkml.kernel.org/r/20220827111959.186838-1-wangkefeng.wang@huawei.com
Signed-off-by: Kefeng Wang <wangkefeng.wang@huawei.com>
Cc: David Hildenbrand <david@redhat.com>
Cc: Muchun Song <muchun.song@linux.dev>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>


# c4f20f14 25-Aug-2022 Li Zhe <lizhe.67@bytedance.com>

page_ext: introduce boot parameter 'early_page_ext'

In commit 2f1ee0913ce5 ("Revert "mm: use early_pfn_to_nid in
page_ext_init""), we call page_ext_init() after page_alloc_init_late() to
avoid some panic problem. It seems that we cannot track early page
allocations in current kernel even if page structure has been initialized
early.

This patch introduces a new boot parameter 'early_page_ext' to resolve
this problem. If we pass it to the kernel, page_ext_init() will be moved
up and the feature 'deferred initialization of struct pages' will be
disabled to initialize the page allocator early and prevent the panic
problem above. It can help us to catch early page allocations. This is
useful especially when we find that the free memory value is not the same
right after different kernel booting.

[akpm@linux-foundation.org: fix section issue by removing __meminitdata]
Link: https://lkml.kernel.org/r/20220825102714.669-1-lizhe.67@bytedance.com
Signed-off-by: Li Zhe <lizhe.67@bytedance.com>
Suggested-by: Michal Hocko <mhocko@suse.com>
Acked-by: Michal Hocko <mhocko@suse.com>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Cc: Jason A. Donenfeld <Jason@zx2c4.com>
Cc: Jonathan Corbet <corbet@lwn.net>
Cc: Kees Cook <keescook@chromium.org>
Cc: Mark-PK Tsai <mark-pk.tsai@mediatek.com>
Cc: Masami Hiramatsu (Google) <mhiramat@kernel.org>
Cc: Steven Rostedt <rostedt@goodmis.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>


# fb70c487 15-Aug-2022 Kefeng Wang <wangkefeng.wang@huawei.com>

mm: kill find_min_pfn_with_active_regions()

find_min_pfn_with_active_regions() is only called from free_area_init().
Open-code the PHYS_PFN(memblock_start_of_DRAM()) into free_area_init(),
and kill find_min_pfn_with_active_regions().

Link: https://lkml.kernel.org/r/20220815111017.39341-1-wangkefeng.wang@huawei.com
Signed-off-by: Kefeng Wang <wangkefeng.wang@huawei.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>


# e933dc4a 02-Aug-2022 Abel Wu <wuyun.abel@bytedance.com>

mm/page_alloc: only search higher order when fallback

It seems unnecessary to search pages with order < alloc_order in
fallback allocation.

This can currently happen with ALLOC_NOFRAGMENT and alloc_order >
pageblock_order, so add a test to prevent it.

[vbabka@suse.cz: changelog addition]
Link: https://lkml.kernel.org/r/20220803025121.47018-1-wuyun.abel@bytedance.com
Signed-off-by: Abel Wu <wuyun.abel@bytedance.com>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Acked-by: Mel Gorman <mgorman@techsingularity.net>
Reviewed-by: Muchun Song <songmuchun@bytedance.com>
Cc: Michal Hocko <mhocko@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>


# 97bab178 03-Aug-2022 Li kunyu <kunyu@nfschina.com>

page_alloc: remove inactive initialization

The allocation address of the table pointer variable is first performed in
the function, no initialization assignment is required, and no invalid
pointer will appear.

Link: https://lkml.kernel.org/r/20220803064118.3664-1-kunyu@nfschina.com
Signed-off-by: Li kunyu <kunyu@nfschina.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>


# ebc97a52 22-Aug-2022 Yosry Ahmed <yosryahmed@google.com>

mm: add NR_SECONDARY_PAGETABLE to count secondary page table uses.

We keep track of several kernel memory stats (total kernel memory, page
tables, stack, vmalloc, etc) on multiple levels (global, per-node,
per-memcg, etc). These stats give insights to users to how much memory
is used by the kernel and for what purposes.

Currently, memory used by KVM mmu is not accounted in any of those
kernel memory stats. This patch series accounts the memory pages
used by KVM for page tables in those stats in a new
NR_SECONDARY_PAGETABLE stat. This stat can be later extended to account
for other types of secondary pages tables (e.g. iommu page tables).

KVM has a decent number of large allocations that aren't for page
tables, but for most of them, the number/size of those allocations
scales linearly with either the number of vCPUs or the amount of memory
assigned to the VM. KVM's secondary page table allocations do not scale
linearly, especially when nested virtualization is in use.

From a KVM perspective, NR_SECONDARY_PAGETABLE will scale with KVM's
per-VM pages_{4k,2m,1g} stats unless the guest is doing something
bizarre (e.g. accessing only 4kb chunks of 2mb pages so that KVM is
forced to allocate a large number of page tables even though the guest
isn't accessing that much memory). However, someone would need to either
understand how KVM works to make that connection, or know (or be told) to
go look at KVM's stats if they're running VMs to better decipher the stats.

Furthermore, having NR_PAGETABLE side-by-side with NR_SECONDARY_PAGETABLE
is informative. For example, when backing a VM with THP vs. HugeTLB,
NR_SECONDARY_PAGETABLE is roughly the same, but NR_PAGETABLE is an order
of magnitude higher with THP. So having this stat will at the very least
prove to be useful for understanding tradeoffs between VM backing types,
and likely even steer folks towards potential optimizations.

The original discussion with more details about the rationale:
https://lore.kernel.org/all/87ilqoi77b.wl-maz@kernel.org

This stat will be used by subsequent patches to count KVM mmu
memory usage.

Signed-off-by: Yosry Ahmed <yosryahmed@google.com>
Acked-by: Shakeel Butt <shakeelb@google.com>
Acked-by: Marc Zyngier <maz@kernel.org>
Link: https://lore.kernel.org/r/20220823004639.2387269-2-yosryahmed@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>


# dac22531 15-Jul-2022 Maurizio Lombardi <mlombard@redhat.com>

mm: prevent page_frag_alloc() from corrupting the memory

A number of drivers call page_frag_alloc() with a fragment's size >
PAGE_SIZE.

In low memory conditions, __page_frag_cache_refill() may fail the order
3 cache allocation and fall back to order 0; In this case, the cache
will be smaller than the fragment, causing memory corruptions.

Prevent this from happening by checking if the newly allocated cache is
large enough for the fragment; if not, the allocation will fail and
page_frag_alloc() will return NULL.

Link: https://lkml.kernel.org/r/20220715125013.247085-1-mlombard@redhat.com
Fixes: b63ae8ca096d ("mm/net: Rename and move page fragment handling from net/ to mm/")
Signed-off-by: Maurizio Lombardi <mlombard@redhat.com>
Reviewed-by: Alexander Duyck <alexanderduyck@fb.com>
Cc: Chen Lin <chen45464546@163.com>
Cc: Jakub Kicinski <kuba@kernel.org>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>


# 3d36424b 23-Aug-2022 Mel Gorman <mgorman@techsingularity.net>

mm/page_alloc: fix race condition between build_all_zonelists and page allocation

Patrick Daly reported the following problem;

NODE_DATA(nid)->node_zonelists[ZONELIST_FALLBACK] - before offline operation
[0] - ZONE_MOVABLE
[1] - ZONE_NORMAL
[2] - NULL

For a GFP_KERNEL allocation, alloc_pages_slowpath() will save the
offset of ZONE_NORMAL in ac->preferred_zoneref. If a concurrent
memory_offline operation removes the last page from ZONE_MOVABLE,
build_all_zonelists() & build_zonerefs_node() will update
node_zonelists as shown below. Only populated zones are added.

NODE_DATA(nid)->node_zonelists[ZONELIST_FALLBACK] - after offline operation
[0] - ZONE_NORMAL
[1] - NULL
[2] - NULL

The race is simple -- page allocation could be in progress when a memory
hot-remove operation triggers a zonelist rebuild that removes zones. The
allocation request will still have a valid ac->preferred_zoneref that is
now pointing to NULL and triggers an OOM kill.

This problem probably always existed but may be slightly easier to trigger
due to 6aa303defb74 ("mm, vmscan: only allocate and reclaim from zones
with pages managed by the buddy allocator") which distinguishes between
zones that are completely unpopulated versus zones that have valid pages
not managed by the buddy allocator (e.g. reserved, memblock, ballooning
etc). Memory hotplug had multiple stages with timing considerations
around managed/present page updates, the zonelist rebuild and the zone
span updates. As David Hildenbrand puts it

memory offlining adjusts managed+present pages of the zone
essentially in one go. If after the adjustments, the zone is no
longer populated (present==0), we rebuild the zone lists.

Once that's done, we try shrinking the zone (start+spanned
pages) -- which results in zone_start_pfn == 0 if there are no
more pages. That happens *after* rebuilding the zonelists via
remove_pfn_range_from_zone().

The only requirement to fix the race is that a page allocation request
identifies when a zonelist rebuild has happened since the allocation
request started and no page has yet been allocated. Use a seqlock_t to
track zonelist updates with a lockless read-side of the zonelist and
protecting the rebuild and update of the counter with a spinlock.

[akpm@linux-foundation.org: make zonelist_update_seq static]
Link: https://lkml.kernel.org/r/20220824110900.vh674ltxmzb3proq@techsingularity.net
Fixes: 6aa303defb74 ("mm, vmscan: only allocate and reclaim from zones with pages managed by the buddy allocator")
Signed-off-by: Mel Gorman <mgorman@techsingularity.net>
Reported-by: Patrick Daly <quic_pdaly@quicinc.com>
Acked-by: Michal Hocko <mhocko@suse.com>
Reviewed-by: David Hildenbrand <david@redhat.com>
Cc: <stable@vger.kernel.org> [4.9+]
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>


# 189cdcfe 18-Jul-2022 Mark-PK Tsai <mark-pk.tsai@mediatek.com>

mm/page_alloc: correct the wrong cpuset file path in comment

cpuset.c was moved to kernel/cgroup/ in below commit
201af4c0fab0 ("cgroup: move cgroup files under kernel/cgroup/")
Correct the wrong path in comment.

Link: https://lkml.kernel.org/r/20220718120336.5145-1-mark-pk.tsai@mediatek.com
Signed-off-by: Mark-PK Tsai <mark-pk.tsai@mediatek.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>


# 04ec0061 08-Jul-2022 Uros Bizjak <ubizjak@gmail.com>

mm/page_alloc: use try_cmpxchg in set_pfnblock_flags_mask

Use try_cmpxchg instead of cmpxchg in set_pfnblock_flags_mask. x86
CMPXCHG instruction returns success in ZF flag, so this change saves a
compare after cmpxchg (and related move instruction in front of cmpxchg).
The main loop improves from:

1c5d: 48 89 c2 mov %rax,%rdx
1c60: 48 89 c1 mov %rax,%rcx
1c63: 48 21 fa and %rdi,%rdx
1c66: 4c 09 c2 or %r8,%rdx
1c69: f0 48 0f b1 16 lock cmpxchg %rdx,(%rsi)
1c6e: 48 39 c1 cmp %rax,%rcx
1c71: 75 ea jne 1c5d <...>

to:

1c60: 48 89 ca mov %rcx,%rdx
1c63: 48 21 c2 and %rax,%rdx
1c66: 4c 09 c2 or %r8,%rdx
1c69: f0 48 0f b1 16 lock cmpxchg %rdx,(%rsi)
1c6e: 75 f0 jne 1c60 <...>

Link: https://lkml.kernel.org/r/20220708140736.8737-1-ubizjak@gmail.com
Signed-off-by: Uros Bizjak <ubizjak@gmail.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>


# dcadcf1c 05-Jul-2022 Gang Li <ligang.bdlg@bytedance.com>

mm, hugetlb: skip irrelevant nodes in show_free_areas()

show_free_areas() allows to filter out node specific data which is
irrelevant to the allocation request. But hugetlb_show_meminfo() still
shows hugetlb on all nodes, which is redundant and unnecessary.

Use show_mem_node_skip() to skip irrelevant nodes. And replace
hugetlb_show_meminfo() with hugetlb_show_meminfo_node(nid).

before-and-after sample output of OOM:

before:
```
[ 214.362453] Node 1 active_anon:148kB inactive_anon:4050920kB active_file:112kB inactive_file:100kB
[ 214.375429] Node 1 Normal free:45100kB boost:0kB min:45576kB low:56968kB high:68360kB reserved_hig
[ 214.388334] lowmem_reserve[]: 0 0 0 0 0
[ 214.390251] Node 1 Normal: 423*4kB (UE) 320*8kB (UME) 187*16kB (UE) 117*32kB (UE) 57*64kB (UME) 20
[ 214.397626] Node 0 hugepages_total=0 hugepages_free=0 hugepages_surp=0 hugepages_size=2048kB
[ 214.401518] Node 1 hugepages_total=0 hugepages_free=0 hugepages_surp=0 hugepages_size=2048kB
```

after:
```
[ 145.069705] Node 1 active_anon:128kB inactive_anon:4049412kB active_file:56kB inactive_file:84kB u
[ 145.110319] Node 1 Normal free:45424kB boost:0kB min:45576kB low:56968kB high:68360kB reserved_hig
[ 145.152315] lowmem_reserve[]: 0 0 0 0 0
[ 145.155244] Node 1 Normal: 470*4kB (UME) 373*8kB (UME) 247*16kB (UME) 168*32kB (UE) 86*64kB (UME)
[ 145.164119] Node 1 hugepages_total=0 hugepages_free=0 hugepages_surp=0 hugepages_size=2048kB
```

Link: https://lkml.kernel.org/r/20220706034655.1834-1-ligang.bdlg@bytedance.com
Signed-off-by: Gang Li <ligang.bdlg@bytedance.com>
Reviewed-by: Mike Kravetz <mike.kravetz@oracle.com>
Cc: Muchun Song <songmuchun@bytedance.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>


# 01b44456 24-Jun-2022 Mel Gorman <mgorman@techsingularity.net>

mm/page_alloc: replace local_lock with normal spinlock

struct per_cpu_pages is no longer strictly local as PCP lists can be
drained remotely using a lock for protection. While the use of local_lock
works, it goes against the intent of local_lock which is for "pure CPU
local concurrency control mechanisms and not suited for inter-CPU
concurrency control" (Documentation/locking/locktypes.rst)

local_lock protects against migration between when the percpu pointer is
accessed and the pcp->lock acquired. The lock acquisition is a preemption
point so in the worst case, a task could migrate to another NUMA node and
accidentally allocate remote memory. The main requirement is to pin the
task to a CPU that is suitable for PREEMPT_RT and !PREEMPT_RT.

Replace local_lock with helpers that pin a task to a CPU, lookup the
per-cpu structure and acquire the embedded lock. It's similar to
local_lock without breaking the intent behind the API. It is not a
complete API as only the parts needed for PCP-alloc are implemented but in
theory, the generic helpers could be promoted to a general API if there
was demand for an embedded lock within a per-cpu struct with a guarantee
that the per-cpu structure locked matches the running CPU and cannot use
get_cpu_var due to RT concerns. PCP requires these semantics to avoid
accidentally allocating remote memory.

[mgorman@techsingularity.net: use pcp_spin_trylock_irqsave instead of pcpu_spin_trylock_irqsave]
Link: https://lkml.kernel.org/r/20220627084645.GA27531@techsingularity.net
Link: https://lkml.kernel.org/r/20220624125423.6126-8-mgorman@techsingularity.net
Signed-off-by: Mel Gorman <mgorman@techsingularity.net>
Tested-by: Yu Zhao <yuzhao@google.com>
Reviewed-by: Nicolas Saenz Julienne <nsaenzju@redhat.com>
Tested-by: Nicolas Saenz Julienne <nsaenzju@redhat.com>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Tested-by: Yu Zhao <yuzhao@google.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: Marcelo Tosatti <mtosatti@redhat.com>
Cc: Marek Szyprowski <m.szyprowski@samsung.com>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Minchan Kim <minchan@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>


# 443c2acc 24-Jun-2022 Nicolas Saenz Julienne <nsaenzju@redhat.com>

mm/page_alloc: remotely drain per-cpu lists

Some setups, notably NOHZ_FULL CPUs, are too busy to handle the per-cpu
drain work queued by __drain_all_pages(). So introduce a new mechanism to
remotely drain the per-cpu lists. It is made possible by remotely locking
'struct per_cpu_pages' new per-cpu spinlocks. A benefit of this new
scheme is that drain operations are now migration safe.

There was no observed performance degradation vs. the previous scheme.
Both netperf and hackbench were run in parallel to triggering the
__drain_all_pages(NULL, true) code path around ~100 times per second. The
new scheme performs a bit better (~5%), although the important point here
is there are no performance regressions vs. the previous mechanism.
Per-cpu lists draining happens only in slow paths.

Minchan Kim tested an earlier version and reported;

My workload is not NOHZ CPUs but run apps under heavy memory
pressure so they goes to direct reclaim and be stuck on
drain_all_pages until work on workqueue run.

unit: nanosecond
max(dur) avg(dur) count(dur)
166713013 487511.77786438033 1283

From traces, system encountered the drain_all_pages 1283 times and
worst case was 166ms and avg was 487us.

The other problem was alloc_contig_range in CMA. The PCP draining
takes several hundred millisecond sometimes though there is no
memory pressure or a few of pages to be migrated out but CPU were
fully booked.

Your patch perfectly removed those wasted time.

Link: https://lkml.kernel.org/r/20220624125423.6126-7-mgorman@techsingularity.net
Signed-off-by: Nicolas Saenz Julienne <nsaenzju@redhat.com>
Signed-off-by: Mel Gorman <mgorman@techsingularity.net>
Tested-by: Yu Zhao <yuzhao@google.com>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Cc: Hugh Dickins <hughd@google.com>
Cc: Marcelo Tosatti <mtosatti@redhat.com>
Cc: Marek Szyprowski <m.szyprowski@samsung.com>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Minchan Kim <minchan@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>


# 4b23a68f 24-Jun-2022 Mel Gorman <mgorman@techsingularity.net>

mm/page_alloc: protect PCP lists with a spinlock

Currently the PCP lists are protected by using local_lock_irqsave to
prevent migration and IRQ reentrancy but this is inconvenient. Remote
draining of the lists is impossible and a workqueue is required and every
task allocation/free must disable then enable interrupts which is
expensive.

As preparation for dealing with both of those problems, protect the
lists with a spinlock. The IRQ-unsafe version of the lock is used
because IRQs are already disabled by local_lock_irqsave. spin_trylock
is used in combination with local_lock_irqsave() but later will be
replaced with a spin_trylock_irqsave when the local_lock is removed.

The per_cpu_pages still fits within the same number of cache lines after
this patch relative to before the series.

struct per_cpu_pages {
spinlock_t lock; /* 0 4 */
int count; /* 4 4 */
int high; /* 8 4 */
int batch; /* 12 4 */
short int free_factor; /* 16 2 */
short int expire; /* 18 2 */

/* XXX 4 bytes hole, try to pack */

struct list_head lists[13]; /* 24 208 */

/* size: 256, cachelines: 4, members: 7 */
/* sum members: 228, holes: 1, sum holes: 4 */
/* padding: 24 */
} __attribute__((__aligned__(64)));

There is overhead in the fast path due to acquiring the spinlock even
though the spinlock is per-cpu and uncontended in the common case. Page
Fault Test (PFT) running on a 1-socket reported the following results on a
1 socket machine.

5.19.0-rc3 5.19.0-rc3
vanilla mm-pcpspinirq-v5r16
Hmean faults/sec-1 869275.7381 ( 0.00%) 874597.5167 * 0.61%*
Hmean faults/sec-3 2370266.6681 ( 0.00%) 2379802.0362 * 0.40%*
Hmean faults/sec-5 2701099.7019 ( 0.00%) 2664889.7003 * -1.34%*
Hmean faults/sec-7 3517170.9157 ( 0.00%) 3491122.8242 * -0.74%*
Hmean faults/sec-8 3965729.6187 ( 0.00%) 3939727.0243 * -0.66%*

There is a small hit in the number of faults per second but given that the
results are more stable, it's borderline noise.

[akpm@linux-foundation.org: add missing local_unlock_irqrestore() on contention path]
Link: https://lkml.kernel.org/r/20220624125423.6126-6-mgorman@techsingularity.net
Signed-off-by: Mel Gorman <mgorman@techsingularity.net>
Tested-by: Yu Zhao <yuzhao@google.com>
Reviewed-by: Nicolas Saenz Julienne <nsaenzju@redhat.com>
Tested-by: Nicolas Saenz Julienne <nsaenzju@redhat.com>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Cc: Hugh Dickins <hughd@google.com>
Cc: Marcelo Tosatti <mtosatti@redhat.com>
Cc: Marek Szyprowski <m.szyprowski@samsung.com>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Minchan Kim <minchan@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>


# e2a66c21 24-Jun-2022 Mel Gorman <mgorman@techsingularity.net>

mm/page_alloc: remove mistaken page == NULL check in rmqueue

If a page allocation fails, the ZONE_BOOSTER_WATERMARK should be tested,
cleared and kswapd woken whether the allocation attempt was via the PCP or
directly via the buddy list.

Remove the page == NULL so the ZONE_BOOSTED_WATERMARK bit is checked
unconditionally. As it is unlikely that ZONE_BOOSTED_WATERMARK is set,
mark the branch accordingly.

Link: https://lkml.kernel.org/r/20220624125423.6126-5-mgorman@techsingularity.net
Signed-off-by: Mel Gorman <mgorman@techsingularity.net>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Tested-by: Yu Zhao <yuzhao@google.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: Marcelo Tosatti <mtosatti@redhat.com>
Cc: Marek Szyprowski <m.szyprowski@samsung.com>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Minchan Kim <minchan@kernel.org>
Cc: Nicolas Saenz Julienne <nsaenzju@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>


# 589d9973 24-Jun-2022 Mel Gorman <mgorman@techsingularity.net>

mm/page_alloc: split out buddy removal code from rmqueue into separate helper

This is a preparation page to allow the buddy removal code to be reused in
a later patch.

No functional change.

Link: https://lkml.kernel.org/r/20220624125423.6126-4-mgorman@techsingularity.net
Signed-off-by: Mel Gorman <mgorman@techsingularity.net>
Tested-by: Minchan Kim <minchan@kernel.org>
Acked-by: Minchan Kim <minchan@kernel.org>
Reviewed-by: Nicolas Saenz Julienne <nsaenzju@redhat.com>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Tested-by: Yu Zhao <yuzhao@google.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: Marcelo Tosatti <mtosatti@redhat.com>
Cc: Marek Szyprowski <m.szyprowski@samsung.com>
Cc: Michal Hocko <mhocko@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>


# 5d0a661d 24-Jun-2022 Mel Gorman <mgorman@techsingularity.net>

mm/page_alloc: use only one PCP list for THP-sized allocations

The per_cpu_pages is cache-aligned on a standard x86-64 distribution
configuration but a later patch will add a new field which would push the
structure into the next cache line. Use only one list to store THP-sized
pages on the per-cpu list. This assumes that the vast majority of
THP-sized allocations are GFP_MOVABLE but even if it was another type, it
would not contribute to serious fragmentation that potentially causes a
later THP allocation failure. Align per_cpu_pages on the cacheline
boundary to ensure there is no false cache sharing.

After this patch, the structure sizing is;

struct per_cpu_pages {
int count; /* 0 4 */
int high; /* 4 4 */
int batch; /* 8 4 */
short int free_factor; /* 12 2 */
short int expire; /* 14 2 */
struct list_head lists[13]; /* 16 208 */

/* size: 256, cachelines: 4, members: 6 */
/* padding: 32 */
} __attribute__((__aligned__(64)));

Link: https://lkml.kernel.org/r/20220624125423.6126-3-mgorman@techsingularity.net
Signed-off-by: Mel Gorman <mgorman@techsingularity.net>
Tested-by: Minchan Kim <minchan@kernel.org>
Acked-by: Minchan Kim <minchan@kernel.org>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Tested-by: Yu Zhao <yuzhao@google.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: Marcelo Tosatti <mtosatti@redhat.com>
Cc: Marek Szyprowski <m.szyprowski@samsung.com>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Nicolas Saenz Julienne <nsaenzju@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>


# bf75f200 24-Jun-2022 Mel Gorman <mgorman@techsingularity.net>

mm/page_alloc: add page->buddy_list and page->pcp_list

Patch series "Drain remote per-cpu directly", v5.

Some setups, notably NOHZ_FULL CPUs, may be running realtime or
latency-sensitive applications that cannot tolerate interference due to
per-cpu drain work queued by __drain_all_pages(). Introduce a new
mechanism to remotely drain the per-cpu lists. It is made possible by
remotely locking 'struct per_cpu_pages' new per-cpu spinlocks. This has
two advantages, the time to drain is more predictable and other unrelated
tasks are not interrupted.

This series has the same intent as Nicolas' series "mm/page_alloc: Remote
per-cpu lists drain support" -- avoid interference of a high priority task
due to a workqueue item draining per-cpu page lists. While many workloads
can tolerate a brief interruption, it may cause a real-time task running
on a NOHZ_FULL CPU to miss a deadline and at minimum, the draining is
non-deterministic.

Currently an IRQ-safe local_lock protects the page allocator per-cpu
lists. The local_lock on its own prevents migration and the IRQ disabling
protects from corruption due to an interrupt arriving while a page
allocation is in progress.

This series adjusts the locking. A spinlock is added to struct
per_cpu_pages to protect the list contents while local_lock_irq is
ultimately replaced by just the spinlock in the final patch. This allows
a remote CPU to safely. Follow-on work should allow the spin_lock_irqsave
to be converted to spin_lock to avoid IRQs being disabled/enabled in most
cases. The follow-on patch will be one kernel release later as it is
relatively high risk and it'll make bisections more clear if there are any
problems.

Patch 1 is a cosmetic patch to clarify when page->lru is storing buddy pages
and when it is storing per-cpu pages.

Patch 2 shrinks per_cpu_pages to make room for a spin lock. Strictly speaking
this is not necessary but it avoids per_cpu_pages consuming another
cache line.

Patch 3 is a preparation patch to avoid code duplication.

Patch 4 is a minor correction.

Patch 5 uses a spin_lock to protect the per_cpu_pages contents while still
relying on local_lock to prevent migration, stabilise the pcp
lookup and prevent IRQ reentrancy.

Patch 6 remote drains per-cpu pages directly instead of using a workqueue.

Patch 7 uses a normal spinlock instead of local_lock for remote draining


This patch (of 7):

The page allocator uses page->lru for storing pages on either buddy or PCP
lists. Create page->buddy_list and page->pcp_list as a union with
page->lru. This is simply to clarify what type of list a page is on in
the page allocator.

No functional change intended.

[minchan@kernel.org: fix page lru fields in macros]
Link: https://lkml.kernel.org/r/20220624125423.6126-2-mgorman@techsingularity.net
Signed-off-by: Mel Gorman <mgorman@techsingularity.net>
Tested-by: Minchan Kim <minchan@kernel.org>
Acked-by: Minchan Kim <minchan@kernel.org>
Reviewed-by: Nicolas Saenz Julienne <nsaenzju@redhat.com>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Tested-by: Yu Zhao <yuzhao@google.com>
Cc: Marcelo Tosatti <mtosatti@redhat.com>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Hugh Dickins <hughd@google.com>
Cc: Marek Szyprowski <m.szyprowski@samsung.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>


# ade63b41 22-Jun-2022 Yang Yang <yang.yang29@zte.com.cn>

mm/page_alloc: make the annotations of available memory more accurate

Not all systems use swap, so estimating available memory would help to
prevent swapping or OOM of system that not use swap.

And we need to reserve some page cache to prevent swapping or thrashing.
If somebody is accessing the pages in pagecache, and if too much would be
freed, most accesses might mean reading data from disk, i.e. thrashing.

Link: https://lkml.kernel.org/r/20220623020833.972979-1-yang.yang29@zte.com.cn
Signed-off-by: Yang Yang <yang.yang29@zte.com.cn>
Signed-off-by: CGEL ZTE <cgel.zte@gmail.com>
Cc: Matthew Wilcox <willy@infradead.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>


# 5375336c 17-Jun-2022 Matthew Wilcox (Oracle) <willy@infradead.org>

mm: convert destroy_compound_page() to destroy_large_folio()

All callers now have a folio, so push the folio->page conversion
down to this function.

[akpm@linux-foundation.org: uninline destroy_large_folio() to fix build issue]
Link: https://lkml.kernel.org/r/20220617175020.717127-20-willy@infradead.org
Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>


# d9da8f6c 09-Jun-2022 Andrey Konovalov <andreyknvl@gmail.com>

mm: introduce clear_highpage_kasan_tagged

Add a clear_highpage_kasan_tagged() helper that does clear_highpage() on a
page potentially tagged by KASAN.

This helper is used by the following patch.

Link: https://lkml.kernel.org/r/4471979b46b2c487787ddcd08b9dc5fedd1b6ffd.1654798516.git.andreyknvl@google.com
Signed-off-by: Andrey Konovalov <andreyknvl@google.com>
Cc: Alexander Potapenko <glider@google.com>
Cc: Andrey Ryabinin <ryabinin.a.a@gmail.com>
Cc: Dmitry Vyukov <dvyukov@google.com>
Cc: Marco Elver <elver@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>


# aeaec8e2 09-Jun-2022 Andrey Konovalov <andreyknvl@gmail.com>

mm: rename kernel_init_free_pages to kernel_init_pages

Rename kernel_init_free_pages() to kernel_init_pages(). This function is
not only used for free pages but also for pages that were just allocated.

Link: https://lkml.kernel.org/r/1ecaffc0a9c1404d4d7cf52efe0b2dc8a0c681d8.1654798516.git.andreyknvl@google.com
Signed-off-by: Andrey Konovalov <andreyknvl@google.com>
Reviewed-by: Muchun Song <songmuchun@bytedance.com>
Cc: Alexander Potapenko <glider@google.com>
Cc: Andrey Ryabinin <ryabinin.a.a@gmail.com>
Cc: Dmitry Vyukov <dvyukov@google.com>
Cc: Marco Elver <elver@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>


# 446ec838 05-Jun-2022 Daniel Vetter <daniel.vetter@ffwll.ch>

mm/page_alloc: use might_alloc()

... instead of open coding it. Completely equivalent code, just a notch
more meaningful when reading.

Link: https://lkml.kernel.org/r/20220605152539.3196045-1-daniel.vetter@ffwll.ch
Signed-off-by: Daniel Vetter <daniel.vetter@intel.com>
Reviewed-by: David Hildenbrand <david@redhat.com>
Reviewed-by: Vlastimil Babka <vbabka@suse.cz>
Cc: Christoph Lameter <cl@linux.com>
Cc: David Rientjes <rientjes@google.com>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Cc: Pekka Enberg <penberg@kernel.org>
Cc: Roman Gushchin <roman.gushchin@linux.dev>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>


# 902c2d91 14-Jun-2022 Ma Wupeng <mawupeng1@huawei.com>

memblock: Disable mirror feature if kernelcore is not specified

If system have some mirrored memory and mirrored feature is not specified
in boot parameter, the basic mirrored feature will be enabled and this will
lead to the following situations:

- memblock memory allocation prefers mirrored region. This may have some
unexpected influence on numa affinity.

- contiguous memory will be split into several parts if parts of them
is mirrored memory via memblock_mark_mirror().

To fix this, variable mirrored_kernelcore will be checked in
memblock_mark_mirror(). Mark mirrored memory with flag MEMBLOCK_MIRROR iff
kernelcore=mirror is added in the kernel parameters.

Signed-off-by: Ma Wupeng <mawupeng1@huawei.com>
Acked-by: Ard Biesheuvel <ardb@kernel.org>
Link: https://lore.kernel.org/r/20220614092156.1972846-6-mawupeng1@huawei.com
Acked-by: Mike Rapoport <rppt@linux.ibm.com>
Signed-off-by: Ard Biesheuvel <ardb@kernel.org>


# 6d05141a 10-Jun-2022 Catalin Marinas <catalin.marinas@arm.com>

mm: kasan: Skip page unpoisoning only if __GFP_SKIP_KASAN_UNPOISON

Currently post_alloc_hook() skips the kasan unpoisoning if the tags will
be zeroed (__GFP_ZEROTAGS) or __GFP_SKIP_KASAN_UNPOISON is passed. Since
__GFP_ZEROTAGS is now accompanied by __GFP_SKIP_KASAN_UNPOISON, remove
the extra check.

Signed-off-by: Catalin Marinas <catalin.marinas@arm.com>
Cc: Andrey Ryabinin <ryabinin.a.a@gmail.com>
Cc: Andrey Konovalov <andreyknvl@gmail.com>
Cc: Peter Collingbourne <pcc@google.com>
Cc: Vincenzo Frascino <vincenzo.frascino@arm.com>
Reviewed-by: Vincenzo Frascino <vincenzo.frascino@arm.com>
Reviewed-by: Andrey Konovalov <andreyknvl@gmail.com>
Link: https://lore.kernel.org/r/20220610152141.2148929-4-catalin.marinas@arm.com
Signed-off-by: Will Deacon <will@kernel.org>


# 70c248ac 10-Jun-2022 Catalin Marinas <catalin.marinas@arm.com>

mm: kasan: Skip unpoisoning of user pages

Commit c275c5c6d50a ("kasan: disable freed user page poisoning with HW
tags") added __GFP_SKIP_KASAN_POISON to GFP_HIGHUSER_MOVABLE. A similar
argument can be made about unpoisoning, so also add
__GFP_SKIP_KASAN_UNPOISON to user pages. To ensure the user page is
still accessible via page_address() without a kasan fault, reset the
page->flags tag.

With the above changes, there is no need for the arm64
tag_clear_highpage() to reset the page->flags tag.

Signed-off-by: Catalin Marinas <catalin.marinas@arm.com>
Cc: Andrey Ryabinin <ryabinin.a.a@gmail.com>
Cc: Andrey Konovalov <andreyknvl@gmail.com>
Cc: Peter Collingbourne <pcc@google.com>
Cc: Vincenzo Frascino <vincenzo.frascino@arm.com>
Reviewed-by: Vincenzo Frascino <vincenzo.frascino@arm.com>
Reviewed-by: Andrey Konovalov <andreyknvl@gmail.com>
Link: https://lore.kernel.org/r/20220610152141.2148929-3-catalin.marinas@arm.com
Signed-off-by: Will Deacon <will@kernel.org>


# 9282012f 25-Jul-2022 Jaewon Kim <jaewon31.kim@samsung.com>

page_alloc: fix invalid watermark check on a negative value

There was a report that a task is waiting at the
throttle_direct_reclaim. The pgscan_direct_throttle in vmstat was
increasing.

This is a bug where zone_watermark_fast returns true even when the free
is very low. The commit f27ce0e14088 ("page_alloc: consider highatomic
reserve in watermark fast") changed the watermark fast to consider
highatomic reserve. But it did not handle a negative value case which
can be happened when reserved_highatomic pageblock is bigger than the
actual free.

If watermark is considered as ok for the negative value, allocating
contexts for order-0 will consume all free pages without direct reclaim,
and finally free page may become depleted except highatomic free.

Then allocating contexts may fall into throttle_direct_reclaim. This
symptom may easily happen in a system where wmark min is low and other
reclaimers like kswapd does not make free pages quickly.

Handle the negative case by using MIN.

Link: https://lkml.kernel.org/r/20220725095212.25388-1-jaewon31.kim@samsung.com
Fixes: f27ce0e14088 ("page_alloc: consider highatomic reserve in watermark fast")
Signed-off-by: Jaewon Kim <jaewon31.kim@samsung.com>
Reported-by: GyeongHwan Hong <gh21.hong@samsung.com>
Acked-by: Mel Gorman <mgorman@techsingularity.net>
Cc: Minchan Kim <minchan@kernel.org>
Cc: Baoquan He <bhe@redhat.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Yong-Taek Lee <ytk.lee@samsung.com>
Cc: <stable@vger.kerenl.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>


# 1c563432 24-May-2022 Minchan Kim <minchan@kernel.org>

mm: fix is_pinnable_page against a cma page

Pages in the CMA area could have MIGRATE_ISOLATE as well as MIGRATE_CMA so
the current is_pinnable_page() could miss CMA pages which have
MIGRATE_ISOLATE. It ends up pinning CMA pages as longterm for the
pin_user_pages() API so CMA allocations keep failing until the pin is
released.

CPU 0 CPU 1 - Task B

cma_alloc
alloc_contig_range
pin_user_pages_fast(FOLL_LONGTERM)
change pageblock as MIGRATE_ISOLATE
internal_get_user_pages_fast
lockless_pages_from_mm
gup_pte_range
try_grab_folio
is_pinnable_page
return true;
So, pinned the page successfully.
page migration failure with pinned page
..
.. After 30 sec
unpin_user_page(page)

CMA allocation succeeded after 30 sec.

The CMA allocation path protects the migration type change race using
zone->lock but what GUP path need to know is just whether the page is on
CMA area or not rather than exact migration type. Thus, we don't need
zone->lock but just checks migration type in either of (MIGRATE_ISOLATE
and MIGRATE_CMA).

Adding the MIGRATE_ISOLATE check in is_pinnable_page could cause rejecting
of pinning pages on MIGRATE_ISOLATE pageblocks even though it's neither
CMA nor movable zone if the page is temporarily unmovable. However, such
a migration failure by unexpected temporal refcount holding is general
issue, not only come from MIGRATE_ISOLATE and the MIGRATE_ISOLATE is also
transient state like other temporal elevated refcount problem.

Link: https://lkml.kernel.org/r/20220524171525.976723-1-minchan@kernel.org
Signed-off-by: Minchan Kim <minchan@kernel.org>
Reviewed-by: John Hubbard <jhubbard@nvidia.com>
Acked-by: Paul E. McKenney <paulmck@kernel.org>
Cc: David Hildenbrand <david@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>


# 86d28b07 26-May-2022 Zi Yan <ziy@nvidia.com>

mm: split free page with properly free memory accounting and without race

In isolate_single_pageblock(), free pages are checked without holding zone
lock, but they can go away in split_free_page() when zone lock is held.
Check the free page and its order again in split_free_page() when zone lock
is held. Recheck the page if the free page is gone under zone lock.

In addition, in split_free_page(), the free page was deleted from the page
list without changing free page accounting. Add the missing free page
accounting code.

Fix the type of order parameter in split_free_page().

Link: https://lore.kernel.org/lkml/20220525103621.987185e2ca0079f7b97b856d@linux-foundation.org/
Link: https://lkml.kernel.org/r/20220526231531.2404977-2-zi.yan@sent.com
Fixes: b2c9e2fbba32 ("mm: make alloc_contig_range work at pageblock granularity")
Signed-off-by: Zi Yan <ziy@nvidia.com>
Reported-by: Doug Berger <opendmb@gmail.com>
Link: https://lore.kernel.org/linux-mm/c3932a6f-77fe-29f7-0c29-fe6b1c67ab7b@gmail.com/
Cc: David Hildenbrand <david@redhat.com>
Cc: Qian Cai <quic_qiancai@quicinc.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Mel Gorman <mgorman@techsingularity.net>
Cc: Eric Ren <renzhengeek@gmail.com>
Cc: Mike Rapoport <rppt@kernel.org>
Cc: Oscar Salvador <osalvador@suse.de>
Cc: Christophe Leroy <christophe.leroy@csgroup.eu>
Cc: Marek Szyprowski <m.szyprowski@samsung.com>
Cc: Michael Walle <michael@walle.cc>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>


# c572e488 26-May-2022 Mel Gorman <mgorman@techsingularity.net>

mm/page_alloc: always attempt to allocate at least one page during bulk allocation

Peter Pavlisko reported the following problem on kernel bugzilla 216007.

When I try to extract an uncompressed tar archive (2.6 milion
files, 760.3 GiB in size) on newly created (empty) XFS file system,
after first low tens of gigabytes extracted the process hangs in
iowait indefinitely. One CPU core is 100% occupied with iowait,
the other CPU core is idle (on 2-core Intel Celeron G1610T).

It was bisected to c9fa563072e1 ("xfs: use alloc_pages_bulk_array() for
buffers") but XFS is only the messenger. The problem is that nothing is
waking kswapd to reclaim some pages at a time the PCP lists cannot be
refilled until some reclaim happens. The bulk allocator checks that there
are some pages in the array and the original intent was that a bulk
allocator did not necessarily need all the requested pages and it was best
to return as quickly as possible.

This was fine for the first user of the API but both NFS and XFS require
the requested number of pages be available before making progress. Both
could be adjusted to call the page allocator directly if a bulk allocation
fails but it puts a burden on users of the API. Adjust the semantics to
attempt at least one allocation via __alloc_pages() before returning so
kswapd is woken if necessary.

It was reported via bugzilla that the patch addressed the problem and that
the tar extraction completed successfully. This may also address bug
215975 but has yet to be confirmed.

BugLink: https://bugzilla.kernel.org/show_bug.cgi?id=216007
BugLink: https://bugzilla.kernel.org/show_bug.cgi?id=215975
Link: https://lkml.kernel.org/r/20220526091210.GC3441@techsingularity.net
Fixes: 387ba26fb1cb ("mm/page_alloc: add a bulk page allocator")
Signed-off-by: Mel Gorman <mgorman@techsingularity.net>
Cc: "Darrick J. Wong" <djwong@kernel.org>
Cc: Dave Chinner <dchinner@redhat.com>
Cc: Jan Kara <jack@suse.cz>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Jesper Dangaard Brouer <brouer@redhat.com>
Cc: Chuck Lever <chuck.lever@oracle.com>
Cc: <stable@vger.kernel.org> [5.13+]
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>


# 88ee1343 24-May-2022 Zi Yan <ziy@nvidia.com>

mm: fix a potential infinite loop in start_isolate_page_range()

In isolate_single_pageblock() called by start_isolate_page_range(), there
are some pageblock isolation issues causing a potential infinite loop when
isolating a page range. This is reported by Qian Cai.

1. the pageblock was isolated by just changing pageblock migratetype
without checking unmovable pages. Calling set_migratetype_isolate() to
isolate pageblock properly.
2. an off-by-one error caused migrating pages unnecessarily, since the page
is not crossing pageblock boundary.
3. migrating a compound page across pageblock boundary then splitting the
free page later has a small race window that the free page might be
allocated again, so that the code will try again, causing an potential
infinite loop. Temporarily set the to-be-migrated page's pageblock to
MIGRATE_ISOLATE to prevent that and bail out early if no free page is
found after page migration.

An additional fix to split_free_page() aims to avoid crashing in
__free_one_page(). When the free page is split at the specified
split_pfn_offset, free_page_order should check both the first bit of
free_page_pfn and the last bit of split_pfn_offset and use the smaller
one. For example, if free_page_pfn=0x10000, split_pfn_offset=0xc000,
free_page_order should first be 0x8000 then 0x4000, instead of 0x4000 then
0x8000, which the original algorithm did.

[akpm@linux-foundation.org: suppress min() warning]
Link: https://lkml.kernel.org/r/20220524194756.1698351-1-zi.yan@sent.com
Fixes: b2c9e2fbba3253 ("mm: make alloc_contig_range work at pageblock granularity")
Signed-off-by: Zi Yan <ziy@nvidia.com>
Reported-by: Qian Cai <quic_qiancai@quicinc.com>
Cc: Christophe Leroy <christophe.leroy@csgroup.eu>
Cc: David Hildenbrand <david@redhat.com>
Cc: Eric Ren <renzhengeek@gmail.com>
Cc: Mel Gorman <mgorman@techsingularity.net>
Cc: Mike Rapoport <rppt@linux.ibm.com>
Cc: Minchan Kim <minchan@kernel.org>
Cc: Oscar Salvador <osalvador@suse.de>
Cc: Vlastimil Babka <vbabka@suse.cz>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>


# 3f913fc5 19-May-2022 Qi Zheng <zhengqi.arch@bytedance.com>

mm: fix missing handler for __GFP_NOWARN

We expect no warnings to be issued when we specify __GFP_NOWARN, but
currently in paths like alloc_pages() and kmalloc(), there are still some
warnings printed, fix it.

But for some warnings that report usage problems, we don't deal with them.
If such warnings are printed, then we should fix the usage problems.
Such as the following case:

WARN_ON_ONCE((gfp_flags & __GFP_NOFAIL) && (order > 1));

[zhengqi.arch@bytedance.com: v2]
Link: https://lkml.kernel.org/r/20220511061951.1114-1-zhengqi.arch@bytedance.com
Link: https://lkml.kernel.org/r/20220510113809.80626-1-zhengqi.arch@bytedance.com
Signed-off-by: Qi Zheng <zhengqi.arch@bytedance.com>
Cc: Akinobu Mita <akinobu.mita@gmail.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: Jiri Slaby <jirislaby@kernel.org>
Cc: Steven Rostedt (Google) <rostedt@goodmis.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>


# 10e0f753 19-May-2022 Wonhyuk Yang <vvghjk1234@gmail.com>

mm/page_alloc: fix tracepoint mm_page_alloc_zone_locked()

Currently, trace point mm_page_alloc_zone_locked() doesn't show correct
information.

First, when alloc_flag has ALLOC_HARDER/ALLOC_CMA, page can be allocated
from MIGRATE_HIGHATOMIC/MIGRATE_CMA. Nevertheless, tracepoint use
requested migration type not MIGRATE_HIGHATOMIC and MIGRATE_CMA.

Second, after commit 44042b4498728 ("mm/page_alloc: allow high-order pages
to be stored on the per-cpu lists") percpu-list can store high order
pages. But trace point determine whether it is a refiil of percpu-list by
comparing requested order and 0.

To handle these problems, make mm_page_alloc_zone_locked() only be called
by __rmqueue_smallest with correct migration type. With a new argument
called percpu_refill, it can show roughly whether it is a refill of
percpu-list.

Link: https://lkml.kernel.org/r/20220512025307.57924-1-vvghjk1234@gmail.com
Signed-off-by: Wonhyuk Yang <vvghjk1234@gmail.com>
Acked-by: Mel Gorman <mgorman@suse.de>
Cc: Baik Song An <bsahn@etri.re.kr>
Cc: Hong Yeon Kim <kimhy@etri.re.kr>
Cc: Taeung Song <taeung@reallinux.co.kr>
Cc: <linuxgeek@linuxgeek.io>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Ingo Molnar <mingo@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>


# c8bd84f7 12-May-2022 zhenwei pi <pizhenwei@bytedance.com>

mm/memory-failure.c: simplify num_poisoned_pages_dec

Don't decrease the number of poisoned pages in page_alloc.c, let the
memory-failure.c do inc/dec poisoned pages only.

Also simplify unpoison_memory(), only decrease the number of
poisoned pages when:
- TestClearPageHWPoison() succeed
- put_page_back_buddy succeed

After decreasing, print necessary log.

Finally, remove clear_page_hwpoison() and unpoison_taken_off_page().

Link: https://lkml.kernel.org/r/20220509105641.491313-3-pizhenwei@bytedance.com
Signed-off-by: zhenwei pi <pizhenwei@bytedance.com>
Acked-by: Naoya Horiguchi <naoya.horiguchi@nec.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>


# 11ac3e87 12-May-2022 Zi Yan <ziy@nvidia.com>

mm: cma: use pageblock_order as the single alignment

Now alloc_contig_range() works at pageblock granularity. Change CMA
allocation, which uses alloc_contig_range(), to use pageblock_nr_pages
alignment.

Link: https://lkml.kernel.org/r/20220425143118.2850746-6-zi.yan@sent.com
Signed-off-by: Zi Yan <ziy@nvidia.com>
Cc: Christophe Leroy <christophe.leroy@csgroup.eu>
Cc: David Hildenbrand <david@redhat.com>
Cc: Eric Ren <renzhengeek@gmail.com>
Cc: kernel test robot <lkp@intel.com>
Cc: Mel Gorman <mgorman@techsingularity.net>
Cc: Mike Rapoport <rppt@linux.ibm.com>
Cc: Minchan Kim <minchan@kernel.org>
Cc: Oscar Salvador <osalvador@suse.de>
Cc: Vlastimil Babka <vbabka@suse.cz>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>


# 6e263fff 12-May-2022 Zi Yan <ziy@nvidia.com>

mm: page_isolation: enable arbitrary range page isolation.

Now start_isolate_page_range() is ready to handle arbitrary range
isolation, so move the alignment check/adjustment into the function body.
Do the same for its counterpart undo_isolate_page_range().
alloc_contig_range(), its caller, can pass an arbitrary range instead of a
MAX_ORDER_NR_PAGES aligned one.

Link: https://lkml.kernel.org/r/20220425143118.2850746-5-zi.yan@sent.com
Signed-off-by: Zi Yan <ziy@nvidia.com>
Cc: Christophe Leroy <christophe.leroy@csgroup.eu>
Cc: David Hildenbrand <david@redhat.com>
Cc: Eric Ren <renzhengeek@gmail.com>
Cc: kernel test robot <lkp@intel.com>
Cc: Mel Gorman <mgorman@techsingularity.net>
Cc: Mike Rapoport <rppt@linux.ibm.com>
Cc: Minchan Kim <minchan@kernel.org>
Cc: Oscar Salvador <osalvador@suse.de>
Cc: Vlastimil Babka <vbabka@suse.cz>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>


# b2c9e2fb 12-May-2022 Zi Yan <ziy@nvidia.com>

mm: make alloc_contig_range work at pageblock granularity

alloc_contig_range() worked at MAX_ORDER_NR_PAGES granularity to avoid
merging pageblocks with different migratetypes. It might unnecessarily
convert extra pageblocks at the beginning and at the end of the range.
Change alloc_contig_range() to work at pageblock granularity.

Special handling is needed for free pages and in-use pages across the
boundaries of the range specified by alloc_contig_range(). Because these=

Partially isolated pages causes free page accounting issues. The free
pages will be split and freed into separate migratetype lists; the in-use=

Pages will be migrated then the freed pages will be handled in the
aforementioned way.

[ziy@nvidia.com: fix deadlock/crash]
Link: https://lkml.kernel.org/r/23A7297E-6C84-4138-A9FE-3598234004E6@nvidia.com
Link: https://lkml.kernel.org/r/20220425143118.2850746-4-zi.yan@sent.com
Signed-off-by: Zi Yan <ziy@nvidia.com>
Reported-by: kernel test robot <lkp@intel.com>
Cc: Christophe Leroy <christophe.leroy@csgroup.eu>
Cc: David Hildenbrand <david@redhat.com>
Cc: Eric Ren <renzhengeek@gmail.com>
Cc: Mel Gorman <mgorman@techsingularity.net>
Cc: Mike Rapoport <rppt@linux.ibm.com>
Cc: Minchan Kim <minchan@kernel.org>
Cc: Oscar Salvador <osalvador@suse.de>
Cc: Vlastimil Babka <vbabka@suse.cz>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>


# b48d8a8e 12-May-2022 Zi Yan <ziy@nvidia.com>

mm: page_isolation: move has_unmovable_pages() to mm/page_isolation.c

Patch series "Use pageblock_order for cma and alloc_contig_range alignment", v11.

This patchset tries to remove the MAX_ORDER-1 alignment requirement for CMA
and alloc_contig_range(). It prepares for my upcoming changes to make
MAX_ORDER adjustable at boot time[1].

The MAX_ORDER - 1 alignment requirement comes from that
alloc_contig_range() isolates pageblocks to remove free memory from buddy
allocator but isolating only a subset of pageblocks within a page spanning
across multiple pageblocks causes free page accounting issues. Isolated
page might not be put into the right free list, since the code assumes the
migratetype of the first pageblock as the whole free page migratetype.
This is based on the discussion at [2].

To remove the requirement, this patchset:
1. isolates pages at pageblock granularity instead of
max(MAX_ORDER_NR_PAEGS, pageblock_nr_pages);
2. splits free pages across the specified range or migrates in-use pages
across the specified range then splits the freed page to avoid free page
accounting issues (it happens when multiple pageblocks within a single page
have different migratetypes);
3. only checks unmovable pages within the range instead of MAX_ORDER - 1 aligned
range during isolation to avoid alloc_contig_range() failure when pageblocks
within a MAX_ORDER - 1 aligned range are allocated separately.
4. returns pages not in the range as it did before.

One optimization might come later:
1. make MIGRATE_ISOLATE a separate bit to be able to restore the original
migratetypes when isolation fails in the middle of the range.

[1] https://lore.kernel.org/linux-mm/20210805190253.2795604-1-zi.yan@sent.com/
[2] https://lore.kernel.org/linux-mm/d19fb078-cb9b-f60f-e310-fdeea1b947d2@redhat.com/


This patch (of 6):

has_unmovable_pages() is only used in mm/page_isolation.c. Move it from
mm/page_alloc.c and make it static.

Link: https://lkml.kernel.org/r/20220425143118.2850746-2-zi.yan@sent.com
Signed-off-by: Zi Yan <ziy@nvidia.com>
Reviewed-by: Oscar Salvador <osalvador@suse.de>
Reviewed-by: Mike Rapoport <rppt@linux.ibm.com>
Acked-by: David Hildenbrand <david@redhat.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Mel Gorman <mgorman@techsingularity.net>
Cc: Eric Ren <renzhengeek@gmail.com>
Cc: Christophe Leroy <christophe.leroy@csgroup.eu>
Cc: Minchan Kim <minchan@kernel.org>
Cc: kernel test robot <lkp@intel.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>


# 8a87d695 12-May-2022 Wonhyuk Yang <vvghjk1234@gmail.com>

mm/page_alloc: cache the result of node_dirty_ok()

To spread dirty pages, nodes are checked whether they have reached the
dirty limit using the expensive node_dirty_ok(). To reduce the frequency
of calling node_dirty_ok(), the last node that hit the dirty limit can be
cached.

Instead of caching the node, caching both the node and its node_dirty_ok()
status can reduce the number of calle to node_dirty_ok().

[akpm@linux-foundation.org: rename last_pgdat_dirty_limit to last_pgdat_dirty_ok]
Link: https://lkml.kernel.org/r/20220430011032.64071-1-vvghjk1234@gmail.com
Signed-off-by: Wonhyuk Yang <vvghjk1234@gmail.com>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Acked-by: Mel Gorman <mgorman@suse.de>
Cc: Donghyeok Kim <dthex5d@gmail.com>
Cc: JaeSang Yoo <jsyoo5b@gmail.com>
Cc: Jiyoup Kim <lakroforce@gmail.com>
Cc: Ohhoon Kwon <ohkwon1043@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>


# 014bb1de 09-May-2022 NeilBrown <neilb@suse.de>

mm: create new mm/swap.h header file

Patch series "MM changes to improve swap-over-NFS support".

Assorted improvements for swap-via-filesystem.

This is a resend of these patches, rebased on current HEAD. The only
substantial changes is that swap_dirty_folio has replaced
swap_set_page_dirty.

Currently swap-via-fs (SWP_FS_OPS) doesn't work for any filesystem. It
has previously worked for NFS but that broke a few releases back. This
series changes to use a new ->swap_rw rather than ->readpage and
->direct_IO. It also makes other improvements.

There is a companion series already in linux-next which fixes various
issues with NFS. Once both series land, a final patch is needed which
changes NFS over to use ->swap_rw.


This patch (of 10):

Many functions declared in include/linux/swap.h are only used within mm/

Create a new "mm/swap.h" and move some of these declarations there.
Remove the redundant 'extern' from the function declarations.

[akpm@linux-foundation.org: mm/memory-failure.c needs mm/swap.h]
Link: https://lkml.kernel.org/r/164859751830.29473.5309689752169286816.stgit@noble.brown
Link: https://lkml.kernel.org/r/164859778120.29473.11725907882296224053.stgit@noble.brown
Signed-off-by: NeilBrown <neilb@suse.de>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Tested-by: David Howells <dhowells@redhat.com>
Tested-by: Geert Uytterhoeven <geert+renesas@glider.be>
Cc: Trond Myklebust <trond.myklebust@hammerspace.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: Mel Gorman <mgorman@techsingularity.net>
Cc: Miaohe Lin <linmiaohe@huawei.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>


# d137a7cb 29-Apr-2022 Chen Wandun <chenwandun@huawei.com>

mm/page_alloc: simplify update of pgdat in wake_all_kswapds

There is no need to update last_pgdat for each zone, only update
last_pgdat when iterating the first zone of a node.

Link: https://lkml.kernel.org/r/20220322115635.2708989-1-chenwandun@huawei.com
Signed-off-by: Chen Wandun <chenwandun@huawei.com>
Reviewed-by: Andrew Morton <akpm@linux-foundation.org>
Cc: Mel Gorman <mgorman@techsingularity.net>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>


# 6fd3620b 29-Apr-2022 Joao Martins <joao.m.martins@oracle.com>

mm/page_alloc: reuse tail struct pages for compound devmaps

Currently memmap_init_zone_device() ends up initializing 32768 pages when
it only needs to initialize 128 given tail page reuse. That number is
worse with 1GB compound pages, 262144 instead of 128. Update
memmap_init_zone_device() to skip redundant initialization, detailed
below.

When a pgmap @vmemmap_shift is set, all pages are mapped at a given huge
page alignment and use compound pages to describe them as opposed to a
struct per 4K.

With @vmemmap_shift > 0 and when struct pages are stored in ram (!altmap)
most tail pages are reused. Consequently, the amount of unique struct
pages is a lot smaller than the total amount of struct pages being mapped.

The altmap path is left alone since it does not support memory savings
based on compound pages devmap.

Link: https://lkml.kernel.org/r/20220420155310.9712-6-joao.m.martins@oracle.com
Signed-off-by: Joao Martins <joao.m.martins@oracle.com>
Reviewed-by: Muchun Song <songmuchun@bytedance.com>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: Jane Chu <jane.chu@oracle.com>
Cc: Jason Gunthorpe <jgg@ziepe.ca>
Cc: Jonathan Corbet <corbet@lwn.net>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Mike Kravetz <mike.kravetz@oracle.com>
Cc: Vishal Verma <vishal.l.verma@intel.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>


# aa282a15 29-Apr-2022 Ma Wupeng <mawupeng1@huawei.com>

mm/page_alloc.c: calc the right pfn if page size is not 4K

Previous 0x100000 is used to check the 4G limit in
find_zone_movable_pfns_for_nodes(). This is right in x86 because the page
size can only be 4K. But 16K and 64K are available in arm64. So replace
it with PHYS_PFN(SZ_4G).

Link: https://lkml.kernel.org/r/20220414101314.1250667-8-mawupeng1@huawei.com
Signed-off-by: Ma Wupeng <mawupeng1@huawei.com>
Reviewed-by: David Hildenbrand <david@redhat.com>
Reviewed-by: Anshuman Khandual <anshuman.khandual@arm.com>
Acked-by: Ard Biesheuvel <ardb@kernel.org>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Will Deacon <will@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>


# bc53008e 29-Apr-2022 Wei Yang <richard.weiyang@gmail.com>

mm/vmscan: make sure wakeup_kswapd with managed zone

wakeup_kswapd() only wake up kswapd when the zone is managed.

For two callers of wakeup_kswapd(), they are node perspective.

* wake_all_kswapds
* numamigrate_isolate_page

If we picked up a !managed zone, this is not we expected.

This patch makes sure we pick up a managed zone for wakeup_kswapd(). And
it also use managed_zone in migrate_balanced_pgdat() to get the proper
zone.

[richard.weiyang@gmail.com: adjust the usage in migrate_balanced_pgdat()]
Link: https://lkml.kernel.org/r/20220329010901.1654-2-richard.weiyang@gmail.com
Link: https://lkml.kernel.org/r/20220327024101.10378-2-richard.weiyang@gmail.com
Signed-off-by: Wei Yang <richard.weiyang@gmail.com>
Reviewed-by: "Huang, Ying" <ying.huang@intel.com>
Reviewed-by: Miaohe Lin <linmiaohe@huawei.com>
Reviewed-by: David Hildenbrand <david@redhat.com>
Cc: Mel Gorman <mgorman@techsingularity.net>
Cc: Oscar Salvador <osalvador@suse.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>


# 8170ac47 29-Apr-2022 Zi Yan <ziy@nvidia.com>

mm: wrap __find_buddy_pfn() with a necessary buddy page validation

Whenever the buddy of a page is found from __find_buddy_pfn(),
page_is_buddy() should be used to check its validity. Add a helper
function find_buddy_page_pfn() to find the buddy page and do the check
together.

[ziy@nvidia.com: updates per David]
Link: https://lkml.kernel.org/r/20220401230804.1658207-2-zi.yan@sent.com
Link: https://lore.kernel.org/linux-mm/CAHk-=wji_AmYygZMTsPMdJ7XksMt7kOur8oDfDdniBRMjm4VkQ@mail.gmail.com/
Link: https://lkml.kernel.org/r/7236E7CA-B5F1-4C04-AB85-E86FA3E9A54B@nvidia.com
Suggested-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Zi Yan <ziy@nvidia.com>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Acked-by: David Hildenbrand <david@redhat.com>
Cc: Steven Rostedt (Google) <rostedt@goodmis.org>
Cc: Mel Gorman <mgorman@techsingularity.net>
Cc: Mike Rapoport <rppt@kernel.org>
Cc: Oscar Salvador <osalvador@suse.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>


# bb0e28eb 29-Apr-2022 Zi Yan <ziy@nvidia.com>

mm: page_alloc: simplify pageblock migratetype check in __free_one_page()

Move pageblock migratetype check code in the while loop to simplify the
logic. It also saves redundant buddy page checking code.

Link: https://lkml.kernel.org/r/20220401230804.1658207-1-zi.yan@sent.com
Link: https://lore.kernel.org/linux-mm/27ff69f9-60c5-9e59-feb2-295250077551@suse.cz/
Signed-off-by: Zi Yan <ziy@nvidia.com>
Suggested-by: Vlastimil Babka <vbabka@suse.cz>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Acked-by: David Hildenbrand <david@redhat.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mel Gorman <mgorman@techsingularity.net>
Cc: Mike Rapoport <rppt@kernel.org>
Cc: Oscar Salvador <osalvador@suse.de>
Cc: Steven Rostedt (Google) <rostedt@goodmis.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>


# 37931324 29-Apr-2022 Wei Yang <richard.weiyang@gmail.com>

mm/page_alloc: adding same penalty is enough to get round-robin order

To make node order in round-robin in the same distance group, we add a
penalty to the first node we got in each round.

To get a round-robin order in the same distance group, we don't need to
decrease the penalty since:

* find_next_best_node() always iterates node in the same order
* distance matters more then penalty in find_next_best_node()
* in nodes with the same distance, the first one would be picked up

So it is fine to increase same penalty when we get the first node in the
same distance group. Since we just increase a constance of 1 to node
penalty, it is not necessary to multiply MAX_NODE_LOAD for preference.

[richard.weiyang@gmail.com: remove remove MAX_NODE_LOAD, per Vlastimil]
Link: https://lkml.kernel.org/r/20220412001319.7462-1-richard.weiyang@gmail.com
Link: https://lkml.kernel.org/r/20220123013537.20491-1-richard.weiyang@gmail.com
Signed-off-by: Wei Yang <richard.weiyang@gmail.com>
Signed-off-by: Wei Yang <richard.weiyang@gmail.com>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Acked-by: David Hildenbrand <david@redhat.com>
Acked-by: Oscar Salvador <osalvador@suse.de>
Cc: David Rientjes <rientjes@google.com>
Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Krupa Ramakrishnan <krupa.ramakrishnan@amd.com>
Cc: Michal Hocko <mhocko@suse.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>


# f2edd118 15-Apr-2022 Song Liu <song@kernel.org>

page_alloc: use vmalloc_huge for large system hash

Use vmalloc_huge() in alloc_large_system_hash() so that large system
hash (>= PMD_SIZE) could benefit from huge pages.

Note that vmalloc_huge only allocates huge pages for systems with
HAVE_ARCH_HUGE_VMALLOC.

Signed-off-by: Song Liu <song@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Rik van Riel <riel@surriel.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# e553f62f 14-Apr-2022 Juergen Gross <jgross@suse.com>

mm, page_alloc: fix build_zonerefs_node()

Since commit 6aa303defb74 ("mm, vmscan: only allocate and reclaim from
zones with pages managed by the buddy allocator") only zones with free
memory are included in a built zonelist. This is problematic when e.g.
all memory of a zone has been ballooned out when zonelists are being
rebuilt.

The decision whether to rebuild the zonelists when onlining new memory
is done based on populated_zone() returning 0 for the zone the memory
will be added to. The new zone is added to the zonelists only, if it
has free memory pages (managed_zone() returns a non-zero value) after
the memory has been onlined. This implies, that onlining memory will
always free the added pages to the allocator immediately, but this is
not true in all cases: when e.g. running as a Xen guest the onlined new
memory will be added only to the ballooned memory list, it will be freed
only when the guest is being ballooned up afterwards.

Another problem with using managed_zone() for the decision whether a
zone is being added to the zonelists is, that a zone with all memory
used will in fact be removed from all zonelists in case the zonelists
happen to be rebuilt.

Use populated_zone() when building a zonelist as it has been done before
that commit.

There was a report that QubesOS (based on Xen) is hitting this problem.
Xen has switched to use the zone device functionality in kernel 5.9 and
QubesOS wants to use memory hotplugging for guests in order to be able
to start a guest with minimal memory and expand it as needed. This was
the report leading to the patch.

Link: https://lkml.kernel.org/r/20220407120637.9035-1-jgross@suse.com
Fixes: 6aa303defb74 ("mm, vmscan: only allocate and reclaim from zones with pages managed by the buddy allocator")
Signed-off-by: Juergen Gross <jgross@suse.com>
Reported-by: Marek Marczykowski-Górecki <marmarek@invisiblethingslab.com>
Acked-by: Michal Hocko <mhocko@suse.com>
Acked-by: David Hildenbrand <david@redhat.com>
Cc: Marek Marczykowski-Górecki <marmarek@invisiblethingslab.com>
Reviewed-by: Wei Yang <richard.weiyang@gmail.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 273ba85b 28-Mar-2022 Sebastian Andrzej Siewior <bigeasy@linutronix.de>

Revert "mm/page_alloc: mark pagesets as __maybe_unused"

The local_lock() is now using a proper static inline function which is
enough for llvm to accept that the variable is used.

Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lore.kernel.org/r/20220328145810.86783-4-bigeasy@linutronix.de


# adb11e78 01-Apr-2022 Sebastian Andrzej Siewior <bigeasy@linutronix.de>

mm/munlock: protect the per-CPU pagevec by a local_lock_t

The access to mlock_pvec is protected by disabling preemption via
get_cpu_var() or implicit by having preemption disabled by the caller
(in mlock_page_drain() case). This breaks on PREEMPT_RT since
folio_lruvec_lock_irq() acquires a sleeping lock in this section.

Create struct mlock_pvec which consits of the local_lock_t and the
pagevec. Acquire the local_lock() before accessing the per-CPU pagevec.
Replace mlock_page_drain() with a _local() version which is invoked on
the local CPU and acquires the local_lock_t and a _remote() version
which uses the pagevec from a remote CPU which offline.

Link: https://lkml.kernel.org/r/YjizWi9IY0mpvIfb@linutronix.de
Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Acked-by: Hugh Dickins <hughd@google.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 787af64d 30-Mar-2022 Zi Yan <ziy@nvidia.com>

mm: page_alloc: validate buddy before check its migratetype.

Whenever a buddy page is found, page_is_buddy() should be called to
check its validity. Add the missing check during pageblock merge check.

Fixes: 1dd214b8f21c ("mm: page_alloc: avoid merging non-fallbackable pageblocks with others")
Link: https://lore.kernel.org/all/20220330154208.71aca532@gandalf.local.home/
Reported-and-tested-by: Steven Rostedt <rostedt@goodmis.org>
Signed-off-by: Zi Yan <ziy@nvidia.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 9353ffa6 24-Mar-2022 Andrey Konovalov <andreyknvl@gmail.com>

kasan, page_alloc: allow skipping memory init for HW_TAGS

Add a new GFP flag __GFP_SKIP_ZERO that allows to skip memory
initialization. The flag is only effective with HW_TAGS KASAN.

This flag will be used by vmalloc code for page_alloc allocations backing
vmalloc() mappings in a following patch. The reason to skip memory
initialization for these pages in page_alloc is because vmalloc code will
be initializing them instead.

With the current implementation, when __GFP_SKIP_ZERO is provided,
__GFP_ZEROTAGS is ignored. This doesn't matter, as these two flags are
never provided at the same time. However, if this is changed in the
future, this particular implementation detail can be changed as well.

Link: https://lkml.kernel.org/r/0d53efeff345de7d708e0baa0d8829167772521e.1643047180.git.andreyknvl@google.com
Signed-off-by: Andrey Konovalov <andreyknvl@google.com>
Acked-by: Marco Elver <elver@google.com>
Cc: Alexander Potapenko <glider@google.com>
Cc: Andrey Ryabinin <ryabinin.a.a@gmail.com>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Dmitry Vyukov <dvyukov@google.com>
Cc: Evgenii Stepanov <eugenis@google.com>
Cc: Mark Rutland <mark.rutland@arm.com>
Cc: Peter Collingbourne <pcc@google.com>
Cc: Vincenzo Frascino <vincenzo.frascino@arm.com>
Cc: Will Deacon <will@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 53ae233c 24-Mar-2022 Andrey Konovalov <andreyknvl@gmail.com>

kasan, page_alloc: allow skipping unpoisoning for HW_TAGS

Add a new GFP flag __GFP_SKIP_KASAN_UNPOISON that allows skipping KASAN
poisoning for page_alloc allocations. The flag is only effective with
HW_TAGS KASAN.

This flag will be used by vmalloc code for page_alloc allocations backing
vmalloc() mappings in a following patch. The reason to skip KASAN
poisoning for these pages in page_alloc is because vmalloc code will be
poisoning them instead.

Also reword the comment for __GFP_SKIP_KASAN_POISON.

Link: https://lkml.kernel.org/r/35c97d77a704f6ff971dd3bfe4be95855744108e.1643047180.git.andreyknvl@google.com
Signed-off-by: Andrey Konovalov <andreyknvl@google.com>
Acked-by: Marco Elver <elver@google.com>
Cc: Alexander Potapenko <glider@google.com>
Cc: Andrey Ryabinin <ryabinin.a.a@gmail.com>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Dmitry Vyukov <dvyukov@google.com>
Cc: Evgenii Stepanov <eugenis@google.com>
Cc: Mark Rutland <mark.rutland@arm.com>
Cc: Peter Collingbourne <pcc@google.com>
Cc: Vincenzo Frascino <vincenzo.frascino@arm.com>
Cc: Will Deacon <will@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# e9d0ca92 24-Mar-2022 Andrey Konovalov <andreyknvl@gmail.com>

kasan, page_alloc: rework kasan_unpoison_pages call site

Rework the checks around kasan_unpoison_pages() call in post_alloc_hook().

The logical condition for calling this function is:

- If a software KASAN mode is enabled, we need to mark shadow memory.

- Otherwise, HW_TAGS KASAN is enabled, and it only makes sense to set
tags if they haven't already been cleared by tag_clear_highpage(),
which is indicated by init_tags.

This patch concludes the changes for post_alloc_hook().

Link: https://lkml.kernel.org/r/0ecebd0d7ccd79150e3620ea4185a32d3dfe912f.1643047180.git.andreyknvl@google.com
Signed-off-by: Andrey Konovalov <andreyknvl@google.com>
Acked-by: Marco Elver <elver@google.com>
Cc: Alexander Potapenko <glider@google.com>
Cc: Andrey Ryabinin <ryabinin.a.a@gmail.com>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Dmitry Vyukov <dvyukov@google.com>
Cc: Evgenii Stepanov <eugenis@google.com>
Cc: Mark Rutland <mark.rutland@arm.com>
Cc: Peter Collingbourne <pcc@google.com>
Cc: Vincenzo Frascino <vincenzo.frascino@arm.com>
Cc: Will Deacon <will@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 7e3cbba6 24-Mar-2022 Andrey Konovalov <andreyknvl@gmail.com>

kasan, page_alloc: move kernel_init_free_pages in post_alloc_hook

Pull the kernel_init_free_pages() call in post_alloc_hook() out of the big
if clause for better code readability. This also allows for more
simplifications in the following patch.

This patch does no functional changes.

Link: https://lkml.kernel.org/r/a7a76456501eb37ddf9fca6529cee9555e59cdb1.1643047180.git.andreyknvl@google.com
Signed-off-by: Andrey Konovalov <andreyknvl@google.com>
Reviewed-by: Alexander Potapenko <glider@google.com>
Acked-by: Marco Elver <elver@google.com>
Cc: Andrey Ryabinin <ryabinin.a.a@gmail.com>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Dmitry Vyukov <dvyukov@google.com>
Cc: Evgenii Stepanov <eugenis@google.com>
Cc: Mark Rutland <mark.rutland@arm.com>
Cc: Peter Collingbourne <pcc@google.com>
Cc: Vincenzo Frascino <vincenzo.frascino@arm.com>
Cc: Will Deacon <will@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 89b27116 24-Mar-2022 Andrey Konovalov <andreyknvl@gmail.com>

kasan, page_alloc: move SetPageSkipKASanPoison in post_alloc_hook

Pull the SetPageSkipKASanPoison() call in post_alloc_hook() out of the big
if clause for better code readability. This also allows for more
simplifications in the following patches.

Also turn the kasan_has_integrated_init() check into the proper
kasan_hw_tags_enabled() one. These checks evaluate to the same value, but
logically skipping kasan poisoning has nothing to do with integrated init.

Link: https://lkml.kernel.org/r/7214c1698b754ccfaa44a792113c95cc1f807c48.1643047180.git.andreyknvl@google.com
Signed-off-by: Andrey Konovalov <andreyknvl@google.com>
Acked-by: Marco Elver <elver@google.com>
Cc: Alexander Potapenko <glider@google.com>
Cc: Andrey Ryabinin <ryabinin.a.a@gmail.com>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Dmitry Vyukov <dvyukov@google.com>
Cc: Evgenii Stepanov <eugenis@google.com>
Cc: Mark Rutland <mark.rutland@arm.com>
Cc: Peter Collingbourne <pcc@google.com>
Cc: Vincenzo Frascino <vincenzo.frascino@arm.com>
Cc: Will Deacon <will@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 9294b128 24-Mar-2022 Andrey Konovalov <andreyknvl@gmail.com>

kasan, page_alloc: combine tag_clear_highpage calls in post_alloc_hook

Move tag_clear_highpage() loops out of the kasan_has_integrated_init()
clause as a code simplification.

This patch does no functional changes.

Link: https://lkml.kernel.org/r/587e3fc36358b88049320a89cc8dc6deaecb0cda.1643047180.git.andreyknvl@google.com
Signed-off-by: Andrey Konovalov <andreyknvl@google.com>
Reviewed-by: Alexander Potapenko <glider@google.com>
Acked-by: Marco Elver <elver@google.com>
Cc: Andrey Ryabinin <ryabinin.a.a@gmail.com>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Dmitry Vyukov <dvyukov@google.com>
Cc: Evgenii Stepanov <eugenis@google.com>
Cc: Mark Rutland <mark.rutland@arm.com>
Cc: Peter Collingbourne <pcc@google.com>
Cc: Vincenzo Frascino <vincenzo.frascino@arm.com>
Cc: Will Deacon <will@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# b42090ae 24-Mar-2022 Andrey Konovalov <andreyknvl@gmail.com>

kasan, page_alloc: merge kasan_alloc_pages into post_alloc_hook

Currently, the code responsible for initializing and poisoning memory in
post_alloc_hook() is scattered across two locations: kasan_alloc_pages()
hook for HW_TAGS KASAN and post_alloc_hook() itself. This is confusing.

This and a few following patches combine the code from these two
locations. Along the way, these patches do a step-by-step restructure the
many performed checks to make them easier to follow.

Replace the only caller of kasan_alloc_pages() with its implementation.

As kasan_has_integrated_init() is only true when CONFIG_KASAN_HW_TAGS is
enabled, moving the code does no functional changes.

Also move init and init_tags variables definitions out of
kasan_has_integrated_init() clause in post_alloc_hook(), as they have the
same values regardless of what the if condition evaluates to.

This patch is not useful by itself but makes the simplifications in the
following patches easier to follow.

Link: https://lkml.kernel.org/r/5ac7e0b30f5cbb177ec363ddd7878a3141289592.1643047180.git.andreyknvl@google.com
Signed-off-by: Andrey Konovalov <andreyknvl@google.com>
Acked-by: Marco Elver <elver@google.com>
Cc: Alexander Potapenko <glider@google.com>
Cc: Andrey Ryabinin <ryabinin.a.a@gmail.com>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Dmitry Vyukov <dvyukov@google.com>
Cc: Evgenii Stepanov <eugenis@google.com>
Cc: Mark Rutland <mark.rutland@arm.com>
Cc: Peter Collingbourne <pcc@google.com>
Cc: Vincenzo Frascino <vincenzo.frascino@arm.com>
Cc: Will Deacon <will@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# b8491b90 24-Mar-2022 Andrey Konovalov <andreyknvl@gmail.com>

kasan, page_alloc: refactor init checks in post_alloc_hook

Separate code for zeroing memory from the code clearing tags in
post_alloc_hook().

This patch is not useful by itself but makes the simplifications in the
following patches easier to follow.

This patch does no functional changes.

Link: https://lkml.kernel.org/r/2283fde963adfd8a2b29a92066f106cc16661a3c.1643047180.git.andreyknvl@google.com
Signed-off-by: Andrey Konovalov <andreyknvl@google.com>
Reviewed-by: Alexander Potapenko <glider@google.com>
Acked-by: Marco Elver <elver@google.com>
Cc: Andrey Ryabinin <ryabinin.a.a@gmail.com>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Dmitry Vyukov <dvyukov@google.com>
Cc: Evgenii Stepanov <eugenis@google.com>
Cc: Mark Rutland <mark.rutland@arm.com>
Cc: Peter Collingbourne <pcc@google.com>
Cc: Vincenzo Frascino <vincenzo.frascino@arm.com>
Cc: Will Deacon <will@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 487a32ec 24-Mar-2022 Andrey Konovalov <andreyknvl@gmail.com>

kasan: drop skip_kasan_poison variable in free_pages_prepare

skip_kasan_poison is only used in a single place. Call
should_skip_kasan_poison() directly for simplicity.

Link: https://lkml.kernel.org/r/1d33212e79bc9ef0b4d3863f903875823e89046f.1643047180.git.andreyknvl@google.com
Signed-off-by: Andrey Konovalov <andreyknvl@google.com>
Suggested-by: Marco Elver <elver@google.com>
Acked-by: Marco Elver <elver@google.com>
Cc: Alexander Potapenko <glider@google.com>
Cc: Andrey Ryabinin <ryabinin.a.a@gmail.com>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Dmitry Vyukov <dvyukov@google.com>
Cc: Evgenii Stepanov <eugenis@google.com>
Cc: Mark Rutland <mark.rutland@arm.com>
Cc: Peter Collingbourne <pcc@google.com>
Cc: Vincenzo Frascino <vincenzo.frascino@arm.com>
Cc: Will Deacon <will@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# db8a0477 24-Mar-2022 Andrey Konovalov <andreyknvl@gmail.com>

kasan, page_alloc: init memory of skipped pages on free

Since commit 7a3b83537188 ("kasan: use separate (un)poison implementation
for integrated init"), when all init, kasan_has_integrated_init(), and
skip_kasan_poison are true, free_pages_prepare() doesn't initialize the
page. This is wrong.

Fix it by remembering whether kasan_poison_pages() performed
initialization, and call kernel_init_free_pages() if it didn't.

Reordering kasan_poison_pages() and kernel_init_free_pages() is OK, since
kernel_init_free_pages() can handle poisoned memory.

Link: https://lkml.kernel.org/r/1d97df75955e52727a3dc1c4e33b3b50506fc3fd.1643047180.git.andreyknvl@google.com
Signed-off-by: Andrey Konovalov <andreyknvl@google.com>
Acked-by: Marco Elver <elver@google.com>
Cc: Alexander Potapenko <glider@google.com>
Cc: Andrey Ryabinin <ryabinin.a.a@gmail.com>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Dmitry Vyukov <dvyukov@google.com>
Cc: Evgenii Stepanov <eugenis@google.com>
Cc: Mark Rutland <mark.rutland@arm.com>
Cc: Peter Collingbourne <pcc@google.com>
Cc: Vincenzo Frascino <vincenzo.frascino@arm.com>
Cc: Will Deacon <will@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# c3525330 24-Mar-2022 Andrey Konovalov <andreyknvl@gmail.com>

kasan, page_alloc: simplify kasan_poison_pages call site

Simplify the code around calling kasan_poison_pages() in
free_pages_prepare().

This patch does no functional changes.

Link: https://lkml.kernel.org/r/ae4f9bcf071577258e786bcec4798c145d718c46.1643047180.git.andreyknvl@google.com
Signed-off-by: Andrey Konovalov <andreyknvl@google.com>
Reviewed-by: Alexander Potapenko <glider@google.com>
Acked-by: Marco Elver <elver@google.com>
Cc: Andrey Ryabinin <ryabinin.a.a@gmail.com>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Dmitry Vyukov <dvyukov@google.com>
Cc: Evgenii Stepanov <eugenis@google.com>
Cc: Mark Rutland <mark.rutland@arm.com>
Cc: Peter Collingbourne <pcc@google.com>
Cc: Vincenzo Frascino <vincenzo.frascino@arm.com>
Cc: Will Deacon <will@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 7c13c163 24-Mar-2022 Andrey Konovalov <andreyknvl@gmail.com>

kasan, page_alloc: merge kasan_free_pages into free_pages_prepare

Currently, the code responsible for initializing and poisoning memory in
free_pages_prepare() is scattered across two locations: kasan_free_pages()
for HW_TAGS KASAN and free_pages_prepare() itself. This is confusing.

This and a few following patches combine the code from these two
locations. Along the way, these patches also simplify the performed
checks to make them easier to follow.

Replaces the only caller of kasan_free_pages() with its implementation.

As kasan_has_integrated_init() is only true when CONFIG_KASAN_HW_TAGS is
enabled, moving the code does no functional changes.

This patch is not useful by itself but makes the simplifications in the
following patches easier to follow.

Link: https://lkml.kernel.org/r/303498d15840bb71905852955c6e2390ecc87139.1643047180.git.andreyknvl@google.com
Signed-off-by: Andrey Konovalov <andreyknvl@google.com>
Reviewed-by: Alexander Potapenko <glider@google.com>
Acked-by: Marco Elver <elver@google.com>
Cc: Andrey Ryabinin <ryabinin.a.a@gmail.com>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Dmitry Vyukov <dvyukov@google.com>
Cc: Evgenii Stepanov <eugenis@google.com>
Cc: Mark Rutland <mark.rutland@arm.com>
Cc: Peter Collingbourne <pcc@google.com>
Cc: Vincenzo Frascino <vincenzo.frascino@arm.com>
Cc: Will Deacon <will@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 5b2c0713 24-Mar-2022 Andrey Konovalov <andreyknvl@gmail.com>

kasan, page_alloc: move tag_clear_highpage out of kernel_init_free_pages

Currently, kernel_init_free_pages() serves two purposes: it either only
zeroes memory or zeroes both memory and memory tags via a different code
path. As this function has only two callers, each using only one code
path, this behaviour is confusing.

Pull the code that zeroes both memory and tags out of
kernel_init_free_pages().

As a result of this change, the code in free_pages_prepare() starts to
look complicated, but this is improved in the few following patches.
Those improvements are not integrated into this patch to make diffs easier
to read.

This patch does no functional changes.

Link: https://lkml.kernel.org/r/7719874e68b23902629c7cf19f966c4fd5f57979.1643047180.git.andreyknvl@google.com
Signed-off-by: Andrey Konovalov <andreyknvl@google.com>
Reviewed-by: Alexander Potapenko <glider@google.com>
Acked-by: Marco Elver <elver@google.com>
Cc: Andrey Ryabinin <ryabinin.a.a@gmail.com>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Dmitry Vyukov <dvyukov@google.com>
Cc: Evgenii Stepanov <eugenis@google.com>
Cc: Mark Rutland <mark.rutland@arm.com>
Cc: Peter Collingbourne <pcc@google.com>
Cc: Vincenzo Frascino <vincenzo.frascino@arm.com>
Cc: Will Deacon <will@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 94ae8b83 24-Mar-2022 Andrey Konovalov <andreyknvl@gmail.com>

kasan, page_alloc: deduplicate should_skip_kasan_poison

Patch series "kasan, vmalloc, arm64: add vmalloc tagging support for SW/HW_TAGS", v6.

This patchset adds vmalloc tagging support for SW_TAGS and HW_TAGS
KASAN modes.

About half of patches are cleanups I went for along the way. None of them
seem to be important enough to go through stable, so I decided not to
split them out into separate patches/series.

The patchset is partially based on an early version of the HW_TAGS
patchset by Vincenzo that had vmalloc support. Thus, I added a
Co-developed-by tag into a few patches.

SW_TAGS vmalloc tagging support is straightforward. It reuses all of the
generic KASAN machinery, but uses shadow memory to store tags instead of
magic values. Naturally, vmalloc tagging requires adding a few
kasan_reset_tag() annotations to the vmalloc code.

HW_TAGS vmalloc tagging support stands out. HW_TAGS KASAN is based on Arm
MTE, which can only assigns tags to physical memory. As a result, HW_TAGS
KASAN only tags vmalloc() allocations, which are backed by page_alloc
memory. It ignores vmap() and others.

This patch (of 39):

Currently, should_skip_kasan_poison() has two definitions: one for when
CONFIG_DEFERRED_STRUCT_PAGE_INIT is enabled, one for when it's not.

Instead of duplicating the checks, add a deferred_pages_enabled() helper
and use it in a single should_skip_kasan_poison() definition.

Also move should_skip_kasan_poison() closer to its caller and clarify all
conditions in the comment.

Link: https://lkml.kernel.org/r/cover.1643047180.git.andreyknvl@google.com
Link: https://lkml.kernel.org/r/658b79f5fb305edaf7dc16bc52ea870d3220d4a8.1643047180.git.andreyknvl@google.com
Signed-off-by: Andrey Konovalov <andreyknvl@google.com>
Acked-by: Marco Elver <elver@google.com>
Cc: Alexander Potapenko <glider@google.com>
Cc: Dmitry Vyukov <dvyukov@google.com>
Cc: Andrey Ryabinin <ryabinin.a.a@gmail.com>
Cc: Vincenzo Frascino <vincenzo.frascino@arm.com>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Will Deacon <will@kernel.org>
Cc: Mark Rutland <mark.rutland@arm.com>
Cc: Peter Collingbourne <pcc@google.com>
Cc: Evgenii Stepanov <eugenis@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 7c30daac 22-Mar-2022 Michal Hocko <mhocko@suse.com>

mm: make free_area_init_node aware of memory less nodes

free_area_init_node is also called from memory less node initialization
path (free_area_init_memoryless_node). It doesn't really make much sense
to display the physical memory range for those nodes: Initmem setup node
XX [mem 0x0000000000000000-0x0000000000000000]

Instead be explicit that the node is memoryless: Initmem setup node XX as
memoryless

Link: https://lkml.kernel.org/r/20220127085305.20890-6-mhocko@kernel.org
Signed-off-by: Michal Hocko <mhocko@suse.com>
Acked-by: Rafael Aquini <raquini@redhat.com>
Acked-by: David Hildenbrand <david@redhat.com>
Reviewed-by: Mike Rapoport <rppt@linux.ibm.com>
Reviewed-by: Oscar Salvador <osalvador@suse.de>
Cc: Alexey Makhalov <amakhalov@vmware.com>
Cc: Christoph Lameter <cl@linux.com>
Cc: Dennis Zhou <dennis@kernel.org>
Cc: Eric Dumazet <eric.dumazet@gmail.com>
Cc: Nico Pache <npache@redhat.com>
Cc: Tejun Heo <tj@kernel.org>
Cc: Wei Yang <richard.weiyang@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 70b5b46a 22-Mar-2022 Michal Hocko <mhocko@suse.com>

mm, memory_hotplug: reorganize new pgdat initialization

When a !node_online node is brought up it needs a hotplug specific
initialization because the node could be either uninitialized yet or it
could have been recycled after previous hotremove. hotadd_init_pgdat is
responsible for that.

Internal pgdat state is initialized at two places currently
- hotadd_init_pgdat
- free_area_init_core_hotplug

There is no real clear cut what should go where but this patch's chosen to
move the whole internal state initialization into
free_area_init_core_hotplug. hotadd_init_pgdat is still responsible to
pull all the parts together - most notably to initialize zonelists because
those depend on the overall topology.

This patch doesn't introduce any functional change.

Link: https://lkml.kernel.org/r/20220127085305.20890-5-mhocko@kernel.org
Signed-off-by: Michal Hocko <mhocko@suse.com>
Acked-by: Rafael Aquini <raquini@redhat.com>
Acked-by: David Hildenbrand <david@redhat.com>
Reviewed-by: Oscar Salvador <osalvador@suse.de>
Cc: Alexey Makhalov <amakhalov@vmware.com>
Cc: Christoph Lameter <cl@linux.com>
Cc: Dennis Zhou <dennis@kernel.org>
Cc: Eric Dumazet <eric.dumazet@gmail.com>
Cc: Mike Rapoport <rppt@linux.ibm.com>
Cc: Nico Pache <npache@redhat.com>
Cc: Tejun Heo <tj@kernel.org>
Cc: Wei Yang <richard.weiyang@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 09f49dca 22-Mar-2022 Michal Hocko <mhocko@suse.com>

mm: handle uninitialized numa nodes gracefully

We have had several reports [1][2][3] that page allocator blows up when an
allocation from a possible node is requested. The underlying reason is
that NODE_DATA for the specific node is not allocated.

NUMA specific initialization is arch specific and it can vary a lot. E.g.
x86 tries to initialize all nodes that have some cpu affinity (see
init_cpu_to_node) but this can be insufficient because the node might be
cpuless for example.

One way to address this problem would be to check for !node_online nodes
when trying to get a zonelist and silently fall back to another node.
That is unfortunately adding a branch into allocator hot path and it
doesn't handle any other potential NODE_DATA users.

This patch takes a different approach (following a lead of [3]) and it pre
allocates pgdat for all possible nodes in an arch indipendent code -
free_area_init. All uninitialized nodes are treated as memoryless nodes.
node_state of the node is not changed because that would lead to other
side effects - e.g. sysfs representation of such a node and from past
discussions [4] it is known that some tools might have problems digesting
that.

Newly allocated pgdat only gets a minimal initialization and the rest of
the work is expected to be done by the memory hotplug - hotadd_new_pgdat
(renamed to hotadd_init_pgdat).

generic_alloc_nodedata is changed to use the memblock allocator because
neither page nor slab allocators are available at the stage when all
pgdats are allocated. Hotplug doesn't allocate pgdat anymore so we can
use the early boot allocator. The only arch specific implementation is
ia64 and that is changed to use the early allocator as well.

[1] http://lkml.kernel.org/r/20211101201312.11589-1-amakhalov@vmware.com
[2] http://lkml.kernel.org/r/20211207224013.880775-1-npache@redhat.com
[3] http://lkml.kernel.org/r/20190114082416.30939-1-mhocko@kernel.org
[4] http://lkml.kernel.org/r/20200428093836.27190-1-srikar@linux.vnet.ibm.com

[akpm@linux-foundation.org: replace comment, per Mike]

Link: https://lkml.kernel.org/r/Yfe7RBeLCijnWBON@dhcp22.suse.cz
Reported-by: Alexey Makhalov <amakhalov@vmware.com>
Tested-by: Alexey Makhalov <amakhalov@vmware.com>
Reported-by: Nico Pache <npache@redhat.com>
Acked-by: Rafael Aquini <raquini@redhat.com>
Tested-by: Rafael Aquini <raquini@redhat.com>
Acked-by: David Hildenbrand <david@redhat.com>
Reviewed-by: Oscar Salvador <osalvador@suse.de>
Acked-by: Mike Rapoport <rppt@linux.ibm.com>
Signed-off-by: Michal Hocko <mhocko@suse.com>
Cc: Christoph Lameter <cl@linux.com>
Cc: Dennis Zhou <dennis@kernel.org>
Cc: Eric Dumazet <eric.dumazet@gmail.com>
Cc: Tejun Heo <tj@kernel.org>
Cc: Wei Yang <richard.weiyang@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# c574bbe9 22-Mar-2022 Huang Ying <ying.huang@intel.com>

NUMA balancing: optimize page placement for memory tiering system

With the advent of various new memory types, some machines will have
multiple types of memory, e.g. DRAM and PMEM (persistent memory). The
memory subsystem of these machines can be called memory tiering system,
because the performance of the different types of memory are usually
different.

In such system, because of the memory accessing pattern changing etc,
some pages in the slow memory may become hot globally. So in this
patch, the NUMA balancing mechanism is enhanced to optimize the page
placement among the different memory types according to hot/cold
dynamically.

In a typical memory tiering system, there are CPUs, fast memory and slow
memory in each physical NUMA node. The CPUs and the fast memory will be
put in one logical node (called fast memory node), while the slow memory
will be put in another (faked) logical node (called slow memory node).
That is, the fast memory is regarded as local while the slow memory is
regarded as remote. So it's possible for the recently accessed pages in
the slow memory node to be promoted to the fast memory node via the
existing NUMA balancing mechanism.

The original NUMA balancing mechanism will stop to migrate pages if the
free memory of the target node becomes below the high watermark. This
is a reasonable policy if there's only one memory type. But this makes
the original NUMA balancing mechanism almost do not work to optimize
page placement among different memory types. Details are as follows.

It's the common cases that the working-set size of the workload is
larger than the size of the fast memory nodes. Otherwise, it's
unnecessary to use the slow memory at all. So, there are almost always
no enough free pages in the fast memory nodes, so that the globally hot
pages in the slow memory node cannot be promoted to the fast memory
node. To solve the issue, we have 2 choices as follows,

a. Ignore the free pages watermark checking when promoting hot pages
from the slow memory node to the fast memory node. This will
create some memory pressure in the fast memory node, thus trigger
the memory reclaiming. So that, the cold pages in the fast memory
node will be demoted to the slow memory node.

b. Define a new watermark called wmark_promo which is higher than
wmark_high, and have kswapd reclaiming pages until free pages reach
such watermark. The scenario is as follows: when we want to promote
hot-pages from a slow memory to a fast memory, but fast memory's free
pages would go lower than high watermark with such promotion, we wake
up kswapd with wmark_promo watermark in order to demote cold pages and
free us up some space. So, next time we want to promote hot-pages we
might have a chance of doing so.

The choice "a" may create high memory pressure in the fast memory node.
If the memory pressure of the workload is high, the memory pressure
may become so high that the memory allocation latency of the workload
is influenced, e.g. the direct reclaiming may be triggered.

The choice "b" works much better at this aspect. If the memory
pressure of the workload is high, the hot pages promotion will stop
earlier because its allocation watermark is higher than that of the
normal memory allocation. So in this patch, choice "b" is implemented.
A new zone watermark (WMARK_PROMO) is added. Which is larger than the
high watermark and can be controlled via watermark_scale_factor.

In addition to the original page placement optimization among sockets,
the NUMA balancing mechanism is extended to be used to optimize page
placement according to hot/cold among different memory types. So the
sysctl user space interface (numa_balancing) is extended in a backward
compatible way as follow, so that the users can enable/disable these
functionality individually.

The sysctl is converted from a Boolean value to a bits field. The
definition of the flags is,

- 0: NUMA_BALANCING_DISABLED
- 1: NUMA_BALANCING_NORMAL
- 2: NUMA_BALANCING_MEMORY_TIERING

We have tested the patch with the pmbench memory accessing benchmark
with the 80:20 read/write ratio and the Gauss access address
distribution on a 2 socket Intel server with Optane DC Persistent
Memory Model. The test results shows that the pmbench score can
improve up to 95.9%.

Thanks Andrew Morton to help fix the document format error.

Link: https://lkml.kernel.org/r/20220221084529.1052339-3-ying.huang@intel.com
Signed-off-by: "Huang, Ying" <ying.huang@intel.com>
Tested-by: Baolin Wang <baolin.wang@linux.alibaba.com>
Reviewed-by: Baolin Wang <baolin.wang@linux.alibaba.com>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Reviewed-by: Oscar Salvador <osalvador@suse.de>
Reviewed-by: Yang Shi <shy828301@gmail.com>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Rik van Riel <riel@surriel.com>
Cc: Mel Gorman <mgorman@techsingularity.net>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Zi Yan <ziy@nvidia.com>
Cc: Wei Xu <weixugc@google.com>
Cc: Shakeel Butt <shakeelb@google.com>
Cc: zhongjiang-ali <zhongjiang-ali@linux.alibaba.com>
Cc: Randy Dunlap <rdunlap@infradead.org>
Cc: Feng Tang <feng.tang@intel.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# a581865e 22-Mar-2022 Miaohe Lin <linmiaohe@huawei.com>

mm/hwpoison-inject: support injecting hwpoison to free page

memory_failure() can handle free buddy page. Support injecting hwpoison
to free page by adding is_free_buddy_page check when hwpoison filter is
disabled.

[akpm@linux-foundation.org: export is_free_buddy_page() to modules]

Link: https://lkml.kernel.org/r/20220218092052.3853-1-linmiaohe@huawei.com
Signed-off-by: Miaohe Lin <linmiaohe@huawei.com>
Cc: Naoya Horiguchi <naoya.horiguchi@nec.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 77fe7f13 22-Mar-2022 Mel Gorman <mgorman@techsingularity.net>

mm/page_alloc: check high-order pages for corruption during PCP operations

Eric Dumazet pointed out that commit 44042b449872 ("mm/page_alloc: allow
high-order pages to be stored on the per-cpu lists") only checks the
head page during PCP refill and allocation operations. This was an
oversight and all pages should be checked. This will incur a small
performance penalty but it's necessary for correctness.

Link: https://lkml.kernel.org/r/20220310092456.GJ15701@techsingularity.net
Fixes: 44042b449872 ("mm/page_alloc: allow high-order pages to be stored on the per-cpu lists")
Signed-off-by: Mel Gorman <mgorman@techsingularity.net>
Reported-by: Eric Dumazet <edumazet@google.com>
Acked-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: Shakeel Butt <shakeelb@google.com>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Acked-by: David Rientjes <rientjes@google.com>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Wei Xu <weixugc@google.com>
Cc: Greg Thelen <gthelen@google.com>
Cc: Hugh Dickins <hughd@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 3313204c 22-Mar-2022 Eric Dumazet <edumazet@google.com>

mm/page_alloc: call check_new_pages() while zone spinlock is not held

For high order pages not using pcp, rmqueue() is currently calling the
costly check_new_pages() while zone spinlock is held, and hard irqs
masked.

This is not needed, we can release the spinlock sooner to reduce zone
spinlock contention.

Note that after this patch, we call __mod_zone_freepage_state() before
deciding to leak the page because it is in bad state.

Link: https://lkml.kernel.org/r/20220304170215.1868106-1-eric.dumazet@gmail.com
Signed-off-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: Shakeel Butt <shakeelb@google.com>
Acked-by: David Rientjes <rientjes@google.com>
Acked-by: Mel Gorman <mgorman@techsingularity.net>
Reviewed-by: Vlastimil Babka <vbabka@suse.cz>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Wei Xu <weixugc@google.com>
Cc: Greg Thelen <gthelen@google.com>
Cc: Hugh Dickins <hughd@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# fa7fc75f 22-Mar-2022 Suren Baghdasaryan <surenb@google.com>

mm: count time in drain_all_pages during direct reclaim as memory pressure

When page allocation in direct reclaim path fails, the system will make
one attempt to shrink per-cpu page lists and free pages from high alloc
reserves. Draining per-cpu pages into buddy allocator can be a very
slow operation because it's done using workqueues and the task in direct
reclaim waits for all of them to finish before proceeding. Currently
this time is not accounted as psi memory stall.

While testing mobile devices under extreme memory pressure, when
allocations are failing during direct reclaim, we notices that psi
events which would be expected in such conditions were not triggered.
After profiling these cases it was determined that the reason for
missing psi events was that a big chunk of time spent in direct reclaim
is not accounted as memory stall, therefore psi would not reach the
levels at which an event is generated. Further investigation revealed
that the bulk of that unaccounted time was spent inside drain_all_pages
call.

A typical captured case when drain_all_pages path gets activated:

__alloc_pages_slowpath took 44.644.613ns
__perform_reclaim took 751.668ns (1.7%)
drain_all_pages took 43.887.167ns (98.3%)

PSI in this case records the time spent in __perform_reclaim but ignores
drain_all_pages, IOW it misses 98.3% of the time spent in
__alloc_pages_slowpath.

Annotate __alloc_pages_direct_reclaim in its entirety so that delays
from handling page allocation failure in the direct reclaim path are
accounted as memory stall.

Link: https://lkml.kernel.org/r/20220223194812.1299646-1-surenb@google.com
Signed-off-by: Suren Baghdasaryan <surenb@google.com>
Reported-by: Tim Murray <timmurray@google.com>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Acked-by: Michal Hocko <mhocko@suse.com>
Reviewed-by: Shakeel Butt <shakeelb@google.com>
Cc: Petr Mladek <pmladek@suse.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Roman Gushchin <roman.gushchin@linux.dev>
Cc: Minchan Kim <minchan@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 1ca75fa7 22-Mar-2022 Oscar Salvador <osalvador@suse.de>

arch/x86/mm/numa: Do not initialize nodes twice

On x86, prior to ("mm: handle uninitialized numa nodes gracecully"), NUMA
nodes could be allocated at three different places.

- numa_register_memblks
- init_cpu_to_node
- init_gi_nodes

All these calls happen at setup_arch, and have the following order:

setup_arch
...
x86_numa_init
numa_init
numa_register_memblks
...
init_cpu_to_node
init_memory_less_node
alloc_node_data
free_area_init_memoryless_node
init_gi_nodes
init_memory_less_node
alloc_node_data
free_area_init_memoryless_node

numa_register_memblks() is only interested in those nodes which have
memory, so it skips over any memoryless node it founds. Later on, when
we have read ACPI's SRAT table, we call init_cpu_to_node() and
init_gi_nodes(), which initialize any memoryless node we might have that
have either CPU or Initiator affinity, meaning we allocate pg_data_t
struct for them and we mark them as ONLINE.

So far so good, but the thing is that after ("mm: handle uninitialized
numa nodes gracefully"), we allocate all possible NUMA nodes in
free_area_init(), meaning we have a picture like the following:

setup_arch
x86_numa_init
numa_init
numa_register_memblks <-- allocate non-memoryless node
x86_init.paging.pagetable_init
...
free_area_init
free_area_init_memoryless <-- allocate memoryless node
init_cpu_to_node
alloc_node_data <-- allocate memoryless node with CPU
free_area_init_memoryless_node
init_gi_nodes
alloc_node_data <-- allocate memoryless node with Initiator
free_area_init_memoryless_node

free_area_init() already allocates all possible NUMA nodes, but
init_cpu_to_node() and init_gi_nodes() are clueless about that, so they
go ahead and allocate a new pg_data_t struct without checking anything,
meaning we end up allocating twice.

It should be mad clear that this only happens in the case where
memoryless NUMA node happens to have a CPU/Initiator affinity.

So get rid of init_memory_less_node() and just set the node online.

Note that setting the node online is needed, otherwise we choke down the
chain when bringup_nonboot_cpus() ends up calling
__try_online_node()->register_one_node()->... and we blow up in
bus_add_device(). As can be seen here:

BUG: kernel NULL pointer dereference, address: 0000000000000060
#PF: supervisor read access in kernel mode
#PF: error_code(0x0000) - not-present page
PGD 0 P4D 0
Oops: 0000 [#1] PREEMPT SMP DEBUG_PAGEALLOC PTI
CPU: 0 PID: 1 Comm: swapper/0 Not tainted 5.17.0-rc4-1-default+ #45
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.0.0-prebuilt.qemu-project.org 04/4
RIP: 0010:bus_add_device+0x5a/0x140
Code: 8b 74 24 20 48 89 df e8 84 96 ff ff 85 c0 89 c5 75 38 48 8b 53 50 48 85 d2 0f 84 bb 00 004
RSP: 0000:ffffc9000022bd10 EFLAGS: 00010246
RAX: 0000000000000000 RBX: ffff888100987400 RCX: ffff8881003e4e19
RDX: ffff8881009a5e00 RSI: ffff888100987400 RDI: ffff888100987400
RBP: 0000000000000000 R08: ffff8881003e4e18 R09: ffff8881003e4c98
R10: 0000000000000000 R11: ffff888100402bc0 R12: ffffffff822ceba0
R13: 0000000000000000 R14: ffff888100987400 R15: 0000000000000000
FS: 0000000000000000(0000) GS:ffff88853fc00000(0000) knlGS:0000000000000000
CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 0000000000000060 CR3: 000000000200a001 CR4: 00000000001706b0
Call Trace:
device_add+0x4c0/0x910
__register_one_node+0x97/0x2d0
__try_online_node+0x85/0xc0
try_online_node+0x25/0x40
cpu_up+0x4f/0x100
bringup_nonboot_cpus+0x4f/0x60
smp_init+0x26/0x79
kernel_init_freeable+0x130/0x2f1
kernel_init+0x17/0x150
ret_from_fork+0x22/0x30

The reason is simple, by the time bringup_nonboot_cpus() gets called, we
did not register the node_subsys bus yet, so we crash when
bus_add_device() tries to dereference bus()->p.

The following shows the order of the calls:

kernel_init_freeable
smp_init
bringup_nonboot_cpus
...
bus_add_device() <- we did not register node_subsys yet
do_basic_setup
do_initcalls
postcore_initcall(register_node_type);
register_node_type
subsys_system_register
subsys_register
bus_register <- register node_subsys bus

Why setting the node online saves us then? Well, simply because
__try_online_node() backs off when the node is online, meaning we do not
end up calling register_one_node() in the first place.

This is subtle, broken and deserves a deep analysis and thought about
how to put this into shape, but for now let us have this easy fix for
the leaking memory issue.

[osalvador@suse.de: add comments]
Link: https://lkml.kernel.org/r/20220221142649.3457-1-osalvador@suse.de

Link: https://lkml.kernel.org/r/20220218224302.5282-2-osalvador@suse.de
Fixes: da4490c958ad ("mm: handle uninitialized numa nodes gracefully")
Signed-off-by: Oscar Salvador <osalvador@suse.de>
Acked-by: Michal Hocko <mhocko@suse.com>
Cc: David Hildenbrand <david@redhat.com>
Cc: Rafael Aquini <raquini@redhat.com>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Wei Yang <richard.weiyang@gmail.com>
Cc: Dennis Zhou <dennis@kernel.org>
Cc: Alexey Makhalov <amakhalov@vmware.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 2a791f44 22-Mar-2022 Mel Gorman <mgorman@techsingularity.net>

mm/page_alloc: do not prefetch buddies during bulk free

free_pcppages_bulk() has taken two passes through the pcp lists since
commit 0a5f4e5b4562 ("mm/free_pcppages_bulk: do not hold lock when
picking pages to free") due to deferring the cost of selecting PCP lists
until the zone lock is held.

As the list processing now takes place under the zone lock, it's less
clear that this will always benefit for two reasons.

1. There is a guaranteed cost to calculating the buddy which definitely
has to be calculated again. However, as the zone lock is held and
there is no deferring of buddy merging, there is no guarantee that the
prefetch will have completed when the second buddy calculation takes
place and buddies are being merged. With or without the prefetch, there
may be further stalls depending on how many pages get merged. In other
words, a stall due to merging is inevitable and at best only one stall
might be avoided at the cost of calculating the buddy location twice.

2. As the zone lock is held, prefetch_nr makes less sense as once
prefetch_nr expires, the cache lines of interest have already been
merged.

The main concern is that there is a definite cost to calculating the
buddy location early for the prefetch and it is a "maybe win" depending
on whether the CPU prefetch logic and memory is fast enough. Remove the
prefetch logic on the basis that reduced instructions in a path is
always a saving where as the prefetch might save one memory stall
depending on the CPU and memory.

In most cases, this has marginal benefit as the calculations are a small
part of the overall freeing of pages. However, it was detectable on at
least one machine.

5.17.0-rc3 5.17.0-rc3
mm-highpcplimit-v2r1 mm-noprefetch-v1r1
Min elapsed 630.00 ( 0.00%) 610.00 ( 3.17%)
Amean elapsed 639.00 ( 0.00%) 623.00 * 2.50%*
Max elapsed 660.00 ( 0.00%) 660.00 ( 0.00%)

Link: https://lkml.kernel.org/r/20220221094119.15282-2-mgorman@techsingularity.net
Signed-off-by: Mel Gorman <mgorman@techsingularity.net>
Suggested-by: Aaron Lu <aaron.lu@intel.com>
Reviewed-by: Vlastimil Babka <vbabka@suse.cz>
Reviewed-by: Aaron Lu <aaron.lu@intel.com>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Jesper Dangaard Brouer <brouer@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# f26b3fa0 22-Mar-2022 Mel Gorman <mgorman@techsingularity.net>

mm/page_alloc: limit number of high-order pages on PCP during bulk free

When a PCP is mostly used for frees then high-order pages can exist on
PCP lists for some time. This is problematic when the allocation
pattern is all allocations from one CPU and all frees from another
resulting in colder pages being used. When bulk freeing pages, limit
the number of high-order pages that are stored on the PCP lists.

Netperf running on localhost exhibits this pattern and while it does not
matter for some machines, it does matter for others with smaller caches
where cache misses cause problems due to reduced page reuse. Pages
freed directly to the buddy list may be reused quickly while still cache
hot where as storing on the PCP lists may be cold by the time
free_pcppages_bulk() is called.

Using perf kmem:mm_page_alloc, the 5 most used page frames were

5.17-rc3
13041 pfn=0x111a30
13081 pfn=0x5814d0
13097 pfn=0x108258
13121 pfn=0x689598
13128 pfn=0x5814d8

5.17-revert-highpcp
192009 pfn=0x54c140
195426 pfn=0x1081d0
200908 pfn=0x61c808
243515 pfn=0xa9dc20
402523 pfn=0x222bb8

5.17-full-series
142693 pfn=0x346208
162227 pfn=0x13bf08
166413 pfn=0x2711e0
166950 pfn=0x2702f8

The spread is wider as there is still time before pages freed to one PCP
get released with a tradeoff between fast reuse and reduced zone lock
acquisition.

On the machine used to gather the traces, the headline performance was
equivalent.

netperf-tcp
5.17.0-rc3 5.17.0-rc3 5.17.0-rc3
vanilla mm-reverthighpcp-v1r1 mm-highpcplimit-v2
Hmean 64 839.93 ( 0.00%) 840.77 ( 0.10%) 841.02 ( 0.13%)
Hmean 128 1614.22 ( 0.00%) 1622.07 * 0.49%* 1636.41 * 1.37%*
Hmean 256 2952.00 ( 0.00%) 2953.19 ( 0.04%) 2977.76 * 0.87%*
Hmean 1024 10291.67 ( 0.00%) 10239.17 ( -0.51%) 10434.41 * 1.39%*
Hmean 2048 17335.08 ( 0.00%) 17399.97 ( 0.37%) 17134.81 * -1.16%*
Hmean 3312 22628.15 ( 0.00%) 22471.97 ( -0.69%) 22422.78 ( -0.91%)
Hmean 4096 25009.50 ( 0.00%) 24752.83 * -1.03%* 24740.41 ( -1.08%)
Hmean 8192 32745.01 ( 0.00%) 31682.63 * -3.24%* 32153.50 * -1.81%*
Hmean 16384 39759.59 ( 0.00%) 36805.78 * -7.43%* 38948.13 * -2.04%*

On a 1-socket skylake machine with a small CPU cache that suffers more if
cache misses are too high

netperf-tcp
5.17.0-rc3 5.17.0-rc3 5.17.0-rc3
vanilla mm-reverthighpcp-v1 mm-highpcplimit-v2
Hmean 64 938.95 ( 0.00%) 941.50 * 0.27%* 943.61 * 0.50%*
Hmean 128 1843.10 ( 0.00%) 1857.58 * 0.79%* 1861.09 * 0.98%*
Hmean 256 3573.07 ( 0.00%) 3667.45 * 2.64%* 3674.91 * 2.85%*
Hmean 1024 13206.52 ( 0.00%) 13487.80 * 2.13%* 13393.21 * 1.41%*
Hmean 2048 22870.23 ( 0.00%) 23337.96 * 2.05%* 23188.41 * 1.39%*
Hmean 3312 31001.99 ( 0.00%) 32206.50 * 3.89%* 31863.62 * 2.78%*
Hmean 4096 35364.59 ( 0.00%) 36490.96 * 3.19%* 36112.54 * 2.11%*
Hmean 8192 48497.71 ( 0.00%) 49954.05 * 3.00%* 49588.26 * 2.25%*
Hmean 16384 58410.86 ( 0.00%) 60839.80 * 4.16%* 62282.96 * 6.63%*

Note that this was a machine that did not benefit from caching high-order
pages and performance is almost restored with the series applied. It's
not fully restored as cache misses are still higher. This is a trade-off
between optimising for a workload that does all allocs on one CPU and
frees on another or more general workloads that need high-order pages for
SLUB and benefit from avoiding zone->lock for every SLUB refill/drain.

Link: https://lkml.kernel.org/r/20220217002227.5739-7-mgorman@techsingularity.net
Signed-off-by: Mel Gorman <mgorman@techsingularity.net>
Reviewed-by: Vlastimil Babka <vbabka@suse.cz>
Tested-by: Aaron Lu <aaron.lu@intel.com>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Jesper Dangaard Brouer <brouer@redhat.com>
Cc: Michal Hocko <mhocko@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 8b10b465 22-Mar-2022 Mel Gorman <mgorman@techsingularity.net>

mm/page_alloc: free pages in a single pass during bulk free

free_pcppages_bulk() has taken two passes through the pcp lists since
commit 0a5f4e5b4562 ("mm/free_pcppages_bulk: do not hold lock when
picking pages to free") due to deferring the cost of selecting PCP lists
until the zone lock is held. Now that list selection is simplier, the
main cost during selection is bulkfree_pcp_prepare() which in the normal
case is a simple check and prefetching. As the list manipulations have
cost in itself, go back to freeing pages in a single pass.

The series up to this point was evaulated using a trunc microbenchmark
that is truncating sparse files stored in page cache (mmtests config
config-io-trunc). Sparse files were used to limit filesystem
interaction. The results versus a revert of storing high-order pages in
the PCP lists is

1-socket Skylake
5.17.0-rc3 5.17.0-rc3 5.17.0-rc3
vanilla mm-reverthighpcp-v1 mm-highpcpopt-v2
Min elapsed 540.00 ( 0.00%) 530.00 ( 1.85%) 530.00 ( 1.85%)
Amean elapsed 543.00 ( 0.00%) 530.00 * 2.39%* 530.00 * 2.39%*
Stddev elapsed 4.83 ( 0.00%) 0.00 ( 100.00%) 0.00 ( 100.00%)
CoeffVar elapsed 0.89 ( 0.00%) 0.00 ( 100.00%) 0.00 ( 100.00%)
Max elapsed 550.00 ( 0.00%) 530.00 ( 3.64%) 530.00 ( 3.64%)
BAmean-50 elapsed 540.00 ( 0.00%) 530.00 ( 1.85%) 530.00 ( 1.85%)
BAmean-95 elapsed 542.22 ( 0.00%) 530.00 ( 2.25%) 530.00 ( 2.25%)
BAmean-99 elapsed 542.22 ( 0.00%) 530.00 ( 2.25%) 530.00 ( 2.25%)

2-socket CascadeLake
5.17.0-rc3 5.17.0-rc3 5.17.0-rc3
vanilla mm-reverthighpcp-v1 mm-highpcpopt-v2
Min elapsed 510.00 ( 0.00%) 500.00 ( 1.96%) 500.00 ( 1.96%)
Amean elapsed 529.00 ( 0.00%) 521.00 ( 1.51%) 510.00 * 3.59%*
Stddev elapsed 16.63 ( 0.00%) 12.87 ( 22.64%) 11.55 ( 30.58%)
CoeffVar elapsed 3.14 ( 0.00%) 2.47 ( 21.46%) 2.26 ( 27.99%)
Max elapsed 550.00 ( 0.00%) 540.00 ( 1.82%) 530.00 ( 3.64%)
BAmean-50 elapsed 516.00 ( 0.00%) 512.00 ( 0.78%) 500.00 ( 3.10%)
BAmean-95 elapsed 526.67 ( 0.00%) 518.89 ( 1.48%) 507.78 ( 3.59%)
BAmean-99 elapsed 526.67 ( 0.00%) 518.89 ( 1.48%) 507.78 ( 3.59%)

The original motivation for multi-passes was will-it-scale page_fault1
using $nr_cpu processes.

2-socket CascadeLake (40 cores, 80 CPUs HT enabled)
5.17.0-rc3 5.17.0-rc3
vanilla mm-highpcpopt-v2
Hmean page_fault1-processes-2 2694662.26 ( 0.00%) 2695780.35 ( 0.04%)
Hmean page_fault1-processes-5 6425819.34 ( 0.00%) 6435544.57 * 0.15%*
Hmean page_fault1-processes-8 9642169.10 ( 0.00%) 9658962.39 ( 0.17%)
Hmean page_fault1-processes-12 12167502.10 ( 0.00%) 12190163.79 ( 0.19%)
Hmean page_fault1-processes-21 15636859.03 ( 0.00%) 15612447.26 ( -0.16%)
Hmean page_fault1-processes-30 25157348.61 ( 0.00%) 25169456.65 ( 0.05%)
Hmean page_fault1-processes-48 27694013.85 ( 0.00%) 27671111.46 ( -0.08%)
Hmean page_fault1-processes-79 25928742.64 ( 0.00%) 25934202.02 ( 0.02%) <--
Hmean page_fault1-processes-110 25730869.75 ( 0.00%) 25671880.65 * -0.23%*
Hmean page_fault1-processes-141 25626992.42 ( 0.00%) 25629551.61 ( 0.01%)
Hmean page_fault1-processes-172 25611651.35 ( 0.00%) 25614927.99 ( 0.01%)
Hmean page_fault1-processes-203 25577298.75 ( 0.00%) 25583445.59 ( 0.02%)
Hmean page_fault1-processes-234 25580686.07 ( 0.00%) 25608240.71 ( 0.11%)
Hmean page_fault1-processes-265 25570215.47 ( 0.00%) 25568647.58 ( -0.01%)
Hmean page_fault1-processes-296 25549488.62 ( 0.00%) 25543935.00 ( -0.02%)
Hmean page_fault1-processes-320 25555149.05 ( 0.00%) 25575696.74 ( 0.08%)

The differences are mostly within the noise and the difference close to
$nr_cpus is negligible.

Link: https://lkml.kernel.org/r/20220217002227.5739-6-mgorman@techsingularity.net
Signed-off-by: Mel Gorman <mgorman@techsingularity.net>
Reviewed-by: Vlastimil Babka <vbabka@suse.cz>
Tested-by: Aaron Lu <aaron.lu@intel.com>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Jesper Dangaard Brouer <brouer@redhat.com>
Cc: Michal Hocko <mhocko@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# d61372bc 22-Mar-2022 Mel Gorman <mgorman@techsingularity.net>

mm/page_alloc: drain the requested list first during bulk free

Prior to the series, pindex 0 (order-0 MIGRATE_UNMOVABLE) was always
skipped first and the precise reason is forgotten. A potential reason
may have been to artificially preserve MIGRATE_UNMOVABLE but there is no
reason why that would be optimal as it depends on the workload. The
more likely reason is that it was less complicated to do a pre-increment
instead of a post-increment in terms of overall code flow. As
free_pcppages_bulk() now typically receives the pindex of the PCP list
that exceeded high, always start draining that list.

Link: https://lkml.kernel.org/r/20220217002227.5739-5-mgorman@techsingularity.net
Signed-off-by: Mel Gorman <mgorman@techsingularity.net>
Reviewed-by: Vlastimil Babka <vbabka@suse.cz>
Tested-by: Aaron Lu <aaron.lu@intel.com>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Jesper Dangaard Brouer <brouer@redhat.com>
Cc: Michal Hocko <mhocko@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# fd56eef2 22-Mar-2022 Mel Gorman <mgorman@techsingularity.net>

mm/page_alloc: simplify how many pages are selected per pcp list during bulk free

free_pcppages_bulk() selects pages to free by round-robining between
lists. Originally this was to evenly shrink pages by migratetype but
uneven freeing is inevitable due to high pages. Simplify list selection
by starting with a list that definitely has pages on it in
free_unref_page_commit() and for drain, it does not matter where
draining starts as all pages are removed.

Link: https://lkml.kernel.org/r/20220217002227.5739-4-mgorman@techsingularity.net
Signed-off-by: Mel Gorman <mgorman@techsingularity.net>
Reviewed-by: Vlastimil Babka <vbabka@suse.cz>
Tested-by: Aaron Lu <aaron.lu@intel.com>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Jesper Dangaard Brouer <brouer@redhat.com>
Cc: Michal Hocko <mhocko@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 35b6d770 22-Mar-2022 Mel Gorman <mgorman@techsingularity.net>

mm/page_alloc: track range of active PCP lists during bulk free

free_pcppages_bulk() frees pages in a round-robin fashion. Originally,
this was dealing only with migratetypes but storing high-order pages
means that there can be many more empty lists that are uselessly
checked. Track the minimum and maximum active pindex to reduce the
search space.

Link: https://lkml.kernel.org/r/20220217002227.5739-3-mgorman@techsingularity.net
Signed-off-by: Mel Gorman <mgorman@techsingularity.net>
Reviewed-by: Vlastimil Babka <vbabka@suse.cz>
Tested-by: Aaron Lu <aaron.lu@intel.com>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Jesper Dangaard Brouer <brouer@redhat.com>
Cc: Michal Hocko <mhocko@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# ca7b59b1 22-Mar-2022 Mel Gorman <mgorman@techsingularity.net>

mm/page_alloc: fetch the correct pcp buddy during bulk free

Patch series "Follow-up on high-order PCP caching", v2.

Commit 44042b449872 ("mm/page_alloc: allow high-order pages to be stored
on the per-cpu lists") was primarily aimed at reducing the cost of SLUB
cache refills of high-order pages in two ways. Firstly, zone lock
acquisitions was reduced and secondly, there were fewer buddy list
modifications. This is a follow-up series fixing some issues that
became apparant after merging.

Patch 1 is a functional fix. It's harmless but inefficient.

Patches 2-5 reduce the overhead of bulk freeing of PCP pages. While the
overhead is small, it's cumulative and noticable when truncating large
files. The changelog for patch 4 includes results of a microbench that
deletes large sparse files with data in page cache. Sparse files were
used to eliminate filesystem overhead.

Patch 6 addresses issues with high-order PCP pages being stored on PCP
lists for too long. Pages freed on a CPU potentially may not be quickly
reused and in some cases this can increase cache miss rates. Details
are included in the changelog.

This patch (of 6):

free_pcppages_bulk() prefetches buddies about to be freed but the order
must also be passed in as PCP lists store multiple orders.

Link: https://lkml.kernel.org/r/20220217002227.5739-1-mgorman@techsingularity.net
Link: https://lkml.kernel.org/r/20220217002227.5739-2-mgorman@techsingularity.net
Fixes: 44042b449872 ("mm/page_alloc: allow high-order pages to be stored on the per-cpu lists")
Signed-off-by: Mel Gorman <mgorman@techsingularity.net>
Reviewed-by: Vlastimil Babka <vbabka@suse.cz>
Reviewed-by: Aaron Lu <aaron.lu@intel.com>
Tested-by: Aaron Lu <aaron.lu@intel.com>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Jesper Dangaard Brouer <brouer@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# ddbc84f3 22-Mar-2022 Alistair Popple <apopple@nvidia.com>

mm/pages_alloc.c: don't create ZONE_MOVABLE beyond the end of a node

ZONE_MOVABLE uses the remaining memory in each node. Its starting pfn
is also aligned to MAX_ORDER_NR_PAGES. It is possible for the remaining
memory in a node to be less than MAX_ORDER_NR_PAGES, meaning there is
not enough room for ZONE_MOVABLE on that node.

Unfortunately this condition is not checked for. This leads to
zone_movable_pfn[] getting set to a pfn greater than the last pfn in a
node.

calculate_node_totalpages() then sets zone->present_pages to be greater
than zone->spanned_pages which is invalid, as spanned_pages represents
the maximum number of pages in a zone assuming no holes.

Subsequently it is possible free_area_init_core() will observe a zone of
size zero with present pages. In this case it will skip setting up the
zone, including the initialisation of free_lists[].

However populated_zone() checks zone->present_pages to see if a zone has
memory available. This is used by iterators such as
walk_zones_in_node(). pagetypeinfo_showfree() uses this to walk the
free_list of each zone in each node, which are assumed to be initialised
due to the zone not being empty.

As free_area_init_core() never initialised the free_lists[] this results
in the following kernel crash when trying to read /proc/pagetypeinfo:

BUG: kernel NULL pointer dereference, address: 0000000000000000
#PF: supervisor read access in kernel mode
#PF: error_code(0x0000) - not-present page
PGD 0 P4D 0
Oops: 0000 [#1] PREEMPT SMP DEBUG_PAGEALLOC NOPTI
CPU: 0 PID: 456 Comm: cat Not tainted 5.16.0 #461
Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.14.0-2 04/01/2014
RIP: 0010:pagetypeinfo_show+0x163/0x460
Code: 9e 82 e8 80 57 0e 00 49 8b 06 b9 01 00 00 00 4c 39 f0 75 16 e9 65 02 00 00 48 83 c1 01 48 81 f9 a0 86 01 00 0f 84 48 02 00 00 <48> 8b 00 4c 39 f0 75 e7 48 c7 c2 80 a2 e2 82 48 c7 c6 79 ef e3 82
RSP: 0018:ffffc90001c4bd10 EFLAGS: 00010003
RAX: 0000000000000000 RBX: ffff88801105f638 RCX: 0000000000000001
RDX: 0000000000000001 RSI: 000000000000068b RDI: ffff8880163dc68b
RBP: ffffc90001c4bd90 R08: 0000000000000001 R09: ffff8880163dc67e
R10: 656c6261766f6d6e R11: 6c6261766f6d6e55 R12: ffff88807ffb4a00
R13: ffff88807ffb49f8 R14: ffff88807ffb4580 R15: ffff88807ffb3000
FS: 00007f9c83eff5c0(0000) GS:ffff88807dc00000(0000) knlGS:0000000000000000
CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 0000000000000000 CR3: 0000000013c8e000 CR4: 0000000000350ef0
Call Trace:
seq_read_iter+0x128/0x460
proc_reg_read_iter+0x51/0x80
new_sync_read+0x113/0x1a0
vfs_read+0x136/0x1d0
ksys_read+0x70/0xf0
__x64_sys_read+0x1a/0x20
do_syscall_64+0x3b/0xc0
entry_SYSCALL_64_after_hwframe+0x44/0xae

Fix this by checking that the aligned zone_movable_pfn[] does not exceed
the end of the node, and if it does skip creating a movable zone on this
node.

Link: https://lkml.kernel.org/r/20220215025831.2113067-1-apopple@nvidia.com
Fixes: 2a1e274acf0b ("Create the ZONE_MOVABLE zone")
Signed-off-by: Alistair Popple <apopple@nvidia.com>
Acked-by: David Hildenbrand <david@redhat.com>
Acked-by: Mel Gorman <mgorman@techsingularity.net>
Cc: John Hubbard <jhubbard@nvidia.com>
Cc: Zi Yan <ziy@nvidia.com>
Cc: Anshuman Khandual <anshuman.khandual@arm.com>
Cc: Oscar Salvador <osalvador@suse.de>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# a4812d47 22-Mar-2022 Nathan Chancellor <nathan@kernel.org>

mm/page_alloc: mark pagesets as __maybe_unused

Commit 9983a9d577db ("locking/local_lock: Make the empty local_lock_*()
function a macro.") in the -tip tree converted the local_lock_*()
functions into macros, which causes a warning with clang with
CONFIG_PREEMPT_RT=n + CONFIG_DEBUG_LOCK_ALLOC=n:

mm/page_alloc.c:131:40: error: variable 'pagesets' is not needed and will not be emitted [-Werror,-Wunneeded-internal-declaration]
static DEFINE_PER_CPU(struct pagesets, pagesets) = {
^
1 error generated.

Prior to that change, clang was not able to tell that pagesets was
unused in this configuration because it does not perform cross function
analysis in the frontend. After that change, it sees that the macros
just do a typecheck on the lock member of pagesets, which is evaluated
at compile time (so the variable is technically "used"), meaning the
variable is not needed in the final assembly, as the warning states.

Mark the variable as __maybe_unused to make it clear to clang that this
is expected in this configuration so there is no more warning.

Link: https://github.com/ClangBuiltLinux/linux/issues/1593
Link: https://lkml.kernel.org/r/20220215184322.440969-1-nathan@kernel.org
Signed-off-by: Nathan Chancellor <nathan@kernel.org>
Suggested-by: Nick Desaulniers <ndesaulniers@google.com>
Reported-by: "kernelci.org bot" <bot@kernelci.org>
Cc: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# b3d40a2b 22-Mar-2022 David Hildenbrand <david@redhat.com>

mm: enforce pageblock_order < MAX_ORDER

Some places in the kernel don't really expect pageblock_order >=
MAX_ORDER, and it looks like this is only possible in corner cases:

1) CONFIG_DEFERRED_STRUCT_PAGE_INIT we'll end up freeing pageblock_order
pages via __free_pages_core(), which cannot possibly work.

2) find_zone_movable_pfns_for_nodes() will roundup the ZONE_MOVABLE
start PFN to MAX_ORDER_NR_PAGES. Consequently with a bigger
pageblock_order, we could have a single pageblock partially managed by
two zones.

3) compaction code runs into __fragmentation_index() with order
>= MAX_ORDER, when checking WARN_ON_ONCE(order >= MAX_ORDER). [1]

4) mm/page_reporting.c won't be reporting any pages with default
page_reporting_order == pageblock_order, as we'll be skipping the
reporting loop inside page_reporting_process_zone().

5) __rmqueue_fallback() will never be able to steal with
ALLOC_NOFRAGMENT.

pageblock_order >= MAX_ORDER is weird either way: it's a pure
optimization for making alloc_contig_range(), as used for allcoation of
gigantic pages, a little more reliable to succeed. However, if there is
demand for somewhat reliable allocation of gigantic pages, affected
setups should be using CMA or boottime allocations instead.

So let's make sure that pageblock_order < MAX_ORDER and simplify.

[1] https://lkml.kernel.org/r/87r189a2ks.fsf@linux.ibm.com

Link: https://lkml.kernel.org/r/20220214174132.219303-3-david@redhat.com
Signed-off-by: David Hildenbrand <david@redhat.com>
Reviewed-by: Zi Yan <ziy@nvidia.com>
Cc: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Frank Rowand <frowand.list@gmail.com>
Cc: John Garry via iommu <iommu@lists.linux-foundation.org>
Cc: Marek Szyprowski <m.szyprowski@samsung.com>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Cc: Michael S. Tsirkin <mst@redhat.com>
Cc: Minchan Kim <minchan@kernel.org>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Rob Herring <robh+dt@kernel.org>
Cc: Robin Murphy <robin.murphy@arm.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 56651377 22-Mar-2022 Nicolas Saenz Julienne <nsaenzju@redhat.com>

mm/page_alloc: don't pass pfn to free_unref_page_commit()

free_unref_page_commit() doesn't make use of its pfn argument, so get
rid of it.

Link: https://lkml.kernel.org/r/20220202140451.415928-1-nsaenzju@redhat.com
Signed-off-by: Nicolas Saenz Julienne <nsaenzju@redhat.com>
Reviewed-by: Vlastimil Babka <vbabka@suse.cz>
Reviewed-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 1dd214b8 22-Mar-2022 Zi Yan <ziy@nvidia.com>

mm: page_alloc: avoid merging non-fallbackable pageblocks with others

This is done in addition to MIGRATE_ISOLATE pageblock merge avoidance.
It prepares for the upcoming removal of the MAX_ORDER-1 alignment
requirement for CMA and alloc_contig_range().

MIGRATE_HIGHATOMIC should not merge with other migratetypes like
MIGRATE_ISOLATE and MIGRARTE_CMA[1], so this commit prevents that too.

Remove MIGRATE_CMA and MIGRATE_ISOLATE from fallbacks list, since they
are never used.

[1] https://lore.kernel.org/linux-mm/20211130100853.GP3366@techsingularity.net/

Link: https://lkml.kernel.org/r/20220124175957.1261961-1-zi.yan@sent.com
Signed-off-by: Zi Yan <ziy@nvidia.com>
Acked-by: Mel Gorman <mgorman@techsingularity.net>
Acked-by: David Hildenbrand <david@redhat.com>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Acked-by: Mike Rapoport <rppt@linux.ibm.com>
Reviewed-by: Oscar Salvador <osalvador@suse.de>
Cc: Mike Rapoport <rppt@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 5232c63f 06-Jan-2022 Matthew Wilcox (Oracle) <willy@infradead.org>

mm: Make compound_pincount always available

Move compound_pincount from the third page to the second page, which
means it's available for all compound pages. That lets us delete
hpage_pincount_available().

On 32-bit systems, there isn't enough space for both compound_pincount
and compound_nr in the second page (it would collide with page->private,
which is in use for pages in the swap cache), so revert the optimisation
of storing both compound_order and compound_nr on 32-bit systems.

Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Reviewed-by: John Hubbard <jhubbard@nvidia.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Jason Gunthorpe <jgg@nvidia.com>
Reviewed-by: William Kucharski <william.kucharski@oracle.com>


# 5bf18281 19-Jan-2022 wangyong <wang.yong12@zte.com.cn>

delayacct: track delays from memory compact

Delay accounting does not track the delay of memory compact. When there
is not enough free memory, tasks can spend a amount of their time
waiting for compact.

To get the impact of tasks in direct memory compact, measure the delay
when allocating memory through memory compact.

Also update tools/accounting/getdelays.c:

/ # ./getdelays_next -di -p 304
print delayacct stats ON
printing IO accounting
PID 304

CPU count real total virtual total delay total delay average
277 780000000 849039485 18877296 0.068ms
IO count delay total delay average
0 0 0ms
SWAP count delay total delay average
0 0 0ms
RECLAIM count delay total delay average
5 11088812685 2217ms
THRASHING count delay total delay average
0 0 0ms
COMPACT count delay total delay average
3 72758 0ms
watch: read=0, write=0, cancelled_write=0

Link: https://lkml.kernel.org/r/1638619795-71451-1-git-send-email-wang.yong12@zte.com.cn
Signed-off-by: wangyong <wang.yong12@zte.com.cn>
Reviewed-by: Jiang Xuexin <jiang.xuexin@zte.com.cn>
Reviewed-by: Zhang Wenya <zhang.wenya1@zte.com.cn>
Reviewed-by: Yang Yang <yang.yang29@zte.com.cn>
Reviewed-by: Balbir Singh <bsingharora@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# bf181c58 14-Jan-2022 Naoya Horiguchi <naoya.horiguchi@nec.com>

mm/hwpoison: fix unpoison_memory()

After recent soft-offline rework, error pages can be taken off from
buddy allocator, but the existing unpoison_memory() does not properly
undo the operation. Moreover, due to the recent change on
__get_hwpoison_page(), get_page_unless_zero() is hardly called for
hwpoisoned pages. So __get_hwpoison_page() highly likely returns -EBUSY
(meaning to fail to grab page refcount) and unpoison just clears
PG_hwpoison without releasing a refcount. That does not lead to a
critical issue like kernel panic, but unpoisoned pages never get back to
buddy (leaked permanently), which is not good.

To (partially) fix this, we need to identify "taken off" pages from
other types of hwpoisoned pages. We can't use refcount or page flags
for this purpose, so a pseudo flag is defined by hacking ->private
field. Someone might think that put_page() is enough to cancel
taken-off pages, but the normal free path contains some operations not
suitable for the current purpose, and can fire VM_BUG_ON().

Note that unpoison_memory() is now supposed to be cancel hwpoison events
injected only by madvise() or
/sys/devices/system/memory/{hard,soft}_offline_page, not by MCE
injection, so please don't try to use unpoison when testing with MCE
injection.

[lkp@intel.com: report build failure for ARCH=i386]

Link: https://lkml.kernel.org/r/20211115084006.3728254-4-naoya.horiguchi@linux.dev
Signed-off-by: Naoya Horiguchi <naoya.horiguchi@nec.com>
Reviewed-by: Yang Shi <shy828301@gmail.com>
Cc: David Hildenbrand <david@redhat.com>
Cc: Oscar Salvador <osalvador@suse.de>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Ding Hui <dinghui@sangfor.com.cn>
Cc: Tony Luck <tony.luck@intel.com>
Cc: "Aneesh Kumar K.V" <aneesh.kumar@linux.vnet.ibm.com>
Cc: Miaohe Lin <linmiaohe@huawei.com>
Cc: Peter Xu <peterx@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# c4dc63f0 14-Jan-2022 Baoquan He <bhe@redhat.com>

mm/page_alloc.c: do not warn allocation failure on zone DMA if no managed pages

In kdump kernel of x86_64, page allocation failure is observed:

kworker/u2:2: page allocation failure: order:0, mode:0xcc1(GFP_KERNEL|GFP_DMA), nodemask=(null),cpuset=/,mems_allowed=0
CPU: 0 PID: 55 Comm: kworker/u2:2 Not tainted 5.16.0-rc4+ #5
Hardware name: AMD Dinar/Dinar, BIOS RDN1505B 06/05/2013
Workqueue: events_unbound async_run_entry_fn
Call Trace:
<TASK>
dump_stack_lvl+0x48/0x5e
warn_alloc.cold+0x72/0xd6
__alloc_pages_slowpath.constprop.0+0xc69/0xcd0
__alloc_pages+0x1df/0x210
new_slab+0x389/0x4d0
___slab_alloc+0x58f/0x770
__slab_alloc.constprop.0+0x4a/0x80
kmem_cache_alloc_trace+0x24b/0x2c0
sr_probe+0x1db/0x620
......
device_add+0x405/0x920
......
__scsi_add_device+0xe5/0x100
ata_scsi_scan_host+0x97/0x1d0
async_run_entry_fn+0x30/0x130
process_one_work+0x1e8/0x3c0
worker_thread+0x50/0x3b0
? rescuer_thread+0x350/0x350
kthread+0x16b/0x190
? set_kthread_struct+0x40/0x40
ret_from_fork+0x22/0x30
</TASK>
Mem-Info:
......

The above failure happened when calling kmalloc() to allocate buffer with
GFP_DMA. It requests to allocate slab page from DMA zone while no managed
pages at all in there.

sr_probe()
--> get_capabilities()
--> buffer = kmalloc(512, GFP_KERNEL | GFP_DMA);

Because in the current kernel, dma-kmalloc will be created as long as
CONFIG_ZONE_DMA is enabled. However, kdump kernel of x86_64 doesn't have
managed pages on DMA zone since commit 6f599d84231f ("x86/kdump: Always
reserve the low 1M when the crashkernel option is specified"). The
failure can be always reproduced.

For now, let's mute the warning of allocation failure if requesting pages
from DMA zone while no managed pages.

[akpm@linux-foundation.org: fix warning]

Link: https://lkml.kernel.org/r/20211223094435.248523-4-bhe@redhat.com
Fixes: 6f599d84231f ("x86/kdump: Always reserve the low 1M when the crashkernel option is specified")
Signed-off-by: Baoquan He <bhe@redhat.com>
Acked-by: John Donnelly <john.p.donnelly@oracle.com>
Reviewed-by: Hyeonggon Yoo <42.hyeyoo@gmail.com>
Cc: Christoph Lameter <cl@linux.com>
Cc: Pekka Enberg <penberg@kernel.org>
Cc: David Rientjes <rientjes@google.com>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Christoph Hellwig <hch@lst.de>
Cc: David Hildenbrand <david@redhat.com>
Cc: David Laight <David.Laight@ACULAB.COM>
Cc: Marek Szyprowski <m.szyprowski@samsung.com>
Cc: Robin Murphy <robin.murphy@arm.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 62b31070 14-Jan-2022 Baoquan He <bhe@redhat.com>

mm_zone: add function to check if managed dma zone exists

Patch series "Handle warning of allocation failure on DMA zone w/o
managed pages", v4.

**Problem observed:
On x86_64, when crash is triggered and entering into kdump kernel, page
allocation failure can always be seen.

---------------------------------
DMA: preallocated 128 KiB GFP_KERNEL pool for atomic allocations
swapper/0: page allocation failure: order:5, mode:0xcc1(GFP_KERNEL|GFP_DMA), nodemask=(null),cpuset=/,mems_allowed=0
CPU: 0 PID: 1 Comm: swapper/0
Call Trace:
dump_stack+0x7f/0xa1
warn_alloc.cold+0x72/0xd6
......
__alloc_pages+0x24d/0x2c0
......
dma_atomic_pool_init+0xdb/0x176
do_one_initcall+0x67/0x320
? rcu_read_lock_sched_held+0x3f/0x80
kernel_init_freeable+0x290/0x2dc
? rest_init+0x24f/0x24f
kernel_init+0xa/0x111
ret_from_fork+0x22/0x30
Mem-Info:
------------------------------------

***Root cause:
In the current kernel, it assumes that DMA zone must have managed pages
and try to request pages if CONFIG_ZONE_DMA is enabled. While this is not
always true. E.g in kdump kernel of x86_64, only low 1M is presented and
locked down at very early stage of boot, so that this low 1M won't be
added into buddy allocator to become managed pages of DMA zone. This
exception will always cause page allocation failure if page is requested
from DMA zone.

***Investigation:
This failure happens since below commit merged into linus's tree.
1a6a9044b967 x86/setup: Remove CONFIG_X86_RESERVE_LOW and reservelow= options
23721c8e92f7 x86/crash: Remove crash_reserve_low_1M()
f1d4d47c5851 x86/setup: Always reserve the first 1M of RAM
7c321eb2b843 x86/kdump: Remove the backup region handling
6f599d84231f x86/kdump: Always reserve the low 1M when the crashkernel option is specified

Before them, on x86_64, the low 640K area will be reused by kdump kernel.
So in kdump kernel, the content of low 640K area is copied into a backup
region for dumping before jumping into kdump. Then except of those firmware
reserved region in [0, 640K], the left area will be added into buddy
allocator to become available managed pages of DMA zone.

However, after above commits applied, in kdump kernel of x86_64, the low
1M is reserved by memblock, but not released to buddy allocator. So any
later page allocation requested from DMA zone will fail.

At the beginning, if crashkernel is reserved, the low 1M need be locked
down because AMD SME encrypts memory making the old backup region
mechanims impossible when switching into kdump kernel.

Later, it was also observed that there are BIOSes corrupting memory
under 1M. To solve this, in commit f1d4d47c5851, the entire region of
low 1M is always reserved after the real mode trampoline is allocated.

Besides, recently, Intel engineer mentioned their TDX (Trusted domain
extensions) which is under development in kernel also needs to lock down
the low 1M. So we can't simply revert above commits to fix the page allocation
failure from DMA zone as someone suggested.

***Solution:
Currently, only DMA atomic pool and dma-kmalloc will initialize and
request page allocation with GFP_DMA during bootup.

So only initializ DMA atomic pool when DMA zone has available managed
pages, otherwise just skip the initialization.

For dma-kmalloc(), for the time being, let's mute the warning of
allocation failure if requesting pages from DMA zone while no manged
pages. Meanwhile, change code to use dma_alloc_xx/dma_map_xx API to
replace kmalloc(GFP_DMA), or do not use GFP_DMA when calling kmalloc() if
not necessary. Christoph is posting patches to fix those under
drivers/scsi/. Finally, we can remove the need of dma-kmalloc() as people
suggested.

This patch (of 3):

In some places of the current kernel, it assumes that dma zone must have
managed pages if CONFIG_ZONE_DMA is enabled. While this is not always
true. E.g in kdump kernel of x86_64, only low 1M is presented and locked
down at very early stage of boot, so that there's no managed pages at all
in DMA zone. This exception will always cause page allocation failure if
page is requested from DMA zone.

Here add function has_managed_dma() and the relevant helper functions to
check if there's DMA zone with managed pages. It will be used in later
patches.

Link: https://lkml.kernel.org/r/20211223094435.248523-1-bhe@redhat.com
Link: https://lkml.kernel.org/r/20211223094435.248523-2-bhe@redhat.com
Fixes: 6f599d84231f ("x86/kdump: Always reserve the low 1M when the crashkernel option is specified")
Signed-off-by: Baoquan He <bhe@redhat.com>
Reviewed-by: David Hildenbrand <david@redhat.com>
Acked-by: John Donnelly <john.p.donnelly@oracle.com>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Christoph Lameter <cl@linux.com>
Cc: Hyeonggon Yoo <42.hyeyoo@gmail.com>
Cc: Pekka Enberg <penberg@kernel.org>
Cc: David Rientjes <rientjes@google.com>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: David Laight <David.Laight@ACULAB.COM>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Marek Szyprowski <m.szyprowski@samsung.com>
Cc: Robin Murphy <robin.murphy@arm.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# eaab8e75 14-Jan-2022 Anshuman Khandual <anshuman.khandual@arm.com>

mm/page_alloc.c: modify the comment section for alloc_contig_pages()

Clarify that the alloc_contig_pages() allocated range will always be
aligned to the requested nr_pages.

Link: https://lkml.kernel.org/r/1639545478-12160-1-git-send-email-anshuman.khandual@arm.com
Signed-off-by: Anshuman Khandual <anshuman.khandual@arm.com>
Cc: David Hildenbrand <david@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# ca831f29 14-Jan-2022 Xiongwei Song <sxwjean@gmail.com>

mm: page_alloc: fix building error on -Werror=array-compare

Arthur Marsh reported we would hit the error below when building kernel
with gcc-12:

CC mm/page_alloc.o
mm/page_alloc.c: In function `mem_init_print_info':
mm/page_alloc.c:8173:27: error: comparison between two arrays [-Werror=array-compare]
8173 | if (start <= pos && pos < end && size > adj) \
|

In C++20, the comparision between arrays should be warned.

Link: https://lkml.kernel.org/r/20211125130928.32465-1-sxwjean@me.com
Signed-off-by: Xiongwei Song <sxwjean@gmail.com>
Reported-by: Arthur Marsh <arthur.marsh@internode.on.net>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# df4e817b 14-Jan-2022 Pasha Tatashin <pasha.tatashin@soleen.com>

mm: page table check

Check user page table entries at the time they are added and removed.

Allows to synchronously catch memory corruption issues related to double
mapping.

When a pte for an anonymous page is added into page table, we verify
that this pte does not already point to a file backed page, and vice
versa if this is a file backed page that is being added we verify that
this page does not have an anonymous mapping

We also enforce that read-only sharing for anonymous pages is allowed
(i.e. cow after fork). All other sharing must be for file pages.

Page table check allows to protect and debug cases where "struct page"
metadata became corrupted for some reason. For example, when refcnt or
mapcount become invalid.

Link: https://lkml.kernel.org/r/20211221154650.1047963-4-pasha.tatashin@soleen.com
Signed-off-by: Pasha Tatashin <pasha.tatashin@soleen.com>
Cc: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: David Rientjes <rientjes@google.com>
Cc: Frederic Weisbecker <frederic@kernel.org>
Cc: Greg Thelen <gthelen@google.com>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Jiri Slaby <jirislaby@kernel.org>
Cc: Jonathan Corbet <corbet@lwn.net>
Cc: Kees Cook <keescook@chromium.org>
Cc: Masahiro Yamada <masahiroy@kernel.org>
Cc: Mike Rapoport <rppt@kernel.org>
Cc: Muchun Song <songmuchun@bytedance.com>
Cc: Paul Turner <pjt@google.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Sami Tolvanen <samitolvanen@google.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Wei Xu <weixugc@google.com>
Cc: Will Deacon <will@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# c4386bd8 14-Jan-2022 Joao Martins <joao.m.martins@oracle.com>

mm/memremap: add ZONE_DEVICE support for compound pages

Add a new @vmemmap_shift property for struct dev_pagemap which specifies
that a devmap is composed of a set of compound pages of order
@vmemmap_shift, instead of base pages. When a compound page devmap is
requested, all but the first page are initialised as tail pages instead
of order-0 pages.

For certain ZONE_DEVICE users like device-dax which have a fixed page
size, this creates an opportunity to optimize GUP and GUP-fast walkers,
treating it the same way as THP or hugetlb pages.

Additionally, commit 7118fc2906e2 ("hugetlb: address ref count racing in
prep_compound_gigantic_page") removed set_page_count() because the
setting of page ref count to zero was redundant. devmap pages don't
come from page allocator though and only head page refcount is used for
compound pages, hence initialize tail page count to zero.

Link: https://lkml.kernel.org/r/20211202204422.26777-5-joao.m.martins@oracle.com
Signed-off-by: Joao Martins <joao.m.martins@oracle.com>
Reviewed-by: Dan Williams <dan.j.williams@intel.com>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Dave Jiang <dave.jiang@intel.com>
Cc: Jane Chu <jane.chu@oracle.com>
Cc: Jason Gunthorpe <jgg@nvidia.com>
Cc: Jason Gunthorpe <jgg@ziepe.ca>
Cc: John Hubbard <jhubbard@nvidia.com>
Cc: Jonathan Corbet <corbet@lwn.net>
Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
Cc: Mike Kravetz <mike.kravetz@oracle.com>
Cc: Muchun Song <songmuchun@bytedance.com>
Cc: Naoya Horiguchi <naoya.horiguchi@nec.com>
Cc: Vishal Verma <vishal.l.verma@intel.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 46487e00 14-Jan-2022 Joao Martins <joao.m.martins@oracle.com>

mm/page_alloc: refactor memmap_init_zone_device() page init

Move struct page init to an helper function __init_zone_device_page().

This is in preparation for sharing the storage for compound page
metadata.

Link: https://lkml.kernel.org/r/20211202204422.26777-4-joao.m.martins@oracle.com
Signed-off-by: Joao Martins <joao.m.martins@oracle.com>
Reviewed-by: Dan Williams <dan.j.williams@intel.com>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Dave Jiang <dave.jiang@intel.com>
Cc: Jane Chu <jane.chu@oracle.com>
Cc: Jason Gunthorpe <jgg@nvidia.com>
Cc: Jason Gunthorpe <jgg@ziepe.ca>
Cc: John Hubbard <jhubbard@nvidia.com>
Cc: Jonathan Corbet <corbet@lwn.net>
Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
Cc: Mike Kravetz <mike.kravetz@oracle.com>
Cc: Muchun Song <songmuchun@bytedance.com>
Cc: Naoya Horiguchi <naoya.horiguchi@nec.com>
Cc: Vishal Verma <vishal.l.verma@intel.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 5b24eeef 14-Jan-2022 Joao Martins <joao.m.martins@oracle.com>

mm/page_alloc: split prep_compound_page into head and tail subparts

Patch series "mm, device-dax: Introduce compound pages in devmap", v7.

This series converts device-dax to use compound pages, and moves away
from the 'struct page per basepage on PMD/PUD' that is done today.

Doing so
1) unlocks a few noticeable improvements on unpin_user_pages() and
makes device-dax+altmap case 4x times faster in pinning (numbers
below and in last patch)
2) as mentioned in various other threads it's one important step
towards cleaning up ZONE_DEVICE refcounting.

I've split the compound pages on devmap part from the rest based on
recent discussions on devmap pending and future work planned[5][6].
There is consensus that device-dax should be using compound pages to
represent its PMD/PUDs just like HugeTLB and THP, and that leads to less
specialization of the dax parts. I will pursue the rest of the work in
parallel once this part is merged, particular the GUP-{slow,fast}
improvements [7] and the tail struct page deduplication memory savings
part[8].

To summarize what the series does:

Patch 1: Prepare hwpoisoning to work with dax compound pages.

Patches 2-3: Split the current utility function of prep_compound_page()
into head and tail and use those two helpers where appropriate to take
advantage of caches being warm after __init_single_page(). This is used
when initializing zone device when we bring up device-dax namespaces.

Patches 4-10: Add devmap support for compound pages in device-dax.
memmap_init_zone_device() initialize its metadata as compound pages, and
it introduces a new devmap property known as vmemmap_shift which
outlines how the vmemmap is structured (defaults to base pages as done
today). The property describe the page order of the metadata
essentially. While at it do a few cleanups in device-dax in patches
5-9. Finally enable device-dax usage of devmap @vmemmap_shift to a
value based on its own @align property. @vmemmap_shift returns 0 by
default (which is today's case of base pages in devmap, like fsdax or
the others) and the usage of compound devmap is optional. Starting with
device-dax (*not* fsdax) we enable it by default. There are a few
pinning improvements particular on the unpinning case and altmap, as
well as unpin_user_page_range_dirty_lock() being just as effective as
THP/hugetlb[0] pages.

$ gup_test -f /dev/dax1.0 -m 16384 -r 10 -S -a -n 512 -w
(pin_user_pages_fast 2M pages) put:~71 ms -> put:~22 ms
[altmap]
(pin_user_pages_fast 2M pages) get:~524ms put:~525 ms -> get: ~127ms put:~71ms

$ gup_test -f /dev/dax1.0 -m 129022 -r 10 -S -a -n 512 -w
(pin_user_pages_fast 2M pages) put:~513 ms -> put:~188 ms
[altmap with -m 127004]
(pin_user_pages_fast 2M pages) get:~4.1 secs put:~4.12 secs -> get:~1sec put:~563ms

Tested on x86 with 1Tb+ of pmem (alongside registering it with RDMA with
and without altmap), alongside gup_test selftests with dynamic dax
regions and static dax regions. Coupled with ndctl unit tests for
dynamic dax devices that exercise all of this. Note, for dynamic dax
regions I had to revert commit 8aa83e6395 ("x86/setup: Call
early_reserve_memory() earlier"), it is a known issue that this commit
broke efi_fake_mem=.

This patch (of 11):

Split the utility function prep_compound_page() into head and tail
counterparts, and use them accordingly.

This is in preparation for sharing the storage for compound page
metadata.

Link: https://lkml.kernel.org/r/20211202204422.26777-1-joao.m.martins@oracle.com
Link: https://lkml.kernel.org/r/20211202204422.26777-3-joao.m.martins@oracle.com
Signed-off-by: Joao Martins <joao.m.martins@oracle.com>
Acked-by: Mike Kravetz <mike.kravetz@oracle.com>
Reviewed-by: Dan Williams <dan.j.williams@intel.com>
Reviewed-by: Muchun Song <songmuchun@bytedance.com>
Cc: Vishal Verma <vishal.l.verma@intel.com>
Cc: Dave Jiang <dave.jiang@intel.com>
Cc: Naoya Horiguchi <naoya.horiguchi@nec.com>
Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
Cc: Jason Gunthorpe <jgg@ziepe.ca>
Cc: John Hubbard <jhubbard@nvidia.com>
Cc: Jane Chu <jane.chu@oracle.com>
Cc: Jonathan Corbet <corbet@lwn.net>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Jason Gunthorpe <jgg@nvidia.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 132b0d21 05-Nov-2021 Mel Gorman <mgorman@techsingularity.net>

mm/page_alloc: remove the throttling logic from the page allocator

The page allocator stalls based on the number of pages that are waiting
for writeback to start but this should now be redundant.
shrink_inactive_list() will wake flusher threads if the LRU tail are
unqueued dirty pages so the flusher should be active. If it fails to
make progress due to pages under writeback not being completed quickly
then it should stall on VMSCAN_THROTTLE_WRITEBACK.

Link: https://lkml.kernel.org/r/20211022144651.19914-6-mgorman@techsingularity.net
Signed-off-by: Mel Gorman <mgorman@techsingularity.net>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Cc: Andreas Dilger <adilger.kernel@dilger.ca>
Cc: "Darrick J . Wong" <djwong@kernel.org>
Cc: Dave Chinner <david@fromorbit.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Jonathan Corbet <corbet@lwn.net>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Michal Hocko <mhocko@suse.com>
Cc: NeilBrown <neilb@suse.de>
Cc: Rik van Riel <riel@surriel.com>
Cc: "Theodore Ts'o" <tytso@mit.edu>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 8cd7c588 05-Nov-2021 Mel Gorman <mgorman@techsingularity.net>

mm/vmscan: throttle reclaim until some writeback completes if congested

Patch series "Remove dependency on congestion_wait in mm/", v5.

This series that removes all calls to congestion_wait in mm/ and deletes
wait_iff_congested. It's not a clever implementation but
congestion_wait has been broken for a long time [1].

Even if congestion throttling worked, it was never a great idea. While
excessive dirty/writeback pages at the tail of the LRU is one
possibility that reclaim may be slow, there is also the problem of too
many pages being isolated and reclaim failing for other reasons
(elevated references, too many pages isolated, excessive LRU contention
etc).

This series replaces the "congestion" throttling with 3 different types.

- If there are too many dirty/writeback pages, sleep until a timeout or
enough pages get cleaned

- If too many pages are isolated, sleep until enough isolated pages are
either reclaimed or put back on the LRU

- If no progress is being made, direct reclaim tasks sleep until
another task makes progress with acceptable efficiency.

This was initially tested with a mix of workloads that used to trigger
corner cases that no longer work. A new test case was created called
"stutterp" (pagereclaim-stutterp-noreaders in mmtests) using a freshly
created XFS filesystem. Note that it may be necessary to increase the
timeout of ssh if executing remotely as ssh itself can get throttled and
the connection may timeout.

stutterp varies the number of "worker" processes from 4 up to NR_CPUS*4
to check the impact as the number of direct reclaimers increase. It has
four types of worker.

- One "anon latency" worker creates small mappings with mmap() and
times how long it takes to fault the mapping reading it 4K at a time

- X file writers which is fio randomly writing X files where the total
size of the files add up to the allowed dirty_ratio. fio is allowed
to run for a warmup period to allow some file-backed pages to
accumulate. The duration of the warmup is based on the best-case
linear write speed of the storage.

- Y file readers which is fio randomly reading small files

- Z anon memory hogs which continually map (100-dirty_ratio)% of memory

- Total estimated WSS = (100+dirty_ration) percentage of memory

X+Y+Z+1 == NR_WORKERS varying from 4 up to NR_CPUS*4

The intent is to maximise the total WSS with a mix of file and anon
memory where some anonymous memory must be swapped and there is a high
likelihood of dirty/writeback pages reaching the end of the LRU.

The test can be configured to have no background readers to stress
dirty/writeback pages. The results below are based on having zero
readers.

The short summary of the results is that the series works and stalls
until some event occurs but the timeouts may need adjustment.

The test results are not broken down by patch as the series should be
treated as one block that replaces a broken throttling mechanism with a
working one.

Finally, three machines were tested but I'm reporting the worst set of
results. The other two machines had much better latencies for example.

First the results of the "anon latency" latency

stutterp
5.15.0-rc1 5.15.0-rc1
vanilla mm-reclaimcongest-v5r4
Amean mmap-4 31.4003 ( 0.00%) 2661.0198 (-8374.52%)
Amean mmap-7 38.1641 ( 0.00%) 149.2891 (-291.18%)
Amean mmap-12 60.0981 ( 0.00%) 187.8105 (-212.51%)
Amean mmap-21 161.2699 ( 0.00%) 213.9107 ( -32.64%)
Amean mmap-30 174.5589 ( 0.00%) 377.7548 (-116.41%)
Amean mmap-48 8106.8160 ( 0.00%) 1070.5616 ( 86.79%)
Stddev mmap-4 41.3455 ( 0.00%) 27573.9676 (-66591.66%)
Stddev mmap-7 53.5556 ( 0.00%) 4608.5860 (-8505.23%)
Stddev mmap-12 171.3897 ( 0.00%) 5559.4542 (-3143.75%)
Stddev mmap-21 1506.6752 ( 0.00%) 5746.2507 (-281.39%)
Stddev mmap-30 557.5806 ( 0.00%) 7678.1624 (-1277.05%)
Stddev mmap-48 61681.5718 ( 0.00%) 14507.2830 ( 76.48%)
Max-90 mmap-4 31.4243 ( 0.00%) 83.1457 (-164.59%)
Max-90 mmap-7 41.0410 ( 0.00%) 41.0720 ( -0.08%)
Max-90 mmap-12 66.5255 ( 0.00%) 53.9073 ( 18.97%)
Max-90 mmap-21 146.7479 ( 0.00%) 105.9540 ( 27.80%)
Max-90 mmap-30 193.9513 ( 0.00%) 64.3067 ( 66.84%)
Max-90 mmap-48 277.9137 ( 0.00%) 591.0594 (-112.68%)
Max mmap-4 1913.8009 ( 0.00%) 299623.9695 (-15555.96%)
Max mmap-7 2423.9665 ( 0.00%) 204453.1708 (-8334.65%)
Max mmap-12 6845.6573 ( 0.00%) 221090.3366 (-3129.64%)
Max mmap-21 56278.6508 ( 0.00%) 213877.3496 (-280.03%)
Max mmap-30 19716.2990 ( 0.00%) 216287.6229 (-997.00%)
Max mmap-48 477923.9400 ( 0.00%) 245414.8238 ( 48.65%)

For most thread counts, the time to mmap() is unfortunately increased.
In earlier versions of the series, this was lower but a large number of
throttling events were reaching their timeout increasing the amount of
inefficient scanning of the LRU. There is no prioritisation of reclaim
tasks making progress based on each tasks rate of page allocation versus
progress of reclaim. The variance is also impacted for high worker
counts but in all cases, the differences in latency are not
statistically significant due to very large maximum outliers. Max-90
shows that 90% of the stalls are comparable but the Max results show the
massive outliers which are increased to to stalling.

It is expected that this will be very machine dependant. Due to the
test design, reclaim is difficult so allocations stall and there are
variances depending on whether THPs can be allocated or not. The amount
of memory will affect exactly how bad the corner cases are and how often
they trigger. The warmup period calculation is not ideal as it's based
on linear writes where as fio is randomly writing multiple files from
multiple tasks so the start state of the test is variable. For example,
these are the latencies on a single-socket machine that had more memory

Amean mmap-4 42.2287 ( 0.00%) 49.6838 * -17.65%*
Amean mmap-7 216.4326 ( 0.00%) 47.4451 * 78.08%*
Amean mmap-12 2412.0588 ( 0.00%) 51.7497 ( 97.85%)
Amean mmap-21 5546.2548 ( 0.00%) 51.8862 ( 99.06%)
Amean mmap-30 1085.3121 ( 0.00%) 72.1004 ( 93.36%)

The overall system CPU usage and elapsed time is as follows

5.15.0-rc3 5.15.0-rc3
vanilla mm-reclaimcongest-v5r4
Duration User 6989.03 983.42
Duration System 7308.12 799.68
Duration Elapsed 2277.67 2092.98

The patches reduce system CPU usage by 89% as the vanilla kernel is rarely
stalling.

The high-level /proc/vmstats show

5.15.0-rc1 5.15.0-rc1
vanilla mm-reclaimcongest-v5r2
Ops Direct pages scanned 1056608451.00 503594991.00
Ops Kswapd pages scanned 109795048.00 147289810.00
Ops Kswapd pages reclaimed 63269243.00 31036005.00
Ops Direct pages reclaimed 10803973.00 6328887.00
Ops Kswapd efficiency % 57.62 21.07
Ops Kswapd velocity 48204.98 57572.86
Ops Direct efficiency % 1.02 1.26
Ops Direct velocity 463898.83 196845.97

Kswapd scanned less pages but the detailed pattern is different. The
vanilla kernel scans slowly over time where as the patches exhibits
burst patterns of scan activity. Direct reclaim scanning is reduced by
52% due to stalling.

The pattern for stealing pages is also slightly different. Both kernels
exhibit spikes but the vanilla kernel when reclaiming shows pages being
reclaimed over a period of time where as the patches tend to reclaim in
spikes. The difference is that vanilla is not throttling and instead
scanning constantly finding some pages over time where as the patched
kernel throttles and reclaims in spikes.

Ops Percentage direct scans 90.59 77.37

For direct reclaim, vanilla scanned 90.59% of pages where as with the
patches, 77.37% were direct reclaim due to throttling

Ops Page writes by reclaim 2613590.00 1687131.00

Page writes from reclaim context are reduced.

Ops Page writes anon 2932752.00 1917048.00

And there is less swapping.

Ops Page reclaim immediate 996248528.00 107664764.00

The number of pages encountered at the tail of the LRU tagged for
immediate reclaim but still dirty/writeback is reduced by 89%.

Ops Slabs scanned 164284.00 153608.00

Slab scan activity is similar.

ftrace was used to gather stall activity

Vanilla
-------
1 writeback_wait_iff_congested: usec_timeout=100000 usec_delayed=16000
2 writeback_wait_iff_congested: usec_timeout=100000 usec_delayed=12000
8 writeback_wait_iff_congested: usec_timeout=100000 usec_delayed=8000
29 writeback_wait_iff_congested: usec_timeout=100000 usec_delayed=4000
82394 writeback_wait_iff_congested: usec_timeout=100000 usec_delayed=0

The fast majority of wait_iff_congested calls do not stall at all. What
is likely happening is that cond_resched() reschedules the task for a
short period when the BDI is not registering congestion (which it never
will in this test setup).

1 writeback_congestion_wait: usec_timeout=100000 usec_delayed=120000
2 writeback_congestion_wait: usec_timeout=100000 usec_delayed=132000
4 writeback_congestion_wait: usec_timeout=100000 usec_delayed=112000
380 writeback_congestion_wait: usec_timeout=100000 usec_delayed=108000
778 writeback_congestion_wait: usec_timeout=100000 usec_delayed=104000

congestion_wait if called always exceeds the timeout as there is no
trigger to wake it up.

Bottom line: Vanilla will throttle but it's not effective.

Patch series
------------

Kswapd throttle activity was always due to scanning pages tagged for
immediate reclaim at the tail of the LRU

1 usec_timeout=100000 usect_delayed=72000 reason=VMSCAN_THROTTLE_WRITEBACK
4 usec_timeout=100000 usect_delayed=20000 reason=VMSCAN_THROTTLE_WRITEBACK
5 usec_timeout=100000 usect_delayed=12000 reason=VMSCAN_THROTTLE_WRITEBACK
6 usec_timeout=100000 usect_delayed=16000 reason=VMSCAN_THROTTLE_WRITEBACK
11 usec_timeout=100000 usect_delayed=100000 reason=VMSCAN_THROTTLE_WRITEBACK
11 usec_timeout=100000 usect_delayed=8000 reason=VMSCAN_THROTTLE_WRITEBACK
94 usec_timeout=100000 usect_delayed=0 reason=VMSCAN_THROTTLE_WRITEBACK
112 usec_timeout=100000 usect_delayed=4000 reason=VMSCAN_THROTTLE_WRITEBACK

The majority of events did not stall or stalled for a short period.
Roughly 16% of stalls reached the timeout before expiry. For direct
reclaim, the number of times stalled for each reason were

6624 reason=VMSCAN_THROTTLE_ISOLATED
93246 reason=VMSCAN_THROTTLE_NOPROGRESS
96934 reason=VMSCAN_THROTTLE_WRITEBACK

The most common reason to stall was due to excessive pages tagged for
immediate reclaim at the tail of the LRU followed by a failure to make
forward. A relatively small number were due to too many pages isolated
from the LRU by parallel threads

For VMSCAN_THROTTLE_ISOLATED, the breakdown of delays was

9 usec_timeout=20000 usect_delayed=4000 reason=VMSCAN_THROTTLE_ISOLATED
12 usec_timeout=20000 usect_delayed=16000 reason=VMSCAN_THROTTLE_ISOLATED
83 usec_timeout=20000 usect_delayed=20000 reason=VMSCAN_THROTTLE_ISOLATED
6520 usec_timeout=20000 usect_delayed=0 reason=VMSCAN_THROTTLE_ISOLATED

Most did not stall at all. A small number reached the timeout.

For VMSCAN_THROTTLE_NOPROGRESS, the breakdown of stalls were all over
the map

1 usec_timeout=500000 usect_delayed=324000 reason=VMSCAN_THROTTLE_NOPROGRESS
1 usec_timeout=500000 usect_delayed=332000 reason=VMSCAN_THROTTLE_NOPROGRESS
1 usec_timeout=500000 usect_delayed=348000 reason=VMSCAN_THROTTLE_NOPROGRESS
1 usec_timeout=500000 usect_delayed=360000 reason=VMSCAN_THROTTLE_NOPROGRESS
2 usec_timeout=500000 usect_delayed=228000 reason=VMSCAN_THROTTLE_NOPROGRESS
2 usec_timeout=500000 usect_delayed=260000 reason=VMSCAN_THROTTLE_NOPROGRESS
2 usec_timeout=500000 usect_delayed=340000 reason=VMSCAN_THROTTLE_NOPROGRESS
2 usec_timeout=500000 usect_delayed=364000 reason=VMSCAN_THROTTLE_NOPROGRESS
2 usec_timeout=500000 usect_delayed=372000 reason=VMSCAN_THROTTLE_NOPROGRESS
2 usec_timeout=500000 usect_delayed=428000 reason=VMSCAN_THROTTLE_NOPROGRESS
2 usec_timeout=500000 usect_delayed=460000 reason=VMSCAN_THROTTLE_NOPROGRESS
2 usec_timeout=500000 usect_delayed=464000 reason=VMSCAN_THROTTLE_NOPROGRESS
3 usec_timeout=500000 usect_delayed=244000 reason=VMSCAN_THROTTLE_NOPROGRESS
3 usec_timeout=500000 usect_delayed=252000 reason=VMSCAN_THROTTLE_NOPROGRESS
3 usec_timeout=500000 usect_delayed=272000 reason=VMSCAN_THROTTLE_NOPROGRESS
4 usec_timeout=500000 usect_delayed=188000 reason=VMSCAN_THROTTLE_NOPROGRESS
4 usec_timeout=500000 usect_delayed=268000 reason=VMSCAN_THROTTLE_NOPROGRESS
4 usec_timeout=500000 usect_delayed=328000 reason=VMSCAN_THROTTLE_NOPROGRESS
4 usec_timeout=500000 usect_delayed=380000 reason=VMSCAN_THROTTLE_NOPROGRESS
4 usec_timeout=500000 usect_delayed=392000 reason=VMSCAN_THROTTLE_NOPROGRESS
4 usec_timeout=500000 usect_delayed=432000 reason=VMSCAN_THROTTLE_NOPROGRESS
5 usec_timeout=500000 usect_delayed=204000 reason=VMSCAN_THROTTLE_NOPROGRESS
5 usec_timeout=500000 usect_delayed=220000 reason=VMSCAN_THROTTLE_NOPROGRESS
5 usec_timeout=500000 usect_delayed=412000 reason=VMSCAN_THROTTLE_NOPROGRESS
5 usec_timeout=500000 usect_delayed=436000 reason=VMSCAN_THROTTLE_NOPROGRESS
6 usec_timeout=500000 usect_delayed=488000 reason=VMSCAN_THROTTLE_NOPROGRESS
7 usec_timeout=500000 usect_delayed=212000 reason=VMSCAN_THROTTLE_NOPROGRESS
7 usec_timeout=500000 usect_delayed=300000 reason=VMSCAN_THROTTLE_NOPROGRESS
7 usec_timeout=500000 usect_delayed=316000 reason=VMSCAN_THROTTLE_NOPROGRESS
7 usec_timeout=500000 usect_delayed=472000 reason=VMSCAN_THROTTLE_NOPROGRESS
8 usec_timeout=500000 usect_delayed=248000 reason=VMSCAN_THROTTLE_NOPROGRESS
8 usec_timeout=500000 usect_delayed=356000 reason=VMSCAN_THROTTLE_NOPROGRESS
8 usec_timeout=500000 usect_delayed=456000 reason=VMSCAN_THROTTLE_NOPROGRESS
9 usec_timeout=500000 usect_delayed=124000 reason=VMSCAN_THROTTLE_NOPROGRESS
9 usec_timeout=500000 usect_delayed=376000 reason=VMSCAN_THROTTLE_NOPROGRESS
9 usec_timeout=500000 usect_delayed=484000 reason=VMSCAN_THROTTLE_NOPROGRESS
10 usec_timeout=500000 usect_delayed=172000 reason=VMSCAN_THROTTLE_NOPROGRESS
10 usec_timeout=500000 usect_delayed=420000 reason=VMSCAN_THROTTLE_NOPROGRESS
10 usec_timeout=500000 usect_delayed=452000 reason=VMSCAN_THROTTLE_NOPROGRESS
11 usec_timeout=500000 usect_delayed=256000 reason=VMSCAN_THROTTLE_NOPROGRESS
12 usec_timeout=500000 usect_delayed=112000 reason=VMSCAN_THROTTLE_NOPROGRESS
12 usec_timeout=500000 usect_delayed=116000 reason=VMSCAN_THROTTLE_NOPROGRESS
12 usec_timeout=500000 usect_delayed=144000 reason=VMSCAN_THROTTLE_NOPROGRESS
12 usec_timeout=500000 usect_delayed=152000 reason=VMSCAN_THROTTLE_NOPROGRESS
12 usec_timeout=500000 usect_delayed=264000 reason=VMSCAN_THROTTLE_NOPROGRESS
12 usec_timeout=500000 usect_delayed=384000 reason=VMSCAN_THROTTLE_NOPROGRESS
12 usec_timeout=500000 usect_delayed=424000 reason=VMSCAN_THROTTLE_NOPROGRESS
12 usec_timeout=500000 usect_delayed=492000 reason=VMSCAN_THROTTLE_NOPROGRESS
13 usec_timeout=500000 usect_delayed=184000 reason=VMSCAN_THROTTLE_NOPROGRESS
13 usec_timeout=500000 usect_delayed=444000 reason=VMSCAN_THROTTLE_NOPROGRESS
14 usec_timeout=500000 usect_delayed=308000 reason=VMSCAN_THROTTLE_NOPROGRESS
14 usec_timeout=500000 usect_delayed=440000 reason=VMSCAN_THROTTLE_NOPROGRESS
14 usec_timeout=500000 usect_delayed=476000 reason=VMSCAN_THROTTLE_NOPROGRESS
16 usec_timeout=500000 usect_delayed=140000 reason=VMSCAN_THROTTLE_NOPROGRESS
17 usec_timeout=500000 usect_delayed=232000 reason=VMSCAN_THROTTLE_NOPROGRESS
17 usec_timeout=500000 usect_delayed=240000 reason=VMSCAN_THROTTLE_NOPROGRESS
17 usec_timeout=500000 usect_delayed=280000 reason=VMSCAN_THROTTLE_NOPROGRESS
18 usec_timeout=500000 usect_delayed=404000 reason=VMSCAN_THROTTLE_NOPROGRESS
20 usec_timeout=500000 usect_delayed=148000 reason=VMSCAN_THROTTLE_NOPROGRESS
20 usec_timeout=500000 usect_delayed=216000 reason=VMSCAN_THROTTLE_NOPROGRESS
20 usec_timeout=500000 usect_delayed=468000 reason=VMSCAN_THROTTLE_NOPROGRESS
21 usec_timeout=500000 usect_delayed=448000 reason=VMSCAN_THROTTLE_NOPROGRESS
23 usec_timeout=500000 usect_delayed=168000 reason=VMSCAN_THROTTLE_NOPROGRESS
23 usec_timeout=500000 usect_delayed=296000 reason=VMSCAN_THROTTLE_NOPROGRESS
25 usec_timeout=500000 usect_delayed=132000 reason=VMSCAN_THROTTLE_NOPROGRESS
25 usec_timeout=500000 usect_delayed=352000 reason=VMSCAN_THROTTLE_NOPROGRESS
26 usec_timeout=500000 usect_delayed=180000 reason=VMSCAN_THROTTLE_NOPROGRESS
27 usec_timeout=500000 usect_delayed=284000 reason=VMSCAN_THROTTLE_NOPROGRESS
28 usec_timeout=500000 usect_delayed=164000 reason=VMSCAN_THROTTLE_NOPROGRESS
29 usec_timeout=500000 usect_delayed=136000 reason=VMSCAN_THROTTLE_NOPROGRESS
30 usec_timeout=500000 usect_delayed=200000 reason=VMSCAN_THROTTLE_NOPROGRESS
30 usec_timeout=500000 usect_delayed=400000 reason=VMSCAN_THROTTLE_NOPROGRESS
31 usec_timeout=500000 usect_delayed=196000 reason=VMSCAN_THROTTLE_NOPROGRESS
32 usec_timeout=500000 usect_delayed=156000 reason=VMSCAN_THROTTLE_NOPROGRESS
33 usec_timeout=500000 usect_delayed=224000 reason=VMSCAN_THROTTLE_NOPROGRESS
35 usec_timeout=500000 usect_delayed=128000 reason=VMSCAN_THROTTLE_NOPROGRESS
35 usec_timeout=500000 usect_delayed=176000 reason=VMSCAN_THROTTLE_NOPROGRESS
36 usec_timeout=500000 usect_delayed=368000 reason=VMSCAN_THROTTLE_NOPROGRESS
36 usec_timeout=500000 usect_delayed=496000 reason=VMSCAN_THROTTLE_NOPROGRESS
37 usec_timeout=500000 usect_delayed=312000 reason=VMSCAN_THROTTLE_NOPROGRESS
38 usec_timeout=500000 usect_delayed=304000 reason=VMSCAN_THROTTLE_NOPROGRESS
40 usec_timeout=500000 usect_delayed=288000 reason=VMSCAN_THROTTLE_NOPROGRESS
43 usec_timeout=500000 usect_delayed=408000 reason=VMSCAN_THROTTLE_NOPROGRESS
55 usec_timeout=500000 usect_delayed=416000 reason=VMSCAN_THROTTLE_NOPROGRESS
56 usec_timeout=500000 usect_delayed=76000 reason=VMSCAN_THROTTLE_NOPROGRESS
58 usec_timeout=500000 usect_delayed=120000 reason=VMSCAN_THROTTLE_NOPROGRESS
59 usec_timeout=500000 usect_delayed=208000 reason=VMSCAN_THROTTLE_NOPROGRESS
61 usec_timeout=500000 usect_delayed=68000 reason=VMSCAN_THROTTLE_NOPROGRESS
71 usec_timeout=500000 usect_delayed=192000 reason=VMSCAN_THROTTLE_NOPROGRESS
71 usec_timeout=500000 usect_delayed=480000 reason=VMSCAN_THROTTLE_NOPROGRESS
79 usec_timeout=500000 usect_delayed=60000 reason=VMSCAN_THROTTLE_NOPROGRESS
82 usec_timeout=500000 usect_delayed=320000 reason=VMSCAN_THROTTLE_NOPROGRESS
82 usec_timeout=500000 usect_delayed=92000 reason=VMSCAN_THROTTLE_NOPROGRESS
85 usec_timeout=500000 usect_delayed=64000 reason=VMSCAN_THROTTLE_NOPROGRESS
85 usec_timeout=500000 usect_delayed=80000 reason=VMSCAN_THROTTLE_NOPROGRESS
88 usec_timeout=500000 usect_delayed=84000 reason=VMSCAN_THROTTLE_NOPROGRESS
90 usec_timeout=500000 usect_delayed=160000 reason=VMSCAN_THROTTLE_NOPROGRESS
90 usec_timeout=500000 usect_delayed=292000 reason=VMSCAN_THROTTLE_NOPROGRESS
94 usec_timeout=500000 usect_delayed=56000 reason=VMSCAN_THROTTLE_NOPROGRESS
118 usec_timeout=500000 usect_delayed=88000 reason=VMSCAN_THROTTLE_NOPROGRESS
119 usec_timeout=500000 usect_delayed=72000 reason=VMSCAN_THROTTLE_NOPROGRESS
126 usec_timeout=500000 usect_delayed=108000 reason=VMSCAN_THROTTLE_NOPROGRESS
146 usec_timeout=500000 usect_delayed=52000 reason=VMSCAN_THROTTLE_NOPROGRESS
148 usec_timeout=500000 usect_delayed=36000 reason=VMSCAN_THROTTLE_NOPROGRESS
148 usec_timeout=500000 usect_delayed=48000 reason=VMSCAN_THROTTLE_NOPROGRESS
159 usec_timeout=500000 usect_delayed=28000 reason=VMSCAN_THROTTLE_NOPROGRESS
178 usec_timeout=500000 usect_delayed=44000 reason=VMSCAN_THROTTLE_NOPROGRESS
183 usec_timeout=500000 usect_delayed=40000 reason=VMSCAN_THROTTLE_NOPROGRESS
237 usec_timeout=500000 usect_delayed=100000 reason=VMSCAN_THROTTLE_NOPROGRESS
266 usec_timeout=500000 usect_delayed=32000 reason=VMSCAN_THROTTLE_NOPROGRESS
313 usec_timeout=500000 usect_delayed=24000 reason=VMSCAN_THROTTLE_NOPROGRESS
347 usec_timeout=500000 usect_delayed=96000 reason=VMSCAN_THROTTLE_NOPROGRESS
470 usec_timeout=500000 usect_delayed=20000 reason=VMSCAN_THROTTLE_NOPROGRESS
559 usec_timeout=500000 usect_delayed=16000 reason=VMSCAN_THROTTLE_NOPROGRESS
964 usec_timeout=500000 usect_delayed=12000 reason=VMSCAN_THROTTLE_NOPROGRESS
2001 usec_timeout=500000 usect_delayed=104000 reason=VMSCAN_THROTTLE_NOPROGRESS
2447 usec_timeout=500000 usect_delayed=8000 reason=VMSCAN_THROTTLE_NOPROGRESS
7888 usec_timeout=500000 usect_delayed=4000 reason=VMSCAN_THROTTLE_NOPROGRESS
22727 usec_timeout=500000 usect_delayed=0 reason=VMSCAN_THROTTLE_NOPROGRESS
51305 usec_timeout=500000 usect_delayed=500000 reason=VMSCAN_THROTTLE_NOPROGRESS

The full timeout is often hit but a large number also do not stall at
all. The remainder slept a little allowing other reclaim tasks to make
progress.

While this timeout could be further increased, it could also negatively
impact worst-case behaviour when there is no prioritisation of what task
should make progress.

For VMSCAN_THROTTLE_WRITEBACK, the breakdown was

1 usec_timeout=100000 usect_delayed=44000 reason=VMSCAN_THROTTLE_WRITEBACK
2 usec_timeout=100000 usect_delayed=76000 reason=VMSCAN_THROTTLE_WRITEBACK
3 usec_timeout=100000 usect_delayed=80000 reason=VMSCAN_THROTTLE_WRITEBACK
5 usec_timeout=100000 usect_delayed=48000 reason=VMSCAN_THROTTLE_WRITEBACK
5 usec_timeout=100000 usect_delayed=84000 reason=VMSCAN_THROTTLE_WRITEBACK
6 usec_timeout=100000 usect_delayed=72000 reason=VMSCAN_THROTTLE_WRITEBACK
7 usec_timeout=100000 usect_delayed=88000 reason=VMSCAN_THROTTLE_WRITEBACK
11 usec_timeout=100000 usect_delayed=56000 reason=VMSCAN_THROTTLE_WRITEBACK
12 usec_timeout=100000 usect_delayed=64000 reason=VMSCAN_THROTTLE_WRITEBACK
16 usec_timeout=100000 usect_delayed=92000 reason=VMSCAN_THROTTLE_WRITEBACK
24 usec_timeout=100000 usect_delayed=68000 reason=VMSCAN_THROTTLE_WRITEBACK
28 usec_timeout=100000 usect_delayed=32000 reason=VMSCAN_THROTTLE_WRITEBACK
30 usec_timeout=100000 usect_delayed=60000 reason=VMSCAN_THROTTLE_WRITEBACK
30 usec_timeout=100000 usect_delayed=96000 reason=VMSCAN_THROTTLE_WRITEBACK
32 usec_timeout=100000 usect_delayed=52000 reason=VMSCAN_THROTTLE_WRITEBACK
42 usec_timeout=100000 usect_delayed=40000 reason=VMSCAN_THROTTLE_WRITEBACK
77 usec_timeout=100000 usect_delayed=28000 reason=VMSCAN_THROTTLE_WRITEBACK
99 usec_timeout=100000 usect_delayed=36000 reason=VMSCAN_THROTTLE_WRITEBACK
137 usec_timeout=100000 usect_delayed=24000 reason=VMSCAN_THROTTLE_WRITEBACK
190 usec_timeout=100000 usect_delayed=20000 reason=VMSCAN_THROTTLE_WRITEBACK
339 usec_timeout=100000 usect_delayed=16000 reason=VMSCAN_THROTTLE_WRITEBACK
518 usec_timeout=100000 usect_delayed=12000 reason=VMSCAN_THROTTLE_WRITEBACK
852 usec_timeout=100000 usect_delayed=8000 reason=VMSCAN_THROTTLE_WRITEBACK
3359 usec_timeout=100000 usect_delayed=4000 reason=VMSCAN_THROTTLE_WRITEBACK
7147 usec_timeout=100000 usect_delayed=0 reason=VMSCAN_THROTTLE_WRITEBACK
83962 usec_timeout=100000 usect_delayed=100000 reason=VMSCAN_THROTTLE_WRITEBACK

The majority hit the timeout in direct reclaim context although a
sizable number did not stall at all. This is very different to kswapd
where only a tiny percentage of stalls due to writeback reached the
timeout.

Bottom line, the throttling appears to work and the wakeup events may
limit worst case stalls. There might be some grounds for adjusting
timeouts but it's likely futile as the worst-case scenarios depend on
the workload, memory size and the speed of the storage. A better
approach to improve the series further would be to prioritise tasks
based on their rate of allocation with the caveat that it may be very
expensive to track.

This patch (of 5):

Page reclaim throttles on wait_iff_congested under the following
conditions:

- kswapd is encountering pages under writeback and marked for immediate
reclaim implying that pages are cycling through the LRU faster than
pages can be cleaned.

- Direct reclaim will stall if all dirty pages are backed by congested
inodes.

wait_iff_congested is almost completely broken with few exceptions.
This patch adds a new node-based workqueue and tracks the number of
throttled tasks and pages written back since throttling started. If
enough pages belonging to the node are written back then the throttled
tasks will wake early. If not, the throttled tasks sleeps until the
timeout expires.

[neilb@suse.de: Uninterruptible sleep and simpler wakeups]
[hdanton@sina.com: Avoid race when reclaim starts]
[vbabka@suse.cz: vmstat irq-safe api, clarifications]

Link: https://lore.kernel.org/linux-mm/45d8b7a6-8548-65f5-cccf-9f451d4ae3d4@kernel.dk/ [1]
Link: https://lkml.kernel.org/r/20211022144651.19914-1-mgorman@techsingularity.net
Link: https://lkml.kernel.org/r/20211022144651.19914-2-mgorman@techsingularity.net
Signed-off-by: Mel Gorman <mgorman@techsingularity.net>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Cc: NeilBrown <neilb@suse.de>
Cc: "Theodore Ts'o" <tytso@mit.edu>
Cc: Andreas Dilger <adilger.kernel@dilger.ca>
Cc: "Darrick J . Wong" <djwong@kernel.org>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Dave Chinner <david@fromorbit.com>
Cc: Rik van Riel <riel@surriel.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Jonathan Corbet <corbet@lwn.net>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# bd3400ea 05-Nov-2021 Liangcai Fan <liangcaifan19@gmail.com>

mm: khugepaged: recalculate min_free_kbytes after stopping khugepaged

When initializing transparent huge pages, min_free_kbytes would be
calculated according to what khugepaged expected.

So when transparent huge pages get disabled, min_free_kbytes should be
recalculated instead of the higher value set by khugepaged.

Link: https://lkml.kernel.org/r/1633937809-16558-1-git-send-email-liangcaifan19@gmail.com
Signed-off-by: Liangcai Fan <liangcaifan19@gmail.com>
Signed-off-by: Chunyan Zhang <zhang.lyra@gmail.com>
Cc: Mike Kravetz <mike.kravetz@oracle.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 59d336bd 05-Nov-2021 Wang ShaoBo <bobo.shaobowang@huawei.com>

mm/page_alloc: use clamp() to simplify code

This patch uses clamp() to simplify code in init_per_zone_wmark_min().

Link: https://lkml.kernel.org/r/20211021034830.1049150-1-bobo.shaobowang@huawei.com
Signed-off-by: Wang ShaoBo <bobo.shaobowang@huawei.com>
Reviewed-by: David Hildenbrand <david@redhat.com>
Cc: Wei Yongjun <weiyongjun1@huawei.com>
Cc: Li Bin <huawei.libin@huawei.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 9c25cbfc 05-Nov-2021 Sebastian Andrzej Siewior <bigeasy@linutronix.de>

mm: page_alloc: use migrate_disable() in drain_local_pages_wq()

drain_local_pages_wq() disables preemption to avoid CPU migration during
CPU hotplug and can't use cpus_read_lock().

Using migrate_disable() works here, too. The scheduler won't take the
CPU offline until the task left the migrate-disable section. The
problem with disabled preemption here is that drain_local_pages()
acquires locks which are turned into sleeping locks on PREEMPT_RT and
can't be acquired with disabled preemption.

Use migrate_disable() in drain_local_pages_wq().

Link: https://lkml.kernel.org/r/20211015210933.viw6rjvo64qtqxn4@linutronix.de
Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# a6ea8b5b 05-Nov-2021 Liangcai Fan <liangcaifan19@gmail.com>

mm/page_alloc.c: show watermark_boost of zone in zoneinfo

min/low/high_wmark_pages(z) is defined as

(z->_watermark[WMARK_MIN/LOW/HIGH] + z->watermark_boost)

If kswapd is frequently woken up due to the increase of
min/low/high_wmark_pages, printing watermark_boost can quickly locate
whether watermark_boost or _watermark[WMARK_MIN/LOW/HIGH] caused
min/low/high_wmark_pages to increase.

Link: https://lkml.kernel.org/r/1632472566-12246-1-git-send-email-liangcaifan19@gmail.com
Signed-off-by: Liangcai Fan <liangcaifan19@gmail.com>
Cc: Chunyan Zhang <zhang.lyra@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 8ca1b5a4 05-Nov-2021 Feng Tang <feng.tang@intel.com>

mm/page_alloc: detect allocation forbidden by cpuset and bail out early

There was a report that starting an Ubuntu in docker while using cpuset
to bind it to movable nodes (a node only has movable zone, like a node
for hotplug or a Persistent Memory node in normal usage) will fail due
to memory allocation failure, and then OOM is involved and many other
innocent processes got killed.

It can be reproduced with command:

$ docker run -it --rm --cpuset-mems 4 ubuntu:latest bash -c "grep Mems_allowed /proc/self/status"

(where node 4 is a movable node)

runc:[2:INIT] invoked oom-killer: gfp_mask=0x500cc2(GFP_HIGHUSER|__GFP_ACCOUNT), order=0, oom_score_adj=0
CPU: 8 PID: 8291 Comm: runc:[2:INIT] Tainted: G W I E 5.8.2-0.g71b519a-default #1 openSUSE Tumbleweed (unreleased)
Hardware name: Dell Inc. PowerEdge R640/0PHYDR, BIOS 2.6.4 04/09/2020
Call Trace:
dump_stack+0x6b/0x88
dump_header+0x4a/0x1e2
oom_kill_process.cold+0xb/0x10
out_of_memory.part.0+0xaf/0x230
out_of_memory+0x3d/0x80
__alloc_pages_slowpath.constprop.0+0x954/0xa20
__alloc_pages_nodemask+0x2d3/0x300
pipe_write+0x322/0x590
new_sync_write+0x196/0x1b0
vfs_write+0x1c3/0x1f0
ksys_write+0xa7/0xe0
do_syscall_64+0x52/0xd0
entry_SYSCALL_64_after_hwframe+0x44/0xa9

Mem-Info:
active_anon:392832 inactive_anon:182 isolated_anon:0
active_file:68130 inactive_file:151527 isolated_file:0
unevictable:2701 dirty:0 writeback:7
slab_reclaimable:51418 slab_unreclaimable:116300
mapped:45825 shmem:735 pagetables:2540 bounce:0
free:159849484 free_pcp:73 free_cma:0
Node 4 active_anon:1448kB inactive_anon:0kB active_file:0kB inactive_file:0kB unevictable:0kB isolated(anon):0kB isolated(file):0kB mapped:0kB dirty:0kB writeback:0kB shmem:0kB shmem_thp: 0kB shmem_pmdmapped: 0kB anon_thp: 0kB writeback_tmp:0kB all_unreclaimable? no
Node 4 Movable free:130021408kB min:9140kB low:139160kB high:269180kB reserved_highatomic:0KB active_anon:1448kB inactive_anon:0kB active_file:0kB inactive_file:0kB unevictable:0kB writepending:0kB present:130023424kB managed:130023424kB mlocked:0kB kernel_stack:0kB pagetables:0kB bounce:0kB free_pcp:292kB local_pcp:84kB free_cma:0kB
lowmem_reserve[]: 0 0 0 0 0
Node 4 Movable: 1*4kB (M) 0*8kB 0*16kB 1*32kB (M) 0*64kB 0*128kB 1*256kB (M) 1*512kB (M) 1*1024kB (M) 0*2048kB 31743*4096kB (M) = 130021156kB

oom-kill:constraint=CONSTRAINT_CPUSET,nodemask=(null),cpuset=docker-9976a269caec812c134fa317f27487ee36e1129beba7278a463dd53e5fb9997b.scope,mems_allowed=4,global_oom,task_memcg=/system.slice/containerd.service,task=containerd,pid=4100,uid=0
Out of memory: Killed process 4100 (containerd) total-vm:4077036kB, anon-rss:51184kB, file-rss:26016kB, shmem-rss:0kB, UID:0 pgtables:676kB oom_score_adj:0
oom_reaper: reaped process 8248 (docker), now anon-rss:0kB, file-rss:0kB, shmem-rss:0kB
oom_reaper: reaped process 2054 (node_exporter), now anon-rss:0kB, file-rss:0kB, shmem-rss:0kB
oom_reaper: reaped process 1452 (systemd-journal), now anon-rss:0kB, file-rss:8564kB, shmem-rss:4kB
oom_reaper: reaped process 2146 (munin-node), now anon-rss:0kB, file-rss:0kB, shmem-rss:0kB
oom_reaper: reaped process 8291 (runc:[2:INIT]), now anon-rss:0kB, file-rss:0kB, shmem-rss:0kB

The reason is that in this case, the target cpuset nodes only have
movable zone, while the creation of an OS in docker sometimes needs to
allocate memory in non-movable zones (dma/dma32/normal) like
GFP_HIGHUSER, and the cpuset limit forbids the allocation, then
out-of-memory killing is involved even when normal nodes and movable
nodes both have many free memory.

The OOM killer cannot help to resolve the situation as there is no
usable memory for the request in the cpuset scope. The only reasonable
measure to take is to fail the allocation right away and have the caller
to deal with it.

So add a check for cases like this in the slowpath of allocation, and
bail out early returning NULL for the allocation.

As page allocation is one of the hottest path in kernel, this check will
hurt all users with sane cpuset configuration, add a static branch check
and detect the abnormal config in cpuset memory binding setup so that
the extra check cost in page allocation is not paid by everyone.

[thanks to Micho Hocko and David Rientjes for suggesting not handling
it inside OOM code, adding cpuset check, refining comments]

Link: https://lkml.kernel.org/r/1632481657-68112-1-git-send-email-feng.tang@intel.com
Signed-off-by: Feng Tang <feng.tang@intel.com>
Suggested-by: Michal Hocko <mhocko@suse.com>
Acked-by: Michal Hocko <mhocko@suse.com>
Cc: David Rientjes <rientjes@google.com>
Cc: Tejun Heo <tj@kernel.org>
Cc: Zefan Li <lizefan.x@bytedance.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Mel Gorman <mgorman@techsingularity.net>
Cc: Vlastimil Babka <vbabka@suse.cz>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 8446b59b 05-Nov-2021 Eric Dumazet <edumazet@google.com>

mm/page_alloc.c: do not acquire zone lock in is_free_buddy_page()

Grabbing zone lock in is_free_buddy_page() gives a wrong sense of
safety, and has potential performance implications when zone is
experiencing lock contention.

In any case, if a caller needs a stable result, it should grab zone lock
before calling this function.

Link: https://lkml.kernel.org/r/20210922152833.4023972-1-eric.dumazet@gmail.com
Signed-off-by: Eric Dumazet <edumazet@google.com>
Acked-by: Hugh Dickins <hughd@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 61bb6cd2 05-Nov-2021 Geert Uytterhoeven <geert+renesas@glider.be>

mm: move node_reclaim_distance to fix NUMA without SMP

Patch series "Fix NUMA without SMP".

SuperH is the only architecture which still supports NUMA without SMP,
for good reasons (various memories scattered around the address space,
each with varying latencies).

This series fixes two build errors due to variables and functions used
by the NUMA code being provided by SMP-only source files or sections.

This patch (of 2):

If CONFIG_NUMA=y, but CONFIG_SMP=n (e.g. sh/migor_defconfig):

sh4-linux-gnu-ld: mm/page_alloc.o: in function `get_page_from_freelist':
page_alloc.c:(.text+0x2c24): undefined reference to `node_reclaim_distance'

Fix this by moving the declaration of node_reclaim_distance from an
SMP-only to a generic file.

Link: https://lkml.kernel.org/r/cover.1631781495.git.geert+renesas@glider.be
Link: https://lkml.kernel.org/r/6432666a648dde85635341e6c918cee97c97d264.1631781495.git.geert+renesas@glider.be
Fixes: a55c7454a8c887b2 ("sched/topology: Improve load balancing on AMD EPYC systems")
Signed-off-by: Geert Uytterhoeven <geert+renesas@glider.be>
Suggested-by: Matt Fleming <matt@codeblueprint.co.uk>
Acked-by: Mel Gorman <mgorman@suse.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Juri Lelli <juri.lelli@redhat.com>
Cc: Vincent Guittot <vincent.guittot@linaro.org>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Yoshinori Sato <ysato@users.osdn.me>
Cc: Rich Felker <dalias@libc.org>
Cc: Gon Solo <gonsolo@gmail.com>
Cc: Geert Uytterhoeven <geert+renesas@glider.be>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 54d032ce 05-Nov-2021 Krupa Ramakrishnan <krupa.ramakrishnan@amd.com>

mm/page_alloc: use accumulated load when building node fallback list

In build_zonelists(), when the fallback list is built for the nodes, the
node load gets reinitialized during each iteration. This results in
nodes with same distances occupying the same slot in different node
fallback lists rather than appearing in the intended round- robin
manner. This results in one node getting picked for allocation more
compared to other nodes with the same distance.

As an example, consider a 4 node system with the following distance
matrix.

Node 0 1 2 3
----------------
0 10 12 32 32
1 12 10 32 32
2 32 32 10 12
3 32 32 12 10

For this case, the node fallback list gets built like this:

Node Fallback list
---------------------
0 0 1 2 3
1 1 0 3 2
2 2 3 0 1
3 3 2 0 1 <-- Unexpected fallback order

In the fallback list for nodes 2 and 3, the nodes 0 and 1 appear in the
same order which results in more allocations getting satisfied from node
0 compared to node 1.

The effect of this on remote memory bandwidth as seen by stream
benchmark is shown below:

Case 1: Bandwidth from cores on nodes 2 & 3 to memory on nodes 0 & 1
(numactl -m 0,1 ./stream_lowOverhead ... --cores <from 2, 3>)
Case 2: Bandwidth from cores on nodes 0 & 1 to memory on nodes 2 & 3
(numactl -m 2,3 ./stream_lowOverhead ... --cores <from 0, 1>)

----------------------------------------
BANDWIDTH (MB/s)
TEST Case 1 Case 2
----------------------------------------
COPY 57479.6 110791.8
SCALE 55372.9 105685.9
ADD 50460.6 96734.2
TRIADD 50397.6 97119.1
----------------------------------------

The bandwidth drop in Case 1 occurs because most of the allocations get
satisfied by node 0 as it appears first in the fallback order for both
nodes 2 and 3.

This can be fixed by accumulating the node load in build_zonelists()
rather than reinitializing it during each iteration. With this the
nodes with the same distance rightly get assigned in the round robin
manner.

In fact this was how it was originally until commit f0c0b2b808f2
("change zonelist order: zonelist order selection logic") dropped the
load accumulation and resorted to initializing the load during each
iteration.

While zonelist ordering was removed by commit c9bff3eebc09 ("mm,
page_alloc: rip out ZONELIST_ORDER_ZONE"), the change to the node load
accumulation in build_zonelists() remained. So essentially this patch
reverts back to the accumulated node load logic.

After this fix, the fallback order gets built like this:

Node Fallback list
------------------
0 0 1 2 3
1 1 0 3 2
2 2 3 0 1
3 3 2 1 0 <-- Note the change here

The bandwidth in Case 1 improves and matches Case 2 as shown below.

----------------------------------------
BANDWIDTH (MB/s)
TEST Case 1 Case 2
----------------------------------------
COPY 110438.9 110107.2
SCALE 105930.5 105817.5
ADD 97005.1 96159.8
TRIADD 97441.5 96757.1
----------------------------------------

The correctness of the fallback list generation has been verified for
the above node configuration where the node 3 starts as memory-less node
and comes up online only during memory hotplug.

[bharata@amd.com: Added changelog, review, test validation]

Link: https://lkml.kernel.org/r/20210830121603.1081-3-bharata@amd.com
Fixes: f0c0b2b808f2 ("change zonelist order: zonelist order selection logic")
Signed-off-by: Krupa Ramakrishnan <krupa.ramakrishnan@amd.com>
Co-developed-by: Sadagopan Srinivasan <Sadagopan.Srinivasan@amd.com>
Signed-off-by: Sadagopan Srinivasan <Sadagopan.Srinivasan@amd.com>
Signed-off-by: Bharata B Rao <bharata@amd.com>
Acked-by: Mel Gorman <mgorman@suse.de>
Reviewed-by: Anshuman Khandual <anshuman.khandual@arm.com>
Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Lee Schermerhorn <lee.schermerhorn@hp.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 6cf25392 05-Nov-2021 Bharata B Rao <bharata@amd.com>

mm/page_alloc: print node fallback order

Patch series "Fix NUMA nodes fallback list ordering".

For a NUMA system that has multiple nodes at same distance from other
nodes, the fallback list generation prefers same node order for them
instead of round-robin thereby penalizing one node over others. This
series fixes it.

More description of the problem and the fix is present in the patch
description.

This patch (of 2):

Print information message about the allocation fallback order for each
NUMA node during boot.

No functional changes here. This makes it easier to illustrate the
problem in the node fallback list generation, which the next patch
fixes.

Link: https://lkml.kernel.org/r/20210830121603.1081-1-bharata@amd.com
Link: https://lkml.kernel.org/r/20210830121603.1081-2-bharata@amd.com
Signed-off-by: Bharata B Rao <bharata@amd.com>
Acked-by: Mel Gorman <mgorman@suse.de>
Reviewed-by: Anshuman Khandual <anshuman.khandual@arm.com>
Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Lee Schermerhorn <lee.schermerhorn@hp.com>
Cc: Krupa Ramakrishnan <krupa.ramakrishnan@amd.com>
Cc: Sadagopan Srinivasan <Sadagopan.Srinivasan@amd.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# ba7f1b9e 05-Nov-2021 Miaohe Lin <linmiaohe@huawei.com>

mm/page_alloc.c: avoid allocating highmem pages via alloc_pages_exact[_nid]

Don't use with __GFP_HIGHMEM because page_address() cannot represent
highmem pages without kmap(). Newly allocated pages would leak as
page_address() will return NULL for highmem pages here. But It works
now because the callers do not specify __GFP_HIGHMEM now.

Link: https://lkml.kernel.org/r/20210902121242.41607-6-linmiaohe@huawei.com
Signed-off-by: Miaohe Lin <linmiaohe@huawei.com>
Reviewed-by: David Hildenbrand <david@redhat.com>
Cc: Mel Gorman <mgorman@techsingularity.net>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Stephen Rothwell <sfr@canb.auug.org.au>
Cc: Vlastimil Babka <vbabka@suse.cz>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 86fb05b9 05-Nov-2021 Miaohe Lin <linmiaohe@huawei.com>

mm/page_alloc.c: use helper function zone_spans_pfn()

Use helper function zone_spans_pfn() to check whether pfn is within a
zone to simplify the code slightly.

Link: https://lkml.kernel.org/r/20210902121242.41607-5-linmiaohe@huawei.com
Signed-off-by: Miaohe Lin <linmiaohe@huawei.com>
Acked-by: Mel Gorman <mgorman@techsingularity.net>
Reviewed-by: David Hildenbrand <david@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Stephen Rothwell <sfr@canb.auug.org.au>
Cc: Vlastimil Babka <vbabka@suse.cz>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 7cba630b 05-Nov-2021 Miaohe Lin <linmiaohe@huawei.com>

mm/page_alloc.c: fix obsolete comment in free_pcppages_bulk()

The second two paragraphs about "all pages pinned" and pages_scanned is
obsolete. And There are PAGE_ALLOC_COSTLY_ORDER + 1 + NR_PCP_THP orders
in pcp. So the same order assumption is not held now.

Link: https://lkml.kernel.org/r/20210902121242.41607-4-linmiaohe@huawei.com
Signed-off-by: Miaohe Lin <linmiaohe@huawei.com>
Acked-by: Mel Gorman <mgorman@techsingularity.net>
Cc: David Hildenbrand <david@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Stephen Rothwell <sfr@canb.auug.org.au>
Cc: Vlastimil Babka <vbabka@suse.cz>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# ff7ed9e4 05-Nov-2021 Miaohe Lin <linmiaohe@huawei.com>

mm/page_alloc.c: simplify the code by using macro K()

Use helper macro K() to convert the pages to the corresponding size.
Minor readability improvement.

Link: https://lkml.kernel.org/r/20210902121242.41607-3-linmiaohe@huawei.com
Signed-off-by: Miaohe Lin <linmiaohe@huawei.com>
Acked-by: Mel Gorman <mgorman@techsingularity.net>
Reviewed-by: David Hildenbrand <david@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Stephen Rothwell <sfr@canb.auug.org.au>
Cc: Vlastimil Babka <vbabka@suse.cz>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# ea808b4e 05-Nov-2021 Miaohe Lin <linmiaohe@huawei.com>

mm/page_alloc.c: remove meaningless VM_BUG_ON() in pindex_to_order()

Patch series "Cleanups and fixup for page_alloc", v2.

This series contains cleanups to remove meaningless VM_BUG_ON(), use
helpers to simplify the code and remove obsolete comment. Also we avoid
allocating highmem pages via alloc_pages_exact[_nid]. More details can be
found in the respective changelogs.

This patch (of 5):

It's meaningless to VM_BUG_ON() order != pageblock_order just after
setting order to pageblock_order. Remove it.

Link: https://lkml.kernel.org/r/20210902121242.41607-1-linmiaohe@huawei.com
Link: https://lkml.kernel.org/r/20210902121242.41607-2-linmiaohe@huawei.com
Signed-off-by: Miaohe Lin <linmiaohe@huawei.com>
Acked-by: Mel Gorman <mgorman@techsingularity.net>
Reviewed-by: David Hildenbrand <david@redhat.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Stephen Rothwell <sfr@canb.auug.org.au>
Cc: Peter Zijlstra <peterz@infradead.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 084f7e23 05-Nov-2021 Eric Dumazet <edumazet@google.com>

mm/large system hash: avoid possible NULL deref in alloc_large_system_hash

If __vmalloc() returned NULL, is_vm_area_hugepages(NULL) will fault if
CONFIG_HAVE_ARCH_HUGE_VMALLOC=y

Link: https://lkml.kernel.org/r/20210915212530.2321545-1-eric.dumazet@gmail.com
Fixes: 121e6f3258fe ("mm/vmalloc: hugepage vmalloc mappings")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: Andrew Morton <akpm@linux-foundation.org>
Cc: Nicholas Piggin <npiggin@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# cc09cb13 15-Dec-2020 Matthew Wilcox (Oracle) <willy@infradead.org>

mm/page_alloc: Add folio allocation functions

The __folio_alloc(), __folio_alloc_node() and folio_alloc() functions
are mostly for type safety, but they also ensure that the page allocator
allocates a compound page and initialises the deferred list if the page
is large enough to have one.

Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Acked-by: Vlastimil Babka <vbabka@suse.cz>


# bbc6b703 01-May-2021 Matthew Wilcox (Oracle) <willy@infradead.org>

mm/memcg: Convert mem_cgroup_uncharge() to take a folio

Convert all the callers to call page_folio(). Most of them were already
using a head page, but a few of them I can't prove were, so this may
actually fix a bug.

Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Acked-by: Mike Rapoport <rppt@linux.ibm.com>
Reviewed-by: David Howells <dhowells@redhat.com>
Acked-by: Vlastimil Babka <vbabka@suse.cz>


# eac96c3e 28-Oct-2021 Yang Shi <shy828301@gmail.com>

mm: filemap: check if THP has hwpoisoned subpage for PMD page fault

When handling shmem page fault the THP with corrupted subpage could be
PMD mapped if certain conditions are satisfied. But kernel is supposed
to send SIGBUS when trying to map hwpoisoned page.

There are two paths which may do PMD map: fault around and regular
fault.

Before commit f9ce0be71d1f ("mm: Cleanup faultaround and finish_fault()
codepaths") the thing was even worse in fault around path. The THP
could be PMD mapped as long as the VMA fits regardless what subpage is
accessed and corrupted. After this commit as long as head page is not
corrupted the THP could be PMD mapped.

In the regular fault path the THP could be PMD mapped as long as the
corrupted page is not accessed and the VMA fits.

This loophole could be fixed by iterating every subpage to check if any
of them is hwpoisoned or not, but it is somewhat costly in page fault
path.

So introduce a new page flag called HasHWPoisoned on the first tail
page. It indicates the THP has hwpoisoned subpage(s). It is set if any
subpage of THP is found hwpoisoned by memory failure and after the
refcount is bumped successfully, then cleared when the THP is freed or
split.

The soft offline path doesn't need this since soft offline handler just
marks a subpage hwpoisoned when the subpage is migrated successfully.
But shmem THP didn't get split then migrated at all.

Link: https://lkml.kernel.org/r/20211020210755.23964-3-shy828301@gmail.com
Fixes: 800d8c63b2e9 ("shmem: add huge pages support")
Signed-off-by: Yang Shi <shy828301@gmail.com>
Reviewed-by: Naoya Horiguchi <naoya.horiguchi@nec.com>
Suggested-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Oscar Salvador <osalvador@suse.de>
Cc: Peter Xu <peterx@redhat.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 8dcb3060 28-Oct-2021 Shakeel Butt <shakeelb@google.com>

memcg: page_alloc: skip bulk allocator for __GFP_ACCOUNT

Commit 5c1f4e690eec ("mm/vmalloc: switch to bulk allocator in
__vmalloc_area_node()") switched to bulk page allocator for order 0
allocation backing vmalloc. However bulk page allocator does not
support __GFP_ACCOUNT allocations and there are several users of
kvmalloc(__GFP_ACCOUNT).

For now make __GFP_ACCOUNT allocations bypass bulk page allocator. In
future if there is workload that can be significantly improved with the
bulk page allocator with __GFP_ACCCOUNT support, we can revisit the
decision.

Link: https://lkml.kernel.org/r/20211014151607.2171970-1-shakeelb@google.com
Fixes: 5c1f4e690eec ("mm/vmalloc: switch to bulk allocator in __vmalloc_area_node()")
Signed-off-by: Shakeel Butt <shakeelb@google.com>
Reported-by: Vasily Averin <vvs@virtuozzo.com>
Tested-by: Vasily Averin <vvs@virtuozzo.com>
Acked-by: David Hildenbrand <david@redhat.com>
Acked-by: Michal Hocko <mhocko@suse.com>
Acked-by: Roman Gushchin <guro@fb.com>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 053cfda1 08-Sep-2021 Miaohe Lin <linmiaohe@huawei.com>

mm/page_alloc.c: avoid accessing uninitialized pcp page migratetype

If it's not prepared to free unref page, the pcp page migratetype is
unset. Thus we will get rubbish from get_pcppage_migratetype() and
might list_del(&page->lru) again after it's already deleted from the list
leading to grumble about data corruption.

Link: https://lkml.kernel.org/r/20210902115447.57050-1-linmiaohe@huawei.com
Fixes: df1acc856923 ("mm/page_alloc: avoid conflating IRQs disabled with zone->lock")
Signed-off-by: Miaohe Lin <linmiaohe@huawei.com>
Acked-by: Mel Gorman <mgorman@techsingularity.net>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Reviewed-by: David Hildenbrand <david@redhat.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 4b097002 07-Sep-2021 David Hildenbrand <david@redhat.com>

mm: track present early pages per zone

Patch series "mm/memory_hotplug: "auto-movable" online policy and memory groups", v3.

I. Goal

The goal of this series is improving in-kernel auto-online support. It
tackles the fundamental problems that:

1) We can create zone imbalances when onlining all memory blindly to
ZONE_MOVABLE, in the worst case crashing the system. We have to know
upfront how much memory we are going to hotplug such that we can
safely enable auto-onlining of all hotplugged memory to ZONE_MOVABLE
via "online_movable". This is far from practical and only applicable in
limited setups -- like inside VMs under the RHV/oVirt hypervisor which
will never hotplug more than 3 times the boot memory (and the
limitation is only in place due to the Linux limitation).

2) We see more setups that implement dynamic VM resizing, hot(un)plugging
memory to resize VM memory. In these setups, we might hotplug a lot of
memory, but it might happen in various small steps in both directions
(e.g., 2 GiB -> 8 GiB -> 4 GiB -> 16 GiB ...). virtio-mem is the
primary driver of this upstream right now, performing such dynamic
resizing NUMA-aware via multiple virtio-mem devices.

Onlining all hotplugged memory to ZONE_NORMAL means we basically have
no hotunplug guarantees. Onlining all to ZONE_MOVABLE means we can
easily run into zone imbalances when growing a VM. We want a mixture,
and we want as much memory as reasonable/configured in ZONE_MOVABLE.
Details regarding zone imbalances can be found at [1].

3) Memory devices consist of 1..X memory block devices, however, the
kernel doesn't really track the relationship. Consequently, also user
space has no idea. We want to make per-device decisions.

As one example, for memory hotunplug it doesn't make sense to use a
mixture of zones within a single DIMM: we want all MOVABLE if
possible, otherwise all !MOVABLE, because any !MOVABLE part will easily
block the whole DIMM from getting hotunplugged.

As another example, virtio-mem operates on individual units that span
1..X memory blocks. Similar to a DIMM, we want a unit to either be all
MOVABLE or !MOVABLE. A "unit" can be thought of like a DIMM, however,
all units of a virtio-mem device logically belong together and are
managed (added/removed) by a single driver. We want as much memory of
a virtio-mem device to be MOVABLE as possible.

4) We want memory onlining to be done right from the kernel while adding
memory, not triggered by user space via udev rules; for example, this
is reqired for fast memory hotplug for drivers that add individual
memory blocks, like virito-mem. We want a way to configure a policy in
the kernel and avoid implementing advanced policies in user space.

The auto-onlining support we have in the kernel is not sufficient. All we
have is a) online everything MOVABLE (online_movable) b) online everything
!MOVABLE (online_kernel) c) keep zones contiguous (online). This series
allows configuring c) to mean instead "online movable if possible
according to the coniguration, driven by a maximum MOVABLE:KERNEL ratio"
-- a new onlining policy.

II. Approach

This series does 3 things:

1) Introduces the "auto-movable" online policy that initially operates on
individual memory blocks only. It uses a maximum MOVABLE:KERNEL ratio
to make a decision whether a memory block will be onlined to
ZONE_MOVABLE or not. However, in the basic form, hotplugged KERNEL
memory does not allow for more MOVABLE memory (details in the
patches). CMA memory is treated like MOVABLE memory.

2) Introduces static (e.g., DIMM) and dynamic (e.g., virtio-mem) memory
groups and uses group information to make decisions in the
"auto-movable" online policy across memory blocks of a single memory
device (modeled as memory group). More details can be found in patch
#3 or in the DIMM example below.

3) Maximizes ZONE_MOVABLE memory within dynamic memory groups, by
allowing ZONE_NORMAL memory within a dynamic memory group to allow for
more ZONE_MOVABLE memory within the same memory group. The target use
case is dynamic VM resizing using virtio-mem. See the virtio-mem
example below.

I remember that the basic idea of using a ratio to implement a policy in
the kernel was once mentioned by Vitaly Kuznetsov, but I might be wrong (I
lost the pointer to that discussion).

For me, the main use case is using it along with virtio-mem (and DIMMs /
ppc64 dlpar where necessary) for dynamic resizing of VMs, increasing the
amount of memory we can hotunplug reliably again if we might eventually
hotplug a lot of memory to a VM.

III. Target Usage

The target usage will be:

1) Linux boots with "mhp_default_online_type=offline"

2) User space (e.g., systemd unit) configures memory onlining (according
to a config file and system properties), for example:
* Setting memory_hotplug.online_policy=auto-movable
* Setting memory_hotplug.auto_movable_ratio=301
* Setting memory_hotplug.auto_movable_numa_aware=true

3) User space enabled auto onlining via "echo online >
/sys/devices/system/memory/auto_online_blocks"

4) User space triggers manual onlining of all already-offline memory
blocks (go over offline memory blocks and set them to "online")

IV. Example

For DIMMs, hotplugging 4 GiB DIMMs to a 4 GiB VM with a configured ratio of
301% results in the following layout:
Memory block 0-15: DMA32 (early)
Memory block 32-47: Normal (early)
Memory block 48-79: Movable (DIMM 0)
Memory block 80-111: Movable (DIMM 1)
Memory block 112-143: Movable (DIMM 2)
Memory block 144-275: Normal (DIMM 3)
Memory block 176-207: Normal (DIMM 4)
... all Normal
(-> hotplugged Normal memory does not allow for more Movable memory)

For virtio-mem, using a simple, single virtio-mem device with a 4 GiB VM
will result in the following layout:
Memory block 0-15: DMA32 (early)
Memory block 32-47: Normal (early)
Memory block 48-143: Movable (virtio-mem, first 12 GiB)
Memory block 144: Normal (virtio-mem, next 128 MiB)
Memory block 145-147: Movable (virtio-mem, next 384 MiB)
Memory block 148: Normal (virtio-mem, next 128 MiB)
Memory block 149-151: Movable (virtio-mem, next 384 MiB)
... Normal/Movable mixture as above
(-> hotplugged Normal memory allows for more Movable memory within
the same device)

Which gives us maximum flexibility when dynamically growing/shrinking a
VM in smaller steps.

V. Doc Update

I'll update the memory-hotplug.rst documentation, once the overhaul [1] is
usptream. Until then, details can be found in patch #2.

VI. Future Work

1) Use memory groups for ppc64 dlpar
2) Being able to specify a portion of (early) kernel memory that will be
excluded from the ratio. Like "128 MiB globally/per node" are excluded.

This might be helpful when starting VMs with extremely small memory
footprint (e.g., 128 MiB) and hotplugging memory later -- not wanting
the first hotplugged units getting onlined to ZONE_MOVABLE. One
alternative would be a trigger to not consider ZONE_DMA memory
in the ratio. We'll have to see if this is really rrequired.
3) Indicate to user space that MOVABLE might be a bad idea -- especially
relevant when memory ballooning without support for balloon compaction
is active.

This patch (of 9):

For implementing a new memory onlining policy, which determines when to
online memory blocks to ZONE_MOVABLE semi-automatically, we need the
number of present early (boot) pages -- present pages excluding hotplugged
pages. Let's track these pages per zone.

Pass a page instead of the zone to adjust_present_page_count(), similar as
adjust_managed_page_count() and derive the zone from the page.

It's worth noting that a memory block to be offlined/onlined is either
completely "early" or "not early". add_memory() and friends can only add
complete memory blocks and we only online/offline complete (individual)
memory blocks.

Link: https://lkml.kernel.org/r/20210806124715.17090-1-david@redhat.com
Link: https://lkml.kernel.org/r/20210806124715.17090-2-david@redhat.com
Signed-off-by: David Hildenbrand <david@redhat.com>
Cc: Vitaly Kuznetsov <vkuznets@redhat.com>
Cc: "Michael S. Tsirkin" <mst@redhat.com>
Cc: Jason Wang <jasowang@redhat.com>
Cc: Marek Kedzierski <mkedzier@redhat.com>
Cc: Hui Zhu <teawater@gmail.com>
Cc: Pankaj Gupta <pankaj.gupta.linux@gmail.com>
Cc: Wei Yang <richard.weiyang@linux.alibaba.com>
Cc: Oscar Salvador <osalvador@suse.de>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: Anshuman Khandual <anshuman.khandual@arm.com>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Mike Rapoport <rppt@kernel.org>
Cc: "Rafael J. Wysocki" <rjw@rjwysocki.net>
Cc: Len Brown <lenb@kernel.org>
Cc: Pavel Tatashin <pasha.tatashin@soleen.com>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 859a85dd 07-Sep-2021 Mike Rapoport <rppt@kernel.org>

mm: remove pfn_valid_within() and CONFIG_HOLES_IN_ZONE

Patch series "mm: remove pfn_valid_within() and CONFIG_HOLES_IN_ZONE".

After recent updates to freeing unused parts of the memory map, no
architecture can have holes in the memory map within a pageblock. This
makes pfn_valid_within() check and CONFIG_HOLES_IN_ZONE configuration
option redundant.

The first patch removes them both in a mechanical way and the second patch
simplifies memory_hotplug::test_pages_in_a_zone() that had
pfn_valid_within() surrounded by more logic than simple if.

This patch (of 2):

After recent changes in freeing of the unused parts of the memory map and
rework of pfn_valid() in arm and arm64 there are no architectures that can
have holes in the memory map within a pageblock and so nothing can enable
CONFIG_HOLES_IN_ZONE which guards non trivial implementation of
pfn_valid_within().

With that, pfn_valid_within() is always hardwired to 1 and can be
completely removed.

Remove calls to pfn_valid_within() and CONFIG_HOLES_IN_ZONE.

Link: https://lkml.kernel.org/r/20210713080035.7464-1-rppt@kernel.org
Link: https://lkml.kernel.org/r/20210713080035.7464-2-rppt@kernel.org
Signed-off-by: Mike Rapoport <rppt@linux.ibm.com>
Acked-by: David Hildenbrand <david@redhat.com>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: "Rafael J. Wysocki" <rafael@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 5ac95884 02-Sep-2021 Yang Shi <yang.shi@linux.alibaba.com>

mm/migrate: enable returning precise migrate_pages() success count

Under normal circumstances, migrate_pages() returns the number of pages
migrated. In error conditions, it returns an error code. When returning
an error code, there is no way to know how many pages were migrated or not
migrated.

Make migrate_pages() return how many pages are demoted successfully for
all cases, including when encountering errors. Page reclaim behavior will
depend on this in subsequent patches.

Link: https://lkml.kernel.org/r/20210721063926.3024591-3-ying.huang@intel.com
Link: https://lkml.kernel.org/r/20210715055145.195411-4-ying.huang@intel.com
Signed-off-by: Yang Shi <yang.shi@linux.alibaba.com>
Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
Signed-off-by: "Huang, Ying" <ying.huang@intel.com>
Suggested-by: Oscar Salvador <osalvador@suse.de> [optional parameter]
Reviewed-by: Yang Shi <shy828301@gmail.com>
Reviewed-by: Zi Yan <ziy@nvidia.com>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Wei Xu <weixugc@google.com>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: David Hildenbrand <david@redhat.com>
Cc: David Rientjes <rientjes@google.com>
Cc: Greg Thelen <gthelen@google.com>
Cc: Keith Busch <kbusch@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 79c28a41 02-Sep-2021 Dave Hansen <dave.hansen@linux.intel.com>

mm/numa: automatically generate node migration order

Patch series "Migrate Pages in lieu of discard", v11.

We're starting to see systems with more and more kinds of memory such as
Intel's implementation of persistent memory.

Let's say you have a system with some DRAM and some persistent memory.
Today, once DRAM fills up, reclaim will start and some of the DRAM
contents will be thrown out. Allocations will, at some point, start
falling over to the slower persistent memory.

That has two nasty properties. First, the newer allocations can end up in
the slower persistent memory. Second, reclaimed data in DRAM are just
discarded even if there are gobs of space in persistent memory that could
be used.

This patchset implements a solution to these problems. At the end of the
reclaim process in shrink_page_list() just before the last page refcount
is dropped, the page is migrated to persistent memory instead of being
dropped.

While I've talked about a DRAM/PMEM pairing, this approach would function
in any environment where memory tiers exist.

This is not perfect. It "strands" pages in slower memory and never brings
them back to fast DRAM. Huang Ying has follow-on work which repurposes
NUMA balancing to promote hot pages back to DRAM.

This is also all based on an upstream mechanism that allows persistent
memory to be onlined and used as if it were volatile:

http://lkml.kernel.org/r/20190124231441.37A4A305@viggo.jf.intel.com

With that, the DRAM and PMEM in each socket will be represented as 2
separate NUMA nodes, with the CPUs sit in the DRAM node. So the
general inter-NUMA demotion mechanism introduced in the patchset can
migrate the cold DRAM pages to the PMEM node.

We have tested the patchset with the postgresql and pgbench. On a
2-socket server machine with DRAM and PMEM, the kernel with the patchset
can improve the score of pgbench up to 22.1% compared with that of the
DRAM only + disk case. This comes from the reduced disk read throughput
(which reduces up to 70.8%).

== Open Issues ==

* Memory policies and cpusets that, for instance, restrict allocations
to DRAM can be demoted to PMEM whenever they opt in to this
new mechanism. A cgroup-level API to opt-in or opt-out of
these migrations will likely be required as a follow-on.
* Could be more aggressive about where anon LRU scanning occurs
since it no longer necessarily involves I/O. get_scan_count()
for instance says: "If we have no swap space, do not bother
scanning anon pages"

This patch (of 9):

Prepare for the kernel to auto-migrate pages to other memory nodes with a
node migration table. This allows creating single migration target for
each NUMA node to enable the kernel to do NUMA page migrations instead of
simply discarding colder pages. A node with no target is a "terminal
node", so reclaim acts normally there. The migration target does not
fundamentally _need_ to be a single node, but this implementation starts
there to limit complexity.

When memory fills up on a node, memory contents can be automatically
migrated to another node. The biggest problems are knowing when to
migrate and to where the migration should be targeted.

The most straightforward way to generate the "to where" list would be to
follow the page allocator fallback lists. Those lists already tell us if
memory is full where to look next. It would also be logical to move
memory in that order.

But, the allocator fallback lists have a fatal flaw: most nodes appear in
all the lists. This would potentially lead to migration cycles (A->B,
B->A, A->B, ...).

Instead of using the allocator fallback lists directly, keep a separate
node migration ordering. But, reuse the same data used to generate page
allocator fallback in the first place: find_next_best_node().

This means that the firmware data used to populate node distances
essentially dictates the ordering for now. It should also be
architecture-neutral since all NUMA architectures have a working
find_next_best_node().

RCU is used to allow lock-less read of node_demotion[] and prevent
demotion cycles been observed. If multiple reads of node_demotion[] are
performed, a single rcu_read_lock() must be held over all reads to ensure
no cycles are observed. Details are as follows.

=== What does RCU provide? ===

Imagine a simple loop which walks down the demotion path looking
for the last node:

terminal_node = start_node;
while (node_demotion[terminal_node] != NUMA_NO_NODE) {
terminal_node = node_demotion[terminal_node];
}

The initial values are:

node_demotion[0] = 1;
node_demotion[1] = NUMA_NO_NODE;

and are updated to:

node_demotion[0] = NUMA_NO_NODE;
node_demotion[1] = 0;

What guarantees that the cycle is not observed:

node_demotion[0] = 1;
node_demotion[1] = 0;

and would loop forever?

With RCU, a rcu_read_lock/unlock() can be placed around the loop. Since
the write side does a synchronize_rcu(), the loop that observed the old
contents is known to be complete before the synchronize_rcu() has
completed.

RCU, combined with disable_all_migrate_targets(), ensures that the old
migration state is not visible by the time __set_migration_target_nodes()
is called.

=== What does READ_ONCE() provide? ===

READ_ONCE() forbids the compiler from merging or reordering successive
reads of node_demotion[]. This ensures that any updates are *eventually*
observed.

Consider the above loop again. The compiler could theoretically read the
entirety of node_demotion[] into local storage (registers) and never go
back to memory, and *permanently* observe bad values for node_demotion[].

Note: RCU does not provide any universal compiler-ordering
guarantees:

https://lore.kernel.org/lkml/20150921204327.GH4029@linux.vnet.ibm.com/

This code is unused for now. It will be called later in the
series.

Link: https://lkml.kernel.org/r/20210721063926.3024591-1-ying.huang@intel.com
Link: https://lkml.kernel.org/r/20210715055145.195411-1-ying.huang@intel.com
Link: https://lkml.kernel.org/r/20210715055145.195411-2-ying.huang@intel.com
Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
Signed-off-by: "Huang, Ying" <ying.huang@intel.com>
Reviewed-by: Yang Shi <shy828301@gmail.com>
Reviewed-by: Zi Yan <ziy@nvidia.com>
Reviewed-by: Oscar Salvador <osalvador@suse.de>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Wei Xu <weixugc@google.com>
Cc: David Rientjes <rientjes@google.com>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: David Hildenbrand <david@redhat.com>
Cc: Greg Thelen <gthelen@google.com>
Cc: Keith Busch <kbusch@kernel.org>
Cc: Yang Shi <yang.shi@linux.alibaba.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 88dc6f20 02-Sep-2021 Vasily Averin <vvs@virtuozzo.com>

mm/page_alloc.c: use in_task()

Obsoleted in_intrrupt() include task context with disabled BH, it's better
to use in_task() instead.

Link: https://lkml.kernel.org/r/877caa99-1994-5545-92d2-d0bb2e394182@virtuozzo.com
Signed-off-by: Vasily Averin <vvs@virtuozzo.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 3b446da6 02-Sep-2021 Mike Rapoport <rppt@kernel.org>

mm/page_alloc: make alloc_node_mem_map() __init rather than __ref

alloc_node_mem_map() is never only called from free_area_init_node() that
is an __init function.

Make the actual alloc_node_mem_map() also __init and its stub version
static inline.

Link: https://lkml.kernel.org/r/20210716064124.31865-1-rppt@kernel.org
Signed-off-by: Mike Rapoport <rppt@linux.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# b346075f 02-Sep-2021 Nico Pache <npache@redhat.com>

mm/page_alloc.c: fix 'zone_id' may be used uninitialized in this function warning

When compiling with -Werror, cc1 will warn that 'zone_id' may be used
uninitialized in this function warning.

Initialize the zone_id as 0.

Its safe to assume that if the code reaches this point it has at least one
numa node with memory, so no need for an assertion before
init_unavilable_range.

Link: https://lkml.kernel.org/r/20210716210336.1114114-1-npache@redhat.com
Fixes: 122e093c1734 ("mm/page_alloc: fix memory map initialization for descending nodes")
Signed-off-by: Nico Pache <npache@redhat.com>
Cc: Mike Rapoport <rppt@linux.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# c803b3c8 02-Sep-2021 Mike Rapoport <rppt@kernel.org>

mm: introduce memmap_alloc() to unify memory map allocation

There are several places that allocate memory for the memory map:
alloc_node_mem_map() for FLATMEM, sparse_buffer_init() and
__populate_section_memmap() for SPARSEMEM.

The memory allocated in the FLATMEM case is zeroed and it is never
poisoned, regardless of CONFIG_PAGE_POISON setting.

The memory allocated in the SPARSEMEM cases is not zeroed and it is
implicitly poisoned inside memblock if CONFIG_PAGE_POISON is set.

Introduce memmap_alloc() wrapper for memblock allocators that will be used
for both FLATMEM and SPARSEMEM cases and will makei memory map zeroing and
poisoning consistent for different memory models.

Link: https://lkml.kernel.org/r/20210714123739.16493-4-rppt@kernel.org
Signed-off-by: Mike Rapoport <rppt@linux.ibm.com>
Cc: Michal Simek <monstr@monstr.eu>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# c3ab6baf 02-Sep-2021 Mike Rapoport <rppt@kernel.org>

mm/page_alloc: always initialize memory map for the holes

Patch series "mm: ensure consistency of memory map poisoning".

Currently memory map allocation for FLATMEM case does not poison the
struct pages regardless of CONFIG_PAGE_POISON setting.

This happens because allocation of the memory map for FLATMEM and SPARSMEM
use different memblock functions and those that are used for SPARSMEM case
(namely memblock_alloc_try_nid_raw() and memblock_alloc_exact_nid_raw())
implicitly poison the allocated memory.

Another side effect of this implicit poisoning is that early setup code
that uses the same functions to allocate memory burns cycles for the
memory poisoning even if it was not intended.

These patches introduce memmap_alloc() wrapper that ensure that the memory
map allocation is consistent for different memory models.

This patch (of 4):

Currently memory map for the holes is initialized only when SPARSEMEM
memory model is used. Yet, even with FLATMEM there could be holes in the
physical memory layout that have memory map entries.

For instance, the memory reserved using e820 API on i386 or
"reserved-memory" nodes in device tree would not appear in memblock.memory
and hence the struct pages for such holes will be skipped during memory
map initialization.

These struct pages will be zeroed because the memory map for FLATMEM
systems is allocated with memblock_alloc_node() that clears the allocated
memory. While zeroed struct pages do not cause immediate problems, the
correct behaviour is to initialize every page using __init_single_page().
Besides, enabling page poison for FLATMEM case will trigger
PF_POISONED_CHECK() unless the memory map is properly initialized.

Make sure init_unavailable_range() is called for both SPARSEMEM and
FLATMEM so that struct pages representing memory holes would appear as
PG_Reserved with any memory layout.

[rppt@kernel.org: fix microblaze]
Link: https://lkml.kernel.org/r/YQWW3RCE4eWBuMu/@kernel.org

Link: https://lkml.kernel.org/r/20210714123739.16493-1-rppt@kernel.org
Link: https://lkml.kernel.org/r/20210714123739.16493-2-rppt@kernel.org
Signed-off-by: Mike Rapoport <rppt@linux.ibm.com>
Acked-by: David Hildenbrand <david@redhat.com>
Tested-by: Guenter Roeck <linux@roeck-us.net>
Cc: Michal Simek <monstr@monstr.eu>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# eb2169ce 02-Sep-2021 liuhailong <liuhailong@oppo.com>

mm: add kernel_misc_reclaimable in show_free_areas

Print NR_KERNEL_MISC_RECLAIMABLE stat from show_free_areas() so users can
check whether the shrinker is working correctly and to show the current
memory usage.

Link: https://lkml.kernel.org/r/20210813104725.4562-1-liuhailong@oppo.com
Signed-off-by: liuhailong <liuhailong@oppo.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 4f3eaf45 02-Sep-2021 Matthew Wilcox (Oracle) <willy@infradead.org>

mm: report a more useful address for reclaim acquisition

A recent lockdep report included these lines:

[ 96.177910] 3 locks held by containerd/770:
[ 96.177934] #0: ffff88810815ea28 (&mm->mmap_lock#2){++++}-{3:3},
at: do_user_addr_fault+0x115/0x770
[ 96.177999] #1: ffffffff82915020 (rcu_read_lock){....}-{1:2}, at:
get_swap_device+0x33/0x140
[ 96.178057] #2: ffffffff82955ba0 (fs_reclaim){+.+.}-{0:0}, at:
__fs_reclaim_acquire+0x5/0x30

While it was not useful to that bug report to know where the reclaim lock
had been acquired, it might be useful under other circumstances. Allow
the caller of __fs_reclaim_acquire to specify the instruction pointer to
use.

Link: https://lkml.kernel.org/r/20210719185709.1755149-1-willy@infradead.org
Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Cc: Omar Sandoval <osandov@fb.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: Boqun Feng <boqun.feng@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 47aef601 19-Aug-2021 Doug Berger <opendmb@gmail.com>

mm/page_alloc: don't corrupt pcppage_migratetype

When placing pages on a pcp list, migratetype values over
MIGRATE_PCPTYPES get added to the MIGRATE_MOVABLE pcp list.

However, the actual migratetype is preserved in the page and should
not be changed to MIGRATE_MOVABLE or the page may end up on the wrong
free_list.

The impact is that HIGHATOMIC or CMA pages getting bulk freed from the
PCP lists could potentially end up on the wrong buddy list. There are
various consequences but minimally NR_FREE_CMA_PAGES accounting could
get screwed up.

[mgorman@techsingularity.net: changelog update]

Link: https://lkml.kernel.org/r/20210811182917.2607994-1-opendmb@gmail.com
Fixes: df1acc856923 ("mm/page_alloc: avoid conflating IRQs disabled with zone->lock")
Signed-off-by: Doug Berger <opendmb@gmail.com>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Acked-by: Mel Gorman <mgorman@techsingularity.net>
Cc: "Peter Zijlstra (Intel)" <peterz@infradead.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 69e5d322 23-Jul-2021 Sergei Trofimovich <slyfox@gentoo.org>

mm: page_alloc: fix page_poison=1 / INIT_ON_ALLOC_DEFAULT_ON interaction

To reproduce the failure we need the following system:

- kernel command: page_poison=1 init_on_free=0 init_on_alloc=0

- kernel config:
* CONFIG_INIT_ON_ALLOC_DEFAULT_ON=y
* CONFIG_INIT_ON_FREE_DEFAULT_ON=y
* CONFIG_PAGE_POISONING=y

Resulting in:

0000000085629bdd: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................
0000000022861832: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................
00000000c597f5b0: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................
CPU: 11 PID: 15195 Comm: bash Kdump: loaded Tainted: G U O 5.13.1-gentoo-x86_64 #1
Hardware name: System manufacturer System Product Name/PRIME Z370-A, BIOS 2801 01/13/2021
Call Trace:
dump_stack+0x64/0x7c
__kernel_unpoison_pages.cold+0x48/0x84
post_alloc_hook+0x60/0xa0
get_page_from_freelist+0xdb8/0x1000
__alloc_pages+0x163/0x2b0
__get_free_pages+0xc/0x30
pgd_alloc+0x2e/0x1a0
mm_init+0x185/0x270
dup_mm+0x6b/0x4f0
copy_process+0x190d/0x1b10
kernel_clone+0xba/0x3b0
__do_sys_clone+0x8f/0xb0
do_syscall_64+0x68/0x80
entry_SYSCALL_64_after_hwframe+0x44/0xae

Before commit 51cba1ebc60d ("init_on_alloc: Optimize static branches")
init_on_alloc never enabled static branch by default. It could only be
enabed explicitly by init_mem_debugging_and_hardening().

But after commit 51cba1ebc60d, a static branch could already be enabled
by default. There was no code to ever disable it. That caused
page_poison=1 / init_on_free=1 conflict.

This change extends init_mem_debugging_and_hardening() to also disable
static branch disabling.

Link: https://lkml.kernel.org/r/20210714031935.4094114-1-keescook@chromium.org
Link: https://lore.kernel.org/r/20210712215816.1512739-1-slyfox@gentoo.org
Fixes: 51cba1ebc60d ("init_on_alloc: Optimize static branches")
Signed-off-by: Sergei Trofimovich <slyfox@gentoo.org>
Signed-off-by: Kees Cook <keescook@chromium.org>
Co-developed-by: Kees Cook <keescook@chromium.org>
Reported-by: Mikhail Morfikov <mmorfikov@gmail.com>
Reported-by: <bowsingbetee@pm.me>
Tested-by: <bowsingbetee@protonmail.com>
Reviewed-by: David Hildenbrand <david@redhat.com>
Cc: Alexander Potapenko <glider@google.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 06147843 14-Jul-2021 Chuck Lever <chuck.lever@oracle.com>

mm/page_alloc: further fix __alloc_pages_bulk() return value

The author of commit b3b64ebd3822 ("mm/page_alloc: do bulk array
bounds check after checking populated elements") was possibly
confused by the mixture of return values throughout the function.

The API contract is clear that the function "Returns the number of pages
on the list or array." It does not list zero as a unique return value with
a special meaning. Therefore zero is a plausible return value only if
@nr_pages is zero or less.

Clean up the return logic to make it clear that the returned value is
always the total number of pages in the array/list, not the number of
pages that were allocated during this call.

The only change in behavior with this patch is the value returned if
prepare_alloc_pages() fails. To match the API contract, the number of
pages currently in the array/list is returned in this case.

The call site in __page_pool_alloc_pages_slow() also seems to be confused
on this matter. It should be attended to by someone who is familiar with
that code.

[mel@techsingularity.net: Return nr_populated if 0 pages are requested]

Link: https://lkml.kernel.org/r/20210713152100.10381-4-mgorman@techsingularity.net
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Mel Gorman <mgorman@techsingularity.net>
Acked-by: Jesper Dangaard Brouer <brouer@redhat.com>
Cc: Desmond Cheong Zhi Xi <desmondcheongzx@gmail.com>
Cc: Zhang Qiang <Qiang.Zhang@windriver.com>
Cc: Yanfei Xu <yanfei.xu@windriver.com>
Cc: Matteo Croce <mcroce@microsoft.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# e5c15cea 14-Jul-2021 Yanfei Xu <yanfei.xu@windriver.com>

mm/page_alloc: correct return value when failing at preparing

If the array passed in is already partially populated, we should return
"nr_populated" even failing at preparing arguments stage.

Link: https://lkml.kernel.org/r/20210713152100.10381-3-mgorman@techsingularity.net
Signed-off-by: Yanfei Xu <yanfei.xu@windriver.com>
Signed-off-by: Mel Gorman <mgorman@techsingularity.net>
Link: https://lore.kernel.org/r/20210709102855.55058-1-yanfei.xu@windriver.com
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 187ad460 14-Jul-2021 Mel Gorman <mgorman@techsingularity.net>

mm/page_alloc: avoid page allocator recursion with pagesets.lock held

Syzbot is reporting potential deadlocks due to pagesets.lock when
PAGE_OWNER is enabled. One example from Desmond Cheong Zhi Xi is as
follows

__alloc_pages_bulk()
local_lock_irqsave(&pagesets.lock, flags) <---- outer lock here
prep_new_page():
post_alloc_hook():
set_page_owner():
__set_page_owner():
save_stack():
stack_depot_save():
alloc_pages():
alloc_page_interleave():
__alloc_pages():
get_page_from_freelist():
rm_queue():
rm_queue_pcplist():
local_lock_irqsave(&pagesets.lock, flags);
*** DEADLOCK ***

Zhang, Qiang also reported

BUG: sleeping function called from invalid context at mm/page_alloc.c:5179
in_atomic(): 0, irqs_disabled(): 1, non_block: 0, pid: 1, name: swapper/0
.....
__dump_stack lib/dump_stack.c:79 [inline]
dump_stack_lvl+0xcd/0x134 lib/dump_stack.c:96
___might_sleep.cold+0x1f1/0x237 kernel/sched/core.c:9153
prepare_alloc_pages+0x3da/0x580 mm/page_alloc.c:5179
__alloc_pages+0x12f/0x500 mm/page_alloc.c:5375
alloc_page_interleave+0x1e/0x200 mm/mempolicy.c:2147
alloc_pages+0x238/0x2a0 mm/mempolicy.c:2270
stack_depot_save+0x39d/0x4e0 lib/stackdepot.c:303
save_stack+0x15e/0x1e0 mm/page_owner.c:120
__set_page_owner+0x50/0x290 mm/page_owner.c:181
prep_new_page mm/page_alloc.c:2445 [inline]
__alloc_pages_bulk+0x8b9/0x1870 mm/page_alloc.c:5313
alloc_pages_bulk_array_node include/linux/gfp.h:557 [inline]
vm_area_alloc_pages mm/vmalloc.c:2775 [inline]
__vmalloc_area_node mm/vmalloc.c:2845 [inline]
__vmalloc_node_range+0x39d/0x960 mm/vmalloc.c:2947
__vmalloc_node mm/vmalloc.c:2996 [inline]
vzalloc+0x67/0x80 mm/vmalloc.c:3066

There are a number of ways it could be fixed. The page owner code could
be audited to strip GFP flags that allow sleeping but it'll impair the
functionality of PAGE_OWNER if allocations fail. The bulk allocator could
add a special case to release/reacquire the lock for prep_new_page and
lookup PCP after the lock is reacquired at the cost of performance. The
pages requiring prep could be tracked using the least significant bit and
looping through the array although it is more complicated for the list
interface. The options are relatively complex and the second one still
incurs a performance penalty when PAGE_OWNER is active so this patch takes
the simple approach -- disable bulk allocation of PAGE_OWNER is active.
The caller will be forced to allocate one page at a time incurring a
performance penalty but PAGE_OWNER is already a performance penalty.

Link: https://lkml.kernel.org/r/20210708081434.GV3840@techsingularity.net
Fixes: dbbee9d5cd83 ("mm/page_alloc: convert per-cpu list protection to local_lock")
Signed-off-by: Mel Gorman <mgorman@techsingularity.net>
Reported-by: Desmond Cheong Zhi Xi <desmondcheongzx@gmail.com>
Reported-by: "Zhang, Qiang" <Qiang.Zhang@windriver.com>
Reported-by: syzbot+127fd7828d6eeb611703@syzkaller.appspotmail.com
Tested-by: syzbot+127fd7828d6eeb611703@syzkaller.appspotmail.com
Acked-by: Rafael Aquini <aquini@redhat.com>
Cc: Shuah Khan <skhan@linuxfoundation.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 54aa3866 14-Jul-2021 Matteo Croce <mcroce@microsoft.com>

Revert "mm/page_alloc: make should_fail_alloc_page() static"

This reverts commit f7173090033c70886d925995e9dfdfb76dbb2441.

Fix an unresolved symbol error when CONFIG_DEBUG_INFO_BTF=y:

LD vmlinux
BTFIDS vmlinux
FAILED unresolved symbol should_fail_alloc_page
make: *** [Makefile:1199: vmlinux] Error 255
make: *** Deleting file 'vmlinux'

Link: https://lkml.kernel.org/r/20210708191128.153796-1-mcroce@linux.microsoft.com
Fixes: f7173090033c ("mm/page_alloc: make should_fail_alloc_page() static")
Signed-off-by: Matteo Croce <mcroce@microsoft.com>
Acked-by: Mel Gorman <mgorman@techsingularity.net>
Tested-by: John Hubbard <jhubbard@nvidia.com>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: David Hildenbrand <david@redhat.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Dan Streetman <ddstreet@ieee.org>
Cc: Yang Shi <shy828301@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 6bce2443 05-Jul-2021 Mel Gorman <mgorman@techsingularity.net>

mm/page_alloc: Revert pahole zero-sized workaround

Commit dbbee9d5cd83 ("mm/page_alloc: convert per-cpu list protection to
local_lock") folded in a workaround patch for pahole that was unable to
deal with zero-sized percpu structures.

A superior workaround is achieved with commit a0b8200d06ad ("kbuild:
skip per-CPU BTF generation for pahole v1.18-v1.21").

This patch reverts the dummy field and the pahole version check.

Fixes: dbbee9d5cd83 ("mm/page_alloc: convert per-cpu list protection to local_lock")
Signed-off-by: Mel Gorman <mgorman@techsingularity.net>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# f7173090 30-Jun-2021 Mel Gorman <mgorman@techsingularity.net>

mm/page_alloc: make should_fail_alloc_page() static

make W=1 generates the following warning for mm/page_alloc.c

mm/page_alloc.c:3651:15: warning: no previous prototype for `should_fail_alloc_page' [-Wmissing-prototypes]
noinline bool should_fail_alloc_page(gfp_t gfp_mask, unsigned int order)
^~~~~~~~~~~~~~~~~~~~~~

This function is deliberately split out for BPF to allow errors to be
injected. The function is not used anywhere else so it is local to the
file. Make it static which should still allow error injection to be used
similar to how block/blk-core.c:should_fail_bio() works.

Link: https://lkml.kernel.org/r/20210520084809.8576-4-mgorman@techsingularity.net
Signed-off-by: Mel Gorman <mgorman@techsingularity.net>
Reviewed-by: Yang Shi <shy828301@gmail.com>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Cc: Dan Streetman <ddstreet@ieee.org>
Cc: David Hildenbrand <david@redhat.com>
Cc: Michal Hocko <mhocko@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 041711ce 30-Jun-2021 Zhen Lei <thunder.leizhen@huawei.com>

mm: fix spelling mistakes

Fix some spelling mistakes in comments:
each having differents usage ==> each has a different usage
statments ==> statements
adresses ==> addresses
aggresive ==> aggressive
datas ==> data
posion ==> poison
higer ==> higher
precisly ==> precisely
wont ==> won't
We moves tha ==> We move the
endianess ==> endianness

Link: https://lkml.kernel.org/r/20210519065853.7723-2-thunder.leizhen@huawei.com
Signed-off-by: Zhen Lei <thunder.leizhen@huawei.com>
Reviewed-by: Souptick Joarder <jrdr.linux@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 7118fc29 30-Jun-2021 Mike Kravetz <mike.kravetz@oracle.com>

hugetlb: address ref count racing in prep_compound_gigantic_page

In [1], Jann Horn points out a possible race between
prep_compound_gigantic_page and __page_cache_add_speculative. The root
cause of the possible race is prep_compound_gigantic_page uncondittionally
setting the ref count of pages to zero. It does this because
prep_compound_gigantic_page is handed a 'group' of pages from an allocator
and needs to convert that group of pages to a compound page. The ref
count of each page in this 'group' is one as set by the allocator.
However, the ref count of compound page tail pages must be zero.

The potential race comes about when ref counted pages are returned from
the allocator. When this happens, other mm code could also take a
reference on the page. __page_cache_add_speculative is one such example.
Therefore, prep_compound_gigantic_page can not just set the ref count of
pages to zero as it does today. Doing so would lose the reference taken
by any other code. This would lead to BUGs in code checking ref counts
and could possibly even lead to memory corruption.

There are two possible ways to address this issue.

1) Make all allocators of gigantic groups of pages be able to return a
properly constructed compound page.

2) Make prep_compound_gigantic_page be more careful when constructing a
compound page.

This patch takes approach 2.

In prep_compound_gigantic_page, use cmpxchg to only set ref count to zero
if it is one. If the cmpxchg fails, call synchronize_rcu() in the hope
that the extra ref count will be driopped during a rcu grace period. This
is not a performance critical code path and the wait should be
accceptable. If the ref count is still inflated after the grace period,
then undo any modifications made and return an error.

Currently prep_compound_gigantic_page is type void and does not return
errors. Modify the two callers to check for and handle error returns. On
error, the caller must free the 'group' of pages as they can not be used
to form a gigantic page. After freeing pages, the runtime caller
(alloc_fresh_huge_page) will retry the allocation once. Boot time
allocations can not be retried.

The routine prep_compound_page also unconditionally sets the ref count of
compound page tail pages to zero. However, in this case the buddy
allocator is constructing a compound page from freshly allocated pages.
The ref count on those freshly allocated pages is already zero, so the
set_page_count(p, 0) is unnecessary and could lead to confusion. Just
remove it.

[1] https://lore.kernel.org/linux-mm/CAG48ez23q0Jy9cuVnwAe7t_fdhMk2S7N5Hdi-GLcCeq5bsfLxw@mail.gmail.com/

Link: https://lkml.kernel.org/r/20210622021423.154662-3-mike.kravetz@oracle.com
Fixes: 58a84aa92723 ("thp: set compound tail page _count to zero")
Signed-off-by: Mike Kravetz <mike.kravetz@oracle.com>
Reported-by: Jann Horn <jannh@google.com>
Cc: Youquan Song <youquan.song@intel.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Jan Kara <jack@suse.cz>
Cc: John Hubbard <jhubbard@nvidia.com>
Cc: "Kirill A . Shutemov" <kirill@shutemov.name>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Muchun Song <songmuchun@bytedance.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 203c06ee 28-Jun-2021 Mel Gorman <mgorman@techsingularity.net>

mm/page_alloc: split pcp->high across all online CPUs for cpuless nodes

Dave Hansen reported the following about Feng Tang's tests on a machine
with persistent memory onlined as a DRAM-like device.

Feng Tang tossed these on a "Cascade Lake" system with 96 threads and
~512G of persistent memory and 128G of DRAM. The PMEM is in "volatile
use" mode and being managed via the buddy just like the normal RAM.

The PMEM zones are big ones:

present 65011712 = 248 G
high 134595 = 525 M

The PMEM nodes, of course, don't have any CPUs in them.

With your series, the pcp->high value per-cpu is 69584 pages or about
270MB per CPU. Scaled up by the 96 CPU threads, that's ~26GB of
worst-case memory in the pcps per zone, or roughly 10% of the size of
the zone.

This should not cause a problem as such although it could trigger reclaim
due to pages being stored on per-cpu lists for CPUs remote to a node. It
is not possible to treat cpuless nodes exactly the same as normal nodes
but the worst-case scenario can be mitigated by splitting pcp->high across
all online CPUs for cpuless memory nodes.

Link: https://lkml.kernel.org/r/20210616110743.GK30378@techsingularity.net
Suggested-by: Dave Hansen <dave.hansen@intel.com>
Signed-off-by: Mel Gorman <mgorman@techsingularity.net>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Acked-by: Dave Hansen <dave.hansen@intel.com>
Cc: Hillf Danton <hdanton@sina.com>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: "Tang, Feng" <feng.tang@intel.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 44042b44 28-Jun-2021 Mel Gorman <mgorman@techsingularity.net>

mm/page_alloc: allow high-order pages to be stored on the per-cpu lists

The per-cpu page allocator (PCP) only stores order-0 pages. This means
that all THP and "cheap" high-order allocations including SLUB contends on
the zone->lock. This patch extends the PCP allocator to store THP and
"cheap" high-order pages. Note that struct per_cpu_pages increases in
size to 256 bytes (4 cache lines) on x86-64.

Note that this is not necessarily a universal performance win because of
how it is implemented. High-order pages can cause pcp->high to be
exceeded prematurely for lower-orders so for example, a large number of
THP pages being freed could release order-0 pages from the PCP lists.
Hence, much depends on the allocation/free pattern as observed by a single
CPU to determine if caching helps or hurts a particular workload.

That said, basic performance testing passed. The following is a netperf
UDP_STREAM test which hits the relevant patches as some of the network
allocations are high-order.

netperf-udp
5.13.0-rc2 5.13.0-rc2
mm-pcpburst-v3r4 mm-pcphighorder-v1r7
Hmean send-64 261.46 ( 0.00%) 266.30 * 1.85%*
Hmean send-128 516.35 ( 0.00%) 536.78 * 3.96%*
Hmean send-256 1014.13 ( 0.00%) 1034.63 * 2.02%*
Hmean send-1024 3907.65 ( 0.00%) 4046.11 * 3.54%*
Hmean send-2048 7492.93 ( 0.00%) 7754.85 * 3.50%*
Hmean send-3312 11410.04 ( 0.00%) 11772.32 * 3.18%*
Hmean send-4096 13521.95 ( 0.00%) 13912.34 * 2.89%*
Hmean send-8192 21660.50 ( 0.00%) 22730.72 * 4.94%*
Hmean send-16384 31902.32 ( 0.00%) 32637.50 * 2.30%*

Functionally, a patch like this is necessary to make bulk allocation of
high-order pages work with similar performance to order-0 bulk
allocations. The bulk allocator is not updated in this series as it would
have to be determined by bulk allocation users how they want to track the
order of pages allocated with the bulk allocator.

Link: https://lkml.kernel.org/r/20210611135753.GC30378@techsingularity.net
Signed-off-by: Mel Gorman <mgorman@techsingularity.net>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Cc: Zi Yan <ziy@nvidia.com>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Jesper Dangaard Brouer <brouer@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 43b02ba9 28-Jun-2021 Mike Rapoport <rppt@kernel.org>

mm: replace CONFIG_FLAT_NODE_MEM_MAP with CONFIG_FLATMEM

After removal of the DISCONTIGMEM memory model the FLAT_NODE_MEM_MAP
configuration option is equivalent to FLATMEM.

Drop CONFIG_FLAT_NODE_MEM_MAP and use CONFIG_FLATMEM instead.

Link: https://lkml.kernel.org/r/20210608091316.3622-10-rppt@kernel.org
Signed-off-by: Mike Rapoport <rppt@linux.ibm.com>
Acked-by: Arnd Bergmann <arnd@arndb.de>
Acked-by: David Hildenbrand <david@redhat.com>
Cc: Geert Uytterhoeven <geert@linux-m68k.org>
Cc: Ivan Kokshaysky <ink@jurassic.park.msu.ru>
Cc: Jonathan Corbet <corbet@lwn.net>
Cc: Matt Turner <mattst88@gmail.com>
Cc: Richard Henderson <rth@twiddle.net>
Cc: Vineet Gupta <vgupta@synopsys.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# a9ee6cf5 28-Jun-2021 Mike Rapoport <rppt@kernel.org>

mm: replace CONFIG_NEED_MULTIPLE_NODES with CONFIG_NUMA

After removal of DISCINTIGMEM the NEED_MULTIPLE_NODES and NUMA
configuration options are equivalent.

Drop CONFIG_NEED_MULTIPLE_NODES and use CONFIG_NUMA instead.

Done with

$ sed -i 's/CONFIG_NEED_MULTIPLE_NODES/CONFIG_NUMA/' \
$(git grep -wl CONFIG_NEED_MULTIPLE_NODES)
$ sed -i 's/NEED_MULTIPLE_NODES/NUMA/' \
$(git grep -wl NEED_MULTIPLE_NODES)

with manual tweaks afterwards.

[rppt@linux.ibm.com: fix arm boot crash]
Link: https://lkml.kernel.org/r/YMj9vHhHOiCVN4BF@linux.ibm.com

Link: https://lkml.kernel.org/r/20210608091316.3622-9-rppt@kernel.org
Signed-off-by: Mike Rapoport <rppt@linux.ibm.com>
Acked-by: Arnd Bergmann <arnd@arndb.de>
Acked-by: David Hildenbrand <david@redhat.com>
Cc: Geert Uytterhoeven <geert@linux-m68k.org>
Cc: Ivan Kokshaysky <ink@jurassic.park.msu.ru>
Cc: Jonathan Corbet <corbet@lwn.net>
Cc: Matt Turner <mattst88@gmail.com>
Cc: Richard Henderson <rth@twiddle.net>
Cc: Vineet Gupta <vgupta@synopsys.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# bb1c50d3 28-Jun-2021 Mike Rapoport <rppt@kernel.org>

mm: remove CONFIG_DISCONTIGMEM

There are no architectures that support DISCONTIGMEM left.

Remove the configuration option and the dead code it was guarding in the
generic memory management code.

Link: https://lkml.kernel.org/r/20210608091316.3622-6-rppt@kernel.org
Signed-off-by: Mike Rapoport <rppt@linux.ibm.com>
Acked-by: Arnd Bergmann <arnd@arndb.de>
Acked-by: David Hildenbrand <david@redhat.com>
Cc: Geert Uytterhoeven <geert@linux-m68k.org>
Cc: Ivan Kokshaysky <ink@jurassic.park.msu.ru>
Cc: Jonathan Corbet <corbet@lwn.net>
Cc: Matt Turner <mattst88@gmail.com>
Cc: Richard Henderson <rth@twiddle.net>
Cc: Vineet Gupta <vgupta@synopsys.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 21d02f8f 28-Jun-2021 Mel Gorman <mgorman@techsingularity.net>

mm/page_alloc: move free_the_page

Patch series "Allow high order pages to be stored on PCP", v2.

The per-cpu page allocator (PCP) only handles order-0 pages. With the
series "Use local_lock for pcp protection and reduce stat overhead" and
"Calculate pcp->high based on zone sizes and active CPUs", it's now
feasible to store high-order pages on PCP lists.

This small series allows PCP to store "cheap" orders where cheap is
determined by PAGE_ALLOC_COSTLY_ORDER and THP-sized allocations.

This patch (of 2):

In the next page, free_compount_page is going to use the common helper
free_the_page. This patch moves the definition to ease review. No
functional change.

Link: https://lkml.kernel.org/r/20210603142220.10851-1-mgorman@techsingularity.net
Link: https://lkml.kernel.org/r/20210603142220.10851-2-mgorman@techsingularity.net
Signed-off-by: Mel Gorman <mgorman@techsingularity.net>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Jesper Dangaard Brouer <brouer@redhat.com>
Cc: Michal Hocko <mhocko@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# f7ec1044 28-Jun-2021 Liu Shixin <liushixin2@huawei.com>

mm/page_alloc: fix counting of managed_pages

commit f63661566fad ("mm/page_alloc.c: clear out zone->lowmem_reserve[] if
the zone is empty") clears out zone->lowmem_reserve[] if zone is empty.
But when zone is not empty and sysctl_lowmem_reserve_ratio[i] is set to
zero, zone_managed_pages(zone) is not counted in the managed_pages either.
This is inconsistent with the description of lowmem_reserve, so fix it.

Link: https://lkml.kernel.org/r/20210527125707.3760259-1-liushixin2@huawei.com
Fixes: f63661566fad ("mm/page_alloc.c: clear out zone->lowmem_reserve[] if the zone is empty")
Signed-off-by: Liu Shixin <liushixin2@huawei.com>
Reported-by: yangerkun <yangerkun@huawei.com>
Reviewed-by: Baoquan He <bhe@redhat.com>
Acked-by: David Hildenbrand <david@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# e47aa905 28-Jun-2021 Dong Aisheng <aisheng.dong@nxp.com>

mm/page_alloc: improve memmap_pages dbg msg

Make debug message more accurate.

Link: https://lkml.kernel.org/r/20210531091908.1738465-6-aisheng.dong@nxp.com
Signed-off-by: Dong Aisheng <aisheng.dong@nxp.com>
Reviewed-by: David Hildenbrand <david@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 74f44822 28-Jun-2021 Mel Gorman <mgorman@techsingularity.net>

mm/page_alloc: introduce vm.percpu_pagelist_high_fraction

This introduces a new sysctl vm.percpu_pagelist_high_fraction. It is
similar to the old vm.percpu_pagelist_fraction. The old sysctl increased
both pcp->batch and pcp->high with the higher pcp->high potentially
reducing zone->lock contention. However, the higher pcp->batch value also
potentially increased allocation latency while the PCP was refilled. This
sysctl only adjusts pcp->high so that zone->lock contention is potentially
reduced but allocation latency during a PCP refill remains the same.

# grep -E "high:|batch" /proc/zoneinfo | tail -2
high: 649
batch: 63

# sysctl vm.percpu_pagelist_high_fraction=8
# grep -E "high:|batch" /proc/zoneinfo | tail -2
high: 35071
batch: 63

# sysctl vm.percpu_pagelist_high_fraction=64
high: 4383
batch: 63

# sysctl vm.percpu_pagelist_high_fraction=0
high: 649
batch: 63

[mgorman@techsingularity.net: fix documentation]
Link: https://lkml.kernel.org/r/20210528151010.GQ30378@techsingularity.net

Link: https://lkml.kernel.org/r/20210525080119.5455-7-mgorman@techsingularity.net
Signed-off-by: Mel Gorman <mgorman@techsingularity.net>
Acked-by: Dave Hansen <dave.hansen@linux.intel.com>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Cc: Hillf Danton <hdanton@sina.com>
Cc: Michal Hocko <mhocko@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# c49c2c47 28-Jun-2021 Mel Gorman <mgorman@techsingularity.net>

mm/page_alloc: limit the number of pages on PCP lists when reclaim is active

When kswapd is active then direct reclaim is potentially active. In
either case, it is possible that a zone would be balanced if pages were
not trapped on PCP lists. Instead of draining remote pages, simply limit
the size of the PCP lists while kswapd is active.

Link: https://lkml.kernel.org/r/20210525080119.5455-6-mgorman@techsingularity.net
Signed-off-by: Mel Gorman <mgorman@techsingularity.net>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Hillf Danton <hdanton@sina.com>
Cc: Michal Hocko <mhocko@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 3b12e7e9 28-Jun-2021 Mel Gorman <mgorman@techsingularity.net>

mm/page_alloc: scale the number of pages that are batch freed

When a task is freeing a large number of order-0 pages, it may acquire the
zone->lock multiple times freeing pages in batches. This may
unnecessarily contend on the zone lock when freeing very large number of
pages. This patch adapts the size of the batch based on the recent
pattern to scale the batch size for subsequent frees.

As the machines I used were not large enough to test this are not large
enough to illustrate a problem, a debugging patch shows patterns like the
following (slightly editted for clarity)

Baseline vanilla kernel
time-unmap-14426 [...] free_pcppages_bulk: free 63 count 378 high 378
time-unmap-14426 [...] free_pcppages_bulk: free 63 count 378 high 378
time-unmap-14426 [...] free_pcppages_bulk: free 63 count 378 high 378
time-unmap-14426 [...] free_pcppages_bulk: free 63 count 378 high 378
time-unmap-14426 [...] free_pcppages_bulk: free 63 count 378 high 378

With patches
time-unmap-7724 [...] free_pcppages_bulk: free 126 count 814 high 814
time-unmap-7724 [...] free_pcppages_bulk: free 252 count 814 high 814
time-unmap-7724 [...] free_pcppages_bulk: free 504 count 814 high 814
time-unmap-7724 [...] free_pcppages_bulk: free 751 count 814 high 814
time-unmap-7724 [...] free_pcppages_bulk: free 751 count 814 high 814

Link: https://lkml.kernel.org/r/20210525080119.5455-5-mgorman@techsingularity.net
Signed-off-by: Mel Gorman <mgorman@techsingularity.net>
Acked-by: Dave Hansen <dave.hansen@linux.intel.com>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Cc: Hillf Danton <hdanton@sina.com>
Cc: Michal Hocko <mhocko@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 04f8cfea 28-Jun-2021 Mel Gorman <mgorman@techsingularity.net>

mm/page_alloc: adjust pcp->high after CPU hotplug events

The PCP high watermark is based on the number of online CPUs so the
watermarks must be adjusted during CPU hotplug. At the time of
hot-remove, the number of online CPUs is already adjusted but during
hot-add, a delta needs to be applied to update PCP to the correct value.
After this patch is applied, the high watermarks are adjusted correctly.

# grep high: /proc/zoneinfo | tail -1
high: 649
# echo 0 > /sys/devices/system/cpu/cpu4/online
# grep high: /proc/zoneinfo | tail -1
high: 664
# echo 1 > /sys/devices/system/cpu/cpu4/online
# grep high: /proc/zoneinfo | tail -1
high: 649

Link: https://lkml.kernel.org/r/20210525080119.5455-4-mgorman@techsingularity.net
Signed-off-by: Mel Gorman <mgorman@techsingularity.net>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Hillf Danton <hdanton@sina.com>
Cc: Michal Hocko <mhocko@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# b92ca18e 28-Jun-2021 Mel Gorman <mgorman@techsingularity.net>

mm/page_alloc: disassociate the pcp->high from pcp->batch

The pcp high watermark is based on the batch size but there is no
relationship between them other than it is convenient to use early in
boot.

This patch takes the first step and bases pcp->high on the zone low
watermark split across the number of CPUs local to a zone while the batch
size remains the same to avoid increasing allocation latencies. The
intent behind the default pcp->high is "set the number of PCP pages such
that if they are all full that background reclaim is not started
prematurely".

Note that in this patch the pcp->high values are adjusted after memory
hotplug events, min_free_kbytes adjustments and watermark scale factor
adjustments but not CPU hotplug events which is handled later in the
series.

On a test KVM instance;

Before grep -E "high:|batch" /proc/zoneinfo | tail -2
high: 378
batch: 63

After grep -E "high:|batch" /proc/zoneinfo | tail -2
high: 649
batch: 63

[mgorman@techsingularity.net: fix __setup_per_zone_wmarks for parallel memory
hotplug]
Link: https://lkml.kernel.org/r/20210528105925.GN30378@techsingularity.net

Link: https://lkml.kernel.org/r/20210525080119.5455-3-mgorman@techsingularity.net
Signed-off-by: Mel Gorman <mgorman@techsingularity.net>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Hillf Danton <hdanton@sina.com>
Cc: Michal Hocko <mhocko@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# bbbecb35 28-Jun-2021 Mel Gorman <mgorman@techsingularity.net>

mm/page_alloc: delete vm.percpu_pagelist_fraction

Patch series "Calculate pcp->high based on zone sizes and active CPUs", v2.

The per-cpu page allocator (PCP) is meant to reduce contention on the zone
lock but the sizing of batch and high is archaic and neither takes the
zone size into account or the number of CPUs local to a zone. With larger
zones and more CPUs per node, the contention is getting worse.
Furthermore, the fact that vm.percpu_pagelist_fraction adjusts both batch
and high values means that the sysctl can reduce zone lock contention but
also increase allocation latencies.

This series disassociates pcp->high from pcp->batch and then scales
pcp->high based on the size of the local zone with limited impact to
reclaim and accounting for active CPUs but leaves pcp->batch static. It
also adapts the number of pages that can be on the pcp list based on
recent freeing patterns.

The motivation is partially to adjust to larger memory sizes but is also
driven by the fact that large batches of page freeing via release_pages()
often shows zone contention as a major part of the problem. Another is a
bug report based on an older kernel where a multi-terabyte process can
takes several minutes to exit. A workaround was to use
vm.percpu_pagelist_fraction to increase the pcp->high value but testing
indicated that a production workload could not use the same values because
of an increase in allocation latencies. Unfortunately, I cannot reproduce
this test case myself as the multi-terabyte machines are in active use but
it should alleviate the problem.

The series aims to address both and partially acts as a pre-requisite.
pcp only works with order-0 which is useless for SLUB (when using high
orders) and THP (unconditionally). To store high-order pages on PCP, the
pcp->high values need to be increased first.

This patch (of 6):

The vm.percpu_pagelist_fraction is used to increase the batch and high
limits for the per-cpu page allocator (PCP). The intent behind the sysctl
is to reduce zone lock acquisition when allocating/freeing pages but it
has a problem. While it can decrease contention, it can also increase
latency on the allocation side due to unreasonably large batch sizes.
This leads to games where an administrator adjusts
percpu_pagelist_fraction on the fly to work around contention and
allocation latency problems.

This series aims to alleviate the problems with zone lock contention while
avoiding the allocation-side latency problems. For the purposes of
review, it's easier to remove this sysctl now and reintroduce a similar
sysctl later in the series that deals only with pcp->high.

Link: https://lkml.kernel.org/r/20210525080119.5455-1-mgorman@techsingularity.net
Link: https://lkml.kernel.org/r/20210525080119.5455-2-mgorman@techsingularity.net
Signed-off-by: Mel Gorman <mgorman@techsingularity.net>
Acked-by: Dave Hansen <dave.hansen@linux.intel.com>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Cc: Hillf Danton <hdanton@sina.com>
Cc: Michal Hocko <mhocko@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 151e084a 28-Jun-2021 Minchan Kim <minchan@kernel.org>

mm: page_alloc: dump migrate-failed pages only at -EBUSY

alloc_contig_dump_pages() aims for helping debugging page migration
failure by elevated page refcount compared to expected_count. (for the
detail, please look at migrate_page_move_mapping)

However, -ENOMEM is just the case that system is under memory pressure
state, not relevant with page refcount at all. Thus, the dumping page
list is not helpful for the debugging point of view.

Link: https://lkml.kernel.org/r/YKa2Wyo9xqIErpfa@google.com
Signed-off-by: Minchan Kim <minchan@kernel.org>
Reviewed-by: David Hildenbrand <david@redhat.com>
Cc: Suren Baghdasaryan <surenb@google.com>
Cc: John Dias <joaodias@google.com>
Cc: Michal Hocko <mhocko@suse.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 90249993 28-Jun-2021 Mel Gorman <mgorman@techsingularity.net>

mm/page_alloc: update PGFREE outside the zone lock in __free_pages_ok

VM events do not need explicit protection by disabling IRQs so update the
counter with IRQs enabled in __free_pages_ok.

Link: https://lkml.kernel.org/r/20210512095458.30632-10-mgorman@techsingularity.net
Signed-off-by: Mel Gorman <mgorman@techsingularity.net>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Chuck Lever <chuck.lever@oracle.com>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: Jesper Dangaard Brouer <brouer@redhat.com>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Cc: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# df1acc85 28-Jun-2021 Mel Gorman <mgorman@techsingularity.net>

mm/page_alloc: avoid conflating IRQs disabled with zone->lock

Historically when freeing pages, free_one_page() assumed that callers had
IRQs disabled and the zone->lock could be acquired with spin_lock(). This
confuses the scope of what local_lock_irq is protecting and what
zone->lock is protecting in free_unref_page_list in particular.

This patch uses spin_lock_irqsave() for the zone->lock in free_one_page()
instead of relying on callers to have disabled IRQs.
free_unref_page_commit() is changed to only deal with PCP pages protected
by the local lock. free_unref_page_list() then first frees isolated pages
to the buddy lists with free_one_page() and frees the rest of the pages to
the PCP via free_unref_page_commit(). The end result is that
free_one_page() is no longer depending on side-effects of local_lock to be
correct.

Note that this may incur a performance penalty while memory hot-remove is
running but that is not a common operation.

[lkp@intel.com: Ensure CMA pages get addded to correct pcp list]

Link: https://lkml.kernel.org/r/20210512095458.30632-9-mgorman@techsingularity.net
Signed-off-by: Mel Gorman <mgorman@techsingularity.net>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Chuck Lever <chuck.lever@oracle.com>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: Jesper Dangaard Brouer <brouer@redhat.com>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Cc: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 56f0e661 28-Jun-2021 Mel Gorman <mgorman@techsingularity.net>

mm/page_alloc: explicitly acquire the zone lock in __free_pages_ok

__free_pages_ok() disables IRQs before calling a common helper
free_one_page() that acquires the zone lock. This is not safe according
to Documentation/locking/locktypes.rst and in this context, IRQ disabling
is not protecting a per_cpu_pages structure either or a local_lock would
be used.

This patch explicitly acquires the lock with spin_lock_irqsave instead of
relying on a helper. This removes the last instance of local_irq_save()
in page_alloc.c.

Link: https://lkml.kernel.org/r/20210512095458.30632-8-mgorman@techsingularity.net
Signed-off-by: Mel Gorman <mgorman@techsingularity.net>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Chuck Lever <chuck.lever@oracle.com>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: Jesper Dangaard Brouer <brouer@redhat.com>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Cc: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 43c95bcc 28-Jun-2021 Mel Gorman <mgorman@techsingularity.net>

mm/page_alloc: reduce duration that IRQs are disabled for VM counters

IRQs are left disabled for the zone and node VM event counters. This is
unnecessary as the affected counters are allowed to race for preemmption
and IRQs.

This patch reduces the scope of IRQs being disabled via
local_[lock|unlock]_irq on !PREEMPT_RT kernels. One
__mod_zone_freepage_state is still called with IRQs disabled. While this
could be moved out, it's not free on all architectures as some require
IRQs to be disabled for mod_zone_page_state on !PREEMPT_RT kernels.

Link: https://lkml.kernel.org/r/20210512095458.30632-7-mgorman@techsingularity.net
Signed-off-by: Mel Gorman <mgorman@techsingularity.net>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Chuck Lever <chuck.lever@oracle.com>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: Jesper Dangaard Brouer <brouer@redhat.com>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Cc: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 3e23060b 28-Jun-2021 Mel Gorman <mgorman@techsingularity.net>

mm/page_alloc: batch the accounting updates in the bulk allocator

Now that the zone_statistics are simple counters that do not require
special protection, the bulk allocator accounting updates can be batch
updated without adding too much complexity with protected RMW updates or
using xchg.

Link: https://lkml.kernel.org/r/20210512095458.30632-6-mgorman@techsingularity.net
Signed-off-by: Mel Gorman <mgorman@techsingularity.net>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Chuck Lever <chuck.lever@oracle.com>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: Jesper Dangaard Brouer <brouer@redhat.com>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Cc: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# f19298b9 28-Jun-2021 Mel Gorman <mgorman@techsingularity.net>

mm/vmstat: convert NUMA statistics to basic NUMA counters

NUMA statistics are maintained on the zone level for hits, misses, foreign
etc but nothing relies on them being perfectly accurate for functional
correctness. The counters are used by userspace to get a general overview
of a workloads NUMA behaviour but the page allocator incurs a high cost to
maintain perfect accuracy similar to what is required for a vmstat like
NR_FREE_PAGES. There even is a sysctl vm.numa_stat to allow userspace to
turn off the collection of NUMA statistics like NUMA_HIT.

This patch converts NUMA_HIT and friends to be NUMA events with similar
accuracy to VM events. There is a possibility that slight errors will be
introduced but the overall trend as seen by userspace will be similar.
The counters are no longer updated from vmstat_refresh context as it is
unnecessary overhead for counters that may never be read by userspace.
Note that counters could be maintained at the node level to save space but
it would have a user-visible impact due to /proc/zoneinfo.

[lkp@intel.com: Fix misplaced closing brace for !CONFIG_NUMA]

Link: https://lkml.kernel.org/r/20210512095458.30632-4-mgorman@techsingularity.net
Signed-off-by: Mel Gorman <mgorman@techsingularity.net>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Chuck Lever <chuck.lever@oracle.com>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: Jesper Dangaard Brouer <brouer@redhat.com>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Cc: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# dbbee9d5 28-Jun-2021 Mel Gorman <mgorman@techsingularity.net>

mm/page_alloc: convert per-cpu list protection to local_lock

There is a lack of clarity of what exactly
local_irq_save/local_irq_restore protects in page_alloc.c . It conflates
the protection of per-cpu page allocation structures with per-cpu vmstat
deltas.

This patch protects the PCP structure using local_lock which for most
configurations is identical to IRQ enabling/disabling. The scope of the
lock is still wider than it should be but this is decreased later.

It is possible for the local_lock to be embedded safely within struct
per_cpu_pages but it adds complexity to free_unref_page_list.

[akpm@linux-foundation.org: coding style fixes]
[mgorman@techsingularity.net: work around a pahole limitation with zero-sized struct pagesets]
Link: https://lkml.kernel.org/r/20210526080741.GW30378@techsingularity.net
[lkp@intel.com: Make pagesets static]

Link: https://lkml.kernel.org/r/20210512095458.30632-3-mgorman@techsingularity.net
Signed-off-by: Mel Gorman <mgorman@techsingularity.net>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Chuck Lever <chuck.lever@oracle.com>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: Jesper Dangaard Brouer <brouer@redhat.com>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Cc: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 28f836b6 28-Jun-2021 Mel Gorman <mgorman@techsingularity.net>

mm/page_alloc: split per cpu page lists and zone stats

The PCP (per-cpu page allocator in page_alloc.c) shares locking
requirements with vmstat and the zone lock which is inconvenient and
causes some issues. For example, the PCP list and vmstat share the same
per-cpu space meaning that it's possible that vmstat updates dirty cache
lines holding per-cpu lists across CPUs unless padding is used. Second,
PREEMPT_RT does not want to disable IRQs for too long in the page
allocator.

This series splits the locking requirements and uses locks types more
suitable for PREEMPT_RT, reduces the time when special locking is required
for stats and reduces the time when IRQs need to be disabled on
!PREEMPT_RT kernels.

Why local_lock? PREEMPT_RT considers the following sequence to be unsafe
as documented in Documentation/locking/locktypes.rst

local_irq_disable();
spin_lock(&lock);

The pcp allocator has this sequence for rmqueue_pcplist (local_irq_save)
-> __rmqueue_pcplist -> rmqueue_bulk (spin_lock). While it's possible to
separate this out, it generally means there are points where we enable
IRQs and reenable them again immediately. To prevent a migration and the
per-cpu pointer going stale, migrate_disable is also needed. That is a
custom lock that is similar, but worse, than local_lock. Furthermore, on
PREEMPT_RT, it's undesirable to leave IRQs disabled for too long. By
converting to local_lock which disables migration on PREEMPT_RT, the
locking requirements can be separated and start moving the protections for
PCP, stats and the zone lock to PREEMPT_RT-safe equivalent locking. As a
bonus, local_lock also means that PROVE_LOCKING does something useful.

After that, it's obvious that zone_statistics incurs too much overhead and
leaves IRQs disabled for longer than necessary on !PREEMPT_RT kernels.
zone_statistics uses perfectly accurate counters requiring IRQs be
disabled for parallel RMW sequences when inaccurate ones like vm_events
would do. The series makes the NUMA statistics (NUMA_HIT and friends)
inaccurate counters that then require no special protection on
!PREEMPT_RT.

The bulk page allocator can then do stat updates in bulk with IRQs enabled
which should improve the efficiency. Technically, this could have been
done without the local_lock and vmstat conversion work and the order
simply reflects the timing of when different series were implemented.

Finally, there are places where we conflate IRQs being disabled for the
PCP with the IRQ-safe zone spinlock. The remainder of the series reduces
the scope of what is protected by disabled IRQs on !PREEMPT_RT kernels.
By the end of the series, page_alloc.c does not call local_irq_save so the
locking scope is a bit clearer. The one exception is that modifying
NR_FREE_PAGES still happens in places where it's known the IRQs are
disabled as it's harmless for PREEMPT_RT and would be expensive to split
the locking there.

No performance data is included because despite the overhead of the stats,
it's within the noise for most workloads on !PREEMPT_RT. However, Jesper
Dangaard Brouer ran a page allocation microbenchmark on a E5-1650 v4 @
3.60GHz CPU on the first version of this series. Focusing on the array
variant of the bulk page allocator reveals the following.

(CPU: Intel(R) Xeon(R) CPU E5-1650 v4 @ 3.60GHz)
ARRAY variant: time_bulk_page_alloc_free_array: step=bulk size

Baseline Patched
1 56.383 54.225 (+3.83%)
2 40.047 35.492 (+11.38%)
3 37.339 32.643 (+12.58%)
4 35.578 30.992 (+12.89%)
8 33.592 29.606 (+11.87%)
16 32.362 28.532 (+11.85%)
32 31.476 27.728 (+11.91%)
64 30.633 27.252 (+11.04%)
128 30.596 27.090 (+11.46%)

While this is a positive outcome, the series is more likely to be
interesting to the RT people in terms of getting parts of the PREEMPT_RT
tree into mainline.

This patch (of 9):

The per-cpu page allocator lists and the per-cpu vmstat deltas are stored
in the same struct per_cpu_pages even though vmstats have no direct impact
on the per-cpu page lists. This is inconsistent because the vmstats for a
node are stored on a dedicated structure. The bigger issue is that the
per_cpu_pages structure is not cache-aligned and stat updates either cache
conflict with adjacent per-cpu lists incurring a runtime cost or padding
is required incurring a memory cost.

This patch splits the per-cpu pagelists and the vmstat deltas into
separate structures. It's mostly a mechanical conversion but some
variable renaming is done to clearly distinguish the per-cpu pages
structure (pcp) from the vmstats (pzstats).

Superficially, this appears to increase the size of the per_cpu_pages
structure but the movement of expire fills a structure hole so there is no
impact overall.

[mgorman@techsingularity.net: make it W=1 cleaner]
Link: https://lkml.kernel.org/r/20210514144622.GA3735@techsingularity.net
[mgorman@techsingularity.net: make it W=1 even cleaner]
Link: https://lkml.kernel.org/r/20210516140705.GB3735@techsingularity.net
[lkp@intel.com: check struct per_cpu_zonestat has a non-zero size]
[vbabka@suse.cz: Init zone->per_cpu_zonestats properly]

Link: https://lkml.kernel.org/r/20210512095458.30632-1-mgorman@techsingularity.net
Link: https://lkml.kernel.org/r/20210512095458.30632-2-mgorman@techsingularity.net
Signed-off-by: Mel Gorman <mgorman@techsingularity.net>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Chuck Lever <chuck.lever@oracle.com>
Cc: Jesper Dangaard Brouer <brouer@redhat.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: Michal Hocko <mhocko@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 9660ecaa 28-Jun-2021 Heiner Kallweit <hkallweit1@gmail.com>

mm/page_alloc: switch to pr_debug

Having such debug messages in the dmesg log may confuse users. Therefore
restrict debug output to cases where DEBUG is defined or dynamic debugging
is enabled for the respective code piece.

Link: https://lkml.kernel.org/r/976adb93-3041-ce63-48fc-55a6096a51c1@gmail.com
Signed-off-by: Heiner Kallweit <hkallweit1@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# ca891f41 28-Jun-2021 Matthew Wilcox (Oracle) <willy@infradead.org>

mm: constify get_pfnblock_flags_mask and get_pfnblock_migratetype

The struct page is not modified by these routines, so it can be marked
const.

Link: https://lkml.kernel.org/r/20210416231531.2521383-6-willy@infradead.org
Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Reviewed-by: Vlastimil Babka <vbabka@suse.cz>
Reviewed-by: Anshuman Khandual <anshuman.khandual@arm.com>
Reviewed-by: William Kucharski <william.kucharski@oracle.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 691d9497 28-Jun-2021 Aaron Tomlin <atomlin@redhat.com>

mm/page_alloc: bail out on fatal signal during reclaim/compaction retry attempt

A customer experienced a low-memory situation and decided to issue a
SIGKILL (i.e. a fatal signal). Instead of promptly terminating as one
would expect, the aforementioned task remained unresponsive.

Further investigation indicated that the task was "stuck" in the
reclaim/compaction retry loop. Now, it does not make sense to retry
compaction when a fatal signal is pending.

In the context of try_to_compact_pages(), indeed COMPACT_SKIPPED can be
returned; albeit, not every zone, on the zone list, would be considered in
the case a fatal signal is found to be pending. Yet, in
should_compact_retry(), given the last known compaction result, each zone,
on the zone list, can be considered/or checked (see
compaction_zonelist_suitable()). For example, if a zone was found to
succeed, then reclaim/compaction would be tried again (notwithstanding the
above).

This patch ensures that compaction is not needlessly retried irrespective
of the last known compaction result e.g. if it was skipped, in the
unlikely case a fatal signal is found pending. So, OOM is at least
attempted.

Link: https://lkml.kernel.org/r/20210520142901.3371299-1-atomlin@redhat.com
Signed-off-by: Aaron Tomlin <atomlin@redhat.com>
Reviewed-by: Vlastimil Babka <vbabka@suse.cz>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Matthew Wilcox <willy@infradead.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# d2f07ec0 28-Jun-2021 Matthew Wilcox (Oracle) <willy@infradead.org>

mm: make __dump_page static

Patch series "Constify struct page arguments".

While working on various solutions to the 32-bit struct page size
regression, one of the problems I found was the networking stack expects
to be able to pass const struct page pointers around, and the mm doesn't
provide a lot of const-friendly functions to call. The root tangle of
problems is that a lot of functions call VM_BUG_ON_PAGE(), which calls
dump_page(), which calls a lot of functions which don't take a const
struct page (but could be const).

This patch (of 6):

The only caller of __dump_page() now opencodes dump_page(), so remove it
as an externally visible symbol.

Link: https://lkml.kernel.org/r/20210416231531.2521383-1-willy@infradead.org
Link: https://lkml.kernel.org/r/20210416231531.2521383-2-willy@infradead.org
Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Reviewed-by: Vlastimil Babka <vbabka@suse.cz>
Reviewed-by: Anshuman Khandual <anshuman.khandual@arm.com>
Reviewed-by: William Kucharski <william.kucharski@oracle.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# ff4b2b40 28-Jun-2021 Mel Gorman <mgorman@techsingularity.net>

mm/page_alloc: correct return value of populated elements if bulk array is populated

Dave Jones reported the following

This made it into 5.13 final, and completely breaks NFSD for me
(Serving tcp v3 mounts). Existing mounts on clients hang, as do
new mounts from new clients. Rebooting the server back to rc7
everything recovers.

The commit b3b64ebd3822 ("mm/page_alloc: do bulk array bounds check after
checking populated elements") returns the wrong value if the array is
already populated which is interpreted as an allocation failure. Dave
reported this fixes his problem and it also passed a test running dbench
over NFS.

Link: https://lkml.kernel.org/r/20210628150219.GC3840@techsingularity.net
Fixes: b3b64ebd3822 ("mm/page_alloc: do bulk array bounds check after checking populated elements")
Signed-off-by: Mel Gorman <mgorman@techsingularity.net>
Reported-by: Dave Jones <davej@codemonkey.org.uk>
Tested-by: Dave Jones <davej@codemonkey.org.uk>
Cc: Dan Carpenter <dan.carpenter@oracle.com>
Cc: Jesper Dangaard Brouer <brouer@redhat.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: <stable@vger.kernel.org> [5.13+]
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 122e093c 28-Jun-2021 Mike Rapoport <rppt@kernel.org>

mm/page_alloc: fix memory map initialization for descending nodes

On systems with memory nodes sorted in descending order, for instance Dell
Precision WorkStation T5500, the struct pages for higher PFNs and
respectively lower nodes, could be overwritten by the initialization of
struct pages corresponding to the holes in the memory sections.

For example for the below memory layout

[ 0.245624] Early memory node ranges
[ 0.248496] node 1: [mem 0x0000000000001000-0x0000000000090fff]
[ 0.251376] node 1: [mem 0x0000000000100000-0x00000000dbdf8fff]
[ 0.254256] node 1: [mem 0x0000000100000000-0x0000001423ffffff]
[ 0.257144] node 0: [mem 0x0000001424000000-0x0000002023ffffff]

the range 0x1424000000 - 0x1428000000 in the beginning of node 0 starts in
the middle of a section and will be considered as a hole during the
initialization of the last section in node 1.

The wrong initialization of the memory map causes panic on boot when
CONFIG_DEBUG_VM is enabled.

Reorder loop order of the memory map initialization so that the outer loop
will always iterate over populated memory regions in the ascending order
and the inner loop will select the zone corresponding to the PFN range.

This way initialization of the struct pages for the memory holes will be
always done for the ranges that are actually not populated.

[akpm@linux-foundation.org: coding style fixes]

Link: https://lkml.kernel.org/r/YNXlMqBbL+tBG7yq@kernel.org
Link: https://bugzilla.kernel.org/show_bug.cgi?id=213073
Link: https://lkml.kernel.org/r/20210624062305.10940-1-rppt@kernel.org
Fixes: 0740a50b9baa ("mm/page_alloc.c: refactor initialization of struct page for holes in memory layout")
Signed-off-by: Mike Rapoport <rppt@linux.ibm.com>
Cc: Boris Petkov <bp@alien8.de>
Cc: Robert Shteynfeld <robert.shteynfeld@gmail.com>
Cc: Baoquan He <bhe@redhat.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: David Hildenbrand <david@redhat.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 66d92825 28-Jun-2021 Mel Gorman <mgorman@techsingularity.net>

mm/page_alloc: Correct return value of populated elements if bulk array is populated

Dave Jones reported the following

This made it into 5.13 final, and completely breaks NFSD for me
(Serving tcp v3 mounts). Existing mounts on clients hang, as do
new mounts from new clients. Rebooting the server back to rc7
everything recovers.

The commit b3b64ebd3822 ("mm/page_alloc: do bulk array bounds check after
checking populated elements") returns the wrong value if the array is
already populated which is interpreted as an allocation failure. Dave
reported this fixes his problem and it also passed a test running dbench
over NFS.

Fixes: b3b64ebd3822 ("mm/page_alloc: do bulk array bounds check after checking populated elements")
Reported-and-tested-by: Dave Jones <davej@codemonkey.org.uk>
Signed-off-by: Mel Gorman <mgorman@techsingularity.net>
Cc: <stable@vger.kernel.org> [5.13+]
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# c275c5c6 02-Jun-2021 Peter Collingbourne <pcc@google.com>

kasan: disable freed user page poisoning with HW tags

Poisoning freed pages protects against kernel use-after-free. The
likelihood of such a bug involving kernel pages is significantly higher
than that for user pages. At the same time, poisoning freed pages can
impose a significant performance cost, which cannot always be justified
for user pages given the lower probability of finding a bug. Therefore,
disable freed user page poisoning when using HW tags. We identify
"user" pages via the flag set GFP_HIGHUSER_MOVABLE, which indicates
a strong likelihood of not being directly accessible to the kernel.

Signed-off-by: Peter Collingbourne <pcc@google.com>
Reviewed-by: Andrey Konovalov <andreyknvl@gmail.com>
Link: https://linux-review.googlesource.com/id/I716846e2de8ef179f44e835770df7e6307be96c9
Link: https://lore.kernel.org/r/20210602235230.3928842-5-pcc@google.com
Signed-off-by: Will Deacon <will@kernel.org>


# 013bb59d 02-Jun-2021 Peter Collingbourne <pcc@google.com>

arm64: mte: handle tags zeroing at page allocation time

Currently, on an anonymous page fault, the kernel allocates a zeroed
page and maps it in user space. If the mapping is tagged (PROT_MTE),
set_pte_at() additionally clears the tags. It is, however, more
efficient to clear the tags at the same time as zeroing the data on
allocation. To avoid clearing the tags on any page (which may not be
mapped as tagged), only do this if the vma flags contain VM_MTE. This
requires introducing a new GFP flag that is used to determine whether
to clear the tags.

The DC GZVA instruction with a 0 top byte (and 0 tag) requires
top-byte-ignore. Set the TCR_EL1.{TBI1,TBID1} bits irrespective of
whether KASAN_HW is enabled.

Signed-off-by: Peter Collingbourne <pcc@google.com>
Co-developed-by: Catalin Marinas <catalin.marinas@arm.com>
Signed-off-by: Catalin Marinas <catalin.marinas@arm.com>
Link: https://linux-review.googlesource.com/id/Id46dc94e30fe11474f7e54f5d65e7658dbdddb26
Reviewed-by: Catalin Marinas <catalin.marinas@arm.com>
Reviewed-by: Andrey Konovalov <andreyknvl@gmail.com>
Link: https://lore.kernel.org/r/20210602235230.3928842-4-pcc@google.com
Signed-off-by: Will Deacon <will@kernel.org>


# 7a3b8353 02-Jun-2021 Peter Collingbourne <pcc@google.com>

kasan: use separate (un)poison implementation for integrated init

Currently with integrated init page_alloc.c needs to know whether
kasan_alloc_pages() will zero initialize memory, but this will start
becoming more complicated once we start adding tag initialization
support for user pages. To avoid page_alloc.c needing to know more
details of what integrated init will do, move the unpoisoning logic
for integrated init into the HW tags implementation. Currently the
logic is identical but it will diverge in subsequent patches.

For symmetry do the same for poisoning although this logic will
be unaffected by subsequent patches.

Signed-off-by: Peter Collingbourne <pcc@google.com>
Reviewed-by: Andrey Konovalov <andreyknvl@gmail.com>
Link: https://linux-review.googlesource.com/id/I2c550234c6c4a893c48c18ff0c6ce658c7c67056
Link: https://lore.kernel.org/r/20210602235230.3928842-3-pcc@google.com
Signed-off-by: Will Deacon <will@kernel.org>


# b3b64ebd 24-Jun-2021 Mel Gorman <mgorman@techsingularity.net>

mm/page_alloc: do bulk array bounds check after checking populated elements

Dan Carpenter reported the following

The patch 0f87d9d30f21: "mm/page_alloc: add an array-based interface
to the bulk page allocator" from Apr 29, 2021, leads to the following
static checker warning:

mm/page_alloc.c:5338 __alloc_pages_bulk()
warn: potentially one past the end of array 'page_array[nr_populated]'

The problem can occur if an array is passed in that is fully populated.
That potentially ends up allocating a single page and storing it past
the end of the array. This patch returns 0 if the array is fully
populated.

Link: https://lkml.kernel.org/r/20210618125102.GU30378@techsingularity.net
Fixes: 0f87d9d30f21 ("mm/page_alloc: add an array-based interface to the bulk page allocator")
Signed-off-by: Mel Gorman <mgorman@techsinguliarity.net>
Reported-by: Dan Carpenter <dan.carpenter@oracle.com>
Cc: Jesper Dangaard Brouer <brouer@redhat.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# b08e50dd 24-Jun-2021 Rasmus Villemoes <linux@rasmusvillemoes.dk>

mm/page_alloc: __alloc_pages_bulk(): do bounds check before accessing array

In the event that somebody would call this with an already fully
populated page_array, the last loop iteration would do an access beyond
the end of page_array.

It's of course extremely unlikely that would ever be done, but this
triggers my internal static analyzer. Also, if it really is not
supposed to be invoked this way (i.e., with no NULL entries in
page_array), the nr_populated<nr_pages check could simply be removed
instead.

Link: https://lkml.kernel.org/r/20210507064504.1712559-1-linux@rasmusvillemoes.dk
Fixes: 0f87d9d30f21 ("mm/page_alloc: add an array-based interface to the bulk page allocator")
Signed-off-by: Rasmus Villemoes <linux@rasmusvillemoes.dk>
Acked-by: Mel Gorman <mgorman@techsingularity.net>
Reviewed-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# bac9c6fa 04-Jun-2021 Ding Hui <dinghui@sangfor.com.cn>

mm/page_alloc: fix counting of free pages after take off from buddy

Recently we found that there is a lot MemFree left in /proc/meminfo
after do a lot of pages soft offline, it's not quite correct.

Before Oscar's rework of soft offline for free pages [1], if we soft
offline free pages, these pages are left in buddy with HWPoison flag,
and NR_FREE_PAGES is not updated immediately. So the difference between
NR_FREE_PAGES and real number of available free pages is also even big
at the beginning.

However, with the workload running, when we catch HWPoison page in any
alloc functions subsequently, we will remove it from buddy, meanwhile
update the NR_FREE_PAGES and try again, so the NR_FREE_PAGES will get
more and more closer to the real number of available free pages.
(regardless of unpoison_memory())

Now, for offline free pages, after a successful call
take_page_off_buddy(), the page is no longer belong to buddy allocator,
and will not be used any more, but we missed accounting NR_FREE_PAGES in
this situation, and there is no chance to be updated later.

Do update in take_page_off_buddy() like rmqueue() does, but avoid double
counting if some one already set_migratetype_isolate() on the page.

[1]: commit 06be6ff3d2ec ("mm,hwpoison: rework soft offline for free pages")

Link: https://lkml.kernel.org/r/20210526075247.11130-1-dinghui@sangfor.com.cn
Fixes: 06be6ff3d2ec ("mm,hwpoison: rework soft offline for free pages")
Signed-off-by: Ding Hui <dinghui@sangfor.com.cn>
Suggested-by: Naoya Horiguchi <naoya.horiguchi@nec.com>
Reviewed-by: Oscar Salvador <osalvador@suse.de>
Acked-by: David Hildenbrand <david@redhat.com>
Acked-by: Naoya Horiguchi <naoya.horiguchi@nec.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# baf2f90b 06-May-2021 Lu Jialin <lujialin4@huawei.com>

mm: fix typos in comments

succed -> succeed in mm/hugetlb.c
wil -> will in mm/mempolicy.c
wit -> with in mm/page_alloc.c
Retruns -> Returns in mm/page_vma_mapped.c
confict -> conflict in mm/secretmem.c
No functionality changed.

Link: https://lkml.kernel.org/r/20210408140027.60623-1-lujialin4@huawei.com
Signed-off-by: Lu Jialin <lujialin4@huawei.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# f0953a1b 06-May-2021 Ingo Molnar <mingo@kernel.org>

mm: fix typos in comments

Fix ~94 single-word typos in locking code comments, plus a few
very obvious grammar mistakes.

Link: https://lkml.kernel.org/r/20210322212624.GA1963421@gmail.com
Link: https://lore.kernel.org/r/20210322205203.GB1959563@gmail.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Reviewed-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Reviewed-by: Randy Dunlap <rdunlap@infradead.org>
Cc: Bhaskar Chowdhury <unixbhaskar@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 68d68ff6 04-May-2021 Zhiyuan Dai <daizhiyuan@phytium.com.cn>

mm/mempool: minor coding style tweaks

Various coding style tweaks to various files under mm/

[daizhiyuan@phytium.com.cn: mm/swapfile: minor coding style tweaks]
Link: https://lkml.kernel.org/r/1614223624-16055-1-git-send-email-daizhiyuan@phytium.com.cn
[daizhiyuan@phytium.com.cn: mm/sparse: minor coding style tweaks]
Link: https://lkml.kernel.org/r/1614227288-19363-1-git-send-email-daizhiyuan@phytium.com.cn
[daizhiyuan@phytium.com.cn: mm/vmscan: minor coding style tweaks]
Link: https://lkml.kernel.org/r/1614227649-19853-1-git-send-email-daizhiyuan@phytium.com.cn
[daizhiyuan@phytium.com.cn: mm/compaction: minor coding style tweaks]
Link: https://lkml.kernel.org/r/1614228218-20770-1-git-send-email-daizhiyuan@phytium.com.cn
[daizhiyuan@phytium.com.cn: mm/oom_kill: minor coding style tweaks]
Link: https://lkml.kernel.org/r/1614228360-21168-1-git-send-email-daizhiyuan@phytium.com.cn
[daizhiyuan@phytium.com.cn: mm/shmem: minor coding style tweaks]
Link: https://lkml.kernel.org/r/1614228504-21491-1-git-send-email-daizhiyuan@phytium.com.cn
[daizhiyuan@phytium.com.cn: mm/page_alloc: minor coding style tweaks]
Link: https://lkml.kernel.org/r/1614228613-21754-1-git-send-email-daizhiyuan@phytium.com.cn
[daizhiyuan@phytium.com.cn: mm/filemap: minor coding style tweaks]
Link: https://lkml.kernel.org/r/1614228936-22337-1-git-send-email-daizhiyuan@phytium.com.cn
[daizhiyuan@phytium.com.cn: mm/mlock: minor coding style tweaks]
Link: https://lkml.kernel.org/r/1613956588-2453-1-git-send-email-daizhiyuan@phytium.com.cn
[daizhiyuan@phytium.com.cn: mm/frontswap: minor coding style tweaks]
Link: https://lkml.kernel.org/r/1613962668-15045-1-git-send-email-daizhiyuan@phytium.com.cn
[daizhiyuan@phytium.com.cn: mm/vmalloc: minor coding style tweaks]
Link: https://lkml.kernel.org/r/1613963379-15988-1-git-send-email-daizhiyuan@phytium.com.cn
[daizhiyuan@phytium.com.cn: mm/memory_hotplug: minor coding style tweaks]
Link: https://lkml.kernel.org/r/1613971784-24878-1-git-send-email-daizhiyuan@phytium.com.cn
[daizhiyuan@phytium.com.cn: mm/mempolicy: minor coding style tweaks]
Link: https://lkml.kernel.org/r/1613972228-25501-1-git-send-email-daizhiyuan@phytium.com.cn

Link: https://lkml.kernel.org/r/1614222374-13805-1-git-send-email-daizhiyuan@phytium.com.cn
Signed-off-by: Zhiyuan Dai <daizhiyuan@phytium.com.cn>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 8ca55913 04-May-2021 Mel Gorman <mgorman@techsingularity.net>

mm/memory_hotplug: remove broken locking of zone PCP structures during hot remove

zone_pcp_reset allegedly protects against a race with drain_pages using
local_irq_save but this is bogus. local_irq_save only operates on the
local CPU. If memory hotplug is running on CPU A and drain_pages is
running on CPU B, disabling IRQs on CPU A does not affect CPU B and
offers no protection.

This patch deletes IRQ disable/enable on the grounds that IRQs protect
nothing and assumes the existing hotplug paths guarantees the PCP cannot
be used after zone_pcp_enable(). That should be the case already
because all the pages have been freed and there is no page to put on the
PCP lists.

Link: https://lkml.kernel.org/r/20210412090346.GQ3697@techsingularity.net
Signed-off-by: Mel Gorman <mgorman@techsingularity.net>
Acked-by: Michal Hocko <mhocko@suse.com>
Reviewed-by: Oscar Salvador <osalvador@suse.de>
Cc: "Michael S. Tsirkin" <mst@redhat.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Alexander Duyck <alexander.h.duyck@linux.intel.com>
Cc: Minchan Kim <minchan@kernel.org>
Cc: David Hildenbrand <david@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 8e3560d9 04-May-2021 Pavel Tatashin <pasha.tatashin@soleen.com>

mm: honor PF_MEMALLOC_PIN for all movable pages

PF_MEMALLOC_PIN is only honored for CMA pages, extend this flag to work
for any allocations from ZONE_MOVABLE by removing __GFP_MOVABLE from
gfp_mask when this flag is passed in the current context.

Add is_pinnable_page() to return true if page is in a pinnable page. A
pinnable page is not in ZONE_MOVABLE and not of MIGRATE_CMA type.

Link: https://lkml.kernel.org/r/20210215161349.246722-8-pasha.tatashin@soleen.com
Signed-off-by: Pavel Tatashin <pasha.tatashin@soleen.com>
Acked-by: Michal Hocko <mhocko@suse.com>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: David Hildenbrand <david@redhat.com>
Cc: David Rientjes <rientjes@google.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Ira Weiny <ira.weiny@intel.com>
Cc: James Morris <jmorris@namei.org>
Cc: Jason Gunthorpe <jgg@nvidia.com>
Cc: Jason Gunthorpe <jgg@ziepe.ca>
Cc: John Hubbard <jhubbard@nvidia.com>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Mike Kravetz <mike.kravetz@oracle.com>
Cc: Oscar Salvador <osalvador@suse.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Sasha Levin <sashal@kernel.org>
Cc: Steven Rostedt (VMware) <rostedt@goodmis.org>
Cc: Tyler Hicks <tyhicks@linux.microsoft.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# da6df1b0 04-May-2021 Pavel Tatashin <pasha.tatashin@soleen.com>

mm: apply per-task gfp constraints in fast path

Function current_gfp_context() is called after fast path. However, soon
we will add more constraints which will also limit zones based on
context. Move this call into fast path, and apply the correct
constraints for all allocations.

Also update .reclaim_idx based on value returned by
current_gfp_context() because it soon will modify the allowed zones.

Note:
With this patch we will do one extra current->flags load during fast path,
but we already load current->flags in fast-path:

__alloc_pages()
prepare_alloc_pages()
current_alloc_flags(gfp_mask, *alloc_flags);

Later, when we add the zone constrain logic to current_gfp_context() we
will be able to remove current->flags load from current_alloc_flags, and
therefore return fast-path to the current performance level.

Link: https://lkml.kernel.org/r/20210215161349.246722-7-pasha.tatashin@soleen.com
Signed-off-by: Pavel Tatashin <pasha.tatashin@soleen.com>
Suggested-by: Michal Hocko <mhocko@kernel.org>
Acked-by: Michal Hocko <mhocko@suse.com>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: David Hildenbrand <david@redhat.com>
Cc: David Rientjes <rientjes@google.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Ira Weiny <ira.weiny@intel.com>
Cc: James Morris <jmorris@namei.org>
Cc: Jason Gunthorpe <jgg@nvidia.com>
Cc: Jason Gunthorpe <jgg@ziepe.ca>
Cc: John Hubbard <jhubbard@nvidia.com>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Mike Kravetz <mike.kravetz@oracle.com>
Cc: Oscar Salvador <osalvador@suse.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Sasha Levin <sashal@kernel.org>
Cc: Steven Rostedt (VMware) <rostedt@goodmis.org>
Cc: Tyler Hicks <tyhicks@linux.microsoft.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 1a08ae36 04-May-2021 Pavel Tatashin <pasha.tatashin@soleen.com>

mm cma: rename PF_MEMALLOC_NOCMA to PF_MEMALLOC_PIN

PF_MEMALLOC_NOCMA is used ot guarantee that the allocator will not
return pages that might belong to CMA region. This is currently used
for long term gup to make sure that such pins are not going to be done
on any CMA pages.

When PF_MEMALLOC_NOCMA has been introduced we haven't realized that it
is focusing on CMA pages too much and that there is larger class of
pages that need the same treatment. MOVABLE zone cannot contain any
long term pins as well so it makes sense to reuse and redefine this flag
for that usecase as well. Rename the flag to PF_MEMALLOC_PIN which
defines an allocation context which can only get pages suitable for
long-term pins.

Also rename: memalloc_nocma_save()/memalloc_nocma_restore to
memalloc_pin_save()/memalloc_pin_restore() and make the new functions
common.

[rppt@linux.ibm.com: fix renaming of PF_MEMALLOC_NOCMA to PF_MEMALLOC_PIN]
Link: https://lkml.kernel.org/r/20210331163816.11517-1-rppt@kernel.org

Link: https://lkml.kernel.org/r/20210215161349.246722-6-pasha.tatashin@soleen.com
Signed-off-by: Pavel Tatashin <pasha.tatashin@soleen.com>
Reviewed-by: John Hubbard <jhubbard@nvidia.com>
Acked-by: Michal Hocko <mhocko@suse.com>
Signed-off-by: Mike Rapoport <rppt@linux.ibm.com>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: David Hildenbrand <david@redhat.com>
Cc: David Rientjes <rientjes@google.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Ira Weiny <ira.weiny@intel.com>
Cc: James Morris <jmorris@namei.org>
Cc: Jason Gunthorpe <jgg@nvidia.com>
Cc: Jason Gunthorpe <jgg@ziepe.ca>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Mike Kravetz <mike.kravetz@oracle.com>
Cc: Oscar Salvador <osalvador@suse.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Sasha Levin <sashal@kernel.org>
Cc: Steven Rostedt (VMware) <rostedt@goodmis.org>
Cc: Tyler Hicks <tyhicks@linux.microsoft.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 78fa5150 04-May-2021 Minchan Kim <minchan@kernel.org>

mm: use proper type for cma_[alloc|release]

size_t in cma_alloc is confusing since it makes people think it's byte
count, not pages. Change it to unsigned long[1].

The unsigned int in cma_release is also not right so change it. Since we
have unsigned long in cma_release, free_contig_range should also respect
it.

[1] 67a2e213e7e9, mm: cma: fix incorrect type conversion for size during dma allocation

Link: https://lore.kernel.org/linux-mm/20210324043434.GP1719932@casper.infradead.org/
Link: https://lkml.kernel.org/r/20210331164018.710560-1-minchan@kernel.org
Signed-off-by: Minchan Kim <minchan@kernel.org>
Reviewed-by: David Hildenbrand <david@redhat.com>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: David Hildenbrand <david@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 361a2a22 04-May-2021 Minchan Kim <minchan@kernel.org>

mm: replace migrate_[prep|finish] with lru_cache_[disable|enable]

Currently, migrate_[prep|finish] is merely a wrapper of
lru_cache_[disable|enable]. There is not much to gain from having
additional abstraction.

Use lru_cache_[disable|enable] instead of migrate_[prep|finish], which
would be more descriptive.

note: migrate_prep_local in compaction.c changed into lru_add_drain to
avoid CPU schedule cost with involving many other CPUs to keep old
behavior.

Link: https://lkml.kernel.org/r/20210319175127.886124-2-minchan@kernel.org
Signed-off-by: Minchan Kim <minchan@kernel.org>
Acked-by: Michal Hocko <mhocko@suse.com>
Reviewed-by: David Hildenbrand <david@redhat.com>
Cc: Chris Goldsworthy <cgoldswo@codeaurora.org>
Cc: John Dias <joaodias@google.com>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Oliver Sang <oliver.sang@intel.com>
Cc: Suren Baghdasaryan <surenb@google.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# d479960e 04-May-2021 Minchan Kim <minchan@kernel.org>

mm: disable LRU pagevec during the migration temporarily

LRU pagevec holds refcount of pages until the pagevec are drained. It
could prevent migration since the refcount of the page is greater than
the expection in migration logic. To mitigate the issue, callers of
migrate_pages drains LRU pagevec via migrate_prep or lru_add_drain_all
before migrate_pages call.

However, it's not enough because pages coming into pagevec after the
draining call still could stay at the pagevec so it could keep
preventing page migration. Since some callers of migrate_pages have
retrial logic with LRU draining, the page would migrate at next trail
but it is still fragile in that it doesn't close the fundamental race
between upcoming LRU pages into pagvec and migration so the migration
failure could cause contiguous memory allocation failure in the end.

To close the race, this patch disables lru caches(i.e, pagevec) during
ongoing migration until migrate is done.

Since it's really hard to reproduce, I measured how many times
migrate_pages retried with force mode(it is about a fallback to a sync
migration) with below debug code.

int migrate_pages(struct list_head *from, new_page_t get_new_page,
..
..

if (rc && reason == MR_CONTIG_RANGE && pass > 2) {
printk(KERN_ERR, "pfn 0x%lx reason %d", page_to_pfn(page), rc);
dump_page(page, "fail to migrate");
}

The test was repeating android apps launching with cma allocation in
background every five seconds. Total cma allocation count was about 500
during the testing. With this patch, the dump_page count was reduced
from 400 to 30.

The new interface is also useful for memory hotplug which currently
drains lru pcp caches after each migration failure. This is rather
suboptimal as it has to disrupt others running during the operation.
With the new interface the operation happens only once. This is also in
line with pcp allocator cache which are disabled for the offlining as
well.

Link: https://lkml.kernel.org/r/20210319175127.886124-1-minchan@kernel.org
Signed-off-by: Minchan Kim <minchan@kernel.org>
Reviewed-by: Chris Goldsworthy <cgoldswo@codeaurora.org>
Acked-by: Michal Hocko <mhocko@suse.com>
Cc: John Dias <joaodias@google.com>
Cc: Suren Baghdasaryan <surenb@google.com>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: David Hildenbrand <david@redhat.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Oliver Sang <oliver.sang@intel.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 06dac2f4 04-May-2021 Charan Teja Reddy <charante@codeaurora.org>

mm: compaction: update the COMPACT[STALL|FAIL] events properly

By definition, COMPACT[STALL|FAIL] events needs to be counted when there
is 'At least in one zone compaction wasn't deferred or skipped from the
direct compaction'. And when compaction is skipped or deferred,
COMPACT_SKIPPED will be returned but it will still go and update these
compaction events which is wrong in the sense that COMPACT[STALL|FAIL]
is counted without even trying the compaction.

Correct this by skipping the counting of these events when
COMPACT_SKIPPED is returned for compaction. This indirectly also avoid
the unnecessary try into the get_page_from_freelist() when compaction is
not even tried.

There is a corner case where compaction is skipped but still count
COMPACTSTALL event, which is that IRQ came and freed the page and the
same is captured in capture_control.

Link: https://lkml.kernel.org/r/1613151184-21213-1-git-send-email-charante@codeaurora.org
Signed-off-by: Charan Teja Reddy <charante@codeaurora.org>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Acked-by: David Rientjes <rientjes@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 202e35db 04-May-2021 Dave Hansen <dave.hansen@linux.intel.com>

mm/vmscan: replace implicit RECLAIM_ZONE checks with explicit checks

RECLAIM_ZONE was assumed to be unused because it was never explicitly
used in the kernel. However, there were a number of places where it was
checked implicitly by checking 'node_reclaim_mode' for a zero value.

These zero checks are not great because it is not obvious what a zero
mode *means* in the code. Replace them with a helper which makes it
more obvious: node_reclaim_enabled().

This helper also provides a handy place to explicitly check the
RECLAIM_ZONE bit itself. Check it explicitly there to make it more
obvious where the bit can affect behavior.

This should have no functional impact.

Link: https://lkml.kernel.org/r/20210219172559.BF589C44@viggo.jf.intel.com
Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
Reviewed-by: Ben Widawsky <ben.widawsky@intel.com>
Reviewed-by: Oscar Salvador <osalvador@suse.de>
Acked-by: Christoph Lameter <cl@linux.com>
Acked-by: David Rientjes <rientjes@google.com>
Cc: Alex Shi <alex.shi@linux.alibaba.com>
Cc: "Tobin C. Harding" <tobin@kernel.org>
Cc: Huang Ying <ying.huang@intel.com>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: Qian Cai <cai@lca.pw>
Cc: Daniel Wagner <dwagner@suse.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# eb14d4ee 04-May-2021 Oscar Salvador <osalvador@suse.de>

mm,page_alloc: drop unnecessary checks from pfn_range_valid_contig

pfn_range_valid_contig() bails out when it finds an in-use page or a
hugetlb page, among other things. We can drop the in-use page check since
__alloc_contig_pages can migrate away those pages, and the hugetlb page
check can go too since isolate_migratepages_range is now capable of
dealing with hugetlb pages. Either way, those checks are racy so let the
end function handle it when the time comes.

Link: https://lkml.kernel.org/r/20210419075413.1064-8-osalvador@suse.de
Signed-off-by: Oscar Salvador <osalvador@suse.de>
Suggested-by: David Hildenbrand <david@redhat.com>
Reviewed-by: David Hildenbrand <david@redhat.com>
Acked-by: Mike Kravetz <mike.kravetz@oracle.com>
Acked-by: Michal Hocko <mhocko@suse.com>
Cc: Muchun Song <songmuchun@bytedance.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# c2ad7a1f 04-May-2021 Oscar Salvador <osalvador@suse.de>

mm,compaction: let isolate_migratepages_{range,block} return error codes

Currently, isolate_migratepages_{range,block} and their callers use a pfn
== 0 vs pfn != 0 scheme to let the caller know whether there was any error
during isolation.

This does not work as soon as we need to start reporting different error
codes and make sure we pass them down the chain, so they are properly
interpreted by functions like e.g: alloc_contig_range.

Let us rework isolate_migratepages_{range,block} so we can report error
codes. Since isolate_migratepages_block will stop returning the next pfn
to be scanned, we reuse the cc->migrate_pfn field to keep track of that.

Link: https://lkml.kernel.org/r/20210419075413.1064-3-osalvador@suse.de
Signed-off-by: Oscar Salvador <osalvador@suse.de>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Acked-by: Mike Kravetz <mike.kravetz@oracle.com>
Reviewed-by: David Hildenbrand <david@redhat.com>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Muchun Song <songmuchun@bytedance.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# c8e28b47 04-May-2021 Oscar Salvador <osalvador@suse.de>

mm,page_alloc: bail out earlier on -ENOMEM in alloc_contig_migrate_range

Patch series "Make alloc_contig_range handle Hugetlb pages", v10.

alloc_contig_range lacks the ability to handle HugeTLB pages. This can
be problematic for some users, e.g: CMA and virtio-mem, where those
users will fail the call if alloc_contig_range ever sees a HugeTLB page,
even when those pages lay in ZONE_MOVABLE and are free. That problem
can be easily solved by replacing the page in the free hugepage pool.

In-use HugeTLB are no exception though, as those can be isolated and
migrated as any other LRU or Movable page.

This aims to improve alloc_contig_range->isolate_migratepages_block, so
that HugeTLB pages can be recognized and handled.

Since we also need to start reporting errors down the chain (e.g:
-ENOMEM due to not be able to allocate a new hugetlb page),
isolate_migratepages_{range,block} interfaces need to change to start
reporting error codes instead of the pfn == 0 vs pfn != 0 scheme it is
using right now. From now on, isolate_migratepages_block will not
return the next pfn to be scanned anymore, but -EINTR, -ENOMEM or 0, so
we the next pfn to be scanned will be recorded in cc->migrate_pfn field
(as it is already done in isolate_migratepages_range()).

Below is an insight from David (thanks), where the problem can clearly be
seen:

"Start a VM with 4G. Hotplug 1G via virtio-mem and online it to
ZONE_MOVABLE. Allocate 512 huge pages.

[root@localhost ~]# cat /proc/meminfo
MemTotal: 5061512 kB
MemFree: 3319396 kB
MemAvailable: 3457144 kB
...
HugePages_Total: 512
HugePages_Free: 512
HugePages_Rsvd: 0
HugePages_Surp: 0
Hugepagesize: 2048 kB

The huge pages get partially allocate from ZONE_MOVABLE. Try unplugging
1G via virtio-mem (remember, all ZONE_MOVABLE). Inside the guest:

[ 180.058992] alloc_contig_range: [1b8000, 1c0000) PFNs busy
[ 180.060531] alloc_contig_range: [1b8000, 1c0000) PFNs busy
[ 180.061972] alloc_contig_range: [1b8000, 1c0000) PFNs busy
[ 180.063413] alloc_contig_range: [1b8000, 1c0000) PFNs busy
[ 180.064838] alloc_contig_range: [1b8000, 1c0000) PFNs busy
[ 180.065848] alloc_contig_range: [1bfc00, 1c0000) PFNs busy
[ 180.066794] alloc_contig_range: [1bfc00, 1c0000) PFNs busy
[ 180.067738] alloc_contig_range: [1bfc00, 1c0000) PFNs busy
[ 180.068669] alloc_contig_range: [1bfc00, 1c0000) PFNs busy
[ 180.069598] alloc_contig_range: [1bfc00, 1c0000) PFNs busy"

And then with this patchset running:

"Same experiment with ZONE_MOVABLE:

a) Free huge pages: all memory can get unplugged again.

b) Allocated/populated but idle huge pages: all memory can get unplugged
again.

c) Allocated/populated but all 512 huge pages are read/written in a
loop: all memory can get unplugged again, but I get a single

[ 121.192345] alloc_contig_range: [180000, 188000) PFNs busy

Most probably because it happened to try migrating a huge page
while it was busy. As virtio-mem retries on ZONE_MOVABLE a couple of
times, it can deal with this temporary failure.

Last but not least, I did something extreme:

# cat /proc/meminfo
MemTotal: 5061568 kB
MemFree: 186560 kB
MemAvailable: 354524 kB
...
HugePages_Total: 2048
HugePages_Free: 2048
HugePages_Rsvd: 0
HugePages_Surp: 0

Triggering unplug would require to dissolve+alloc - which now fails
when trying to allocate an additional ~512 huge pages (1G).

As expected, I can properly see memory unplug not fully succeeding. +
I get a fairly continuous stream of

[ 226.611584] alloc_contig_range: [19f400, 19f800) PFNs busy
...

But more importantly, the hugepage count remains stable, as configured
by the admin (me):

HugePages_Total: 2048
HugePages_Free: 2048
HugePages_Rsvd: 0
HugePages_Surp: 0"

This patch (of 7):

Currently, __alloc_contig_migrate_range can generate -EINTR, -ENOMEM or
-EBUSY, and report them down the chain. The problem is that when
migrate_pages() reports -ENOMEM, we keep going till we exhaust all the
try-attempts (5 at the moment) instead of bailing out.

migrate_pages() bails out right away on -ENOMEM because it is considered a
fatal error. Do the same here instead of keep going and retrying. Note
that this is not fixing a real issue, just a cosmetic change. Although we
can save some cycles by backing off ealier

Link: https://lkml.kernel.org/r/20210419075413.1064-1-osalvador@suse.de
Link: https://lkml.kernel.org/r/20210419075413.1064-2-osalvador@suse.de
Signed-off-by: Oscar Salvador <osalvador@suse.de>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Reviewed-by: David Hildenbrand <david@redhat.com>
Acked-by: Michal Hocko <mhocko@suse.com>
Acked-by: Mike Kravetz <mike.kravetz@oracle.com>
Cc: Muchun Song <songmuchun@bytedance.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 9df65f52 30-Apr-2021 Sergei Trofimovich <slyfox@gentoo.org>

mm: page_alloc: ignore init_on_free=1 for debug_pagealloc=1

On !ARCH_SUPPORTS_DEBUG_PAGEALLOC (like ia64) debug_pagealloc=1 implies
page_poison=on:

if (page_poisoning_enabled() ||
(!IS_ENABLED(CONFIG_ARCH_SUPPORTS_DEBUG_PAGEALLOC) &&
debug_pagealloc_enabled()))
static_branch_enable(&_page_poisoning_enabled);

page_poison=on needs to override init_on_free=1.

Before the change it did not work as expected for the following case:
- have PAGE_POISONING=y
- have page_poison unset
- have !ARCH_SUPPORTS_DEBUG_PAGEALLOC arch (like ia64)
- have init_on_free=1
- have debug_pagealloc=1

That way we get both keys enabled:
- static_branch_enable(&init_on_free);
- static_branch_enable(&_page_poisoning_enabled);

which leads to poisoned pages returned for __GFP_ZERO pages.

After the change we execute only:
- static_branch_enable(&_page_poisoning_enabled);
and ignore init_on_free=1.

Link: https://lkml.kernel.org/r/20210329222555.3077928-1-slyfox@gentoo.org
Link: https://lkml.org/lkml/2021/3/26/443
Fixes: 8db26a3d4735 ("mm, page_poison: use static key more efficiently")
Signed-off-by: Sergei Trofimovich <slyfox@gentoo.org>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Reviewed-by: David Hildenbrand <david@redhat.com>
Cc: Andrey Konovalov <andreyknvl@gmail.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 3b822017 30-Apr-2021 Jesper Dangaard Brouer <brouer@redhat.com>

mm/page_alloc: inline __rmqueue_pcplist

When __alloc_pages_bulk() got introduced two callers of __rmqueue_pcplist
exist and the compiler chooses to not inline this function.

./scripts/bloat-o-meter vmlinux-before vmlinux-inline__rmqueue_pcplist
add/remove: 0/1 grow/shrink: 2/0 up/down: 164/-125 (39)
Function old new delta
rmqueue 2197 2296 +99
__alloc_pages_bulk 1921 1986 +65
__rmqueue_pcplist 125 - -125
Total: Before=19374127, After=19374166, chg +0.00%

modprobe page_bench04_bulk loops=$((10**7))

Type:time_bulk_page_alloc_free_array
- Per elem: 106 cycles(tsc) 29.595 ns (step:64)
- (measurement period time:0.295955434 sec time_interval:295955434)
- (invoke count:10000000 tsc_interval:1065447105)

Before:
- Per elem: 110 cycles(tsc) 30.633 ns (step:64)

Link: https://lkml.kernel.org/r/20210325114228.27719-6-mgorman@techsingularity.net
Signed-off-by: Jesper Dangaard Brouer <brouer@redhat.com>
Signed-off-by: Mel Gorman <mgorman@techsingularity.net>
Reviewed-by: Alexander Lobakin <alobakin@pm.me>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Cc: Alexander Duyck <alexander.duyck@gmail.com>
Cc: Christoph Hellwig <hch@infradead.org>
Cc: Chuck Lever <chuck.lever@oracle.com>
Cc: David Miller <davem@davemloft.net>
Cc: Ilias Apalodimas <ilias.apalodimas@linaro.org>
Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# ce76f9a1 30-Apr-2021 Jesper Dangaard Brouer <brouer@redhat.com>

mm/page_alloc: optimize code layout for __alloc_pages_bulk

Looking at perf-report and ASM-code for __alloc_pages_bulk() it is clear
that the code activated is suboptimal. The compiler guesses wrong and
places unlikely code at the beginning. Due to the use of WARN_ON_ONCE()
macro the UD2 asm instruction is added to the code, which confuse the
I-cache prefetcher in the CPU.

[mgorman@techsingularity.net: minor changes and rebasing]

Link: https://lkml.kernel.org/r/20210325114228.27719-5-mgorman@techsingularity.net
Signed-off-by: Jesper Dangaard Brouer <brouer@redhat.com>
Signed-off-by: Mel Gorman <mgorman@techsingularity.net>
Reviewed-by: Alexander Lobakin <alobakin@pm.me>
Acked-By: Vlastimil Babka <vbabka@suse.cz>
Cc: Alexander Duyck <alexander.duyck@gmail.com>
Cc: Christoph Hellwig <hch@infradead.org>
Cc: Chuck Lever <chuck.lever@oracle.com>
Cc: David Miller <davem@davemloft.net>
Cc: Ilias Apalodimas <ilias.apalodimas@linaro.org>
Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 0f87d9d3 30-Apr-2021 Mel Gorman <mgorman@techsingularity.net>

mm/page_alloc: add an array-based interface to the bulk page allocator

The proposed callers for the bulk allocator store pages from the bulk
allocator in an array. This patch adds an array-based interface to the
API to avoid multiple list iterations. The page list interface is
preserved to avoid requiring all users of the bulk API to allocate and
manage enough storage to store the pages.

[akpm@linux-foundation.org: remove now unused local `allocated']

Link: https://lkml.kernel.org/r/20210325114228.27719-4-mgorman@techsingularity.net
Signed-off-by: Mel Gorman <mgorman@techsingularity.net>
Reviewed-by: Alexander Lobakin <alobakin@pm.me>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Cc: Alexander Duyck <alexander.duyck@gmail.com>
Cc: Christoph Hellwig <hch@infradead.org>
Cc: Chuck Lever <chuck.lever@oracle.com>
Cc: David Miller <davem@davemloft.net>
Cc: Ilias Apalodimas <ilias.apalodimas@linaro.org>
Cc: Jesper Dangaard Brouer <brouer@redhat.com>
Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 387ba26f 30-Apr-2021 Mel Gorman <mgorman@techsingularity.net>

mm/page_alloc: add a bulk page allocator

This patch adds a new page allocator interface via alloc_pages_bulk, and
__alloc_pages_bulk_nodemask. A caller requests a number of pages to be
allocated and added to a list.

The API is not guaranteed to return the requested number of pages and
may fail if the preferred allocation zone has limited free memory, the
cpuset changes during the allocation or page debugging decides to fail
an allocation. It's up to the caller to request more pages in batch if
necessary.

Note that this implementation is not very efficient and could be
improved but it would require refactoring. The intent is to make it
available early to determine what semantics are required by different
callers. Once the full semantics are nailed down, it can be refactored.

[mgorman@techsingularity.net: fix alloc_pages_bulk() return type, per Matthew]
Link: https://lkml.kernel.org/r/20210325123713.GQ3697@techsingularity.net
[mgorman@techsingularity.net: fix uninit var warning]
Link: https://lkml.kernel.org/r/20210330114847.GX3697@techsingularity.net
[mgorman@techsingularity.net: fix comment, per Vlastimil]
Link: https://lkml.kernel.org/r/20210412110255.GV3697@techsingularity.net

Link: https://lkml.kernel.org/r/20210325114228.27719-3-mgorman@techsingularity.net
Signed-off-by: Mel Gorman <mgorman@techsingularity.net>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Reviewed-by: Alexander Lobakin <alobakin@pm.me>
Tested-by: Colin Ian King <colin.king@canonical.com>
Cc: Alexander Duyck <alexander.duyck@gmail.com>
Cc: Christoph Hellwig <hch@infradead.org>
Cc: Chuck Lever <chuck.lever@oracle.com>
Cc: David Miller <davem@davemloft.net>
Cc: Ilias Apalodimas <ilias.apalodimas@linaro.org>
Cc: Jesper Dangaard Brouer <brouer@redhat.com>
Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# cb66bede 30-Apr-2021 Mel Gorman <mgorman@techsingularity.net>

mm/page_alloc: rename alloced to allocated

Patch series "Introduce a bulk order-0 page allocator with two in-tree users", v6.

This series introduces a bulk order-0 page allocator with sunrpc and the
network page pool being the first users. The implementation is not
efficient as semantics needed to be ironed out first. If no other
semantic changes are needed, it can be made more efficient. Despite that,
this is a performance-related for users that require multiple pages for an
operation without multiple round-trips to the page allocator. Quoting the
last patch for the high-speed networking use-case

Kernel XDP stats CPU pps Delta
Baseline XDP-RX CPU total 3,771,046 n/a
List XDP-RX CPU total 3,940,242 +4.49%
Array XDP-RX CPU total 4,249,224 +12.68%

Via the SUNRPC traces of svc_alloc_arg()

Single page: 25.007 us per call over 532,571 calls
Bulk list: 6.258 us per call over 517,034 calls
Bulk array: 4.590 us per call over 517,442 calls

Both potential users in this series are corner cases (NFS and high-speed
networks) so it is unlikely that most users will see any benefit in the
short term. Other potential other users are batch allocations for page
cache readahead, fault around and SLUB allocations when high-order pages
are unavailable. It's unknown how much benefit would be seen by
converting multiple page allocation calls to a single batch or what
difference it may make to headline performance.

Light testing of my own running dbench over NFS passed. Chuck and Jesper
conducted their own tests and details are included in the changelogs.

Patch 1 renames a variable name that is particularly unpopular

Patch 2 adds a bulk page allocator

Patch 3 adds an array-based version of the bulk allocator

Patches 4-5 adds micro-optimisations to the implementation

Patches 6-7 SUNRPC user

Patches 8-9 Network page_pool user

This patch (of 9):

Review feedback of the bulk allocator twice found problems with "alloced"
being a counter for pages allocated. The naming was based on the API name
"alloc" and was based on the idea that verbal communication about malloc
tends to use the fake word "malloced" instead of the fake word mallocated.
To be consistent, this preparation patch renames alloced to allocated in
rmqueue_bulk so the bulk allocator and per-cpu allocator use similar names
when the bulk allocator is introduced.

Link: https://lkml.kernel.org/r/20210325114228.27719-1-mgorman@techsingularity.net
Link: https://lkml.kernel.org/r/20210325114228.27719-2-mgorman@techsingularity.net
Signed-off-by: Mel Gorman <mgorman@techsingularity.net>
Reviewed-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Reviewed-by: Alexander Lobakin <alobakin@pm.me>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Cc: Chuck Lever <chuck.lever@oracle.com>
Cc: Jesper Dangaard Brouer <brouer@redhat.com>
Cc: Christoph Hellwig <hch@infradead.org>
Cc: Alexander Duyck <alexander.duyck@gmail.com>
Cc: Ilias Apalodimas <ilias.apalodimas@linaro.org>
Cc: David Miller <davem@davemloft.net>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 8f709dbd 30-Apr-2021 zhouchuangao <zhouchuangao@vivo.com>

mm/page_alloc: duplicate include linux/vmalloc.h

linux/vmalloc.h is repeatedly in the file page_alloc.c

Link: https://lkml.kernel.org/r/1616468751-80656-1-git-send-email-zhouchuangao@vivo.com
Signed-off-by: zhouchuangao <zhouchuangao@vivo.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 39ddb991 30-Apr-2021 Kefeng Wang <wangkefeng.wang@huawei.com>

mm, page_alloc: avoid page_to_pfn() in move_freepages()

The start_pfn and end_pfn are already available in move_freepages_block(),
there is no need to go back and forth between page and pfn in
move_freepages and move_freepages_block, and pfn_valid_within() should
validate pfn first before touching the page.

Link: https://lkml.kernel.org/r/20210323131215.934472-1-liushixin2@huawei.com
Signed-off-by: Kefeng Wang <wangkefeng.wang@huawei.com>
Signed-off-by: Liu Shixin <liushixin2@huawei.com>
Reviewed-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Cc: Stephen Rothwell <sfr@canb.auug.org.au>
Cc: Michal Hocko <mhocko@suse.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# a1394bdd 30-Apr-2021 Minchan Kim <minchan@kernel.org>

mm: page_alloc: dump migrate-failed pages

Currently, debugging CMA allocation failures is quite limited. The most
common source of these failures seems to be page migration which doesn't
provide any useful information on the reason of the failure by itself.
alloc_contig_range can report those failures as it holds a list of
migrate-failed pages.

The information logged by dump_page() has already proven helpful for
debugging allocation issues, like identifying long-term pinnings on
ZONE_MOVABLE or MIGRATE_CMA.

Let's use the dynamic debugging infrastructure, such that we avoid
flooding the logs and creating a lot of noise on frequent
alloc_contig_range() calls. This information is helpful for debugging
only.

There are two ifdefery conditions to support common dyndbg options:

- CONFIG_DYNAMIC_DEBUG_CORE && DYNAMIC_DEBUG_MODULE
It aims for supporting the feature with only specific file with
adding ccflags.

- CONFIG_DYNAMIC_DEBUG
It aims for supporting the feature with system wide globally.

A simple example to enable the feature:

Admin could enable the dump like this(by default, disabled)

echo "func alloc_contig_dump_pages +p" > control

Admin could disable it.

echo "func alloc_contig_dump_pages =_" > control

Detail goes Documentation/admin-guide/dynamic-debug-howto.rst

A concern is utility functions in dump_page use inconsistent
loglevels. In the future, we might want to make the loglevels
used inside dump_page() consistent and eventually rework the way
we log the information here. See [1].

[1] https://lore.kernel.org/linux-mm/YEh4doXvyuRl5BDB@google.com/

Link: https://lkml.kernel.org/r/20210311194042.825152-1-minchan@kernel.org
Signed-off-by: Minchan Kim <minchan@kernel.org>
Reviewed-by: David Hildenbrand <david@redhat.com>
Cc: John Dias <joaodias@google.com>
Cc: Michal Hocko <mhocko@suse.com>
Cc: David Hildenbrand <david@redhat.com>
Cc: Jason Baron <jbaron@akamai.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 84172f4b 30-Apr-2021 Matthew Wilcox (Oracle) <willy@infradead.org>

mm/page_alloc: combine __alloc_pages and __alloc_pages_nodemask

There are only two callers of __alloc_pages() so prune the thicket of
alloc_page variants by combining the two functions together. Current
callers of __alloc_pages() simply add an extra 'NULL' parameter and
current callers of __alloc_pages_nodemask() call __alloc_pages() instead.

Link: https://lkml.kernel.org/r/20210225150642.2582252-4-willy@infradead.org
Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Acked-by: Michal Hocko <mhocko@suse.com>
Cc: Mike Rapoport <rppt@linux.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 6e5e0f28 30-Apr-2021 Matthew Wilcox (Oracle) <willy@infradead.org>

mm/page_alloc: rename gfp_mask to gfp

Shorten some overly-long lines by renaming this identifier.

Link: https://lkml.kernel.org/r/20210225150642.2582252-3-willy@infradead.org
Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Mike Rapoport <rppt@linux.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 8e6a930b 30-Apr-2021 Matthew Wilcox (Oracle) <willy@infradead.org>

mm/page_alloc: rename alloc_mask to alloc_gfp

Patch series "Rationalise __alloc_pages wrappers", v3.

I was poking around the __alloc_pages variants trying to understand why
they each exist, and couldn't really find a good justification for keeping
__alloc_pages and __alloc_pages_nodemask as separate functions. That led
to getting rid of alloc_pages_current() and then I noticed the
documentation was bad, and then I noticed the mempolicy documentation
wasn't included.

Anyway, this is all cleanups & doc fixes.

This patch (of 7):

We have two masks involved -- the nodemask and the gfp mask, so alloc_mask
is an unclear name.

Link: https://lkml.kernel.org/r/20210225150642.2582252-2-willy@infradead.org
Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Mike Rapoport <rppt@linux.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# cef4c7d2 30-Apr-2021 Minchan Kim <minchan@kernel.org>

mm: remove lru_add_drain_all in alloc_contig_range

__alloc_contig_migrate_range already has lru_add_drain_all call via
migrate_prep. It's necessary to move LRU taget pages into LRU list to be
able to isolated. However, lru_add_drain_all call after
__alloc_contig_migrate_range is pointless since it has changed source page
freeing from putback_lru_pages to put_page[1].

This patch removes it.

[1] c6c919eb90e0, ("mm: use put_page() to free page instead of putback_lru_page()"

Link: https://lkml.kernel.org/r/20210303204512.2863087-1-minchan@kernel.org
Signed-off-by: Minchan Kim <minchan@kernel.org>
Reviewed-by: Oscar Salvador <osalvador@suse.de>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Acked-by: Michal Hocko <mhocko@suse.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 77febec2 30-Apr-2021 David Hildenbrand <david@redhat.com>

mm/page_alloc: drop pr_info_ratelimited() in alloc_contig_range()

The information that some PFNs are busy is:

a) not helpful for ordinary users: we don't even know *who* called
alloc_contig_range(). This is certainly not worth a pr_info.*().

b) not really helpful for debugging: we don't have any details *why*
these PFNs are busy, and that is what we usually care about.

c) not complete: there are other cases where we fail alloc_contig_range()
using different paths that are not getting recorded.

For example, we reach this path once we succeeded in isolating pageblocks,
but failed to migrate some pages - which can happen easily on ZONE_NORMAL
(i.e., has_unmovable_pages() is racy) but also on ZONE_MOVABLE i.e., we
would have to retry longer to migrate).

For example via virtio-mem when unplugging memory, we can create quite
some noise (especially with ZONE_NORMAL) that is not of interest to users
- it's expected that some allocations may fail as memory is busy.

Let's just drop that pr_info_ratelimit() and rather implement a dynamic
debugging mechanism in the future that can give us a better reason why
alloc_contig_range() failed on specific pages.

Link: https://lkml.kernel.org/r/20210301150945.77012-1-david@redhat.com
Signed-off-by: David Hildenbrand <david@redhat.com>
Reviewed-by: Zi Yan <ziy@nvidia.com>
Acked-by: Michal Hocko <mhocko@suse.com>
Reviewed-by: Oscar Salvador <osalvador@suse.de>
Acked-by: Minchan Kim <minchan@kernel.org>
Cc: Vlastimil Babka <vbabka@suse.cz>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 1f9d03c5 30-Apr-2021 Kefeng Wang <wangkefeng.wang@huawei.com>

mm: move mem_init_print_info() into mm_init()

mem_init_print_info() is called in mem_init() on each architecture, and
pass NULL argument, so using void argument and move it into mm_init().

Link: https://lkml.kernel.org/r/20210317015210.33641-1-wangkefeng.wang@huawei.com
Signed-off-by: Kefeng Wang <wangkefeng.wang@huawei.com>
Acked-by: Dave Hansen <dave.hansen@linux.intel.com> [x86]
Reviewed-by: Christophe Leroy <christophe.leroy@c-s.fr> [powerpc]
Acked-by: David Hildenbrand <david@redhat.com>
Tested-by: Anatoly Pugachev <matorola@gmail.com> [sparc64]
Acked-by: Russell King <rmk+kernel@armlinux.org.uk> [arm]
Acked-by: Mike Rapoport <rppt@linux.ibm.com>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Richard Henderson <rth@twiddle.net>
Cc: Guo Ren <guoren@kernel.org>
Cc: Yoshinori Sato <ysato@users.osdn.me>
Cc: Huacai Chen <chenhuacai@kernel.org>
Cc: Jonas Bonn <jonas@southpole.se>
Cc: Palmer Dabbelt <palmer@dabbelt.com>
Cc: Heiko Carstens <hca@linux.ibm.com>
Cc: "David S. Miller" <davem@davemloft.net>
Cc: "Peter Zijlstra" <peterz@infradead.org>
Cc: Ingo Molnar <mingo@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 1bb5eab3 30-Apr-2021 Andrey Konovalov <andreyknvl@gmail.com>

kasan, mm: integrate page_alloc init with HW_TAGS

This change uses the previously added memory initialization feature of
HW_TAGS KASAN routines for page_alloc memory when init_on_alloc/free is
enabled.

With this change, kernel_init_free_pages() is no longer called when both
HW_TAGS KASAN and init_on_alloc/free are enabled. Instead, memory is
initialized in KASAN runtime.

To avoid discrepancies with which memory gets initialized that can be
caused by future changes, both KASAN and kernel_init_free_pages() hooks
are put together and a warning comment is added.

This patch changes the order in which memory initialization and page
poisoning hooks are called. This doesn't lead to any side-effects, as
whenever page poisoning is enabled, memory initialization gets disabled.

Combining setting allocation tags with memory initialization improves
HW_TAGS KASAN performance when init_on_alloc/free is enabled.

[andreyknvl@google.com: fix for "integrate page_alloc init with HW_TAGS"]
Link: https://lkml.kernel.org/r/65b6028dea2e9a6e8e2cb779b5115c09457363fc.1617122211.git.andreyknvl@google.com

Link: https://lkml.kernel.org/r/e77f0d5b1b20658ef0b8288625c74c2b3690e725.1615296150.git.andreyknvl@google.com
Signed-off-by: Andrey Konovalov <andreyknvl@google.com>
Reviewed-by: Marco Elver <elver@google.com>
Tested-by: Vlastimil Babka <vbabka@suse.cz>
Reviewed-by: Sergei Trofimovich <slyfox@gentoo.org>
Cc: Alexander Potapenko <glider@google.com>
Cc: Andrey Ryabinin <aryabinin@virtuozzo.com>
Cc: Branislav Rankov <Branislav.Rankov@arm.com>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Christoph Lameter <cl@linux.com>
Cc: David Rientjes <rientjes@google.com>
Cc: Dmitry Vyukov <dvyukov@google.com>
Cc: Evgenii Stepanov <eugenis@google.com>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Cc: Kevin Brodsky <kevin.brodsky@arm.com>
Cc: Pekka Enberg <penberg@kernel.org>
Cc: Peter Collingbourne <pcc@google.com>
Cc: Vincenzo Frascino <vincenzo.frascino@arm.com>
Cc: Will Deacon <will.deacon@arm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 2c335680 29-Apr-2021 Andrey Konovalov <andreyknvl@gmail.com>

mm, kasan: don't poison boot memory with tag-based modes

During boot, all non-reserved memblock memory is exposed to page_alloc via
memblock_free_pages->__free_pages_core(). This results in
kasan_free_pages() being called, which poisons that memory.

Poisoning all that memory lengthens boot time. The most noticeable effect
is observed with the HW_TAGS mode. A boot-time impact may potentially
also affect systems with large amount of RAM.

This patch changes the tag-based modes to not poison the memory during the
memblock->page_alloc transition.

An exception is made for KASAN_GENERIC. Since it marks all new memory as
accessible, not poisoning the memory released from memblock will lead to
KASAN missing invalid boot-time accesses to that memory.

With KASAN_SW_TAGS, as it uses the invalid 0xFE tag as the default tag for
all memory, it won't miss bad boot-time accesses even if the poisoning of
memblock memory is removed.

With KASAN_HW_TAGS, the default memory tags values are unspecified.
Therefore, if memblock poisoning is removed, this KASAN mode will miss the
mentioned type of boot-time bugs with a 1/16 probability. This is taken
as an acceptable trafe-off.

Internally, the poisoning is removed as follows. __free_pages_core() is
used when exposing fresh memory during system boot and when onlining
memory during hotplug. This patch adds a new FPI_SKIP_KASAN_POISON flag
and passes it to __free_pages_ok() through free_pages_prepare() from
__free_pages_core(). If FPI_SKIP_KASAN_POISON is set, kasan_free_pages()
is not called.

All memory allocated normally when the boot is over keeps getting poisoned
as usual.

Link: https://lkml.kernel.org/r/a0570dc1e3a8f39a55aa343a1fc08cd5c2d4cad6.1613692950.git.andreyknvl@google.com
Signed-off-by: Andrey Konovalov <andreyknvl@google.com>
Reviewed-by: Catalin Marinas <catalin.marinas@arm.com>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Will Deacon <will.deacon@arm.com>
Cc: Vincenzo Frascino <vincenzo.frascino@arm.com>
Cc: Dmitry Vyukov <dvyukov@google.com>
Cc: Andrey Ryabinin <aryabinin@virtuozzo.com>
Cc: Alexander Potapenko <glider@google.com>
Cc: Marco Elver <elver@google.com>
Cc: Peter Collingbourne <pcc@google.com>
Cc: Evgenii Stepanov <eugenis@google.com>
Cc: Branislav Rankov <Branislav.Rankov@arm.com>
Cc: Kevin Brodsky <kevin.brodsky@arm.com>
Cc: Christoph Hellwig <hch@infradead.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 121e6f32 29-Apr-2021 Nicholas Piggin <npiggin@gmail.com>

mm/vmalloc: hugepage vmalloc mappings

Support huge page vmalloc mappings. Config option HAVE_ARCH_HUGE_VMALLOC
enables support on architectures that define HAVE_ARCH_HUGE_VMAP and
supports PMD sized vmap mappings.

vmalloc will attempt to allocate PMD-sized pages if allocating PMD size or
larger, and fall back to small pages if that was unsuccessful.

Architectures must ensure that any arch specific vmalloc allocations that
require PAGE_SIZE mappings (e.g., module allocations vs strict module rwx)
use the VM_NOHUGE flag to inhibit larger mappings.

This can result in more internal fragmentation and memory overhead for a
given allocation, an option nohugevmalloc is added to disable at boot.

[colin.king@canonical.com: fix read of uninitialized pointer area]
Link: https://lkml.kernel.org/r/20210318155955.18220-1-colin.king@canonical.com

Link: https://lkml.kernel.org/r/20210317062402.533919-14-npiggin@gmail.com
Signed-off-by: Nicholas Piggin <npiggin@gmail.com>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Ding Tianhong <dingtianhong@huawei.com>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Miaohe Lin <linmiaohe@huawei.com>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Cc: Russell King <linux@armlinux.org.uk>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Uladzislau Rezki (Sony) <urezki@gmail.com>
Cc: Will Deacon <will@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 48060834 29-Apr-2021 Muchun Song <songmuchun@bytedance.com>

mm: memcontrol: directly access page->memcg_data in mm/page_alloc.c

page_memcg() is not suitable for use by page_expected_state() and
page_bad_reason(). Because it can BUG_ON() for the slab pages when
CONFIG_DEBUG_VM is enabled. As neither lru, nor kmem, nor slab page
should have anything left in there by the time the page is freed, what
we care about is whether the value of page->memcg_data is 0. So just
directly access page->memcg_data here.

Link: https://lkml.kernel.org/r/20210319163821.20704-4-songmuchun@bytedance.com
Signed-off-by: Muchun Song <songmuchun@bytedance.com>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Reviewed-by: Shakeel Butt <shakeelb@google.com>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Roman Gushchin <guro@fb.com>
Cc: Vladimir Davydov <vdavydov.dev@gmail.com>
Cc: Xiongchun Duan <duanxiongchun@bytedance.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 51cba1eb 01-Apr-2021 Kees Cook <keescook@chromium.org>

init_on_alloc: Optimize static branches

The state of CONFIG_INIT_ON_ALLOC_DEFAULT_ON (and ...ON_FREE...) did not
change the assembly ordering of the static branches: they were always out
of line. Use the new jump_label macros to check the CONFIG settings to
default to the "expected" state, which slightly optimizes the resulting
assembly code.

Signed-off-by: Kees Cook <keescook@chromium.org>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Alexander Potapenko <glider@google.com>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Link: https://lore.kernel.org/r/20210401232347.2791257-3-keescook@chromium.org


# e1baddf8 12-Mar-2021 Zhou Guanghui <zhouguanghui1@huawei.com>

mm/memcg: set memcg when splitting page

As described in the split_page() comment, for the non-compound high order
page, the sub-pages must be freed individually. If the memcg of the first
page is valid, the tail pages cannot be uncharged when be freed.

For example, when alloc_pages_exact is used to allocate 1MB continuous
physical memory, 2MB is charged(kmemcg is enabled and __GFP_ACCOUNT is
set). When make_alloc_exact free the unused 1MB and free_pages_exact free
the applied 1MB, actually, only 4KB(one page) is uncharged.

Therefore, the memcg of the tail page needs to be set when splitting a
page.

Michel:

There are at least two explicit users of __GFP_ACCOUNT with
alloc_exact_pages added recently. See 7efe8ef274024 ("KVM: arm64:
Allocate stage-2 pgd pages with GFP_KERNEL_ACCOUNT") and c419621873713
("KVM: s390: Add memcg accounting to KVM allocations"), so this is not
just a theoretical issue.

Link: https://lkml.kernel.org/r/20210304074053.65527-3-zhouguanghui1@huawei.com
Signed-off-by: Zhou Guanghui <zhouguanghui1@huawei.com>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Reviewed-by: Zi Yan <ziy@nvidia.com>
Reviewed-by: Shakeel Butt <shakeelb@google.com>
Acked-by: Michal Hocko <mhocko@suse.com>
Cc: Hanjun Guo <guohanjun@huawei.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: Kefeng Wang <wangkefeng.wang@huawei.com>
Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
Cc: Nicholas Piggin <npiggin@gmail.com>
Cc: Rui Xiang <rui.xiang@huawei.com>
Cc: Tianhong Ding <dingtianhong@huawei.com>
Cc: Weilong Chen <chenweilong@huawei.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# f9d79e8d 12-Mar-2021 Andrey Konovalov <andreyknvl@google.com>

kasan, mm: fix crash with HW_TAGS and DEBUG_PAGEALLOC

Currently, kasan_free_nondeferred_pages()->kasan_free_pages() is called
after debug_pagealloc_unmap_pages(). This causes a crash when
debug_pagealloc is enabled, as HW_TAGS KASAN can't set tags on an
unmapped page.

This patch puts kasan_free_nondeferred_pages() before
debug_pagealloc_unmap_pages() and arch_free_page(), which can also make
the page unavailable.

Link: https://lkml.kernel.org/r/24cd7db274090f0e5bc3adcdc7399243668e3171.1614987311.git.andreyknvl@google.com
Fixes: 94ab5b61ee16 ("kasan, arm64: enable CONFIG_KASAN_HW_TAGS")
Signed-off-by: Andrey Konovalov <andreyknvl@google.com>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Will Deacon <will.deacon@arm.com>
Cc: Vincenzo Frascino <vincenzo.frascino@arm.com>
Cc: Dmitry Vyukov <dvyukov@google.com>
Cc: Andrey Ryabinin <aryabinin@virtuozzo.com>
Cc: Alexander Potapenko <glider@google.com>
Cc: Marco Elver <elver@google.com>
Cc: Peter Collingbourne <pcc@google.com>
Cc: Evgenii Stepanov <eugenis@google.com>
Cc: Branislav Rankov <Branislav.Rankov@arm.com>
Cc: Kevin Brodsky <kevin.brodsky@arm.com>
Cc: Christoph Hellwig <hch@infradead.org>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 0740a50b 12-Mar-2021 Mike Rapoport <rppt@kernel.org>

mm/page_alloc.c: refactor initialization of struct page for holes in memory layout

There could be struct pages that are not backed by actual physical memory.
This can happen when the actual memory bank is not a multiple of
SECTION_SIZE or when an architecture does not register memory holes
reserved by the firmware as memblock.memory.

Such pages are currently initialized using init_unavailable_mem() function
that iterates through PFNs in holes in memblock.memory and if there is a
struct page corresponding to a PFN, the fields of this page are set to
default values and it is marked as Reserved.

init_unavailable_mem() does not take into account zone and node the page
belongs to and sets both zone and node links in struct page to zero.

Before commit 73a6e474cb37 ("mm: memmap_init: iterate over memblock
regions rather that check each PFN") the holes inside a zone were
re-initialized during memmap_init() and got their zone/node links right.
However, after that commit nothing updates the struct pages representing
such holes.

On a system that has firmware reserved holes in a zone above ZONE_DMA, for
instance in a configuration below:

# grep -A1 E820 /proc/iomem
7a17b000-7a216fff : Unknown E820 type
7a217000-7bffffff : System RAM

unset zone link in struct page will trigger

VM_BUG_ON_PAGE(!zone_spans_pfn(page_zone(page), pfn), page);

in set_pfnblock_flags_mask() when called with a struct page from a range
other than E820_TYPE_RAM because there are pages in the range of
ZONE_DMA32 but the unset zone link in struct page makes them appear as a
part of ZONE_DMA.

Interleave initialization of the unavailable pages with the normal
initialization of memory map, so that zone and node information will be
properly set on struct pages that are not backed by the actual memory.

With this change the pages for holes inside a zone will get proper
zone/node links and the pages that are not spanned by any node will get
links to the adjacent zone/node. The holes between nodes will be
prepended to the zone/node above the hole and the trailing pages in the
last section that will be appended to the zone/node below.

[akpm@linux-foundation.org: don't initialize static to zero, use %llu for u64]

Link: https://lkml.kernel.org/r/20210225224351.7356-2-rppt@kernel.org
Fixes: 73a6e474cb37 ("mm: memmap_init: iterate over memblock regions rather that check each PFN")
Signed-off-by: Mike Rapoport <rppt@linux.ibm.com>
Reported-by: Qian Cai <cai@lca.pw>
Reported-by: Andrea Arcangeli <aarcange@redhat.com>
Reviewed-by: Baoquan He <bhe@redhat.com>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Reviewed-by: David Hildenbrand <david@redhat.com>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Chris Wilson <chris@chris-wilson.co.uk>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: Łukasz Majczak <lma@semihalf.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: "Sarvela, Tomi P" <tomi.p.sarvela@intel.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 3c381db1 25-Feb-2021 David Hildenbrand <david@redhat.com>

mm/page_alloc: count CMA pages per zone and print them in /proc/zoneinfo

Let's count the number of CMA pages per zone and print them in
/proc/zoneinfo.

Having access to the total number of CMA pages per zone is helpful for
debugging purposes to know where exactly the CMA pages ended up, and to
figure out how many pages of a zone might behave differently, even after
some of these pages might already have been allocated.

As one example, CMA pages part of a kernel zone cannot be used for
ordinary kernel allocations but instead behave more like ZONE_MOVABLE.

For now, we are only able to get the global nr+free cma pages from
/proc/meminfo and the free cma pages per zone from /proc/zoneinfo.

Example after this patch when booting a 6 GiB QEMU VM with
"hugetlb_cma=2G":
# cat /proc/zoneinfo | grep cma
cma 0
nr_free_cma 0
cma 0
nr_free_cma 0
cma 524288
nr_free_cma 493016
cma 0
cma 0
# cat /proc/meminfo | grep Cma
CmaTotal: 2097152 kB
CmaFree: 1972064 kB

Note: We print even without CONFIG_CMA, just like "nr_free_cma"; this way,
one can be sure when spotting "cma 0", that there are definetly no
CMA pages located in a zone.

[david@redhat.com: v2]
Link: https://lkml.kernel.org/r/20210128164533.18566-1-david@redhat.com
[david@redhat.com: v3]
Link: https://lkml.kernel.org/r/20210129113451.22085-1-david@redhat.com

Link: https://lkml.kernel.org/r/20210127101813.6370-3-david@redhat.com
Signed-off-by: David Hildenbrand <david@redhat.com>
Reviewed-by: Oscar Salvador <osalvador@suse.de>
Acked-by: David Rientjes <rientjes@google.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: "Peter Zijlstra (Intel)" <peterz@infradead.org>
Cc: Mike Rapoport <rppt@kernel.org>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Wei Yang <richard.weiyang@linux.alibaba.com>
Cc: Zi Yan <ziy@nvidia.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# a0cd7a7c 24-Feb-2021 David Hildenbrand <david@redhat.com>

mm: simplify free_highmem_page() and free_reserved_page()

adjust_managed_page_count() as called by free_reserved_page() properly
handles pages in a highmem zone, so we can reuse it for
free_highmem_page().

We can now get rid of totalhigh_pages_inc() and simplify
free_reserved_page().

Link: https://lkml.kernel.org/r/20210126182113.19892-3-david@redhat.com
Signed-off-by: David Hildenbrand <david@redhat.com>
Reviewed-by: Oscar Salvador <osalvador@suse.de>
Reviewed-by: Anshuman Khandual <anshuman.khandual@arm.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: "Peter Zijlstra (Intel)" <peterz@infradead.org>
Cc: Mike Rapoport <rppt@kernel.org>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Wei Yang <richard.weiyang@linux.alibaba.com>
Cc: "Gustavo A. R. Silva" <gustavoars@kernel.org>
Cc: Sam Ravnborg <sam@ravnborg.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 9699ee7b 24-Feb-2021 Baoquan He <bhe@redhat.com>

mm: remove unneeded local variable in free_area_init_core

Local variable 'zone_start_pfn' is not needed since there's only one call
site in free_area_init_core(). Let's remove it and pass
zone->zone_start_pfn directly to init_currently_empty_zone().

Link: https://lkml.kernel.org/r/20210122135956.5946-6-bhe@redhat.com
Signed-off-by: Baoquan He <bhe@redhat.com>
Reviewed-by: Mike Rapoport <rppt@linux.ibm.com>
Reviewed-by: David Hildenbrand <david@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 7010a6ec 24-Feb-2021 Baoquan He <bhe@redhat.com>

mm: simplify parameter of setup_usemap()

Parameter 'zone' has got needed information, let's remove other
unnecessary parameters.

Link: https://lkml.kernel.org/r/20210122135956.5946-5-bhe@redhat.com
Signed-off-by: Baoquan He <bhe@redhat.com>
Reviewed-by: Mike Rapoport <rppt@linux.ibm.com>
Reviewed-by: David Hildenbrand <david@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 3256ff83 24-Feb-2021 Baoquan He <bhe@redhat.com>

mm: simplify parater of function memmap_init_zone()

As David suggested, simply passing 'struct zone *zone' is enough. We can
get all needed information from 'struct zone*' easily.

Link: https://lkml.kernel.org/r/20210122135956.5946-4-bhe@redhat.com
Signed-off-by: Baoquan He <bhe@redhat.com>
Suggested-by: David Hildenbrand <david@redhat.com>
Reviewed-by: Mike Rapoport <rppt@linux.ibm.com>
Reviewed-by: David Hildenbrand <david@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# ab28cb6e 24-Feb-2021 Baoquan He <bhe@redhat.com>

mm: rename memmap_init() and memmap_init_zone()

The current memmap_init_zone() only handles memory region inside one zone,
actually memmap_init() does the memmap init of one zone. So rename both
of them accordingly.

Link: https://lkml.kernel.org/r/20210122135956.5946-3-bhe@redhat.com
Signed-off-by: Baoquan He <bhe@redhat.com>
Reviewed-by: Mike Rapoport <rppt@linux.ibm.com>
Reviewed-by: David Hildenbrand <david@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# a1528e21 24-Feb-2021 Muchun Song <songmuchun@bytedance.com>

mm: memcontrol: convert NR_SHMEM_PMDMAPPED account to pages

Currently we use struct per_cpu_nodestat to cache the vmstat counters,
which leads to inaccurate statistics especially THP vmstat counters. In
the systems with hundreds of processors it can be GBs of memory. For
example, for a 96 CPUs system, the threshold is the maximum number of 125.
And the per cpu counters can cache 23.4375 GB in total.

The THP page is already a form of batched addition (it will add 512 worth
of memory in one go) so skipping the batching seems like sensible.
Although every THP stats update overflows the per-cpu counter, resorting
to atomic global updates. But it can make the statistics more accuracy
for the THP vmstat counters.

So we convert the NR_SHMEM_PMDMAPPED account to pages. This patch is
consistent with 8f182270dfec ("mm/swap.c: flush lru pvecs on compound page
arrival"). Doing this also can make the unit of vmstat counters more
unified. Finally, the unit of the vmstat counters are pages, kB and
bytes. The B/KB suffix can tell us that the unit is bytes or kB. The
rest which is without suffix are pages.

Link: https://lkml.kernel.org/r/20201228164110.2838-6-songmuchun@bytedance.com
Signed-off-by: Muchun Song <songmuchun@bytedance.com>
Cc: Alexey Dobriyan <adobriyan@gmail.com>
Cc: Feng Tang <feng.tang@intel.com>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: Hugh Dickins <hughd@google.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Cc: Michal Hocko <mhocko@suse.com>
Cc: NeilBrown <neilb@suse.de>
Cc: Pankaj Gupta <pankaj.gupta@cloud.ionos.com>
Cc: Rafael. J. Wysocki <rafael@kernel.org>
Cc: Randy Dunlap <rdunlap@infradead.org>
Cc: Roman Gushchin <guro@fb.com>
Cc: Sami Tolvanen <samitolvanen@google.com>
Cc: Shakeel Butt <shakeelb@google.com>
Cc: Vladimir Davydov <vdavydov.dev@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 57b2847d 24-Feb-2021 Muchun Song <songmuchun@bytedance.com>

mm: memcontrol: convert NR_SHMEM_THPS account to pages

Currently we use struct per_cpu_nodestat to cache the vmstat counters,
which leads to inaccurate statistics especially THP vmstat counters. In
the systems with hundreds of processors it can be GBs of memory. For
example, for a 96 CPUs system, the threshold is the maximum number of 125.
And the per cpu counters can cache 23.4375 GB in total.

The THP page is already a form of batched addition (it will add 512 worth
of memory in one go) so skipping the batching seems like sensible.
Although every THP stats update overflows the per-cpu counter, resorting
to atomic global updates. But it can make the statistics more accuracy
for the THP vmstat counters.

So we convert the NR_SHMEM_THPS account to pages. This patch is
consistent with 8f182270dfec ("mm/swap.c: flush lru pvecs on compound page
arrival"). Doing this also can make the unit of vmstat counters more
unified. Finally, the unit of the vmstat counters are pages, kB and
bytes. The B/KB suffix can tell us that the unit is bytes or kB. The
rest which is without suffix are pages.

Link: https://lkml.kernel.org/r/20201228164110.2838-5-songmuchun@bytedance.com
Signed-off-by: Muchun Song <songmuchun@bytedance.com>
Cc: Alexey Dobriyan <adobriyan@gmail.com>
Cc: Feng Tang <feng.tang@intel.com>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: Hugh Dickins <hughd@google.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Cc: Michal Hocko <mhocko@suse.com>
Cc: NeilBrown <neilb@suse.de>
Cc: Pankaj Gupta <pankaj.gupta@cloud.ionos.com>
Cc: Rafael. J. Wysocki <rafael@kernel.org>
Cc: Randy Dunlap <rdunlap@infradead.org>
Cc: Roman Gushchin <guro@fb.com>
Cc: Sami Tolvanen <samitolvanen@google.com>
Cc: Shakeel Butt <shakeelb@google.com>
Cc: Vladimir Davydov <vdavydov.dev@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 69473e5d 24-Feb-2021 Muchun Song <songmuchun@bytedance.com>

mm: memcontrol: convert NR_ANON_THPS account to pages

Currently we use struct per_cpu_nodestat to cache the vmstat counters,
which leads to inaccurate statistics especially THP vmstat counters. In
the systems with hundreds of processors it can be GBs of memory. For
example, for a 96 CPUs system, the threshold is the maximum number of 125.
And the per cpu counters can cache 23.4375 GB in total.

The THP page is already a form of batched addition (it will add 512 worth
of memory in one go) so skipping the batching seems like sensible.
Although every THP stats update overflows the per-cpu counter, resorting
to atomic global updates. But it can make the statistics more accuracy
for the THP vmstat counters.

So we convert the NR_ANON_THPS account to pages. This patch is consistent
with 8f182270dfec ("mm/swap.c: flush lru pvecs on compound page arrival").
Doing this also can make the unit of vmstat counters more unified.
Finally, the unit of the vmstat counters are pages, kB and bytes. The
B/KB suffix can tell us that the unit is bytes or kB. The rest which is
without suffix are pages.

Link: https://lkml.kernel.org/r/20201228164110.2838-3-songmuchun@bytedance.com
Signed-off-by: Muchun Song <songmuchun@bytedance.com>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: Rafael. J. Wysocki <rafael@kernel.org>
Cc: Alexey Dobriyan <adobriyan@gmail.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Vladimir Davydov <vdavydov.dev@gmail.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: Shakeel Butt <shakeelb@google.com>
Cc: Roman Gushchin <guro@fb.com>
Cc: Sami Tolvanen <samitolvanen@google.com>
Cc: Feng Tang <feng.tang@intel.com>
Cc: NeilBrown <neilb@suse.de>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Cc: Randy Dunlap <rdunlap@infradead.org>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Pankaj Gupta <pankaj.gupta@cloud.ionos.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# b358e212 04-Feb-2021 Kevin Hao <haokexin@gmail.com>

mm: page_frag: Introduce page_frag_alloc_align()

In the current implementation of page_frag_alloc(), it doesn't have
any align guarantee for the returned buffer address. But for some
hardwares they do require the DMA buffer to be aligned correctly,
so we would have to use some workarounds like below if the buffers
allocated by the page_frag_alloc() are used by these hardwares for
DMA.
buf = page_frag_alloc(really_needed_size + align);
buf = PTR_ALIGN(buf, align);

These codes seems ugly and would waste a lot of memories if the buffers
are used in a network driver for the TX/RX. So introduce
page_frag_alloc_align() to make sure that an aligned buffer address is
returned.

Signed-off-by: Kevin Hao <haokexin@gmail.com>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Reviewed-by: Alexander Duyck <alexanderduyck@fb.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>


# 377bf660 26-Jan-2021 Linus Torvalds <torvalds@linux-foundation.org>

Revert "mm: fix initialization of struct page for holes in memory layout"

This reverts commit d3921cb8be29ce5668c64e23ffdaeec5f8c69399.

Chris Wilson reports that it causes boot problems:

"We have half a dozen or so different machines in CI that are silently
failing to boot, that we believe is bisected to this patch"

and the CI team confirmed that a revert fixed the issues.

The cause is unknown for now, so let's revert it.

Link: https://lore.kernel.org/lkml/161160687463.28991.354987542182281928@build.alporthouse.com/
Reported-and-tested-by: Chris Wilson <chris@chris-wilson.co.uk>
Acked-by: Mike Rapoport <rppt@linux.ibm.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# acb35b17 23-Jan-2021 Andrey Konovalov <andreyknvl@google.com>

kasan, mm: fix resetting page_alloc tags for HW_TAGS

A previous commit added resetting KASAN page tags to
kernel_init_free_pages() to avoid false-positives due to accesses to
metadata with the hardware tag-based mode.

That commit did reset page tags before the metadata access, but didn't
restore them after. As the result, KASAN fails to detect bad accesses
to page_alloc allocations on some configurations.

Fix this by recovering the tag after the metadata access.

Link: https://lkml.kernel.org/r/02b5bcd692e912c27d484030f666b350ad7e4ae4.1611074450.git.andreyknvl@google.com
Fixes: aa1ef4d7b3f6 ("kasan, mm: reset tags when accessing metadata")
Signed-off-by: Andrey Konovalov <andreyknvl@google.com>
Cc: Dmitry Vyukov <dvyukov@google.com>
Cc: Alexander Potapenko <glider@google.com>
Cc: Marco Elver <elver@google.com>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Will Deacon <will.deacon@arm.com>
Cc: Vincenzo Frascino <vincenzo.frascino@arm.com>
Cc: Andrey Ryabinin <aryabinin@virtuozzo.com>
Cc: Peter Collingbourne <pcc@google.com>
Cc: Evgenii Stepanov <eugenis@google.com>
Cc: Branislav Rankov <Branislav.Rankov@arm.com>
Cc: Kevin Brodsky <kevin.brodsky@arm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# d3921cb8 23-Jan-2021 Mike Rapoport <rppt@kernel.org>

mm: fix initialization of struct page for holes in memory layout

There could be struct pages that are not backed by actual physical
memory. This can happen when the actual memory bank is not a multiple
of SECTION_SIZE or when an architecture does not register memory holes
reserved by the firmware as memblock.memory.

Such pages are currently initialized using init_unavailable_mem()
function that iterates through PFNs in holes in memblock.memory and if
there is a struct page corresponding to a PFN, the fields if this page
are set to default values and the page is marked as Reserved.

init_unavailable_mem() does not take into account zone and node the page
belongs to and sets both zone and node links in struct page to zero.

On a system that has firmware reserved holes in a zone above ZONE_DMA,
for instance in a configuration below:

# grep -A1 E820 /proc/iomem
7a17b000-7a216fff : Unknown E820 type
7a217000-7bffffff : System RAM

unset zone link in struct page will trigger

VM_BUG_ON_PAGE(!zone_spans_pfn(page_zone(page), pfn), page);

because there are pages in both ZONE_DMA32 and ZONE_DMA (unset zone link
in struct page) in the same pageblock.

Update init_unavailable_mem() to use zone constraints defined by an
architecture to properly setup the zone link and use node ID of the
adjacent range in memblock.memory to set the node link.

Link: https://lkml.kernel.org/r/20210111194017.22696-3-rppt@kernel.org
Fixes: 73a6e474cb37 ("mm: memmap_init: iterate over memblock regions rather that check each PFN")
Signed-off-by: Mike Rapoport <rppt@linux.ibm.com>
Reported-by: Andrea Arcangeli <aarcange@redhat.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Baoquan He <bhe@redhat.com>
Cc: Borislav Petkov <bp@alien8.de>
Cc: David Hildenbrand <david@redhat.com>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Qian Cai <cai@lca.pw>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# ce8f86ee 12-Jan-2021 Hailong liu <liu.hailong6@zte.com.cn>

mm/page_alloc: add a missing mm_page_alloc_zone_locked() tracepoint

The trace point *trace_mm_page_alloc_zone_locked()* in __rmqueue() does
not currently cover all branches. Add the missing tracepoint and check
the page before do that.

[akpm@linux-foundation.org: use IS_ENABLED() to suppress warning]

Link: https://lkml.kernel.org/r/20201228132901.41523-1-carver4lio@163.com
Signed-off-by: Hailong liu <liu.hailong6@zte.com.cn>
Reviewed-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# dc2da7b4 29-Dec-2020 Baoquan He <bhe@redhat.com>

mm: memmap defer init doesn't work as expected

VMware observed a performance regression during memmap init on their
platform, and bisected to commit 73a6e474cb376 ("mm: memmap_init:
iterate over memblock regions rather that check each PFN") causing it.

Before the commit:

[0.033176] Normal zone: 1445888 pages used for memmap
[0.033176] Normal zone: 89391104 pages, LIFO batch:63
[0.035851] ACPI: PM-Timer IO Port: 0x448

With commit

[0.026874] Normal zone: 1445888 pages used for memmap
[0.026875] Normal zone: 89391104 pages, LIFO batch:63
[2.028450] ACPI: PM-Timer IO Port: 0x448

The root cause is the current memmap defer init doesn't work as expected.

Before, memmap_init_zone() was used to do memmap init of one whole zone,
to initialize all low zones of one numa node, but defer memmap init of
the last zone in that numa node. However, since commit 73a6e474cb376,
function memmap_init() is adapted to iterater over memblock regions
inside one zone, then call memmap_init_zone() to do memmap init for each
region.

E.g, on VMware's system, the memory layout is as below, there are two
memory regions in node 2. The current code will mistakenly initialize the
whole 1st region [mem 0xab00000000-0xfcffffffff], then do memmap defer to
iniatialize only one memmory section on the 2nd region [mem
0x10000000000-0x1033fffffff]. In fact, we only expect to see that there's
only one memory section's memmap initialized. That's why more time is
costed at the time.

[ 0.008842] ACPI: SRAT: Node 0 PXM 0 [mem 0x00000000-0x0009ffff]
[ 0.008842] ACPI: SRAT: Node 0 PXM 0 [mem 0x00100000-0xbfffffff]
[ 0.008843] ACPI: SRAT: Node 0 PXM 0 [mem 0x100000000-0x55ffffffff]
[ 0.008844] ACPI: SRAT: Node 1 PXM 1 [mem 0x5600000000-0xaaffffffff]
[ 0.008844] ACPI: SRAT: Node 2 PXM 2 [mem 0xab00000000-0xfcffffffff]
[ 0.008845] ACPI: SRAT: Node 2 PXM 2 [mem 0x10000000000-0x1033fffffff]

Now, let's add a parameter 'zone_end_pfn' to memmap_init_zone() to pass
down the real zone end pfn so that defer_init() can use it to judge
whether defer need be taken in zone wide.

Link: https://lkml.kernel.org/r/20201223080811.16211-1-bhe@redhat.com
Link: https://lkml.kernel.org/r/20201223080811.16211-2-bhe@redhat.com
Fixes: commit 73a6e474cb376 ("mm: memmap_init: iterate over memblock regions rather that check each PFN")
Signed-off-by: Baoquan He <bhe@redhat.com>
Reported-by: Rahul Gopakumar <gopakumarr@vmware.com>
Reviewed-by: Mike Rapoport <rppt@linux.ibm.com>
Cc: David Hildenbrand <david@redhat.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# aa1ef4d7 22-Dec-2020 Andrey Konovalov <andreyknvl@google.com>

kasan, mm: reset tags when accessing metadata

Kernel allocator code accesses metadata for slab objects, that may lie
out-of-bounds of the object itself, or be accessed when an object is
freed. Such accesses trigger tag faults and lead to false-positive
reports with hardware tag-based KASAN.

Software KASAN modes disable instrumentation for allocator code via
KASAN_SANITIZE Makefile macro, and rely on kasan_enable/disable_current()
annotations which are used to ignore KASAN reports.

With hardware tag-based KASAN neither of those options are available, as
it doesn't use compiler instrumetation, no tag faults are ignored, and MTE
is disabled after the first one.

Instead, reset tags when accessing metadata (currently only for SLUB).

Link: https://lkml.kernel.org/r/a0f3cefbc49f34c843b664110842de4db28179d0.1606161801.git.andreyknvl@google.com
Signed-off-by: Andrey Konovalov <andreyknvl@google.com>
Signed-off-by: Vincenzo Frascino <vincenzo.frascino@arm.com>
Acked-by: Marco Elver <elver@google.com>
Reviewed-by: Alexander Potapenko <glider@google.com>
Tested-by: Vincenzo Frascino <vincenzo.frascino@arm.com>
Cc: Andrey Ryabinin <aryabinin@virtuozzo.com>
Cc: Branislav Rankov <Branislav.Rankov@arm.com>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Dmitry Vyukov <dvyukov@google.com>
Cc: Evgenii Stepanov <eugenis@google.com>
Cc: Kevin Brodsky <kevin.brodsky@arm.com>
Cc: Vasily Gorbik <gor@linux.ibm.com>
Cc: Will Deacon <will.deacon@arm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# c746170d 22-Dec-2020 Vincenzo Frascino <vincenzo.frascino@arm.com>

kasan, mm: untag page address in free_reserved_area

free_reserved_area() memsets the pages belonging to a given memory area.
As that memory hasn't been allocated via page_alloc, the KASAN tags that
those pages have are 0x00. As the result the memset might result in a tag
mismatch.

Untag the address to avoid spurious faults.

Link: https://lkml.kernel.org/r/ebef6425f4468d063e2f09c1b62ccbb2236b71d3.1606161801.git.andreyknvl@google.com
Cc: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Vincenzo Frascino <vincenzo.frascino@arm.com>
Signed-off-by: Andrey Konovalov <andreyknvl@google.com>
Reviewed-by: Alexander Potapenko <glider@google.com>
Tested-by: Vincenzo Frascino <vincenzo.frascino@arm.com>
Cc: Andrey Ryabinin <aryabinin@virtuozzo.com>
Cc: Branislav Rankov <Branislav.Rankov@arm.com>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Dmitry Vyukov <dvyukov@google.com>
Cc: Evgenii Stepanov <eugenis@google.com>
Cc: Kevin Brodsky <kevin.brodsky@arm.com>
Cc: Marco Elver <elver@google.com>
Cc: Vasily Gorbik <gor@linux.ibm.com>
Cc: Will Deacon <will.deacon@arm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 6168d0da 15-Dec-2020 Alex Shi <alex.shi@linux.alibaba.com>

mm/lru: replace pgdat lru_lock with lruvec lock

This patch moves per node lru_lock into lruvec, thus bring a lru_lock for
each of memcg per node. So on a large machine, each of memcg don't have
to suffer from per node pgdat->lru_lock competition. They could go fast
with their self lru_lock.

After move memcg charge before lru inserting, page isolation could
serialize page's memcg, then per memcg lruvec lock is stable and could
replace per node lru lock.

In isolate_migratepages_block(), compact_unlock_should_abort and
lock_page_lruvec_irqsave are open coded to work with compact_control.
Also add a debug func in locking which may give some clues if there are
sth out of hands.

Daniel Jordan's testing show 62% improvement on modified readtwice case on
his 2P * 10 core * 2 HT broadwell box.
https://lore.kernel.org/lkml/20200915165807.kpp7uhiw7l3loofu@ca-dmjordan1.us.oracle.com/

Hugh Dickins helped on the patch polish, thanks!

[alex.shi@linux.alibaba.com: fix comment typo]
Link: https://lkml.kernel.org/r/5b085715-292a-4b43-50b3-d73dc90d1de5@linux.alibaba.com
[alex.shi@linux.alibaba.com: use page_memcg()]
Link: https://lkml.kernel.org/r/5a4c2b72-7ee8-2478-fc0e-85eb83aafec4@linux.alibaba.com

Link: https://lkml.kernel.org/r/1604566549-62481-18-git-send-email-alex.shi@linux.alibaba.com
Signed-off-by: Alex Shi <alex.shi@linux.alibaba.com>
Acked-by: Hugh Dickins <hughd@google.com>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Cc: Rong Chen <rong.a.chen@intel.com>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Vladimir Davydov <vdavydov.dev@gmail.com>
Cc: Yang Shi <yang.shi@linux.alibaba.com>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Konstantin Khlebnikov <khlebnikov@yandex-team.ru>
Cc: Daniel Jordan <daniel.m.jordan@oracle.com>
Cc: Alexander Duyck <alexander.duyck@gmail.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Andrey Ryabinin <aryabinin@virtuozzo.com>
Cc: "Huang, Ying" <ying.huang@intel.com>
Cc: Jann Horn <jannh@google.com>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Cc: Kirill A. Shutemov <kirill@shutemov.name>
Cc: Mel Gorman <mgorman@techsingularity.net>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Mika Penttilä <mika.penttila@nextfour.com>
Cc: Minchan Kim <minchan@kernel.org>
Cc: Shakeel Butt <shakeelb@google.com>
Cc: Tejun Heo <tj@kernel.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Wei Yang <richard.weiyang@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# a00cda3f 14-Dec-2020 Mauro Carvalho Chehab <mchehab+huawei@kernel.org>

mm: fix kernel-doc markups

Kernel-doc markups should use this format:
identifier - description

Fix some issues on mm files:

1) The definition for get_user_pages_locked() doesn't follow it. Also,
it expects a short descrpition at the header, followed by a long one,
after the parameters. Fix it.

2) Kernel-doc requires that a kernel-doc markup to be immediately below
the function prototype, as otherwise it will rename it. So, move
get_pfnblock_flags_mask() description to the right place.

3) Make invalidate_mapping_pagevec() to also follow the expected
kernel-doc format.

While here, fix a few minor English syntax issues, as suggested
by Matthew:
will used -> will be used
similar with -> similar to

Link: https://lkml.kernel.org/r/80e85dddc92d333bc2159ee8a2294921612e8745.1605521731.git.mchehab+huawei@kernel.org
Signed-off-by: Mauro Carvalho Chehab <mchehab+huawei@kernel.org>
Suggested-by: Mattew Wilcox <willy@infradead.org> [English fixes]
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# f289041e 14-Dec-2020 Vlastimil Babka <vbabka@suse.cz>

mm, page_poison: remove CONFIG_PAGE_POISONING_ZERO

CONFIG_PAGE_POISONING_ZERO uses the zero pattern instead of 0xAA. It was
introduced by commit 1414c7f4f7d7 ("mm/page_poisoning.c: allow for zero
poisoning"), noting that using zeroes retains the benefit of sanitizing
content of freed pages, with the benefit of not having to zero them again
on alloc, and the downside of making some forms of corruption (stray
writes of NULLs) harder to detect than with the 0xAA pattern. Together
with CONFIG_PAGE_POISONING_NO_SANITY it made possible to sanitize the
contents on free without checking it back on alloc.

These days we have the init_on_free() option to achieve sanitization with
zeroes and to save clearing on alloc (and without checking on alloc).
Arguably if someone does choose to check the poison for corruption on
alloc, the savings of not clearing the page are secondary, and it makes
sense to always use the 0xAA poison pattern. Thus, remove the
CONFIG_PAGE_POISONING_ZERO option for being redundant.

Link: https://lkml.kernel.org/r/20201113104033.22907-6-vbabka@suse.cz
Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
Acked-by: David Hildenbrand <david@redhat.com>
Cc: Mike Rapoport <rppt@linux.ibm.com>
Cc: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Cc: Alexander Potapenko <glider@google.com>
Cc: Kees Cook <keescook@chromium.org>
Cc: Laura Abbott <labbott@kernel.org>
Cc: Mateusz Nosek <mateusznosek0@gmail.com>
Cc: Michal Hocko <mhocko@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 8db26a3d 14-Dec-2020 Vlastimil Babka <vbabka@suse.cz>

mm, page_poison: use static key more efficiently

Commit 11c9c7edae06 ("mm/page_poison.c: replace bool variable with static
key") changed page_poisoning_enabled() to a static key check. However,
the function is not inlined, so each check still involves a function call
with overhead not eliminated when page poisoning is disabled.

Analogically to how debug_pagealloc is handled, this patch converts
page_poisoning_enabled() back to boolean check, and introduces
page_poisoning_enabled_static() for fast paths. Both functions are
inlined.

The function kernel_poison_pages() is also called unconditionally and does
the static key check inside. Remove it from there and put it to callers.
Also split it to two functions kernel_poison_pages() and
kernel_unpoison_pages() instead of the confusing bool parameter.

Also optimize the check that enables page poisoning instead of
debug_pagealloc for architectures without proper debug_pagealloc support.
Move the check to init_mem_debugging_and_hardening() to enable a single
static key instead of having two static branches in
page_poisoning_enabled_static().

Link: https://lkml.kernel.org/r/20201113104033.22907-3-vbabka@suse.cz
Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
Reviewed-by: David Hildenbrand <david@redhat.com>
Cc: Mike Rapoport <rppt@linux.ibm.com>
Cc: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Cc: Alexander Potapenko <glider@google.com>
Cc: Kees Cook <keescook@chromium.org>
Cc: Laura Abbott <labbott@kernel.org>
Cc: Mateusz Nosek <mateusznosek0@gmail.com>
Cc: Michal Hocko <mhocko@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 04013513 14-Dec-2020 Vlastimil Babka <vbabka@suse.cz>

mm, page_alloc: do not rely on the order of page_poison and init_on_alloc/free parameters

Patch series "cleanup page poisoning", v3.

I have identified a number of issues and opportunities for cleanup with
CONFIG_PAGE_POISON and friends:

- interaction with init_on_alloc and init_on_free parameters depends on
the order of parameters (Patch 1)

- the boot time enabling uses static key, but inefficienty (Patch 2)

- sanity checking is incompatible with hibernation (Patch 3)

- CONFIG_PAGE_POISONING_NO_SANITY can be removed now that we have
init_on_free (Patch 4)

- CONFIG_PAGE_POISONING_ZERO can be most likely removed now that we
have init_on_free (Patch 5)

This patch (of 5):

Enabling page_poison=1 together with init_on_alloc=1 or init_on_free=1
produces a warning in dmesg that page_poison takes precedence. However,
as these warnings are printed in early_param handlers for
init_on_alloc/free, they are not printed if page_poison is enabled later
on the command line (handlers are called in the order of their
parameters), or when init_on_alloc/free is always enabled by the
respective config option - before the page_poison early param handler is
called, it is not considered to be enabled. This is inconsistent.

We can remove the dependency on order by making the init_on_* parameters
only set a boolean variable, and postponing the evaluation after all early
params have been processed. Introduce a new
init_mem_debugging_and_hardening() function for that, and move the related
debug_pagealloc processing there as well.

As a result init_mem_debugging_and_hardening() knows always accurately if
init_on_* and/or page_poison options were enabled. Thus we can also
optimize want_init_on_alloc() and want_init_on_free(). We don't need to
check page_poisoning_enabled() there, we can instead not enable the
init_on_* static keys at all, if page poisoning is enabled. This results
in a simpler and more effective code.

Link: https://lkml.kernel.org/r/20201113104033.22907-1-vbabka@suse.cz
Link: https://lkml.kernel.org/r/20201113104033.22907-2-vbabka@suse.cz
Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
Reviewed-by: David Hildenbrand <david@redhat.com>
Reviewed-by: Mike Rapoport <rppt@linux.ibm.com>
Cc: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Cc: Alexander Potapenko <glider@google.com>
Cc: Kees Cook <keescook@chromium.org>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Mateusz Nosek <mateusznosek0@gmail.com>
Cc: Laura Abbott <labbott@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 597c8920 14-Dec-2020 Johannes Weiner <hannes@cmpxchg.org>

mm: don't wake kswapd prematurely when watermark boosting is disabled

On 2-node NUMA hosts we see bursts of kswapd reclaim and subsequent
pressure spikes and stalls from cache refaults while there is plenty of
free memory in the system.

Usually, kswapd is woken up when all eligible nodes in an allocation are
full. But the code related to watermark boosting can wake kswapd on one
full node while the other one is mostly empty. This may be justified to
fight fragmentation, but is currently unconditionally done whether
watermark boosting is occurring or not.

In our case, many of our workloads' throughput scales with available
memory, and pure utilization is a more tangible concern than trends
around longer-term fragmentation. As a result we generally disable
watermark boosting.

Wake kswapd only woken when watermark boosting is requested.

Link: https://lkml.kernel.org/r/20201020175833.397286-1-hannes@cmpxchg.org
Fixes: 1c30844d2dfe ("mm: reclaim small amounts of memory when an external fragmentation event occurs")
Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
Acked-by: Mel Gorman <mgorman@suse.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 7ad69832 14-Dec-2020 Muchun Song <songmuchun@bytedance.com>

mm/page_alloc: speed up the iteration of max_order

When we free a page whose order is very close to MAX_ORDER and greater
than pageblock_order, it wastes some CPU cycles to increase max_order to
MAX_ORDER one by one and check the pageblock migratetype of that page
repeatedly especially when MAX_ORDER is much larger than pageblock_order.

We also should not be checking migratetype of buddy when "order ==
MAX_ORDER - 1" as the buddy pfn may be invalid, so adjust the condition.
With the new check, we don't need the max_order check anymore, so we
replace it.

Also adjust max_order initialization so that it's lower by one than
previously, which makes the code hopefully more clear.

Link: https://lkml.kernel.org/r/20201204155109.55451-1-songmuchun@bytedance.com
Fixes: d9dddbf55667 ("mm/page_alloc: prevent merging between isolated and other pageblocks")
Signed-off-by: Muchun Song <songmuchun@bytedance.com>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Reviewed-by: Oscar Salvador <osalvador@suse.de>
Reviewed-by: David Hildenbrand <david@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 470c61d7 14-Dec-2020 Lorenzo Stoakes <lstoakes@gmail.com>

mm: page_alloc: refactor setup_per_zone_lowmem_reserve()

setup_per_zone_lowmem_reserve() iterates through each zone setting
zone->lowmem_reserve[j] = 0 (where j is the zone's index) then iterates
backwards through all preceding zones, setting
lower_zone->lowmem_reserve[j] = sum(managed pages of higher zones) /
lowmem_reserve_ratio[idx] for each (where idx is the lower zone's index).

If the lower zone has no managed pages or its ratio is 0 then all of its
lowmem_reserve[] entries are effectively zeroed.

As these arrays are only assigned here and all lowmem_reserve[] entries
for index < this zone's index are implicitly assumed to be 0 (as these are
specifically output in show_free_areas() and zoneinfo_show_print() for
example) there is no need to additionally zero index == this zone's index
too. This patch avoids zeroing unnecessarily.

Rather than iterating through zones and setting lowmem_reserve[j] for each
lower zone this patch reverse the process and populates each zone's
lowmem_reserve[] values in ascending order.

This clarifies what is going on especially in the case of zero managed
pages or ratio which is now explicitly shown to clear these values.

Link: https://lkml.kernel.org/r/20201129162758.115907-1-lstoakes@gmail.com
Signed-off-by: Lorenzo Stoakes <lstoakes@gmail.com>
Cc: Baoquan He <bhe@redhat.com>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Nicholas Piggin <npiggin@gmail.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Roman Gushchin <guro@fb.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# ba8f3587 14-Dec-2020 Lin Feng <linf@wangsu.com>

init/main: fix broken buffer_init when DEFERRED_STRUCT_PAGE_INIT set

In the booting phase if CONFIG_DEFERRED_STRUCT_PAGE_INIT is set,
we have following callchain:

start_kernel
...
mm_init
mem_init
memblock_free_all
reset_all_zones_managed_pages
free_low_memory_core_early
...
buffer_init
nr_free_buffer_pages
zone->managed_pages
...
rest_init
kernel_init
kernel_init_freeable
page_alloc_init_late
kthread_run(deferred_init_memmap, NODE_DATA(nid), "pgdatinit%d", nid);
wait_for_completion(&pgdat_init_all_done_comp);
...
files_maxfiles_init

It's clear that buffer_init depends on zone->managed_pages, but it's reset
in reset_all_zones_managed_pages after that pages are readded into
zone->managed_pages, but when buffer_init runs this process is half done
and most of them will finally be added till deferred_init_memmap done. In
large memory couting of nr_free_buffer_pages drifts too much, also
drifting from kernels to kernels on same hardware.

Fix is simple, it delays buffer_init run till deferred_init_memmap all
done.

But as corrected by this patch, max_buffer_heads becomes very large, the
value is roughly as many as 4 times of totalram_pages, formula:
max_buffer_heads = nrpages * (10%) * (PAGE_SIZE / sizeof(struct
buffer_head));

Say in a 64GB memory box we have 16777216 pages, then max_buffer_heads
turns out to be roughly 67,108,864. In common cases, should a buffer_head
be mapped to one page/block(4KB)? So max_buffer_heads never exceeds
totalram_pages. IMO it's likely to make buffer_heads_over_limit bool
value alwasy false, then make codes 'if (buffer_heads_over_limit)' test in
vmscan unnecessary.

So this patch will change the original behavior related to
buffer_heads_over_limit in vmscan since we used a half done value of
zone->managed_pages before, or should we use a smaller factor(<10%) in
previous formula.

akpm: I think this is OK - the max_buffer_heads code is only needed on
highmem machines, to prevent ZONE_NORMAL from being consumed by large
amounts of buffer_heads attached to highmem pagecache. This problem will
not occur on 64-bit machines, so this feature's non-functionality on such
machines is a feature, not a bug.

Link: https://lkml.kernel.org/r/20201123110500.103523-1-linf@wangsu.com
Signed-off-by: Lin Feng <linf@wangsu.com>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Cc: Mel Gorman <mgorman@techsingularity.net>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 862b6dee 14-Dec-2020 David Hildenbrand <david@redhat.com>

mm/page_alloc: clear all pages in post_alloc_hook() with init_on_alloc=1

commit 6471384af2a6 ("mm: security: introduce init_on_alloc=1 and
init_on_free=1 boot options") resulted with init_on_alloc=1 in all pages
leaving the buddy via alloc_pages() and friends to be
initialized/cleared/zeroed on allocation.

However, the same logic is currently not applied to alloc_contig_pages():
allocated pages leaving the buddy aren't cleared with init_on_alloc=1 and
init_on_free=0. Let's also properly clear pages on that allocation path.

To achieve that, let's move clearing into post_alloc_hook(). This will
not only affect alloc_contig_pages() allocations but also any pages used
as migration target in compaction code via compaction_alloc().

While this sounds sub-optimal, it's the very same handling as when
allocating migration targets via alloc_migration_target() - pages will get
properly cleared with init_on_free=1. In case we ever want to optimize
migration in that regard, we should tackle all such migration users - if
we believe migration code can be fully trusted.

With this change, we will see double clearing of pages in some cases. One
example are gigantic pages (either allocated via CMA, or allocated
dynamically via alloc_contig_pages()) - which is the right thing to do
(and to be optimized outside of the buddy in the callers) as discussed in:
https://lkml.kernel.org/r/20201019182853.7467-1-gpiccoli@canonical.com

This change implies that with init_on_alloc=1

- All CMA allocations will be cleared

- Gigantic pages allocated via alloc_contig_pages() will be cleared

- virtio-mem memory to be unplugged will be cleared. While this is
suboptimal, it's similar to memory balloon drivers handling, where
all pages to be inflated will get cleared as well.

- Pages isolated for compaction will be cleared

Link: https://lkml.kernel.org/r/20201120180452.19071-1-david@redhat.com
Signed-off-by: David Hildenbrand <david@redhat.com>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Acked-by: Michal Hocko <mhocko@suse.com>
Cc: Alexander Potapenko <glider@google.com>
Cc: Mike Kravetz <mike.kravetz@oracle.com>
Cc: Mike Rapoport <rppt@linux.ibm.com>
Cc: Oscar Salvador <osalvador@suse.de>
Cc: Kees Cook <keescook@chromium.org>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 3b1f3658 14-Dec-2020 Zou Wei <zou_wei@huawei.com>

mm/page_alloc: mark some symbols with static keyword

Fix the following sparse warnings:

mm/page_alloc.c:3040:6: warning: symbol '__drain_all_pages' was not declared. Should it be static?
mm/page_alloc.c:6349:6: warning: symbol '__zone_set_pageset_high_and_batch' was not declared. Should it be static?

Link: https://lkml.kernel.org/r/1605517365-65858-1-git-send-email-zou_wei@huawei.com
Signed-off-by: Zou Wei <zou_wei@huawei.com>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 7f194fbb 14-Dec-2020 Matthew Wilcox (Oracle) <willy@infradead.org>

mm/page_alloc: add __free_pages() documentation

Provide some guidance towards when this might not be the right interface
to use.

Link: https://lkml.kernel.org/r/20201027025523.3235-1-willy@infradead.org
Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Reviewed-by: William Kucharski <william.kucharski@oracle.com>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Reviewed-by: Mike Rapoport <rppt@linux.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# ec6e8c7e 14-Dec-2020 Vlastimil Babka <vbabka@suse.cz>

mm, page_alloc: disable pcplists during memory offline

Memory offlining relies on page isolation to guarantee a forward progress
because pages cannot be reused while they are isolated. But the page
isolation itself doesn't prevent from races while freed pages are stored
on pcp lists and thus can be reused. This can be worked around by
repeated draining of pcplists, as done by commit 968318261221
("mm/memory_hotplug: drain per-cpu pages again during memory offline").

David and Michal would prefer that this race was closed in a way that
callers of page isolation who need stronger guarantees don't need to
repeatedly drain. David suggested disabling pcplists usage completely
during page isolation, instead of repeatedly draining them.

To achieve this without adding special cases in alloc/free fastpath, we
can use the same approach as boot pagesets - when pcp->high is 0, any
pcplist addition will be immediately flushed.

The race can thus be closed by setting pcp->high to 0 and draining
pcplists once, before calling start_isolate_page_range(). The draining
will serialize after processes that already disabled interrupts and read
the old value of pcp->high in free_unref_page_commit(), and processes that
have not yet disabled interrupts, will observe pcp->high == 0 when they
are rescheduled, and skip pcplists. This guarantees no stray pages on
pcplists in zones where isolation happens.

This patch thus adds zone_pcp_disable() and zone_pcp_enable() functions
that page isolation users can call before start_isolate_page_range() and
after unisolating (or offlining) the isolated pages.

Also, drain_all_pages() is optimized to only execute on cpus where
pcplists are not empty. The check can however race with a free to pcplist
that has not yet increased the pcp->count from 0 to 1. Thus make the
drain optionally skip the racy check and drain on all cpus, and use this
option in zone_pcp_disable().

As we have to avoid external updates to high and batch while pcplists are
disabled, we take pcp_batch_high_lock in zone_pcp_disable() and release it
in zone_pcp_enable(). This also synchronizes multiple users of
zone_pcp_disable()/enable().

Currently the only user of this functionality is offline_pages().

[vbabka@suse.cz: add comment, per David]
Link: https://lkml.kernel.org/r/527480ef-ed72-e1c1-52a0-1c5b0113df45@suse.cz

Link: https://lkml.kernel.org/r/20201111092812.11329-8-vbabka@suse.cz
Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
Suggested-by: David Hildenbrand <david@redhat.com>
Suggested-by: Michal Hocko <mhocko@suse.com>
Reviewed-by: Oscar Salvador <osalvador@suse.de>
Reviewed-by: David Hildenbrand <david@redhat.com>
Acked-by: Michal Hocko <mhocko@suse.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 7612921f 14-Dec-2020 Vlastimil Babka <vbabka@suse.cz>

mm, page_alloc: move draining pcplists to page isolation users

Currently, pcplists are drained during set_migratetype_isolate() which
means once per pageblock processed start_isolate_page_range(). This is
somewhat wasteful. Moreover, the callers might need different guarantees,
and the draining is currently prone to races and does not guarantee that
no page from isolated pageblock will end up on the pcplist after the
drain.

Better guarantees are added by later patches and require explicit actions
by page isolation users that need them. Thus it makes sense to move the
current imperfect draining to the callers also as a preparation step.

Link: https://lkml.kernel.org/r/20201111092812.11329-7-vbabka@suse.cz
Suggested-by: David Hildenbrand <david@redhat.com>
Suggested-by: Pavel Tatashin <pasha.tatashin@soleen.com>
Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
Reviewed-by: David Hildenbrand <david@redhat.com>
Reviewed-by: Oscar Salvador <osalvador@suse.de>
Acked-by: Michal Hocko <mhocko@suse.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 952eaf81 14-Dec-2020 Vlastimil Babka <vbabka@suse.cz>

mm, page_alloc: cache pageset high and batch in struct zone

All per-cpu pagesets for a zone use the same high and batch values, that
are duplicated there just for performance (locality) reasons. This patch
adds the same variables also to struct zone as a shared copy.

This will be useful later for making possible to disable pcplists
temporarily by setting high value to 0, while remembering the values for
restoring them later. But we can also immediately benefit from not
updating pagesets of all possible cpus in case the newly recalculated
values (after sysctl change or memory online/offline) are actually
unchanged from the previous ones.

Link: https://lkml.kernel.org/r/20201111092812.11329-6-vbabka@suse.cz
Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
Reviewed-by: Oscar Salvador <osalvador@suse.de>
Acked-by: Michal Hocko <mhocko@suse.com>
Reviewed-by: David Hildenbrand <david@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 5c3ad2eb 14-Dec-2020 Vlastimil Babka <vbabka@suse.cz>

mm, page_alloc: simplify pageset_update()

pageset_update() attempts to update pcplist's high and batch values in a
way that readers don't observe batch > high. It uses smp_wmb() to order
the updates in a way to achieve this. However, without proper pairing
read barriers in readers this guarantee doesn't hold, and there are no
such barriers in e.g. free_unref_page_commit().

Commit 88e8ac11d2ea ("mm, page_alloc: fix core hung in
free_pcppages_bulk()") already showed this is problematic, and solved this
by ultimately only trusing pcp->count of the current cpu with interrupts
disabled.

The update dance with unpaired write barriers thus makes no sense.
Replace them with plain WRITE_ONCE to prevent store tearing, and document
that the values can change asynchronously and should not be trusted for
correctness.

All current readers appear to be OK after 88e8ac11d2ea. Convert them to
READ_ONCE to prevent unnecessary read tearing, but mainly to alert anybody
making future changes to the code that special care is needed.

Link: https://lkml.kernel.org/r/20201111092812.11329-5-vbabka@suse.cz
Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
Reviewed-by: Oscar Salvador <osalvador@suse.de>
Acked-by: David Hildenbrand <david@redhat.com>
Acked-by: Michal Hocko <mhocko@suse.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 69a8396a 14-Dec-2020 Vlastimil Babka <vbabka@suse.cz>

mm, page_alloc: remove setup_pageset()

We initialize boot-time pagesets with setup_pageset(), which sets high and
batch values that effectively disable pcplists.

We can remove this wrapper if we just set these values for all pagesets in
pageset_init(). Non-boot pagesets then subsequently update them to the
proper values.

No functional change.

Link: https://lkml.kernel.org/r/20201111092812.11329-4-vbabka@suse.cz
Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
Reviewed-by: David Hildenbrand <david@redhat.com>
Reviewed-by: Oscar Salvador <osalvador@suse.de>
Acked-by: Michal Hocko <mhocko@suse.com>
Acked-by: Pankaj Gupta <pankaj.gupta@cloud.ionos.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 0a8b4f1d 14-Dec-2020 Vlastimil Babka <vbabka@suse.cz>

mm, page_alloc: calculate pageset high and batch once per zone

We currently call pageset_set_high_and_batch() for each possible cpu,
which repeats the same calculations of high and batch values.

Instead call the function just once per zone, and make it apply the
calculated values to all per-cpu pagesets of the zone.

This also allows removing the zone_pageset_init() and __zone_pcp_update()
wrappers.

No functional change.

Link: https://lkml.kernel.org/r/20201111092812.11329-3-vbabka@suse.cz
Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
Reviewed-by: Oscar Salvador <osalvador@suse.de>
Reviewed-by: David Hildenbrand <david@redhat.com>
Acked-by: Michal Hocko <mhocko@suse.com>
Acked-by: Pankaj Gupta <pankaj.gupta@cloud.ionos.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 7115ac6e 14-Dec-2020 Vlastimil Babka <vbabka@suse.cz>

mm, page_alloc: clean up pageset high and batch update

Patch series "disable pcplists during memory offline", v3.

As per the discussions [1] [2] this is an attempt to implement David's
suggestion that page isolation should disable pcplists to avoid races with
page freeing in progress. This is done without extra checks in fast
paths, as explained in Patch 9. The repeated draining done by [2] is then
no longer needed. Previous version (RFC) is at [3].

The RFC tried to hide pcplists disabling/enabling into page isolation, but
it wasn't completely possible, as memory offline does not unisolation.
Michal suggested an explicit API in [4] so that's the current
implementation and it seems indeed nicer.

Once we accept that page isolation users need to do explicit actions
around it depending on the needed guarantees, we can also IMHO accept that
the current pcplist draining can be also done by the callers, which is
more effective. After all, there are only two users of page isolation.
So patch 6 does effectively the same thing as Pavel proposed in [5], and
patch 7 implement stronger guarantees only for memory offline. If CMA
decides to opt-in to the stronger guarantee, it can be added later.

Patches 1-5 are preparatory cleanups for pcplist disabling.

Patchset was briefly tested in QEMU so that memory online/offline works,
but I haven't done a stress test that would prove the race fixed by [2] is
eliminated.

Note that patch 7 could be avoided if we instead adjusted page freeing in
shown in [6], but I believe the current implementation of disabling
pcplists is not too much complex, so I would prefer this instead of adding
new checks and longer irq-disabled section into page freeing hotpaths.

[1] https://lore.kernel.org/linux-mm/20200901124615.137200-1-pasha.tatashin@soleen.com/
[2] https://lore.kernel.org/linux-mm/20200903140032.380431-1-pasha.tatashin@soleen.com/
[3] https://lore.kernel.org/linux-mm/20200907163628.26495-1-vbabka@suse.cz/
[4] https://lore.kernel.org/linux-mm/20200909113647.GG7348@dhcp22.suse.cz/
[5] https://lore.kernel.org/linux-mm/20200904151448.100489-3-pasha.tatashin@soleen.com/
[6] https://lore.kernel.org/linux-mm/3d3b53db-aeaa-ff24-260b-36427fac9b1c@suse.cz/
[7] https://lore.kernel.org/linux-mm/20200922143712.12048-1-vbabka@suse.cz/
[8] https://lore.kernel.org/linux-mm/20201008114201.18824-1-vbabka@suse.cz/

This patch (of 7):

The updates to pcplists' high and batch values are handled by multiple
functions that make the calculations hard to follow. Consolidate
everything to pageset_set_high_and_batch() and remove pageset_set_batch()
and pageset_set_high() wrappers.

The only special case using one of the removed wrappers was:
build_all_zonelists_init()

setup_pageset()
pageset_set_batch()

which was hardcoding batch as 0, so we can just open-code a call to
pageset_update() with constant parameters instead.

No functional change.

Link: https://lkml.kernel.org/r/20201111092812.11329-1-vbabka@suse.cz
Link: https://lkml.kernel.org/r/20201111092812.11329-2-vbabka@suse.cz
Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
Reviewed-by: Oscar Salvador <osalvador@suse.de>
Reviewed-by: David Hildenbrand <david@redhat.com>
Acked-by: Michal Hocko <mhocko@suse.com>
Acked-by: Pankaj Gupta <pankaj.gupta@cloud.ionos.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 77bc7fd6 14-Dec-2020 Mike Rapoport <rppt@kernel.org>

mm: introduce debug_pagealloc_{map,unmap}_pages() helpers

Patch series "arch, mm: improve robustness of direct map manipulation", v7.

During recent discussion about KVM protected memory, David raised a
concern about usage of __kernel_map_pages() outside of DEBUG_PAGEALLOC
scope [1].

Indeed, for architectures that define CONFIG_ARCH_HAS_SET_DIRECT_MAP it is
possible that __kernel_map_pages() would fail, but since this function is
void, the failure will go unnoticed.

Moreover, there's lack of consistency of __kernel_map_pages() semantics
across architectures as some guard this function with #ifdef
DEBUG_PAGEALLOC, some refuse to update the direct map if page allocation
debugging is disabled at run time and some allow modifying the direct map
regardless of DEBUG_PAGEALLOC settings.

This set straightens this out by restoring dependency of
__kernel_map_pages() on DEBUG_PAGEALLOC and updating the call sites
accordingly.

Since currently the only user of __kernel_map_pages() outside
DEBUG_PAGEALLOC is hibernation, it is updated to make direct map accesses
there more explicit.

[1] https://lore.kernel.org/lkml/2759b4bf-e1e3-d006-7d86-78a40348269d@redhat.com

This patch (of 4):

When CONFIG_DEBUG_PAGEALLOC is enabled, it unmaps pages from the kernel
direct mapping after free_pages(). The pages than need to be mapped back
before they could be used. Theese mapping operations use
__kernel_map_pages() guarded with with debug_pagealloc_enabled().

The only place that calls __kernel_map_pages() without checking whether
DEBUG_PAGEALLOC is enabled is the hibernation code that presumes
availability of this function when ARCH_HAS_SET_DIRECT_MAP is set. Still,
on arm64, __kernel_map_pages() will bail out when DEBUG_PAGEALLOC is not
enabled but set_direct_map_invalid_noflush() may render some pages not
present in the direct map and hibernation code won't be able to save such
pages.

To make page allocation debugging and hibernation interaction more robust,
the dependency on DEBUG_PAGEALLOC or ARCH_HAS_SET_DIRECT_MAP has to be
made more explicit.

Start with combining the guard condition and the call to
__kernel_map_pages() into debug_pagealloc_map_pages() and
debug_pagealloc_unmap_pages() functions to emphasize that
__kernel_map_pages() should not be called without DEBUG_PAGEALLOC and use
these new functions to map/unmap pages when page allocation debugging is
enabled.

Link: https://lkml.kernel.org/r/20201109192128.960-1-rppt@kernel.org
Link: https://lkml.kernel.org/r/20201109192128.960-2-rppt@kernel.org
Signed-off-by: Mike Rapoport <rppt@linux.ibm.com>
Reviewed-by: David Hildenbrand <david@redhat.com>
Acked-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Cc: Albert Ou <aou@eecs.berkeley.edu>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Christian Borntraeger <borntraeger@de.ibm.com>
Cc: Christoph Lameter <cl@linux.com>
Cc: "David S. Miller" <davem@davemloft.net>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: David Rientjes <rientjes@google.com>
Cc: "Edgecombe, Rick P" <rick.p.edgecombe@intel.com>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: Heiko Carstens <hca@linux.ibm.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Cc: Len Brown <len.brown@intel.com>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Cc: Palmer Dabbelt <palmer@dabbelt.com>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Paul Walmsley <paul.walmsley@sifive.com>
Cc: Pavel Machek <pavel@ucw.cz>
Cc: Pekka Enberg <penberg@kernel.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: "Rafael J. Wysocki" <rjw@rjwysocki.net>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Vasily Gorbik <gor@linux.ibm.com>
Cc: Will Deacon <will@kernel.org>
Cc: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 03e92a5e 14-Dec-2020 Mike Rapoport <rppt@kernel.org>

ia64: remove custom __early_pfn_to_nid()

The ia64 implementation of __early_pfn_to_nid() essentially relies on the
same data as the generic implementation.

The correspondence between memory ranges and nodes is set in memblock
during early memory initialization in register_active_ranges() function.

The initialization of sparsemem that requires early_pfn_to_nid() happens
later and it can use the memblock information like the other architectures.

Link: https://lkml.kernel.org/r/20201101170454.9567-3-rppt@kernel.org
Signed-off-by: Mike Rapoport <rppt@linux.ibm.com>
Cc: Alexey Dobriyan <adobriyan@gmail.com>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Geert Uytterhoeven <geert@linux-m68k.org>
Cc: Greg Ungerer <gerg@linux-m68k.org>
Cc: John Paul Adrian Glaubitz <glaubitz@physik.fu-berlin.de>
Cc: Jonathan Corbet <corbet@lwn.net>
Cc: Matt Turner <mattst88@gmail.com>
Cc: Meelis Roos <mroos@linux.ee>
Cc: Michael Schmitz <schmitzmic@gmail.com>
Cc: Russell King <linux@armlinux.org.uk>
Cc: Tony Luck <tony.luck@intel.com>
Cc: Vineet Gupta <vgupta@synopsys.com>
Cc: Will Deacon <will@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# f920e413 14-Dec-2020 Daniel Vetter <daniel.vetter@ffwll.ch>

mm: track mmu notifiers in fs_reclaim_acquire/release

fs_reclaim_acquire/release nicely catch recursion issues when allocating
GFP_KERNEL memory against shrinkers (which gpu drivers tend to use to keep
the excessive caches in check). For mmu notifier recursions we do have
lockdep annotations since 23b68395c7c7 ("mm/mmu_notifiers: add a lockdep
map for invalidate_range_start/end").

But these only fire if a path actually results in some pte invalidation -
for most small allocations that's very rarely the case. The other trouble
is that pte invalidation can happen any time when __GFP_RECLAIM is set.
Which means only really GFP_ATOMIC is a safe choice, GFP_NOIO isn't good
enough to avoid potential mmu notifier recursion.

I was pondering whether we should just do the general annotation, but
there's always the risk for false positives. Plus I'm assuming that the
core fs and io code is a lot better reviewed and tested than random mmu
notifier code in drivers. Hence why I decide to only annotate for that
specific case.

Furthermore even if we'd create a lockdep map for direct reclaim, we'd
still need to explicit pull in the mmu notifier map - there's a lot more
places that do pte invalidation than just direct reclaim, these two
contexts arent the same.

Note that the mmu notifiers needing their own independent lockdep map is
also the reason we can't hold them from fs_reclaim_acquire to
fs_reclaim_release - it would nest with the acquistion in the pte
invalidation code, causing a lockdep splat. And we can't remove the
annotations from pte invalidation and all the other places since they're
called from many other places than page reclaim. Hence we can only do the
equivalent of might_lock, but on the raw lockdep map.

With this we can also remove the lockdep priming added in 66204f1d2d1b
("mm/mmu_notifiers: prime lockdep") since the new annotations are strictly
more powerful.

Link: https://lkml.kernel.org/r/20201125162532.1299794-2-daniel.vetter@ffwll.ch
Signed-off-by: Daniel Vetter <daniel.vetter@intel.com>
Reviewed-by: Jason Gunthorpe <jgg@nvidia.com>
Cc: Dave Chinner <david@fromorbit.com>
Cc: Qian Cai <cai@lca.pw>
Cc: Thomas Hellström (Intel) <thomas_os@shipmail.org>
Cc: Jason Gunthorpe <jgg@mellanox.com>
Cc: Maarten Lankhorst <maarten.lankhorst@linux.intel.com>
Cc: Christian König <christian.koenig@amd.com>
Cc: "Matthew Wilcox (Oracle)" <willy@infradead.org>
Cc: Christoph Lameter <cl@linux.com>
Cc: David Rientjes <rientjes@google.com>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Cc: Michel Lespinasse <walken@google.com>
Cc: Paul E. McKenney <paulmck@kernel.org>
Cc: Pekka Enberg <penberg@kernel.org>
Cc: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Randy Dunlap <rdunlap@infradead.org>
Cc: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Waiman Long <longman@redhat.com>
Cc: Will Deacon <will@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# f0c0c115 14-Dec-2020 Shakeel Butt <shakeelb@google.com>

mm: memcontrol: account pagetables per node

For many workloads, pagetable consumption is significant and it makes
sense to expose it in the memory.stat for the memory cgroups. However at
the moment, the pagetables are accounted per-zone. Converting them to
per-node and using the right interface will correctly account for the
memory cgroups as well.

[akpm@linux-foundation.org: export __mod_lruvec_page_state to modules for arch/mips/kvm/]

Link: https://lkml.kernel.org/r/20201130212541.2781790-3-shakeelb@google.com
Signed-off-by: Shakeel Butt <shakeelb@google.com>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Acked-by: Roman Gushchin <guro@fb.com>
Cc: Michal Hocko <mhocko@suse.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 18b2db3b 01-Dec-2020 Roman Gushchin <guro@fb.com>

mm: Convert page kmemcg type to a page memcg flag

PageKmemcg flag is currently defined as a page type (like buddy, offline,
table and guard). Semantically it means that the page was accounted as a
kernel memory by the page allocator and has to be uncharged on the
release.

As a side effect of defining the flag as a page type, the accounted page
can't be mapped to userspace (look at page_has_type() and comments above).
In particular, this blocks the accounting of vmalloc-backed memory used
by some bpf maps, because these maps do map the memory to userspace.

One option is to fix it by complicating the access to page->mapcount,
which provides some free bits for page->page_type.

But it's way better to move this flag into page->memcg_data flags.
Indeed, the flag makes no sense without enabled memory cgroups and memory
cgroup pointer set in particular.

This commit replaces PageKmemcg() and __SetPageKmemcg() with
PageMemcgKmem() and an open-coded OR operation setting the memcg pointer
with the MEMCG_DATA_KMEM bit. __ClearPageKmemcg() can be simple deleted,
as the whole memcg_data is zeroed at once.

As a bonus, on !CONFIG_MEMCG build the PageMemcgKmem() check will be
compiled out.

Signed-off-by: Roman Gushchin <guro@fb.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Reviewed-by: Shakeel Butt <shakeelb@google.com>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Acked-by: Michal Hocko <mhocko@suse.com>
Link: https://lkml.kernel.org/r/20201027001657.3398190-5-guro@fb.com
Link: https://lore.kernel.org/bpf/20201201215900.3569844-5-guro@fb.com


# bcfe06bf 01-Dec-2020 Roman Gushchin <guro@fb.com>

mm: memcontrol: Use helpers to read page's memcg data

Patch series "mm: allow mapping accounted kernel pages to userspace", v6.

Currently a non-slab kernel page which has been charged to a memory cgroup
can't be mapped to userspace. The underlying reason is simple: PageKmemcg
flag is defined as a page type (like buddy, offline, etc), so it takes a
bit from a page->mapped counter. Pages with a type set can't be mapped to
userspace.

But in general the kmemcg flag has nothing to do with mapping to
userspace. It only means that the page has been accounted by the page
allocator, so it has to be properly uncharged on release.

Some bpf maps are mapping the vmalloc-based memory to userspace, and their
memory can't be accounted because of this implementation detail.

This patchset removes this limitation by moving the PageKmemcg flag into
one of the free bits of the page->mem_cgroup pointer. Also it formalizes
accesses to the page->mem_cgroup and page->obj_cgroups using new helpers,
adds several checks and removes a couple of obsolete functions. As the
result the code became more robust with fewer open-coded bit tricks.

This patch (of 4):

Currently there are many open-coded reads of the page->mem_cgroup pointer,
as well as a couple of read helpers, which are barely used.

It creates an obstacle on a way to reuse some bits of the pointer for
storing additional bits of information. In fact, we already do this for
slab pages, where the last bit indicates that a pointer has an attached
vector of objcg pointers instead of a regular memcg pointer.

This commits uses 2 existing helpers and introduces a new helper to
converts all read sides to calls of these helpers:
struct mem_cgroup *page_memcg(struct page *page);
struct mem_cgroup *page_memcg_rcu(struct page *page);
struct mem_cgroup *page_memcg_check(struct page *page);

page_memcg_check() is intended to be used in cases when the page can be a
slab page and have a memcg pointer pointing at objcg vector. It does
check the lowest bit, and if set, returns NULL. page_memcg() contains a
VM_BUG_ON_PAGE() check for the page not being a slab page.

To make sure nobody uses a direct access, struct page's
mem_cgroup/obj_cgroups is converted to unsigned long memcg_data.

Signed-off-by: Roman Gushchin <guro@fb.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Reviewed-by: Shakeel Butt <shakeelb@google.com>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Acked-by: Michal Hocko <mhocko@suse.com>
Link: https://lkml.kernel.org/r/20201027001657.3398190-1-guro@fb.com
Link: https://lkml.kernel.org/r/20201027001657.3398190-2-guro@fb.com
Link: https://lore.kernel.org/bpf/20201201215900.3569844-2-guro@fb.com


# d8c19014 15-Nov-2020 Dongli Zhang <dongli.zhang@oracle.com>

page_frag: Recover from memory pressure

The ethernet driver may allocate skb (and skb->data) via napi_alloc_skb().
This ends up to page_frag_alloc() to allocate skb->data from
page_frag_cache->va.

During the memory pressure, page_frag_cache->va may be allocated as
pfmemalloc page. As a result, the skb->pfmemalloc is always true as
skb->data is from page_frag_cache->va. The skb will be dropped if the
sock (receiver) does not have SOCK_MEMALLOC. This is expected behaviour
under memory pressure.

However, once kernel is not under memory pressure any longer (suppose large
amount of memory pages are just reclaimed), the page_frag_alloc() may still
re-use the prior pfmemalloc page_frag_cache->va to allocate skb->data. As a
result, the skb->pfmemalloc is always true unless page_frag_cache->va is
re-allocated, even if the kernel is not under memory pressure any longer.

Here is how kernel runs into issue.

1. The kernel is under memory pressure and allocation of
PAGE_FRAG_CACHE_MAX_ORDER in __page_frag_cache_refill() will fail. Instead,
the pfmemalloc page is allocated for page_frag_cache->va.

2: All skb->data from page_frag_cache->va (pfmemalloc) will have
skb->pfmemalloc=true. The skb will always be dropped by sock without
SOCK_MEMALLOC. This is an expected behaviour.

3. Suppose a large amount of pages are reclaimed and kernel is not under
memory pressure any longer. We expect skb->pfmemalloc drop will not happen.

4. Unfortunately, page_frag_alloc() does not proactively re-allocate
page_frag_alloc->va and will always re-use the prior pfmemalloc page. The
skb->pfmemalloc is always true even kernel is not under memory pressure any
longer.

Fix this by freeing and re-allocating the page instead of recycling it.

References: https://lore.kernel.org/lkml/20201103193239.1807-1-dongli.zhang@oracle.com/
References: https://lore.kernel.org/linux-mm/20201105042140.5253-1-willy@infradead.org/
Suggested-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Cc: Aruna Ramakrishna <aruna.ramakrishna@oracle.com>
Cc: Bert Barbe <bert.barbe@oracle.com>
Cc: Rama Nichanamatlu <rama.nichanamatlu@oracle.com>
Cc: Venkat Venkatsubra <venkat.x.venkatsubra@oracle.com>
Cc: Manjunath Patil <manjunath.b.patil@oracle.com>
Cc: Joe Jin <joe.jin@oracle.com>
Cc: SRINIVAS <srinivas.eeda@oracle.com>
Fixes: 79930f5892e1 ("net: do not deplete pfmemalloc reserve")
Signed-off-by: Dongli Zhang <dongli.zhang@oracle.com>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Link: https://lore.kernel.org/r/20201115201029.11903-1-dongli.zhang@oracle.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>


# ab130f91 15-Oct-2020 Matthew Wilcox (Oracle) <willy@infradead.org>

mm: rename page_order() to buddy_order()

The current page_order() can only be called on pages in the buddy
allocator. For compound pages, you have to use compound_order(). This is
confusing and led to a bug, so rename page_order() to buddy_order().

Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Link: https://lkml.kernel.org/r/20201001152259.14932-2-willy@infradead.org
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 7fef431b 15-Oct-2020 David Hildenbrand <david@redhat.com>

mm/page_alloc: place pages to tail in __free_pages_core()

__free_pages_core() is used when exposing fresh memory to the buddy during
system boot and when onlining memory in generic_online_page().

generic_online_page() is used in two cases:

1. Direct memory onlining in online_pages().
2. Deferred memory onlining in memory-ballooning-like mechanisms (HyperV
balloon and virtio-mem), when parts of a section are kept
fake-offline to be fake-onlined later on.

In 1, we already place pages to the tail of the freelist. Pages will be
freed to MIGRATE_ISOLATE lists first and moved to the tail of the
freelists via undo_isolate_page_range().

In 2, we currently don't implement a proper rule. In case of virtio-mem,
where we currently always online MAX_ORDER - 1 pages, the pages will be
placed to the HEAD of the freelist - undesireable. While the hyper-v
balloon calls generic_online_page() with single pages, usually it will
call it on successive single pages in a larger block.

The pages are fresh, so place them to the tail of the freelist and avoid
the PCP. In __free_pages_core(), remove the now superflouos call to
set_page_refcounted() and add a comment regarding page initialization and
the refcount.

Note: In 2. we currently don't shuffle. If ever relevant (page shuffling
is usually of limited use in virtualized environments), we might want to
shuffle after a sequence of generic_online_page() calls in the relevant
callers.

Signed-off-by: David Hildenbrand <david@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Reviewed-by: Vlastimil Babka <vbabka@suse.cz>
Reviewed-by: Oscar Salvador <osalvador@suse.de>
Reviewed-by: Wei Yang <richard.weiyang@linux.alibaba.com>
Acked-by: Pankaj Gupta <pankaj.gupta.linux@gmail.com>
Acked-by: Michal Hocko <mhocko@suse.com>
Cc: Alexander Duyck <alexander.h.duyck@linux.intel.com>
Cc: Mel Gorman <mgorman@techsingularity.net>
Cc: Dave Hansen <dave.hansen@intel.com>
Cc: Mike Rapoport <rppt@kernel.org>
Cc: "K. Y. Srinivasan" <kys@microsoft.com>
Cc: Haiyang Zhang <haiyangz@microsoft.com>
Cc: Stephen Hemminger <sthemmin@microsoft.com>
Cc: Wei Liu <wei.liu@kernel.org>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Scott Cheloha <cheloha@linux.ibm.com>
Link: https://lkml.kernel.org/r/20201005121534.15649-5-david@redhat.com
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 293ffa5e 15-Oct-2020 David Hildenbrand <david@redhat.com>

mm/page_alloc: move pages to tail in move_to_free_list()

Whenever we move pages between freelists via move_to_free_list()/
move_freepages_block(), we don't actually touch the pages:
1. Page isolation doesn't actually touch the pages, it simply isolates
pageblocks and moves all free pages to the MIGRATE_ISOLATE freelist.
When undoing isolation, we move the pages back to the target list.
2. Page stealing (steal_suitable_fallback()) moves free pages directly
between lists without touching them.
3. reserve_highatomic_pageblock()/unreserve_highatomic_pageblock() moves
free pages directly between freelists without touching them.

We already place pages to the tail of the freelists when undoing isolation
via __putback_isolated_page(), let's do it in any case (e.g., if order <=
pageblock_order) and document the behavior. To simplify, let's move the
pages to the tail for all move_to_free_list()/move_freepages_block() users.

In 2., the target list is empty, so there should be no change. In 3., we
might observe a change, however, highatomic is more concerned about
allocations succeeding than cache hotness - if we ever realize this change
degrades a workload, we can special-case this instance and add a proper
comment.

This change results in all pages getting onlined via online_pages() to be
placed to the tail of the freelist.

Signed-off-by: David Hildenbrand <david@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Reviewed-by: Oscar Salvador <osalvador@suse.de>
Reviewed-by: Wei Yang <richard.weiyang@linux.alibaba.com>
Acked-by: Pankaj Gupta <pankaj.gupta.linux@gmail.com>
Acked-by: Michal Hocko <mhocko@suse.com>
Cc: Alexander Duyck <alexander.h.duyck@linux.intel.com>
Cc: Mel Gorman <mgorman@techsingularity.net>
Cc: Dave Hansen <dave.hansen@intel.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Mike Rapoport <rppt@kernel.org>
Cc: Scott Cheloha <cheloha@linux.ibm.com>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Cc: Haiyang Zhang <haiyangz@microsoft.com>
Cc: "K. Y. Srinivasan" <kys@microsoft.com>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Stephen Hemminger <sthemmin@microsoft.com>
Cc: Wei Liu <wei.liu@kernel.org>
Link: https://lkml.kernel.org/r/20201005121534.15649-4-david@redhat.com
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 47b6a24a2 15-Oct-2020 David Hildenbrand <david@redhat.com>

mm/page_alloc: place pages to tail in __putback_isolated_page()

__putback_isolated_page() already documents that pages will be placed to
the tail of the freelist - this is, however, not the case for "order >=
MAX_ORDER - 2" (see buddy_merge_likely()) - which should be the case for
all existing users.

This change affects two users:
- free page reporting
- page isolation, when undoing the isolation (including memory onlining).

This behavior is desirable for pages that haven't really been touched
lately, so exactly the two users that don't actually read/write page
content, but rather move untouched pages.

The new behavior is especially desirable for memory onlining, where we
allow allocation of newly onlined pages via undo_isolate_page_range() in
online_pages(). Right now, we always place them to the head of the
freelist, resulting in undesireable behavior: Assume we add individual
memory chunks via add_memory() and online them right away to the NORMAL
zone. We create a dependency chain of unmovable allocations e.g., via the
memmap. The memmap of the next chunk will be placed onto previous chunks
- if the last block cannot get offlined+removed, all dependent ones cannot
get offlined+removed. While this can already be observed with individual
DIMMs, it's more of an issue for virtio-mem (and I suspect also ppc
DLPAR).

Document that this should only be used for optimizations, and no code
should rely on this behavior for correction (if the order of the freelists
ever changes).

We won't care about page shuffling: memory onlining already properly
shuffles after onlining. free page reporting doesn't care about
physically contiguous ranges, and there are already cases where page
isolation will simply move (physically close) free pages to (currently)
the head of the freelists via move_freepages_block() instead of shuffling.
If this becomes ever relevant, we should shuffle the whole zone when
undoing isolation of larger ranges, and after free_contig_range().

Signed-off-by: David Hildenbrand <david@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Reviewed-by: Alexander Duyck <alexander.h.duyck@linux.intel.com>
Reviewed-by: Oscar Salvador <osalvador@suse.de>
Reviewed-by: Wei Yang <richard.weiyang@linux.alibaba.com>
Reviewed-by: Pankaj Gupta <pankaj.gupta.linux@gmail.com>
Acked-by: Michal Hocko <mhocko@suse.com>
Cc: Mel Gorman <mgorman@techsingularity.net>
Cc: Dave Hansen <dave.hansen@intel.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Mike Rapoport <rppt@kernel.org>
Cc: Scott Cheloha <cheloha@linux.ibm.com>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Cc: Haiyang Zhang <haiyangz@microsoft.com>
Cc: "K. Y. Srinivasan" <kys@microsoft.com>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Stephen Hemminger <sthemmin@microsoft.com>
Cc: Wei Liu <wei.liu@kernel.org>
Link: https://lkml.kernel.org/r/20201005121534.15649-3-david@redhat.com
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# f04a5d5d 15-Oct-2020 David Hildenbrand <david@redhat.com>

mm/page_alloc: convert "report" flag of __free_one_page() to a proper flag

Patch series "mm: place pages to the freelist tail when onlining and undoing isolation", v2.

When adding separate memory blocks via add_memory*() and onlining them
immediately, the metadata (especially the memmap) of the next block will
be placed onto one of the just added+onlined block. This creates a chain
of unmovable allocations: If the last memory block cannot get
offlined+removed() so will all dependent ones. We directly have unmovable
allocations all over the place.

This can be observed quite easily using virtio-mem, however, it can also
be observed when using DIMMs. The freshly onlined pages will usually be
placed to the head of the freelists, meaning they will be allocated next,
turning the just-added memory usually immediately un-removable. The fresh
pages are cold, prefering to allocate others (that might be hot) also
feels to be the natural thing to do.

It also applies to the hyper-v balloon xen-balloon, and ppc64 dlpar: when
adding separate, successive memory blocks, each memory block will have
unmovable allocations on them - for example gigantic pages will fail to
allocate.

While the ZONE_NORMAL doesn't provide any guarantees that memory can get
offlined+removed again (any kind of fragmentation with unmovable
allocations is possible), there are many scenarios (hotplugging a lot of
memory, running workload, hotunplug some memory/as much as possible) where
we can offline+remove quite a lot with this patchset.

a) To visualize the problem, a very simple example:

Start a VM with 4GB and 8GB of virtio-mem memory:

[root@localhost ~]# lsmem
RANGE SIZE STATE REMOVABLE BLOCK
0x0000000000000000-0x00000000bfffffff 3G online yes 0-23
0x0000000100000000-0x000000033fffffff 9G online yes 32-103

Memory block size: 128M
Total online memory: 12G
Total offline memory: 0B

Then try to unplug as much as possible using virtio-mem. Observe which
memory blocks are still around. Without this patch set:

[root@localhost ~]# lsmem
RANGE SIZE STATE REMOVABLE BLOCK
0x0000000000000000-0x00000000bfffffff 3G online yes 0-23
0x0000000100000000-0x000000013fffffff 1G online yes 32-39
0x0000000148000000-0x000000014fffffff 128M online yes 41
0x0000000158000000-0x000000015fffffff 128M online yes 43
0x0000000168000000-0x000000016fffffff 128M online yes 45
0x0000000178000000-0x000000017fffffff 128M online yes 47
0x0000000188000000-0x0000000197ffffff 256M online yes 49-50
0x00000001a0000000-0x00000001a7ffffff 128M online yes 52
0x00000001b0000000-0x00000001b7ffffff 128M online yes 54
0x00000001c0000000-0x00000001c7ffffff 128M online yes 56
0x00000001d0000000-0x00000001d7ffffff 128M online yes 58
0x00000001e0000000-0x00000001e7ffffff 128M online yes 60
0x00000001f0000000-0x00000001f7ffffff 128M online yes 62
0x0000000200000000-0x0000000207ffffff 128M online yes 64
0x0000000210000000-0x0000000217ffffff 128M online yes 66
0x0000000220000000-0x0000000227ffffff 128M online yes 68
0x0000000230000000-0x0000000237ffffff 128M online yes 70
0x0000000240000000-0x0000000247ffffff 128M online yes 72
0x0000000250000000-0x0000000257ffffff 128M online yes 74
0x0000000260000000-0x0000000267ffffff 128M online yes 76
0x0000000270000000-0x0000000277ffffff 128M online yes 78
0x0000000280000000-0x0000000287ffffff 128M online yes 80
0x0000000290000000-0x0000000297ffffff 128M online yes 82
0x00000002a0000000-0x00000002a7ffffff 128M online yes 84
0x00000002b0000000-0x00000002b7ffffff 128M online yes 86
0x00000002c0000000-0x00000002c7ffffff 128M online yes 88
0x00000002d0000000-0x00000002d7ffffff 128M online yes 90
0x00000002e0000000-0x00000002e7ffffff 128M online yes 92
0x00000002f0000000-0x00000002f7ffffff 128M online yes 94
0x0000000300000000-0x0000000307ffffff 128M online yes 96
0x0000000310000000-0x0000000317ffffff 128M online yes 98
0x0000000320000000-0x0000000327ffffff 128M online yes 100
0x0000000330000000-0x000000033fffffff 256M online yes 102-103

Memory block size: 128M
Total online memory: 8.1G
Total offline memory: 0B

With this patch set:

[root@localhost ~]# lsmem
RANGE SIZE STATE REMOVABLE BLOCK
0x0000000000000000-0x00000000bfffffff 3G online yes 0-23
0x0000000100000000-0x000000013fffffff 1G online yes 32-39

Memory block size: 128M
Total online memory: 4G
Total offline memory: 0B

All memory can get unplugged, all memory block can get removed. Of
course, no workload ran and the system was basically idle, but it
highlights the issue - the fairly deterministic chain of unmovable
allocations. When a huge page for the 2MB memmap is needed, a
just-onlined 4MB page will be split. The remaining 2MB page will be used
for the memmap of the next memory block. So one memory block will hold
the memmap of the two following memory blocks. Finally the pages of the
last-onlined memory block will get used for the next bigger allocations -
if any allocation is unmovable, all dependent memory blocks cannot get
unplugged and removed until that allocation is gone.

Note that with bigger memory blocks (e.g., 256MB), *all* memory
blocks are dependent and none can get unplugged again!

b) Experiment with memory intensive workload

I performed an experiment with an older version of this patch set (before
we used undo_isolate_page_range() in online_pages(): Hotplug 56GB to a VM
with an initial 4GB, onlining all memory to ZONE_NORMAL right from the
kernel when adding it. I then run various memory intensive workloads that
consume most system memory for a total of 45 minutes. Once finished, I
try to unplug as much memory as possible.

With this change, I am able to remove via virtio-mem (adding individual
128MB memory blocks) 413 out of 448 added memory blocks. Via individual
(256MB) DIMMs 380 out of 448 added memory blocks. (I don't have any
numbers without this patchset, but looking at the above example, it's at
most half of the 448 memory blocks for virtio-mem, and most probably none
for DIMMs).

Again, there are workloads that might behave very differently due to the
nature of ZONE_NORMAL.

This change also affects (besides memory onlining):
- Other users of undo_isolate_page_range(): Pages are always placed to the
tail.
-- When memory offlining fails
-- When memory isolation fails after having isolated some pageblocks
-- When alloc_contig_range() either succeeds or fails
- Other users of __putback_isolated_page(): Pages are always placed to the
tail.
-- Free page reporting
- Other users of __free_pages_core()
-- AFAIKs, any memory that is getting exposed to the buddy during boot.
IIUC we will now usually allocate memory from lower addresses within
a zone first (especially during boot).
- Other users of generic_online_page()
-- Hyper-V balloon

This patch (of 5):

Let's prepare for additional flags and avoid long parameter lists of
bools. Follow-up patches will also make use of the flags in
__free_pages_ok().

Signed-off-by: David Hildenbrand <david@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Reviewed-by: Alexander Duyck <alexander.h.duyck@linux.intel.com>
Reviewed-by: Vlastimil Babka <vbabka@suse.cz>
Reviewed-by: Oscar Salvador <osalvador@suse.de>
Reviewed-by: Wei Yang <richard.weiyang@linux.alibaba.com>
Reviewed-by: Pankaj Gupta <pankaj.gupta.linux@gmail.com>
Acked-by: Michal Hocko <mhocko@suse.com>
Cc: Mel Gorman <mgorman@techsingularity.net>
Cc: Dave Hansen <dave.hansen@intel.com>
Cc: Mike Rapoport <rppt@kernel.org>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Haiyang Zhang <haiyangz@microsoft.com>
Cc: "K. Y. Srinivasan" <kys@microsoft.com>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Scott Cheloha <cheloha@linux.ibm.com>
Cc: Stephen Hemminger <sthemmin@microsoft.com>
Cc: Wei Liu <wei.liu@kernel.org>
Cc: Michal Hocko <mhocko@kernel.org>
Link: https://lkml.kernel.org/r/20201005121534.15649-1-david@redhat.com
Link: https://lkml.kernel.org/r/20201005121534.15649-2-david@redhat.com
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# d882c006 15-Oct-2020 David Hildenbrand <david@redhat.com>

mm: pass migratetype into memmap_init_zone() and move_pfn_range_to_zone()

On the memory onlining path, we want to start with MIGRATE_ISOLATE, to
un-isolate the pages after memory onlining is complete. Let's allow
passing in the migratetype.

Signed-off-by: David Hildenbrand <david@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Reviewed-by: Oscar Salvador <osalvador@suse.de>
Acked-by: Michal Hocko <mhocko@suse.com>
Cc: Wei Yang <richard.weiyang@linux.alibaba.com>
Cc: Baoquan He <bhe@redhat.com>
Cc: Pankaj Gupta <pankaj.gupta.linux@gmail.com>
Cc: Tony Luck <tony.luck@intel.com>
Cc: Fenghua Yu <fenghua.yu@intel.com>
Cc: Logan Gunthorpe <logang@deltatee.com>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: Mike Rapoport <rppt@kernel.org>
Cc: "Matthew Wilcox (Oracle)" <willy@infradead.org>
Cc: Michel Lespinasse <walken@google.com>
Cc: Charan Teja Reddy <charante@codeaurora.org>
Cc: Mel Gorman <mgorman@techsingularity.net>
Link: https://lkml.kernel.org/r/20200819175957.28465-10-david@redhat.com
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 4eb29bd9 15-Oct-2020 David Hildenbrand <david@redhat.com>

mm/page_alloc: drop stale pageblock comment in memmap_init_zone*()

Commit ac5d2539b238 ("mm: meminit: reduce number of times pageblocks are
set during struct page init") moved the actual zone range check, leaving
only the alignment check for pageblocks.

Let's drop the stale comment and make the pageblock check easier to read.

Signed-off-by: David Hildenbrand <david@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Reviewed-by: Oscar Salvador <osalvador@suse.de>
Acked-by: Michal Hocko <mhocko@suse.com>
Cc: Wei Yang <richard.weiyang@linux.alibaba.com>
Cc: Baoquan He <bhe@redhat.com>
Cc: Pankaj Gupta <pankaj.gupta.linux@gmail.com>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Charan Teja Reddy <charante@codeaurora.org>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: Fenghua Yu <fenghua.yu@intel.com>
Cc: Logan Gunthorpe <logang@deltatee.com>
Cc: "Matthew Wilcox (Oracle)" <willy@infradead.org>
Cc: Mel Gorman <mgorman@techsingularity.net>
Cc: Michel Lespinasse <walken@google.com>
Cc: Mike Rapoport <rppt@kernel.org>
Cc: Tony Luck <tony.luck@intel.com>
Link: https://lkml.kernel.org/r/20200819175957.28465-9-david@redhat.com
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 3fa0c7c7 15-Oct-2020 David Hildenbrand <david@redhat.com>

mm/page_isolation: simplify return value of start_isolate_page_range()

Callers no longer need the number of isolated pageblocks. Let's simplify.

Signed-off-by: David Hildenbrand <david@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Reviewed-by: Oscar Salvador <osalvador@suse.de>
Acked-by: Michal Hocko <mhocko@suse.com>
Cc: Wei Yang <richard.weiyang@linux.alibaba.com>
Cc: Baoquan He <bhe@redhat.com>
Cc: Pankaj Gupta <pankaj.gupta.linux@gmail.com>
Cc: Charan Teja Reddy <charante@codeaurora.org>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: Fenghua Yu <fenghua.yu@intel.com>
Cc: Logan Gunthorpe <logang@deltatee.com>
Cc: "Matthew Wilcox (Oracle)" <willy@infradead.org>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Mel Gorman <mgorman@techsingularity.net>
Cc: Michel Lespinasse <walken@google.com>
Cc: Mike Rapoport <rppt@kernel.org>
Cc: Tony Luck <tony.luck@intel.com>
Link: https://lkml.kernel.org/r/20200819175957.28465-7-david@redhat.com
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 257bea71 15-Oct-2020 David Hildenbrand <david@redhat.com>

mm/page_alloc: simplify __offline_isolated_pages()

offline_pages() is the only user. __offline_isolated_pages() never gets
called with ranges that contain memory holes and we no longer care about
the return value. Drop the return value handling and all pfn_valid()
checks.

Update the documentation.

Signed-off-by: David Hildenbrand <david@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Reviewed-by: Oscar Salvador <osalvador@suse.de>
Acked-by: Michal Hocko <mhocko@suse.com>
Cc: Wei Yang <richard.weiyang@linux.alibaba.com>
Cc: Baoquan He <bhe@redhat.com>
Cc: Pankaj Gupta <pankaj.gupta.linux@gmail.com>
Cc: Charan Teja Reddy <charante@codeaurora.org>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: Fenghua Yu <fenghua.yu@intel.com>
Cc: Logan Gunthorpe <logang@deltatee.com>
Cc: "Matthew Wilcox (Oracle)" <willy@infradead.org>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Mel Gorman <mgorman@techsingularity.net>
Cc: Michel Lespinasse <walken@google.com>
Cc: Mike Rapoport <rppt@kernel.org>
Cc: Tony Luck <tony.luck@intel.com>
Link: https://lkml.kernel.org/r/20200819175957.28465-5-david@redhat.com
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 79f5f8fa 15-Oct-2020 Oscar Salvador <osalvador@suse.de>

mm,hwpoison: rework soft offline for in-use pages

This patch changes the way we set and handle in-use poisoned pages. Until
now, poisoned pages were released to the buddy allocator, trusting that
the checks that take place at allocation time would act as a safe net and
would skip that page.

This has proved to be wrong, as we got some pfn walkers out there, like
compaction, that all they care is the page to be in a buddy freelist.

Although this might not be the only user, having poisoned pages in the
buddy allocator seems a bad idea as we should only have free pages that
are ready and meant to be used as such.

Before explaining the taken approach, let us break down the kind of pages
we can soft offline.

- Anonymous THP (after the split, they end up being 4K pages)
- Hugetlb
- Order-0 pages (that can be either migrated or invalited)

* Normal pages (order-0 and anon-THP)

- If they are clean and unmapped page cache pages, we invalidate
then by means of invalidate_inode_page().
- If they are mapped/dirty, we do the isolate-and-migrate dance.

Either way, do not call put_page directly from those paths. Instead, we
keep the page and send it to page_handle_poison to perform the right
handling.

page_handle_poison sets the HWPoison flag and does the last put_page.

Down the chain, we placed a check for HWPoison page in
free_pages_prepare, that just skips any poisoned page, so those pages
do not end up in any pcplist/freelist.

After that, we set the refcount on the page to 1 and we increment
the poisoned pages counter.

If we see that the check in free_pages_prepare creates trouble, we can
always do what we do for free pages:

- wait until the page hits buddy's freelists
- take it off, and flag it

The downside of the above approach is that we could race with an
allocation, so by the time we want to take the page off the buddy, the
page has been already allocated so we cannot soft offline it.
But the user could always retry it.

* Hugetlb pages

- We isolate-and-migrate them

After the migration has been successful, we call dissolve_free_huge_page,
and we set HWPoison on the page if we succeed.
Hugetlb has a slightly different handling though.

While for non-hugetlb pages we cared about closing the race with an
allocation, doing so for hugetlb pages requires quite some additional
and intrusive code (we would need to hook in free_huge_page and some other
places).
So I decided to not make the code overly complicated and just fail
normally if the page we allocated in the meantime.

We can always build on top of this.

As a bonus, because of the way we handle now in-use pages, we no longer
need the put-as-isolation-migratetype dance, that was guarding for poisoned
pages to end up in pcplists.

Signed-off-by: Oscar Salvador <osalvador@suse.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Acked-by: Naoya Horiguchi <naoya.horiguchi@nec.com>
Cc: "Aneesh Kumar K.V" <aneesh.kumar@linux.ibm.com>
Cc: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
Cc: Aristeu Rozanski <aris@ruivo.org>
Cc: Dave Hansen <dave.hansen@intel.com>
Cc: David Hildenbrand <david@redhat.com>
Cc: Dmitry Yakunin <zeil@yandex-team.ru>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Mike Kravetz <mike.kravetz@oracle.com>
Cc: Oscar Salvador <osalvador@suse.com>
Cc: Qian Cai <cai@lca.pw>
Cc: Tony Luck <tony.luck@intel.com>
Link: https://lkml.kernel.org/r/20200922135650.1634-10-osalvador@suse.de
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 06be6ff3 15-Oct-2020 Oscar Salvador <osalvador@suse.de>

mm,hwpoison: rework soft offline for free pages

When trying to soft-offline a free page, we need to first take it off the
buddy allocator. Once we know is out of reach, we can safely flag it as
poisoned.

take_page_off_buddy will be used to take a page meant to be poisoned off
the buddy allocator. take_page_off_buddy calls break_down_buddy_pages,
which splits a higher-order page in case our page belongs to one.

Once the page is under our control, we call page_handle_poison to set it
as poisoned and grab a refcount on it.

Signed-off-by: Oscar Salvador <osalvador@suse.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Acked-by: Naoya Horiguchi <naoya.horiguchi@nec.com>
Cc: "Aneesh Kumar K.V" <aneesh.kumar@linux.ibm.com>
Cc: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
Cc: Aristeu Rozanski <aris@ruivo.org>
Cc: Dave Hansen <dave.hansen@intel.com>
Cc: David Hildenbrand <david@redhat.com>
Cc: Dmitry Yakunin <zeil@yandex-team.ru>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Mike Kravetz <mike.kravetz@oracle.com>
Cc: Oscar Salvador <osalvador@suse.com>
Cc: Qian Cai <cai@lca.pw>
Cc: Tony Luck <tony.luck@intel.com>
Link: https://lkml.kernel.org/r/20200922135650.1634-9-osalvador@suse.de
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 8fb156c9 15-Oct-2020 Matthew Wilcox (Oracle) <willy@infradead.org>

mm/page_owner: change split_page_owner to take a count

The implementation of split_page_owner() prefers a count rather than the
old order of the page. When we support a variable size THP, we won't
have the order at this point, but we will have the number of pages.
So change the interface to what the caller and callee would prefer.

Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Reviewed-by: SeongJae Park <sjpark@amazon.de>
Acked-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Cc: Huang Ying <ying.huang@intel.com>
Link: https://lkml.kernel.org/r/20200908195539.25896-4-willy@infradead.org
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# cc6de168 13-Oct-2020 Mike Rapoport <rppt@kernel.org>

memblock: use separate iterators for memory and reserved regions

for_each_memblock() is used to iterate over memblock.memory in a few
places that use data from memblock_region rather than the memory ranges.

Introduce separate for_each_mem_region() and
for_each_reserved_mem_region() to improve encapsulation of memblock
internals from its users.

Signed-off-by: Mike Rapoport <rppt@linux.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Reviewed-by: Baoquan He <bhe@redhat.com>
Acked-by: Ingo Molnar <mingo@kernel.org> [x86]
Acked-by: Thomas Bogendoerfer <tsbogend@alpha.franken.de> [MIPS]
Acked-by: Miguel Ojeda <miguel.ojeda.sandonis@gmail.com> [.clang-format]
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Daniel Axtens <dja@axtens.net>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Emil Renner Berthing <kernel@esmil.dk>
Cc: Hari Bathini <hbathini@linux.ibm.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Jonathan Cameron <Jonathan.Cameron@huawei.com>
Cc: Marek Szyprowski <m.szyprowski@samsung.com>
Cc: Max Filippov <jcmvbkbc@gmail.com>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Cc: Michal Simek <monstr@monstr.eu>
Cc: Palmer Dabbelt <palmer@dabbelt.com>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Paul Walmsley <paul.walmsley@sifive.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Russell King <linux@armlinux.org.uk>
Cc: Stafford Horne <shorne@gmail.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Will Deacon <will@kernel.org>
Cc: Yoshinori Sato <ysato@users.sourceforge.jp>
Link: https://lkml.kernel.org/r/20200818151634.14343-18-rppt@kernel.org
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 6e245ad4 13-Oct-2020 Mike Rapoport <rppt@kernel.org>

memblock: reduce number of parameters in for_each_mem_range()

Currently for_each_mem_range() and for_each_mem_range_rev() iterators are
the most generic way to traverse memblock regions. As such, they have 8
parameters and they are hardly convenient to users. Most users choose to
utilize one of their wrappers and the only user that actually needs most
of the parameters is memblock itself.

To avoid yet another naming for memblock iterators, rename the existing
for_each_mem_range[_rev]() to __for_each_mem_range[_rev]() and add a new
for_each_mem_range[_rev]() wrappers with only index, start and end
parameters.

The new wrapper nicely fits into init_unavailable_mem() and will be used
in upcoming changes to simplify memblock traversals.

Signed-off-by: Mike Rapoport <rppt@linux.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Acked-by: Thomas Bogendoerfer <tsbogend@alpha.franken.de> [MIPS]
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Baoquan He <bhe@redhat.com>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Daniel Axtens <dja@axtens.net>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Emil Renner Berthing <kernel@esmil.dk>
Cc: Hari Bathini <hbathini@linux.ibm.com>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Jonathan Cameron <Jonathan.Cameron@huawei.com>
Cc: Marek Szyprowski <m.szyprowski@samsung.com>
Cc: Max Filippov <jcmvbkbc@gmail.com>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Cc: Michal Simek <monstr@monstr.eu>
Cc: Miguel Ojeda <miguel.ojeda.sandonis@gmail.com>
Cc: Palmer Dabbelt <palmer@dabbelt.com>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Paul Walmsley <paul.walmsley@sifive.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Russell King <linux@armlinux.org.uk>
Cc: Stafford Horne <shorne@gmail.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Will Deacon <will@kernel.org>
Cc: Yoshinori Sato <ysato@users.sourceforge.jp>
Link: https://lkml.kernel.org/r/20200818151634.14343-11-rppt@kernel.org
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# e320d301 13-Oct-2020 Matthew Wilcox (Oracle) <willy@infradead.org>

mm/page_alloc.c: fix freeing non-compound pages

Here is a very rare race which leaks memory:

Page P0 is allocated to the page cache. Page P1 is free.

Thread A Thread B Thread C
find_get_entry():
xas_load() returns P0
Removes P0 from page cache
P0 finds its buddy P1
alloc_pages(GFP_KERNEL, 1) returns P0
P0 has refcount 1
page_cache_get_speculative(P0)
P0 has refcount 2
__free_pages(P0)
P0 has refcount 1
put_page(P0)
P1 is not freed

Fix this by freeing all the pages in __free_pages() that won't be freed
by the call to put_page(). It's usually not a good idea to split a page,
but this is a very unlikely scenario.

Fixes: e286781d5f2e ("mm: speculative page references")
Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Acked-by: Mike Rapoport <rppt@linux.ibm.com>
Cc: Nick Piggin <npiggin@gmail.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Link: https://lkml.kernel.org/r/20200926213919.26642-1-willy@infradead.org
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 30d8ec73 13-Oct-2020 Mateusz Nosek <mateusznosek0@gmail.com>

mmzone: clean code by removing unused macro parameter

Previously 'for_next_zone_zonelist_nodemask' macro parameter 'zlist' was
unused so this patch removes it.

Signed-off-by: Mateusz Nosek <mateusznosek0@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Link: https://lkml.kernel.org/r/20200917211906.30059-1-mateusznosek0@gmail.com
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 2187e17b 13-Oct-2020 Yanfei Xu <yanfei.xu@windriver.com>

mm/page_alloc.c: __perform_reclaim should return 'unsigned long'

__perform_reclaim()'s single caller expects it to return 'unsigned long',
hence change its return value and a local variable to 'unsigned long'.

Suggested-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Yanfei Xu <yanfei.xu@windriver.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Reviewed-by: Andrew Morton <akpm@linux-foundation.org>
Link: https://lkml.kernel.org/r/20200916022138.16740-1-yanfei.xu@windriver.com
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# a0622d05 13-Oct-2020 Mateusz Nosek <mateusznosek0@gmail.com>

mm/page_alloc.c: clean code by merging two functions

finalise_ac() is just 'epilogue' for 'prepare_alloc_pages'. Therefore
there is no need to keep them both so 'finalise_ac' content can be merged
into prepare_alloc_pages() code. It would make __alloc_pages_nodemask()
cleaner when it comes to readability.

Signed-off-by: Mateusz Nosek <mateusznosek0@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Reviewed-by: Andrew Morton <akpm@linux-foundation.org>
Cc: Mel Gorman <mgorman@techsingularity.net>
Cc: Mike Rapoport <rppt@kernel.org>
Link: https://lkml.kernel.org/r/20200916110118.6537-1-mateusznosek0@gmail.com
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# fdd4fa1c 13-Oct-2020 Mateusz Nosek <mateusznosek0@gmail.com>

mm/page_alloc.c: fix early params garbage value accesses

Previously in '__init early_init_on_alloc' and '__init early_init_on_free'
the return values from 'kstrtobool' were not handled properly. That
caused potential garbage value read from variable 'bool_result'.
Introduced patch fixes error handling.

Signed-off-by: Mateusz Nosek <mateusznosek0@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Reviewed-by: Andrew Morton <akpm@linux-foundation.org>
Link: https://lkml.kernel.org/r/20200916214125.28271-1-mateusznosek0@gmail.com
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# cfb4a541 13-Oct-2020 Mateusz Nosek <mateusznosek0@gmail.com>

mm/page_alloc.c: micro-optimization remove unnecessary branch

Previously flags check was separated into two separated checks with two
separated branches. In case of presence of any of two mentioned flags,
the same effect on flow occurs. Therefore checks can be merged and one
branch can be avoided.

Signed-off-by: Mateusz Nosek <mateusznosek0@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Reviewed-by: Andrew Morton <akpm@linux-foundation.org>
Link: https://lkml.kernel.org/r/20200911092310.31136-1-mateusznosek0@gmail.com
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# b630749f 13-Oct-2020 Mateusz Nosek <mateusznosek0@gmail.com>

mm/page_alloc.c: clean code by removing unnecessary initialization

Previously variable 'tmp' was initialized, but was not read later before
reassigning. So the initialization can be removed.

[akpm@linux-foundation.org: remove `tmp' altogether]

Signed-off-by: Mateusz Nosek <mateusznosek0@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Link: https://lkml.kernel.org/r/20200904132422.17387-1-mateusznosek0@gmail.com
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 6a654e36 13-Oct-2020 Li Xinhai <lixinhai.lxh@gmail.com>

mm, isolation: avoid checking unmovable pages across pageblock boundary

In has_unmovable_pages(), the page parameter would not always be the first
page within a pageblock (see how the page pointer is passed in from
start_isolate_page_range() after call __first_valid_page()), so that would
cause checking unmovable pages span two pageblocks.

After this patch, the checking is enforced within one pageblock no matter
the page is first one or not, and obey the semantics of this function.

This issue is found by code inspection.

Michal said "this might lead to false negatives when an unrelated block
would cause an isolation failure".

Signed-off-by: Li Xinhai <lixinhai.lxh@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Reviewed-by: Oscar Salvador <osalvador@suse.de>
Acked-by: Michal Hocko <mhocko@suse.com>
Cc: David Hildenbrand <david@redhat.com>
Link: https://lkml.kernel.org/r/20200824065811.383266-1-lixinhai.lxh@gmail.com
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# c9c510dc 13-Oct-2020 David Hildenbrand <david@redhat.com>

mm/page_alloc: tweak comments in has_unmovable_pages()

Patch series "mm / virtio-mem: support ZONE_MOVABLE", v5.

When introducing virtio-mem, the semantics of ZONE_MOVABLE were rather
unclear, which is why we special-cased ZONE_MOVABLE such that partially
plugged blocks would never end up in ZONE_MOVABLE.

Now that the semantics are much clearer (and are documented in patch #6),
let's support partially plugged memory blocks in ZONE_MOVABLE, allowing
partially plugged memory blocks to be online to ZONE_MOVABLE and also
unplugging from such memory blocks. This avoids surprises when onlining
of memory blocks suddenly fails, just because they are not completely
populated by virtio-mem (yet).

This is especially helpful for testing, but also paves the way for
virtio-mem optimizations, allowing more memory to get reliably unplugged.

Cleanup has_unmovable_pages() and set_migratetype_isolate(), providing
better documentation of how ZONE_MOVABLE interacts with different kind of
unmovable pages (memory offlining vs. alloc_contig_range()).

This patch (of 6):

Let's move the split comment regarding bootmem allocations and memory
holes, especially in the context of ZONE_MOVABLE, to the PageReserved()
check.

Signed-off-by: David Hildenbrand <david@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Reviewed-by: Baoquan He <bhe@redhat.com>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Michael S. Tsirkin <mst@redhat.com>
Cc: Mike Kravetz <mike.kravetz@oracle.com>
Cc: Pankaj Gupta <pankaj.gupta.linux@gmail.com>
Cc: Jason Wang <jasowang@redhat.com>
Cc: Mike Rapoport <rppt@kernel.org>
Cc: Qian Cai <cai@lca.pw>
Link: http://lkml.kernel.org/r/20200816125333.7434-1-david@redhat.com
Link: http://lkml.kernel.org/r/20200816125333.7434-2-david@redhat.com
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 4aab2be0 11-Oct-2020 Vijay Balakrishna <vijayb@linux.microsoft.com>

mm: khugepaged: recalculate min_free_kbytes after memory hotplug as expected by khugepaged

When memory is hotplug added or removed the min_free_kbytes should be
recalculated based on what is expected by khugepaged. Currently after
hotplug, min_free_kbytes will be set to a lower default and higher
default set when THP enabled is lost.

This change restores min_free_kbytes as expected for THP consumers.

[vijayb@linux.microsoft.com: v5]
Link: https://lkml.kernel.org/r/1601398153-5517-1-git-send-email-vijayb@linux.microsoft.com

Fixes: f000565adb77 ("thp: set recommended min free kbytes")
Signed-off-by: Vijay Balakrishna <vijayb@linux.microsoft.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Reviewed-by: Pavel Tatashin <pasha.tatashin@soleen.com>
Acked-by: Michal Hocko <mhocko@suse.com>
Cc: Allen Pais <apais@microsoft.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Song Liu <songliubraving@fb.com>
Cc: <stable@vger.kernel.org>
Link: https://lkml.kernel.org/r/1600305709-2319-2-git-send-email-vijayb@linux.microsoft.com
Link: https://lkml.kernel.org/r/1600204258-13683-1-git-send-email-vijayb@linux.microsoft.com
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 1d91df85 02-Oct-2020 Joonsoo Kim <iamjoonsoo.kim@lge.com>

mm/page_alloc: handle a missing case for memalloc_nocma_{save/restore} APIs

memalloc_nocma_{save/restore} APIs can be used to skip page allocation
on CMA area, but, there is a missing case and the page on CMA area could
be allocated even if APIs are used. This patch handles this case to fix
the potential issue.

For now, these APIs are used to prevent long-term pinning on the CMA
page. When the long-term pinning is requested on the CMA page, it is
migrated to the non-CMA page before pinning. This non-CMA page is
allocated by using memalloc_nocma_{save/restore} APIs. If APIs doesn't
work as intended, the CMA page is allocated and it is pinned for a long
time. This long-term pin for the CMA page causes cma_alloc() failure
and it could result in wrong behaviour on the device driver who uses the
cma_alloc().

Missing case is an allocation from the pcplist. MIGRATE_MOVABLE pcplist
could have the pages on CMA area so we need to skip it if ALLOC_CMA
isn't specified.

Fixes: 8510e69c8efe (mm/page_alloc: fix memalloc_nocma_{save/restore} APIs)
Signed-off-by: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Acked-by: Michal Hocko <mhocko@suse.com>
Cc: "Aneesh Kumar K . V" <aneesh.kumar@linux.ibm.com>
Cc: Mel Gorman <mgorman@techsingularity.net>
Link: https://lkml.kernel.org/r/1601429472-12599-1-git-send-email-iamjoonsoo.kim@lge.com
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# c1d0da83 25-Sep-2020 Laurent Dufour <ldufour@linux.ibm.com>

mm: replace memmap_context by meminit_context

Patch series "mm: fix memory to node bad links in sysfs", v3.

Sometimes, firmware may expose interleaved memory layout like this:

Early memory node ranges
node 1: [mem 0x0000000000000000-0x000000011fffffff]
node 2: [mem 0x0000000120000000-0x000000014fffffff]
node 1: [mem 0x0000000150000000-0x00000001ffffffff]
node 0: [mem 0x0000000200000000-0x000000048fffffff]
node 2: [mem 0x0000000490000000-0x00000007ffffffff]

In that case, we can see memory blocks assigned to multiple nodes in
sysfs:

$ ls -l /sys/devices/system/memory/memory21
total 0
lrwxrwxrwx 1 root root 0 Aug 24 05:27 node1 -> ../../node/node1
lrwxrwxrwx 1 root root 0 Aug 24 05:27 node2 -> ../../node/node2
-rw-r--r-- 1 root root 65536 Aug 24 05:27 online
-r--r--r-- 1 root root 65536 Aug 24 05:27 phys_device
-r--r--r-- 1 root root 65536 Aug 24 05:27 phys_index
drwxr-xr-x 2 root root 0 Aug 24 05:27 power
-r--r--r-- 1 root root 65536 Aug 24 05:27 removable
-rw-r--r-- 1 root root 65536 Aug 24 05:27 state
lrwxrwxrwx 1 root root 0 Aug 24 05:25 subsystem -> ../../../../bus/memory
-rw-r--r-- 1 root root 65536 Aug 24 05:25 uevent
-r--r--r-- 1 root root 65536 Aug 24 05:27 valid_zones

The same applies in the node's directory with a memory21 link in both
the node1 and node2's directory.

This is wrong but doesn't prevent the system to run. However when
later, one of these memory blocks is hot-unplugged and then hot-plugged,
the system is detecting an inconsistency in the sysfs layout and a
BUG_ON() is raised:

kernel BUG at /Users/laurent/src/linux-ppc/mm/memory_hotplug.c:1084!
LE PAGE_SIZE=64K MMU=Hash SMP NR_CPUS=2048 NUMA pSeries
Modules linked in: rpadlpar_io rpaphp pseries_rng rng_core vmx_crypto gf128mul binfmt_misc ip_tables x_tables xfs libcrc32c crc32c_vpmsum autofs4
CPU: 8 PID: 10256 Comm: drmgr Not tainted 5.9.0-rc1+ #25
Call Trace:
add_memory_resource+0x23c/0x340 (unreliable)
__add_memory+0x5c/0xf0
dlpar_add_lmb+0x1b4/0x500
dlpar_memory+0x1f8/0xb80
handle_dlpar_errorlog+0xc0/0x190
dlpar_store+0x198/0x4a0
kobj_attr_store+0x30/0x50
sysfs_kf_write+0x64/0x90
kernfs_fop_write+0x1b0/0x290
vfs_write+0xe8/0x290
ksys_write+0xdc/0x130
system_call_exception+0x160/0x270
system_call_common+0xf0/0x27c

This has been seen on PowerPC LPAR.

The root cause of this issue is that when node's memory is registered,
the range used can overlap another node's range, thus the memory block
is registered to multiple nodes in sysfs.

There are two issues here:

(a) The sysfs memory and node's layouts are broken due to these
multiple links

(b) The link errors in link_mem_sections() should not lead to a system
panic.

To address (a) register_mem_sect_under_node should not rely on the
system state to detect whether the link operation is triggered by a hot
plug operation or not. This is addressed by the patches 1 and 2 of this
series.

Issue (b) will be addressed separately.

This patch (of 2):

The memmap_context enum is used to detect whether a memory operation is
due to a hot-add operation or happening at boot time.

Make it general to the hotplug operation and rename it as
meminit_context.

There is no functional change introduced by this patch

Suggested-by: David Hildenbrand <david@redhat.com>
Signed-off-by: Laurent Dufour <ldufour@linux.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Reviewed-by: David Hildenbrand <david@redhat.com>
Reviewed-by: Oscar Salvador <osalvador@suse.de>
Acked-by: Michal Hocko <mhocko@suse.com>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: "Rafael J . Wysocki" <rafael@kernel.org>
Cc: Nathan Lynch <nathanl@linux.ibm.com>
Cc: Scott Cheloha <cheloha@linux.ibm.com>
Cc: Tony Luck <tony.luck@intel.com>
Cc: Fenghua Yu <fenghua.yu@intel.com>
Cc: <stable@vger.kernel.org>
Link: https://lkml.kernel.org/r/20200915094143.79181-1-ldufour@linux.ibm.com
Link: https://lkml.kernel.org/r/20200915132624.9723-1-ldufour@linux.ibm.com
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 76cd6173 27-Aug-2020 Alexei Starovoitov <ast@kernel.org>

mm/error_inject: Fix allow_error_inject function signatures.

'static' and 'static noinline' function attributes make no guarantees that
gcc/clang won't optimize them. The compiler may decide to inline 'static'
function and in such case ALLOW_ERROR_INJECT becomes meaningless. The compiler
could have inlined __add_to_page_cache_locked() in one callsite and didn't
inline in another. In such case injecting errors into it would cause
unpredictable behavior. It's worse with 'static noinline' which won't be
inlined, but it still can be optimized. Like the compiler may decide to remove
one argument or constant propagate the value depending on the callsite.

To avoid such issues make sure that these functions are global noinline.

Fixes: af3b854492f3 ("mm/page_alloc.c: allow error injection")
Fixes: cfcbfb1382db ("mm/filemap.c: enable error injection at add_to_page_cache()")
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Link: https://lore.kernel.org/bpf/20200827220114.69225-2-alexei.starovoitov@gmail.com


# 88e8ac11 20-Aug-2020 Charan Teja Reddy <charante@codeaurora.org>

mm, page_alloc: fix core hung in free_pcppages_bulk()

The following race is observed with the repeated online, offline and a
delay between two successive online of memory blocks of movable zone.

P1 P2

Online the first memory block in
the movable zone. The pcp struct
values are initialized to default
values,i.e., pcp->high = 0 &
pcp->batch = 1.

Allocate the pages from the
movable zone.

Try to Online the second memory
block in the movable zone thus it
entered the online_pages() but yet
to call zone_pcp_update().
This process is entered into
the exit path thus it tries
to release the order-0 pages
to pcp lists through
free_unref_page_commit().
As pcp->high = 0, pcp->count = 1
proceed to call the function
free_pcppages_bulk().
Update the pcp values thus the
new pcp values are like, say,
pcp->high = 378, pcp->batch = 63.
Read the pcp's batch value using
READ_ONCE() and pass the same to
free_pcppages_bulk(), pcp values
passed here are, batch = 63,
count = 1.

Since num of pages in the pcp
lists are less than ->batch,
then it will stuck in
while(list_empty(list)) loop
with interrupts disabled thus
a core hung.

Avoid this by ensuring free_pcppages_bulk() is called with proper count of
pcp list pages.

The mentioned race is some what easily reproducible without [1] because
pcp's are not updated for the first memory block online and thus there is
a enough race window for P2 between alloc+free and pcp struct values
update through onlining of second memory block.

With [1], the race still exists but it is very narrow as we update the pcp
struct values for the first memory block online itself.

This is not limited to the movable zone, it could also happen in cases
with the normal zone (e.g., hotplug to a node that only has DMA memory, or
no other memory yet).

[1]: https://patchwork.kernel.org/patch/11696389/

Fixes: 5f8dcc21211a ("page-allocator: split per-cpu list into one-list-per-migrate-type")
Signed-off-by: Charan Teja Reddy <charante@codeaurora.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Acked-by: David Hildenbrand <david@redhat.com>
Acked-by: David Rientjes <rientjes@google.com>
Acked-by: Michal Hocko <mhocko@suse.com>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Vinayak Menon <vinmenon@codeaurora.org>
Cc: <stable@vger.kernel.org> [2.6+]
Link: http://lkml.kernel.org/r/1597150703-19003-1-git-send-email-charante@codeaurora.org
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# e08d3fdf 20-Aug-2020 Doug Berger <opendmb@gmail.com>

mm: include CMA pages in lowmem_reserve at boot

The lowmem_reserve arrays provide a means of applying pressure against
allocations from lower zones that were targeted at higher zones. Its
values are a function of the number of pages managed by higher zones and
are assigned by a call to the setup_per_zone_lowmem_reserve() function.

The function is initially called at boot time by the function
init_per_zone_wmark_min() and may be called later by accesses of the
/proc/sys/vm/lowmem_reserve_ratio sysctl file.

The function init_per_zone_wmark_min() was moved up from a module_init to
a core_initcall to resolve a sequencing issue with khugepaged.
Unfortunately this created a sequencing issue with CMA page accounting.

The CMA pages are added to the managed page count of a zone when
cma_init_reserved_areas() is called at boot also as a core_initcall. This
makes it uncertain whether the CMA pages will be added to the managed page
counts of their zones before or after the call to
init_per_zone_wmark_min() as it becomes dependent on link order. With the
current link order the pages are added to the managed count after the
lowmem_reserve arrays are initialized at boot.

This means the lowmem_reserve values at boot may be lower than the values
used later if /proc/sys/vm/lowmem_reserve_ratio is accessed even if the
ratio values are unchanged.

In many cases the difference is not significant, but for example
an ARM platform with 1GB of memory and the following memory layout

cma: Reserved 256 MiB at 0x0000000030000000
Zone ranges:
DMA [mem 0x0000000000000000-0x000000002fffffff]
Normal empty
HighMem [mem 0x0000000030000000-0x000000003fffffff]

would result in 0 lowmem_reserve for the DMA zone. This would allow
userspace to deplete the DMA zone easily.

Funnily enough

$ cat /proc/sys/vm/lowmem_reserve_ratio

would fix up the situation because as a side effect it forces
setup_per_zone_lowmem_reserve.

This commit breaks the link order dependency by invoking
init_per_zone_wmark_min() as a postcore_initcall so that the CMA pages
have the chance to be properly accounted in their zone(s) and allowing
the lowmem_reserve arrays to receive consistent values.

Fixes: bc22af74f271 ("mm: update min_free_kbytes from khugepaged after core initialization")
Signed-off-by: Doug Berger <opendmb@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Acked-by: Michal Hocko <mhocko@suse.com>
Cc: Jason Baron <jbaron@akamai.com>
Cc: David Rientjes <rientjes@google.com>
Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
Cc: <stable@vger.kernel.org>
Link: http://lkml.kernel.org/r/1597423766-27849-1-git-send-email-opendmb@gmail.com
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 1378a5ee 14-Aug-2020 Matthew Wilcox (Oracle) <willy@infradead.org>

mm: store compound_nr as well as compound_order

Patch series "THP prep patches".

These are some generic cleanups and improvements, which I would like
merged into mmotm soon. The first one should be a performance improvement
for all users of compound pages, and the others are aimed at getting code
to compile away when CONFIG_TRANSPARENT_HUGEPAGE is disabled (ie small
systems). Also better documented / less confusing than the current prefix
mixture of compound, hpage and thp.

This patch (of 7):

This removes a few instructions from functions which need to know how many
pages are in a compound page. The storage used is either page->mapping on
64-bit or page->index on 32-bit. Both of these are fine to overlay on
tail pages.

Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Reviewed-by: William Kucharski <william.kucharski@oracle.com>
Reviewed-by: Zi Yan <ziy@nvidia.com>
Cc: David Hildenbrand <david@redhat.com>
Cc: Mike Kravetz <mike.kravetz@oracle.com>
Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
Link: http://lkml.kernel.org/r/20200629151959.15779-1-willy@infradead.org
Link: http://lkml.kernel.org/r/20200629151959.15779-2-willy@infradead.org
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 8b94e0b8 11-Aug-2020 Joonsoo Kim <iamjoonsoo.kim@lge.com>

mm/page_alloc: remove a wrapper for alloc_migration_target()

There is a well-defined standard migration target callback. Use it
directly.

Signed-off-by: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Acked-by: Michal Hocko <mhocko@suse.com>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Cc: Christoph Hellwig <hch@infradead.org>
Cc: Mike Kravetz <mike.kravetz@oracle.com>
Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
Cc: Roman Gushchin <guro@fb.com>
Link: http://lkml.kernel.org/r/1594622517-20681-8-git-send-email-iamjoonsoo.kim@lge.com
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 047b9967 11-Aug-2020 Randy Dunlap <rdunlap@infradead.org>

mm/page_alloc.c: delete or fix duplicated words

Drop the repeated word "them" and "that".
Change "the the" to "to the".

Signed-off-by: Randy Dunlap <rdunlap@infradead.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Reviewed-by: Andrew Morton <akpm@linux-foundation.org>
Reviewed-by: Zi Yan <ziy@nvidia.com>
Link: http://lkml.kernel.org/r/20200801173822.14973-10-rdunlap@infradead.org
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 8510e69c 07-Aug-2020 Joonsoo Kim <iamjoonsoo.kim@lge.com>

mm/page_alloc: fix memalloc_nocma_{save/restore} APIs

Currently, memalloc_nocma_{save/restore} API that prevents CMA area
in page allocation is implemented by using current_gfp_context(). However,
there are two problems of this implementation.

First, this doesn't work for allocation fastpath. In the fastpath,
original gfp_mask is used since current_gfp_context() is introduced in
order to control reclaim and it is on slowpath. So, CMA area can be
allocated through the allocation fastpath even if
memalloc_nocma_{save/restore} APIs are used. Currently, there is just
one user for these APIs and it has a fallback method to prevent actual
problem.
Second, clearing __GFP_MOVABLE in current_gfp_context() has a side effect
to exclude the memory on the ZONE_MOVABLE for allocation target.

To fix these problems, this patch changes the implementation to exclude
CMA area in page allocation. Main point of this change is using the
alloc_flags. alloc_flags is mainly used to control allocation so it fits
for excluding CMA area in allocation.

Fixes: d7fefcc8de91 (mm/cma: add PF flag to force non cma alloc)
Signed-off-by: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Reviewed-by: Vlastimil Babka <vbabka@suse.cz>
Cc: Christoph Hellwig <hch@infradead.org>
Cc: Roman Gushchin <guro@fb.com>
Cc: Mike Kravetz <mike.kravetz@oracle.com>
Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
Cc: Michal Hocko <mhocko@suse.com>
Cc: "Aneesh Kumar K . V" <aneesh.kumar@linux.ibm.com>
Link: http://lkml.kernel.org/r/1595468942-29687-1-git-send-email-iamjoonsoo.kim@lge.com
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 182f3d7a 07-Aug-2020 Muchun Song <songmuchun@bytedance.com>

mm/page_alloc.c: skip setting nodemask when we are in interrupt

When we are in the interrupt context, it is irrelevant to the current task
context. If we use current task's mems_allowed, we can be fair to alloc
pages in the fast path and fall back to slow path memory allocation when
the current node(which is the current task mems_allowed) does not have
enough memory to allocate. In this case, it slows down the memory
allocation speed of interrupt context. So we can skip setting the
nodemask to allow any node to allocate memory, so that fast path
allocation can success.

Signed-off-by: Muchun Song <songmuchun@bytedance.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Reviewed-by: Pekka Enberg <penberg@kernel.org>
Cc: David Hildenbrand <david@redhat.com>
Link: http://lkml.kernel.org/r/20200706025921.53683-1-songmuchun@bytedance.com
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# da415663 07-Aug-2020 Wei Yang <richard.weiyang@linux.alibaba.com>

mm/page_alloc: fallbacks at most has 3 elements

MIGRAGE_TYPES is used to be the mark of end and there are at most 3
elements for the one dimension array.

Reduce to 3 to save little memory.

Signed-off-by: Wei Yang <richard.weiyang@linux.alibaba.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Reviewed-by: David Hildenbrand <david@redhat.com>
Link: http://lkml.kernel.org/r/20200625231022.18784-1-richard.weiyang@linux.alibaba.com
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 9e15afa5 07-Aug-2020 Qian Cai <cai@lca.pw>

mm/page_alloc: silence a KASAN false positive

kernel_init_free_pages() will use memset() on s390 to clear all pages from
kmalloc_order() which will override KASAN redzones because a redzone was
setup from the end of the allocation size to the end of the last page.
Silence it by not reporting it there. An example of the report is,

BUG: KASAN: slab-out-of-bounds in __free_pages_ok
Write of size 4096 at addr 000000014beaa000
Call Trace:
show_stack+0x152/0x210
dump_stack+0x1f8/0x248
print_address_description.isra.13+0x5e/0x4d0
kasan_report+0x130/0x178
check_memory_region+0x190/0x218
memset+0x34/0x60
__free_pages_ok+0x894/0x12f0
kfree+0x4f2/0x5e0
unpack_to_rootfs+0x60e/0x650
populate_rootfs+0x56/0x358
do_one_initcall+0x1f4/0xa20
kernel_init_freeable+0x758/0x7e8
kernel_init+0x1c/0x170
ret_from_fork+0x24/0x28
Memory state around the buggy address:
000000014bea9f00: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
000000014bea9f80: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
>000000014beaa000: 03 fe fe fe fe fe fe fe fe fe fe fe fe fe fe fe
^
000000014beaa080: fe fe fe fe fe fe fe fe fe fe fe fe fe fe fe fe
000000014beaa100: fe fe fe fe fe fe fe fe fe fe fe fe fe fe

Fixes: 6471384af2a6 ("mm: security: introduce init_on_alloc=1 and init_on_free=1 boot options")
Signed-off-by: Qian Cai <cai@lca.pw>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Tested-by: Vasily Gorbik <gor@linux.ibm.com>
Acked-by: Vasily Gorbik <gor@linux.ibm.com>
Cc: Dmitry Vyukov <dvyukov@google.com>
Cc: Christian Borntraeger <borntraeger@de.ibm.com>
Cc: Alexander Potapenko <glider@google.com>
Cc: Kees Cook <keescook@chromium.org>
Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
Link: http://lkml.kernel.org/r/20200610052154.5180-1-cai@lca.pw
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 535b81e2 07-Aug-2020 Wei Yang <richard.weiyang@linux.alibaba.com>

mm/page_alloc.c: remove unnecessary end_bitidx for [set|get]_pfnblock_flags_mask()

After previous cleanup, the end_bitidx is not necessary any more.

Signed-off-by: Wei Yang <richard.weiyang@linux.alibaba.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Cc: Mel Gorman <mgorman@suse.de>
Link: http://lkml.kernel.org/r/20200623124201.8199-4-richard.weiyang@linux.alibaba.com
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# d93d5ab9 07-Aug-2020 Wei Yang <richard.weiyang@linux.alibaba.com>

mm/page_alloc.c: simplify pageblock bitmap access

Due to commit e58469bafd05 ("mm: page_alloc: use word-based accesses for
get/set pageblock bitmaps"), pageblock bitmap is accessed with word-based
access. This operation could be simplified a little.

Intuitively, if we want to get a bit range [start_idx, end_idx] in a word,
we can do like this:

mask = (1 << (end_bitidx - start_bitidx + 1)) - 1;
ret = (word >> start_idx) & mask;

And also if we want to set a bit range [start_idx, end_idx] with flags, we
can do the same by just shift start_bitidx.

By doing so we reduce some instructions for these two helper functions:

Before Patched
set_pfnblock_flags_mask 209 198(-5%)
get_pfnblock_flags_mask 101 87(-13%)

Since the syntax is changed a little, we need to check the whole 4-bit
migrate_type instead of part of it.

Signed-off-by: Wei Yang <richard.weiyang@linux.alibaba.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Cc: Mel Gorman <mgorman@suse.de>
Link: http://lkml.kernel.org/r/20200623124201.8199-3-richard.weiyang@linux.alibaba.com
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 399b795b 07-Aug-2020 Wei Yang <richard.weiyang@linux.alibaba.com>

mm/page_alloc.c: extract the common part in pfn_to_bitidx()

The return value calculation is the same both for SPARSEMEM or not.

Just take it out.

Signed-off-by: Wei Yang <richard.weiyang@linux.alibaba.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Cc: Mel Gorman <mgorman@suse.de>
Link: http://lkml.kernel.org/r/20200623124201.8199-2-richard.weiyang@linux.alibaba.com
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 56b9413b 07-Aug-2020 David Hildenbrand <david@redhat.com>

mm/page_alloc: remove nr_free_pagecache_pages()

nr_free_pagecache_pages() isn't used outside page_alloc.c anymore - and
the name does not really help to understand what's going on. Let's
open-code it instead and add a comment.

Signed-off-by: David Hildenbrand <david@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Reviewed-by: Wei Yang <richard.weiyang@gmail.com>
Reviewed-by: Pankaj Gupta <pankaj.gupta.linux@gmail.com>
Reviewed-by: Mike Rapoport <rppt@linux.ibm.com>
Acked-by: Michal Hocko <mhocko@suse.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Minchan Kim <minchan@kernel.org>
Cc: Huang Ying <ying.huang@intel.com>
Link: http://lkml.kernel.org/r/20200619132410.23859-3-david@redhat.com
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 0a18e607 07-Aug-2020 David Hildenbrand <david@redhat.com>

mm: remove vm_total_pages

The global variable "vm_total_pages" is a relic from older days. There is
only a single user that reads the variable - build_all_zonelists() - and
the first thing it does is update it.

Use a local variable in build_all_zonelists() instead and remove the
global variable.

Signed-off-by: David Hildenbrand <david@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Reviewed-by: Wei Yang <richard.weiyang@gmail.com>
Reviewed-by: Pankaj Gupta <pankaj.gupta.linux@gmail.com>
Reviewed-by: Mike Rapoport <rppt@linux.ibm.com>
Acked-by: Michal Hocko <mhocko@suse.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Huang Ying <ying.huang@intel.com>
Cc: Minchan Kim <minchan@kernel.org>
Link: http://lkml.kernel.org/r/20200619132410.23859-2-david@redhat.com
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# f80b08fc 07-Aug-2020 Charan Teja Reddy <charante@codeaurora.org>

mm, page_alloc: skip ->waternark_boost for atomic order-0 allocations

When boosting is enabled, it is observed that rate of atomic order-0
allocation failures are high due to the fact that free levels in the
system are checked with ->watermark_boost offset. This is not a problem
for sleepable allocations but for atomic allocations which looks like
regression.

This problem is seen frequently on system setup of Android kernel running
on Snapdragon hardware with 4GB RAM size. When no extfrag event occurred
in the system, ->watermark_boost factor is zero, thus the watermark
configurations in the system are:

_watermark = (
[WMARK_MIN] = 1272, --> ~5MB
[WMARK_LOW] = 9067, --> ~36MB
[WMARK_HIGH] = 9385), --> ~38MB
watermark_boost = 0

After launching some memory hungry applications in Android which can cause
extfrag events in the system to an extent that ->watermark_boost can be
set to max i.e. default boost factor makes it to 150% of high watermark.

_watermark = (
[WMARK_MIN] = 1272, --> ~5MB
[WMARK_LOW] = 9067, --> ~36MB
[WMARK_HIGH] = 9385), --> ~38MB
watermark_boost = 14077, -->~57MB

With default system configuration, for an atomic order-0 allocation to
succeed, having free memory of ~2MB will suffice. But boosting makes the
min_wmark to ~61MB thus for an atomic order-0 allocation to be successful
system should have minimum of ~23MB of free memory(from calculations of
zone_watermark_ok(), min = 3/4(min/2)). But failures are observed despite
system is having ~20MB of free memory. In the testing, this is
reproducible as early as first 300secs since boot and with furtherlowram
configurations(<2GB) it is observed as early as first 150secs since boot.

These failures can be avoided by excluding the ->watermark_boost in
watermark caluculations for atomic order-0 allocations.

[akpm@linux-foundation.org: fix comment grammar, reflow comment]
[charante@codeaurora.org: fix suggested by Mel Gorman]
Link: http://lkml.kernel.org/r/31556793-57b1-1c21-1a9d-22674d9bd938@codeaurora.org

Signed-off-by: Charan Teja Reddy <charante@codeaurora.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Cc: Vinayak Menon <vinmenon@codeaurora.org>
Cc: Mel Gorman <mgorman@techsingularity.net>
Link: http://lkml.kernel.org/r/1589882284-21010-1-git-send-email-charante@codeaurora.org
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# f27ce0e1 07-Aug-2020 Jaewon Kim <jaewon31.kim@samsung.com>

page_alloc: consider highatomic reserve in watermark fast

zone_watermark_fast was introduced by commit 48ee5f3696f6 ("mm,
page_alloc: shortcut watermark checks for order-0 pages"). The commit
simply checks if free pages is bigger than watermark without additional
calculation such like reducing watermark.

It considered free cma pages but it did not consider highatomic reserved.
This may incur exhaustion of free pages except high order atomic free
pages.

Assume that reserved_highatomic pageblock is bigger than watermark min,
and there are only few free pages except high order atomic free. Because
zone_watermark_fast passes the allocation without considering high order
atomic free, normal reclaimable allocation like GFP_HIGHUSER will consume
all the free pages. Then finally order-0 atomic allocation may fail on
allocation.

This means watermark min is not protected against non-atomic allocation.
The order-0 atomic allocation with ALLOC_HARDER unwantedly can be failed.
Additionally the __GFP_MEMALLOC allocation with ALLOC_NO_WATERMARKS also
can be failed.

To avoid the problem, zone_watermark_fast should consider highatomic
reserve. If the actual size of high atomic free is counted accurately
like cma free, we may use it. On this patch just use
nr_reserved_highatomic. Additionally introduce
__zone_watermark_unusable_free to factor out common parts between
zone_watermark_fast and __zone_watermark_ok.

This is an example of ALLOC_HARDER allocation failure using v4.19 based
kernel.

Binder:9343_3: page allocation failure: order:0, mode:0x480020(GFP_ATOMIC), nodemask=(null)
Call trace:
[<ffffff8008f40f8c>] dump_stack+0xb8/0xf0
[<ffffff8008223320>] warn_alloc+0xd8/0x12c
[<ffffff80082245e4>] __alloc_pages_nodemask+0x120c/0x1250
[<ffffff800827f6e8>] new_slab+0x128/0x604
[<ffffff800827b0cc>] ___slab_alloc+0x508/0x670
[<ffffff800827ba00>] __kmalloc+0x2f8/0x310
[<ffffff80084ac3e0>] context_struct_to_string+0x104/0x1cc
[<ffffff80084ad8fc>] security_sid_to_context_core+0x74/0x144
[<ffffff80084ad880>] security_sid_to_context+0x10/0x18
[<ffffff800849bd80>] selinux_secid_to_secctx+0x20/0x28
[<ffffff800849109c>] security_secid_to_secctx+0x3c/0x70
[<ffffff8008bfe118>] binder_transaction+0xe68/0x454c
Mem-Info:
active_anon:102061 inactive_anon:81551 isolated_anon:0
active_file:59102 inactive_file:68924 isolated_file:64
unevictable:611 dirty:63 writeback:0 unstable:0
slab_reclaimable:13324 slab_unreclaimable:44354
mapped:83015 shmem:4858 pagetables:26316 bounce:0
free:2727 free_pcp:1035 free_cma:178
Node 0 active_anon:408244kB inactive_anon:326204kB active_file:236408kB inactive_file:275696kB unevictable:2444kB isolated(anon):0kB isolated(file):256kB mapped:332060kB dirty:252kB writeback:0kB shmem:19432kB writeback_tmp:0kB unstable:0kB all_unreclaimable? no
Normal free:10908kB min:6192kB low:44388kB high:47060kB active_anon:409160kB inactive_anon:325924kB active_file:235820kB inactive_file:276628kB unevictable:2444kB writepending:252kB present:3076096kB managed:2673676kB mlocked:2444kB kernel_stack:62512kB pagetables:105264kB bounce:0kB free_pcp:4140kB local_pcp:40kB free_cma:712kB
lowmem_reserve[]: 0 0
Normal: 505*4kB (H) 357*8kB (H) 201*16kB (H) 65*32kB (H) 1*64kB (H) 0*128kB 0*256kB 0*512kB 0*1024kB 0*2048kB 0*4096kB = 10236kB
138826 total pagecache pages
5460 pages in swap cache
Swap cache stats: add 8273090, delete 8267506, find 1004381/4060142

This is an example of ALLOC_NO_WATERMARKS allocation failure using v4.14
based kernel.

kswapd0: page allocation failure: order:0, mode:0x140000a(GFP_NOIO|__GFP_HIGHMEM|__GFP_MOVABLE), nodemask=(null)
kswapd0 cpuset=/ mems_allowed=0
CPU: 4 PID: 1221 Comm: kswapd0 Not tainted 4.14.113-18770262-userdebug #1
Call trace:
[<0000000000000000>] dump_backtrace+0x0/0x248
[<0000000000000000>] show_stack+0x18/0x20
[<0000000000000000>] __dump_stack+0x20/0x28
[<0000000000000000>] dump_stack+0x68/0x90
[<0000000000000000>] warn_alloc+0x104/0x198
[<0000000000000000>] __alloc_pages_nodemask+0xdc0/0xdf0
[<0000000000000000>] zs_malloc+0x148/0x3d0
[<0000000000000000>] zram_bvec_rw+0x410/0x798
[<0000000000000000>] zram_rw_page+0x88/0xdc
[<0000000000000000>] bdev_write_page+0x70/0xbc
[<0000000000000000>] __swap_writepage+0x58/0x37c
[<0000000000000000>] swap_writepage+0x40/0x4c
[<0000000000000000>] shrink_page_list+0xc30/0xf48
[<0000000000000000>] shrink_inactive_list+0x2b0/0x61c
[<0000000000000000>] shrink_node_memcg+0x23c/0x618
[<0000000000000000>] shrink_node+0x1c8/0x304
[<0000000000000000>] kswapd+0x680/0x7c4
[<0000000000000000>] kthread+0x110/0x120
[<0000000000000000>] ret_from_fork+0x10/0x18
Mem-Info:
active_anon:111826 inactive_anon:65557 isolated_anon:0\x0a active_file:44260 inactive_file:83422 isolated_file:0\x0a unevictable:4158 dirty:117 writeback:0 unstable:0\x0a slab_reclaimable:13943 slab_unreclaimable:43315\x0a mapped:102511 shmem:3299 pagetables:19566 bounce:0\x0a free:3510 free_pcp:553 free_cma:0
Node 0 active_anon:447304kB inactive_anon:262228kB active_file:177040kB inactive_file:333688kB unevictable:16632kB isolated(anon):0kB isolated(file):0kB mapped:410044kB d irty:468kB writeback:0kB shmem:13196kB writeback_tmp:0kB unstable:0kB all_unreclaimable? no
Normal free:14040kB min:7440kB low:94500kB high:98136kB reserved_highatomic:32768KB active_anon:447336kB inactive_anon:261668kB active_file:177572kB inactive_file:333768k B unevictable:16632kB writepending:480kB present:4081664kB managed:3637088kB mlocked:16632kB kernel_stack:47072kB pagetables:78264kB bounce:0kB free_pcp:2280kB local_pcp:720kB free_cma:0kB [ 4738.329607] lowmem_reserve[]: 0 0
Normal: 860*4kB (H) 453*8kB (H) 180*16kB (H) 26*32kB (H) 34*64kB (H) 6*128kB (H) 2*256kB (H) 0*512kB 0*1024kB 0*2048kB 0*4096kB = 14232kB

This is trace log which shows GFP_HIGHUSER consumes free pages right
before ALLOC_NO_WATERMARKS.

<...>-22275 [006] .... 889.213383: mm_page_alloc: page=00000000d2be5665 pfn=970744 order=0 migratetype=0 nr_free=3650 gfp_flags=GFP_HIGHUSER|__GFP_ZERO
<...>-22275 [006] .... 889.213385: mm_page_alloc: page=000000004b2335c2 pfn=970745 order=0 migratetype=0 nr_free=3650 gfp_flags=GFP_HIGHUSER|__GFP_ZERO
<...>-22275 [006] .... 889.213387: mm_page_alloc: page=00000000017272e1 pfn=970278 order=0 migratetype=0 nr_free=3650 gfp_flags=GFP_HIGHUSER|__GFP_ZERO
<...>-22275 [006] .... 889.213389: mm_page_alloc: page=00000000c4be79fb pfn=970279 order=0 migratetype=0 nr_free=3650 gfp_flags=GFP_HIGHUSER|__GFP_ZERO
<...>-22275 [006] .... 889.213391: mm_page_alloc: page=00000000f8a51d4f pfn=970260 order=0 migratetype=0 nr_free=3650 gfp_flags=GFP_HIGHUSER|__GFP_ZERO
<...>-22275 [006] .... 889.213393: mm_page_alloc: page=000000006ba8f5ac pfn=970261 order=0 migratetype=0 nr_free=3650 gfp_flags=GFP_HIGHUSER|__GFP_ZERO
<...>-22275 [006] .... 889.213395: mm_page_alloc: page=00000000819f1cd3 pfn=970196 order=0 migratetype=0 nr_free=3650 gfp_flags=GFP_HIGHUSER|__GFP_ZERO
<...>-22275 [006] .... 889.213396: mm_page_alloc: page=00000000f6b72a64 pfn=970197 order=0 migratetype=0 nr_free=3650 gfp_flags=GFP_HIGHUSER|__GFP_ZERO
kswapd0-1207 [005] ...1 889.213398: mm_page_alloc: page= (null) pfn=0 order=0 migratetype=1 nr_free=3650 gfp_flags=GFP_NOWAIT|__GFP_HIGHMEM|__GFP_NOWARN|__GFP_MOVABLE

[jaewon31.kim@samsung.com: remove redundant code for high-order]
Link: http://lkml.kernel.org/r/20200623035242.27232-1-jaewon31.kim@samsung.com

Reported-by: Yong-Taek Lee <ytk.lee@samsung.com>
Suggested-by: Minchan Kim <minchan@kernel.org>
Signed-off-by: Jaewon Kim <jaewon31.kim@samsung.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Reviewed-by: Baoquan He <bhe@redhat.com>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Acked-by: Mel Gorman <mgorman@techsingularity.net>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Yong-Taek Lee <ytk.lee@samsung.com>
Cc: Michal Hocko <mhocko@kernel.org>
Link: http://lkml.kernel.org/r/20200619235958.11283-1-jaewon31.kim@samsung.com
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# deba0487 07-Aug-2020 Vlastimil Babka <vbabka@suse.cz>

mm, page_alloc: use unlikely() in task_capc()

Hugh noted that task_capc() could use unlikely(), as most of the time
there is no capture in progress and we are in page freeing hot path.
Indeed adding unlikely() produces assembly that better matches the
assumption and moves all the tests away from the hot path.

I have also noticed that we don't need to test for cc->direct_compaction
as the only place we set current->task_capture is compact_zone_order()
which also always sets cc->direct_compaction true.

Suggested-by: Hugh Dickins <hughd@google.com>
Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Acked-by: Hugh Dickins <hughd@googlecom>
Acked-by: Mel Gorman <mgorman@techsingularity.net>
Cc: Alex Shi <alex.shi@linux.alibaba.com>
Cc: Li Wang <liwang@redhat.com>
Link: http://lkml.kernel.org/r/4a24f7af-3aa5-6e80-4ae6-8f253b562039@suse.cz
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# c89ab04f 07-Aug-2020 Mike Rapoport <rppt@kernel.org>

mm/sparse: cleanup the code surrounding memory_present()

After removal of CONFIG_HAVE_MEMBLOCK_NODE_MAP we have two equivalent
functions that call memory_present() for each region in memblock.memory:
sparse_memory_present_with_active_regions() and membocks_present().

Moreover, all architectures have a call to either of these functions
preceding the call to sparse_init() and in the most cases they are called
one after the other.

Mark the regions from memblock.memory as present during sparce_init() by
making sparse_init() call memblocks_present(), make memblocks_present()
and memory_present() functions static and remove redundant
sparse_memory_present_with_active_regions() function.

Also remove no longer required HAVE_MEMORY_PRESENT configuration option.

Signed-off-by: Mike Rapoport <rppt@linux.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Link: http://lkml.kernel.org/r/20200712083130.22919-1-rppt@kernel.org
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 991e7673 07-Aug-2020 Shakeel Butt <shakeelb@google.com>

mm: memcontrol: account kernel stack per node

Currently the kernel stack is being accounted per-zone. There is no need
to do that. In addition due to being per-zone, memcg has to keep a
separate MEMCG_KERNEL_STACK_KB. Make the stat per-node and deprecate
MEMCG_KERNEL_STACK_KB as memcg_stat_item is an extension of
node_stat_item. In addition localize the kernel stack stats updates to
account_kernel_stack().

Signed-off-by: Shakeel Butt <shakeelb@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Reviewed-by: Roman Gushchin <guro@fb.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Michal Hocko <mhocko@kernel.org>
Link: http://lkml.kernel.org/r/20200630161539.1759185-1-shakeelb@google.com
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# d42f3245 07-Aug-2020 Roman Gushchin <guro@fb.com>

mm: memcg: convert vmstat slab counters to bytes

In order to prepare for per-object slab memory accounting, convert
NR_SLAB_RECLAIMABLE and NR_SLAB_UNRECLAIMABLE vmstat items to bytes.

To make it obvious, rename them to NR_SLAB_RECLAIMABLE_B and
NR_SLAB_UNRECLAIMABLE_B (similar to NR_KERNEL_STACK_KB).

Internally global and per-node counters are stored in pages, however memcg
and lruvec counters are stored in bytes. This scheme may look weird, but
only for now. As soon as slab pages will be shared between multiple
cgroups, global and node counters will reflect the total number of slab
pages. However memcg and lruvec counters will be used for per-memcg slab
memory tracking, which will take separate kernel objects in the account.
Keeping global and node counters in pages helps to avoid additional
overhead.

The size of slab memory shouldn't exceed 4Gb on 32-bit machines, so it
will fit into atomic_long_t we use for vmstats.

Signed-off-by: Roman Gushchin <guro@fb.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Reviewed-by: Shakeel Butt <shakeelb@google.com>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Cc: Christoph Lameter <cl@linux.com>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Tejun Heo <tj@kernel.org>
Link: http://lkml.kernel.org/r/20200623174037.3951353-4-guro@fb.com
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 3f649ab7 03-Jun-2020 Kees Cook <keescook@chromium.org>

treewide: Remove uninitialized_var() usage

Using uninitialized_var() is dangerous as it papers over real bugs[1]
(or can in the future), and suppresses unrelated compiler warnings
(e.g. "unused variable"). If the compiler thinks it is uninitialized,
either simply initialize the variable or make compiler changes.

In preparation for removing[2] the[3] macro[4], remove all remaining
needless uses with the following script:

git grep '\buninitialized_var\b' | cut -d: -f1 | sort -u | \
xargs perl -pi -e \
's/\buninitialized_var\(([^\)]+)\)/\1/g;
s:\s*/\* (GCC be quiet|to make compiler happy) \*/$::g;'

drivers/video/fbdev/riva/riva_hw.c was manually tweaked to avoid
pathological white-space.

No outstanding warnings were found building allmodconfig with GCC 9.3.0
for x86_64, i386, arm64, arm, powerpc, powerpc64le, s390x, mips, sparc64,
alpha, and m68k.

[1] https://lore.kernel.org/lkml/20200603174714.192027-1-glider@google.com/
[2] https://lore.kernel.org/lkml/CA+55aFw+Vbj0i=1TGqCR5vQkCzWJ0QxK6CernOU6eedsudAixw@mail.gmail.com/
[3] https://lore.kernel.org/lkml/CA+55aFwgbgqhbp1fkxvRKEpzyR5J8n1vKT1VZdz9knmPuXhOeg@mail.gmail.com/
[4] https://lore.kernel.org/lkml/CA+55aFz2500WfbKXAx8s67wrm9=yVJu65TpLgN_ybYNv0VEOKA@mail.gmail.com/

Reviewed-by: Leon Romanovsky <leonro@mellanox.com> # drivers/infiniband and mlx4/mlx5
Acked-by: Jason Gunthorpe <jgg@mellanox.com> # IB
Acked-by: Kalle Valo <kvalo@codeaurora.org> # wireless drivers
Reviewed-by: Chao Yu <yuchao0@huawei.com> # erofs
Signed-off-by: Kees Cook <keescook@chromium.org>


# 8beeae86 03-Jul-2020 Joel Savitz <jsavitz@redhat.com>

mm/page_alloc: fix documentation error

When I increased the upper bound of the min_free_kbytes value in
ee8eb9a5fe863 ("mm/page_alloc: increase default min_free_kbytes bound") I
forgot to tweak the above comment to reflect the new value. This patch
fixes that mistake.

Signed-off-by: Joel Savitz <jsavitz@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Rafael Aquini <aquini@redhat.com>
Cc: Fabrizio D'Angelo <fdangelo@redhat.com>
Link: http://lkml.kernel.org/r/20200624221236.29560-1-jsavitz@redhat.com
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 0a477e1a 07-Jun-2020 Vlastimil Babka <vbabka@suse.cz>

kernel/sysctl: support handling command line aliases

We can now handle sysctl parameters on kernel command line, but
historically some parameters introduced their own command line
equivalent, which we don't want to remove for compatibility reasons.

We can, however, convert them to the generic infrastructure with a table
translating the legacy command line parameters to their sysctl names,
and removing the one-off param handlers.

This patch adds the support and makes the first conversion to
demonstrate it, on the (deprecated) numa_zonelist_order parameter.

Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Reviewed-by: Luis Chamberlain <mcgrof@kernel.org>
Acked-by: Kees Cook <keescook@chromium.org>
Acked-by: Michal Hocko <mhocko@suse.com>
Cc: Alexey Dobriyan <adobriyan@gmail.com>
Cc: Christian Brauner <christian.brauner@ubuntu.com>
Cc: David Rientjes <rientjes@google.com>
Cc: "Eric W . Biederman" <ebiederm@xmission.com>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: "Guilherme G . Piccoli" <gpiccoli@canonical.com>
Cc: Iurii Zaikin <yzaikin@google.com>
Cc: Ivan Teterevkov <ivan.teterevkov@nutanix.com>
Cc: Masami Hiramatsu <mhiramat@kernel.org>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: http://lkml.kernel.org/r/20200427180433.7029-3-vbabka@suse.cz
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# aa218795 07-May-2020 David Hildenbrand <david@redhat.com>

mm: Allow to offline unmovable PageOffline() pages via MEM_GOING_OFFLINE

virtio-mem wants to allow to offline memory blocks of which some parts
were unplugged (allocated via alloc_contig_range()), especially, to later
offline and remove completely unplugged memory blocks. The important part
is that PageOffline() has to remain set until the section is offline, so
these pages will never get accessed (e.g., when dumping). The pages should
not be handed back to the buddy (which would require clearing PageOffline()
and result in issues if offlining fails and the pages are suddenly in the
buddy).

Let's allow to do that by allowing to isolate any PageOffline() page
when offlining. This way, we can reach the memory hotplug notifier
MEM_GOING_OFFLINE, where the driver can signal that he is fine with
offlining this page by dropping its reference count. PageOffline() pages
with a reference count of 0 can then be skipped when offlining the
pages (like if they were free, however they are not in the buddy).

Anybody who uses PageOffline() pages and does not agree to offline them
(e.g., Hyper-V balloon, XEN balloon, VMWare balloon for 2MB pages) will not
decrement the reference count and make offlining fail when trying to
migrate such an unmovable page. So there should be no observable change.
Same applies to balloon compaction users (movable PageOffline() pages), the
pages will simply be migrated.

Note 1: If offlining fails, a driver has to increment the reference
count again in MEM_CANCEL_OFFLINE.

Note 2: A driver that makes use of this has to be aware that re-onlining
the memory block has to be handled by hooking into onlining code
(online_page_callback_t), resetting the page PageOffline() and
not giving them to the buddy.

Reviewed-by: Alexander Duyck <alexander.h.duyck@linux.intel.com>
Acked-by: Michal Hocko <mhocko@suse.com>
Tested-by: Pankaj Gupta <pankaj.gupta.linux@gmail.com>
Acked-by: Andrew Morton <akpm@linux-foundation.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Juergen Gross <jgross@suse.com>
Cc: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Cc: Pavel Tatashin <pavel.tatashin@microsoft.com>
Cc: Alexander Duyck <alexander.h.duyck@linux.intel.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Anthony Yznaga <anthony.yznaga@oracle.com>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Oscar Salvador <osalvador@suse.de>
Cc: Mel Gorman <mgorman@techsingularity.net>
Cc: Mike Rapoport <rppt@linux.ibm.com>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: Anshuman Khandual <anshuman.khandual@arm.com>
Cc: Qian Cai <cai@lca.pw>
Cc: Pingfan Liu <kernelfans@gmail.com>
Signed-off-by: David Hildenbrand <david@redhat.com>
Link: https://lore.kernel.org/r/20200507140139.17083-7-david@redhat.com
Signed-off-by: Michael S. Tsirkin <mst@redhat.com>


# 255f5985 07-May-2020 David Hildenbrand <david@redhat.com>

virtio-mem: Paravirtualized memory hotunplug part 2

We also want to unplug online memory (contained in online memory blocks
and, therefore, managed by the buddy), and eventually replug it later.

When requested to unplug memory, we use alloc_contig_range() to allocate
subblocks in online memory blocks (so we are the owner) and send them to
our hypervisor. When requested to plug memory, we can replug such memory
using free_contig_range() after asking our hypervisor.

We also want to mark all allocated pages PG_offline, so nobody will
touch them. To differentiate pages that were never onlined when
onlining the memory block from pages allocated via alloc_contig_range(), we
use PageDirty(). Based on this flag, virtio_mem_fake_online() can either
online the pages for the first time or use free_contig_range().

It is worth noting that there are no guarantees on how much memory can
actually get unplugged again. All device memory might completely be
fragmented with unmovable data, such that no subblock can get unplugged.

We are not touching the ZONE_MOVABLE. If memory is onlined to the
ZONE_MOVABLE, it can only get unplugged after that memory was offlined
manually by user space. In normal operation, virtio-mem memory is
suggested to be onlined to ZONE_NORMAL. In the future, we will try to
make unplug more likely to succeed.

Add a module parameter to control if online memory shall be touched.

As we want to access alloc_contig_range()/free_contig_range() from
kernel module context, export the symbols.

Note: Whenever virtio-mem uses alloc_contig_range(), all affected pages
are on the same node, in the same zone, and contain no holes.

Acked-by: Michal Hocko <mhocko@suse.com> # to export contig range allocator API
Tested-by: Pankaj Gupta <pankaj.gupta.linux@gmail.com>
Cc: "Michael S. Tsirkin" <mst@redhat.com>
Cc: Jason Wang <jasowang@redhat.com>
Cc: Oscar Salvador <osalvador@suse.de>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Igor Mammedov <imammedo@redhat.com>
Cc: Dave Young <dyoung@redhat.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: Pavel Tatashin <pasha.tatashin@soleen.com>
Cc: Stefan Hajnoczi <stefanha@redhat.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Mel Gorman <mgorman@techsingularity.net>
Cc: Mike Rapoport <rppt@linux.ibm.com>
Cc: Alexander Duyck <alexander.h.duyck@linux.intel.com>
Cc: Alexander Potapenko <glider@google.com>
Signed-off-by: David Hildenbrand <david@redhat.com>
Link: https://lore.kernel.org/r/20200507140139.17083-6-david@redhat.com
Signed-off-by: Michael S. Tsirkin <mst@redhat.com>


# 730ec8c0 03-Jun-2020 Maninder Singh <maninder1.s@samsung.com>

mm/vmscan.c: change prototype for shrink_page_list

commit 3c710c1ad11b ("mm, vmscan extract shrink_page_list reclaim counters
into a struct") changed data type for the function, so changing return
type for funciton and its caller.

Signed-off-by: Vaneet Narang <v.narang@samsung.com>
Signed-off-by: Maninder Singh <maninder1.s@samsung.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Acked-by: Michal Hocko <mhocko@suse.com>
Cc: Amit Sahrawat <a.sahrawat@samsung.com>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Vlastimil Babka <vbabka@suse.cz>
Link: http://lkml.kernel.org/r/1588168259-25604-1-git-send-email-maninder1.s@samsung.com
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 633bf2fe 03-Jun-2020 Chen Tao <chentao107@huawei.com>

mm/page_alloc.c: add missing newline

Add missing line breaks on pr_warn().

Signed-off-by: Chen Tao <chentao107@huawei.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Reviewed-by: Andrew Morton <akpm@linux-foundation.org>
Link: http://lkml.kernel.org/r/20200603063547.235825-1-chentao107@huawei.com
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# ecd09650 03-Jun-2020 Daniel Jordan <daniel.m.jordan@oracle.com>

mm: make deferred init's max threads arch-specific

Using padata during deferred init has only been tested on x86, so for now
limit it to this architecture.

If another arch wants this, it can find the max thread limit that's best
for it and override deferred_page_init_max_threads().

Signed-off-by: Daniel Jordan <daniel.m.jordan@oracle.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Tested-by: Josh Triplett <josh@joshtriplett.org>
Cc: Alexander Duyck <alexander.h.duyck@linux.intel.com>
Cc: Alex Williamson <alex.williamson@redhat.com>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: David Hildenbrand <david@redhat.com>
Cc: Herbert Xu <herbert@gondor.apana.org.au>
Cc: Jason Gunthorpe <jgg@ziepe.ca>
Cc: Jonathan Corbet <corbet@lwn.net>
Cc: Kirill Tkhai <ktkhai@virtuozzo.com>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Pavel Machek <pavel@ucw.cz>
Cc: Pavel Tatashin <pasha.tatashin@soleen.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Randy Dunlap <rdunlap@infradead.org>
Cc: Robert Elliott <elliott@hpe.com>
Cc: Shile Zhang <shile.zhang@linux.alibaba.com>
Cc: Steffen Klassert <steffen.klassert@secunet.com>
Cc: Steven Sistare <steven.sistare@oracle.com>
Cc: Tejun Heo <tj@kernel.org>
Cc: Zi Yan <ziy@nvidia.com>
Link: http://lkml.kernel.org/r/20200527173608.2885243-8-daniel.m.jordan@oracle.com
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# e4443149 03-Jun-2020 Daniel Jordan <daniel.m.jordan@oracle.com>

mm: parallelize deferred_init_memmap()

Deferred struct page init is a significant bottleneck in kernel boot.
Optimizing it maximizes availability for large-memory systems and allows
spinning up short-lived VMs as needed without having to leave them
running. It also benefits bare metal machines hosting VMs that are
sensitive to downtime. In projects such as VMM Fast Restart[1], where
guest state is preserved across kexec reboot, it helps prevent application
and network timeouts in the guests.

Multithread to take full advantage of system memory bandwidth.

The maximum number of threads is capped at the number of CPUs on the node
because speedups always improve with additional threads on every system
tested, and at this phase of boot, the system is otherwise idle and
waiting on page init to finish.

Helper threads operate on section-aligned ranges to both avoid false
sharing when setting the pageblock's migrate type and to avoid accessing
uninitialized buddy pages, though max order alignment is enough for the
latter.

The minimum chunk size is also a section. There was benefit to using
multiple threads even on relatively small memory (1G) systems, and this is
the smallest size that the alignment allows.

The time (milliseconds) is the slowest node to initialize since boot
blocks until all nodes finish. intel_pstate is loaded in active mode
without hwp and with turbo enabled, and intel_idle is active as well.

Intel(R) Xeon(R) Platinum 8167M CPU @ 2.00GHz (Skylake, bare metal)
2 nodes * 26 cores * 2 threads = 104 CPUs
384G/node = 768G memory

kernel boot deferred init
------------------------ ------------------------
node% (thr) speedup time_ms (stdev) speedup time_ms (stdev)
( 0) -- 4089.7 ( 8.1) -- 1785.7 ( 7.6)
2% ( 1) 1.7% 4019.3 ( 1.5) 3.8% 1717.7 ( 11.8)
12% ( 6) 34.9% 2662.7 ( 2.9) 79.9% 359.3 ( 0.6)
25% ( 13) 39.9% 2459.0 ( 3.6) 91.2% 157.0 ( 0.0)
37% ( 19) 39.2% 2485.0 ( 29.7) 90.4% 172.0 ( 28.6)
50% ( 26) 39.3% 2482.7 ( 25.7) 90.3% 173.7 ( 30.0)
75% ( 39) 39.0% 2495.7 ( 5.5) 89.4% 190.0 ( 1.0)
100% ( 52) 40.2% 2443.7 ( 3.8) 92.3% 138.0 ( 1.0)

Intel(R) Xeon(R) CPU E5-2699C v4 @ 2.20GHz (Broadwell, kvm guest)
1 node * 16 cores * 2 threads = 32 CPUs
192G/node = 192G memory

kernel boot deferred init
------------------------ ------------------------
node% (thr) speedup time_ms (stdev) speedup time_ms (stdev)
( 0) -- 1988.7 ( 9.6) -- 1096.0 ( 11.5)
3% ( 1) 1.1% 1967.0 ( 17.6) 0.3% 1092.7 ( 11.0)
12% ( 4) 41.1% 1170.3 ( 14.2) 73.8% 287.0 ( 3.6)
25% ( 8) 47.1% 1052.7 ( 21.9) 83.9% 177.0 ( 13.5)
38% ( 12) 48.9% 1016.3 ( 12.1) 86.8% 144.7 ( 1.5)
50% ( 16) 48.9% 1015.7 ( 8.1) 87.8% 134.0 ( 4.4)
75% ( 24) 49.1% 1012.3 ( 3.1) 88.1% 130.3 ( 2.3)
100% ( 32) 49.5% 1004.0 ( 5.3) 88.5% 125.7 ( 2.1)

Intel(R) Xeon(R) CPU E5-2699 v3 @ 2.30GHz (Haswell, bare metal)
2 nodes * 18 cores * 2 threads = 72 CPUs
128G/node = 256G memory

kernel boot deferred init
------------------------ ------------------------
node% (thr) speedup time_ms (stdev) speedup time_ms (stdev)
( 0) -- 1680.0 ( 4.6) -- 627.0 ( 4.0)
3% ( 1) 0.3% 1675.7 ( 4.5) -0.2% 628.0 ( 3.6)
11% ( 4) 25.6% 1250.7 ( 2.1) 67.9% 201.0 ( 0.0)
25% ( 9) 30.7% 1164.0 ( 17.3) 81.8% 114.3 ( 17.7)
36% ( 13) 31.4% 1152.7 ( 10.8) 84.0% 100.3 ( 17.9)
50% ( 18) 31.5% 1150.7 ( 9.3) 83.9% 101.0 ( 14.1)
75% ( 27) 31.7% 1148.0 ( 5.6) 84.5% 97.3 ( 6.4)
100% ( 36) 32.0% 1142.3 ( 4.0) 85.6% 90.0 ( 1.0)

AMD EPYC 7551 32-Core Processor (Zen, kvm guest)
1 node * 8 cores * 2 threads = 16 CPUs
64G/node = 64G memory

kernel boot deferred init
------------------------ ------------------------
node% (thr) speedup time_ms (stdev) speedup time_ms (stdev)
( 0) -- 1029.3 ( 25.1) -- 240.7 ( 1.5)
6% ( 1) -0.6% 1036.0 ( 7.8) -2.2% 246.0 ( 0.0)
12% ( 2) 11.8% 907.7 ( 8.6) 44.7% 133.0 ( 1.0)
25% ( 4) 13.9% 886.0 ( 10.6) 62.6% 90.0 ( 6.0)
38% ( 6) 17.8% 845.7 ( 14.2) 69.1% 74.3 ( 3.8)
50% ( 8) 16.8% 856.0 ( 22.1) 72.9% 65.3 ( 5.7)
75% ( 12) 15.4% 871.0 ( 29.2) 79.8% 48.7 ( 7.4)
100% ( 16) 21.0% 813.7 ( 21.0) 80.5% 47.0 ( 5.2)

Server-oriented distros that enable deferred page init sometimes run in
small VMs, and they still benefit even though the fraction of boot time
saved is smaller:

AMD EPYC 7551 32-Core Processor (Zen, kvm guest)
1 node * 2 cores * 2 threads = 4 CPUs
16G/node = 16G memory

kernel boot deferred init
------------------------ ------------------------
node% (thr) speedup time_ms (stdev) speedup time_ms (stdev)
( 0) -- 716.0 ( 14.0) -- 49.7 ( 0.6)
25% ( 1) 1.8% 703.0 ( 5.3) -4.0% 51.7 ( 0.6)
50% ( 2) 1.6% 704.7 ( 1.2) 43.0% 28.3 ( 0.6)
75% ( 3) 2.7% 696.7 ( 13.1) 49.7% 25.0 ( 0.0)
100% ( 4) 4.1% 687.0 ( 10.4) 55.7% 22.0 ( 0.0)

Intel(R) Xeon(R) CPU E5-2699 v3 @ 2.30GHz (Haswell, kvm guest)
1 node * 2 cores * 2 threads = 4 CPUs
14G/node = 14G memory

kernel boot deferred init
------------------------ ------------------------
node% (thr) speedup time_ms (stdev) speedup time_ms (stdev)
( 0) -- 787.7 ( 6.4) -- 122.3 ( 0.6)
25% ( 1) 0.2% 786.3 ( 10.8) -2.5% 125.3 ( 2.1)
50% ( 2) 5.9% 741.0 ( 13.9) 37.6% 76.3 ( 19.7)
75% ( 3) 8.3% 722.0 ( 19.0) 49.9% 61.3 ( 3.2)
100% ( 4) 9.3% 714.7 ( 9.5) 56.4% 53.3 ( 1.5)

On Josh's 96-CPU and 192G memory system:

Without this patch series:
[ 0.487132] node 0 initialised, 23398907 pages in 292ms
[ 0.499132] node 1 initialised, 24189223 pages in 304ms
...
[ 0.629376] Run /sbin/init as init process

With this patch series:
[ 0.231435] node 1 initialised, 24189223 pages in 32ms
[ 0.236718] node 0 initialised, 23398907 pages in 36ms

[1] https://static.sched.com/hosted_files/kvmforum2019/66/VMM-fast-restart_kvmforum2019.pdf

Signed-off-by: Daniel Jordan <daniel.m.jordan@oracle.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Tested-by: Josh Triplett <josh@joshtriplett.org>
Reviewed-by: Alexander Duyck <alexander.h.duyck@linux.intel.com>
Cc: Alex Williamson <alex.williamson@redhat.com>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: David Hildenbrand <david@redhat.com>
Cc: Herbert Xu <herbert@gondor.apana.org.au>
Cc: Jason Gunthorpe <jgg@ziepe.ca>
Cc: Jonathan Corbet <corbet@lwn.net>
Cc: Kirill Tkhai <ktkhai@virtuozzo.com>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Pavel Machek <pavel@ucw.cz>
Cc: Pavel Tatashin <pasha.tatashin@soleen.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Randy Dunlap <rdunlap@infradead.org>
Cc: Robert Elliott <elliott@hpe.com>
Cc: Shile Zhang <shile.zhang@linux.alibaba.com>
Cc: Steffen Klassert <steffen.klassert@secunet.com>
Cc: Steven Sistare <steven.sistare@oracle.com>
Cc: Tejun Heo <tj@kernel.org>
Cc: Zi Yan <ziy@nvidia.com>
Link: http://lkml.kernel.org/r/20200527173608.2885243-7-daniel.m.jordan@oracle.com
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 89c7c402 03-Jun-2020 Daniel Jordan <daniel.m.jordan@oracle.com>

mm: don't track number of pages during deferred initialization

Deferred page init used to report the number of pages initialized:

node 0 initialised, 32439114 pages in 97ms

Tracking this makes the code more complicated when using multiple threads.
Given that the statistic probably has limited value, especially since a
zone grows on demand so that the page count can vary, just remove it.

The boot message now looks like

node 0 deferred pages initialised in 97ms

Suggested-by: Alexander Duyck <alexander.h.duyck@linux.intel.com>
Signed-off-by: Daniel Jordan <daniel.m.jordan@oracle.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Reviewed-by: Alexander Duyck <alexander.h.duyck@linux.intel.com>
Cc: Alex Williamson <alex.williamson@redhat.com>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: David Hildenbrand <david@redhat.com>
Cc: Herbert Xu <herbert@gondor.apana.org.au>
Cc: Jason Gunthorpe <jgg@ziepe.ca>
Cc: Jonathan Corbet <corbet@lwn.net>
Cc: Josh Triplett <josh@joshtriplett.org>
Cc: Kirill Tkhai <ktkhai@virtuozzo.com>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Pavel Machek <pavel@ucw.cz>
Cc: Pavel Tatashin <pasha.tatashin@soleen.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Randy Dunlap <rdunlap@infradead.org>
Cc: Robert Elliott <elliott@hpe.com>
Cc: Shile Zhang <shile.zhang@linux.alibaba.com>
Cc: Steffen Klassert <steffen.klassert@secunet.com>
Cc: Steven Sistare <steven.sistare@oracle.com>
Cc: Tejun Heo <tj@kernel.org>
Cc: Zi Yan <ziy@nvidia.com>
Link: http://lkml.kernel.org/r/20200527173608.2885243-6-daniel.m.jordan@oracle.com
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# da97f2d5 03-Jun-2020 Pavel Tatashin <pasha.tatashin@soleen.com>

mm: call cond_resched() from deferred_init_memmap()

Now that deferred pages are initialized with interrupts enabled we can
replace touch_nmi_watchdog() with cond_resched(), as it was before
3a2d7fa8a3d5.

For now, we cannot do the same in deferred_grow_zone() as it is still
initializes pages with interrupts disabled.

This change fixes RCU problem described in
https://lkml.kernel.org/r/20200401104156.11564-2-david@redhat.com

[ 60.474005] rcu: INFO: rcu_sched detected stalls on CPUs/tasks:
[ 60.475000] rcu: 1-...0: (0 ticks this GP) idle=02a/1/0x4000000000000000 softirq=1/1 fqs=15000
[ 60.475000] rcu: (detected by 0, t=60002 jiffies, g=-1199, q=1)
[ 60.475000] Sending NMI from CPU 0 to CPUs 1:
[ 1.760091] NMI backtrace for cpu 1
[ 1.760091] CPU: 1 PID: 20 Comm: pgdatinit0 Not tainted 4.18.0-147.9.1.el8_1.x86_64 #1
[ 1.760091] Hardware name: Red Hat KVM, BIOS 1.13.0-1.module+el8.2.0+5520+4e5817f3 04/01/2014
[ 1.760091] RIP: 0010:__init_single_page.isra.65+0x10/0x4f
[ 1.760091] Code: 48 83 cf 63 48 89 f8 0f 1f 40 00 48 89 c6 48 89 d7 e8 6b 18 80 ff 66 90 5b c3 31 c0 b9 10 00 00 00 49 89 f8 48 c1 e6 33 f3 ab <b8> 07 00 00 00 48 c1 e2 36 41 c7 40 34 01 00 00 00 48 c1 e0 33 41
[ 1.760091] RSP: 0000:ffffba783123be40 EFLAGS: 00000006
[ 1.760091] RAX: 0000000000000000 RBX: fffffad34405e300 RCX: 0000000000000000
[ 1.760091] RDX: 0000000000000000 RSI: 0010000000000000 RDI: fffffad34405e340
[ 1.760091] RBP: 0000000033f3177e R08: fffffad34405e300 R09: 0000000000000002
[ 1.760091] R10: 000000000000002b R11: ffff98afb691a500 R12: 0000000000000002
[ 1.760091] R13: 0000000000000000 R14: 000000003f03ea00 R15: 000000003e10178c
[ 1.760091] FS: 0000000000000000(0000) GS:ffff9c9ebeb00000(0000) knlGS:0000000000000000
[ 1.760091] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 1.760091] CR2: 00000000ffffffff CR3: 000000a1cf20a001 CR4: 00000000003606e0
[ 1.760091] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[ 1.760091] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
[ 1.760091] Call Trace:
[ 1.760091] deferred_init_pages+0x8f/0xbf
[ 1.760091] deferred_init_memmap+0x184/0x29d
[ 1.760091] ? deferred_free_pages.isra.97+0xba/0xba
[ 1.760091] kthread+0x112/0x130
[ 1.760091] ? kthread_flush_work_fn+0x10/0x10
[ 1.760091] ret_from_fork+0x35/0x40
[ 89.123011] node 0 initialised, 1055935372 pages in 88650ms

Fixes: 3a2d7fa8a3d5 ("mm: disable interrupts while initializing deferred pages")
Reported-by: Yiqian Wei <yiwei@redhat.com>
Signed-off-by: Pavel Tatashin <pasha.tatashin@soleen.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Tested-by: David Hildenbrand <david@redhat.com>
Reviewed-by: Daniel Jordan <daniel.m.jordan@oracle.com>
Reviewed-by: David Hildenbrand <david@redhat.com>
Reviewed-by: Pankaj Gupta <pankaj.gupta.linux@gmail.com>
Acked-by: Michal Hocko <mhocko@suse.com>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: James Morris <jmorris@namei.org>
Cc: Kirill Tkhai <ktkhai@virtuozzo.com>
Cc: Sasha Levin <sashal@kernel.org>
Cc: Shile Zhang <shile.zhang@linux.alibaba.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: <stable@vger.kernel.org> [4.17+]
Link: http://lkml.kernel.org/r/20200403140952.17177-4-pasha.tatashin@soleen.com
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 3d060856 03-Jun-2020 Pavel Tatashin <pasha.tatashin@soleen.com>

mm: initialize deferred pages with interrupts enabled

Initializing struct pages is a long task and keeping interrupts disabled
for the duration of this operation introduces a number of problems.

1. jiffies are not updated for long period of time, and thus incorrect time
is reported. See proposed solution and discussion here:
lkml/20200311123848.118638-1-shile.zhang@linux.alibaba.com
2. It prevents farther improving deferred page initialization by allowing
intra-node multi-threading.

We are keeping interrupts disabled to solve a rather theoretical problem
that was never observed in real world (See 3a2d7fa8a3d5).

Let's keep interrupts enabled. In case we ever encounter a scenario where
an interrupt thread wants to allocate large amount of memory this early in
boot we can deal with that by growing zone (see deferred_grow_zone()) by
the needed amount before starting deferred_init_memmap() threads.

Before:
[ 1.232459] node 0 initialised, 12058412 pages in 1ms

After:
[ 1.632580] node 0 initialised, 12051227 pages in 436ms

Fixes: 3a2d7fa8a3d5 ("mm: disable interrupts while initializing deferred pages")
Reported-by: Shile Zhang <shile.zhang@linux.alibaba.com>
Signed-off-by: Pavel Tatashin <pasha.tatashin@soleen.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Reviewed-by: Daniel Jordan <daniel.m.jordan@oracle.com>
Reviewed-by: David Hildenbrand <david@redhat.com>
Acked-by: Michal Hocko <mhocko@suse.com>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: James Morris <jmorris@namei.org>
Cc: Kirill Tkhai <ktkhai@virtuozzo.com>
Cc: Sasha Levin <sashal@kernel.org>
Cc: Yiqian Wei <yiwei@redhat.com>
Cc: <stable@vger.kernel.org> [4.17+]
Link: http://lkml.kernel.org/r/20200403140952.17177-3-pasha.tatashin@soleen.com
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 117003c3 03-Jun-2020 Daniel Jordan <daniel.m.jordan@oracle.com>

mm/pagealloc.c: call touch_nmi_watchdog() on max order boundaries in deferred init

Patch series "initialize deferred pages with interrupts enabled", v4.

Keep interrupts enabled during deferred page initialization in order to
make code more modular and allow jiffies to update.

Original approach, and discussion can be found here:
http://lkml.kernel.org/r/20200311123848.118638-1-shile.zhang@linux.alibaba.com

This patch (of 3):

deferred_init_memmap() disables interrupts the entire time, so it calls
touch_nmi_watchdog() periodically to avoid soft lockup splats. Soon it
will run with interrupts enabled, at which point cond_resched() should be
used instead.

deferred_grow_zone() makes the same watchdog calls through code shared
with deferred init but will continue to run with interrupts disabled, so
it can't call cond_resched().

Pull the watchdog calls up to these two places to allow the first to be
changed later, independently of the second. The frequency reduces from
twice per pageblock (init and free) to once per max order block.

Fixes: 3a2d7fa8a3d5 ("mm: disable interrupts while initializing deferred pages")
Signed-off-by: Daniel Jordan <daniel.m.jordan@oracle.com>
Signed-off-by: Pavel Tatashin <pasha.tatashin@soleen.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Reviewed-by: David Hildenbrand <david@redhat.com>
Acked-by: Michal Hocko <mhocko@suse.com>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: Shile Zhang <shile.zhang@linux.alibaba.com>
Cc: Kirill Tkhai <ktkhai@virtuozzo.com>
Cc: James Morris <jmorris@namei.org>
Cc: Sasha Levin <sashal@kernel.org>
Cc: Yiqian Wei <yiwei@redhat.com>
Cc: <stable@vger.kernel.org> [4.17+]
Link: http://lkml.kernel.org/r/20200403140952.17177-2-pasha.tatashin@soleen.com
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# ae70eddd 03-Jun-2020 Anshuman Khandual <anshuman.khandual@arm.com>

mm/page_alloc: restrict and formalize compound_page_dtors[]

Restrict elements in compound_page_dtors[] array per NR_COMPOUND_DTORS and
explicitly position them according to enum compound_dtor_id. This
improves protection against possible misalignment between
compound_page_dtors[] and enum compound_dtor_id later on.

Signed-off-by: Anshuman Khandual <anshuman.khandual@arm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Reviewed-by: David Hildenbrand <david@redhat.com>
Link: http://lkml.kernel.org/r/1589795958-19317-1-git-send-email-anshuman.khandual@arm.com
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# aa092591 03-Jun-2020 Charan Teja Reddy <charante@codeaurora.org>

mm, page_alloc: reset the zone->watermark_boost early

Updating the zone watermarks by any means, like min_free_kbytes,
water_mark_scale_factor etc, when ->watermark_boost is set will result in
higher low and high watermarks than the user asked.

Below are the steps to reproduce the problem on system setup of Android
kernel running on Snapdragon hardware.

1) Default settings of the system are as below:

#cat /proc/sys/vm/min_free_kbytes = 5162
#cat /proc/zoneinfo | grep -e boost -e low -e "high " -e min -e Node
Node 0, zone Normal
min 797
low 8340
high 8539

2) Monitor the zone->watermark_boost(by adding a debug print in the
kernel) and whenever it is greater than zero value, write the same
value of min_free_kbytes obtained from step 1.

#echo 5162 > /proc/sys/vm/min_free_kbytes

3) Then read the zone watermarks in the system while the
->watermark_boost is zero. This should show the same values of
watermarks as step 1 but shown a higher values than asked.

#cat /proc/zoneinfo | grep -e boost -e low -e "high " -e min -e Node
Node 0, zone Normal
min 797
low 21148
high 21347

These higher values are because of updating the zone watermarks using the
macro min_wmark_pages(zone) which also adds the zone->watermark_boost.

#define min_wmark_pages(z) (z->_watermark[WMARK_MIN] +
z->watermark_boost)

So the steps that lead to the issue are:

1) On the extfrag event, watermarks are boosted by storing the required
value in ->watermark_boost.

2) User tries to update the zone watermarks level in the system through
min_free_kbytes or watermark_scale_factor.

3) Later, when kswapd woke up, it resets the zone->watermark_boost to
zero.

In step 2), we use the min_wmark_pages() macro to store the watermarks
in the zone structure thus the values are always offsetted by
->watermark_boost value. This can be avoided by resetting the
->watermark_boost to zero before it is used.

Signed-off-by: Charan Teja Reddy <charante@codeaurora.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Reviewed-by: Baoquan He <bhe@redhat.com>
Cc: Vinayak Menon <vinmenon@codeaurora.org>
Link: http://lkml.kernel.org/r/1589457511-4255-1-git-send-email-charante@codeaurora.org
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# b418a0f9 03-Jun-2020 Sandipan Das <sandipan@linux.ibm.com>

mm/page_alloc.c: reset numa stats for boot pagesets

Initially, the per-cpu pagesets of each zone are set to the boot pagesets.
The real pagesets are allocated later but before that happens, page
allocations do occur and the numa stats for the boot pagesets get
incremented since they are common to all zones at that point.

The real pagesets, however, are allocated for the populated zones only.
Unpopulated zones, like those associated with memory-less nodes, continue
using the boot pageset and end up skewing the numa stats of the
corresponding node.

E.g.

$ numactl -H
available: 2 nodes (0-1)
node 0 cpus: 0 1 2 3
node 0 size: 0 MB
node 0 free: 0 MB
node 1 cpus: 4 5 6 7
node 1 size: 8131 MB
node 1 free: 6980 MB
node distances:
node 0 1
0: 10 40
1: 40 10

$ numastat
node0 node1
numa_hit 108 56495
numa_miss 0 0
numa_foreign 0 0
interleave_hit 0 4537
local_node 108 31547
other_node 0 24948

Hence, the boot pageset stats need to be cleared after the real pagesets
are allocated.

After this point, the stats of the boot pagesets do not change as page
allocations requested for a memory-less node will either fail (if
__GFP_THISNODE is used) or get fulfilled by a preferred zone of a
different node based on the fallback zonelist.

[sandipan@linux.ibm.com: v3]
Link: http://lkml.kernel.org/r/20200511170356.162531-1-sandipan@linux.ibm.com
Signed-off-by: Sandipan Das <sandipan@linux.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Cc: Konstantin Khlebnikov <khlebnikov@yandex-team.ru>
Cc: Michal Hocko <mhocko@suse.com>
Cc: "Kirill A . Shutemov" <kirill@shutemov.name>
Cc: "Aneesh Kumar K.V" <aneesh.kumar@linux.ibm.com>
Link: http://lkml.kernel.org/r/9c9c2d1b15e37f6e6bf32f99e3100035e90c4ac9.1588868430.git.sandipan@linux.ibm.com
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 01c0bfe0 03-Jun-2020 Wei Yang <richard.weiyang@gmail.com>

mm: rename gfpflags_to_migratetype to gfp_migratetype for same convention

Pageblock migrate type is encoded in GFP flags, just as zone_type and
zonelist.

Currently we use gfp_zone() and gfp_zonelist() to extract related
information, it would be proper to use the same naming convention for
migrate type.

Signed-off-by: Wei Yang <richard.weiyang@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Reviewed-by: Pankaj Gupta <pankaj.gupta.linux@gmail.com>
Link: http://lkml.kernel.org/r/20200329080823.7735-1-richard.weiyang@gmail.com
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# d0ddf49b 03-Jun-2020 Wei Yang <richard.weiyang@gmail.com>

mm/page_alloc.c: use NODE_MASK_NONE in build_zonelists()

Slightly simplify the code by initializing user_mask with NODE_MASK_NONE,
instead of later calling nodes_clear(). This saves a line of code.

Signed-off-by: Wei Yang <richard.weiyang@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Reviewed-by: John Hubbard <jhubbard@nvidia.com>
Reviewed-by: David Hildenbrand <david@redhat.com>
Reviewed-by: Pankaj Gupta <pankaj.gupta.linux@gmail.com>
Link: http://lkml.kernel.org/r/20200330220840.21228-1-richard.weiyang@gmail.com
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 97a225e6 03-Jun-2020 Joonsoo Kim <iamjoonsoo.kim@lge.com>

mm/page_alloc: integrate classzone_idx and high_zoneidx

classzone_idx is just different name for high_zoneidx now. So, integrate
them and add some comment to struct alloc_context in order to reduce
future confusion about the meaning of this variable.

The accessor, ac_classzone_idx() is also removed since it isn't needed
after integration.

In addition to integration, this patch also renames high_zoneidx to
highest_zoneidx since it represents more precise meaning.

Signed-off-by: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Reviewed-by: Baoquan He <bhe@redhat.com>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Acked-by: David Rientjes <rientjes@google.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Mel Gorman <mgorman@techsingularity.net>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Minchan Kim <minchan@kernel.org>
Cc: Ye Xiaolong <xiaolong.ye@intel.com>
Link: http://lkml.kernel.org/r/1587095923-7515-3-git-send-email-iamjoonsoo.kim@lge.com
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# f6366156 03-Jun-2020 Baoquan He <bhe@redhat.com>

mm/page_alloc.c: clear out zone->lowmem_reserve[] if the zone is empty

When requesting memory allocation from a specific zone is not satisfied,
it will fall to lower zone to try allocating memory. In this case, lower
zone's ->lowmem_reserve[] will help protect its own memory resource. The
higher the relevant ->lowmem_reserve[] is, the harder the upper zone can
get memory from this lower zone.

However, this protection mechanism should be applied to populated zone,
but not an empty zone. So filling ->lowmem_reserve[] for empty zone is
not necessary, and may mislead people that it's valid data in that zone.

Node 2, zone DMA
pages free 0
min 0
low 0
high 0
spanned 0
present 0
managed 0
protection: (0, 0, 1024, 1024)
Node 2, zone DMA32
pages free 0
min 0
low 0
high 0
spanned 0
present 0
managed 0
protection: (0, 0, 1024, 1024)
Node 2, zone Normal
per-node stats
nr_inactive_anon 0
nr_active_anon 143
nr_inactive_file 0
nr_active_file 0
nr_unevictable 0
nr_slab_reclaimable 45
nr_slab_unreclaimable 254

Here clear out zone->lowmem_reserve[] if zone is empty.

Signed-off-by: Baoquan He <bhe@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Link: http://lkml.kernel.org/r/20200402140113.3696-3-bhe@redhat.com
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 86aaf255 03-Jun-2020 Baoquan He <bhe@redhat.com>

mm/page_alloc.c: only tune sysctl_lowmem_reserve_ratio value once when changing it

Patch series "improvements about lowmem_reserve and /proc/zoneinfo", v2.

This patch (of 3):

When people write to /proc/sys/vm/lowmem_reserve_ratio to change
sysctl_lowmem_reserve_ratio[], setup_per_zone_lowmem_reserve() is called
to recalculate all ->lowmem_reserve[] for each zone of all nodes as below:

static void setup_per_zone_lowmem_reserve(void)
{
...
for_each_online_pgdat(pgdat) {
for (j = 0; j < MAX_NR_ZONES; j++) {
...
while (idx) {
...
if (sysctl_lowmem_reserve_ratio[idx] < 1) {
sysctl_lowmem_reserve_ratio[idx] = 0;
lower_zone->lowmem_reserve[j] = 0;
} else {
...
}
}
}
}

Meanwhile, here, sysctl_lowmem_reserve_ratio[idx] will be tuned if its
value is smaller than '1'. As we know, sysctl_lowmem_reserve_ratio[] is
set for zone without regarding to which node it belongs to. That means
the tuning will be done on all nodes, even though it has been done in the
first node.

And the tuning will be done too even when init_per_zone_wmark_min() calls
setup_per_zone_lowmem_reserve(), where actually nobody tries to change
sysctl_lowmem_reserve_ratio[].

So now move the tuning into lowmem_reserve_ratio_sysctl_handler(), to make
code logic more reasonable.

Signed-off-by: Baoquan He <bhe@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Baoquan He <bhe@redhat.com>
Cc: Mel Gorman <mgorman@techsingularity.net>
Cc: David Rientjes <rientjes@google.com>
Link: http://lkml.kernel.org/r/20200402140113.3696-1-bhe@redhat.com
Link: http://lkml.kernel.org/r/20200402140113.3696-2-bhe@redhat.com
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 4ca7be24 03-Jun-2020 Baoquan He <bhe@redhat.com>

mm/page_alloc.c: remove unused free_bootmem_with_active_regions

Since commit 397dc00e249ec64e10 ("mips: sgi-ip27: switch from DISCONTIGMEM
to SPARSEMEM"), the last caller of free_bootmem_with_active_regions() was
gone. Now no user calls it any more.

Let's remove it.

Signed-off-by: Baoquan He <bhe@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Reviewed-by: David Hildenbrand <david@redhat.com>
Acked-by: Michal Hocko <mhocko@suse.com>
Link: http://lkml.kernel.org/r/20200402143455.5145-1-bhe@redhat.com
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 16867664 03-Jun-2020 Roman Gushchin <guro@fb.com>

mm,page_alloc,cma: conditionally prefer cma pageblocks for movable allocations

Currently a cma area is barely used by the page allocator because it's
used only as a fallback from movable, however kswapd tries hard to make
sure that the fallback path isn't used.

This results in a system evicting memory and pushing data into swap, while
lots of CMA memory is still available. This happens despite the fact that
alloc_contig_range is perfectly capable of moving any movable allocations
out of the way of an allocation.

To effectively use the cma area let's alter the rules: if the zone has
more free cma pages than the half of total free pages in the zone, use cma
pageblocks first and fallback to movable blocks in the case of failure.

[guro@fb.com: ifdef the cma-specific code]
Link: http://lkml.kernel.org/r/20200311225832.GA178154@carbon.DHCP.thefacebook.com
Co-developed-by: Rik van Riel <riel@surriel.com>
Signed-off-by: Roman Gushchin <guro@fb.com>
Signed-off-by: Rik van Riel <riel@surriel.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Acked-by: Minchan Kim <minchan@kernel.org>
Cc: Qian Cai <cai@lca.pw>
Cc: Mel Gorman <mgorman@techsingularity.net>
Cc: Anshuman Khandual <anshuman.khandual@arm.com>
Cc: Joonsoo Kim <js1304@gmail.com>
Link: http://lkml.kernel.org/r/20200306150102.3e77354b@imladris.surriel.com
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 58b7f119 03-Jun-2020 Wei Yang <richard.weiyang@gmail.com>

mm/page_alloc.c: extract check_[new|free]_page_bad() common part to page_bad_reason()

We share similar code in check_[new|free]_page_bad() to get the page's bad
reason.

Let's extract it and reduce code duplication.

Signed-off-by: Wei Yang <richard.weiyang@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Cc: David Rientjes <rientjes@google.com>
Cc: Anshuman Khandual <anshuman.khandual@arm.com>
Cc: David Hildenbrand <david@redhat.com>
Cc: Michal Hocko <mhocko@suse.com>
Link: http://lkml.kernel.org/r/20200411220357.9636-6-richard.weiyang@gmail.com
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 534fe5e3 03-Jun-2020 Wei Yang <richard.weiyang@gmail.com>

mm/page_alloc.c: rename free_pages_check() to check_free_page()

free_pages_check() is the counterpart of check_new_page(). Rename it to
use the same naming convention.

Signed-off-by: Wei Yang <richard.weiyang@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Cc: Anshuman Khandual <anshuman.khandual@arm.com>
Cc: David Hildenbrand <david@redhat.com>
Cc: David Rientjes <rientjes@google.com>
Cc: Michal Hocko <mhocko@suse.com>
Link: http://lkml.kernel.org/r/20200411220357.9636-5-richard.weiyang@gmail.com
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 0d0c48a2 03-Jun-2020 Wei Yang <richard.weiyang@gmail.com>

mm/page_alloc.c: rename free_pages_check_bad() to check_free_page_bad()

free_pages_check_bad() is the counterpart of check_new_page_bad(). Rename
it to use the same naming convention.

Signed-off-by: Wei Yang <richard.weiyang@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Cc: Anshuman Khandual <anshuman.khandual@arm.com>
Cc: David Hildenbrand <david@redhat.com>
Cc: David Rientjes <rientjes@google.com>
Cc: Michal Hocko <mhocko@suse.com>
Link: http://lkml.kernel.org/r/20200411220357.9636-4-richard.weiyang@gmail.com
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 82a3241a 03-Jun-2020 Wei Yang <richard.weiyang@gmail.com>

mm/page_alloc.c: bad_flags is not necessary for bad_page()

After commit 5b57b8f22709 ("mm/debug.c: always print flags in
dump_page()"), page->flags is always printed for a bad page. It is not
necessary to have bad_flags any more.

Suggested-by: Anshuman Khandual <anshuman.khandual@arm.com>
Signed-off-by: Wei Yang <richard.weiyang@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Cc: David Hildenbrand <david@redhat.com>
Cc: David Rientjes <rientjes@google.com>
Cc: Michal Hocko <mhocko@suse.com>
Link: http://lkml.kernel.org/r/20200411220357.9636-3-richard.weiyang@gmail.com
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 833d8a42 03-Jun-2020 Wei Yang <richard.weiyang@gmail.com>

mm/page_alloc.c: bad_[reason|flags] is not necessary when PageHWPoison

Patch series "mm/page_alloc.c: cleanup on check page", v3.

This patchset does some cleanup related to check page.

1. Remove unnecessary bad_reason assignment
2. Remove bad_flags to bad_page()
3. Rename function for naming convention
4. Extract common part to check page

Thanks for suggestions from David Rientjes and Anshuman Khandual.

This patch (of 5):

Since function returns directly, bad_[reason|flags] is not used any where.
And move this to the first.

This is a following cleanup for commit e570f56cccd21 ("mm:
check_new_page_bad() directly returns in __PG_HWPOISON case")

Signed-off-by: Wei Yang <richard.weiyang@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Reviewed-by: David Hildenbrand <david@redhat.com>
Acked-by: Michal Hocko <mhocko@suse.com>
Cc: Anshuman Khandual <anshuman.khandual@arm.com>
Cc: David Rientjes <rientjes@google.com>
Link: http://lkml.kernel.org/r/20200411220357.9636-2-richard.weiyang@gmail.com
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 8a1b25fe 03-Jun-2020 Mike Rapoport <rppt@kernel.org>

mm: simplify find_min_pfn_with_active_regions()

find_min_pfn_with_active_regions() calls find_min_pfn_for_node() with nid
parameter set to MAX_NUMNODES. This makes the find_min_pfn_for_node()
traverse all memblock memory regions although the first PFN in the system
can be easily found with memblock_start_of_DRAM().

Use memblock_start_of_DRAM() in find_min_pfn_with_active_regions() and drop
now unused find_min_pfn_for_node().

Signed-off-by: Mike Rapoport <rppt@linux.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Tested-by: Hoan Tran <hoan@os.amperecomputing.com> [arm64]
Cc: Baoquan He <bhe@redhat.com>
Cc: Brian Cain <bcain@codeaurora.org>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: "David S. Miller" <davem@davemloft.net>
Cc: Geert Uytterhoeven <geert@linux-m68k.org>
Cc: Greentime Hu <green.hu@gmail.com>
Cc: Greg Ungerer <gerg@linux-m68k.org>
Cc: Guan Xuetao <gxt@pku.edu.cn>
Cc: Guo Ren <guoren@kernel.org>
Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
Cc: Helge Deller <deller@gmx.de>
Cc: "James E.J. Bottomley" <James.Bottomley@HansenPartnership.com>
Cc: Jonathan Corbet <corbet@lwn.net>
Cc: Ley Foon Tan <ley.foon.tan@intel.com>
Cc: Mark Salter <msalter@redhat.com>
Cc: Matt Turner <mattst88@gmail.com>
Cc: Max Filippov <jcmvbkbc@gmail.com>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Michal Simek <monstr@monstr.eu>
Cc: Nick Hu <nickhu@andestech.com>
Cc: Paul Walmsley <paul.walmsley@sifive.com>
Cc: Richard Weinberger <richard@nod.at>
Cc: Rich Felker <dalias@libc.org>
Cc: Russell King <linux@armlinux.org.uk>
Cc: Stafford Horne <shorne@gmail.com>
Cc: Thomas Bogendoerfer <tsbogend@alpha.franken.de>
Cc: Tony Luck <tony.luck@intel.com>
Cc: Vineet Gupta <vgupta@synopsys.com>
Cc: Yoshinori Sato <ysato@users.sourceforge.jp>
Link: http://lkml.kernel.org/r/20200412194859.12663-21-rppt@kernel.org
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 854e8848 03-Jun-2020 Mike Rapoport <rppt@kernel.org>

mm: clean up free_area_init_node() and its helpers

free_area_init_node() now always uses memblock info and the zone PFN
limits so it does not need the backwards compatibility functions to
calculate the zone spanned and absent pages. The removal of the compat_
versions of zone_{abscent,spanned}_pages_in_node() in turn, makes
zone_size and zhole_size parameters unused.

The node_start_pfn is determined by get_pfn_range_for_nid(), so there is
no need to pass it to free_area_init_node().

As a result, the only required parameter to free_area_init_node() is the
node ID, all the rest are removed along with no longer used
compat_zone_{abscent,spanned}_pages_in_node() helpers.

Signed-off-by: Mike Rapoport <rppt@linux.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Tested-by: Hoan Tran <hoan@os.amperecomputing.com> [arm64]
Cc: Baoquan He <bhe@redhat.com>
Cc: Brian Cain <bcain@codeaurora.org>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: "David S. Miller" <davem@davemloft.net>
Cc: Geert Uytterhoeven <geert@linux-m68k.org>
Cc: Greentime Hu <green.hu@gmail.com>
Cc: Greg Ungerer <gerg@linux-m68k.org>
Cc: Guan Xuetao <gxt@pku.edu.cn>
Cc: Guo Ren <guoren@kernel.org>
Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
Cc: Helge Deller <deller@gmx.de>
Cc: "James E.J. Bottomley" <James.Bottomley@HansenPartnership.com>
Cc: Jonathan Corbet <corbet@lwn.net>
Cc: Ley Foon Tan <ley.foon.tan@intel.com>
Cc: Mark Salter <msalter@redhat.com>
Cc: Matt Turner <mattst88@gmail.com>
Cc: Max Filippov <jcmvbkbc@gmail.com>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Michal Simek <monstr@monstr.eu>
Cc: Nick Hu <nickhu@andestech.com>
Cc: Paul Walmsley <paul.walmsley@sifive.com>
Cc: Richard Weinberger <richard@nod.at>
Cc: Rich Felker <dalias@libc.org>
Cc: Russell King <linux@armlinux.org.uk>
Cc: Stafford Horne <shorne@gmail.com>
Cc: Thomas Bogendoerfer <tsbogend@alpha.franken.de>
Cc: Tony Luck <tony.luck@intel.com>
Cc: Vineet Gupta <vgupta@synopsys.com>
Cc: Yoshinori Sato <ysato@users.sourceforge.jp>
Link: http://lkml.kernel.org/r/20200412194859.12663-20-rppt@kernel.org
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# bc9331a1 03-Jun-2020 Mike Rapoport <rppt@kernel.org>

mm: rename free_area_init_node() to free_area_init_memoryless_node()

free_area_init_node() is only used by x86 to initialize a memory-less
nodes. Make its name reflect this and drop all the function parameters
except node ID as they are anyway zero.

Signed-off-by: Mike Rapoport <rppt@linux.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Tested-by: Hoan Tran <hoan@os.amperecomputing.com> [arm64]
Cc: Baoquan He <bhe@redhat.com>
Cc: Brian Cain <bcain@codeaurora.org>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: "David S. Miller" <davem@davemloft.net>
Cc: Geert Uytterhoeven <geert@linux-m68k.org>
Cc: Greentime Hu <green.hu@gmail.com>
Cc: Greg Ungerer <gerg@linux-m68k.org>
Cc: Guan Xuetao <gxt@pku.edu.cn>
Cc: Guo Ren <guoren@kernel.org>
Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
Cc: Helge Deller <deller@gmx.de>
Cc: "James E.J. Bottomley" <James.Bottomley@HansenPartnership.com>
Cc: Jonathan Corbet <corbet@lwn.net>
Cc: Ley Foon Tan <ley.foon.tan@intel.com>
Cc: Mark Salter <msalter@redhat.com>
Cc: Matt Turner <mattst88@gmail.com>
Cc: Max Filippov <jcmvbkbc@gmail.com>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Michal Simek <monstr@monstr.eu>
Cc: Nick Hu <nickhu@andestech.com>
Cc: Paul Walmsley <paul.walmsley@sifive.com>
Cc: Richard Weinberger <richard@nod.at>
Cc: Rich Felker <dalias@libc.org>
Cc: Russell King <linux@armlinux.org.uk>
Cc: Stafford Horne <shorne@gmail.com>
Cc: Thomas Bogendoerfer <tsbogend@alpha.franken.de>
Cc: Tony Luck <tony.luck@intel.com>
Cc: Vineet Gupta <vgupta@synopsys.com>
Cc: Yoshinori Sato <ysato@users.sourceforge.jp>
Link: http://lkml.kernel.org/r/20200412194859.12663-19-rppt@kernel.org
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 51930df5 03-Jun-2020 Mike Rapoport <rppt@kernel.org>

mm: free_area_init: allow defining max_zone_pfn in descending order

Some architectures (e.g. ARC) have the ZONE_HIGHMEM zone below the
ZONE_NORMAL. Allowing free_area_init() parse max_zone_pfn array even it
is sorted in descending order allows using free_area_init() on such
architectures.

Add top -> down traversal of max_zone_pfn array in free_area_init() and
use the latter in ARC node/zone initialization.

[rppt@kernel.org: ARC fix]
Link: http://lkml.kernel.org/r/20200504153901.GM14260@kernel.org
[rppt@linux.ibm.com: arc: free_area_init(): take into account PAE40 mode]
Link: http://lkml.kernel.org/r/20200507205900.GH683243@linux.ibm.com
[akpm@linux-foundation.org: declare arch_has_descending_max_zone_pfns()]
Signed-off-by: Mike Rapoport <rppt@linux.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Tested-by: Hoan Tran <hoan@os.amperecomputing.com> [arm64]
Reviewed-by: Baoquan He <bhe@redhat.com>
Cc: Brian Cain <bcain@codeaurora.org>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: "David S. Miller" <davem@davemloft.net>
Cc: Geert Uytterhoeven <geert@linux-m68k.org>
Cc: Greentime Hu <green.hu@gmail.com>
Cc: Greg Ungerer <gerg@linux-m68k.org>
Cc: Guan Xuetao <gxt@pku.edu.cn>
Cc: Guo Ren <guoren@kernel.org>
Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
Cc: Helge Deller <deller@gmx.de>
Cc: "James E.J. Bottomley" <James.Bottomley@HansenPartnership.com>
Cc: Jonathan Corbet <corbet@lwn.net>
Cc: Ley Foon Tan <ley.foon.tan@intel.com>
Cc: Mark Salter <msalter@redhat.com>
Cc: Matt Turner <mattst88@gmail.com>
Cc: Max Filippov <jcmvbkbc@gmail.com>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Michal Simek <monstr@monstr.eu>
Cc: Nick Hu <nickhu@andestech.com>
Cc: Paul Walmsley <paul.walmsley@sifive.com>
Cc: Richard Weinberger <richard@nod.at>
Cc: Rich Felker <dalias@libc.org>
Cc: Russell King <linux@armlinux.org.uk>
Cc: Stafford Horne <shorne@gmail.com>
Cc: Thomas Bogendoerfer <tsbogend@alpha.franken.de>
Cc: Tony Luck <tony.luck@intel.com>
Cc: Vineet Gupta <vgupta@synopsys.com>
Cc: Yoshinori Sato <ysato@users.sourceforge.jp>
Cc: Guenter Roeck <linux@roeck-us.net>
Link: http://lkml.kernel.org/r/20200412194859.12663-18-rppt@kernel.org
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# acd3f5c4 03-Jun-2020 Mike Rapoport <rppt@kernel.org>

mm: remove early_pfn_in_nid() and CONFIG_NODES_SPAN_OTHER_NODES

The memmap_init() function was made to iterate over memblock regions and
as the result the early_pfn_in_nid() function became obsolete. Since
CONFIG_NODES_SPAN_OTHER_NODES is only used to pick a stub or a real
implementation of early_pfn_in_nid(), it is also not needed anymore.

Remove both early_pfn_in_nid() and the CONFIG_NODES_SPAN_OTHER_NODES.

Co-developed-by: Hoan Tran <Hoan@os.amperecomputing.com>
Signed-off-by: Hoan Tran <Hoan@os.amperecomputing.com>
Signed-off-by: Mike Rapoport <rppt@linux.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Tested-by: Hoan Tran <hoan@os.amperecomputing.com> [arm64]
Cc: Baoquan He <bhe@redhat.com>
Cc: Brian Cain <bcain@codeaurora.org>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: "David S. Miller" <davem@davemloft.net>
Cc: Geert Uytterhoeven <geert@linux-m68k.org>
Cc: Greentime Hu <green.hu@gmail.com>
Cc: Greg Ungerer <gerg@linux-m68k.org>
Cc: Guan Xuetao <gxt@pku.edu.cn>
Cc: Guo Ren <guoren@kernel.org>
Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
Cc: Helge Deller <deller@gmx.de>
Cc: "James E.J. Bottomley" <James.Bottomley@HansenPartnership.com>
Cc: Jonathan Corbet <corbet@lwn.net>
Cc: Ley Foon Tan <ley.foon.tan@intel.com>
Cc: Mark Salter <msalter@redhat.com>
Cc: Matt Turner <mattst88@gmail.com>
Cc: Max Filippov <jcmvbkbc@gmail.com>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Michal Simek <monstr@monstr.eu>
Cc: Nick Hu <nickhu@andestech.com>
Cc: Paul Walmsley <paul.walmsley@sifive.com>
Cc: Richard Weinberger <richard@nod.at>
Cc: Rich Felker <dalias@libc.org>
Cc: Russell King <linux@armlinux.org.uk>
Cc: Stafford Horne <shorne@gmail.com>
Cc: Thomas Bogendoerfer <tsbogend@alpha.franken.de>
Cc: Tony Luck <tony.luck@intel.com>
Cc: Vineet Gupta <vgupta@synopsys.com>
Cc: Yoshinori Sato <ysato@users.sourceforge.jp>
Link: http://lkml.kernel.org/r/20200412194859.12663-17-rppt@kernel.org
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 73a6e474 03-Jun-2020 Baoquan He <bhe@redhat.com>

mm: memmap_init: iterate over memblock regions rather that check each PFN

When called during boot the memmap_init_zone() function checks if each PFN
is valid and actually belongs to the node being initialized using
early_pfn_valid() and early_pfn_in_nid().

Each such check may cost up to O(log(n)) where n is the number of memory
banks, so for large amount of memory overall time spent in early_pfn*()
becomes substantial.

Since the information is anyway present in memblock, we can iterate over
memblock memory regions in memmap_init() and only call memmap_init_zone()
for PFN ranges that are know to be valid and in the appropriate node.

[cai@lca.pw: fix a compilation warning from Clang]
Link: http://lkml.kernel.org/r/CF6E407F-17DC-427C-8203-21979FB882EF@lca.pw
[bhe@redhat.com: fix the incorrect hole in fast_isolate_freepages()]
Link: http://lkml.kernel.org/r/8C537EB7-85EE-4DCF-943E-3CC0ED0DF56D@lca.pw
Link: http://lkml.kernel.org/r/20200521014407.29690-1-bhe@redhat.com
Signed-off-by: Baoquan He <bhe@redhat.com>
Signed-off-by: Mike Rapoport <rppt@linux.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Tested-by: Hoan Tran <hoan@os.amperecomputing.com> [arm64]
Cc: Brian Cain <bcain@codeaurora.org>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: "David S. Miller" <davem@davemloft.net>
Cc: Geert Uytterhoeven <geert@linux-m68k.org>
Cc: Greentime Hu <green.hu@gmail.com>
Cc: Greg Ungerer <gerg@linux-m68k.org>
Cc: Guan Xuetao <gxt@pku.edu.cn>
Cc: Guo Ren <guoren@kernel.org>
Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
Cc: Helge Deller <deller@gmx.de>
Cc: "James E.J. Bottomley" <James.Bottomley@HansenPartnership.com>
Cc: Jonathan Corbet <corbet@lwn.net>
Cc: Ley Foon Tan <ley.foon.tan@intel.com>
Cc: Mark Salter <msalter@redhat.com>
Cc: Matt Turner <mattst88@gmail.com>
Cc: Max Filippov <jcmvbkbc@gmail.com>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Michal Simek <monstr@monstr.eu>
Cc: Nick Hu <nickhu@andestech.com>
Cc: Paul Walmsley <paul.walmsley@sifive.com>
Cc: Richard Weinberger <richard@nod.at>
Cc: Rich Felker <dalias@libc.org>
Cc: Russell King <linux@armlinux.org.uk>
Cc: Stafford Horne <shorne@gmail.com>
Cc: Thomas Bogendoerfer <tsbogend@alpha.franken.de>
Cc: Tony Luck <tony.luck@intel.com>
Cc: Vineet Gupta <vgupta@synopsys.com>
Cc: Yoshinori Sato <ysato@users.sourceforge.jp>
Cc: Qian Cai <cai@lca.pw>
Link: http://lkml.kernel.org/r/20200412194859.12663-16-rppt@kernel.org
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 9691a071 03-Jun-2020 Mike Rapoport <rppt@kernel.org>

mm: use free_area_init() instead of free_area_init_nodes()

free_area_init() has effectively became a wrapper for
free_area_init_nodes() and there is no point of keeping it. Still
free_area_init() name is shorter and more general as it does not imply
necessity to initialize multiple nodes.

Rename free_area_init_nodes() to free_area_init(), update the callers and
drop old version of free_area_init().

Signed-off-by: Mike Rapoport <rppt@linux.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Tested-by: Hoan Tran <hoan@os.amperecomputing.com> [arm64]
Reviewed-by: Baoquan He <bhe@redhat.com>
Acked-by: Catalin Marinas <catalin.marinas@arm.com>
Cc: Brian Cain <bcain@codeaurora.org>
Cc: "David S. Miller" <davem@davemloft.net>
Cc: Geert Uytterhoeven <geert@linux-m68k.org>
Cc: Greentime Hu <green.hu@gmail.com>
Cc: Greg Ungerer <gerg@linux-m68k.org>
Cc: Guan Xuetao <gxt@pku.edu.cn>
Cc: Guo Ren <guoren@kernel.org>
Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
Cc: Helge Deller <deller@gmx.de>
Cc: "James E.J. Bottomley" <James.Bottomley@HansenPartnership.com>
Cc: Jonathan Corbet <corbet@lwn.net>
Cc: Ley Foon Tan <ley.foon.tan@intel.com>
Cc: Mark Salter <msalter@redhat.com>
Cc: Matt Turner <mattst88@gmail.com>
Cc: Max Filippov <jcmvbkbc@gmail.com>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Michal Simek <monstr@monstr.eu>
Cc: Nick Hu <nickhu@andestech.com>
Cc: Paul Walmsley <paul.walmsley@sifive.com>
Cc: Richard Weinberger <richard@nod.at>
Cc: Rich Felker <dalias@libc.org>
Cc: Russell King <linux@armlinux.org.uk>
Cc: Stafford Horne <shorne@gmail.com>
Cc: Thomas Bogendoerfer <tsbogend@alpha.franken.de>
Cc: Tony Luck <tony.luck@intel.com>
Cc: Vineet Gupta <vgupta@synopsys.com>
Cc: Yoshinori Sato <ysato@users.sourceforge.jp>
Link: http://lkml.kernel.org/r/20200412194859.12663-6-rppt@kernel.org
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# fa3354e4 03-Jun-2020 Mike Rapoport <rppt@kernel.org>

mm: free_area_init: use maximal zone PFNs rather than zone sizes

Currently, architectures that use free_area_init() to initialize memory
map and node and zone structures need to calculate zone and hole sizes.
We can use free_area_init_nodes() instead and let it detect the zone
boundaries while the architectures will only have to supply the possible
limits for the zones.

Signed-off-by: Mike Rapoport <rppt@linux.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Tested-by: Hoan Tran <hoan@os.amperecomputing.com> [arm64]
Reviewed-by: Baoquan He <bhe@redhat.com>
Cc: Brian Cain <bcain@codeaurora.org>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: "David S. Miller" <davem@davemloft.net>
Cc: Geert Uytterhoeven <geert@linux-m68k.org>
Cc: Greentime Hu <green.hu@gmail.com>
Cc: Greg Ungerer <gerg@linux-m68k.org>
Cc: Guan Xuetao <gxt@pku.edu.cn>
Cc: Guo Ren <guoren@kernel.org>
Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
Cc: Helge Deller <deller@gmx.de>
Cc: "James E.J. Bottomley" <James.Bottomley@HansenPartnership.com>
Cc: Jonathan Corbet <corbet@lwn.net>
Cc: Ley Foon Tan <ley.foon.tan@intel.com>
Cc: Mark Salter <msalter@redhat.com>
Cc: Matt Turner <mattst88@gmail.com>
Cc: Max Filippov <jcmvbkbc@gmail.com>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Michal Simek <monstr@monstr.eu>
Cc: Nick Hu <nickhu@andestech.com>
Cc: Paul Walmsley <paul.walmsley@sifive.com>
Cc: Richard Weinberger <richard@nod.at>
Cc: Rich Felker <dalias@libc.org>
Cc: Russell King <linux@armlinux.org.uk>
Cc: Stafford Horne <shorne@gmail.com>
Cc: Thomas Bogendoerfer <tsbogend@alpha.franken.de>
Cc: Tony Luck <tony.luck@intel.com>
Cc: Vineet Gupta <vgupta@synopsys.com>
Cc: Yoshinori Sato <ysato@users.sourceforge.jp>
Link: http://lkml.kernel.org/r/20200412194859.12663-5-rppt@kernel.org
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 3f08a302 03-Jun-2020 Mike Rapoport <rppt@kernel.org>

mm: remove CONFIG_HAVE_MEMBLOCK_NODE_MAP option

CONFIG_HAVE_MEMBLOCK_NODE_MAP is used to differentiate initialization of
nodes and zones structures between the systems that have region to node
mapping in memblock and those that don't.

Currently all the NUMA architectures enable this option and for the
non-NUMA systems we can presume that all the memory belongs to node 0 and
therefore the compile time configuration option is not required.

The remaining few architectures that use DISCONTIGMEM without NUMA are
easily updated to use memblock_add_node() instead of memblock_add() and
thus have proper correspondence of memblock regions to NUMA nodes.

Still, free_area_init_node() must have a backward compatible version
because its semantics with and without CONFIG_HAVE_MEMBLOCK_NODE_MAP is
different. Once all the architectures will use the new semantics, the
entire compatibility layer can be dropped.

To avoid addition of extra run time memory to store node id for
architectures that keep memblock but have only a single node, the node id
field of the memblock_region is guarded by CONFIG_NEED_MULTIPLE_NODES and
the corresponding accessors presume that in those cases it is always 0.

Signed-off-by: Mike Rapoport <rppt@linux.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Tested-by: Hoan Tran <hoan@os.amperecomputing.com> [arm64]
Acked-by: Catalin Marinas <catalin.marinas@arm.com> [arm64]
Cc: Baoquan He <bhe@redhat.com>
Cc: Brian Cain <bcain@codeaurora.org>
Cc: "David S. Miller" <davem@davemloft.net>
Cc: Geert Uytterhoeven <geert@linux-m68k.org>
Cc: Greentime Hu <green.hu@gmail.com>
Cc: Greg Ungerer <gerg@linux-m68k.org>
Cc: Guan Xuetao <gxt@pku.edu.cn>
Cc: Guo Ren <guoren@kernel.org>
Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
Cc: Helge Deller <deller@gmx.de>
Cc: "James E.J. Bottomley" <James.Bottomley@HansenPartnership.com>
Cc: Jonathan Corbet <corbet@lwn.net>
Cc: Ley Foon Tan <ley.foon.tan@intel.com>
Cc: Mark Salter <msalter@redhat.com>
Cc: Matt Turner <mattst88@gmail.com>
Cc: Max Filippov <jcmvbkbc@gmail.com>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Michal Simek <monstr@monstr.eu>
Cc: Nick Hu <nickhu@andestech.com>
Cc: Paul Walmsley <paul.walmsley@sifive.com>
Cc: Richard Weinberger <richard@nod.at>
Cc: Rich Felker <dalias@libc.org>
Cc: Russell King <linux@armlinux.org.uk>
Cc: Stafford Horne <shorne@gmail.com>
Cc: Thomas Bogendoerfer <tsbogend@alpha.franken.de>
Cc: Tony Luck <tony.luck@intel.com>
Cc: Vineet Gupta <vgupta@synopsys.com>
Cc: Yoshinori Sato <ysato@users.sourceforge.jp>
Link: http://lkml.kernel.org/r/20200412194859.12663-4-rppt@kernel.org
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 6f24fbd3 03-Jun-2020 Mike Rapoport <rppt@kernel.org>

mm: make early_pfn_to_nid() and related defintions close to each other

early_pfn_to_nid() and its helper __early_pfn_to_nid() are spread around
include/linux/mm.h, include/linux/mmzone.h and mm/page_alloc.c.

Drop unused stub for __early_pfn_to_nid() and move its actual generic
implementation close to its users.

Signed-off-by: Mike Rapoport <rppt@linux.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Tested-by: Hoan Tran <hoan@os.amperecomputing.com> [arm64]
Reviewed-by: Baoquan He <bhe@redhat.com>
Cc: Brian Cain <bcain@codeaurora.org>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: "David S. Miller" <davem@davemloft.net>
Cc: Geert Uytterhoeven <geert@linux-m68k.org>
Cc: Greentime Hu <green.hu@gmail.com>
Cc: Greg Ungerer <gerg@linux-m68k.org>
Cc: Guan Xuetao <gxt@pku.edu.cn>
Cc: Guo Ren <guoren@kernel.org>
Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
Cc: Helge Deller <deller@gmx.de>
Cc: "James E.J. Bottomley" <James.Bottomley@HansenPartnership.com>
Cc: Jonathan Corbet <corbet@lwn.net>
Cc: Ley Foon Tan <ley.foon.tan@intel.com>
Cc: Mark Salter <msalter@redhat.com>
Cc: Matt Turner <mattst88@gmail.com>
Cc: Max Filippov <jcmvbkbc@gmail.com>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Michal Simek <monstr@monstr.eu>
Cc: Nick Hu <nickhu@andestech.com>
Cc: Paul Walmsley <paul.walmsley@sifive.com>
Cc: Richard Weinberger <richard@nod.at>
Cc: Rich Felker <dalias@libc.org>
Cc: Russell King <linux@armlinux.org.uk>
Cc: Stafford Horne <shorne@gmail.com>
Cc: Thomas Bogendoerfer <tsbogend@alpha.franken.de>
Cc: Tony Luck <tony.luck@intel.com>
Cc: Vineet Gupta <vgupta@synopsys.com>
Cc: Yoshinori Sato <ysato@users.sourceforge.jp>
Link: http://lkml.kernel.org/r/20200412194859.12663-3-rppt@kernel.org
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# d622abf7 03-Jun-2020 Mike Rapoport <rppt@kernel.org>

mm: memblock: replace dereferences of memblock_region.nid with API calls

Patch series "mm: rework free_area_init*() funcitons".

After the discussion [1] about removal of CONFIG_NODES_SPAN_OTHER_NODES
and CONFIG_HAVE_MEMBLOCK_NODE_MAP options, I took it a bit further and
updated the node/zone initialization.

Since all architectures have memblock, it is possible to use only the
newer version of free_area_init_node() that calculates the zone and node
boundaries based on memblock node mapping and architectural limits on
possible zone PFNs.

The architectures that still determined zone and hole sizes can be
switched to the generic code and the old code that took those zone and
hole sizes can be simply removed.

And, since it all started from the removal of
CONFIG_NODES_SPAN_OTHER_NODES, the memmap_init() is now updated to iterate
over memblocks and so it does not need to perform early_pfn_to_nid() query
for every PFN.

[1] https://lore.kernel.org/lkml/1585420282-25630-1-git-send-email-Hoan@os.amperecomputing.com

This patch (of 21):

There are several places in the code that directly dereference
memblock_region.nid despite this field being defined only when
CONFIG_HAVE_MEMBLOCK_NODE_MAP=y.

Replace these with calls to memblock_get_region_nid() to improve code
robustness and to avoid possible breakage when
CONFIG_HAVE_MEMBLOCK_NODE_MAP will be removed.

Signed-off-by: Mike Rapoport <rppt@linux.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Tested-by: Hoan Tran <hoan@os.amperecomputing.com> [arm64]
Reviewed-by: Baoquan He <bhe@redhat.com>
Cc: Brian Cain <bcain@codeaurora.org>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: "David S. Miller" <davem@davemloft.net>
Cc: Geert Uytterhoeven <geert@linux-m68k.org>
Cc: Greentime Hu <green.hu@gmail.com>
Cc: Greg Ungerer <gerg@linux-m68k.org>
Cc: Guan Xuetao <gxt@pku.edu.cn>
Cc: Guo Ren <guoren@kernel.org>
Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
Cc: Helge Deller <deller@gmx.de>
Cc: "James E.J. Bottomley" <James.Bottomley@HansenPartnership.com>
Cc: Jonathan Corbet <corbet@lwn.net>
Cc: Ley Foon Tan <ley.foon.tan@intel.com>
Cc: Mark Salter <msalter@redhat.com>
Cc: Matt Turner <mattst88@gmail.com>
Cc: Max Filippov <jcmvbkbc@gmail.com>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Michal Simek <monstr@monstr.eu>
Cc: Mike Rapoport <rppt@kernel.org>
Cc: Nick Hu <nickhu@andestech.com>
Cc: Paul Walmsley <paul.walmsley@sifive.com>
Cc: Richard Weinberger <richard@nod.at>
Cc: Rich Felker <dalias@libc.org>
Cc: Russell King <linux@armlinux.org.uk>
Cc: Stafford Horne <shorne@gmail.com>
Cc: Thomas Bogendoerfer <tsbogend@alpha.franken.de>
Cc: Tony Luck <tony.luck@intel.com>
Cc: Vineet Gupta <vgupta@synopsys.com>
Cc: Yoshinori Sato <ysato@users.sourceforge.jp>
Link: http://lkml.kernel.org/r/20200412194859.12663-1-rppt@kernel.org
Link: http://lkml.kernel.org/r/20200412194859.12663-2-rppt@kernel.org
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 88dca4ca 01-Jun-2020 Christoph Hellwig <hch@lst.de>

mm: remove the pgprot argument to __vmalloc

The pgprot argument to __vmalloc is always PAGE_KERNEL now, so remove it.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Reviewed-by: Michael Kelley <mikelley@microsoft.com> [hyperv]
Acked-by: Gao Xiang <xiang@kernel.org> [erofs]
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Acked-by: Wei Liu <wei.liu@kernel.org>
Cc: Christian Borntraeger <borntraeger@de.ibm.com>
Cc: Christophe Leroy <christophe.leroy@c-s.fr>
Cc: Daniel Vetter <daniel.vetter@ffwll.ch>
Cc: David Airlie <airlied@linux.ie>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: Haiyang Zhang <haiyangz@microsoft.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: "K. Y. Srinivasan" <kys@microsoft.com>
Cc: Laura Abbott <labbott@redhat.com>
Cc: Mark Rutland <mark.rutland@arm.com>
Cc: Minchan Kim <minchan@kernel.org>
Cc: Nitin Gupta <ngupta@vflare.org>
Cc: Robin Murphy <robin.murphy@arm.com>
Cc: Sakari Ailus <sakari.ailus@linux.intel.com>
Cc: Stephen Hemminger <sthemmin@microsoft.com>
Cc: Sumit Semwal <sumit.semwal@linaro.org>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
Cc: Paul Mackerras <paulus@ozlabs.org>
Cc: Vasily Gorbik <gor@linux.ibm.com>
Cc: Will Deacon <will@kernel.org>
Link: http://lkml.kernel.org/r/20200414131348.444715-22-hch@lst.de
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 8d92890b 01-Jun-2020 NeilBrown <neilb@suse.de>

mm/writeback: discard NR_UNSTABLE_NFS, use NR_WRITEBACK instead

After an NFS page has been written it is considered "unstable" until a
COMMIT request succeeds. If the COMMIT fails, the page will be
re-written.

These "unstable" pages are currently accounted as "reclaimable", either
in WB_RECLAIMABLE, or in NR_UNSTABLE_NFS which is included in a
'reclaimable' count. This might have made sense when sending the COMMIT
required a separate action by the VFS/MM (e.g. releasepage() used to
send a COMMIT). However now that all writes generated by ->writepages()
will automatically be followed by a COMMIT (since commit 919e3bd9a875
("NFS: Ensure we commit after writeback is complete")) it makes more
sense to treat them as writeback pages.

So this patch removes NR_UNSTABLE_NFS and accounts unstable pages in
NR_WRITEBACK and WB_WRITEBACK.

A particular effect of this change is that when
wb_check_background_flush() calls wb_over_bg_threshold(), the latter
will report 'true' a lot less often as the 'unstable' pages are no
longer considered 'dirty' (as there is nothing that writeback can do
about them anyway).

Currently wb_check_background_flush() will trigger writeback to NFS even
when there are relatively few dirty pages (if there are lots of unstable
pages), this can result in small writes going to the server (10s of
Kilobytes rather than a Megabyte) which hurts throughput. With this
patch, there are fewer writes which are each larger on average.

Where the NR_UNSTABLE_NFS count was included in statistics
virtual-files, the entry is retained, but the value is hard-coded as
zero. static trace points and warning printks which mentioned this
counter no longer report it.

[akpm@linux-foundation.org: re-layout comment]
[akpm@linux-foundation.org: fix printk warning]
Signed-off-by: NeilBrown <neilb@suse.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Reviewed-by: Jan Kara <jack@suse.cz>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Acked-by: Trond Myklebust <trond.myklebust@hammerspace.com>
Acked-by: Michal Hocko <mhocko@suse.com> [mm]
Cc: Christoph Hellwig <hch@lst.de>
Cc: Chuck Lever <chuck.lever@oracle.com>
Link: http://lkml.kernel.org/r/87d06j7gqa.fsf@notabene.neil.brown.name
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 628d06a4 27-Apr-2020 Sami Tolvanen <samitolvanen@google.com>

scs: Add page accounting for shadow call stack allocations

This change adds accounting for the memory allocated for shadow stacks.

Signed-off-by: Sami Tolvanen <samitolvanen@google.com>
Reviewed-by: Kees Cook <keescook@chromium.org>
Acked-by: Will Deacon <will@kernel.org>
Signed-off-by: Will Deacon <will@kernel.org>


# 14f69140 07-May-2020 Henry Willard <henry.willard@oracle.com>

mm: limit boost_watermark on small zones

Commit 1c30844d2dfe ("mm: reclaim small amounts of memory when an
external fragmentation event occurs") adds a boost_watermark() function
which increases the min watermark in a zone by at least
pageblock_nr_pages or the number of pages in a page block.

On Arm64, with 64K pages and 512M huge pages, this is 8192 pages or
512M. It does this regardless of the number of managed pages managed in
the zone or the likelihood of success.

This can put the zone immediately under water in terms of allocating
pages from the zone, and can cause a small machine to fail immediately
due to OoM. Unlike set_recommended_min_free_kbytes(), which
substantially increases min_free_kbytes and is tied to THP,
boost_watermark() can be called even if THP is not active.

The problem is most likely to appear on architectures such as Arm64
where pageblock_nr_pages is very large.

It is desirable to run the kdump capture kernel in as small a space as
possible to avoid wasting memory. In some architectures, such as Arm64,
there are restrictions on where the capture kernel can run, and
therefore, the space available. A capture kernel running in 768M can
fail due to OoM immediately after boost_watermark() sets the min in zone
DMA32, where most of the memory is, to 512M. It fails even though there
is over 500M of free memory. With boost_watermark() suppressed, the
capture kernel can run successfully in 448M.

This patch limits boost_watermark() to boosting a zone's min watermark
only when there are enough pages that the boost will produce positive
results. In this case that is estimated to be four times as many pages
as pageblock_nr_pages.

Mel said:

: There is no harm in marking it stable. Clearly it does not happen very
: often but it's not impossible. 32-bit x86 is a lot less common now
: which would previously have been vulnerable to triggering this easily.
: ppc64 has a larger base page size but typically only has one zone.
: arm64 is likely the most vulnerable, particularly when CMA is
: configured with a small movable zone.

Fixes: 1c30844d2dfe ("mm: reclaim small amounts of memory when an external fragmentation event occurs")
Signed-off-by: Henry Willard <henry.willard@oracle.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Reviewed-by: David Hildenbrand <david@redhat.com>
Acked-by: Mel Gorman <mgorman@techsingularity.net>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: <stable@vger.kernel.org>
Link: http://lkml.kernel.org/r/1588294148-6586-1-git-send-email-henry.willard@oracle.com
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# e84fe99b 07-May-2020 David Hildenbrand <david@redhat.com>

mm/page_alloc: fix watchdog soft lockups during set_zone_contiguous()

Without CONFIG_PREEMPT, it can happen that we get soft lockups detected,
e.g., while booting up.

watchdog: BUG: soft lockup - CPU#0 stuck for 22s! [swapper/0:1]
CPU: 0 PID: 1 Comm: swapper/0 Not tainted 5.6.0-next-20200331+ #4
Hardware name: Red Hat KVM, BIOS 1.11.1-4.module+el8.1.0+4066+0f1aadab 04/01/2014
RIP: __pageblock_pfn_to_page+0x134/0x1c0
Call Trace:
set_zone_contiguous+0x56/0x70
page_alloc_init_late+0x166/0x176
kernel_init_freeable+0xfa/0x255
kernel_init+0xa/0x106
ret_from_fork+0x35/0x40

The issue becomes visible when having a lot of memory (e.g., 4TB)
assigned to a single NUMA node - a system that can easily be created
using QEMU. Inside VMs on a hypervisor with quite some memory
overcommit, this is fairly easy to trigger.

Signed-off-by: David Hildenbrand <david@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Reviewed-by: Pavel Tatashin <pasha.tatashin@soleen.com>
Reviewed-by: Pankaj Gupta <pankaj.gupta.linux@gmail.com>
Reviewed-by: Baoquan He <bhe@redhat.com>
Reviewed-by: Shile Zhang <shile.zhang@linux.alibaba.com>
Acked-by: Michal Hocko <mhocko@suse.com>
Cc: Kirill Tkhai <ktkhai@virtuozzo.com>
Cc: Shile Zhang <shile.zhang@linux.alibaba.com>
Cc: Pavel Tatashin <pasha.tatashin@soleen.com>
Cc: Daniel Jordan <daniel.m.jordan@oracle.com>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Alexander Duyck <alexander.duyck@gmail.com>
Cc: Baoquan He <bhe@redhat.com>
Cc: Oscar Salvador <osalvador@suse.de>
Cc: <stable@vger.kernel.org>
Link: http://lkml.kernel.org/r/20200416073417.5003-1-david@redhat.com
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 32927393 24-Apr-2020 Christoph Hellwig <hch@lst.de>

sysctl: pass kernel pointers to ->proc_handler

Instead of having all the sysctl handlers deal with user pointers, which
is rather hairy in terms of the BPF interaction, copy the input to and
from userspace in common code. This also means that the strings are
always NUL-terminated by the common code, making the API a little bit
safer.

As most handler just pass through the data to one of the common handlers
a lot of the changes are mechnical.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Acked-by: Andrey Ignatov <rdna@fb.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>


# 26363af5 24-Apr-2020 Christoph Hellwig <hch@lst.de>

mm: remove watermark_boost_factor_sysctl_handler

watermark_boost_factor_sysctl_handler is just a pointless wrapper for
proc_dointvec_minmax, so remove it and use proc_dointvec_minmax
directly.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Acked-by: David Rientjes <rientjes@google.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>


# 8b885f53 10-Apr-2020 Jason Yan <yanaijie@huawei.com>

mm/page_alloc: make pcpu_drain_mutex and pcpu_drain static

Fix the following sparse warning:

mm/page_alloc.c:106:1: warning: symbol 'pcpu_drain_mutex' was not declared. Should it be static?
mm/page_alloc.c:107:1: warning: symbol '__pcpu_scope_pcpu_drain' was not declared. Should it be static?

Reported-by: Hulk Robot <hulkci@huawei.com>
Signed-off-by: Jason Yan <yanaijie@huawei.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Link: http://lkml.kernel.org/r/20200407023925.46438-1-yanaijie@huawei.com
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# e6a0a7ad 10-Apr-2020 Randy Dunlap <rdunlap@infradead.org>

mm/page_alloc.c: fix kernel-doc warning

Add description of function parameter 'mt' to fix kernel-doc warning:

mm/page_alloc.c:3246: warning: Function parameter or member 'mt' not described in '__putback_isolated_page'

Signed-off-by: Randy Dunlap <rdunlap@infradead.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Acked-by: Pankaj Gupta <pankaj.gupta.linux@gmail.com>
Link: http://lkml.kernel.org/r/02998bd4-0b82-2f15-2570-f86130304d1e@infradead.org
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 36e66c55 06-Apr-2020 Alexander Duyck <alexander.h.duyck@linux.intel.com>

mm: introduce Reported pages

In order to pave the way for free page reporting in virtualized
environments we will need a way to get pages out of the free lists and
identify those pages after they have been returned. To accomplish this,
this patch adds the concept of a Reported Buddy, which is essentially
meant to just be the Uptodate flag used in conjunction with the Buddy page
type.

To prevent the reported pages from leaking outside of the buddy lists I
added a check to clear the PageReported bit in the del_page_from_free_list
function. As a result any reported page that is split, merged, or
allocated will have the flag cleared prior to the PageBuddy value being
cleared.

The process for reporting pages is fairly simple. Once we free a page
that meets the minimum order for page reporting we will schedule a worker
thread to start 2s or more in the future. That worker thread will begin
working from the lowest supported page reporting order up to MAX_ORDER - 1
pulling unreported pages from the free list and storing them in the
scatterlist.

When processing each individual free list it is necessary for the worker
thread to release the zone lock when it needs to stop and report the full
scatterlist of pages. To reduce the work of the next iteration the worker
thread will rotate the free list so that the first unreported page in the
free list becomes the first entry in the list.

It will then call a reporting function providing information on how many
entries are in the scatterlist. Once the function completes it will
return the pages to the free area from which they were allocated and start
over pulling more pages from the free areas until there are no longer
enough pages to report on to keep the worker busy, or we have processed as
many pages as were contained in the free area when we started processing
the list.

The worker thread will work in a round-robin fashion making its way though
each zone requesting reporting, and through each reportable free list
within that zone. Once all free areas within the zone have been processed
it will check to see if there have been any requests for reporting while
it was processing. If so it will reschedule the worker thread to start up
again in roughly 2s and exit.

Signed-off-by: Alexander Duyck <alexander.h.duyck@linux.intel.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Acked-by: Mel Gorman <mgorman@techsingularity.net>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: Dave Hansen <dave.hansen@intel.com>
Cc: David Hildenbrand <david@redhat.com>
Cc: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Cc: Luiz Capitulino <lcapitulino@redhat.com>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Michael S. Tsirkin <mst@redhat.com>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Nitesh Narayan Lal <nitesh@redhat.com>
Cc: Oscar Salvador <osalvador@suse.de>
Cc: Pankaj Gupta <pagupta@redhat.com>
Cc: Paolo Bonzini <pbonzini@redhat.com>
Cc: Rik van Riel <riel@surriel.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Wei Wang <wei.w.wang@intel.com>
Cc: Yang Zhang <yang.zhang.wz@gmail.com>
Cc: wei qi <weiqi4@huawei.com>
Link: http://lkml.kernel.org/r/20200211224635.29318.19750.stgit@localhost.localdomain
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 624f58d8 06-Apr-2020 Alexander Duyck <alexander.h.duyck@linux.intel.com>

mm: add function __putback_isolated_page

There are cases where we would benefit from avoiding having to go through
the allocation and free cycle to return an isolated page.

Examples for this might include page poisoning in which we isolate a page
and then put it back in the free list without ever having actually
allocated it.

This will enable us to also avoid notifiers for the future free page
reporting which will need to avoid retriggering page reporting when
returning pages that have been reported on.

Signed-off-by: Alexander Duyck <alexander.h.duyck@linux.intel.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Acked-by: David Hildenbrand <david@redhat.com>
Acked-by: Mel Gorman <mgorman@techsingularity.net>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: Dave Hansen <dave.hansen@intel.com>
Cc: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Cc: Luiz Capitulino <lcapitulino@redhat.com>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Michael S. Tsirkin <mst@redhat.com>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Nitesh Narayan Lal <nitesh@redhat.com>
Cc: Oscar Salvador <osalvador@suse.de>
Cc: Pankaj Gupta <pagupta@redhat.com>
Cc: Paolo Bonzini <pbonzini@redhat.com>
Cc: Rik van Riel <riel@surriel.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Wei Wang <wei.w.wang@intel.com>
Cc: Yang Zhang <yang.zhang.wz@gmail.com>
Cc: wei qi <weiqi4@huawei.com>
Link: http://lkml.kernel.org/r/20200211224624.29318.89287.stgit@localhost.localdomain
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 6ab01363 06-Apr-2020 Alexander Duyck <alexander.h.duyck@linux.intel.com>

mm: use zone and order instead of free area in free_list manipulators

In order to enable the use of the zone from the list manipulator functions
I will need access to the zone pointer. As it turns out most of the
accessors were always just being directly passed &zone->free_area[order]
anyway so it would make sense to just fold that into the function itself
and pass the zone and order as arguments instead of the free area.

In order to be able to reference the zone we need to move the declaration
of the functions down so that we have the zone defined before we define
the list manipulation functions. Since the functions are only used in the
file mm/page_alloc.c we can just move them there to reduce noise in the
header.

Signed-off-by: Alexander Duyck <alexander.h.duyck@linux.intel.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Reviewed-by: Dan Williams <dan.j.williams@intel.com>
Reviewed-by: David Hildenbrand <david@redhat.com>
Reviewed-by: Pankaj Gupta <pagupta@redhat.com>
Acked-by: Mel Gorman <mgorman@techsingularity.net>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Dave Hansen <dave.hansen@intel.com>
Cc: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Cc: Luiz Capitulino <lcapitulino@redhat.com>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Michael S. Tsirkin <mst@redhat.com>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Nitesh Narayan Lal <nitesh@redhat.com>
Cc: Oscar Salvador <osalvador@suse.de>
Cc: Paolo Bonzini <pbonzini@redhat.com>
Cc: Rik van Riel <riel@surriel.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Wei Wang <wei.w.wang@intel.com>
Cc: Yang Zhang <yang.zhang.wz@gmail.com>
Cc: wei qi <weiqi4@huawei.com>
Link: http://lkml.kernel.org/r/20200211224613.29318.43080.stgit@localhost.localdomain
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# a2129f24 06-Apr-2020 Alexander Duyck <alexander.h.duyck@linux.intel.com>

mm: adjust shuffle code to allow for future coalescing

Patch series "mm / virtio: Provide support for free page reporting", v17.

This series provides an asynchronous means of reporting free guest pages
to a hypervisor so that the memory associated with those pages can be
dropped and reused by other processes and/or guests on the host. Using
this it is possible to avoid unnecessary I/O to disk and greatly improve
performance in the case of memory overcommit on the host.

When enabled we will be performing a scan of free memory every 2 seconds
while pages of sufficiently high order are being freed. In each pass at
least one sixteenth of each free list will be reported. By doing this we
avoid racing against other threads that may be causing a high amount of
memory churn.

The lowest page order currently scanned when reporting pages is
pageblock_order so that this feature will not interfere with the use of
Transparent Huge Pages in the case of virtualization.

Currently this is only in use by virtio-balloon however there is the hope
that at some point in the future other hypervisors might be able to make
use of it. In the virtio-balloon/QEMU implementation the hypervisor is
currently using MADV_DONTNEED to indicate to the host kernel that the page
is currently free. It will be zeroed and faulted back into the guest the
next time the page is accessed.

To track if a page is reported or not the Uptodate flag was repurposed and
used as a Reported flag for Buddy pages. We walk though the free list
isolating pages and adding them to the scatterlist until we either
encounter the end of the list or have processed at least one sixteenth of
the pages that were listed in nr_free prior to us starting. If we fill
the scatterlist before we reach the end of the list we rotate the list so
that the first unreported page we encounter is moved to the head of the
list as that is where we will resume after we have freed the reported
pages back into the tail of the list.

Below are the results from various benchmarks. I primarily focused on two
tests. The first is the will-it-scale/page_fault2 test, and the other is
a modified version of will-it-scale/page_fault1 that was enabled to use
THP. I did this as it allows for better visibility into different parts
of the memory subsystem. The guest is running with 32G for RAM on one
node of a E5-2630 v3. The host has had some features such as CPU turbo
disabled in the BIOS.

Test page_fault1 (THP) page_fault2
Name tasks Process Iter STDEV Process Iter STDEV
Baseline 1 1012402.50 0.14% 361855.25 0.81%
16 8827457.25 0.09% 3282347.00 0.34%

Patches Applied 1 1007897.00 0.23% 361887.00 0.26%
16 8784741.75 0.39% 3240669.25 0.48%

Patches Enabled 1 1010227.50 0.39% 359749.25 0.56%
16 8756219.00 0.24% 3226608.75 0.97%

Patches Enabled 1 1050982.00 4.26% 357966.25 0.14%
page shuffle 16 8672601.25 0.49% 3223177.75 0.40%

Patches enabled 1 1003238.00 0.22% 360211.00 0.22%
shuffle w/ RFC 16 8767010.50 0.32% 3199874.00 0.71%

The results above are for a baseline with a linux-next-20191219 kernel,
that kernel with this patch set applied but page reporting disabled in
virtio-balloon, the patches applied and page reporting fully enabled, the
patches enabled with page shuffling enabled, and the patches applied with
page shuffling enabled and an RFC patch that makes used of MADV_FREE in
QEMU. These results include the deviation seen between the average value
reported here versus the high and/or low value. I observed that during
the test memory usage for the first three tests never dropped whereas with
the patches fully enabled the VM would drop to using only a few GB of the
host's memory when switching from memhog to page fault tests.

Any of the overhead visible with this patch set enabled seems due to page
faults caused by accessing the reported pages and the host zeroing the
page before giving it back to the guest. This overhead is much more
visible when using THP than with standard 4K pages. In addition page
shuffling seemed to increase the amount of faults generated due to an
increase in memory churn. The overehad is reduced when using MADV_FREE as
we can avoid the extra zeroing of the pages when they are reintroduced to
the host, as can be seen when the RFC is applied with shuffling enabled.

The overall guest size is kept fairly small to only a few GB while the
test is running. If the host memory were oversubscribed this patch set
should result in a performance improvement as swapping memory in the host
can be avoided.

A brief history on the background of free page reporting can be found at:
https://lore.kernel.org/lkml/29f43d5796feed0dec8e8bb98b187d9dac03b900.camel@linux.intel.com/

This patch (of 9):

Move the head/tail adding logic out of the shuffle code and into the
__free_one_page function since ultimately that is where it is really
needed anyway. By doing this we should be able to reduce the overhead and
can consolidate all of the list addition bits in one spot.

Signed-off-by: Alexander Duyck <alexander.h.duyck@linux.intel.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Reviewed-by: Dan Williams <dan.j.williams@intel.com>
Acked-by: Mel Gorman <mgorman@techsingularity.net>
Acked-by: David Hildenbrand <david@redhat.com>
Cc: Yang Zhang <yang.zhang.wz@gmail.com>
Cc: Pankaj Gupta <pagupta@redhat.com>
Cc: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Cc: Nitesh Narayan Lal <nitesh@redhat.com>
Cc: Rik van Riel <riel@surriel.com>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Luiz Capitulino <lcapitulino@redhat.com>
Cc: Dave Hansen <dave.hansen@intel.com>
Cc: Wei Wang <wei.w.wang@intel.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Paolo Bonzini <pbonzini@redhat.com>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Oscar Salvador <osalvador@suse.de>
Cc: Michael S. Tsirkin <mst@redhat.com>
Cc: wei qi <weiqi4@huawei.com>
Link: http://lkml.kernel.org/r/20200211224602.29318.84523.stgit@localhost.localdomain
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 1da2f328 01-Apr-2020 Rik van Riel <riel@surriel.com>

mm,thp,compaction,cma: allow THP migration for CMA allocations

The code to implement THP migrations already exists, and the code for CMA
to clear out a region of memory already exists.

Only a few small tweaks are needed to allow CMA to move THP memory when
attempting an allocation from alloc_contig_range.

With these changes, migrating THPs from a CMA area works when allocating a
1GB hugepage from CMA memory.

[riel@surriel.com: fix hugetlbfs pages per Mike, cleanup per Vlastimil]
Link: http://lkml.kernel.org/r/20200228104700.0af2f18d@imladris.surriel.com
Signed-off-by: Rik van Riel <riel@surriel.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Reviewed-by: Zi Yan <ziy@nvidia.com>
Reviewed-by: Vlastimil Babka <vbabka@suse.cz>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Mel Gorman <mgorman@techsingularity.net>
Cc: David Rientjes <rientjes@google.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Mike Kravetz <mike.kravetz@oracle.com>
Cc: Joonsoo Kim <js1304@gmail.com>
Link: http://lkml.kernel.org/r/20200227213238.1298752-2-riel@surriel.com
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# b06eda09 01-Apr-2020 Rik van Riel <riel@surriel.com>

mm,compaction,cma: add alloc_contig flag to compact_control

Patch series "fix THP migration for CMA allocations", v2.

Transparent huge pages are allocated with __GFP_MOVABLE, and can end up in
CMA memory blocks. Transparent huge pages also have most of the
infrastructure in place to allow migration.

However, a few pieces were missing, causing THP migration to fail when
attempting to use CMA to allocate 1GB hugepages.

With these patches in place, THP migration from CMA blocks seems to work,
both for anonymous THPs and for tmpfs/shmem THPs.

This patch (of 2):

Add information to struct compact_control to indicate that the allocator
would really like to clear out this specific part of memory, used by for
example CMA.

Signed-off-by: Rik van Riel <riel@surriel.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Reviewed-by: Vlastimil Babka <vbabka@suse.cz>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: David Rientjes <rientjes@google.com>
Cc: Mel Gorman <mgorman@techsingularity.net>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Zi Yan <ziy@nvidia.com>
Cc: Joonsoo Kim <js1304@gmail.com>
Link: http://lkml.kernel.org/r/20200227213238.1298752-1-riel@surriel.com
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# fe925c0c 01-Apr-2020 chenqiwu <chenqiwu@xiaomi.com>

mm/page_alloc: simplify page_is_buddy() for better code readability

Simplify page_is_buddy() to reduce the redundant code for better code
readability.

Signed-off-by: chenqiwu <chenqiwu@xiaomi.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Reviewed-by: Alexander Duyck <alexander.h.duyck@linux.intel.com>
Reviewed-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Reviewed-by: Vlastimil Babka <vbabka@suse.cz>
Acked-by: Pankaj Gupta <pankaj.gupta.linux@gmail.com>
Link: http://lkml.kernel.org/r/1583853751-5525-1-git-send-email-qiwuchen55@gmail.com
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 97ce86f9 01-Apr-2020 Mateusz Nosek <mateusznosek0@gmail.com>

mm/page_alloc.c: micro-optimisation Remove unnecessary branch

Previously if branch condition was false, the assignment was not executed.
The assignment can be safely executed even when the condition is false
and it is not incorrect as it assigns the value of 'nodemask' to
'ac.nodemask' which already has the same value.

So as the assignment can be executed unconditionally, the branch can be
removed.

Signed-off-by: Mateusz Nosek <mateusznosek0@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Reviewed-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Link: http://lkml.kernel.org/r/20200307225335.31300-1-mateusznosek0@gmail.com
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 76089d00 01-Apr-2020 chenqiwu <chenqiwu@xiaomi.com>

mm/page_alloc.c: use free_area_empty() instead of open-coding

Use free_area_empty() API to replace list_empty() for better code
readability.

Signed-off-by: chenqiwu <chenqiwu@xiaomi.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Reviewed-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Link: http://lkml.kernel.org/r/1583674354-7713-1-git-send-email-qiwuchen55@gmail.com
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 736838e9 01-Apr-2020 Mateusz Nosek <mateusznosek0@gmail.com>

mm, pagealloc: micro-optimisation: save two branches on hot page allocation path

This patch makes ALLOC_KSWAPD equal to __GFP_KSWAPD_RECLAIM (cast to int).

Thanks to that code like:

if (gfp_mask & __GFP_KSWAPD_RECLAIM)
alloc_flags |= ALLOC_KSWAPD;

can be changed to:

alloc_flags |= (__force int) (gfp_mask &__GFP_KSWAPD_RECLAIM);

Thanks to this one branch less is generated in the assembly.

In case of ALLOC_KSWAPD flag two branches are saved, first one in code
that always executes in the beginning of page allocation and the second
one in loop in page allocator slowpath.

Signed-off-by: Mateusz Nosek <mateusznosek0@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Acked-by: Mel Gorman <mgorman@techsingularity.net>
Link: http://lkml.kernel.org/r/20200304162118.14784-1-mateusznosek0@gmail.com
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# ee8eb9a5 01-Apr-2020 Joel Savitz <jsavitz@redhat.com>

mm/page_alloc: increase default min_free_kbytes bound

Currently, the vm.min_free_kbytes sysctl value is capped at a hardcoded
64M in init_per_zone_wmark_min (unless it is overridden by khugepaged
initialization).

This value has not been modified since 2005, and enterprise-grade systems
now frequently have hundreds of GB of RAM and multiple 10, 40, or even 100
GB NICs. We have seen page allocation failures on heavily loaded systems
related to NIC drivers. These issues were resolved by an increase to
vm.min_free_kbytes.

This patch increases the hardcoded value by a factor of 4 as a temporary
solution.

Further work to make the calculation of vm.min_free_kbytes more consistent
throughout the kernel would be desirable.

As an example of the inconsistency of the current method, this value is
recalculated by init_per_zone_wmark_min() in the case of memory hotplug
which will override the value set by set_recommended_min_free_kbytes()
called during khugepaged initialization even if khugepaged remains
enabled, however an on/off toggle of khugepaged will then recalculate and
set the value via set_recommended_min_free_kbytes().

Signed-off-by: Joel Savitz <jsavitz@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Reviewed-by: Andrew Morton <akpm@linux-foundation.org>
Cc: Rafael Aquini <aquini@redhat.com>
Link: http://lkml.kernel.org/r/20200220150103.5183-1-jsavitz@redhat.com
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# f4b00eab 01-Apr-2020 Roman Gushchin <guro@fb.com>

mm: kmem: rename memcg_kmem_(un)charge() into memcg_kmem_(un)charge_page()

Rename (__)memcg_kmem_(un)charge() into (__)memcg_kmem_(un)charge_page()
to better reflect what they are actually doing:

1) call __memcg_kmem_(un)charge_memcg() to actually charge or uncharge
the current memcg

2) set or clear the PageKmemcg flag

Signed-off-by: Roman Gushchin <guro@fb.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Reviewed-by: Shakeel Butt <shakeelb@google.com>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Vladimir Davydov <vdavydov.dev@gmail.com>
Link: http://lkml.kernel.org/r/20200109202659.752357-4-guro@fb.com
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 47e29d32 01-Apr-2020 John Hubbard <jhubbard@nvidia.com>

mm/gup: page->hpage_pinned_refcount: exact pin counts for huge pages

For huge pages (and in fact, any compound page), the GUP_PIN_COUNTING_BIAS
scheme tends to overflow too easily, each tail page increments the head
page->_refcount by GUP_PIN_COUNTING_BIAS (1024). That limits the number
of huge pages that can be pinned.

This patch removes that limitation, by using an exact form of pin counting
for compound pages of order > 1. The "order > 1" is required because this
approach uses the 3rd struct page in the compound page, and order 1
compound pages only have two pages, so that won't work there.

A new struct page field, hpage_pinned_refcount, has been added, replacing
a padding field in the union (so no new space is used).

This enhancement also has a useful side effect: huge pages and compound
pages (of order > 1) do not suffer from the "potential false positives"
problem that is discussed in the page_dma_pinned() comment block. That is
because these compound pages have extra space for tracking things, so they
get exact pin counts instead of overloading page->_refcount.

Documentation/core-api/pin_user_pages.rst is updated accordingly.

Suggested-by: Jan Kara <jack@suse.cz>
Signed-off-by: John Hubbard <jhubbard@nvidia.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Reviewed-by: Jan Kara <jack@suse.cz>
Acked-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Cc: Ira Weiny <ira.weiny@intel.com>
Cc: Jérôme Glisse <jglisse@redhat.com>
Cc: "Matthew Wilcox (Oracle)" <willy@infradead.org>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Christoph Hellwig <hch@infradead.org>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: Dave Chinner <david@fromorbit.com>
Cc: Jason Gunthorpe <jgg@ziepe.ca>
Cc: Jonathan Corbet <corbet@lwn.net>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Mike Kravetz <mike.kravetz@oracle.com>
Cc: Shuah Khan <shuah@kernel.org>
Cc: Vlastimil Babka <vbabka@suse.cz>
Link: http://lkml.kernel.org/r/20200211001536.1027652-8-jhubbard@nvidia.com
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 667c7901 01-Apr-2020 Vlastimil Babka <vbabka@suse.cz>

revert "topology: add support for node_to_mem_node() to determine the fallback node"

This reverts commit ad2c8144418c6a81cefe65379fd47bbe8344cef2.

The function node_to_mem_node() was introduced by that commit for use in SLUB
on systems with memoryless nodes, but it turned out to be unreliable on some
architectures/configurations and a simpler solution exists than fixing it up.

Thus commit 0715e6c516f1 ("mm, slub: prevent kmalloc_node crashes and
memory leaks") removed the only user of node_to_mem_node() and we can
revert the commit that introduced the function.

Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Reviewed-by: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Cc: Bharata B Rao <bharata@linux.ibm.com>
Cc: Christopher Lameter <cl@linux.com>
Cc: David Rientjes <rientjes@google.com>
Cc: Kirill Tkhai <ktkhai@virtuozzo.com>
Cc: Mel Gorman <mgorman@techsingularity.net>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Nathan Lynch <nathanl@linux.ibm.com>
Cc: Pekka Enberg <penberg@kernel.org>
Cc: PUVICHAKRAVARTHY RAMACHANDRAN <puvichakravarthy@in.ibm.com>
Cc: Sachin Sant <sachinp@linux.vnet.ibm.com>
Link: http://lkml.kernel.org/r/20200320115533.9604-2-vbabka@suse.cz
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 1f8d75c1 03-Feb-2020 Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>

mm/memmap_init: update variable name in memmap_init_zone

Patch series "mm/memory_hotplug: Shrink zones before removing memory", v6.

This series fixes the access of uninitialized memmaps when shrinking
zones/nodes and when removing memory. Also, it contains all fixes for
crashes that can be triggered when removing certain namespace using
memunmap_pages() - ZONE_DEVICE, reported by Aneesh.

We stop trying to shrink ZONE_DEVICE, as it's buggy, fixing it would be
more involved (we don't have SECTION_IS_ONLINE as an indicator), and
shrinking is only of limited use (set_zone_contiguous() cannot detect the
ZONE_DEVICE as contiguous).

We continue shrinking !ZONE_DEVICE zones, however, I reduced the amount of
code to a minimum. Shrinking is especially necessary to keep
zone->contiguous set where possible, especially, on memory unplug of DIMMs
at zone boundaries.

--------------------------------------------------------------------------

Zones are now properly shrunk when offlining memory blocks or when
onlining failed. This allows to properly shrink zones on memory unplug
even if the separate memory blocks of a DIMM were onlined to different
zones or re-onlined to a different zone after offlining.

Example:

:/# cat /proc/zoneinfo
Node 1, zone Movable
spanned 0
present 0
managed 0
:/# echo "online_movable" > /sys/devices/system/memory/memory41/state
:/# echo "online_movable" > /sys/devices/system/memory/memory43/state
:/# cat /proc/zoneinfo
Node 1, zone Movable
spanned 98304
present 65536
managed 65536
:/# echo 0 > /sys/devices/system/memory/memory43/online
:/# cat /proc/zoneinfo
Node 1, zone Movable
spanned 32768
present 32768
managed 32768
:/# echo 0 > /sys/devices/system/memory/memory41/online
:/# cat /proc/zoneinfo
Node 1, zone Movable
spanned 0
present 0
managed 0

This patch (of 6):

The third argument is actually number of pages. Change the variable name
from size to nr_pages to indicate this better.

No functional change in this patch.

Link: http://lkml.kernel.org/r/20191006085646.5768-3-david@redhat.com
Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
Signed-off-by: David Hildenbrand <david@redhat.com>
Reviewed-by: Pankaj Gupta <pagupta@redhat.com>
Reviewed-by: David Hildenbrand <david@redhat.com>
Cc: Michal Hocko <mhocko@suse.com>
Cc: "Matthew Wilcox (Oracle)" <willy@infradead.org>
Cc: "Aneesh Kumar K.V" <aneesh.kumar@linux.ibm.com>
Cc: Pavel Tatashin <pasha.tatashin@soleen.com>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: Logan Gunthorpe <logang@deltatee.com>
Cc: Oscar Salvador <osalvador@suse.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 4c605881 03-Feb-2020 David Hildenbrand <david@redhat.com>

mm: factor out next_present_section_nr()

Let's move it to the header and use the shorter variant from
mm/page_alloc.c (the original one will also check
"__highest_present_section_nr + 1", which is not necessary). While at
it, make the section_nr in next_pfn() const.

In next_pfn(), we now return section_nr_to_pfn(-1) instead of -1 once we
exceed __highest_present_section_nr, which doesn't make a difference in
the caller as it is big enough (>= all sane end_pfn).

Link: http://lkml.kernel.org/r/20200113144035.10848-3-david@redhat.com
Signed-off-by: David Hildenbrand <david@redhat.com>
Acked-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Cc: Baoquan He <bhe@redhat.com>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: "Jin, Zhi" <zhi.jin@intel.com>
Cc: "Kirill A. Shutemov" <kirill@shutemov.name>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Oscar Salvador <osalvador@suse.de>
Cc: Pavel Tatashin <pasha.tatashin@oracle.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 948c436e 03-Feb-2020 David Hildenbrand <david@redhat.com>

mm/page_alloc: fix and rework pfn handling in memmap_init_zone()

Let's update the pfn manually whenever we continue the loop. This makes
the code easier to read but also less error prone (and we can directly fix
one issue).

When overlap_memmap_init() returns true, pfn is updated to
"memblock_region_memory_end_pfn(r)". So it already points at the *next*
pfn to process. Incrementing the pfn another time is wrong, we might
leave one uninitialized. I spotted this by inspecting the code, so I have
no idea if this is relevant in practise (with kernelcore=mirror).

Link: http://lkml.kernel.org/r/20200113144035.10848-2-david@redhat.com
Fixes: a9a9e77fbf27 ("mm: move mirrored memory specific code outside of memmap_init_zone")
Signed-off-by: David Hildenbrand <david@redhat.com>
Acked-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Reviewed-by: Alexander Duyck <alexander.h.duyck@linux.intel.com>
Cc: Pavel Tatashin <pasha.tatashin@oracle.com>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Oscar Salvador <osalvador@suse.de>
Cc: Kirill A. Shutemov <kirill@shutemov.name>
Cc: Baoquan He <bhe@redhat.com>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Mel Gorman <mgorman@suse.de>
Cc: "Jin, Zhi" <zhi.jin@intel.com>
Cc: Michal Hocko <mhocko@suse.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 4b094b78 03-Feb-2020 David Hildenbrand <david@redhat.com>

mm/page_alloc.c: initialize memmap of unavailable memory directly

Let's make sure that all memory holes are actually marked PageReserved(),
that page_to_pfn() produces reliable results, and that these pages are not
detected as "mmap" pages due to the mapcount.

E.g., booting a x86-64 QEMU guest with 4160 MB:

[ 0.010585] Early memory node ranges
[ 0.010586] node 0: [mem 0x0000000000001000-0x000000000009efff]
[ 0.010588] node 0: [mem 0x0000000000100000-0x00000000bffdefff]
[ 0.010589] node 0: [mem 0x0000000100000000-0x0000000143ffffff]

max_pfn is 0x144000.

Before this change:

[root@localhost ~]# ./page-types -r -a 0x144000,
flags page-count MB symbolic-flags long-symbolic-flags
0x0000000000000800 16384 64 ___________M_______________________________ mmap
total 16384 64

After this change:

[root@localhost ~]# ./page-types -r -a 0x144000,
flags page-count MB symbolic-flags long-symbolic-flags
0x0000000100000000 16384 64 ___________________________r_______________ reserved
total 16384 64

IOW, especially the unavailable physical memory ("memory hole") in the
last section would not get properly marked PageReserved() and is indicated
to be "mmap" memory.

Drop the trace of that function from include/linux/mm.h - nobody else
needs it, and rename it accordingly.

Note: The fake zone/node might not be covered by the zone/node span. This
is not an urgent issue (for now, we had the same node/zone due to the
zeroing). We'll need a clean way to mark memory holes (e.g., using a page
type PageHole() if possible or a fake ZONE_INVALID) and eventually stop
marking these memory holes PageReserved().

Link: http://lkml.kernel.org/r/20191211163201.17179-4-david@redhat.com
Signed-off-by: David Hildenbrand <david@redhat.com>
Cc: Oscar Salvador <osalvador@suse.de>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: Alexey Dobriyan <adobriyan@gmail.com>
Cc: Bob Picco <bob.picco@oracle.com>
Cc: Daniel Jordan <daniel.m.jordan@oracle.com>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
Cc: Pavel Tatashin <pasha.tatashin@oracle.com>
Cc: Stephen Rothwell <sfr@canb.auug.org.au>
Cc: Steven Sistare <steven.sistare@oracle.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# e822969c 03-Feb-2020 David Hildenbrand <david@redhat.com>

mm/page_alloc.c: fix uninitialized memmaps on a partially populated last section

Patch series "mm: fix max_pfn not falling on section boundary", v2.

Playing with different memory sizes for a x86-64 guest, I discovered that
some memmaps (highest section if max_mem does not fall on the section
boundary) are marked as being valid and online, but contain garbage. We
have to properly initialize these memmaps.

Looking at /proc/kpageflags and friends, I found some more issues,
partially related to this.

This patch (of 3):

If max_pfn is not aligned to a section boundary, we can easily run into
BUGs. This can e.g., be triggered on x86-64 under QEMU by specifying a
memory size that is not a multiple of 128MB (e.g., 4097MB, but also
4160MB). I was told that on real HW, we can easily have this scenario
(esp., one of the main reasons sub-section hotadd of devmem was added).

The issue is, that we have a valid memmap (pfn_valid()) for the whole
section, and the whole section will be marked "online".
pfn_to_online_page() will succeed, but the memmap contains garbage.

E.g., doing a "./page-types -r -a 0x144001" when QEMU was started with "-m
4160M" - (see tools/vm/page-types.c):

[ 200.476376] BUG: unable to handle page fault for address: fffffffffffffffe
[ 200.477500] #PF: supervisor read access in kernel mode
[ 200.478334] #PF: error_code(0x0000) - not-present page
[ 200.479076] PGD 59614067 P4D 59614067 PUD 59616067 PMD 0
[ 200.479557] Oops: 0000 [#4] SMP NOPTI
[ 200.479875] CPU: 0 PID: 603 Comm: page-types Tainted: G D W 5.5.0-rc1-next-20191209 #93
[ 200.480646] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.12.0-59-gc9ba5276e321-prebuilt.qemu4
[ 200.481648] RIP: 0010:stable_page_flags+0x4d/0x410
[ 200.482061] Code: f3 ff 41 89 c0 48 b8 00 00 00 00 01 00 00 00 45 84 c0 0f 85 cd 02 00 00 48 8b 53 08 48 8b 2b 48f
[ 200.483644] RSP: 0018:ffffb139401cbe60 EFLAGS: 00010202
[ 200.484091] RAX: fffffffffffffffe RBX: fffffbeec5100040 RCX: 0000000000000000
[ 200.484697] RDX: 0000000000000001 RSI: ffffffff9535c7cd RDI: 0000000000000246
[ 200.485313] RBP: ffffffffffffffff R08: 0000000000000000 R09: 0000000000000000
[ 200.485917] R10: 0000000000000000 R11: 0000000000000000 R12: 0000000000144001
[ 200.486523] R13: 00007ffd6ba55f48 R14: 00007ffd6ba55f40 R15: ffffb139401cbf08
[ 200.487130] FS: 00007f68df717580(0000) GS:ffff9ec77fa00000(0000) knlGS:0000000000000000
[ 200.487804] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 200.488295] CR2: fffffffffffffffe CR3: 0000000135d48000 CR4: 00000000000006f0
[ 200.488897] Call Trace:
[ 200.489115] kpageflags_read+0xe9/0x140
[ 200.489447] proc_reg_read+0x3c/0x60
[ 200.489755] vfs_read+0xc2/0x170
[ 200.490037] ksys_pread64+0x65/0xa0
[ 200.490352] do_syscall_64+0x5c/0xa0
[ 200.490665] entry_SYSCALL_64_after_hwframe+0x49/0xbe

But it can be triggered much easier via "cat /proc/kpageflags > /dev/null"
after cold/hot plugging a DIMM to such a system:

[root@localhost ~]# cat /proc/kpageflags > /dev/null
[ 111.517275] BUG: unable to handle page fault for address: fffffffffffffffe
[ 111.517907] #PF: supervisor read access in kernel mode
[ 111.518333] #PF: error_code(0x0000) - not-present page
[ 111.518771] PGD a240e067 P4D a240e067 PUD a2410067 PMD 0

This patch fixes that by at least zero-ing out that memmap (so e.g.,
page_to_pfn() will not crash). Commit 907ec5fca3dc ("mm: zero remaining
unavailable struct pages") tried to fix a similar issue, but forgot to
consider this special case.

After this patch, there are still problems to solve. E.g., not all of
these pages falling into a memory hole will actually get initialized later
and set PageReserved - they are only zeroed out - but at least the
immediate crashes are gone. A follow-up patch will take care of this.

Link: http://lkml.kernel.org/r/20191211163201.17179-2-david@redhat.com
Fixes: f7f99100d8d9 ("mm: stop zeroing memory during allocation in vmemmap")
Signed-off-by: David Hildenbrand <david@redhat.com>
Tested-by: Daniel Jordan <daniel.m.jordan@oracle.com>
Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
Cc: Pavel Tatashin <pasha.tatashin@oracle.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Steven Sistare <steven.sistare@oracle.com>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Daniel Jordan <daniel.m.jordan@oracle.com>
Cc: Bob Picco <bob.picco@oracle.com>
Cc: Oscar Salvador <osalvador@suse.de>
Cc: Alexey Dobriyan <adobriyan@gmail.com>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Stephen Rothwell <sfr@canb.auug.org.au>
Cc: <stable@vger.kernel.org> [4.15+]
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 3d680bdf 30-Jan-2020 Qian Cai <cai@lca.pw>

mm/page_isolation: fix potential warning from user

It makes sense to call the WARN_ON_ONCE(zone_idx(zone) == ZONE_MOVABLE)
from start_isolate_page_range(), but should avoid triggering it from
userspace, i.e, from is_mem_section_removable() because it could crash
the system by a non-root user if warn_on_panic is set.

While at it, simplify the code a bit by removing an unnecessary jump
label.

Link: http://lkml.kernel.org/r/20200120163915.1469-1-cai@lca.pw
Signed-off-by: Qian Cai <cai@lca.pw>
Suggested-by: Michal Hocko <mhocko@kernel.org>
Acked-by: Michal Hocko <mhocko@suse.com>
Reviewed-by: David Hildenbrand <david@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 4a55c047 30-Jan-2020 Qian Cai <cai@lca.pw>

mm/hotplug: silence a lockdep splat with printk()

It is not that hard to trigger lockdep splats by calling printk from
under zone->lock. Most of them are false positives caused by lock
chains introduced early in the boot process and they do not cause any
real problems (although most of the early boot lock dependencies could
happen after boot as well). There are some console drivers which do
allocate from the printk context as well and those should be fixed. In
any case, false positives are not that trivial to workaround and it is
far from optimal to lose lockdep functionality for something that is a
non-issue.

So change has_unmovable_pages() so that it no longer calls dump_page()
itself - instead it returns a "struct page *" of the unmovable page back
to the caller so that in the case of a has_unmovable_pages() failure,
the caller can call dump_page() after releasing zone->lock. Also, make
dump_page() is able to report a CMA page as well, so the reason string
from has_unmovable_pages() can be removed.

Even though has_unmovable_pages doesn't hold any reference to the
returned page this should be reasonably safe for the purpose of
reporting the page (dump_page) because it cannot be hotremoved in the
context of memory unplug. The state of the page might change but that
is the case even with the existing code as zone->lock only plays role
for free pages.

While at it, remove a similar but unnecessary debug-only printk() as
well. A sample of one of those lockdep splats is,

WARNING: possible circular locking dependency detected
------------------------------------------------------
test.sh/8653 is trying to acquire lock:
ffffffff865a4460 (console_owner){-.-.}, at:
console_unlock+0x207/0x750

but task is already holding lock:
ffff88883fff3c58 (&(&zone->lock)->rlock){-.-.}, at:
__offline_isolated_pages+0x179/0x3e0

which lock already depends on the new lock.

the existing dependency chain (in reverse order) is:

-> #3 (&(&zone->lock)->rlock){-.-.}:
__lock_acquire+0x5b3/0xb40
lock_acquire+0x126/0x280
_raw_spin_lock+0x2f/0x40
rmqueue_bulk.constprop.21+0xb6/0x1160
get_page_from_freelist+0x898/0x22c0
__alloc_pages_nodemask+0x2f3/0x1cd0
alloc_pages_current+0x9c/0x110
allocate_slab+0x4c6/0x19c0
new_slab+0x46/0x70
___slab_alloc+0x58b/0x960
__slab_alloc+0x43/0x70
__kmalloc+0x3ad/0x4b0
__tty_buffer_request_room+0x100/0x250
tty_insert_flip_string_fixed_flag+0x67/0x110
pty_write+0xa2/0xf0
n_tty_write+0x36b/0x7b0
tty_write+0x284/0x4c0
__vfs_write+0x50/0xa0
vfs_write+0x105/0x290
redirected_tty_write+0x6a/0xc0
do_iter_write+0x248/0x2a0
vfs_writev+0x106/0x1e0
do_writev+0xd4/0x180
__x64_sys_writev+0x45/0x50
do_syscall_64+0xcc/0x76c
entry_SYSCALL_64_after_hwframe+0x49/0xbe

-> #2 (&(&port->lock)->rlock){-.-.}:
__lock_acquire+0x5b3/0xb40
lock_acquire+0x126/0x280
_raw_spin_lock_irqsave+0x3a/0x50
tty_port_tty_get+0x20/0x60
tty_port_default_wakeup+0xf/0x30
tty_port_tty_wakeup+0x39/0x40
uart_write_wakeup+0x2a/0x40
serial8250_tx_chars+0x22e/0x440
serial8250_handle_irq.part.8+0x14a/0x170
serial8250_default_handle_irq+0x5c/0x90
serial8250_interrupt+0xa6/0x130
__handle_irq_event_percpu+0x78/0x4f0
handle_irq_event_percpu+0x70/0x100
handle_irq_event+0x5a/0x8b
handle_edge_irq+0x117/0x370
do_IRQ+0x9e/0x1e0
ret_from_intr+0x0/0x2a
cpuidle_enter_state+0x156/0x8e0
cpuidle_enter+0x41/0x70
call_cpuidle+0x5e/0x90
do_idle+0x333/0x370
cpu_startup_entry+0x1d/0x1f
start_secondary+0x290/0x330
secondary_startup_64+0xb6/0xc0

-> #1 (&port_lock_key){-.-.}:
__lock_acquire+0x5b3/0xb40
lock_acquire+0x126/0x280
_raw_spin_lock_irqsave+0x3a/0x50
serial8250_console_write+0x3e4/0x450
univ8250_console_write+0x4b/0x60
console_unlock+0x501/0x750
vprintk_emit+0x10d/0x340
vprintk_default+0x1f/0x30
vprintk_func+0x44/0xd4
printk+0x9f/0xc5

-> #0 (console_owner){-.-.}:
check_prev_add+0x107/0xea0
validate_chain+0x8fc/0x1200
__lock_acquire+0x5b3/0xb40
lock_acquire+0x126/0x280
console_unlock+0x269/0x750
vprintk_emit+0x10d/0x340
vprintk_default+0x1f/0x30
vprintk_func+0x44/0xd4
printk+0x9f/0xc5
__offline_isolated_pages.cold.52+0x2f/0x30a
offline_isolated_pages_cb+0x17/0x30
walk_system_ram_range+0xda/0x160
__offline_pages+0x79c/0xa10
offline_pages+0x11/0x20
memory_subsys_offline+0x7e/0xc0
device_offline+0xd5/0x110
state_store+0xc6/0xe0
dev_attr_store+0x3f/0x60
sysfs_kf_write+0x89/0xb0
kernfs_fop_write+0x188/0x240
__vfs_write+0x50/0xa0
vfs_write+0x105/0x290
ksys_write+0xc6/0x160
__x64_sys_write+0x43/0x50
do_syscall_64+0xcc/0x76c
entry_SYSCALL_64_after_hwframe+0x49/0xbe

other info that might help us debug this:

Chain exists of:
console_owner --> &(&port->lock)->rlock --> &(&zone->lock)->rlock

Possible unsafe locking scenario:

CPU0 CPU1
---- ----
lock(&(&zone->lock)->rlock);
lock(&(&port->lock)->rlock);
lock(&(&zone->lock)->rlock);
lock(console_owner);

*** DEADLOCK ***

9 locks held by test.sh/8653:
#0: ffff88839ba7d408 (sb_writers#4){.+.+}, at:
vfs_write+0x25f/0x290
#1: ffff888277618880 (&of->mutex){+.+.}, at:
kernfs_fop_write+0x128/0x240
#2: ffff8898131fc218 (kn->count#115){.+.+}, at:
kernfs_fop_write+0x138/0x240
#3: ffffffff86962a80 (device_hotplug_lock){+.+.}, at:
lock_device_hotplug_sysfs+0x16/0x50
#4: ffff8884374f4990 (&dev->mutex){....}, at:
device_offline+0x70/0x110
#5: ffffffff86515250 (cpu_hotplug_lock.rw_sem){++++}, at:
__offline_pages+0xbf/0xa10
#6: ffffffff867405f0 (mem_hotplug_lock.rw_sem){++++}, at:
percpu_down_write+0x87/0x2f0
#7: ffff88883fff3c58 (&(&zone->lock)->rlock){-.-.}, at:
__offline_isolated_pages+0x179/0x3e0
#8: ffffffff865a4920 (console_lock){+.+.}, at:
vprintk_emit+0x100/0x340

stack backtrace:
Hardware name: HPE ProLiant DL560 Gen10/ProLiant DL560 Gen10,
BIOS U34 05/21/2019
Call Trace:
dump_stack+0x86/0xca
print_circular_bug.cold.31+0x243/0x26e
check_noncircular+0x29e/0x2e0
check_prev_add+0x107/0xea0
validate_chain+0x8fc/0x1200
__lock_acquire+0x5b3/0xb40
lock_acquire+0x126/0x280
console_unlock+0x269/0x750
vprintk_emit+0x10d/0x340
vprintk_default+0x1f/0x30
vprintk_func+0x44/0xd4
printk+0x9f/0xc5
__offline_isolated_pages.cold.52+0x2f/0x30a
offline_isolated_pages_cb+0x17/0x30
walk_system_ram_range+0xda/0x160
__offline_pages+0x79c/0xa10
offline_pages+0x11/0x20
memory_subsys_offline+0x7e/0xc0
device_offline+0xd5/0x110
state_store+0xc6/0xe0
dev_attr_store+0x3f/0x60
sysfs_kf_write+0x89/0xb0
kernfs_fop_write+0x188/0x240
__vfs_write+0x50/0xa0
vfs_write+0x105/0x290
ksys_write+0xc6/0x160
__x64_sys_write+0x43/0x50
do_syscall_64+0xcc/0x76c
entry_SYSCALL_64_after_hwframe+0x49/0xbe

Link: http://lkml.kernel.org/r/20200117181200.20299-1-cai@lca.pw
Signed-off-by: Qian Cai <cai@lca.pw>
Reviewed-by: David Hildenbrand <david@redhat.com>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Sergey Senozhatsky <sergey.senozhatsky.work@gmail.com>
Cc: Petr Mladek <pmladek@suse.com>
Cc: Steven Rostedt (VMware) <rostedt@goodmis.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# fe4c86c9 30-Jan-2020 David Hildenbrand <david@redhat.com>

mm: remove "count" parameter from has_unmovable_pages()

Now that the memory isolate notifier is gone, the parameter is always 0.
Drop it and cleanup has_unmovable_pages().

Link: http://lkml.kernel.org/r/20191114131911.11783-3-david@redhat.com
Signed-off-by: David Hildenbrand <david@redhat.com>
Acked-by: Michal Hocko <mhocko@suse.com>
Cc: Oscar Salvador <osalvador@suse.de>
Cc: Anshuman Khandual <anshuman.khandual@arm.com>
Cc: Qian Cai <cai@lca.pw>
Cc: Pingfan Liu <kernelfans@gmail.com>
Cc: Stephen Rothwell <sfr@canb.auug.org.au>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: Pavel Tatashin <pasha.tatashin@soleen.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Mel Gorman <mgorman@techsingularity.net>
Cc: Mike Rapoport <rppt@linux.vnet.ibm.com>
Cc: Wei Yang <richardw.yang@linux.intel.com>
Cc: Alexander Duyck <alexander.h.duyck@linux.intel.com>
Cc: Alexander Potapenko <glider@google.com>
Cc: Arun KS <arunks@codeaurora.org>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 3f135355 30-Jan-2020 Kirill A. Shutemov <kirill@shutemov.name>

mm/page_alloc: skip non present sections on zone initialization

memmap_init_zone() can be called on the ranges with holes during the
boot. It will skip any non-valid PFNs one-by-one. It works fine as
long as holes are not too big.

But huge holes in the memory map causes a problem. It takes over 20
seconds to walk 32TiB hole. x86-64 with 5-level paging allows for much
larger holes in the memory map which would practically hang the system.

Deferred struct page init doesn't help here. It only works on the
present ranges.

Skipping non-present sections would fix the issue.

Link: http://lkml.kernel.org/r/20191230093828.24613-1-kirill.shutemov@linux.intel.com
Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Reviewed-by: Baoquan He <bhe@redhat.com>
Acked-by: Michal Hocko <mhocko@suse.com>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Mel Gorman <mgorman@suse.de>
Cc: "Jin, Zhi" <zhi.jin@intel.com>
Cc: David Hildenbrand <david@redhat.com>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Oscar Salvador <osalvador@suse.de>
Cc: Pavel Tatashin <pasha.tatashin@oracle.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 8e57f8ac 13-Jan-2020 Vlastimil Babka <vbabka@suse.cz>

mm, debug_pagealloc: don't rely on static keys too early

Commit 96a2b03f281d ("mm, debug_pagelloc: use static keys to enable
debugging") has introduced a static key to reduce overhead when
debug_pagealloc is compiled in but not enabled. It relied on the
assumption that jump_label_init() is called before parse_early_param()
as in start_kernel(), so when the "debug_pagealloc=on" option is parsed,
it is safe to enable the static key.

However, it turns out multiple architectures call parse_early_param()
earlier from their setup_arch(). x86 also calls jump_label_init() even
earlier, so no issue was found while testing the commit, but same is not
true for e.g. ppc64 and s390 where the kernel would not boot with
debug_pagealloc=on as found by our QA.

To fix this without tricky changes to init code of multiple
architectures, this patch partially reverts the static key conversion
from 96a2b03f281d. Init-time and non-fastpath calls (such as in arch
code) of debug_pagealloc_enabled() will again test a simple bool
variable. Fastpath mm code is converted to a new
debug_pagealloc_enabled_static() variant that relies on the static key,
which is enabled in a well-defined point in mm_init() where it's
guaranteed that jump_label_init() has been called, regardless of
architecture.

[sfr@canb.auug.org.au: export _debug_pagealloc_enabled_early]
Link: http://lkml.kernel.org/r/20200106164944.063ac07b@canb.auug.org.au
Link: http://lkml.kernel.org/r/20191219130612.23171-1-vbabka@suse.cz
Fixes: 96a2b03f281d ("mm, debug_pagelloc: use static keys to enable debugging")
Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
Signed-off-by: Stephen Rothwell <sfr@canb.auug.org.au>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Mel Gorman <mgorman@techsingularity.net>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Qian Cai <cai@lca.pw>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# cc638f32 13-Jan-2020 Vlastimil Babka <vbabka@suse.cz>

mm, thp: tweak reclaim/compaction effort of local-only and all-node allocations

THP page faults now attempt a __GFP_THISNODE allocation first, which
should only compact existing free memory, followed by another attempt
that can allocate from any node using reclaim/compaction effort
specified by global defrag setting and madvise.

This patch makes the following changes to the scheme:

- Before the patch, the first allocation relies on a check for
pageblock order and __GFP_IO to prevent excessive reclaim. This
however affects also the second attempt, which is not limited to
single node.

Instead of that, reuse the existing check for costly order
__GFP_NORETRY allocations, and make sure the first THP attempt uses
__GFP_NORETRY. As a side-effect, all costly order __GFP_NORETRY
allocations will bail out if compaction needs reclaim, while
previously they only bailed out when compaction was deferred due to
previous failures.

This should be still acceptable within the __GFP_NORETRY semantics.

- Before the patch, the second allocation attempt (on all nodes) was
passing __GFP_NORETRY. This is redundant as the check for pageblock
order (discussed above) was stronger. It's also contrary to
madvise(MADV_HUGEPAGE) which means some effort to allocate THP is
requested.

After this patch, the second attempt doesn't pass __GFP_THISNODE nor
__GFP_NORETRY.

To sum up, THP page faults now try the following attempts:

1. local node only THP allocation with no reclaim, just compaction.
2. for madvised VMA's or when synchronous compaction is enabled always - THP
allocation from any node with effort determined by global defrag setting
and VMA madvise
3. fallback to base pages on any node

Link: http://lkml.kernel.org/r/08a3f4dd-c3ce-0009-86c5-9ee51aba8557@suse.cz
Fixes: b39d0ee2632d ("mm, page_alloc: avoid expensive reclaim when compaction may not succeed")
Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
Acked-by: Michal Hocko <mhocko@suse.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Mel Gorman <mgorman@suse.de>
Cc: "Kirill A. Shutemov" <kirill@shutemov.name>
Cc: David Rientjes <rientjes@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 867e5e1d 30-Nov-2019 Johannes Weiner <hannes@cmpxchg.org>

mm: clean up and clarify lruvec lookup procedure

There is a per-memcg lruvec and a NUMA node lruvec. Which one is being
used is somewhat confusing right now, and it's easy to make mistakes -
especially when it comes to global reclaim.

How it works: when memory cgroups are enabled, we always use the
root_mem_cgroup's per-node lruvecs. When memory cgroups are not compiled
in or disabled at runtime, we use pgdat->lruvec.

Document that in a comment.

Due to the way the reclaim code is generalized, all lookups use the
mem_cgroup_lruvec() helper function, and nobody should have to find the
right lruvec manually right now. But to avoid future mistakes, rename the
pgdat->lruvec member to pgdat->__lruvec and delete the convenience wrapper
that suggests it's a commonly accessed member.

While in this area, swap the mem_cgroup_lruvec() argument order. The name
suggests a memcg operation, yet it takes a pgdat first and a memcg second.
I have to double take every time I call this. Fix that.

Link: http://lkml.kernel.org/r/20191022144803.302233-3-hannes@cmpxchg.org
Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
Acked-by: Michal Hocko <mhocko@suse.com>
Reviewed-by: Shakeel Butt <shakeelb@google.com>
Cc: Roman Gushchin <guro@fb.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# e47b346a 30-Nov-2019 lijiazi <jqqlijiazi@gmail.com>

mm/page_alloc.c: print reserved_highatomic info

Print nr_reserved_highatomic in show_free_areas, because when alloc_harder
is false, this value will be subtracted from the free_pages in
__zone_watermark_ok. Printing this value can help analyze memory
allocaction failure issues.

Link: http://lkml.kernel.org/r/19515f3de2fb6abe66b52e03e4b676a21e82beda.1573634806.git.lijiazi@xiaomi.com
Signed-off-by: lijiazi <lijiazi@xiaomi.com>
Reviewed-by: Andrew Morton <akpm@linux-foundation.org>
Cc: Mel Gorman <mgorman@techsingularity.net>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# cb1ef534 30-Nov-2019 Mel Gorman <mgorman@techsingularity.net>

mm, pcp: share common code between memory hotplug and percpu sysctl handler

Both the percpu_pagelist_fraction sysctl handler and memory hotplug have
a common requirement of updating the pcpu page allocation batch and high
values. Split the relevant helper to share common code.

No functional change.

Link: http://lkml.kernel.org/r/20191021094808.28824-3-mgorman@techsingularity.net
Signed-off-by: Mel Gorman <mgorman@techsingularity.net>
Acked-by: Michal Hocko <mhocko@suse.com>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Matt Fleming <matt@codeblueprint.co.uk>
Cc: Qian Cai <cai@lca.pw>
Cc: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 5e27a2df 30-Nov-2019 Anshuman Khandual <anshuman.khandual@arm.com>

mm/page_alloc: add alloc_contig_pages()

HugeTLB helper alloc_gigantic_page() implements fairly generic
allocation method where it scans over various zones looking for a large
contiguous pfn range before trying to allocate it with
alloc_contig_range().

Other than deriving the requested order from 'struct hstate', there is
nothing HugeTLB specific in there. This can be made available for
general use to allocate contiguous memory which could not have been
allocated through the buddy allocator.

alloc_gigantic_page() has been split carving out actual allocation
method which is then made available via new alloc_contig_pages() helper
wrapped under CONFIG_CONTIG_ALLOC. All references to 'gigantic' have
been replaced with more generic term 'contig'. Allocated pages here
should be freed with free_contig_range() or by calling __free_page() on
each allocated page.

Link: http://lkml.kernel.org/r/1571300646-32240-1-git-send-email-anshuman.khandual@arm.com
Signed-off-by: Anshuman Khandual <anshuman.khandual@arm.com>
Acked-by: David Hildenbrand <david@redhat.com>
Acked-by: Michal Hocko <mhocko@suse.com>
Cc: Mike Kravetz <mike.kravetz@oracle.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Michal Hocko <mhocko@suse.com>
Cc: David Rientjes <rientjes@google.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Oscar Salvador <osalvador@suse.de>
Cc: Mel Gorman <mgorman@techsingularity.net>
Cc: Mike Rapoport <rppt@linux.ibm.com>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: Pavel Tatashin <pavel.tatashin@microsoft.com>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: David Hildenbrand <david@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 756d25be 30-Nov-2019 David Hildenbrand <david@redhat.com>

mm/page_isolation.c: convert SKIP_HWPOISON to MEMORY_OFFLINE

We have two types of users of page isolation:

1. Memory offlining: Offline memory so it can be unplugged. Memory
won't be touched.

2. Memory allocation: Allocate memory (e.g., alloc_contig_range()) to
become the owner of the memory and make use of
it.

For example, in case we want to offline memory, we can ignore (skip
over) PageHWPoison() pages, as the memory won't get used. We can allow
to offline memory. In contrast, we don't want to allow to allocate such
memory.

Let's generalize the approach so we can special case other types of
pages we want to skip over in case we offline memory. While at it, also
pass the same flags to test_pages_isolated().

Link: http://lkml.kernel.org/r/20191021172353.3056-3-david@redhat.com
Signed-off-by: David Hildenbrand <david@redhat.com>
Suggested-by: Michal Hocko <mhocko@suse.com>
Acked-by: Michal Hocko <mhocko@suse.com>
Cc: Oscar Salvador <osalvador@suse.de>
Cc: Anshuman Khandual <anshuman.khandual@arm.com>
Cc: David Hildenbrand <david@redhat.com>
Cc: Pingfan Liu <kernelfans@gmail.com>
Cc: Qian Cai <cai@lca.pw>
Cc: Pavel Tatashin <pasha.tatashin@soleen.com>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Mel Gorman <mgorman@techsingularity.net>
Cc: Mike Rapoport <rppt@linux.vnet.ibm.com>
Cc: Alexander Duyck <alexander.h.duyck@linux.intel.com>
Cc: Mike Rapoport <rppt@linux.ibm.com>
Cc: Pavel Tatashin <pavel.tatashin@microsoft.com>
Cc: Wei Yang <richard.weiyang@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 0ee5f4f3 30-Nov-2019 David Hildenbrand <david@redhat.com>

mm/page_alloc.c: don't set pages PageReserved() when offlining

Patch series "mm: Memory offlining + page isolation cleanups", v2.

This patch (of 2):

We call __offline_isolated_pages() from __offline_pages() after all
pages were isolated and are either free (PageBuddy()) or PageHWPoison.
Nothing can stop us from offlining memory at this point.

In __offline_isolated_pages() we first set all affected memory sections
offline (offline_mem_sections(pfn, end_pfn)), to mark the memmap as
invalid (pfn_to_online_page() will no longer succeed), and then walk
over all pages to pull the free pages from the free lists (to the
isolated free lists, to be precise).

Note that re-onlining a memory block will result in the whole memmap
getting reinitialized, overwriting any old state. We already poision
the memmap when offlining is complete to find any access to
stale/uninitialized memmaps.

So, setting the pages PageReserved() is not helpful. The memap is
marked offline and all pageblocks are isolated. As soon as offline, the
memmap is stale either way.

This looks like a leftover from ancient times where we initialized the
memmap when adding memory and not when onlining it (the pages were set
PageReserved so re-onling would work as expected).

Link: http://lkml.kernel.org/r/20191021172353.3056-2-david@redhat.com
Signed-off-by: David Hildenbrand <david@redhat.com>
Acked-by: Michal Hocko <mhocko@suse.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Oscar Salvador <osalvador@suse.de>
Cc: Mel Gorman <mgorman@techsingularity.net>
Cc: Mike Rapoport <rppt@linux.ibm.com>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: Wei Yang <richard.weiyang@gmail.com>
Cc: Alexander Duyck <alexander.h.duyck@linux.intel.com>
Cc: Anshuman Khandual <anshuman.khandual@arm.com>
Cc: Pavel Tatashin <pavel.tatashin@microsoft.com>
Cc: Mike Rapoport <rppt@linux.vnet.ibm.com>
Cc: Pavel Tatashin <pasha.tatashin@soleen.com>
Cc: Pingfan Liu <kernelfans@gmail.com>
Cc: Qian Cai <cai@lca.pw>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 1be334e5 05-Nov-2019 Johannes Weiner <hannes@cmpxchg.org>

mm/page_alloc.c: ratelimit allocation failure warnings more aggressively

While investigating a bug related to higher atomic allocation failures,
we noticed the failure warnings positively drowning the console, and in
our case trigger lockup warnings because of a serial console too slow to
handle all that output.

But even if we had a faster console, it's unclear what additional
information the current level of repetition provides.

Allocation failures happen for three reasons: The machine is OOM, the VM
is failing to handle reasonable requests, or somebody is making
unreasonable requests (and didn't acknowledge their opportunism with
__GFP_NOWARN). Having the memory dump, a callstack, and the ratelimit
stats on skipped failure warnings should provide enough information to
let users/admins/developers know whether something is wrong and point
them in the right direction for debugging, bpftracing etc.

Limit allocation failure warnings to one spew every ten seconds.

Link: http://lkml.kernel.org/r/20191028194906.26899-1-hannes@cmpxchg.org
Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
Acked-by: David Rientjes <rientjes@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 3e8fc007 05-Nov-2019 Mel Gorman <mgorman@techsingularity.net>

mm, meminit: recalculate pcpu batch and high limits after init completes

Deferred memory initialisation updates zone->managed_pages during the
initialisation phase but before that finishes, the per-cpu page
allocator (pcpu) calculates the number of pages allocated/freed in
batches as well as the maximum number of pages allowed on a per-cpu
list. As zone->managed_pages is not up to date yet, the pcpu
initialisation calculates inappropriately low batch and high values.

This increases zone lock contention quite severely in some cases with
the degree of severity depending on how many CPUs share a local zone and
the size of the zone. A private report indicated that kernel build
times were excessive with extremely high system CPU usage. A perf
profile indicated that a large chunk of time was lost on zone->lock
contention.

This patch recalculates the pcpu batch and high values after deferred
initialisation completes for every populated zone in the system. It was
tested on a 2-socket AMD EPYC 2 machine using a kernel compilation
workload -- allmodconfig and all available CPUs.

mmtests configuration: config-workload-kernbench-max Configuration was
modified to build on a fresh XFS partition.

kernbench
5.4.0-rc3 5.4.0-rc3
vanilla resetpcpu-v2
Amean user-256 13249.50 ( 0.00%) 16401.31 * -23.79%*
Amean syst-256 14760.30 ( 0.00%) 4448.39 * 69.86%*
Amean elsp-256 162.42 ( 0.00%) 119.13 * 26.65%*
Stddev user-256 42.97 ( 0.00%) 19.15 ( 55.43%)
Stddev syst-256 336.87 ( 0.00%) 6.71 ( 98.01%)
Stddev elsp-256 2.46 ( 0.00%) 0.39 ( 84.03%)

5.4.0-rc3 5.4.0-rc3
vanilla resetpcpu-v2
Duration User 39766.24 49221.79
Duration System 44298.10 13361.67
Duration Elapsed 519.11 388.87

The patch reduces system CPU usage by 69.86% and total build time by
26.65%. The variance of system CPU usage is also much reduced.

Before, this was the breakdown of batch and high values over all zones
was:

256 batch: 1
256 batch: 63
512 batch: 7
256 high: 0
256 high: 378
512 high: 42

512 pcpu pagesets had a batch limit of 7 and a high limit of 42. After
the patch:

256 batch: 1
768 batch: 63
256 high: 0
768 high: 378

[mgorman@techsingularity.net: fix merge/linkage snafu]
Link: http://lkml.kernel.org/r/20191023084705.GD3016@techsingularity.netLink: http://lkml.kernel.org/r/20191021094808.28824-2-mgorman@techsingularity.net
Signed-off-by: Mel Gorman <mgorman@techsingularity.net>
Acked-by: Michal Hocko <mhocko@suse.com>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Acked-by: David Hildenbrand <david@redhat.com>
Cc: Matt Fleming <matt@codeblueprint.co.uk>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Qian Cai <cai@lca.pw>
Cc: <stable@vger.kernel.org> [4.1+]
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 3f36d866 14-Oct-2019 David Rientjes <rientjes@google.com>

mm, hugetlb: allow hugepage allocations to reclaim as needed

Commit b39d0ee2632d ("mm, page_alloc: avoid expensive reclaim when
compaction may not succeed") has chnaged the allocator to bail out from
the allocator early to prevent from a potentially excessive memory
reclaim. __GFP_RETRY_MAYFAIL is designed to retry the allocation,
reclaim and compaction loop as long as there is a reasonable chance to
make forward progress. Neither COMPACT_SKIPPED nor COMPACT_DEFERRED at
the INIT_COMPACT_PRIORITY compaction attempt gives this feedback.

The most obvious affected subsystem is hugetlbfs which allocates huge
pages based on an admin request (or via admin configured overcommit). I
have done a simple test which tries to allocate half of the memory for
hugetlb pages while the memory is full of a clean page cache. This is
not an unusual situation because we try to cache as much of the memory
as possible and sysctl/sysfs interface to allocate huge pages is there
for flexibility to allocate hugetlb pages at any time.

System has 1GB of RAM and we are requesting 515MB worth of hugetlb pages
after the memory is prefilled by a clean page cache:

root@test1:~# cat hugetlb_test.sh

set -x
echo 0 > /proc/sys/vm/nr_hugepages
echo 3 > /proc/sys/vm/drop_caches
echo 1 > /proc/sys/vm/compact_memory
dd if=/mnt/data/file-1G of=/dev/null bs=$((4<<10))
TS=$(date +%s)
echo 256 > /proc/sys/vm/nr_hugepages
cat /proc/sys/vm/nr_hugepages

The results for 2 consecutive runs on clean 5.3

root@test1:~# sh hugetlb_test.sh
+ echo 0
+ echo 3
+ echo 1
+ dd if=/mnt/data/file-1G of=/dev/null bs=4096
262144+0 records in
262144+0 records out
1073741824 bytes (1.1 GB) copied, 21.0694 s, 51.0 MB/s
+ date +%s
+ TS=1569905284
+ echo 256
+ cat /proc/sys/vm/nr_hugepages
256
root@test1:~# sh hugetlb_test.sh
+ echo 0
+ echo 3
+ echo 1
+ dd if=/mnt/data/file-1G of=/dev/null bs=4096
262144+0 records in
262144+0 records out
1073741824 bytes (1.1 GB) copied, 21.7548 s, 49.4 MB/s
+ date +%s
+ TS=1569905311
+ echo 256
+ cat /proc/sys/vm/nr_hugepages
256

Now with b39d0ee2632d applied

root@test1:~# sh hugetlb_test.sh
+ echo 0
+ echo 3
+ echo 1
+ dd if=/mnt/data/file-1G of=/dev/null bs=4096
262144+0 records in
262144+0 records out
1073741824 bytes (1.1 GB) copied, 20.1815 s, 53.2 MB/s
+ date +%s
+ TS=1569905516
+ echo 256
+ cat /proc/sys/vm/nr_hugepages
11
root@test1:~# sh hugetlb_test.sh
+ echo 0
+ echo 3
+ echo 1
+ dd if=/mnt/data/file-1G of=/dev/null bs=4096
262144+0 records in
262144+0 records out
1073741824 bytes (1.1 GB) copied, 21.9485 s, 48.9 MB/s
+ date +%s
+ TS=1569905541
+ echo 256
+ cat /proc/sys/vm/nr_hugepages
12

The success rate went down by factor of 20!

Although hugetlb allocation requests might fail and it is reasonable to
expect them to under extremely fragmented memory or when the memory is
under a heavy pressure but the above situation is not that case.

Fix the regression by reverting back to the previous behavior for
__GFP_RETRY_MAYFAIL requests and disable the beail out heuristic for
those requests.

Mike said:

: hugetlbfs allocations are commonly done via sysctl/sysfs shortly after
: boot where this may not be as much of an issue. However, I am aware of at
: least three use cases where allocations are made after the system has been
: up and running for quite some time:
:
: - DB reconfiguration. If sysctl/sysfs fails to get required number of
: huge pages, system is rebooted to perform allocation after boot.
:
: - VM provisioning. If unable get required number of huge pages, fall
: back to base pages.
:
: - An application that does not preallocate pool, but rather allocates
: pages at fault time for optimal NUMA locality.
:
: In all cases, I would expect b39d0ee2632d to cause regressions and
: noticable behavior changes.
:
: My quick/limited testing in
: https://lkml.kernel.org/r/3468b605-a3a9-6978-9699-57c52a90bd7e@oracle.com
: was insufficient. It was also mentioned that if something like
: b39d0ee2632d went forward, I would like exemptions for __GFP_RETRY_MAYFAIL
: requests as in this patch.

[mhocko@suse.com: reworded changelog]
Link: http://lkml.kernel.org/r/20191007075548.12456-1-mhocko@kernel.org
Fixes: b39d0ee2632d ("mm, page_alloc: avoid expensive reclaim when compaction may not succeed")
Signed-off-by: David Rientjes <rientjes@google.com>
Signed-off-by: Michal Hocko <mhocko@suse.com>
Reviewed-by: Mike Kravetz <mike.kravetz@oracle.com>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Cc: Mel Gorman <mgorman@suse.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 234fdce8 06-Oct-2019 Qian Cai <cai@lca.pw>

mm/page_alloc.c: fix a crash in free_pages_prepare()

On architectures like s390, arch_free_page() could mark the page unused
(set_page_unused()) and any access later would trigger a kernel panic.
Fix it by moving arch_free_page() after all possible accessing calls.

Hardware name: IBM 2964 N96 400 (z/VM 6.4.0)
Krnl PSW : 0404e00180000000 0000000026c2b96e (__free_pages_ok+0x34e/0x5d8)
R:0 T:1 IO:0 EX:0 Key:0 M:1 W:0 P:0 AS:3 CC:2 PM:0 RI:0 EA:3
Krnl GPRS: 0000000088d43af7 0000000000484000 000000000000007c 000000000000000f
000003d080012100 000003d080013fc0 0000000000000000 0000000000100000
00000000275cca48 0000000000000100 0000000000000008 000003d080010000
00000000000001d0 000003d000000000 0000000026c2b78a 000000002717fdb0
Krnl Code: 0000000026c2b95c: ec1100b30659 risbgn %r1,%r1,0,179,6
0000000026c2b962: e32014000036 pfd 2,1024(%r1)
#0000000026c2b968: d7ff10001000 xc 0(256,%r1),0(%r1)
>0000000026c2b96e: 41101100 la %r1,256(%r1)
0000000026c2b972: a737fff8 brctg %r3,26c2b962
0000000026c2b976: d7ff10001000 xc 0(256,%r1),0(%r1)
0000000026c2b97c: e31003400004 lg %r1,832
0000000026c2b982: ebff1430016a asi 5168(%r1),-1
Call Trace:
__free_pages_ok+0x16a/0x5d8)
memblock_free_all+0x206/0x290
mem_init+0x58/0x120
start_kernel+0x2b0/0x570
startup_continue+0x6a/0xc0
INFO: lockdep is turned off.
Last Breaking-Event-Address:
__free_pages_ok+0x372/0x5d8
Kernel panic - not syncing: Fatal exception: panic_on_oops
00: HCPGIR450W CP entered; disabled wait PSW 00020001 80000000 00000000 26A2379C

In the past, only kernel_poison_pages() would trigger this but it needs
"page_poison=on" kernel cmdline, and I suspect nobody tested that on
s390. Recently, kernel_init_free_pages() (commit 6471384af2a6 ("mm:
security: introduce init_on_alloc=1 and init_on_free=1 boot options"))
was added and could trigger this as well.

[akpm@linux-foundation.org: add comment]
Link: http://lkml.kernel.org/r/1569613623-16820-1-git-send-email-cai@lca.pw
Fixes: 8823b1dbc05f ("mm/page_poison.c: enable PAGE_POISONING as a separate option")
Fixes: 6471384af2a6 ("mm: security: introduce init_on_alloc=1 and init_on_free=1 boot options")
Signed-off-by: Qian Cai <cai@lca.pw>
Reviewed-by: Heiko Carstens <heiko.carstens@de.ibm.com>
Acked-by: Christian Borntraeger <borntraeger@de.ibm.com>
Acked-by: Michal Hocko <mhocko@suse.com>
Cc: "Kirill A. Shutemov" <kirill@shutemov.name>
Cc: Vasily Gorbik <gor@linux.ibm.com>
Cc: Alexander Duyck <alexander.duyck@gmail.com>
Cc: <stable@vger.kernel.org> [5.3+]
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# b39d0ee2 04-Sep-2019 David Rientjes <rientjes@google.com>

mm, page_alloc: avoid expensive reclaim when compaction may not succeed

Memory compaction has a couple significant drawbacks as the allocation
order increases, specifically:

- isolate_freepages() is responsible for finding free pages to use as
migration targets and is implemented as a linear scan of memory
starting at the end of a zone,

- failing order-0 watermark checks in memory compaction does not account
for how far below the watermarks the zone actually is: to enable
migration, there must be *some* free memory available. Per the above,
watermarks are not always suffficient if isolate_freepages() cannot
find the free memory but it could require hundreds of MBs of reclaim to
even reach this threshold (read: potentially very expensive reclaim with
no indication compaction can be successful), and

- if compaction at this order has failed recently so that it does not even
run as a result of deferred compaction, looping through reclaim can often
be pointless.

For hugepage allocations, these are quite substantial drawbacks because
these are very high order allocations (order-9 on x86) and falling back to
doing reclaim can potentially be *very* expensive without any indication
that compaction would even be successful.

Reclaim itself is unlikely to free entire pageblocks and certainly no
reliance should be put on it to do so in isolation (recall lumpy reclaim).
This means we should avoid reclaim and simply fail hugepage allocation if
compaction is deferred.

It is also not helpful to thrash a zone by doing excessive reclaim if
compaction may not be able to access that memory. If order-0 watermarks
fail and the allocation order is sufficiently large, it is likely better
to fail the allocation rather than thrashing the zone.

Signed-off-by: David Rientjes <rientjes@google.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Stefan Priebe - Profihost AG <s.priebe@profihost.ag>
Cc: "Kirill A. Shutemov" <kirill@shutemov.name>
Cc: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 7ae88534 23-Sep-2019 Yang Shi <yang.shi@linux.alibaba.com>

mm: move mem_cgroup_uncharge out of __page_cache_release()

A later patch makes THP deferred split shrinker memcg aware, but it needs
page->mem_cgroup information in THP destructor, which is called after
mem_cgroup_uncharge() now.

So move mem_cgroup_uncharge() from __page_cache_release() to compound page
destructor, which is called by both THP and other compound pages except
HugeTLB. And call it in __put_single_page() for single order page.

Link: http://lkml.kernel.org/r/1565144277-36240-3-git-send-email-yang.shi@linux.alibaba.com
Signed-off-by: Yang Shi <yang.shi@linux.alibaba.com>
Suggested-by: "Kirill A . Shutemov" <kirill.shutemov@linux.intel.com>
Acked-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Reviewed-by: Kirill Tkhai <ktkhai@virtuozzo.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: Shakeel Butt <shakeelb@google.com>
Cc: David Rientjes <rientjes@google.com>
Cc: Qian Cai <cai@lca.pw>
Cc: Vladimir Davydov <vdavydov.dev@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 364c1eeb 23-Sep-2019 Yang Shi <yang.shi@linux.alibaba.com>

mm: thp: extract split_queue_* into a struct

Patch series "Make deferred split shrinker memcg aware", v6.

Currently THP deferred split shrinker is not memcg aware, this may cause
premature OOM with some configuration. For example the below test would
run into premature OOM easily:

$ cgcreate -g memory:thp
$ echo 4G > /sys/fs/cgroup/memory/thp/memory/limit_in_bytes
$ cgexec -g memory:thp transhuge-stress 4000

transhuge-stress comes from kernel selftest.

It is easy to hit OOM, but there are still a lot THP on the deferred split
queue, memcg direct reclaim can't touch them since the deferred split
shrinker is not memcg aware.

Convert deferred split shrinker memcg aware by introducing per memcg
deferred split queue. The THP should be on either per node or per memcg
deferred split queue if it belongs to a memcg. When the page is
immigrated to the other memcg, it will be immigrated to the target memcg's
deferred split queue too.

Reuse the second tail page's deferred_list for per memcg list since the
same THP can't be on multiple deferred split queues.

Make deferred split shrinker not depend on memcg kmem since it is not
slab. It doesn't make sense to not shrink THP even though memcg kmem is
disabled.

With the above change the test demonstrated above doesn't trigger OOM even
though with cgroup.memory=nokmem.

This patch (of 4):

Put split_queue, split_queue_lock and split_queue_len into a struct in
order to reduce code duplication when we convert deferred_split to memcg
aware in the later patches.

Link: http://lkml.kernel.org/r/1565144277-36240-2-git-send-email-yang.shi@linux.alibaba.com
Signed-off-by: Yang Shi <yang.shi@linux.alibaba.com>
Suggested-by: "Kirill A . Shutemov" <kirill.shutemov@linux.intel.com>
Acked-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Reviewed-by: Kirill Tkhai <ktkhai@virtuozzo.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: Shakeel Butt <shakeelb@google.com>
Cc: David Rientjes <rientjes@google.com>
Cc: Qian Cai <cai@lca.pw>
Cc: Vladimir Davydov <vdavydov.dev@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 49433085 23-Sep-2019 Vlastimil Babka <vbabka@suse.cz>

mm, compaction: raise compaction priority after it withdrawns

Mike Kravetz reports that "hugetlb allocations could stall for minutes or
hours when should_compact_retry() would return true more often then it
should. Specifically, this was in the case where compact_result was
COMPACT_DEFERRED and COMPACT_PARTIAL_SKIPPED and no progress was being
made."

The problem is that the compaction_withdrawn() test in
should_compact_retry() includes compaction outcomes that are only possible
on low compaction priority, and results in a retry without increasing the
priority. This may result in furter reclaim, and more incomplete
compaction attempts.

With this patch, compaction priority is raised when possible, or
should_compact_retry() returns false.

The COMPACT_SKIPPED result doesn't really fit together with the other
outcomes in compaction_withdrawn(), as that's a result caused by
insufficient order-0 pages, not due to low compaction priority. With this
patch, it is moved to a new compaction_needs_reclaim() function, and for
that outcome we keep the current logic of retrying if it looks like
reclaim will be able to help.

Link: http://lkml.kernel.org/r/20190806014744.15446-4-mike.kravetz@oracle.com
Reported-by: Mike Kravetz <mike.kravetz@oracle.com>
Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
Signed-off-by: Mike Kravetz <mike.kravetz@oracle.com>
Tested-by: Mike Kravetz <mike.kravetz@oracle.com>
Cc: Hillf Danton <hdanton@sina.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Michal Hocko <mhocko@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# d8c6546b 23-Sep-2019 Matthew Wilcox (Oracle) <willy@infradead.org>

mm: introduce compound_nr()

Replace 1 << compound_order(page) with compound_nr(page). Minor
improvements in readability.

Link: http://lkml.kernel.org/r/20190721104612.19120-4-willy@infradead.org
Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Reviewed-by: Andrew Morton <akpm@linux-foundation.org>
Reviewed-by: Ira Weiny <ira.weiny@intel.com>
Acked-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Cc: Michal Hocko <mhocko@suse.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# a55c7454 08-Aug-2019 Matt Fleming <matt@codeblueprint.co.uk>

sched/topology: Improve load balancing on AMD EPYC systems

SD_BALANCE_{FORK,EXEC} and SD_WAKE_AFFINE are stripped in sd_init()
for any sched domains with a NUMA distance greater than 2 hops
(RECLAIM_DISTANCE). The idea being that it's expensive to balance
across domains that far apart.

However, as is rather unfortunately explained in:

commit 32e45ff43eaf ("mm: increase RECLAIM_DISTANCE to 30")

the value for RECLAIM_DISTANCE is based on node distance tables from
2011-era hardware.

Current AMD EPYC machines have the following NUMA node distances:

node distances:
node 0 1 2 3 4 5 6 7
0: 10 16 16 16 32 32 32 32
1: 16 10 16 16 32 32 32 32
2: 16 16 10 16 32 32 32 32
3: 16 16 16 10 32 32 32 32
4: 32 32 32 32 10 16 16 16
5: 32 32 32 32 16 10 16 16
6: 32 32 32 32 16 16 10 16
7: 32 32 32 32 16 16 16 10

where 2 hops is 32.

The result is that the scheduler fails to load balance properly across
NUMA nodes on different sockets -- 2 hops apart.

For example, pinning 16 busy threads to NUMA nodes 0 (CPUs 0-7) and 4
(CPUs 32-39) like so,

$ numactl -C 0-7,32-39 ./spinner 16

causes all threads to fork and remain on node 0 until the active
balancer kicks in after a few seconds and forcibly moves some threads
to node 4.

Override node_reclaim_distance for AMD Zen.

Signed-off-by: Matt Fleming <matt@codeblueprint.co.uk>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Acked-by: Mel Gorman <mgorman@techsingularity.net>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Rik van Riel <riel@surriel.com>
Cc: Suravee.Suthikulpanit@amd.com
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Thomas.Lendacky@amd.com
Cc: Tony Luck <tony.luck@intel.com>
Link: https://lkml.kernel.org/r/20190808195301.13222-3-matt@codeblueprint.co.uk
Signed-off-by: Ingo Molnar <mingo@kernel.org>


# cd961038 24-Aug-2019 David Rientjes <rientjes@google.com>

mm, page_alloc: move_freepages should not examine struct page of reserved memory

After commit 907ec5fca3dc ("mm: zero remaining unavailable struct
pages"), struct page of reserved memory is zeroed. This causes
page->flags to be 0 and fixes issues related to reading
/proc/kpageflags, for example, of reserved memory.

The VM_BUG_ON() in move_freepages_block(), however, assumes that
page_zone() is meaningful even for reserved memory. That assumption is
no longer true after the aforementioned commit.

There's no reason why move_freepages_block() should be testing the
legitimacy of page_zone() for reserved memory; its scope is limited only
to pages on the zone's freelist.

Note that pfn_valid() can be true for reserved memory: there is a
backing struct page. The check for page_to_nid(page) is also buggy but
reserved memory normally only appears on node 0 so the zeroing doesn't
affect this.

Move the debug checks to after verifying PageBuddy is true. This
isolates the scope of the checks to only be for buddy pages which are on
the zone's freelist which move_freepages_block() is operating on. In
this case, an incorrect node or zone is a bug worthy of being warned
about (and the examination of struct page is acceptable bcause this
memory is not reserved).

Why does move_freepages_block() gets called on reserved memory? It's
simply math after finding a valid free page from the per-zone free area
to use as fallback. We find the beginning and end of the pageblock of
the valid page and that can bring us into memory that was reserved per
the e820. pfn_valid() is still true (it's backed by a struct page), but
since it's zero'd we shouldn't make any inferences here about comparing
its node or zone. The current node check just happens to succeed most
of the time by luck because reserved memory typically appears on node 0.

The fix here is to validate that we actually have buddy pages before
testing if there's any type of zone or node strangeness going on.

We noticed it almost immediately after bringing 907ec5fca3dc in on
CONFIG_DEBUG_VM builds. It depends on finding specific free pages in
the per-zone free area where the math in move_freepages() will bring the
start or end pfn into reserved memory and wanting to claim that entire
pageblock as a new migratetype. So the path will be rare, require
CONFIG_DEBUG_VM, and require fallback to a different migratetype.

Some struct pages were already zeroed from reserve pages before
907ec5fca3c so it theoretically could trigger before this commit. I
think it's rare enough under a config option that most people don't run
that others may not have noticed. I wouldn't argue against a stable tag
and the backport should be easy enough, but probably wouldn't single out
a commit that this is fixing.

Mel said:

: The overhead of the debugging check is higher with this patch although
: it'll only affect debug builds and the path is not particularly hot.
: If this was a concern, I think it would be reasonable to simply remove
: the debugging check as the zone boundaries are checked in
: move_freepages_block and we never expect a zone/node to be smaller than
: a pageblock and stuck in the middle of another zone.

Link: http://lkml.kernel.org/r/alpine.DEB.2.21.1908122036560.10779@chino.kir.corp.google.com
Signed-off-by: David Rientjes <rientjes@google.com>
Acked-by: Mel Gorman <mgorman@techsingularity.net>
Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
Cc: Masayoshi Mizuma <m.mizuma@jp.fujitsu.com>
Cc: Oscar Salvador <osalvador@suse.de>
Cc: Pavel Tatashin <pavel.tatashin@microsoft.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# fdc029b1 18-Aug-2019 Christoph Hellwig <hch@lst.de>

memremap: remove the dev field in struct dev_pagemap

The dev field in struct dev_pagemap is only used to print dev_name in two
places, which are at best nice to have. Just remove the field and thus
the name in those two messages.

Link: https://lore.kernel.org/r/20190818090557.17853-3-hch@lst.de
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Ira Weiny <ira.weiny@intel.com>
Reviewed-by: Dan Williams <dan.j.williams@intel.com>
Tested-by: Bharata B Rao <bharata@linux.ibm.com>
Reviewed-by: Jason Gunthorpe <jgg@mellanox.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>


# ba72b4c8 18-Jul-2019 Dan Williams <dan.j.williams@intel.com>

mm/sparsemem: support sub-section hotplug

The libnvdimm sub-system has suffered a series of hacks and broken
workarounds for the memory-hotplug implementation's awkward
section-aligned (128MB) granularity.

For example the following backtrace is emitted when attempting
arch_add_memory() with physical address ranges that intersect 'System
RAM' (RAM) with 'Persistent Memory' (PMEM) within a given section:

# cat /proc/iomem | grep -A1 -B1 Persistent\ Memory
100000000-1ffffffff : System RAM
200000000-303ffffff : Persistent Memory (legacy)
304000000-43fffffff : System RAM
440000000-23ffffffff : Persistent Memory
2400000000-43bfffffff : Persistent Memory
2400000000-43bfffffff : namespace2.0

WARNING: CPU: 38 PID: 928 at arch/x86/mm/init_64.c:850 add_pages+0x5c/0x60
[..]
RIP: 0010:add_pages+0x5c/0x60
[..]
Call Trace:
devm_memremap_pages+0x460/0x6e0
pmem_attach_disk+0x29e/0x680 [nd_pmem]
? nd_dax_probe+0xfc/0x120 [libnvdimm]
nvdimm_bus_probe+0x66/0x160 [libnvdimm]

It was discovered that the problem goes beyond RAM vs PMEM collisions as
some platform produce PMEM vs PMEM collisions within a given section.
The libnvdimm workaround for that case revealed that the libnvdimm
section-alignment-padding implementation has been broken for a long
while.

A fix for that long-standing breakage introduces as many problems as it
solves as it would require a backward-incompatible change to the
namespace metadata interpretation. Instead of that dubious route [1],
address the root problem in the memory-hotplug implementation.

Note that EEXIST is no longer treated as success as that is how
sparse_add_section() reports subsection collisions, it was also obviated
by recent changes to perform the request_region() for 'System RAM'
before arch_add_memory() in the add_memory() sequence.

[1] https://lore.kernel.org/r/155000671719.348031.2347363160141119237.stgit@dwillia2-desk3.amr.corp.intel.com

[osalvador@suse.de: fix deactivate_section for early sections]
Link: http://lkml.kernel.org/r/20190715081549.32577-2-osalvador@suse.de
Link: http://lkml.kernel.org/r/156092354368.979959.6232443923440952359.stgit@dwillia2-desk3.amr.corp.intel.com
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
Signed-off-by: Oscar Salvador <osalvador@suse.de>
Tested-by: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com> [ppc64]
Reviewed-by: Oscar Salvador <osalvador@suse.de>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Logan Gunthorpe <logang@deltatee.com>
Cc: Pavel Tatashin <pasha.tatashin@soleen.com>
Cc: David Hildenbrand <david@redhat.com>
Cc: Jane Chu <jane.chu@oracle.com>
Cc: Jeff Moyer <jmoyer@redhat.com>
Cc: Jérôme Glisse <jglisse@redhat.com>
Cc: Jonathan Corbet <corbet@lwn.net>
Cc: Mike Rapoport <rppt@linux.ibm.com>
Cc: Toshi Kani <toshi.kani@hpe.com>
Cc: Wei Yang <richardw.yang@linux.intel.com>
Cc: Jason Gunthorpe <jgg@mellanox.com>
Cc: Christoph Hellwig <hch@lst.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 46d945ae 18-Jul-2019 Dan Williams <dan.j.williams@intel.com>

mm: kill is_dev_zone() helper

Given there are no more usages of is_dev_zone() outside of 'ifdef
CONFIG_ZONE_DEVICE' protection, kill off the compilation helper.

Link: http://lkml.kernel.org/r/156092353211.979959.1489004866360828964.stgit@dwillia2-desk3.amr.corp.intel.com
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
Reviewed-by: Oscar Salvador <osalvador@suse.de>
Reviewed-by: Pavel Tatashin <pasha.tatashin@soleen.com>
Reviewed-by: Wei Yang <richardw.yang@linux.intel.com>
Acked-by: David Hildenbrand <david@redhat.com>
Tested-by: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com> [ppc64]
Cc: Michal Hocko <mhocko@suse.com>
Cc: Logan Gunthorpe <logang@deltatee.com>
Cc: Jane Chu <jane.chu@oracle.com>
Cc: Jeff Moyer <jmoyer@redhat.com>
Cc: Jérôme Glisse <jglisse@redhat.com>
Cc: Jonathan Corbet <corbet@lwn.net>
Cc: Mike Rapoport <rppt@linux.ibm.com>
Cc: Toshi Kani <toshi.kani@hpe.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Jason Gunthorpe <jgg@mellanox.com>
Cc: Christoph Hellwig <hch@lst.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# f46edbd1 18-Jul-2019 Dan Williams <dan.j.williams@intel.com>

mm/sparsemem: add helpers track active portions of a section at boot

Prepare for hot{plug,remove} of sub-ranges of a section by tracking a
sub-section active bitmask, each bit representing a PMD_SIZE span of the
architecture's memory hotplug section size.

The implications of a partially populated section is that pfn_valid()
needs to go beyond a valid_section() check and either determine that the
section is an "early section", or read the sub-section active ranges
from the bitmask. The expectation is that the bitmask (subsection_map)
fits in the same cacheline as the valid_section() / early_section()
data, so the incremental performance overhead to pfn_valid() should be
negligible.

The rationale for using early_section() to short-ciruit the
subsection_map check is that there are legacy code paths that use
pfn_valid() at section granularity before validating the pfn against
pgdat data. So, the early_section() check allows those traditional
assumptions to persist while also permitting subsection_map to tell the
truth for purposes of populating the unused portions of early sections
with PMEM and other ZONE_DEVICE mappings.

Link: http://lkml.kernel.org/r/156092350874.979959.18185938451405518285.stgit@dwillia2-desk3.amr.corp.intel.com
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
Reported-by: Qian Cai <cai@lca.pw>
Tested-by: Jane Chu <jane.chu@oracle.com>
Tested-by: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com> [ppc64]
Reviewed-by: Oscar Salvador <osalvador@suse.de>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Logan Gunthorpe <logang@deltatee.com>
Cc: Pavel Tatashin <pasha.tatashin@soleen.com>
Cc: David Hildenbrand <david@redhat.com>
Cc: Jeff Moyer <jmoyer@redhat.com>
Cc: Jérôme Glisse <jglisse@redhat.com>
Cc: Jonathan Corbet <corbet@lwn.net>
Cc: Mike Rapoport <rppt@linux.ibm.com>
Cc: Toshi Kani <toshi.kani@hpe.com>
Cc: Wei Yang <richardw.yang@linux.intel.com>
Cc: Jason Gunthorpe <jgg@mellanox.com>
Cc: Christoph Hellwig <hch@lst.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# f1eca35a 18-Jul-2019 Dan Williams <dan.j.williams@intel.com>

mm/sparsemem: introduce struct mem_section_usage

Patch series "mm: Sub-section memory hotplug support", v10.

The memory hotplug section is an arbitrary / convenient unit for memory
hotplug. 'Section-size' units have bled into the user interface
('memblock' sysfs) and can not be changed without breaking existing
userspace. The section-size constraint, while mostly benign for typical
memory hotplug, has and continues to wreak havoc with 'device-memory'
use cases, persistent memory (pmem) in particular. Recall that pmem
uses devm_memremap_pages(), and subsequently arch_add_memory(), to
allocate a 'struct page' memmap for pmem. However, it does not use the
'bottom half' of memory hotplug, i.e. never marks pmem pages online and
never exposes the userspace memblock interface for pmem. This leaves an
opening to redress the section-size constraint.

To date, the libnvdimm subsystem has attempted to inject padding to
satisfy the internal constraints of arch_add_memory(). Beyond
complicating the code, leading to bugs [2], wasting memory, and limiting
configuration flexibility, the padding hack is broken when the platform
changes this physical memory alignment of pmem from one boot to the
next. Device failure (intermittent or permanent) and physical
reconfiguration are events that can cause the platform firmware to
change the physical placement of pmem on a subsequent boot, and device
failure is an everyday event in a data-center.

It turns out that sections are only a hard requirement of the
user-facing interface for memory hotplug and with a bit more
infrastructure sub-section arch_add_memory() support can be added for
kernel internal usages like devm_memremap_pages(). Here is an analysis
of the current design assumptions in the current code and how they are
addressed in the new implementation:

Current design assumptions:

- Sections that describe boot memory (early sections) are never
unplugged / removed.

- pfn_valid(), in the CONFIG_SPARSEMEM_VMEMMAP=y, case devolves to a
valid_section() check

- __add_pages() and helper routines assume all operations occur in
PAGES_PER_SECTION units.

- The memblock sysfs interface only comprehends full sections

New design assumptions:

- Sections are instrumented with a sub-section bitmask to track (on
x86) individual 2MB sub-divisions of a 128MB section.

- Partially populated early sections can be extended with additional
sub-sections, and those sub-sections can be removed with
arch_remove_memory(). With this in place we no longer lose usable
memory capacity to padding.

- pfn_valid() is updated to look deeper than valid_section() to also
check the active-sub-section mask. This indication is in the same
cacheline as the valid_section() so the performance impact is
expected to be negligible. So far the lkp robot has not reported any
regressions.

- Outside of the core vmemmap population routines which are replaced,
other helper routines like shrink_{zone,pgdat}_span() are updated to
handle the smaller granularity. Core memory hotplug routines that
deal with online memory are not touched.

- The existing memblock sysfs user api guarantees / assumptions are not
touched since this capability is limited to !online
!memblock-sysfs-accessible sections.

Meanwhile the issue reports continue to roll in from users that do not
understand when and how the 128MB constraint will bite them. The current
implementation relied on being able to support at least one misaligned
namespace, but that immediately falls over on any moderately complex
namespace creation attempt. Beyond the initial problem of 'System RAM'
colliding with pmem, and the unsolvable problem of physical alignment
changes, Linux is now being exposed to platforms that collide pmem ranges
with other pmem ranges by default [3]. In short, devm_memremap_pages()
has pushed the venerable section-size constraint past the breaking point,
and the simplicity of section-aligned arch_add_memory() is no longer
tenable.

These patches are exposed to the kbuild robot on a subsection-v10 branch
[4], and a preview of the unit test for this functionality is available
on the 'subsection-pending' branch of ndctl [5].

[2]: https://lore.kernel.org/r/155000671719.348031.2347363160141119237.stgit@dwillia2-desk3.amr.corp.intel.com
[3]: https://github.com/pmem/ndctl/issues/76
[4]: https://git.kernel.org/pub/scm/linux/kernel/git/djbw/nvdimm.git/log/?h=subsection-v10
[5]: https://github.com/pmem/ndctl/commit/7c59b4867e1c

This patch (of 13):

Towards enabling memory hotplug to track partial population of a section,
introduce 'struct mem_section_usage'.

A pointer to a 'struct mem_section_usage' instance replaces the existing
pointer to a 'pageblock_flags' bitmap. Effectively it adds one more
'unsigned long' beyond the 'pageblock_flags' (usemap) allocation to house
a new 'subsection_map' bitmap. The new bitmap enables the memory
hot{plug,remove} implementation to act on incremental sub-divisions of a
section.

SUBSECTION_SHIFT is defined as global constant instead of per-architecture
value like SECTION_SIZE_BITS in order to allow cross-arch compatibility of
subsection users. Specifically a common subsection size allows for the
possibility that persistent memory namespace configurations be made
compatible across architectures.

The primary motivation for this functionality is to support platforms that
mix "System RAM" and "Persistent Memory" within a single section, or
multiple PMEM ranges with different mapping lifetimes within a single
section. The section restriction for hotplug has caused an ongoing saga
of hacks and bugs for devm_memremap_pages() users.

Beyond the fixups to teach existing paths how to retrieve the 'usemap'
from a section, and updates to usemap allocation path, there are no
expected behavior changes.

Link: http://lkml.kernel.org/r/156092349845.979959.73333291612799019.stgit@dwillia2-desk3.amr.corp.intel.com
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
Reviewed-by: Oscar Salvador <osalvador@suse.de>
Reviewed-by: Wei Yang <richardw.yang@linux.intel.com>
Tested-by: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com> [ppc64]
Cc: Michal Hocko <mhocko@suse.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Logan Gunthorpe <logang@deltatee.com>
Cc: Pavel Tatashin <pasha.tatashin@soleen.com>
Cc: David Hildenbrand <david@redhat.com>
Cc: Jérôme Glisse <jglisse@redhat.com>
Cc: Mike Rapoport <rppt@linux.ibm.com>
Cc: Jane Chu <jane.chu@oracle.com>
Cc: Pavel Tatashin <pasha.tatashin@soleen.com>
Cc: Jonathan Corbet <corbet@lwn.net>
Cc: Qian Cai <cai@lca.pw>
Cc: Logan Gunthorpe <logang@deltatee.com>
Cc: Toshi Kani <toshi.kani@hpe.com>
Cc: Jeff Moyer <jmoyer@redhat.com>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Jason Gunthorpe <jgg@mellanox.com>
Cc: Christoph Hellwig <hch@lst.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# e5ca8071 16-Jul-2019 Yafang Shao <laoar.shao@gmail.com>

mm/vmscan.c: add a new member reclaim_state in struct shrink_control

Patch series "mm/vmscan: calculate reclaimed slab in all reclaim paths".

This patchset is to fix the issues in doing shrink slab.

There're six different reclaim paths by now,
- kswapd reclaim path
- node reclaim path
- hibernate preallocate memory reclaim path
- direct reclaim path
- memcg reclaim path
- memcg softlimit reclaim path

The slab caches reclaimed in these paths are only calculated in the
above three paths. The issues are detailed explained in patch #2. We
should calculate the reclaimed slab caches in every reclaim path. In
order to do it, the struct reclaim_state is placed into the struct
shrink_control.

In node reclaim path, there'is another issue about shrinking slab, which
is adressed in "mm/vmscan: shrink slab in node reclaim"
(https://lore.kernel.org/linux-mm/1559874946-22960-1-git-send-email-laoar.shao@gmail.com/).

This patch (of 2):

The struct reclaim_state is used to record how many slab caches are
reclaimed in one reclaim path. The struct shrink_control is used to
control one reclaim path. So we'd better put reclaim_state into
shrink_control.

[laoar.shao@gmail.com: remove reclaim_state assignment from __perform_reclaim()]
Link: http://lkml.kernel.org/r/1561381582-13697-1-git-send-email-laoar.shao@gmail.com
Link: http://lkml.kernel.org/r/1561112086-6169-2-git-send-email-laoar.shao@gmail.com
Signed-off-by: Yafang Shao <laoar.shao@gmail.com>
Reviewed-by: Andrew Morton <akpm@linux-foundation.org>
Reviewed-by: Kirill Tkhai <ktkhai@virtuozzo.com>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Vladimir Davydov <vdavydov.dev@gmail.com>
Cc: Mel Gorman <mgorman@techsingularity.net>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 6471384a 11-Jul-2019 Alexander Potapenko <glider@google.com>

mm: security: introduce init_on_alloc=1 and init_on_free=1 boot options

Patch series "add init_on_alloc/init_on_free boot options", v10.

Provide init_on_alloc and init_on_free boot options.

These are aimed at preventing possible information leaks and making the
control-flow bugs that depend on uninitialized values more deterministic.

Enabling either of the options guarantees that the memory returned by the
page allocator and SL[AU]B is initialized with zeroes. SLOB allocator
isn't supported at the moment, as its emulation of kmem caches complicates
handling of SLAB_TYPESAFE_BY_RCU caches correctly.

Enabling init_on_free also guarantees that pages and heap objects are
initialized right after they're freed, so it won't be possible to access
stale data by using a dangling pointer.

As suggested by Michal Hocko, right now we don't let the heap users to
disable initialization for certain allocations. There's not enough
evidence that doing so can speed up real-life cases, and introducing ways
to opt-out may result in things going out of control.

This patch (of 2):

The new options are needed to prevent possible information leaks and make
control-flow bugs that depend on uninitialized values more deterministic.

This is expected to be on-by-default on Android and Chrome OS. And it
gives the opportunity for anyone else to use it under distros too via the
boot args. (The init_on_free feature is regularly requested by folks
where memory forensics is included in their threat models.)

init_on_alloc=1 makes the kernel initialize newly allocated pages and heap
objects with zeroes. Initialization is done at allocation time at the
places where checks for __GFP_ZERO are performed.

init_on_free=1 makes the kernel initialize freed pages and heap objects
with zeroes upon their deletion. This helps to ensure sensitive data
doesn't leak via use-after-free accesses.

Both init_on_alloc=1 and init_on_free=1 guarantee that the allocator
returns zeroed memory. The two exceptions are slab caches with
constructors and SLAB_TYPESAFE_BY_RCU flag. Those are never
zero-initialized to preserve their semantics.

Both init_on_alloc and init_on_free default to zero, but those defaults
can be overridden with CONFIG_INIT_ON_ALLOC_DEFAULT_ON and
CONFIG_INIT_ON_FREE_DEFAULT_ON.

If either SLUB poisoning or page poisoning is enabled, those options take
precedence over init_on_alloc and init_on_free: initialization is only
applied to unpoisoned allocations.

Slowdown for the new features compared to init_on_free=0, init_on_alloc=0:

hackbench, init_on_free=1: +7.62% sys time (st.err 0.74%)
hackbench, init_on_alloc=1: +7.75% sys time (st.err 2.14%)

Linux build with -j12, init_on_free=1: +8.38% wall time (st.err 0.39%)
Linux build with -j12, init_on_free=1: +24.42% sys time (st.err 0.52%)
Linux build with -j12, init_on_alloc=1: -0.13% wall time (st.err 0.42%)
Linux build with -j12, init_on_alloc=1: +0.57% sys time (st.err 0.40%)

The slowdown for init_on_free=0, init_on_alloc=0 compared to the baseline
is within the standard error.

The new features are also going to pave the way for hardware memory
tagging (e.g. arm64's MTE), which will require both on_alloc and on_free
hooks to set the tags for heap objects. With MTE, tagging will have the
same cost as memory initialization.

Although init_on_free is rather costly, there are paranoid use-cases where
in-memory data lifetime is desired to be minimized. There are various
arguments for/against the realism of the associated threat models, but
given that we'll need the infrastructure for MTE anyway, and there are
people who want wipe-on-free behavior no matter what the performance cost,
it seems reasonable to include it in this series.

[glider@google.com: v8]
Link: http://lkml.kernel.org/r/20190626121943.131390-2-glider@google.com
[glider@google.com: v9]
Link: http://lkml.kernel.org/r/20190627130316.254309-2-glider@google.com
[glider@google.com: v10]
Link: http://lkml.kernel.org/r/20190628093131.199499-2-glider@google.com
Link: http://lkml.kernel.org/r/20190617151050.92663-2-glider@google.com
Signed-off-by: Alexander Potapenko <glider@google.com>
Acked-by: Kees Cook <keescook@chromium.org>
Acked-by: Michal Hocko <mhocko@suse.cz> [page and dmapool parts
Acked-by: James Morris <jamorris@linux.microsoft.com>]
Cc: Christoph Lameter <cl@linux.com>
Cc: Masahiro Yamada <yamada.masahiro@socionext.com>
Cc: "Serge E. Hallyn" <serge@hallyn.com>
Cc: Nick Desaulniers <ndesaulniers@google.com>
Cc: Kostya Serebryany <kcc@google.com>
Cc: Dmitry Vyukov <dvyukov@google.com>
Cc: Sandeep Patil <sspatil@android.com>
Cc: Laura Abbott <labbott@redhat.com>
Cc: Randy Dunlap <rdunlap@infradead.org>
Cc: Jann Horn <jannh@google.com>
Cc: Mark Rutland <mark.rutland@arm.com>
Cc: Marco Elver <elver@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# e03a5125 11-Jul-2019 Nicholas Piggin <npiggin@gmail.com>

mm/large system hash: clear hashdist when only one node with memory is booted

CONFIG_NUMA on 64-bit CPUs currently enables hashdist unconditionally even
when booting on single node machines. This causes the large system hashes
to be allocated with vmalloc, and mapped with small pages.

This change clears hashdist if only one node has come up with memory.

This results in the important large inode and dentry hashes using memblock
allocations. All others are within 4MB size up to about 128GB of RAM,
which allows them to be allocated from the linear map on most non-NUMA
images.

Other big hashes like futex and TCP should eventually be moved over to the
same style of allocation as those vfs caches that use HASH_EARLY if
!hashdist, so they don't exceed MAX_ORDER on very large non-NUMA images.

This brings dTLB misses for linux kernel tree `git diff` from ~45,000 to
~8,000 on a Kaby Lake KVM guest with 8MB dentry hash and mitigations=off
(performance is in the noise, under 1% difference, page tables are likely
to be well cached for this workload).

Link: http://lkml.kernel.org/r/20190605144814.29319-2-npiggin@gmail.com
Signed-off-by: Nicholas Piggin <npiggin@gmail.com>
Reviewed-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# ec11408a 11-Jul-2019 Nicholas Piggin <npiggin@gmail.com>

mm/large system hash: use vmalloc for size > MAX_ORDER when !hashdist

The kernel currently clamps large system hashes to MAX_ORDER when hashdist
is not set, which is rather arbitrary.

vmalloc space is limited on 32-bit machines, but this shouldn't result in
much more used because of small physical memory limiting system hash
sizes.

Include "vmalloc" or "linear" in the kernel log message.

Link: http://lkml.kernel.org/r/20190605144814.29319-1-npiggin@gmail.com
Signed-off-by: Nicholas Piggin <npiggin@gmail.com>
Reviewed-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 3972f6bb 11-Jul-2019 Vlastimil Babka <vbabka@suse.cz>

mm, debug_pagealloc: use a page type instead of page_ext flag

When debug_pagealloc is enabled, we currently allocate the page_ext
array to mark guard pages with the PAGE_EXT_DEBUG_GUARD flag. Now that
we have the page_type field in struct page, we can use that instead, as
guard pages are neither PageSlab nor mapped to userspace. This reduces
memory overhead when debug_pagealloc is enabled and there are no other
features requiring the page_ext array.

Link: http://lkml.kernel.org/r/20190603143451.27353-4-vbabka@suse.cz
Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
Cc: Mel Gorman <mgorman@techsingularity.net>
Cc: Michal Hocko <mhocko@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 4462b32c 11-Jul-2019 Vlastimil Babka <vbabka@suse.cz>

mm, page_alloc: more extensive free page checking with debug_pagealloc

The page allocator checks struct pages for expected state (mapcount,
flags etc) as pages are being allocated (check_new_page()) and freed
(free_pages_check()) to provide some defense against errors in page
allocator users.

Prior commits 479f854a207c ("mm, page_alloc: defer debugging checks of
pages allocated from the PCP") and 4db7548ccbd9 ("mm, page_alloc: defer
debugging checks of freed pages until a PCP drain") this has happened
for order-0 pages as they were allocated from or freed to the per-cpu
caches (pcplists). Since those are fast paths, the checks are now
performed only when pages are moved between pcplists and global free
lists. This however lowers the chances of catching errors soon enough.

In order to increase the chances of the checks to catch errors, the
kernel has to be rebuilt with CONFIG_DEBUG_VM, which also enables
multiple other internal debug checks (VM_BUG_ON() etc), which is
suboptimal when the goal is to catch errors in mm users, not in mm code
itself.

To catch some wrong users of the page allocator we have
CONFIG_DEBUG_PAGEALLOC, which is designed to have virtually no overhead
unless enabled at boot time. Memory corruptions when writing to freed
pages have often the same underlying errors (use-after-free, double free)
as corrupting the corresponding struct pages, so this existing debugging
functionality is a good fit to extend by also perform struct page checks
at least as often as if CONFIG_DEBUG_VM was enabled.

Specifically, after this patch, when debug_pagealloc is enabled on boot,
and CONFIG_DEBUG_VM disabled, pages are checked when allocated from or
freed to the pcplists *in addition* to being moved between pcplists and
free lists. When both debug_pagealloc and CONFIG_DEBUG_VM are enabled,
pages are checked when being moved between pcplists and free lists *in
addition* to when allocated from or freed to the pcplists.

When debug_pagealloc is not enabled on boot, the overhead in fast paths
should be virtually none thanks to the use of static key.

Link: http://lkml.kernel.org/r/20190603143451.27353-3-vbabka@suse.cz
Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
Reviewed-by: Andrew Morton <akpm@linux-foundation.org>
Cc: Mel Gorman <mgorman@techsingularity.net>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Michal Hocko <mhocko@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 96a2b03f 11-Jul-2019 Vlastimil Babka <vbabka@suse.cz>

mm, debug_pagelloc: use static keys to enable debugging

Patch series "debug_pagealloc improvements".

I have been recently debugging some pcplist corruptions, where it would be
useful to perform struct page checks immediately as pages are allocated
from and freed to pcplists, which is now only possible by rebuilding the
kernel with CONFIG_DEBUG_VM (details in Patch 2 changelog).

To make this kind of debugging simpler in future on a distro kernel, I
have improved CONFIG_DEBUG_PAGEALLOC so that it has even smaller overhead
when not enabled at boot time (Patch 1) and also when enabled (Patch 3),
and extended it to perform the struct page checks more often when enabled
(Patch 2). Now it can be configured in when building a distro kernel
without extra overhead, and debugging page use after free or double free
can be enabled simply by rebooting with debug_pagealloc=on.

This patch (of 3):

CONFIG_DEBUG_PAGEALLOC has been redesigned by 031bc5743f15
("mm/debug-pagealloc: make debug-pagealloc boottime configurable") to
allow being always enabled in a distro kernel, but only perform its
expensive functionality when booted with debug_pagelloc=on. We can
further reduce the overhead when not boot-enabled (including page
allocator fast paths) using static keys. This patch introduces one for
debug_pagealloc core functionality, and another for the optional guard
page functionality (enabled by booting with debug_guardpage_minorder=X).

Link: http://lkml.kernel.org/r/20190603143451.27353-2-vbabka@suse.cz
Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
Reviewed-by: Andrew Morton <akpm@linux-foundation.org>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Mel Gorman <mgorman@techsingularity.net>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 98ef2046 11-Jul-2019 Denis Efremov <efremov@linux.com>

mm: remove the exporting of totalram_pages

Previously totalram_pages was the global variable. Currently,
totalram_pages is the static inline function from the include/linux/mm.h
However, the function is also marked as EXPORT_SYMBOL, which is at best an
odd combination. Because there is no point for the static inline function
from a public header to be exported, this commit removes the
EXPORT_SYMBOL() marking. It will be still possible to use the function in
modules because all the symbols it depends on are exported.

Link: http://lkml.kernel.org/r/20190710141031.15642-1-efremov@linux.com
Fixes: ca79b0c211af6 ("mm: convert totalram_pages and totalhigh_pages variables to atomic")
Signed-off-by: Denis Efremov <efremov@linux.com>
Acked-by: Michal Hocko <mhocko@suse.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Oscar Salvador <osalvador@suse.de>
Cc: Pavel Tatashin <pavel.tatashin@microsoft.com>
Cc: Mel Gorman <mgorman@techsingularity.net>
Cc: Mike Rapoport <rppt@linux.ibm.com>
Cc: Alexander Duyck <alexander.h.duyck@linux.intel.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# b9705d87 04-Jul-2019 Juergen Gross <jgross@suse.com>

mm/page_alloc.c: fix regression with deferred struct page init

Commit 0e56acae4b4d ("mm: initialize MAX_ORDER_NR_PAGES at a time
instead of doing larger sections") is causing a regression on some
systems when the kernel is booted as Xen dom0.

The system will just hang in early boot.

Reason is an endless loop in get_page_from_freelist() in case the first
zone looked at has no free memory. deferred_grow_zone() is always
returning true due to the following code snipplet:

/* If the zone is empty somebody else may have cleared out the zone */
if (!deferred_init_mem_pfn_range_in_zone(&i, zone, &spfn, &epfn,
first_deferred_pfn)) {
pgdat->first_deferred_pfn = ULONG_MAX;
pgdat_resize_unlock(pgdat, &flags);
return true;
}

This in turn results in the loop as get_page_from_freelist() is assuming
forward progress can be made by doing some more struct page
initialization.

Link: http://lkml.kernel.org/r/20190620160821.4210-1-jgross@suse.com
Fixes: 0e56acae4b4d ("mm: initialize MAX_ORDER_NR_PAGES at a time instead of doing larger sections")
Signed-off-by: Juergen Gross <jgross@suse.com>
Suggested-by: Alexander Duyck <alexander.h.duyck@linux.intel.com>
Acked-by: Alexander Duyck <alexander.h.duyck@linux.intel.com>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Pavel Tatashin <pasha.tatashin@soleen.com>
Cc: Mike Rapoport <rppt@linux.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 8a164fef 26-Jun-2019 Christoph Hellwig <hch@lst.de>

mm: simplify ZONE_DEVICE page private data

Remove the clumsy hmm_devmem_page_{get,set}_drvdata helpers, and
instead just access the page directly. Also make the page data
a void pointer, and thus much easier to use.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Dan Williams <dan.j.williams@intel.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>


# 514caf23 26-Jun-2019 Christoph Hellwig <hch@lst.de>

memremap: replace the altmap_valid field with a PGMAP_ALTMAP_VALID flag

Add a flags field to struct dev_pagemap to replace the altmap_valid
boolean to be a little more extensible. Also add a pgmap_altmap() helper
to find the optional altmap and clean up the code using the altmap using
it.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Ira Weiny <ira.weiny@intel.com>
Reviewed-by: Dan Williams <dan.j.williams@intel.com>
Tested-by: Dan Williams <dan.j.williams@intel.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>


# 457c8996 19-May-2019 Thomas Gleixner <tglx@linutronix.de>

treewide: Add SPDX license identifier for missed files

Add SPDX license identifiers to all files which:

- Have no license information of any form

- Have EXPORT_.*_SYMBOL_GPL inside which was used in the
initial scan/conversion to ignore the file

These files fall under the project license, GPL v2 only. The resulting SPDX
license identifier is:

GPL-2.0-only

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>


# 97500a4a 14-May-2019 Dan Williams <dan.j.williams@intel.com>

mm: maintain randomization of page free lists

When freeing a page with an order >= shuffle_page_order randomly select
the front or back of the list for insertion.

While the mm tries to defragment physical pages into huge pages this can
tend to make the page allocator more predictable over time. Inject the
front-back randomness to preserve the initial randomness established by
shuffle_free_memory() when the kernel was booted.

The overhead of this manipulation is constrained by only being applied
for MAX_ORDER sized pages by default.

[akpm@linux-foundation.org: coding-style fixes]
Link: http://lkml.kernel.org/r/154899812788.3165233.9066631950746578517.stgit@dwillia2-desk3.amr.corp.intel.com
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
Reviewed-by: Kees Cook <keescook@chromium.org>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Keith Busch <keith.busch@intel.com>
Cc: Robert Elliott <elliott@hpe.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# b03641af 14-May-2019 Dan Williams <dan.j.williams@intel.com>

mm: move buddy list manipulations into helpers

In preparation for runtime randomization of the zone lists, take all
(well, most of) the list_*() functions in the buddy allocator and put
them in helper functions. Provide a common control point for injecting
additional behavior when freeing pages.

[dan.j.williams@intel.com: fix buddy list helpers]
Link: http://lkml.kernel.org/r/155033679702.1773410.13041474192173212653.stgit@dwillia2-desk3.amr.corp.intel.com
[vbabka@suse.cz: remove del_page_from_free_area() migratetype parameter]
Link: http://lkml.kernel.org/r/4672701b-6775-6efd-0797-b6242591419e@suse.cz
Link: http://lkml.kernel.org/r/154899812264.3165233.5219320056406926223.stgit@dwillia2-desk3.amr.corp.intel.com
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
Tested-by: Tetsuo Handa <penguin-kernel@i-love.sakura.ne.jp>
Acked-by: Michal Hocko <mhocko@suse.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Kees Cook <keescook@chromium.org>
Cc: Keith Busch <keith.busch@intel.com>
Cc: Robert Elliott <elliott@hpe.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# e900a918 14-May-2019 Dan Williams <dan.j.williams@intel.com>

mm: shuffle initial free memory to improve memory-side-cache utilization

Patch series "mm: Randomize free memory", v10.

This patch (of 3):

Randomization of the page allocator improves the average utilization of
a direct-mapped memory-side-cache. Memory side caching is a platform
capability that Linux has been previously exposed to in HPC
(high-performance computing) environments on specialty platforms. In
that instance it was a smaller pool of high-bandwidth-memory relative to
higher-capacity / lower-bandwidth DRAM. Now, this capability is going
to be found on general purpose server platforms where DRAM is a cache in
front of higher latency persistent memory [1].

Robert offered an explanation of the state of the art of Linux
interactions with memory-side-caches [2], and I copy it here:

It's been a problem in the HPC space:
http://www.nersc.gov/research-and-development/knl-cache-mode-performance-coe/

A kernel module called zonesort is available to try to help:
https://software.intel.com/en-us/articles/xeon-phi-software

and this abandoned patch series proposed that for the kernel:
https://lkml.kernel.org/r/20170823100205.17311-1-lukasz.daniluk@intel.com

Dan's patch series doesn't attempt to ensure buffers won't conflict, but
also reduces the chance that the buffers will. This will make performance
more consistent, albeit slower than "optimal" (which is near impossible
to attain in a general-purpose kernel). That's better than forcing
users to deploy remedies like:
"To eliminate this gradual degradation, we have added a Stream
measurement to the Node Health Check that follows each job;
nodes are rebooted whenever their measured memory bandwidth
falls below 300 GB/s."

A replacement for zonesort was merged upstream in commit cc9aec03e58f
("x86/numa_emulation: Introduce uniform split capability"). With this
numa_emulation capability, memory can be split into cache sized
("near-memory" sized) numa nodes. A bind operation to such a node, and
disabling workloads on other nodes, enables full cache performance.
However, once the workload exceeds the cache size then cache conflicts
are unavoidable. While HPC environments might be able to tolerate
time-scheduling of cache sized workloads, for general purpose server
platforms, the oversubscribed cache case will be the common case.

The worst case scenario is that a server system owner benchmarks a
workload at boot with an un-contended cache only to see that performance
degrade over time, even below the average cache performance due to
excessive conflicts. Randomization clips the peaks and fills in the
valleys of cache utilization to yield steady average performance.

Here are some performance impact details of the patches:

1/ An Intel internal synthetic memory bandwidth measurement tool, saw a
3X speedup in a contrived case that tries to force cache conflicts.
The contrived cased used the numa_emulation capability to force an
instance of the benchmark to be run in two of the near-memory sized
numa nodes. If both instances were placed on the same emulated they
would fit and cause zero conflicts. While on separate emulated nodes
without randomization they underutilized the cache and conflicted
unnecessarily due to the in-order allocation per node.

2/ A well known Java server application benchmark was run with a heap
size that exceeded cache size by 3X. The cache conflict rate was 8%
for the first run and degraded to 21% after page allocator aging. With
randomization enabled the rate levelled out at 11%.

3/ A MongoDB workload did not observe measurable difference in
cache-conflict rates, but the overall throughput dropped by 7% with
randomization in one case.

4/ Mel Gorman ran his suite of performance workloads with randomization
enabled on platforms without a memory-side-cache and saw a mix of some
improvements and some losses [3].

While there is potentially significant improvement for applications that
depend on low latency access across a wide working-set, the performance
may be negligible to negative for other workloads. For this reason the
shuffle capability defaults to off unless a direct-mapped
memory-side-cache is detected. Even then, the page_alloc.shuffle=0
parameter can be specified to disable the randomization on those systems.

Outside of memory-side-cache utilization concerns there is potentially
security benefit from randomization. Some data exfiltration and
return-oriented-programming attacks rely on the ability to infer the
location of sensitive data objects. The kernel page allocator, especially
early in system boot, has predictable first-in-first out behavior for
physical pages. Pages are freed in physical address order when first
onlined.

Quoting Kees:
"While we already have a base-address randomization
(CONFIG_RANDOMIZE_MEMORY), attacks against the same hardware and
memory layouts would certainly be using the predictability of
allocation ordering (i.e. for attacks where the base address isn't
important: only the relative positions between allocated memory).
This is common in lots of heap-style attacks. They try to gain
control over ordering by spraying allocations, etc.

I'd really like to see this because it gives us something similar
to CONFIG_SLAB_FREELIST_RANDOM but for the page allocator."

While SLAB_FREELIST_RANDOM reduces the predictability of some local slab
caches it leaves vast bulk of memory to be predictably in order allocated.
However, it should be noted, the concrete security benefits are hard to
quantify, and no known CVE is mitigated by this randomization.

Introduce shuffle_free_memory(), and its helper shuffle_zone(), to perform
a Fisher-Yates shuffle of the page allocator 'free_area' lists when they
are initially populated with free memory at boot and at hotplug time. Do
this based on either the presence of a page_alloc.shuffle=Y command line
parameter, or autodetection of a memory-side-cache (to be added in a
follow-on patch).

The shuffling is done in terms of CONFIG_SHUFFLE_PAGE_ORDER sized free
pages where the default CONFIG_SHUFFLE_PAGE_ORDER is MAX_ORDER-1 i.e. 10,
4MB this trades off randomization granularity for time spent shuffling.
MAX_ORDER-1 was chosen to be minimally invasive to the page allocator
while still showing memory-side cache behavior improvements, and the
expectation that the security implications of finer granularity
randomization is mitigated by CONFIG_SLAB_FREELIST_RANDOM. The
performance impact of the shuffling appears to be in the noise compared to
other memory initialization work.

This initial randomization can be undone over time so a follow-on patch is
introduced to inject entropy on page free decisions. It is reasonable to
ask if the page free entropy is sufficient, but it is not enough due to
the in-order initial freeing of pages. At the start of that process
putting page1 in front or behind page0 still keeps them close together,
page2 is still near page1 and has a high chance of being adjacent. As
more pages are added ordering diversity improves, but there is still high
page locality for the low address pages and this leads to no significant
impact to the cache conflict rate.

[1]: https://itpeernetwork.intel.com/intel-optane-dc-persistent-memory-operating-modes/
[2]: https://lkml.kernel.org/r/AT5PR8401MB1169D656C8B5E121752FC0F8AB120@AT5PR8401MB1169.NAMPRD84.PROD.OUTLOOK.COM
[3]: https://lkml.org/lkml/2018/10/12/309

[dan.j.williams@intel.com: fix shuffle enable]
Link: http://lkml.kernel.org/r/154943713038.3858443.4125180191382062871.stgit@dwillia2-desk3.amr.corp.intel.com
[cai@lca.pw: fix SHUFFLE_PAGE_ALLOCATOR help texts]
Link: http://lkml.kernel.org/r/20190425201300.75650-1-cai@lca.pw
Link: http://lkml.kernel.org/r/154899811738.3165233.12325692939590944259.stgit@dwillia2-desk3.amr.corp.intel.com
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
Signed-off-by: Qian Cai <cai@lca.pw>
Reviewed-by: Kees Cook <keescook@chromium.org>
Acked-by: Michal Hocko <mhocko@suse.com>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Keith Busch <keith.busch@intel.com>
Cc: Robert Elliott <elliott@hpe.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 136ac591 14-May-2019 Baruch Siach <baruch@tkos.co.il>

mm: update references to page _refcount

Commit 0139aa7b7fa ("mm: rename _count, field of the struct page, to
_refcount") left out a couple of references to the old field name. Fix
that.

Link: http://lkml.kernel.org/r/cedf87b02eb8a6b3eac57e8e91da53fb15c3c44c.1556537475.git.baruch@tkos.co.il
Fixes: 0139aa7b7fa ("mm: rename _count, field of the struct page, to _refcount")
Signed-off-by: Baruch Siach <baruch@tkos.co.il>
Reviewed-by: Andrew Morton <akpm@linux-foundation.org>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 350e88ba 13-May-2019 Mike Rapoport <rppt@kernel.org>

mm: memblock: make keeping memblock memory opt-in rather than opt-out

Most architectures do not need the memblock memory after the page
allocator is initialized, but only few enable ARCH_DISCARD_MEMBLOCK in the
arch Kconfig.

Replacing ARCH_DISCARD_MEMBLOCK with ARCH_KEEP_MEMBLOCK and inverting the
logic makes it clear which architectures actually use memblock after
system initialization and skips the necessity to add ARCH_DISCARD_MEMBLOCK
to the architectures that are still missing that option.

Link: http://lkml.kernel.org/r/1556102150-32517-1-git-send-email-rppt@linux.ibm.com
Signed-off-by: Mike Rapoport <rppt@linux.ibm.com>
Acked-by: Michael Ellerman <mpe@ellerman.id.au> (powerpc)
Cc: Russell King <linux@armlinux.org.uk>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Will Deacon <will.deacon@arm.com>
Cc: Richard Kuo <rkuo@codeaurora.org>
Cc: Tony Luck <tony.luck@intel.com>
Cc: Fenghua Yu <fenghua.yu@intel.com>
Cc: Geert Uytterhoeven <geert@linux-m68k.org>
Cc: Ralf Baechle <ralf@linux-mips.org>
Cc: Paul Burton <paul.burton@mips.com>
Cc: James Hogan <jhogan@kernel.org>
Cc: Ley Foon Tan <lftan@altera.com>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
Cc: Yoshinori Sato <ysato@users.sourceforge.jp>
Cc: Rich Felker <dalias@libc.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Borislav Petkov <bp@alien8.de>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: Eric Biederman <ebiederm@xmission.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 1c52e6d0 13-May-2019 Yafang Shao <laoar.shao@gmail.com>

mm/page_alloc.c: remove unnecessary parameter in rmqueue_pcplist

Because rmqueue_pcplist() is only called when order is 0, we don't need to
use order as a parameter.

Link: http://lkml.kernel.org/r/1555591709-11744-1-git-send-email-laoar.shao@gmail.com
Signed-off-by: Yafang Shao <laoar.shao@gmail.com>
Acked-by: Michal Hocko <mhocko@suse.com>
Acked-by: Pankaj Gupta <pagupta@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 5557c766 13-May-2019 Michal Hocko <mhocko@suse.com>

mm, memory_hotplug: cleanup memory offline path

check_pages_isolated_cb currently accounts the whole pfn range as being
offlined if test_pages_isolated suceeds on the range. This is based on
the assumption that all pages in the range are freed which is currently
the case in most cases but it won't be with later changes, as pages marked
as vmemmap won't be isolated.

Move the offlined pages counting to offline_isolated_pages_cb and rely on
__offline_isolated_pages to return the correct value.
check_pages_isolated_cb will still do it's primary job and check the pfn
range.

While we are at it remove check_pages_isolated and offline_isolated_pages
and use directly walk_system_ram_range as do in online_pages.

Link: http://lkml.kernel.org/r/20190408082633.2864-2-osalvador@suse.de
Reviewed-by: David Hildenbrand <david@redhat.com>
Signed-off-by: Michal Hocko <mhocko@suse.com>
Signed-off-by: Oscar Salvador <osalvador@suse.de>
Cc: Dan Williams <dan.j.williams@intel.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 0e56acae 13-May-2019 Alexander Duyck <alexander.h.duyck@linux.intel.com>

mm: initialize MAX_ORDER_NR_PAGES at a time instead of doing larger sections

Add yet another iterator, for_each_free_mem_range_in_zone_from, and then
use it to support initializing and freeing pages in groups no larger than
MAX_ORDER_NR_PAGES. By doing this we can greatly improve the cache
locality of the pages while we do several loops over them in the init and
freeing process.

We are able to tighten the loops further as a result of the "from"
iterator as we can perform the initial checks for first_init_pfn in our
first call to the iterator, and continue without the need for those checks
via the "from" iterator. I have added this functionality in the function
called deferred_init_mem_pfn_range_in_zone that primes the iterator and
causes us to exit if we encounter any failure.

On my x86_64 test system with 384GB of memory per node I saw a reduction
in initialization time from 1.85s to 1.38s as a result of this patch.

Link: http://lkml.kernel.org/r/20190405221231.12227.85836.stgit@localhost.localdomain
Signed-off-by: Alexander Duyck <alexander.h.duyck@linux.intel.com>
Reviewed-by: Pavel Tatashin <pasha.tatashin@soleen.com>
Cc: Mike Rapoport <rppt@linux.ibm.com>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Dave Jiang <dave.jiang@intel.com>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: <yi.z.zhang@linux.intel.com>
Cc: Khalid Aziz <khalid.aziz@oracle.com>
Cc: Mike Rapoport <rppt@linux.vnet.ibm.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: Laurent Dufour <ldufour@linux.vnet.ibm.com>
Cc: Mel Gorman <mgorman@techsingularity.net>
Cc: David S. Miller <davem@davemloft.net>
Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
Cc: Pavel Tatashin <pavel.tatashin@microsoft.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 837566e7 13-May-2019 Alexander Duyck <alexander.h.duyck@linux.intel.com>

mm: implement new zone specific memblock iterator

Introduce a new iterator for_each_free_mem_pfn_range_in_zone.

This iterator will take care of making sure a given memory range provided
is in fact contained within a zone. It takes are of all the bounds
checking we were doing in deferred_grow_zone, and deferred_init_memmap.
In addition it should help to speed up the search a bit by iterating until
the end of a range is greater than the start of the zone pfn range, and
will exit completely if the start is beyond the end of the zone.

Link: http://lkml.kernel.org/r/20190405221225.12227.22573.stgit@localhost.localdomain
Signed-off-by: Alexander Duyck <alexander.h.duyck@linux.intel.com>
Reviewed-by: Pavel Tatashin <pasha.tatashin@soleen.com>
Reviewed-by: Mike Rapoport <rppt@linux.ibm.com>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: Dave Jiang <dave.jiang@intel.com>
Cc: David S. Miller <davem@davemloft.net>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: Khalid Aziz <khalid.aziz@oracle.com>
Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
Cc: Laurent Dufour <ldufour@linux.vnet.ibm.com>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Mel Gorman <mgorman@techsingularity.net>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Mike Rapoport <rppt@linux.vnet.ibm.com>
Cc: Pavel Tatashin <pavel.tatashin@microsoft.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: <yi.z.zhang@linux.intel.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 56ec43d8 13-May-2019 Alexander Duyck <alexander.h.duyck@linux.intel.com>

mm: drop meminit_pfn_in_nid as it is redundant

As best as I can tell the meminit_pfn_in_nid call is completely redundant.
The deferred memory initialization is already making use of
for_each_free_mem_range which in turn will call into __next_mem_range
which will only return a memory range if it matches the node ID provided
assuming it is not NUMA_NO_NODE.

I am operating on the assumption that there are no zones or pgdata_t
structures that have a NUMA node of NUMA_NO_NODE associated with them. If
that is the case then __next_mem_range will never return a memory range
that doesn't match the zone's node ID and as such the check is redundant.

So one piece I would like to verify on this is if this works for ia64.
Technically it was using a different approach to get the node ID, but it
seems to have the node ID also encoded into the memblock. So I am
assuming this is okay, but would like to get confirmation on that.

On my x86_64 test system with 384GB of memory per node I saw a reduction
in initialization time from 2.80s to 1.85s as a result of this patch.

Link: http://lkml.kernel.org/r/20190405221219.12227.93957.stgit@localhost.localdomain
Signed-off-by: Alexander Duyck <alexander.h.duyck@linux.intel.com>
Reviewed-by: Pavel Tatashin <pavel.tatashin@microsoft.com>
Acked-by: Michal Hocko <mhocko@suse.com>
Cc: Mike Rapoport <rppt@linux.ibm.com>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: Dave Jiang <dave.jiang@intel.com>
Cc: David S. Miller <davem@davemloft.net>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: Khalid Aziz <khalid.aziz@oracle.com>
Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
Cc: Laurent Dufour <ldufour@linux.vnet.ibm.com>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Mel Gorman <mgorman@techsingularity.net>
Cc: Mike Rapoport <rppt@linux.vnet.ibm.com>
Cc: Pavel Tatashin <pasha.tatashin@soleen.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: <yi.z.zhang@linux.intel.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 299c83dc 13-May-2019 Linxu Fang <fanglinxu@huawei.com>

mem-hotplug: fix node spanned pages when we have a node with only ZONE_MOVABLE

342332e6a925 ("mm/page_alloc.c: introduce kernelcore=mirror option") and
later patches rewrote the calculation of node spanned pages.

e506b99696a2 ("mem-hotplug: fix node spanned pages when we have a movable
node"), but the current code still has problems,

When we have a node with only zone_movable and the node id is not zero,
the size of node spanned pages is double added.

That's because we have an empty normal zone, and zone_start_pfn or
zone_end_pfn is not between arch_zone_lowest_possible_pfn and
arch_zone_highest_possible_pfn, so we need to use clamp to constrain the
range just like the commit <96e907d13602> (bootmem: Reimplement
__absent_pages_in_range() using for_each_mem_pfn_range()).

e.g.
Zone ranges:
DMA [mem 0x0000000000001000-0x0000000000ffffff]
DMA32 [mem 0x0000000001000000-0x00000000ffffffff]
Normal [mem 0x0000000100000000-0x000000023fffffff]
Movable zone start for each node
Node 0: 0x0000000100000000
Node 1: 0x0000000140000000
Early memory node ranges
node 0: [mem 0x0000000000001000-0x000000000009efff]
node 0: [mem 0x0000000000100000-0x00000000bffdffff]
node 0: [mem 0x0000000100000000-0x000000013fffffff]
node 1: [mem 0x0000000140000000-0x000000023fffffff]

node 0 DMA spanned:0xfff present:0xf9e absent:0x61
node 0 DMA32 spanned:0xff000 present:0xbefe0 absent:0x40020
node 0 Normal spanned:0 present:0 absent:0
node 0 Movable spanned:0x40000 present:0x40000 absent:0
On node 0 totalpages(node_present_pages): 1048446
node_spanned_pages:1310719
node 1 DMA spanned:0 present:0 absent:0
node 1 DMA32 spanned:0 present:0 absent:0
node 1 Normal spanned:0x100000 present:0x100000 absent:0
node 1 Movable spanned:0x100000 present:0x100000 absent:0
On node 1 totalpages(node_present_pages): 2097152
node_spanned_pages:2097152
Memory: 6967796K/12582392K available (16388K kernel code, 3686K rwdata,
4468K rodata, 2160K init, 10444K bss, 5614596K reserved, 0K
cma-reserved)

It shows that the current memory of node 1 is double added.
After this patch, the problem is fixed.

node 0 DMA spanned:0xfff present:0xf9e absent:0x61
node 0 DMA32 spanned:0xff000 present:0xbefe0 absent:0x40020
node 0 Normal spanned:0 present:0 absent:0
node 0 Movable spanned:0x40000 present:0x40000 absent:0
On node 0 totalpages(node_present_pages): 1048446
node_spanned_pages:1310719
node 1 DMA spanned:0 present:0 absent:0
node 1 DMA32 spanned:0 present:0 absent:0
node 1 Normal spanned:0 present:0 absent:0
node 1 Movable spanned:0x100000 present:0x100000 absent:0
On node 1 totalpages(node_present_pages): 1048576
node_spanned_pages:1048576
memory: 6967796K/8388088K available (16388K kernel code, 3686K rwdata,
4468K rodata, 2160K init, 10444K bss, 1420292K reserved, 0K
cma-reserved)

Link: http://lkml.kernel.org/r/1554178276-10372-1-git-send-email-fanglinxu@huawei.com
Signed-off-by: Linxu Fang <fanglinxu@huawei.com>
Cc: Taku Izumi <izumi.taku@jp.fujitsu.com>
Cc: Xishi Qiu <qiuxishi@huawei.com>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Pavel Tatashin <pavel.tatashin@microsoft.com>
Cc: Oscar Salvador <osalvador@suse.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 4eb0716e 13-May-2019 Alexandre Ghiti <alex@ghiti.fr>

hugetlb: allow to free gigantic pages regardless of the configuration

On systems without CONTIG_ALLOC activated but that support gigantic pages,
boottime reserved gigantic pages can not be freed at all. This patch
simply enables the possibility to hand back those pages to memory
allocator.

Link: http://lkml.kernel.org/r/20190327063626.18421-5-alex@ghiti.fr
Signed-off-by: Alexandre Ghiti <alex@ghiti.fr>
Acked-by: David S. Miller <davem@davemloft.net> [sparc]
Reviewed-by: Mike Kravetz <mike.kravetz@oracle.com>
Cc: Andy Lutomirsky <luto@kernel.org>
Cc: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
Cc: "H . Peter Anvin" <hpa@zytor.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Rich Felker <dalias@libc.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Will Deacon <will.deacon@arm.com>
Cc: Yoshinori Sato <ysato@users.sourceforge.jp>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 8df995f6 13-May-2019 Alexandre Ghiti <alex@ghiti.fr>

mm: simplify MEMORY_ISOLATION && COMPACTION || CMA into CONTIG_ALLOC

This condition allows to define alloc_contig_range, so simplify it into a
more accurate naming.

Link: http://lkml.kernel.org/r/20190327063626.18421-4-alex@ghiti.fr
Signed-off-by: Alexandre Ghiti <alex@ghiti.fr>
Suggested-by: Vlastimil Babka <vbabka@suse.cz>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Cc: Andy Lutomirsky <luto@kernel.org>
Cc: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: David S. Miller <davem@davemloft.net>
Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
Cc: "H . Peter Anvin" <hpa@zytor.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Cc: Mike Kravetz <mike.kravetz@oracle.com>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Rich Felker <dalias@libc.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Will Deacon <will.deacon@arm.com>
Cc: Yoshinori Sato <ysato@users.sourceforge.jp>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 63931eb9 13-May-2019 Vlastimil Babka <vbabka@suse.cz>

mm, page_alloc: disallow __GFP_COMP in alloc_pages_exact()

alloc_pages_exact*() allocates a page of sufficient order and then splits
it to return only the number of pages requested. That makes it
incompatible with __GFP_COMP, because compound pages cannot be split.

As shown by [1] things may silently work until the requested size
(possibly depending on user) stops being power of two. Then for
CONFIG_DEBUG_VM, BUG_ON() triggers in split_page(). Without
CONFIG_DEBUG_VM, consequences are unclear.

There are several options here, none of them great:

1) Don't do the splitting when __GFP_COMP is passed, and return the
whole compound page. However if caller then returns it via
free_pages_exact(), that will be unexpected and the freeing actions
there will be wrong.

2) Warn and remove __GFP_COMP from the flags. But the caller may have
really wanted it, so things may break later somewhere.

3) Warn and return NULL. However NULL may be unexpected, especially
for small sizes.

This patch picks option 2, because as Michal Hocko put it: "callers wanted
it" is much less probable than "caller is simply confused and more gfp
flags is surely better than fewer".

[1] https://lore.kernel.org/lkml/20181126002805.GI18977@shao2-debian/T/#u

Link: http://lkml.kernel.org/r/0c6393eb-b28d-4607-c386-862a71f09de6@suse.cz
Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
Acked-by: Michal Hocko <mhocko@suse.com>
Acked-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Acked-by: Mel Gorman <mgorman@techsingularity.net>
Cc: Takashi Iwai <tiwai@suse.de>
Cc: Hugh Dickins <hughd@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# d6332692 25-Apr-2019 Rick Edgecombe <rick.p.edgecombe@intel.com>

mm/hibernation: Make hibernation handle unmapped pages

Make hibernate handle unmapped pages on the direct map when
CONFIG_ARCH_HAS_SET_ALIAS=y is set. These functions allow for setting pages
to invalid configurations, so now hibernate should check if the pages have
valid mappings and handle if they are unmapped when doing a hibernate
save operation.

Previously this checking was already done when CONFIG_DEBUG_PAGEALLOC=y
was configured. It does not appear to have a big hibernating performance
impact. The speed of the saving operation before this change was measured
as 819.02 MB/s, and after was measured at 813.32 MB/s.

Before:
[ 4.670938] PM: Wrote 171996 kbytes in 0.21 seconds (819.02 MB/s)

After:
[ 4.504714] PM: Wrote 178932 kbytes in 0.22 seconds (813.32 MB/s)

Signed-off-by: Rick Edgecombe <rick.p.edgecombe@intel.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Acked-by: Pavel Machek <pavel@ucw.cz>
Cc: <akpm@linux-foundation.org>
Cc: <ard.biesheuvel@linaro.org>
Cc: <deneen.t.dock@intel.com>
Cc: <kernel-hardening@lists.openwall.com>
Cc: <kristen@linux.intel.com>
Cc: <linux_dti@icloud.com>
Cc: <will.deacon@arm.com>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Nadav Amit <nadav.amit@gmail.com>
Cc: Rafael J. Wysocki <rjw@rjwysocki.net>
Cc: Rik van Riel <riel@surriel.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: https://lkml.kernel.org/r/20190426001143.4983-16-namit@vmware.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>


# 8118b82e 25-Apr-2019 Andrey Ryabinin <ryabinin.a.a@gmail.com>

mm/page_alloc.c: fix never set ALLOC_NOFRAGMENT flag

Commit 0a79cdad5eb2 ("mm: use alloc_flags to record if kswapd can wake")
removed setting of the ALLOC_NOFRAGMENT flag. Bring it back.

The runtime effect is that ALLOC_NOFRAGMENT behaviour is restored so
that allocations are spread across local zones to avoid fragmentation
due to mixing pageblocks as long as possible.

Link: http://lkml.kernel.org/r/20190423120806.3503-2-aryabinin@virtuozzo.com
Fixes: 0a79cdad5eb2 ("mm: use alloc_flags to record if kswapd can wake")
Signed-off-by: Andrey Ryabinin <aryabinin@virtuozzo.com>
Acked-by: Mel Gorman <mgorman@techsingularity.net>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 8139ad04 25-Apr-2019 Andrey Ryabinin <ryabinin.a.a@gmail.com>

mm/page_alloc.c: avoid potential NULL pointer dereference

ac.preferred_zoneref->zone passed to alloc_flags_nofragment() can be NULL.
'zone' pointer unconditionally derefernced in alloc_flags_nofragment().
Bail out on NULL zone to avoid potential crash. Currently we don't see
any crashes only because alloc_flags_nofragment() has another bug which
allows compiler to optimize away all accesses to 'zone'.

Link: http://lkml.kernel.org/r/20190423120806.3503-1-aryabinin@virtuozzo.com
Fixes: 6bb154504f8b ("mm, page_alloc: spread allocations across zones before introducing fragmentation")
Signed-off-by: Andrey Ryabinin <aryabinin@virtuozzo.com>
Acked-by: Mel Gorman <mgorman@techsingularity.net>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# ee8ab0ee 25-Apr-2019 Mel Gorman <mgorman@techsingularity.net>

mm, page_alloc: always use a captured page regardless of compaction result

During the development of commit 5e1f0f098b46 ("mm, compaction: capture
a page under direct compaction"), a paranoid check was added to ensure
that if a captured page was available after compaction that it was
consistent with the final state of compaction. The intent was to catch
serious programming bugs such as using a stale page pointer and causing
corruption problems.

However, it is possible to get a captured page even if compaction was
unsuccessful if an interrupt triggered and happened to free pages in
interrupt context that got merged into a suitable high-order page. It's
highly unlikely but Li Wang did report the following warning on s390
occuring when testing OOM handling. Note that the warning is slightly
edited for clarity.

WARNING: CPU: 0 PID: 9783 at mm/page_alloc.c:3777 __alloc_pages_direct_compact+0x182/0x190
Modules linked in: rpcsec_gss_krb5 auth_rpcgss nfsv4 dns_resolver nfs
lockd grace fscache sunrpc pkey ghash_s390 prng xts aes_s390
des_s390 des_generic sha512_s390 zcrypt_cex4 zcrypt vmur binfmt_misc
ip_tables xfs libcrc32c dasd_fba_mod qeth_l2 dasd_eckd_mod dasd_mod
qeth qdio lcs ctcm ccwgroup fsm dm_mirror dm_region_hash dm_log
dm_mod
CPU: 0 PID: 9783 Comm: copy.sh Kdump: loaded Not tainted 5.1.0-rc 5 #1

This patch simply removes the check entirely instead of trying to be
clever about pages freed from interrupt context. If a serious
programming error was introduced, it is highly likely to be caught by
prep_new_page() instead.

Link: http://lkml.kernel.org/r/20190419085133.GH18914@techsingularity.net
Fixes: 5e1f0f098b46 ("mm, compaction: capture a page under direct compaction")
Signed-off-by: Mel Gorman <mgorman@techsingularity.net>
Reported-by: Li Wang <liwang@redhat.com>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 24512228b 25-Apr-2019 Mel Gorman <mgorman@techsingularity.net>

mm: do not boost watermarks to avoid fragmentation for the DISCONTIG memory model

Mikulas Patocka reported that commit 1c30844d2dfe ("mm: reclaim small
amounts of memory when an external fragmentation event occurs") "broke"
memory management on parisc.

The machine is not NUMA but the DISCONTIG model creates three pgdats
even though it's a UMA machine for the following ranges

0) Start 0x0000000000000000 End 0x000000003fffffff Size 1024 MB
1) Start 0x0000000100000000 End 0x00000001bfdfffff Size 3070 MB
2) Start 0x0000004040000000 End 0x00000040ffffffff Size 3072 MB

Mikulas reported:

With the patch 1c30844d2, the kernel will incorrectly reclaim the
first zone when it fills up, ignoring the fact that there are two
completely free zones. Basiscally, it limits cache size to 1GiB.

For example, if I run:
# dd if=/dev/sda of=/dev/null bs=1M count=2048

- with the proper kernel, there should be "Buffers - 2GiB"
when this command finishes. With the patch 1c30844d2, buffers
will consume just 1GiB or slightly more, because the kernel was
incorrectly reclaiming them.

The page allocator and reclaim makes assumptions that pgdats really
represent NUMA nodes and zones represent ranges and makes decisions on
that basis. Watermark boosting for small pgdats leads to unexpected
results even though this would have behaved reasonably on SPARSEMEM.

DISCONTIG is essentially deprecated and even parisc plans to move to
SPARSEMEM so there is no need to be fancy, this patch simply disables
watermark boosting by default on DISCONTIGMEM.

Link: http://lkml.kernel.org/r/20190419094335.GJ18914@techsingularity.net
Fixes: 1c30844d2dfe ("mm: reclaim small amounts of memory when an external fragmentation event occurs")
Signed-off-by: Mel Gorman <mgorman@techsingularity.net>
Reported-by: Mikulas Patocka <mpatocka@redhat.com>
Tested-by: Mikulas Patocka <mpatocka@redhat.com>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Cc: James Bottomley <James.Bottomley@hansenpartnership.com>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 1a9f2191 18-Apr-2019 Qian Cai <cai@lca.pw>

mm/hotplug: treat CMA pages as unmovable

has_unmovable_pages() is used by allocating CMA and gigantic pages as
well as the memory hotplug. The later doesn't know how to offline CMA
pool properly now, but if an unused (free) CMA page is encountered, then
has_unmovable_pages() happily considers it as a free memory and
propagates this up the call chain. Memory offlining code then frees the
page without a proper CMA tear down which leads to an accounting issues.
Moreover if the same memory range is onlined again then the memory never
gets back to the CMA pool.

State after memory offline:

# grep cma /proc/vmstat
nr_free_cma 205824

# cat /sys/kernel/debug/cma/cma-kvm_cma/count
209920

Also, kmemleak still think those memory address are reserved below but
have already been used by the buddy allocator after onlining. This
patch fixes the situation by treating CMA pageblocks as unmovable except
when has_unmovable_pages() is called as part of CMA allocation.

Offlined Pages 4096
kmemleak: Cannot insert 0xc000201f7d040008 into the object search tree (overlaps existing)
Call Trace:
dump_stack+0xb0/0xf4 (unreliable)
create_object+0x344/0x380
__kmalloc_node+0x3ec/0x860
kvmalloc_node+0x58/0x110
seq_read+0x41c/0x620
__vfs_read+0x3c/0x70
vfs_read+0xbc/0x1a0
ksys_read+0x7c/0x140
system_call+0x5c/0x70
kmemleak: Kernel memory leak detector disabled
kmemleak: Object 0xc000201cc8000000 (size 13757317120):
kmemleak: comm "swapper/0", pid 0, jiffies 4294937297
kmemleak: min_count = -1
kmemleak: count = 0
kmemleak: flags = 0x5
kmemleak: checksum = 0
kmemleak: backtrace:
cma_declare_contiguous+0x2a4/0x3b0
kvm_cma_reserve+0x11c/0x134
setup_arch+0x300/0x3f8
start_kernel+0x9c/0x6e8
start_here_common+0x1c/0x4b0
kmemleak: Automatic memory scanning thread ended

[cai@lca.pw: use is_migrate_cma_page() and update commit log]
Link: http://lkml.kernel.org/r/20190416170510.20048-1-cai@lca.pw
Link: http://lkml.kernel.org/r/20190413002623.8967-1-cai@lca.pw
Signed-off-by: Qian Cai <cai@lca.pw>
Acked-by: Michal Hocko <mhocko@suse.com>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Reviewed-by: Oscar Salvador <osalvador@suse.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 9b7ea46a 28-Mar-2019 Qian Cai <cai@lca.pw>

mm/hotplug: fix offline undo_isolate_page_range()

Commit f1dd2cd13c4b ("mm, memory_hotplug: do not associate hotadded
memory to zones until online") introduced move_pfn_range_to_zone() which
calls memmap_init_zone() during onlining a memory block.
memmap_init_zone() will reset pagetype flags and makes migrate type to
be MOVABLE.

However, in __offline_pages(), it also call undo_isolate_page_range()
after offline_isolated_pages() to do the same thing. Due to commit
2ce13640b3f4 ("mm: __first_valid_page skip over offline pages") changed
__first_valid_page() to skip offline pages, undo_isolate_page_range()
here just waste CPU cycles looping around the offlining PFN range while
doing nothing, because __first_valid_page() will return NULL as
offline_isolated_pages() has already marked all memory sections within
the pfn range as offline via offline_mem_sections().

Also, after calling the "useless" undo_isolate_page_range() here, it
reaches the point of no returning by notifying MEM_OFFLINE. Those pages
will be marked as MIGRATE_MOVABLE again once onlining. The only thing
left to do is to decrease the number of isolated pageblocks zone counter
which would make some paths of the page allocation slower that the above
commit introduced.

Even if alloc_contig_range() can be used to isolate 16GB-hugetlb pages
on ppc64, an "int" should still be enough to represent the number of
pageblocks there. Fix an incorrect comment along the way.

[cai@lca.pw: v4]
Link: http://lkml.kernel.org/r/20190314150641.59358-1-cai@lca.pw
Link: http://lkml.kernel.org/r/20190313143133.46200-1-cai@lca.pw
Fixes: 2ce13640b3f4 ("mm: __first_valid_page skip over offline pages")
Signed-off-by: Qian Cai <cai@lca.pw>
Acked-by: Michal Hocko <mhocko@suse.com>
Reviewed-by: Oscar Salvador <osalvador@suse.de>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: <stable@vger.kernel.org> [4.13+]
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 26fb3dae 12-Mar-2019 Mike Rapoport <rppt@kernel.org>

memblock: drop memblock_alloc_*_nopanic() variants

As all the memblock allocation functions return NULL in case of error
rather than panic(), the duplicates with _nopanic suffix can be removed.

Link: http://lkml.kernel.org/r/1548057848-15136-22-git-send-email-rppt@linux.ibm.com
Signed-off-by: Mike Rapoport <rppt@linux.ibm.com>
Acked-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Reviewed-by: Petr Mladek <pmladek@suse.com> [printk]
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Christophe Leroy <christophe.leroy@c-s.fr>
Cc: Christoph Hellwig <hch@lst.de>
Cc: "David S. Miller" <davem@davemloft.net>
Cc: Dennis Zhou <dennis@kernel.org>
Cc: Geert Uytterhoeven <geert@linux-m68k.org>
Cc: Greentime Hu <green.hu@gmail.com>
Cc: Guan Xuetao <gxt@pku.edu.cn>
Cc: Guo Ren <guoren@kernel.org>
Cc: Guo Ren <ren_guo@c-sky.com> [c-sky]
Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
Cc: Juergen Gross <jgross@suse.com> [Xen]
Cc: Mark Salter <msalter@redhat.com>
Cc: Matt Turner <mattst88@gmail.com>
Cc: Max Filippov <jcmvbkbc@gmail.com>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Cc: Michal Simek <monstr@monstr.eu>
Cc: Paul Burton <paul.burton@mips.com>
Cc: Richard Weinberger <richard@nod.at>
Cc: Rich Felker <dalias@libc.org>
Cc: Rob Herring <robh+dt@kernel.org>
Cc: Rob Herring <robh@kernel.org>
Cc: Russell King <linux@armlinux.org.uk>
Cc: Stafford Horne <shorne@gmail.com>
Cc: Tony Luck <tony.luck@intel.com>
Cc: Vineet Gupta <vgupta@synopsys.com>
Cc: Yoshinori Sato <ysato@users.sourceforge.jp>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# afa00112 05-Mar-2019 Christoph Hellwig <hch@lst.de>

mm: unexport free_reserved_area

This function is only used by built-in code, which makes perfect sense
given the purpose of it.

Link: http://lkml.kernel.org/r/20190213174621.29297-2-hch@lst.de
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# a862f68a 05-Mar-2019 Mike Rapoport <rppt@kernel.org>

docs/core-api/mm: fix return value descriptions in mm/

Many kernel-doc comments in mm/ have the return value descriptions
either misformatted or omitted at all which makes kernel-doc script
unhappy:

$ make V=1 htmldocs
...
./mm/util.c:36: info: Scanning doc for kstrdup
./mm/util.c:41: warning: No description found for return value of 'kstrdup'
./mm/util.c:57: info: Scanning doc for kstrdup_const
./mm/util.c:66: warning: No description found for return value of 'kstrdup_const'
./mm/util.c:75: info: Scanning doc for kstrndup
./mm/util.c:83: warning: No description found for return value of 'kstrndup'
...

Fixing the formatting and adding the missing return value descriptions
eliminates ~100 such warnings.

Link: http://lkml.kernel.org/r/1549549644-4903-4-git-send-email-rppt@linux.ibm.com
Signed-off-by: Mike Rapoport <rppt@linux.ibm.com>
Reviewed-by: Andrew Morton <akpm@linux-foundation.org>
Cc: Jonathan Corbet <corbet@lwn.net>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# ce0725f7 05-Mar-2019 Alexey Dobriyan <adobriyan@gmail.com>

numa: make "nr_online_nodes" unsigned int

Number of online NUMA nodes can't be negative as well. This doesn't
save space as the variable is used only in 32-bit context, but do it
anyway for consistency.

Link: http://lkml.kernel.org/r/20190201223151.GB15820@avx2
Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com>
Reviewed-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# b9726c26 05-Mar-2019 Alexey Dobriyan <adobriyan@gmail.com>

numa: make "nr_node_ids" unsigned int

Number of NUMA nodes can't be negative.

This saves a few bytes on x86_64:

add/remove: 0/0 grow/shrink: 4/21 up/down: 27/-265 (-238)
Function old new delta
hv_synic_alloc.cold 88 110 +22
prealloc_shrinker 260 262 +2
bootstrap 249 251 +2
sched_init_numa 1566 1567 +1
show_slab_objects 778 777 -1
s_show 1201 1200 -1
kmem_cache_init 346 345 -1
__alloc_workqueue_key 1146 1145 -1
mem_cgroup_css_alloc 1614 1612 -2
__do_sys_swapon 4702 4699 -3
__list_lru_init 655 651 -4
nic_probe 2379 2374 -5
store_user_store 118 111 -7
red_zone_store 106 99 -7
poison_store 106 99 -7
wq_numa_init 348 338 -10
__kmem_cache_empty 75 65 -10
task_numa_free 186 173 -13
merge_across_nodes_store 351 336 -15
irq_create_affinity_masks 1261 1246 -15
do_numa_crng_init 343 321 -22
task_numa_fault 4760 4737 -23
swapfile_init 179 156 -23
hv_synic_alloc 536 492 -44
apply_wqattrs_prepare 746 695 -51

Link: http://lkml.kernel.org/r/20190201223029.GA15820@avx2
Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com>
Reviewed-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 23a7052a 05-Mar-2019 Mike Rapoport <rppt@kernel.org>

mm/page_alloc.c: check return value of memblock_alloc_node_nopanic()

There are two early memory allocations that use
memblock_alloc_node_nopanic() and do not check its return value.

While this happens very early during boot and chances that the
allocation will fail are diminishing, it is still worth to have proper
checks for the allocation errors.

Link: http://lkml.kernel.org/r/1547734941-944-1-git-send-email-rppt@linux.ibm.com
Signed-off-by: Mike Rapoport <rppt@linux.ibm.com>
Reviewed-by: William Kucharski <william.kucharski@oracle.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 8bb4e7a2 05-Mar-2019 Wei Yang <richard.weiyang@gmail.com>

mm: fix some typos in mm directory

No functional change.

Link: http://lkml.kernel.org/r/20190118235123.27843-1-richard.weiyang@gmail.com
Signed-off-by: Wei Yang <richard.weiyang@gmail.com>
Reviewed-by: Pekka Enberg <penberg@kernel.org>
Acked-by: Mike Rapoport <rppt@linux.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# d9f7979c 05-Mar-2019 Greg Kroah-Hartman <gregkh@linuxfoundation.org>

mm: no need to check return value of debugfs_create functions

When calling debugfs functions, there is no need to ever check the
return value. The function can work or not, but the code logic should
never do something different based on this.

Link: http://lkml.kernel.org/r/20190122152151.16139-14-gregkh@linuxfoundation.org
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: David Rientjes <rientjes@google.com>
Cc: Laura Abbott <labbott@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 5e1f0f09 05-Mar-2019 Mel Gorman <mgorman@techsingularity.net>

mm, compaction: capture a page under direct compaction

Compaction is inherently race-prone as a suitable page freed during
compaction can be allocated by any parallel task. This patch uses a
capture_control structure to isolate a page immediately when it is freed
by a direct compactor in the slow path of the page allocator. The
intent is to avoid redundant scanning.

5.0.0-rc1 5.0.0-rc1
selective-v3r17 capture-v3r19
Amean fault-both-1 0.00 ( 0.00%) 0.00 * 0.00%*
Amean fault-both-3 2582.11 ( 0.00%) 2563.68 ( 0.71%)
Amean fault-both-5 4500.26 ( 0.00%) 4233.52 ( 5.93%)
Amean fault-both-7 5819.53 ( 0.00%) 6333.65 ( -8.83%)
Amean fault-both-12 9321.18 ( 0.00%) 9759.38 ( -4.70%)
Amean fault-both-18 9782.76 ( 0.00%) 10338.76 ( -5.68%)
Amean fault-both-24 15272.81 ( 0.00%) 13379.55 * 12.40%*
Amean fault-both-30 15121.34 ( 0.00%) 16158.25 ( -6.86%)
Amean fault-both-32 18466.67 ( 0.00%) 18971.21 ( -2.73%)

Latency is only moderately affected but the devil is in the details. A
closer examination indicates that base page fault latency is reduced but
latency of huge pages is increased as it takes creater care to succeed.
Part of the "problem" is that allocation success rates are close to 100%
even when under pressure and compaction gets harder

5.0.0-rc1 5.0.0-rc1
selective-v3r17 capture-v3r19
Percentage huge-3 96.70 ( 0.00%) 98.23 ( 1.58%)
Percentage huge-5 96.99 ( 0.00%) 95.30 ( -1.75%)
Percentage huge-7 94.19 ( 0.00%) 97.24 ( 3.24%)
Percentage huge-12 94.95 ( 0.00%) 97.35 ( 2.53%)
Percentage huge-18 96.74 ( 0.00%) 97.30 ( 0.58%)
Percentage huge-24 97.07 ( 0.00%) 97.55 ( 0.50%)
Percentage huge-30 95.69 ( 0.00%) 98.50 ( 2.95%)
Percentage huge-32 96.70 ( 0.00%) 99.27 ( 2.65%)

And scan rates are reduced as expected by 6% for the migration scanner
and 29% for the free scanner indicating that there is less redundant
work.

Compaction migrate scanned 20815362 19573286
Compaction free scanned 16352612 11510663

[mgorman@techsingularity.net: remove redundant check]
Link: http://lkml.kernel.org/r/20190201143853.GH9565@techsingularity.net
Link: http://lkml.kernel.org/r/20190118175136.31341-23-mgorman@techsingularity.net
Signed-off-by: Mel Gorman <mgorman@techsingularity.net>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Dan Carpenter <dan.carpenter@oracle.com>
Cc: David Rientjes <rientjes@google.com>
Cc: YueHaibing <yuehaibing@huawei.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# fd1444b2 05-Mar-2019 Mel Gorman <mgorman@techsingularity.net>

mm, compaction: ignore the fragmentation avoidance boost for isolation and compaction

When pageblocks get fragmented, watermarks are artifically boosted to
reclaim pages to avoid further fragmentation events. However,
compaction is often either fragmentation-neutral or moving movable pages
away from unmovable/reclaimable pages. As the true watermarks are
preserved, allow compaction to ignore the boost factor.

The expected impact is very slight as the main benefit is that
compaction is slightly more likely to succeed when the system has been
fragmented very recently. On both 1-socket and 2-socket machines for
THP-intensive allocation during fragmentation the success rate was
increased by less than 1% which is marginal. However, detailed tracing
indicated that failure of migration due to a premature ENOMEM triggered
by watermark checks were eliminated.

Link: http://lkml.kernel.org/r/20190118175136.31341-9-mgorman@techsingularity.net
Signed-off-by: Mel Gorman <mgorman@techsingularity.net>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Dan Carpenter <dan.carpenter@oracle.com>
Cc: David Rientjes <rientjes@google.com>
Cc: YueHaibing <yuehaibing@huawei.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# c52e7593 05-Mar-2019 Wei Yang <richard.weiyang@gmail.com>

mm: remove extra drain pages on pcp list

In the current implementation, there are two places to isolate a range
of page: __offline_pages() and alloc_contig_range(). During this
procedure, it will drain pages on pcp list.

Below is a brief call flow:

__offline_pages()/alloc_contig_range()
start_isolate_page_range()
set_migratetype_isolate()
drain_all_pages()
drain_all_pages() <--- A

This snippet shows the current logic is isolate and drain pcp list for
each pageblock and drain pcp list again for the whole range.

start_isolate_page_range is responsible for isolating the given pfn
range. One part of that job is to make sure that also pages that are on
the allocator pcp lists are properly isolated. Otherwise they could be
reused and the range wouldn't be completely isolated until the memory is
freed back. While there is no strict guarantee here because pages might
get allocated at any time before drain_all_pages is called there doesn't
seem to be any strong demand for such a guarantee.

In any case, draining is already done at the isolation level and there
is no need to do it again later by start_isolate_page_range callers
(memory hotplug and CMA allocator currently). Therefore remove
pointless draining in existing callers to make the code more clear and
functionally correct.

[mhocko@suse.com: provide a clearer changelog for the last two paragraphs]
Link: http://lkml.kernel.org/r/20190105233141.2329-1-richard.weiyang@gmail.com
Signed-off-by: Wei Yang <richard.weiyang@gmail.com>
Acked-by: Michal Hocko <mhocko@suse.com>
Acked-by: David Hildenbrand <david@redhat.com>
Reviewed-by: Oscar Salvador <osalvador@suse.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 60cd4bcd 05-Mar-2019 Shakeel Butt <shakeelb@google.com>

memcg: localize memcg_kmem_enabled() check

Move the memcg_kmem_enabled() checks into memcg kmem charge/uncharge
functions, so, the users don't have to explicitly check that condition.

This is purely code cleanup patch without any functional change. Only
the order of checks in memcg_charge_slab() can potentially be changed
but the functionally it will be same. This should not matter as
memcg_charge_slab() is not in the hot path.

Link: http://lkml.kernel.org/r/20190103161203.162375-1-shakeelb@google.com
Signed-off-by: Shakeel Butt <shakeelb@google.com>
Acked-by: Michal Hocko <mhocko@suse.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Vladimir Davydov <vdavydov.dev@gmail.com>
Cc: Roman Gushchin <guro@fb.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 98fa15f3 05-Mar-2019 Anshuman Khandual <anshuman.khandual@arm.com>

mm: replace all open encodings for NUMA_NO_NODE

Patch series "Replace all open encodings for NUMA_NO_NODE", v3.

All these places for replacement were found by running the following
grep patterns on the entire kernel code. Please let me know if this
might have missed some instances. This might also have replaced some
false positives. I will appreciate suggestions, inputs and review.

1. git grep "nid == -1"
2. git grep "node == -1"
3. git grep "nid = -1"
4. git grep "node = -1"

This patch (of 2):

At present there are multiple places where invalid node number is
encoded as -1. Even though implicitly understood it is always better to
have macros in there. Replace these open encodings for an invalid node
number with the global macro NUMA_NO_NODE. This helps remove NUMA
related assumptions like 'invalid node' from various places redirecting
them to a common definition.

Link: http://lkml.kernel.org/r/1545127933-10711-2-git-send-email-anshuman.khandual@arm.com
Signed-off-by: Anshuman Khandual <anshuman.khandual@arm.com>
Reviewed-by: David Hildenbrand <david@redhat.com>
Acked-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com> [ixgbe]
Acked-by: Jens Axboe <axboe@kernel.dk> [mtip32xx]
Acked-by: Vinod Koul <vkoul@kernel.org> [dmaengine.c]
Acked-by: Michael Ellerman <mpe@ellerman.id.au> [powerpc]
Acked-by: Doug Ledford <dledford@redhat.com> [drivers/infiniband]
Cc: Joseph Qi <jiangqi903@gmail.com>
Cc: Hans Verkuil <hverkuil@xs4all.nl>
Cc: Stephen Rothwell <sfr@canb.auug.org.au>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# a9cd410a 05-Mar-2019 Arun KS <arunks@codeaurora.org>

mm/page_alloc.c: memory hotplug: free pages as higher order

When freeing pages are done with higher order, time spent on coalescing
pages by buddy allocator can be reduced. With section size of 256MB,
hot add latency of a single section shows improvement from 50-60 ms to
less than 1 ms, hence improving the hot add latency by 60 times. Modify
external providers of online callback to align with the change.

[arunks@codeaurora.org: v11]
Link: http://lkml.kernel.org/r/1547792588-18032-1-git-send-email-arunks@codeaurora.org
[akpm@linux-foundation.org: remove unused local, per Arun]
[akpm@linux-foundation.org: avoid return of void-returning __free_pages_core(), per Oscar]
[akpm@linux-foundation.org: fix it for mm-convert-totalram_pages-and-totalhigh_pages-variables-to-atomic.patch]
[arunks@codeaurora.org: v8]
Link: http://lkml.kernel.org/r/1547032395-24582-1-git-send-email-arunks@codeaurora.org
[arunks@codeaurora.org: v9]
Link: http://lkml.kernel.org/r/1547098543-26452-1-git-send-email-arunks@codeaurora.org
Link: http://lkml.kernel.org/r/1538727006-5727-1-git-send-email-arunks@codeaurora.org
Signed-off-by: Arun KS <arunks@codeaurora.org>
Reviewed-by: Andrew Morton <akpm@linux-foundation.org>
Acked-by: Michal Hocko <mhocko@suse.com>
Reviewed-by: Oscar Salvador <osalvador@suse.de>
Reviewed-by: Alexander Duyck <alexander.h.duyck@linux.intel.com>
Cc: K. Y. Srinivasan <kys@microsoft.com>
Cc: Haiyang Zhang <haiyangz@microsoft.com>
Cc: Stephen Hemminger <sthemmin@microsoft.com>
Cc: Boris Ostrovsky <boris.ostrovsky@oracle.com>
Cc: Juergen Gross <jgross@suse.com>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: Mathieu Malaterre <malat@debian.org>
Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
Cc: Souptick Joarder <jrdr.linux@gmail.com>
Cc: Mel Gorman <mgorman@techsingularity.net>
Cc: Aaron Lu <aaron.lu@intel.com>
Cc: Srivatsa Vaddagiri <vatsa@codeaurora.org>
Cc: Vinayak Menon <vinmenon@codeaurora.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 4117992d 05-Mar-2019 Qian Cai <cai@lca.pw>

page_poison: play nicely with KASAN

KASAN does not play well with the page poisoning (CONFIG_PAGE_POISONING).
It triggers false positives in the allocation path:

BUG: KASAN: use-after-free in memchr_inv+0x2ea/0x330
Read of size 8 at addr ffff88881f800000 by task swapper/0
CPU: 0 PID: 0 Comm: swapper Not tainted 5.0.0-rc1+ #54
Call Trace:
dump_stack+0xe0/0x19a
print_address_description.cold.2+0x9/0x28b
kasan_report.cold.3+0x7a/0xb5
__asan_report_load8_noabort+0x19/0x20
memchr_inv+0x2ea/0x330
kernel_poison_pages+0x103/0x3d5
get_page_from_freelist+0x15e7/0x4d90

because KASAN has not yet unpoisoned the shadow page for allocation
before it checks memchr_inv() but only found a stale poison pattern.

Also, false positives in free path,

BUG: KASAN: slab-out-of-bounds in kernel_poison_pages+0x29e/0x3d5
Write of size 4096 at addr ffff8888112cc000 by task swapper/0/1
CPU: 5 PID: 1 Comm: swapper/0 Not tainted 5.0.0-rc1+ #55
Call Trace:
dump_stack+0xe0/0x19a
print_address_description.cold.2+0x9/0x28b
kasan_report.cold.3+0x7a/0xb5
check_memory_region+0x22d/0x250
memset+0x28/0x40
kernel_poison_pages+0x29e/0x3d5
__free_pages_ok+0x75f/0x13e0

due to KASAN adds poisoned redzones around slab objects, but the page
poisoning needs to poison the whole page.

Link: http://lkml.kernel.org/r/20190114233405.67843-1-cai@lca.pw
Signed-off-by: Qian Cai <cai@lca.pw>
Acked-by: Andrey Ryabinin <aryabinin@virtuozzo.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 94b3334c 20-Feb-2019 Mel Gorman <mgorman@techsingularity.net>

mm, page_alloc: fix a division by zero error when boosting watermarks v2

Yury Norov reported that an arm64 KVM instance could not boot since
after v5.0-rc1 and could addressed by reverting the patches

1c30844d2dfe272d58c ("mm: reclaim small amounts of memory when an external
73444bc4d8f92e46a20 ("mm, page_alloc: do not wake kswapd with zone lock held")

The problem is that a division by zero error is possible if boosting
occurs very early in boot if the system has very little memory. This
patch avoids the division by zero error.

Link: http://lkml.kernel.org/r/20190213143012.GT9565@techsingularity.net
Fixes: 1c30844d2dfe ("mm: reclaim small amounts of memory when an external fragmentation event occurs")
Signed-off-by: Mel Gorman <mgorman@techsingularity.net>
Reported-by: Yury Norov <yury.norov@gmail.com>
Tested-by: Yury Norov <yury.norov@gmail.com>
Tested-by: Will Deacon <will.deacon@arm.com>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: David Rientjes <rientjes@google.com>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 86447726 15-Feb-2019 Alexander Duyck <alexander.h.duyck@linux.intel.com>

mm: Use fixed constant in page_frag_alloc instead of size + 1

This patch replaces the size + 1 value introduced with the recent fix for 1
byte allocs with a constant value.

The idea here is to reduce code overhead as the previous logic would have
to read size into a register, then increment it, and write it back to
whatever field was being used. By using a constant we can avoid those
memory reads and arithmetic operations in favor of just encoding the
maximum value into the operation itself.

Fixes: 2c2ade81741c ("mm: page_alloc: fix ref bias in page_frag_alloc() for 1-byte allocs")
Signed-off-by: Alexander Duyck <alexander.h.duyck@linux.intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>


# 2c2ade81 13-Feb-2019 Jann Horn <jannh@google.com>

mm: page_alloc: fix ref bias in page_frag_alloc() for 1-byte allocs

The basic idea behind ->pagecnt_bias is: If we pre-allocate the maximum
number of references that we might need to create in the fastpath later,
the bump-allocation fastpath only has to modify the non-atomic bias value
that tracks the number of extra references we hold instead of the atomic
refcount. The maximum number of allocations we can serve (under the
assumption that no allocation is made with size 0) is nc->size, so that's
the bias used.

However, even when all memory in the allocation has been given away, a
reference to the page is still held; and in the `offset < 0` slowpath, the
page may be reused if everyone else has dropped their references.
This means that the necessary number of references is actually
`nc->size+1`.

Luckily, from a quick grep, it looks like the only path that can call
page_frag_alloc(fragsz=1) is TAP with the IFF_NAPI_FRAGS flag, which
requires CAP_NET_ADMIN in the init namespace and is only intended to be
used for kernel testing and fuzzing.

To test for this issue, put a `WARN_ON(page_ref_count(page) == 0)` in the
`offset < 0` path, below the virt_to_page() call, and then repeatedly call
writev() on a TAP device with IFF_TAP|IFF_NO_PI|IFF_NAPI_FRAGS|IFF_NAPI,
with a vector consisting of 15 elements containing 1 byte each.

Signed-off-by: Jann Horn <jannh@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>


# 4aa9fc2a 25-Jan-2019 Michal Hocko <mhocko@suse.com>

Revert "mm, memory_hotplug: initialize struct pages for the full memory section"

This reverts commit 2830bf6f05fb3e05bc4743274b806c821807a684.

The underlying assumption that one sparse section belongs into a single
numa node doesn't hold really. Robert Shteynfeld has reported a boot
failure. The boot log was not captured but his memory layout is as
follows:

Early memory node ranges
node 1: [mem 0x0000000000001000-0x0000000000090fff]
node 1: [mem 0x0000000000100000-0x00000000dbdf8fff]
node 1: [mem 0x0000000100000000-0x0000001423ffffff]
node 0: [mem 0x0000001424000000-0x0000002023ffffff]

This means that node0 starts in the middle of a memory section which is
also in node1. memmap_init_zone tries to initialize padding of a
section even when it is outside of the given pfn range because there are
code paths (e.g. memory hotplug) which assume that the full worth of
memory section is always initialized.

In this particular case, though, such a range is already intialized and
most likely already managed by the page allocator. Scribbling over
those pages corrupts the internal state and likely blows up when any of
those pages gets used.

Reported-by: Robert Shteynfeld <robert.shteynfeld@gmail.com>
Fixes: 2830bf6f05fb ("mm, memory_hotplug: initialize struct pages for the full memory section")
Cc: stable@kernel.org
Signed-off-by: Michal Hocko <mhocko@suse.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 73444bc4 08-Jan-2019 Mel Gorman <mgorman@techsingularity.net>

mm, page_alloc: do not wake kswapd with zone lock held

syzbot reported the following regression in the latest merge window and
it was confirmed by Qian Cai that a similar bug was visible from a
different context.

======================================================
WARNING: possible circular locking dependency detected
4.20.0+ #297 Not tainted
------------------------------------------------------
syz-executor0/8529 is trying to acquire lock:
000000005e7fb829 (&pgdat->kswapd_wait){....}, at:
__wake_up_common_lock+0x19e/0x330 kernel/sched/wait.c:120

but task is already holding lock:
000000009bb7bae0 (&(&zone->lock)->rlock){-.-.}, at: spin_lock
include/linux/spinlock.h:329 [inline]
000000009bb7bae0 (&(&zone->lock)->rlock){-.-.}, at: rmqueue_bulk
mm/page_alloc.c:2548 [inline]
000000009bb7bae0 (&(&zone->lock)->rlock){-.-.}, at: __rmqueue_pcplist
mm/page_alloc.c:3021 [inline]
000000009bb7bae0 (&(&zone->lock)->rlock){-.-.}, at: rmqueue_pcplist
mm/page_alloc.c:3050 [inline]
000000009bb7bae0 (&(&zone->lock)->rlock){-.-.}, at: rmqueue
mm/page_alloc.c:3072 [inline]
000000009bb7bae0 (&(&zone->lock)->rlock){-.-.}, at:
get_page_from_freelist+0x1bae/0x52a0 mm/page_alloc.c:3491

It appears to be a false positive in that the only way the lock ordering
should be inverted is if kswapd is waking itself and the wakeup
allocates debugging objects which should already be allocated if it's
kswapd doing the waking. Nevertheless, the possibility exists and so
it's best to avoid the problem.

This patch flags a zone as needing a kswapd using the, surprisingly,
unused zone flag field. The flag is read without the lock held to do
the wakeup. It's possible that the flag setting context is not the same
as the flag clearing context or for small races to occur. However, each
race possibility is harmless and there is no visible degredation in
fragmentation treatment.

While zone->flag could have continued to be unused, there is potential
for moving some existing fields into the flags field instead.
Particularly read-mostly ones like zone->initialized and
zone->contiguous.

Link: http://lkml.kernel.org/r/20190103225712.GJ31517@techsingularity.net
Fixes: 1c30844d2dfe ("mm: reclaim small amounts of memory when an external fragmentation event occurs")
Reported-by: syzbot+93d94a001cfbce9e60e1@syzkaller.appspotmail.com
Signed-off-by: Mel Gorman <mgorman@techsingularity.net>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Tested-by: Qian Cai <cai@lca.pw>
Cc: Dmitry Vyukov <dvyukov@google.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Michal Hocko <mhocko@suse.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# af3b8544 28-Dec-2018 Benjamin Poirier <bpoirier@suse.com>

mm/page_alloc.c: allow error injection

Model call chain after should_failslab(). Likewise, we can now use a
kprobe to override the return value of should_fail_alloc_page() and inject
allocation failures into alloc_page*().

This will allow injecting allocation failures using the BCC tools even
without building kernel with CONFIG_FAIL_PAGE_ALLOC and booting it with a
fail_page_alloc= parameter, which incurs some overhead even when failures
are not being injected. On the other hand, this patch adds an
unconditional call to should_fail_alloc_page() from page allocation
hotpath. That overhead should be rather negligible with
CONFIG_FAIL_PAGE_ALLOC=n when there's no kprobe attached, though.

[vbabka@suse.cz: changelog addition]
Link: http://lkml.kernel.org/r/20181214074330.18917-1-bpoirier@suse.com
Signed-off-by: Benjamin Poirier <bpoirier@suse.com>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Cc: Arnd Bergmann <arnd@arndb.de>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Pavel Tatashin <pavel.tatashin@microsoft.com>
Cc: Oscar Salvador <osalvador@suse.de>
Cc: Mike Rapoport <rppt@linux.vnet.ibm.com>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Cc: Alexander Duyck <alexander.h.duyck@linux.intel.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# d9367bd0 28-Dec-2018 Wei Yang <richard.weiyang@gmail.com>

mm, page_alloc: enable pcpu_drain with zone capability

drain_all_pages is documented to drain per-cpu pages for a given zone (if
non-NULL). The current implementation doesn't match the description
though. It will drain all pcp pages for all zones that happen to have
cached pages on the same cpu as the given zone. This will lead to
premature pcp cache draining for zones that are not of any interest to the
caller - e.g. compaction, hwpoison or memory offline.

This forces the page allocator to take locks and potential lock contention
as a result.

There is no real reason for this sub-optimal implementation. Replace
per-cpu work item with a dedicated structure which contains a pointer to
the zone and pass it over to the worker. This will get the zone
information all the way down to the worker function and do the right job.

[akpm@linux-foundation.org: avoid 80-col tricks]
[mhocko@suse.com: refactor the whole changelog]
Link: http://lkml.kernel.org/r/20181212142550.61686-1-richard.weiyang@gmail.com
Signed-off-by: Wei Yang <richard.weiyang@gmail.com>
Acked-by: Michal Hocko <mhocko@suse.com>
Reviewed-by: Oscar Salvador <osalvador@suse.de>
Reviewed-by: David Hildenbrand <david@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 3c0c12cc 28-Dec-2018 Waiman Long <longman@redhat.com>

mm/page_alloc.c: don't call kasan_free_pages() at deferred mem init

When CONFIG_KASAN is enabled on large memory SMP systems, the deferrred
pages initialization can take a long time. Below were the reported init
times on a 8-socket 96-core 4TB IvyBridge system.

1) Non-debug kernel without CONFIG_KASAN
[ 8.764222] node 1 initialised, 132086516 pages in 7027ms

2) Debug kernel with CONFIG_KASAN
[ 146.288115] node 1 initialised, 132075466 pages in 143052ms

So the page init time in a debug kernel was 20X of the non-debug kernel.
The long init time can be problematic as the page initialization is done
with interrupt disabled. In this particular case, it caused the
appearance of following warning messages as well as NMI backtraces of all
the cores that were doing the initialization.

[ 68.240049] rcu: INFO: rcu_sched detected stalls on CPUs/tasks:
[ 68.241000] rcu: 25-...0: (100 ticks this GP) idle=b72/1/0x4000000000000000 softirq=915/915 fqs=16252
[ 68.241000] rcu: 44-...0: (95 ticks this GP) idle=49a/1/0x4000000000000000 softirq=788/788 fqs=16253
[ 68.241000] rcu: 54-...0: (104 ticks this GP) idle=03a/1/0x4000000000000000 softirq=721/825 fqs=16253
[ 68.241000] rcu: 60-...0: (103 ticks this GP) idle=cbe/1/0x4000000000000000 softirq=637/740 fqs=16253
[ 68.241000] rcu: 72-...0: (105 ticks this GP) idle=786/1/0x4000000000000000 softirq=536/641 fqs=16253
[ 68.241000] rcu: 84-...0: (99 ticks this GP) idle=292/1/0x4000000000000000 softirq=537/537 fqs=16253
[ 68.241000] rcu: 111-...0: (104 ticks this GP) idle=bde/1/0x4000000000000000 softirq=474/476 fqs=16253
[ 68.241000] rcu: (detected by 13, t=65018 jiffies, g=249, q=2)

The long init time was mainly caused by the call to kasan_free_pages() to
poison the newly initialized pages. On a 4TB system, we are talking about
almost 500GB of memory probably on the same node.

In reality, we may not need to poison the newly initialized pages before
they are ever allocated. So KASAN poisoning of freed pages before the
completion of deferred memory initialization is now disabled. Those pages
will be properly poisoned when they are allocated or freed after deferred
pages initialization is done.

With this change, the new page initialization time became:

[ 21.948010] node 1 initialised, 132075466 pages in 18702ms

This was still about double the non-debug kernel time, but was much
better than before.

Link: http://lkml.kernel.org/r/1544459388-8736-1-git-send-email-longman@redhat.com
Signed-off-by: Waiman Long <longman@redhat.com>
Reviewed-by: Andrew Morton <akpm@linux-foundation.org>
Cc: Andrey Ryabinin <aryabinin@virtuozzo.com>
Cc: Alexander Potapenko <glider@google.com>
Cc: Dmitry Vyukov <dvyukov@google.com>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Pasha Tatashin <Pavel.Tatashin@microsoft.com>
Cc: Oscar Salvador <osalvador@suse.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 125b860b 28-Dec-2018 Pingfan Liu <kernelfans@gmail.com>

mm/pageblock: throw compile error if pageblock_bits cannot hold MIGRATE_TYPES

Currently, NR_PAGEBLOCK_BITS and MIGRATE_TYPES are not associated by code.
If someone adds extra migrate type, then he may forget to enlarge the
NR_PAGEBLOCK_BITS. Hence it requires some way to fix.

NR_PAGEBLOCK_BITS depends on MIGRATE_TYPES, while these macro spread on
two different .h file with reverse dependency, it is a little hard to
refer to MIGRATE_TYPES in pageblock-flag.h. This patch tries to remind
such relation in compiling-time.

Link: http://lkml.kernel.org/r/1544508709-11358-1-git-send-email-kernelfans@gmail.com
Signed-off-by: Pingfan Liu <kernelfans@gmail.com>
Reviewed-by: Andrew Morton <akpm@linux-foundation.org>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Pavel Tatashin <pavel.tatashin@microsoft.com>
Cc: Oscar Salvador <osalvador@suse.de>
Cc: Mike Rapoport <rppt@linux.vnet.ibm.com>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Cc: Alexander Duyck <alexander.h.duyck@linux.intel.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# bbe5d993 28-Dec-2018 Oscar Salvador <osalvador@suse.de>

mm/page_alloc.c: drop uneeded __meminit and __meminitdata

Since commit 03e85f9d5f1 ("mm/page_alloc: Introduce
free_area_init_core_hotplug"), some functions changed to only be called
during system initialization. Concretly, free_area_init_node() and the
functions that hang from it.

Also, some variables are no longer used after the system has gone
through initialization. So this could be considered as a late clean-up
for that patch.

This patch changes the functions from __meminit to __init, and the
variables from __meminitdata to __initdata.

In return, we get some KBs back:

Before:
Freeing unused kernel image memory: 2472K

After:
Freeing unused kernel image memory: 2480K

Link: http://lkml.kernel.org/r/20181204111507.4808-1-osalvador@suse.de
Signed-off-by: Oscar Salvador <osalvador@suse.de>
Reviewed-by: Wei Yang <richard.weiyang@gmail.com>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Pavel Tatashin <pavel.tatashin@microsoft.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Alexander Duyck <alexander.h.duyck@linux.intel.com>
Cc: David Hildenbrand <david@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 23b68cfa 28-Dec-2018 Wei Yang <richard.weiyang@gmail.com>

mm: check nr_initialised with PAGES_PER_SECTION directly in defer_init()

When DEFERRED_STRUCT_PAGE_INIT is configured, only the first section of
each node's highest zone is initialized before defer stage.

static_init_pgcnt is used to store the number of pages like this:

pgdat->static_init_pgcnt = min_t(unsigned long, PAGES_PER_SECTION,
pgdat->node_spanned_pages);

because we don't want to overflow zone's range.

But this is not necessary, since defer_init() is called like this:

memmap_init_zone()
for pfn in [start_pfn, end_pfn)
defer_init(pfn, end_pfn)

In case (pgdat->node_spanned_pages < PAGES_PER_SECTION), the loop would
stop before calling defer_init().

BTW, comparing PAGES_PER_SECTION with node_spanned_pages is not correct,
since nr_initialised is zone based instead of node based. Even
node_spanned_pages is bigger than PAGES_PER_SECTION, its highest zone
would have pages less than PAGES_PER_SECTION.

Link: http://lkml.kernel.org/r/20181122094807.6985-1-richard.weiyang@gmail.com
Signed-off-by: Wei Yang <richard.weiyang@gmail.com>
Reviewed-by: Alexander Duyck <alexander.h.duyck@linux.intel.com>
Cc: Pavel Tatashin <pasha.tatashin@oracle.com>
Cc: Oscar Salvador <osalvador@suse.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# ef8444ea 28-Dec-2018 yuzhoujian <yuzhoujian@didichuxing.com>

mm, oom: reorganize the oom report in dump_header

OOM report contains several sections. The first one is the allocation
context that has triggered the OOM. Then we have cpuset context followed
by the stack trace of the OOM path. The tird one is the OOM memory
information. Followed by the current memory state of all system tasks.
At last, we will show oom eligible tasks and the information about the
chosen oom victim.

One thing that makes parsing more awkward than necessary is that we do not
have a single and easily parsable line about the oom context. This patch
is reorganizing the oom report to

1) who invoked oom and what was the allocation request

[ 515.902945] tuned invoked oom-killer: gfp_mask=0x6200ca(GFP_HIGHUSER_MOVABLE), order=0, oom_score_adj=0

2) OOM stack trace

[ 515.904273] CPU: 24 PID: 1809 Comm: tuned Not tainted 4.20.0-rc3+ #3
[ 515.905518] Hardware name: Inspur SA5212M4/YZMB-00370-107, BIOS 4.1.10 11/14/2016
[ 515.906821] Call Trace:
[ 515.908062] dump_stack+0x5a/0x73
[ 515.909311] dump_header+0x55/0x28c
[ 515.914260] oom_kill_process+0x2d8/0x300
[ 515.916708] out_of_memory+0x145/0x4a0
[ 515.917932] __alloc_pages_slowpath+0x7d2/0xa16
[ 515.919157] __alloc_pages_nodemask+0x277/0x290
[ 515.920367] filemap_fault+0x3d0/0x6c0
[ 515.921529] ? filemap_map_pages+0x2b8/0x420
[ 515.922709] ext4_filemap_fault+0x2c/0x40 [ext4]
[ 515.923884] __do_fault+0x20/0x80
[ 515.925032] __handle_mm_fault+0xbc0/0xe80
[ 515.926195] handle_mm_fault+0xfa/0x210
[ 515.927357] __do_page_fault+0x233/0x4c0
[ 515.928506] do_page_fault+0x32/0x140
[ 515.929646] ? page_fault+0x8/0x30
[ 515.930770] page_fault+0x1e/0x30

3) OOM memory information

[ 515.958093] Mem-Info:
[ 515.959647] active_anon:26501758 inactive_anon:1179809 isolated_anon:0
active_file:4402672 inactive_file:483963 isolated_file:1344
unevictable:0 dirty:4886753 writeback:0 unstable:0
slab_reclaimable:148442 slab_unreclaimable:18741
mapped:1347 shmem:1347 pagetables:58669 bounce:0
free:88663 free_pcp:0 free_cma:0
...

4) current memory state of all system tasks

[ 516.079544] [ 744] 0 744 9211 1345 114688 82 0 systemd-journal
[ 516.082034] [ 787] 0 787 31764 0 143360 92 0 lvmetad
[ 516.084465] [ 792] 0 792 10930 1 110592 208 -1000 systemd-udevd
[ 516.086865] [ 1199] 0 1199 13866 0 131072 112 -1000 auditd
[ 516.089190] [ 1222] 0 1222 31990 1 110592 157 0 smartd
[ 516.091477] [ 1225] 0 1225 4864 85 81920 43 0 irqbalance
[ 516.093712] [ 1226] 0 1226 52612 0 258048 426 0 abrtd
[ 516.112128] [ 1280] 0 1280 109774 55 299008 400 0 NetworkManager
[ 516.113998] [ 1295] 0 1295 28817 37 69632 24 0 ksmtuned
[ 516.144596] [ 10718] 0 10718 2622484 1721372 15998976 267219 0 panic
[ 516.145792] [ 10719] 0 10719 2622484 1164767 9818112 53576 0 panic
[ 516.146977] [ 10720] 0 10720 2622484 1174361 9904128 53709 0 panic
[ 516.148163] [ 10721] 0 10721 2622484 1209070 10194944 54824 0 panic
[ 516.149329] [ 10722] 0 10722 2622484 1745799 14774272 91138 0 panic

5) oom context (contrains and the chosen victim).

oom-kill:constraint=CONSTRAINT_NONE,nodemask=(null),cpuset=/,mems_allowed=0-1,task=panic,pid=10737,uid=0

An admin can easily get the full oom context at a single line which
makes parsing much easier.

Link: http://lkml.kernel.org/r/1542799799-36184-1-git-send-email-ufo19890607@gmail.com
Signed-off-by: yuzhoujian <yuzhoujian@didichuxing.com>
Acked-by: Michal Hocko <mhocko@suse.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: David Rientjes <rientjes@google.com>
Cc: "Kirill A . Shutemov" <kirill.shutemov@linux.intel.com>
Cc: Roman Gushchin <guro@fb.com>
Cc: Tetsuo Handa <penguin-kernel@i-love.sakura.ne.jp>
Cc: Yang Shi <yang.s@alibaba-inc.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# e5cb113f 28-Dec-2018 Alexey Dobriyan <adobriyan@gmail.com>

mm: make free_reserved_area() return "const char *"

and propagate through down the call stack.

Link: http://lkml.kernel.org/r/20181124091411.GC10969@avx2
Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com>
Reviewed-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# c999fbd3 28-Dec-2018 Alexey Dobriyan <adobriyan@gmail.com>

mm/mmzone.c: make "migratetype_names" const char *

Those strings are immutable in fact.

Link: http://lkml.kernel.org/r/20181124090327.GA10877@avx2
Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com>
Reviewed-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 1c30844d 28-Dec-2018 Mel Gorman <mgorman@techsingularity.net>

mm: reclaim small amounts of memory when an external fragmentation event occurs

An external fragmentation event was previously described as

When the page allocator fragments memory, it records the event using
the mm_page_alloc_extfrag event. If the fallback_order is smaller
than a pageblock order (order-9 on 64-bit x86) then it's considered
an event that will cause external fragmentation issues in the future.

The kernel reduces the probability of such events by increasing the
watermark sizes by calling set_recommended_min_free_kbytes early in the
lifetime of the system. This works reasonably well in general but if
there are enough sparsely populated pageblocks then the problem can still
occur as enough memory is free overall and kswapd stays asleep.

This patch introduces a watermark_boost_factor sysctl that allows a zone
watermark to be temporarily boosted when an external fragmentation causing
events occurs. The boosting will stall allocations that would decrease
free memory below the boosted low watermark and kswapd is woken if the
calling context allows to reclaim an amount of memory relative to the size
of the high watermark and the watermark_boost_factor until the boost is
cleared. When kswapd finishes, it wakes kcompactd at the pageblock order
to clean some of the pageblocks that may have been affected by the
fragmentation event. kswapd avoids any writeback, slab shrinkage and swap
from reclaim context during this operation to avoid excessive system
disruption in the name of fragmentation avoidance. Care is taken so that
kswapd will do normal reclaim work if the system is really low on memory.

This was evaluated using the same workloads as "mm, page_alloc: Spread
allocations across zones before introducing fragmentation".

1-socket Skylake machine
config-global-dhp__workload_thpfioscale XFS (no special madvise)
4 fio threads, 1 THP allocating thread
--------------------------------------

4.20-rc3 extfrag events < order 9: 804694
4.20-rc3+patch: 408912 (49% reduction)
4.20-rc3+patch1-4: 18421 (98% reduction)

4.20.0-rc3 4.20.0-rc3
lowzone-v5r8 boost-v5r8
Amean fault-base-1 653.58 ( 0.00%) 652.71 ( 0.13%)
Amean fault-huge-1 0.00 ( 0.00%) 178.93 * -99.00%*

4.20.0-rc3 4.20.0-rc3
lowzone-v5r8 boost-v5r8
Percentage huge-1 0.00 ( 0.00%) 5.12 ( 100.00%)

Note that external fragmentation causing events are massively reduced by
this path whether in comparison to the previous kernel or the vanilla
kernel. The fault latency for huge pages appears to be increased but that
is only because THP allocations were successful with the patch applied.

1-socket Skylake machine
global-dhp__workload_thpfioscale-madvhugepage-xfs (MADV_HUGEPAGE)
-----------------------------------------------------------------

4.20-rc3 extfrag events < order 9: 291392
4.20-rc3+patch: 191187 (34% reduction)
4.20-rc3+patch1-4: 13464 (95% reduction)

thpfioscale Fault Latencies
4.20.0-rc3 4.20.0-rc3
lowzone-v5r8 boost-v5r8
Min fault-base-1 912.00 ( 0.00%) 905.00 ( 0.77%)
Min fault-huge-1 127.00 ( 0.00%) 135.00 ( -6.30%)
Amean fault-base-1 1467.55 ( 0.00%) 1481.67 ( -0.96%)
Amean fault-huge-1 1127.11 ( 0.00%) 1063.88 * 5.61%*

4.20.0-rc3 4.20.0-rc3
lowzone-v5r8 boost-v5r8
Percentage huge-1 77.64 ( 0.00%) 83.46 ( 7.49%)

As before, massive reduction in external fragmentation events, some jitter
on latencies and an increase in THP allocation success rates.

2-socket Haswell machine
config-global-dhp__workload_thpfioscale XFS (no special madvise)
4 fio threads, 5 THP allocating threads
----------------------------------------------------------------

4.20-rc3 extfrag events < order 9: 215698
4.20-rc3+patch: 200210 (7% reduction)
4.20-rc3+patch1-4: 14263 (93% reduction)

4.20.0-rc3 4.20.0-rc3
lowzone-v5r8 boost-v5r8
Amean fault-base-5 1346.45 ( 0.00%) 1306.87 ( 2.94%)
Amean fault-huge-5 3418.60 ( 0.00%) 1348.94 ( 60.54%)

4.20.0-rc3 4.20.0-rc3
lowzone-v5r8 boost-v5r8
Percentage huge-5 0.78 ( 0.00%) 7.91 ( 910.64%)

There is a 93% reduction in fragmentation causing events, there is a big
reduction in the huge page fault latency and allocation success rate is
higher.

2-socket Haswell machine
global-dhp__workload_thpfioscale-madvhugepage-xfs (MADV_HUGEPAGE)
-----------------------------------------------------------------

4.20-rc3 extfrag events < order 9: 166352
4.20-rc3+patch: 147463 (11% reduction)
4.20-rc3+patch1-4: 11095 (93% reduction)

thpfioscale Fault Latencies
4.20.0-rc3 4.20.0-rc3
lowzone-v5r8 boost-v5r8
Amean fault-base-5 6217.43 ( 0.00%) 7419.67 * -19.34%*
Amean fault-huge-5 3163.33 ( 0.00%) 3263.80 ( -3.18%)

4.20.0-rc3 4.20.0-rc3
lowzone-v5r8 boost-v5r8
Percentage huge-5 95.14 ( 0.00%) 87.98 ( -7.53%)

There is a large reduction in fragmentation events with some jitter around
the latencies and success rates. As before, the high THP allocation
success rate does mean the system is under a lot of pressure. However, as
the fragmentation events are reduced, it would be expected that the
long-term allocation success rate would be higher.

Link: http://lkml.kernel.org/r/20181123114528.28802-5-mgorman@techsingularity.net
Signed-off-by: Mel Gorman <mgorman@techsingularity.net>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: David Rientjes <rientjes@google.com>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Zi Yan <zi.yan@cs.rutgers.edu>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 0a79cdad 28-Dec-2018 Mel Gorman <mgorman@techsingularity.net>

mm: use alloc_flags to record if kswapd can wake

This is a preparation patch that copies the GFP flag __GFP_KSWAPD_RECLAIM
into alloc_flags. This is a preparation patch only that avoids having to
pass gfp_mask through a long callchain in a future patch.

Note that the setting in the fast path happens in alloc_flags_nofragment()
and it may be claimed that this has nothing to do with ALLOC_NO_FRAGMENT.
That's true in this patch but is not true later so it's done now for
easier review to show where the flag needs to be recorded.

No functional change.

[mgorman@techsingularity.net: ALLOC_KSWAPD flag needs to be applied in the !CONFIG_ZONE_DMA32 case]
Link: http://lkml.kernel.org/r/20181126143503.GO23260@techsingularity.net
Link: http://lkml.kernel.org/r/20181123114528.28802-4-mgorman@techsingularity.net
Signed-off-by: Mel Gorman <mgorman@techsingularity.net>
Reviewed-by: Andrew Morton <akpm@linux-foundation.org>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: David Rientjes <rientjes@google.com>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Zi Yan <zi.yan@cs.rutgers.edu>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# a9214443 28-Dec-2018 Mel Gorman <mgorman@techsingularity.net>

mm: move zone watermark accesses behind an accessor

This is a preparation patch only, no functional change.

Link: http://lkml.kernel.org/r/20181123114528.28802-3-mgorman@techsingularity.net
Signed-off-by: Mel Gorman <mgorman@techsingularity.net>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: David Rientjes <rientjes@google.com>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Zi Yan <zi.yan@cs.rutgers.edu>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 6bb15450 28-Dec-2018 Mel Gorman <mgorman@techsingularity.net>

mm, page_alloc: spread allocations across zones before introducing fragmentation

Patch series "Fragmentation avoidance improvements", v5.

It has been noted before that fragmentation avoidance (aka
anti-fragmentation) is not perfect. Given sufficient time or an adverse
workload, memory gets fragmented and the long-term success of high-order
allocations degrades. This series defines an adverse workload, a definition
of external fragmentation events (including serious) ones and a series
that reduces the level of those fragmentation events.

The details of the workload and the consequences are described in more
detail in the changelogs. However, from patch 1, this is a high-level
summary of the adverse workload. The exact details are found in the
mmtests implementation.

The broad details of the workload are as follows;

1. Create an XFS filesystem (not specified in the configuration but done
as part of the testing for this patch)
2. Start 4 fio threads that write a number of 64K files inefficiently.
Inefficiently means that files are created on first access and not
created in advance (fio parameterr create_on_open=1) and fallocate
is not used (fallocate=none). With multiple IO issuers this creates
a mix of slab and page cache allocations over time. The total size
of the files is 150% physical memory so that the slabs and page cache
pages get mixed
3. Warm up a number of fio read-only threads accessing the same files
created in step 2. This part runs for the same length of time it
took to create the files. It'll fault back in old data and further
interleave slab and page cache allocations. As it's now low on
memory due to step 2, fragmentation occurs as pageblocks get
stolen.
4. While step 3 is still running, start a process that tries to allocate
75% of memory as huge pages with a number of threads. The number of
threads is based on a (NR_CPUS_SOCKET - NR_FIO_THREADS)/4 to avoid THP
threads contending with fio, any other threads or forcing cross-NUMA
scheduling. Note that the test has not been used on a machine with less
than 8 cores. The benchmark records whether huge pages were allocated
and what the fault latency was in microseconds
5. Measure the number of events potentially causing external fragmentation,
the fault latency and the huge page allocation success rate.
6. Cleanup

Overall the series reduces external fragmentation causing events by over 94%
on 1 and 2 socket machines, which in turn impacts high-order allocation
success rates over the long term. There are differences in latencies and
high-order allocation success rates. Latencies are a mixed bag as they
are vulnerable to exact system state and whether allocations succeeded
so they are treated as a secondary metric.

Patch 1 uses lower zones if they are populated and have free memory
instead of fragmenting a higher zone. It's special cased to
handle a Normal->DMA32 fallback with the reasons explained
in the changelog.

Patch 2-4 boosts watermarks temporarily when an external fragmentation
event occurs. kswapd wakes to reclaim a small amount of old memory
and then wakes kcompactd on completion to recover the system
slightly. This introduces some overhead in the slowpath. The level
of boosting can be tuned or disabled depending on the tolerance
for fragmentation vs allocation latency.

Patch 5 stalls some movable allocation requests to let kswapd from patch 4
make some progress. The duration of the stalls is very low but it
is possible to tune the system to avoid fragmentation events if
larger stalls can be tolerated.

The bulk of the improvement in fragmentation avoidance is from patches
1-4 but patch 5 can deal with a rare corner case and provides the option
of tuning a system for THP allocation success rates in exchange for
some stalls to control fragmentation.

This patch (of 5):

The page allocator zone lists are iterated based on the watermarks of each
zone which does not take anti-fragmentation into account. On x86, node 0
may have multiple zones while other nodes have one zone. A consequence is
that tasks running on node 0 may fragment ZONE_NORMAL even though
ZONE_DMA32 has plenty of free memory. This patch special cases the
allocator fast path such that it'll try an allocation from a lower local
zone before fragmenting a higher zone. In this case, stealing of
pageblocks or orders larger than a pageblock are still allowed in the fast
path as they are uninteresting from a fragmentation point of view.

This was evaluated using a benchmark designed to fragment memory before
attempting THP allocations. It's implemented in mmtests as the following
configurations

configs/config-global-dhp__workload_thpfioscale
configs/config-global-dhp__workload_thpfioscale-defrag
configs/config-global-dhp__workload_thpfioscale-madvhugepage

e.g. from mmtests
./run-mmtests.sh --run-monitor --config configs/config-global-dhp__workload_thpfioscale test-run-1

The broad details of the workload are as follows;

1. Create an XFS filesystem (not specified in the configuration but done
as part of the testing for this patch).
2. Start 4 fio threads that write a number of 64K files inefficiently.
Inefficiently means that files are created on first access and not
created in advance (fio parameter create_on_open=1) and fallocate
is not used (fallocate=none). With multiple IO issuers this creates
a mix of slab and page cache allocations over time. The total size
of the files is 150% physical memory so that the slabs and page cache
pages get mixed.
3. Warm up a number of fio read-only processes accessing the same files
created in step 2. This part runs for the same length of time it
took to create the files. It'll refault old data and further
interleave slab and page cache allocations. As it's now low on
memory due to step 2, fragmentation occurs as pageblocks get
stolen.
4. While step 3 is still running, start a process that tries to allocate
75% of memory as huge pages with a number of threads. The number of
threads is based on a (NR_CPUS_SOCKET - NR_FIO_THREADS)/4 to avoid THP
threads contending with fio, any other threads or forcing cross-NUMA
scheduling. Note that the test has not been used on a machine with less
than 8 cores. The benchmark records whether huge pages were allocated
and what the fault latency was in microseconds.
5. Measure the number of events potentially causing external fragmentation,
the fault latency and the huge page allocation success rate.
6. Cleanup the test files.

Note that due to the use of IO and page cache that this benchmark is not
suitable for running on large machines where the time to fragment memory
may be excessive. Also note that while this is one mix that generates
fragmentation that it's not the only mix that generates fragmentation.
Differences in workload that are more slab-intensive or whether SLUB is
used with high-order pages may yield different results.

When the page allocator fragments memory, it records the event using the
mm_page_alloc_extfrag ftrace event. If the fallback_order is smaller than
a pageblock order (order-9 on 64-bit x86) then it's considered to be an
"external fragmentation event" that may cause issues in the future.
Hence, the primary metric here is the number of external fragmentation
events that occur with order < 9. The secondary metric is allocation
latency and huge page allocation success rates but note that differences
in latencies and what the success rate also can affect the number of
external fragmentation event which is why it's a secondary metric.

1-socket Skylake machine
config-global-dhp__workload_thpfioscale XFS (no special madvise)
4 fio threads, 1 THP allocating thread
--------------------------------------

4.20-rc3 extfrag events < order 9: 804694
4.20-rc3+patch: 408912 (49% reduction)

thpfioscale Fault Latencies
4.20.0-rc3 4.20.0-rc3
vanilla lowzone-v5r8
Amean fault-base-1 662.92 ( 0.00%) 653.58 * 1.41%*
Amean fault-huge-1 0.00 ( 0.00%) 0.00 ( 0.00%)

4.20.0-rc3 4.20.0-rc3
vanilla lowzone-v5r8
Percentage huge-1 0.00 ( 0.00%) 0.00 ( 0.00%)

Fault latencies are slightly reduced while allocation success rates remain
at zero as this configuration does not make any special effort to allocate
THP and fio is heavily active at the time and either filling memory or
keeping pages resident. However, a 49% reduction of serious fragmentation
events reduces the changes of external fragmentation being a problem in
the future.

Vlastimil asked during review for a breakdown of the allocation types
that are falling back.

vanilla
3816 MIGRATE_UNMOVABLE
800845 MIGRATE_MOVABLE
33 MIGRATE_UNRECLAIMABLE

patch
735 MIGRATE_UNMOVABLE
408135 MIGRATE_MOVABLE
42 MIGRATE_UNRECLAIMABLE

The majority of the fallbacks are due to movable allocations and this is
consistent for the workload throughout the series so will not be presented
again as the primary source of fallbacks are movable allocations.

Movable fallbacks are sometimes considered "ok" to fallback because they
can be migrated. The problem is that they can fill an
unmovable/reclaimable pageblock causing those allocations to fallback
later and polluting pageblocks with pages that cannot move. If there is a
movable fallback, it is pretty much guaranteed to affect an
unmovable/reclaimable pageblock and while it might not be enough to
actually cause a unmovable/reclaimable fallback in the future, we cannot
know that in advance so the patch takes the only option available to it.
Hence, it's important to control them. This point is also consistent
throughout the series and will not be repeated.

1-socket Skylake machine
global-dhp__workload_thpfioscale-madvhugepage-xfs (MADV_HUGEPAGE)
-----------------------------------------------------------------

4.20-rc3 extfrag events < order 9: 291392
4.20-rc3+patch: 191187 (34% reduction)

thpfioscale Fault Latencies
4.20.0-rc3 4.20.0-rc3
vanilla lowzone-v5r8
Amean fault-base-1 1495.14 ( 0.00%) 1467.55 ( 1.85%)
Amean fault-huge-1 1098.48 ( 0.00%) 1127.11 ( -2.61%)

thpfioscale Percentage Faults Huge
4.20.0-rc3 4.20.0-rc3
vanilla lowzone-v5r8
Percentage huge-1 78.57 ( 0.00%) 77.64 ( -1.18%)

Fragmentation events were reduced quite a bit although this is known
to be a little variable. The latencies and allocation success rates
are similar but they were already quite high.

2-socket Haswell machine
config-global-dhp__workload_thpfioscale XFS (no special madvise)
4 fio threads, 5 THP allocating threads
----------------------------------------------------------------

4.20-rc3 extfrag events < order 9: 215698
4.20-rc3+patch: 200210 (7% reduction)

thpfioscale Fault Latencies
4.20.0-rc3 4.20.0-rc3
vanilla lowzone-v5r8
Amean fault-base-5 1350.05 ( 0.00%) 1346.45 ( 0.27%)
Amean fault-huge-5 4181.01 ( 0.00%) 3418.60 ( 18.24%)

4.20.0-rc3 4.20.0-rc3
vanilla lowzone-v5r8
Percentage huge-5 1.15 ( 0.00%) 0.78 ( -31.88%)

The reduction of external fragmentation events is slight and this is
partially due to the removal of __GFP_THISNODE in commit ac5b2c18911f
("mm: thp: relax __GFP_THISNODE for MADV_HUGEPAGE mappings") as THP
allocations can now spill over to remote nodes instead of fragmenting
local memory.

2-socket Haswell machine
global-dhp__workload_thpfioscale-madvhugepage-xfs (MADV_HUGEPAGE)
-----------------------------------------------------------------

4.20-rc3 extfrag events < order 9: 166352
4.20-rc3+patch: 147463 (11% reduction)

thpfioscale Fault Latencies
4.20.0-rc3 4.20.0-rc3
vanilla lowzone-v5r8
Amean fault-base-5 6138.97 ( 0.00%) 6217.43 ( -1.28%)
Amean fault-huge-5 2294.28 ( 0.00%) 3163.33 * -37.88%*

thpfioscale Percentage Faults Huge
4.20.0-rc3 4.20.0-rc3
vanilla lowzone-v5r8
Percentage huge-5 96.82 ( 0.00%) 95.14 ( -1.74%)

There was a slight reduction in external fragmentation events although the
latencies were higher. The allocation success rate is high enough that
the system is struggling and there is quite a lot of parallel reclaim and
compaction activity. There is also a certain degree of luck on whether
processes start on node 0 or not for this patch but the relevance is
reduced later in the series.

Overall, the patch reduces the number of external fragmentation causing
events so the success of THP over long periods of time would be improved
for this adverse workload.

Link: http://lkml.kernel.org/r/20181123114528.28802-2-mgorman@techsingularity.net
Signed-off-by: Mel Gorman <mgorman@techsingularity.net>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Cc: David Rientjes <rientjes@google.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Zi Yan <zi.yan@cs.rutgers.edu>
Cc: Michal Hocko <mhocko@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 742aa7fb 28-Dec-2018 Aaron Lu <aaron.lu@intel.com>

mm/page_alloc.c: use a single function to free page

There are multiple places of freeing a page, they all do the same things
so a common function can be used to reduce code duplicate.

It also avoids bug fixed in one function but left in another.

Link: http://lkml.kernel.org/r/20181119134834.17765-3-aaron.lu@intel.com
Signed-off-by: Aaron Lu <aaron.lu@intel.com>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Cc: Alexander Duyck <alexander.h.duyck@linux.intel.com>
Cc: Ilias Apalodimas <ilias.apalodimas@linaro.org>
Cc: Jesper Dangaard Brouer <brouer@redhat.com>
Cc: Mel Gorman <mgorman@techsingularity.net>
Cc: Pankaj gupta <pagupta@redhat.com>
Cc: Pawel Staszewski <pstaszewski@itcare.pl>
Cc: Tariq Toukan <tariqt@mellanox.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 65895b67 28-Dec-2018 Aaron Lu <aaron.lu@intel.com>

mm/page_alloc.c: free order-0 pages through PCP in page_frag_free()

page_frag_free() calls __free_pages_ok() to free the page back to Buddy.
This is OK for high order page, but for order-0 pages, it misses the
optimization opportunity of using Per-Cpu-Pages and can cause zone lock
contention when called frequently.

Pawel Staszewski recently shared his result of 'how Linux kernel handles
normal traffic'[1] and from perf data, Jesper Dangaard Brouer found the
lock contention comes from page allocator:

mlx5e_poll_tx_cq
|
--16.34%--napi_consume_skb
|
|--12.65%--__free_pages_ok
| |
| --11.86%--free_one_page
| |
| |--10.10%--queued_spin_lock_slowpath
| |
| --0.65%--_raw_spin_lock
|
|--1.55%--page_frag_free
|
--1.44%--skb_release_data

Jesper explained how it happened: mlx5 driver RX-page recycle mechanism is
not effective in this workload and pages have to go through the page
allocator. The lock contention happens during mlx5 DMA TX completion
cycle. And the page allocator cannot keep up at these speeds.[2]

I thought that __free_pages_ok() are mostly freeing high order pages and
thought this is an lock contention for high order pages but Jesper
explained in detail that __free_pages_ok() here are actually freeing
order-0 pages because mlx5 is using order-0 pages to satisfy its page pool
allocation request.[3]

The free path as pointed out by Jesper is:
skb_free_head()
-> skb_free_frag()
-> page_frag_free()
And the pages being freed on this path are order-0 pages.

Fix this by doing similar things as in __page_frag_cache_drain() - send
the being freed page to PCP if it's an order-0 page, or directly to Buddy
if it is a high order page.

With this change, Paweł hasn't noticed lock contention yet in his
workload and Jesper has noticed a 7% performance improvement using a micro
benchmark and lock contention is gone. Ilias' test on a 'low' speed 1Gbit
interface on an cortex-a53 shows ~11% performance boost testing with
64byte packets and __free_pages_ok() disappeared from perf top.

[1]: https://www.spinics.net/lists/netdev/msg531362.html
[2]: https://www.spinics.net/lists/netdev/msg531421.html
[3]: https://www.spinics.net/lists/netdev/msg531556.html

[akpm@linux-foundation.org: add comment]
Link: http://lkml.kernel.org/r/20181120014544.GB10657@intel.com
Signed-off-by: Aaron Lu <aaron.lu@intel.com>
Reported-by: Pawel Staszewski <pstaszewski@itcare.pl>
Analysed-by: Jesper Dangaard Brouer <brouer@redhat.com>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Acked-by: Mel Gorman <mgorman@techsingularity.net>
Acked-by: Jesper Dangaard Brouer <brouer@redhat.com>
Acked-by: Ilias Apalodimas <ilias.apalodimas@linaro.org>
Tested-by: Ilias Apalodimas <ilias.apalodimas@linaro.org>
Acked-by: Alexander Duyck <alexander.h.duyck@linux.intel.com>
Acked-by: Tariq Toukan <tariqt@mellanox.com>
Acked-by: Pankaj gupta <pagupta@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 7ead3342 28-Dec-2018 Huang Shijie <sjhuang@iluvatar.ai>

mm/page_alloc.c: change the order of MIGRATE_RECLAIMABLE/MIGRATE_MOVABLE in fallbacks

In the enum migratetype definition, MIGRATE_MOVABLE is before
MIGRATE_RECLAIMABLE. Change the order of them to match the enumeration's
order.

Link: http://lkml.kernel.org/r/20181121085821.3442-1-sjhuang@iluvatar.ai
Signed-off-by: Huang Shijie <sjhuang@iluvatar.ai>
Reviewed-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 476567e8 28-Dec-2018 Arun KS <arunks@codeaurora.org>

mm: remove managed_page_count_lock spinlock

Now that totalram_pages and managed_pages are atomic varibles, no need of
managed_page_count spinlock. The lock had really a weak consistency
guarantee. It hasn't been used for anything but the update but no reader
actually cares about all the values being updated to be in sync.

Link: http://lkml.kernel.org/r/1542090790-21750-5-git-send-email-arunks@codeaurora.org
Signed-off-by: Arun KS <arunks@codeaurora.org>
Reviewed-by: Konstantin Khlebnikov <khlebnikov@yandex-team.ru>
Acked-by: Michal Hocko <mhocko@suse.com>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Cc: David Hildenbrand <david@redhat.com>
Reviewed-by: Pavel Tatashin <pasha.tatashin@soleen.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# ca79b0c2 28-Dec-2018 Arun KS <arunks@codeaurora.org>

mm: convert totalram_pages and totalhigh_pages variables to atomic

totalram_pages and totalhigh_pages are made static inline function.

Main motivation was that managed_page_count_lock handling was complicating
things. It was discussed in length here,
https://lore.kernel.org/patchwork/patch/995739/#1181785 So it seemes
better to remove the lock and convert variables to atomic, with preventing
poteintial store-to-read tearing as a bonus.

[akpm@linux-foundation.org: coding style fixes]
Link: http://lkml.kernel.org/r/1542090790-21750-4-git-send-email-arunks@codeaurora.org
Signed-off-by: Arun KS <arunks@codeaurora.org>
Suggested-by: Michal Hocko <mhocko@suse.com>
Suggested-by: Vlastimil Babka <vbabka@suse.cz>
Reviewed-by: Konstantin Khlebnikov <khlebnikov@yandex-team.ru>
Reviewed-by: Pavel Tatashin <pasha.tatashin@soleen.com>
Acked-by: Michal Hocko <mhocko@suse.com>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Cc: David Hildenbrand <david@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 9705bea5 28-Dec-2018 Arun KS <arunks@codeaurora.org>

mm: convert zone->managed_pages to atomic variable

totalram_pages, zone->managed_pages and totalhigh_pages updates are
protected by managed_page_count_lock, but readers never care about it.
Convert these variables to atomic to avoid readers potentially seeing a
store tear.

This patch converts zone->managed_pages. Subsequent patches will convert
totalram_panges, totalhigh_pages and eventually managed_page_count_lock
will be removed.

Main motivation was that managed_page_count_lock handling was complicating
things. It was discussed in length here,
https://lore.kernel.org/patchwork/patch/995739/#1181785 So it seemes
better to remove the lock and convert variables to atomic, with preventing
poteintial store-to-read tearing as a bonus.

Link: http://lkml.kernel.org/r/1542090790-21750-3-git-send-email-arunks@codeaurora.org
Signed-off-by: Arun KS <arunks@codeaurora.org>
Suggested-by: Michal Hocko <mhocko@suse.com>
Suggested-by: Vlastimil Babka <vbabka@suse.cz>
Reviewed-by: Konstantin Khlebnikov <khlebnikov@yandex-team.ru>
Reviewed-by: David Hildenbrand <david@redhat.com>
Acked-by: Michal Hocko <mhocko@suse.com>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Reviewed-by: Pavel Tatashin <pasha.tatashin@soleen.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 3d6357de 28-Dec-2018 Arun KS <arunks@codeaurora.org>

mm: reference totalram_pages and managed_pages once per function

Patch series "mm: convert totalram_pages, totalhigh_pages and managed
pages to atomic", v5.

This series converts totalram_pages, totalhigh_pages and
zone->managed_pages to atomic variables.

totalram_pages, zone->managed_pages and totalhigh_pages updates are
protected by managed_page_count_lock, but readers never care about it.
Convert these variables to atomic to avoid readers potentially seeing a
store tear.

Main motivation was that managed_page_count_lock handling was complicating
things. It was discussed in length here,
https://lore.kernel.org/patchwork/patch/995739/#1181785 It seemes better
to remove the lock and convert variables to atomic. With the change,
preventing poteintial store-to-read tearing comes as a bonus.

This patch (of 4):

This is in preparation to a later patch which converts totalram_pages and
zone->managed_pages to atomic variables. Please note that re-reading the
value might lead to a different value and as such it could lead to
unexpected behavior. There are no known bugs as a result of the current
code but it is better to prevent from them in principle.

Link: http://lkml.kernel.org/r/1542090790-21750-2-git-send-email-arunks@codeaurora.org
Signed-off-by: Arun KS <arunks@codeaurora.org>
Reviewed-by: Konstantin Khlebnikov <khlebnikov@yandex-team.ru>
Reviewed-by: David Hildenbrand <david@redhat.com>
Acked-by: Michal Hocko <mhocko@suse.com>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Reviewed-by: Pavel Tatashin <pasha.tatashin@soleen.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# fecd4a50 28-Dec-2018 Wei Yang <richard.weiyang@gmail.com>

mm: remove reset of pcp->counter in pageset_init()

per_cpu_pageset is cleared by memset, it is not necessary to reset it
again.

Link: http://lkml.kernel.org/r/20181021023920.5501-1-richard.weiyang@gmail.com
Signed-off-by: Wei Yang <richard.weiyang@gmail.com>
Reviewed-by: Andrew Morton <akpm@linux-foundation.org>
Cc: Michal Hocko <mhocko@suse.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# d381c547 28-Dec-2018 Michal Hocko <mhocko@suse.com>

mm: only report isolation failures when offlining memory

Heiko has complained that his log is swamped by warnings from
has_unmovable_pages

[ 20.536664] page dumped because: has_unmovable_pages
[ 20.536792] page:000003d081ff4080 count:1 mapcount:0 mapping:000000008ff88600 index:0x0 compound_mapcount: 0
[ 20.536794] flags: 0x3fffe0000010200(slab|head)
[ 20.536795] raw: 03fffe0000010200 0000000000000100 0000000000000200 000000008ff88600
[ 20.536796] raw: 0000000000000000 0020004100000000 ffffffff00000001 0000000000000000
[ 20.536797] page dumped because: has_unmovable_pages
[ 20.536814] page:000003d0823b0000 count:1 mapcount:0 mapping:0000000000000000 index:0x0
[ 20.536815] flags: 0x7fffe0000000000()
[ 20.536817] raw: 07fffe0000000000 0000000000000100 0000000000000200 0000000000000000
[ 20.536818] raw: 0000000000000000 0000000000000000 ffffffff00000001 0000000000000000

which are not triggered by the memory hotplug but rather CMA allocator.
The original idea behind dumping the page state for all call paths was
that these messages will be helpful debugging failures. From the above it
seems that this is not the case for the CMA path because we are lacking
much more context. E.g the second reported page might be a CMA allocated
page. It is still interesting to see a slab page in the CMA area but it
is hard to tell whether this is bug from the above output alone.

Address this issue by dumping the page state only on request. Both
start_isolate_page_range and has_unmovable_pages already have an argument
to ignore hwpoison pages so make this argument more generic and turn it
into flags and allow callers to combine non-default modes into a mask.
While we are at it, has_unmovable_pages call from
is_pageblock_removable_nolock (sysfs removable file) is questionable to
report the failure so drop it from there as well.

Link: http://lkml.kernel.org/r/20181218092802.31429-1-mhocko@kernel.org
Signed-off-by: Michal Hocko <mhocko@suse.com>
Reported-by: Heiko Carstens <heiko.carstens@de.ibm.com>
Reviewed-by: Oscar Salvador <osalvador@suse.de>
Cc: Anshuman Khandual <anshuman.khandual@arm.com>
Cc: Stephen Rothwell <sfr@canb.auug.org.au>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 2932c8b0 28-Dec-2018 Michal Hocko <mhocko@suse.com>

mm, memory_hotplug: be more verbose for memory offline failures

There is only very limited information printed when the memory offlining
fails:

[ 1984.506184] rac1 kernel: memory offlining [mem 0x82600000000-0x8267fffffff] failed due to signal backoff

This tells us that the failure is triggered by the userspace intervention
but it doesn't tell us much more about the underlying reason. It might be
that the page migration failes repeatedly and the userspace timeout
expires and send a signal or it might be some of the earlier steps
(isolation, memory notifier) takes too long.

If the migration failes then it would be really helpful to see which page
that and its state. The same applies to the isolation phase. If we fail
to isolate a page from the allocator then knowing the state of the page
would be helpful as well.

Dump the page state that fails to get isolated or migrated. This will
tell us more about the failure and what to focus on during debugging.

[akpm@linux-foundation.org: add missing printk arg]
[mhocko@suse.com: tweak dump_page() `reason' text]
Link: http://lkml.kernel.org/r/20181116083020.20260-6-mhocko@kernel.org
Link: http://lkml.kernel.org/r/20181107101830.17405-6-mhocko@kernel.org
Signed-off-by: Michal Hocko <mhocko@suse.com>
Reviewed-by: Andrew Morton <akpm@linux-foundation.org>
Reviewed-by: Oscar Salvador <osalvador@suse.de>
Reviewed-by: Anshuman Khandual <anshuman.khandual@arm.com>
Cc: Baoquan He <bhe@redhat.com>
Cc: Oscar Salvador <OSalvador@suse.com>
Cc: William Kucharski <william.kucharski@oracle.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 2813b9c0 28-Dec-2018 Andrey Konovalov <andreyknvl@google.com>

kasan, mm, arm64: tag non slab memory allocated via pagealloc

Tag-based KASAN doesn't check memory accesses through pointers tagged with
0xff. When page_address is used to get pointer to memory that corresponds
to some page, the tag of the resulting pointer gets set to 0xff, even
though the allocated memory might have been tagged differently.

For slab pages it's impossible to recover the correct tag to return from
page_address, since the page might contain multiple slab objects tagged
with different values, and we can't know in advance which one of them is
going to get accessed. For non slab pages however, we can recover the tag
in page_address, since the whole page was marked with the same tag.

This patch adds tagging to non slab memory allocated with pagealloc. To
set the tag of the pointer returned from page_address, the tag gets stored
to page->flags when the memory gets allocated.

Link: http://lkml.kernel.org/r/d758ddcef46a5abc9970182b9137e2fbee202a2c.1544099024.git.andreyknvl@google.com
Signed-off-by: Andrey Konovalov <andreyknvl@google.com>
Reviewed-by: Andrey Ryabinin <aryabinin@virtuozzo.com>
Reviewed-by: Dmitry Vyukov <dvyukov@google.com>
Acked-by: Will Deacon <will.deacon@arm.com>
Cc: Christoph Lameter <cl@linux.com>
Cc: Mark Rutland <mark.rutland@arm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 17e2e7d7 21-Dec-2018 Oscar Salvador <osalvador@suse.de>

mm, page_alloc: fix has_unmovable_pages for HugePages

While playing with gigantic hugepages and memory_hotplug, I triggered
the following #PF when "cat memoryX/removable":

BUG: unable to handle kernel NULL pointer dereference at 0000000000000008
#PF error: [normal kernel read fault]
PGD 0 P4D 0
Oops: 0000 [#1] SMP PTI
CPU: 1 PID: 1481 Comm: cat Tainted: G E 4.20.0-rc6-mm1-1-default+ #18
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.0.0-prebuilt.qemu-project.org 04/01/2014
RIP: 0010:has_unmovable_pages+0x154/0x210
Call Trace:
is_mem_section_removable+0x7d/0x100
removable_show+0x90/0xb0
dev_attr_show+0x1c/0x50
sysfs_kf_seq_show+0xca/0x1b0
seq_read+0x133/0x380
__vfs_read+0x26/0x180
vfs_read+0x89/0x140
ksys_read+0x42/0x90
do_syscall_64+0x5b/0x180
entry_SYSCALL_64_after_hwframe+0x44/0xa9

The reason is we do not pass the Head to page_hstate(), and so, the call
to compound_order() in page_hstate() returns 0, so we end up checking
all hstates's size to match PAGE_SIZE.

Obviously, we do not find any hstate matching that size, and we return
NULL. Then, we dereference that NULL pointer in
hugepage_migration_supported() and we got the #PF from above.

Fix that by getting the head page before calling page_hstate().

Also, since gigantic pages span several pageblocks, re-adjust the logic
for skipping pages. While are it, we can also get rid of the
round_up().

[osalvador@suse.de: remove round_up(), adjust skip pages logic per Michal]
Link: http://lkml.kernel.org/r/20181221062809.31771-1-osalvador@suse.de
Link: http://lkml.kernel.org/r/20181217225113.17864-1-osalvador@suse.de
Signed-off-by: Oscar Salvador <osalvador@suse.de>
Acked-by: Michal Hocko <mhocko@suse.com>
Reviewed-by: David Hildenbrand <david@redhat.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Pavel Tatashin <pavel.tatashin@microsoft.com>
Cc: Mike Rapoport <rppt@linux.vnet.ibm.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 2830bf6f 21-Dec-2018 Mikhail Zaslonko <zaslonko@linux.ibm.com>

mm, memory_hotplug: initialize struct pages for the full memory section

If memory end is not aligned with the sparse memory section boundary,
the mapping of such a section is only partly initialized. This may lead
to VM_BUG_ON due to uninitialized struct page access from
is_mem_section_removable() or test_pages_in_a_zone() function triggered
by memory_hotplug sysfs handlers:

Here are the the panic examples:
CONFIG_DEBUG_VM=y
CONFIG_DEBUG_VM_PGFLAGS=y

kernel parameter mem=2050M
--------------------------
page:000003d082008000 is uninitialized and poisoned
page dumped because: VM_BUG_ON_PAGE(PagePoisoned(p))
Call Trace:
( test_pages_in_a_zone+0xde/0x160)
show_valid_zones+0x5c/0x190
dev_attr_show+0x34/0x70
sysfs_kf_seq_show+0xc8/0x148
seq_read+0x204/0x480
__vfs_read+0x32/0x178
vfs_read+0x82/0x138
ksys_read+0x5a/0xb0
system_call+0xdc/0x2d8
Last Breaking-Event-Address:
test_pages_in_a_zone+0xde/0x160
Kernel panic - not syncing: Fatal exception: panic_on_oops

kernel parameter mem=3075M
--------------------------
page:000003d08300c000 is uninitialized and poisoned
page dumped because: VM_BUG_ON_PAGE(PagePoisoned(p))
Call Trace:
( is_mem_section_removable+0xb4/0x190)
show_mem_removable+0x9a/0xd8
dev_attr_show+0x34/0x70
sysfs_kf_seq_show+0xc8/0x148
seq_read+0x204/0x480
__vfs_read+0x32/0x178
vfs_read+0x82/0x138
ksys_read+0x5a/0xb0
system_call+0xdc/0x2d8
Last Breaking-Event-Address:
is_mem_section_removable+0xb4/0x190
Kernel panic - not syncing: Fatal exception: panic_on_oops

Fix the problem by initializing the last memory section of each zone in
memmap_init_zone() till the very end, even if it goes beyond the zone end.

Michal said:

: This has alwways been problem AFAIU. It just went unnoticed because we
: have zeroed memmaps during allocation before f7f99100d8d9 ("mm: stop
: zeroing memory during allocation in vmemmap") and so the above test
: would simply skip these ranges as belonging to zone 0 or provided a
: garbage.
:
: So I guess we do care for post f7f99100d8d9 kernels mostly and
: therefore Fixes: f7f99100d8d9 ("mm: stop zeroing memory during
: allocation in vmemmap")

Link: http://lkml.kernel.org/r/20181212172712.34019-2-zaslonko@linux.ibm.com
Fixes: f7f99100d8d9 ("mm: stop zeroing memory during allocation in vmemmap")
Signed-off-by: Mikhail Zaslonko <zaslonko@linux.ibm.com>
Reviewed-by: Gerald Schaefer <gerald.schaefer@de.ibm.com>
Suggested-by: Michal Hocko <mhocko@kernel.org>
Acked-by: Michal Hocko <mhocko@suse.com>
Reported-by: Mikhail Gavrilov <mikhail.v.gavrilov@gmail.com>
Tested-by: Mikhail Gavrilov <mikhail.v.gavrilov@gmail.com>
Cc: Dave Hansen <dave.hansen@intel.com>
Cc: Alexander Duyck <alexander.h.duyck@linux.intel.com>
Cc: Pasha Tatashin <Pavel.Tatashin@microsoft.com>
Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 8f416836 30-Nov-2018 Wei Yang <richard.weiyang@gmail.com>

mm/page_alloc.c: fix calculation of pgdat->nr_zones

init_currently_empty_zone() will adjust pgdat->nr_zones and set it to
'zone_idx(zone) + 1' unconditionally. This is correct in the normal
case, while not exact in hot-plug situation.

This function is used in two places:

* free_area_init_core()
* move_pfn_range_to_zone()

In the first case, we are sure zone index increase monotonically. While
in the second one, this is under users control.

One way to reproduce this is:
----------------------------

1. create a virtual machine with empty node1

-m 4G,slots=32,maxmem=32G \
-smp 4,maxcpus=8 \
-numa node,nodeid=0,mem=4G,cpus=0-3 \
-numa node,nodeid=1,mem=0G,cpus=4-7

2. hot-add cpu 3-7

cpu-add [3-7]

2. hot-add memory to nod1

object_add memory-backend-ram,id=ram0,size=1G
device_add pc-dimm,id=dimm0,memdev=ram0,node=1

3. online memory with following order

echo online_movable > memory47/state
echo online > memory40/state

After this, node1 will have its nr_zones equals to (ZONE_NORMAL + 1)
instead of (ZONE_MOVABLE + 1).

Michal said:
"Having an incorrect nr_zones might result in all sorts of problems
which would be quite hard to debug (e.g. reclaim not considering the
movable zone). I do not expect many users would suffer from this it
but still this is trivial and obviously right thing to do so
backporting to the stable tree shouldn't be harmful (last famous
words)"

Link: http://lkml.kernel.org/r/20181117022022.9956-1-richard.weiyang@gmail.com
Fixes: f1dd2cd13c4b ("mm, memory_hotplug: do not associate hotadded memory to zones until online")
Signed-off-by: Wei Yang <richard.weiyang@gmail.com>
Acked-by: Michal Hocko <mhocko@suse.com>
Reviewed-by: Oscar Salvador <osalvador@suse.de>
Cc: Anshuman Khandual <anshuman.khandual@arm.com>
Cc: Dave Hansen <dave.hansen@intel.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# c63ae43b 16-Nov-2018 Michal Hocko <mhocko@suse.com>

mm, page_alloc: check for max order in hot path

Konstantin has noticed that kvmalloc might trigger the following
warning:

WARNING: CPU: 0 PID: 6676 at mm/vmstat.c:986 __fragmentation_index+0x54/0x60
[...]
Call Trace:
fragmentation_index+0x76/0x90
compaction_suitable+0x4f/0xf0
shrink_node+0x295/0x310
node_reclaim+0x205/0x250
get_page_from_freelist+0x649/0xad0
__alloc_pages_nodemask+0x12a/0x2a0
kmalloc_large_node+0x47/0x90
__kmalloc_node+0x22b/0x2e0
kvmalloc_node+0x3e/0x70
xt_alloc_table_info+0x3a/0x80 [x_tables]
do_ip6t_set_ctl+0xcd/0x1c0 [ip6_tables]
nf_setsockopt+0x44/0x60
SyS_setsockopt+0x6f/0xc0
do_syscall_64+0x67/0x120
entry_SYSCALL_64_after_hwframe+0x3d/0xa2

the problem is that we only check for an out of bound order in the slow
path and the node reclaim might happen from the fast path already. This
is fixable by making sure that kvmalloc doesn't ever use kmalloc for
requests that are larger than KMALLOC_MAX_SIZE but this also shows that
the code is rather fragile. A recent UBSAN report just underlines that
by the following report

UBSAN: Undefined behaviour in mm/page_alloc.c:3117:19
shift exponent 51 is too large for 32-bit type 'int'
CPU: 0 PID: 6520 Comm: syz-executor1 Not tainted 4.19.0-rc2 #1
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS Bochs 01/01/2011
Call Trace:
__dump_stack lib/dump_stack.c:77 [inline]
dump_stack+0xd2/0x148 lib/dump_stack.c:113
ubsan_epilogue+0x12/0x94 lib/ubsan.c:159
__ubsan_handle_shift_out_of_bounds+0x2b6/0x30b lib/ubsan.c:425
__zone_watermark_ok+0x2c7/0x400 mm/page_alloc.c:3117
zone_watermark_fast mm/page_alloc.c:3216 [inline]
get_page_from_freelist+0xc49/0x44c0 mm/page_alloc.c:3300
__alloc_pages_nodemask+0x21e/0x640 mm/page_alloc.c:4370
alloc_pages_current+0xcc/0x210 mm/mempolicy.c:2093
alloc_pages include/linux/gfp.h:509 [inline]
__get_free_pages+0x12/0x60 mm/page_alloc.c:4414
dma_mem_alloc+0x36/0x50 arch/x86/include/asm/floppy.h:156
raw_cmd_copyin drivers/block/floppy.c:3159 [inline]
raw_cmd_ioctl drivers/block/floppy.c:3206 [inline]
fd_locked_ioctl+0xa00/0x2c10 drivers/block/floppy.c:3544
fd_ioctl+0x40/0x60 drivers/block/floppy.c:3571
__blkdev_driver_ioctl block/ioctl.c:303 [inline]
blkdev_ioctl+0xb3c/0x1a30 block/ioctl.c:601
block_ioctl+0x105/0x150 fs/block_dev.c:1883
vfs_ioctl fs/ioctl.c:46 [inline]
do_vfs_ioctl+0x1c0/0x1150 fs/ioctl.c:687
ksys_ioctl+0x9e/0xb0 fs/ioctl.c:702
__do_sys_ioctl fs/ioctl.c:709 [inline]
__se_sys_ioctl fs/ioctl.c:707 [inline]
__x64_sys_ioctl+0x7e/0xc0 fs/ioctl.c:707
do_syscall_64+0xc4/0x510 arch/x86/entry/common.c:290
entry_SYSCALL_64_after_hwframe+0x49/0xbe

Note that this is not a kvmalloc path. It is just that the fast path
really depends on having sanitzed order as well. Therefore move the
order check to the fast path.

Link: http://lkml.kernel.org/r/20181113094305.GM15120@dhcp22.suse.cz
Signed-off-by: Michal Hocko <mhocko@suse.com>
Reported-by: Konstantin Khlebnikov <khlebnikov@yandex-team.ru>
Reported-by: Kyungtae Kim <kt0755@gmail.com>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Cc: Balbir Singh <bsingharora@gmail.com>
Cc: Mel Gorman <mgorman@techsingularity.net>
Cc: Pavel Tatashin <pavel.tatashin@microsoft.com>
Cc: Oscar Salvador <osalvador@suse.de>
Cc: Mike Rapoport <rppt@linux.vnet.ibm.com>
Cc: Aaron Lu <aaron.lu@intel.com>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Cc: Byoungyoung Lee <lifeasageek@gmail.com>
Cc: "Dae R. Jeong" <threeearcat@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 9d789999 16-Nov-2018 Michal Hocko <mhocko@suse.com>

mm, memory_hotplug: check zone_movable in has_unmovable_pages

Page state checks are racy. Under a heavy memory workload (e.g. stress
-m 200 -t 2h) it is quite easy to hit a race window when the page is
allocated but its state is not fully populated yet. A debugging patch to
dump the struct page state shows

has_unmovable_pages: pfn:0x10dfec00, found:0x1, count:0x0
page:ffffea0437fb0000 count:1 mapcount:1 mapping:ffff880e05239841 index:0x7f26e5000 compound_mapcount: 1
flags: 0x5fffffc0090034(uptodate|lru|active|head|swapbacked)

Note that the state has been checked for both PageLRU and PageSwapBacked
already. Closing this race completely would require some sort of retry
logic. This can be tricky and error prone (think of potential endless
or long taking loops).

Workaround this problem for movable zones at least. Such a zone should
only contain movable pages. Commit 15c30bc09085 ("mm, memory_hotplug:
make has_unmovable_pages more robust") has told us that this is not
strictly true though. Bootmem pages should be marked reserved though so
we can move the original check after the PageReserved check. Pages from
other zones are still prone to races but we even do not pretend that
memory hotremove works for those so pre-mature failure doesn't hurt that
much.

Link: http://lkml.kernel.org/r/20181106095524.14629-1-mhocko@kernel.org
Fixes: 15c30bc09085 ("mm, memory_hotplug: make has_unmovable_pages more robust")
Signed-off-by: Michal Hocko <mhocko@suse.com>
Reported-by: Baoquan He <bhe@redhat.com>
Tested-by: Baoquan He <bhe@redhat.com>
Acked-by: Baoquan He <bhe@redhat.com>
Reviewed-by: Oscar Salvador <osalvador@suse.de>
Acked-by: Balbir Singh <bsingharora@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 7e1c4e27 30-Oct-2018 Mike Rapoport <rppt@linux.vnet.ibm.com>

memblock: stop using implicit alignment to SMP_CACHE_BYTES

When a memblock allocation APIs are called with align = 0, the alignment
is implicitly set to SMP_CACHE_BYTES.

Implicit alignment is done deep in the memblock allocator and it can
come as a surprise. Not that such an alignment would be wrong even
when used incorrectly but it is better to be explicit for the sake of
clarity and the prinicple of the least surprise.

Replace all such uses of memblock APIs with the 'align' parameter
explicitly set to SMP_CACHE_BYTES and stop implicit alignment assignment
in the memblock internal allocation functions.

For the case when memblock APIs are used via helper functions, e.g. like
iommu_arena_new_node() in Alpha, the helper functions were detected with
Coccinelle's help and then manually examined and updated where
appropriate.

The direct memblock APIs users were updated using the semantic patch below:

@@
expression size, min_addr, max_addr, nid;
@@
(
|
- memblock_alloc_try_nid_raw(size, 0, min_addr, max_addr, nid)
+ memblock_alloc_try_nid_raw(size, SMP_CACHE_BYTES, min_addr, max_addr,
nid)
|
- memblock_alloc_try_nid_nopanic(size, 0, min_addr, max_addr, nid)
+ memblock_alloc_try_nid_nopanic(size, SMP_CACHE_BYTES, min_addr, max_addr,
nid)
|
- memblock_alloc_try_nid(size, 0, min_addr, max_addr, nid)
+ memblock_alloc_try_nid(size, SMP_CACHE_BYTES, min_addr, max_addr, nid)
|
- memblock_alloc(size, 0)
+ memblock_alloc(size, SMP_CACHE_BYTES)
|
- memblock_alloc_raw(size, 0)
+ memblock_alloc_raw(size, SMP_CACHE_BYTES)
|
- memblock_alloc_from(size, 0, min_addr)
+ memblock_alloc_from(size, SMP_CACHE_BYTES, min_addr)
|
- memblock_alloc_nopanic(size, 0)
+ memblock_alloc_nopanic(size, SMP_CACHE_BYTES)
|
- memblock_alloc_low(size, 0)
+ memblock_alloc_low(size, SMP_CACHE_BYTES)
|
- memblock_alloc_low_nopanic(size, 0)
+ memblock_alloc_low_nopanic(size, SMP_CACHE_BYTES)
|
- memblock_alloc_from_nopanic(size, 0, min_addr)
+ memblock_alloc_from_nopanic(size, SMP_CACHE_BYTES, min_addr)
|
- memblock_alloc_node(size, 0, nid)
+ memblock_alloc_node(size, SMP_CACHE_BYTES, nid)
)

[mhocko@suse.com: changelog update]
[akpm@linux-foundation.org: coding-style fixes]
[rppt@linux.ibm.com: fix missed uses of implicit alignment]
Link: http://lkml.kernel.org/r/20181016133656.GA10925@rapoport-lnx
Link: http://lkml.kernel.org/r/1538687224-17535-1-git-send-email-rppt@linux.vnet.ibm.com
Signed-off-by: Mike Rapoport <rppt@linux.vnet.ibm.com>
Suggested-by: Michal Hocko <mhocko@suse.com>
Acked-by: Paul Burton <paul.burton@mips.com> [MIPS]
Acked-by: Michael Ellerman <mpe@ellerman.id.au> [powerpc]
Acked-by: Michal Hocko <mhocko@suse.com>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Chris Zankel <chris@zankel.net>
Cc: Geert Uytterhoeven <geert@linux-m68k.org>
Cc: Guan Xuetao <gxt@pku.edu.cn>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Matt Turner <mattst88@gmail.com>
Cc: Michal Simek <monstr@monstr.eu>
Cc: Richard Weinberger <richard@nod.at>
Cc: Russell King <linux@armlinux.org.uk>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Tony Luck <tony.luck@intel.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 57c8a661 30-Oct-2018 Mike Rapoport <rppt@linux.vnet.ibm.com>

mm: remove include/linux/bootmem.h

Move remaining definitions and declarations from include/linux/bootmem.h
into include/linux/memblock.h and remove the redundant header.

The includes were replaced with the semantic patch below and then
semi-automated removal of duplicated '#include <linux/memblock.h>

@@
@@
- #include <linux/bootmem.h>
+ #include <linux/memblock.h>

[sfr@canb.auug.org.au: dma-direct: fix up for the removal of linux/bootmem.h]
Link: http://lkml.kernel.org/r/20181002185342.133d1680@canb.auug.org.au
[sfr@canb.auug.org.au: powerpc: fix up for removal of linux/bootmem.h]
Link: http://lkml.kernel.org/r/20181005161406.73ef8727@canb.auug.org.au
[sfr@canb.auug.org.au: x86/kaslr, ACPI/NUMA: fix for linux/bootmem.h removal]
Link: http://lkml.kernel.org/r/20181008190341.5e396491@canb.auug.org.au
Link: http://lkml.kernel.org/r/1536927045-23536-30-git-send-email-rppt@linux.vnet.ibm.com
Signed-off-by: Mike Rapoport <rppt@linux.vnet.ibm.com>
Signed-off-by: Stephen Rothwell <sfr@canb.auug.org.au>
Acked-by: Michal Hocko <mhocko@suse.com>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Chris Zankel <chris@zankel.net>
Cc: "David S. Miller" <davem@davemloft.net>
Cc: Geert Uytterhoeven <geert@linux-m68k.org>
Cc: Greentime Hu <green.hu@gmail.com>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: Guan Xuetao <gxt@pku.edu.cn>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: "James E.J. Bottomley" <jejb@parisc-linux.org>
Cc: Jonas Bonn <jonas@southpole.se>
Cc: Jonathan Corbet <corbet@lwn.net>
Cc: Ley Foon Tan <lftan@altera.com>
Cc: Mark Salter <msalter@redhat.com>
Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
Cc: Matt Turner <mattst88@gmail.com>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Cc: Michal Simek <monstr@monstr.eu>
Cc: Palmer Dabbelt <palmer@sifive.com>
Cc: Paul Burton <paul.burton@mips.com>
Cc: Richard Kuo <rkuo@codeaurora.org>
Cc: Richard Weinberger <richard@nod.at>
Cc: Rich Felker <dalias@libc.org>
Cc: Russell King <linux@armlinux.org.uk>
Cc: Serge Semin <fancer.lancer@gmail.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Tony Luck <tony.luck@intel.com>
Cc: Vineet Gupta <vgupta@synopsys.com>
Cc: Yoshinori Sato <ysato@users.sourceforge.jp>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 7c2ee349 30-Oct-2018 Mike Rapoport <rppt@linux.vnet.ibm.com>

memblock: rename __free_pages_bootmem to memblock_free_pages

The conversion is done using

sed -i 's@__free_pages_bootmem@memblock_free_pages@' \
$(git grep -l __free_pages_bootmem)

Link: http://lkml.kernel.org/r/1536927045-23536-27-git-send-email-rppt@linux.vnet.ibm.com
Signed-off-by: Mike Rapoport <rppt@linux.vnet.ibm.com>
Acked-by: Michal Hocko <mhocko@suse.com>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Chris Zankel <chris@zankel.net>
Cc: "David S. Miller" <davem@davemloft.net>
Cc: Geert Uytterhoeven <geert@linux-m68k.org>
Cc: Greentime Hu <green.hu@gmail.com>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: Guan Xuetao <gxt@pku.edu.cn>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: "James E.J. Bottomley" <jejb@parisc-linux.org>
Cc: Jonas Bonn <jonas@southpole.se>
Cc: Jonathan Corbet <corbet@lwn.net>
Cc: Ley Foon Tan <lftan@altera.com>
Cc: Mark Salter <msalter@redhat.com>
Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
Cc: Matt Turner <mattst88@gmail.com>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Cc: Michal Simek <monstr@monstr.eu>
Cc: Palmer Dabbelt <palmer@sifive.com>
Cc: Paul Burton <paul.burton@mips.com>
Cc: Richard Kuo <rkuo@codeaurora.org>
Cc: Richard Weinberger <richard@nod.at>
Cc: Rich Felker <dalias@libc.org>
Cc: Russell King <linux@armlinux.org.uk>
Cc: Serge Semin <fancer.lancer@gmail.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Tony Luck <tony.luck@intel.com>
Cc: Vineet Gupta <vgupta@synopsys.com>
Cc: Yoshinori Sato <ysato@users.sourceforge.jp>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# c6ffc5ca 30-Oct-2018 Mike Rapoport <rppt@linux.vnet.ibm.com>

memblock: rename free_all_bootmem to memblock_free_all

The conversion is done using

sed -i 's@free_all_bootmem@memblock_free_all@' \
$(git grep -l free_all_bootmem)

Link: http://lkml.kernel.org/r/1536927045-23536-26-git-send-email-rppt@linux.vnet.ibm.com
Signed-off-by: Mike Rapoport <rppt@linux.vnet.ibm.com>
Acked-by: Michal Hocko <mhocko@suse.com>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Chris Zankel <chris@zankel.net>
Cc: "David S. Miller" <davem@davemloft.net>
Cc: Geert Uytterhoeven <geert@linux-m68k.org>
Cc: Greentime Hu <green.hu@gmail.com>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: Guan Xuetao <gxt@pku.edu.cn>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: "James E.J. Bottomley" <jejb@parisc-linux.org>
Cc: Jonas Bonn <jonas@southpole.se>
Cc: Jonathan Corbet <corbet@lwn.net>
Cc: Ley Foon Tan <lftan@altera.com>
Cc: Mark Salter <msalter@redhat.com>
Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
Cc: Matt Turner <mattst88@gmail.com>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Cc: Michal Simek <monstr@monstr.eu>
Cc: Palmer Dabbelt <palmer@sifive.com>
Cc: Paul Burton <paul.burton@mips.com>
Cc: Richard Kuo <rkuo@codeaurora.org>
Cc: Richard Weinberger <richard@nod.at>
Cc: Rich Felker <dalias@libc.org>
Cc: Russell King <linux@armlinux.org.uk>
Cc: Serge Semin <fancer.lancer@gmail.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Tony Luck <tony.luck@intel.com>
Cc: Vineet Gupta <vgupta@synopsys.com>
Cc: Yoshinori Sato <ysato@users.sourceforge.jp>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# eb31d559 30-Oct-2018 Mike Rapoport <rppt@linux.vnet.ibm.com>

memblock: remove _virt from APIs returning virtual address

The conversion is done using

sed -i 's@memblock_virt_alloc@memblock_alloc@g' \
$(git grep -l memblock_virt_alloc)

Link: http://lkml.kernel.org/r/1536927045-23536-8-git-send-email-rppt@linux.vnet.ibm.com
Signed-off-by: Mike Rapoport <rppt@linux.vnet.ibm.com>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Chris Zankel <chris@zankel.net>
Cc: "David S. Miller" <davem@davemloft.net>
Cc: Geert Uytterhoeven <geert@linux-m68k.org>
Cc: Greentime Hu <green.hu@gmail.com>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: Guan Xuetao <gxt@pku.edu.cn>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: "James E.J. Bottomley" <jejb@parisc-linux.org>
Cc: Jonas Bonn <jonas@southpole.se>
Cc: Jonathan Corbet <corbet@lwn.net>
Cc: Ley Foon Tan <lftan@altera.com>
Cc: Mark Salter <msalter@redhat.com>
Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
Cc: Matt Turner <mattst88@gmail.com>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Michal Simek <monstr@monstr.eu>
Cc: Palmer Dabbelt <palmer@sifive.com>
Cc: Paul Burton <paul.burton@mips.com>
Cc: Richard Kuo <rkuo@codeaurora.org>
Cc: Richard Weinberger <richard@nod.at>
Cc: Rich Felker <dalias@libc.org>
Cc: Russell King <linux@armlinux.org.uk>
Cc: Serge Semin <fancer.lancer@gmail.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Tony Luck <tony.luck@intel.com>
Cc: Vineet Gupta <vgupta@synopsys.com>
Cc: Yoshinori Sato <ysato@users.sourceforge.jp>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# aca52c39 30-Oct-2018 Mike Rapoport <rppt@linux.vnet.ibm.com>

mm: remove CONFIG_HAVE_MEMBLOCK

All architecures use memblock for early memory management. There is no need
for the CONFIG_HAVE_MEMBLOCK configuration option.

[rppt@linux.vnet.ibm.com: of/fdt: fixup #ifdefs]
Link: http://lkml.kernel.org/r/20180919103457.GA20545@rapoport-lnx
[rppt@linux.vnet.ibm.com: csky: fixups after bootmem removal]
Link: http://lkml.kernel.org/r/20180926112744.GC4628@rapoport-lnx
[rppt@linux.vnet.ibm.com: remove stale #else and the code it protects]
Link: http://lkml.kernel.org/r/1538067825-24835-1-git-send-email-rppt@linux.vnet.ibm.com
Link: http://lkml.kernel.org/r/1536927045-23536-4-git-send-email-rppt@linux.vnet.ibm.com
Signed-off-by: Mike Rapoport <rppt@linux.vnet.ibm.com>
Acked-by: Michal Hocko <mhocko@suse.com>
Tested-by: Jonathan Cameron <jonathan.cameron@huawei.com>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Chris Zankel <chris@zankel.net>
Cc: "David S. Miller" <davem@davemloft.net>
Cc: Geert Uytterhoeven <geert@linux-m68k.org>
Cc: Greentime Hu <green.hu@gmail.com>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: Guan Xuetao <gxt@pku.edu.cn>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: "James E.J. Bottomley" <jejb@parisc-linux.org>
Cc: Jonas Bonn <jonas@southpole.se>
Cc: Jonathan Corbet <corbet@lwn.net>
Cc: Ley Foon Tan <lftan@altera.com>
Cc: Mark Salter <msalter@redhat.com>
Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
Cc: Matt Turner <mattst88@gmail.com>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Cc: Michal Simek <monstr@monstr.eu>
Cc: Palmer Dabbelt <palmer@sifive.com>
Cc: Paul Burton <paul.burton@mips.com>
Cc: Richard Kuo <rkuo@codeaurora.org>
Cc: Richard Weinberger <richard@nod.at>
Cc: Rich Felker <dalias@libc.org>
Cc: Russell King <linux@armlinux.org.uk>
Cc: Serge Semin <fancer.lancer@gmail.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Tony Luck <tony.luck@intel.com>
Cc: Vineet Gupta <vgupta@synopsys.com>
Cc: Yoshinori Sato <ysato@users.sourceforge.jp>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# ec393a0f 26-Oct-2018 Pavel Tatashin <pavel.tatashin@microsoft.com>

mm: return zero_resv_unavail optimization

When checking for valid pfns in zero_resv_unavail(), it is not necessary
to verify that pfns within pageblock_nr_pages ranges are valid, only the
first one needs to be checked. This is because memory for pages are
allocated in contiguous chunks that contain pageblock_nr_pages struct
pages.

Link: http://lkml.kernel.org/r/20181002143821.5112-3-msys.mizuma@gmail.com
Signed-off-by: Pavel Tatashin <pavel.tatashin@microsoft.com>
Signed-off-by: Masayoshi Mizuma <m.mizuma@jp.fujitsu.com>
Reviewed-by: Masayoshi Mizuma <m.mizuma@jp.fujitsu.com>
Acked-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
Reviewed-by: Oscar Salvador <osalvador@suse.de>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 907ec5fc 26-Oct-2018 Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>

mm: zero remaining unavailable struct pages

Patch series "mm: Fix for movable_node boot option", v3.

This patch series contains a fix for the movable_node boot option issue
which was introduced by commit 124049decbb1 ("x86/e820: put !E820_TYPE_RAM
regions into memblock.reserved").

The commit breaks the option because it changed the memory gap range to
reserved memblock. So, the node is marked as Normal zone even if the SRAT
has Hot pluggable affinity.

First and second patch fix the original issue which the commit tried to
fix, then revert the commit.

This patch (of 3):

There is a kernel panic that is triggered when reading /proc/kpageflags on
the kernel booted with kernel parameter 'memmap=nn[KMG]!ss[KMG]':

BUG: unable to handle kernel paging request at fffffffffffffffe
PGD 9b20e067 P4D 9b20e067 PUD 9b210067 PMD 0
Oops: 0000 [#1] SMP PTI
CPU: 2 PID: 1728 Comm: page-types Not tainted 4.17.0-rc6-mm1-v4.17-rc6-180605-0816-00236-g2dfb086ef02c+ #160
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.11.0-2.fc28 04/01/2014
RIP: 0010:stable_page_flags+0x27/0x3c0
Code: 00 00 00 0f 1f 44 00 00 48 85 ff 0f 84 a0 03 00 00 41 54 55 49 89 fc 53 48 8b 57 08 48 8b 2f 48 8d 42 ff 83 e2 01 48 0f 44 c7 <48> 8b 00 f6 c4 01 0f 84 10 03 00 00 31 db 49 8b 54 24 08 4c 89 e7
RSP: 0018:ffffbbd44111fde0 EFLAGS: 00010202
RAX: fffffffffffffffe RBX: 00007fffffffeff9 RCX: 0000000000000000
RDX: 0000000000000001 RSI: 0000000000000202 RDI: ffffed1182fff5c0
RBP: ffffffffffffffff R08: 0000000000000001 R09: 0000000000000001
R10: ffffbbd44111fed8 R11: 0000000000000000 R12: ffffed1182fff5c0
R13: 00000000000bffd7 R14: 0000000002fff5c0 R15: ffffbbd44111ff10
FS: 00007efc4335a500(0000) GS:ffff93a5bfc00000(0000) knlGS:0000000000000000
CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: fffffffffffffffe CR3: 00000000b2a58000 CR4: 00000000001406e0
Call Trace:
kpageflags_read+0xc7/0x120
proc_reg_read+0x3c/0x60
__vfs_read+0x36/0x170
vfs_read+0x89/0x130
ksys_pread64+0x71/0x90
do_syscall_64+0x5b/0x160
entry_SYSCALL_64_after_hwframe+0x44/0xa9
RIP: 0033:0x7efc42e75e23
Code: 09 00 ba 9f 01 00 00 e8 ab 81 f4 ff 66 2e 0f 1f 84 00 00 00 00 00 90 83 3d 29 0a 2d 00 00 75 13 49 89 ca b8 11 00 00 00 0f 05 <48> 3d 01 f0 ff ff 73 34 c3 48 83 ec 08 e8 db d3 01 00 48 89 04 24

According to kernel bisection, this problem became visible due to commit
f7f99100d8d9 which changes how struct pages are initialized.

Memblock layout affects the pfn ranges covered by node/zone. Consider
that we have a VM with 2 NUMA nodes and each node has 4GB memory, and the
default (no memmap= given) memblock layout is like below:

MEMBLOCK configuration:
memory size = 0x00000001fff75c00 reserved size = 0x000000000300c000
memory.cnt = 0x4
memory[0x0] [0x0000000000001000-0x000000000009efff], 0x000000000009e000 bytes on node 0 flags: 0x0
memory[0x1] [0x0000000000100000-0x00000000bffd6fff], 0x00000000bfed7000 bytes on node 0 flags: 0x0
memory[0x2] [0x0000000100000000-0x000000013fffffff], 0x0000000040000000 bytes on node 0 flags: 0x0
memory[0x3] [0x0000000140000000-0x000000023fffffff], 0x0000000100000000 bytes on node 1 flags: 0x0
...

If you give memmap=1G!4G (so it just covers memory[0x2]),
the range [0x100000000-0x13fffffff] is gone:

MEMBLOCK configuration:
memory size = 0x00000001bff75c00 reserved size = 0x000000000300c000
memory.cnt = 0x3
memory[0x0] [0x0000000000001000-0x000000000009efff], 0x000000000009e000 bytes on node 0 flags: 0x0
memory[0x1] [0x0000000000100000-0x00000000bffd6fff], 0x00000000bfed7000 bytes on node 0 flags: 0x0
memory[0x2] [0x0000000140000000-0x000000023fffffff], 0x0000000100000000 bytes on node 1 flags: 0x0
...

This causes shrinking node 0's pfn range because it is calculated by the
address range of memblock.memory. So some of struct pages in the gap
range are left uninitialized.

We have a function zero_resv_unavail() which does zeroing the struct pages
outside memblock.memory, but currently it covers only the reserved
unavailable range (i.e. memblock.memory && !memblock.reserved). This
patch extends it to cover all unavailable range, which fixes the reported
issue.

Link: http://lkml.kernel.org/r/20181002143821.5112-2-msys.mizuma@gmail.com
Fixes: f7f99100d8d9 ("mm: stop zeroing memory during allocation in vmemmap")
Signed-off-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
Signed-off-by-by: Masayoshi Mizuma <m.mizuma@jp.fujitsu.com>
Tested-by: Oscar Salvador <osalvador@suse.de>
Tested-by: Masayoshi Mizuma <m.mizuma@jp.fujitsu.com>
Reviewed-by: Pavel Tatashin <pavel.tatashin@microsoft.com>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# a9a9e77f 26-Oct-2018 Pavel Tatashin <pasha.tatashin@oracle.com>

mm: move mirrored memory specific code outside of memmap_init_zone

memmap_init_zone, is getting complex, because it is called from different
contexts: hotplug, and during boot, and also because it must handle some
architecture quirks. One of them is mirrored memory.

Move the code that decides whether to skip mirrored memory outside of
memmap_init_zone, into a separate function.

[pasha.tatashin@oracle.com: uninline overlap_memmap_init()]
Link: http://lkml.kernel.org/r/20180726193509.3326-4-pasha.tatashin@oracle.com
Link: http://lkml.kernel.org/r/20180724235520.10200-4-pasha.tatashin@oracle.com
Signed-off-by: Pavel Tatashin <pasha.tatashin@oracle.com>
Reviewed-by: Oscar Salvador <osalvador@suse.de>
Cc: Pasha Tatashin <Pavel.Tatashin@microsoft.com>
Cc: Abdul Haleem <abdhalee@linux.vnet.ibm.com>
Cc: Baoquan He <bhe@redhat.com>
Cc: Daniel Jordan <daniel.m.jordan@oracle.com>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: Dave Hansen <dave.hansen@intel.com>
Cc: David Rientjes <rientjes@google.com>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: Jan Kara <jack@suse.cz>
Cc: Jérôme Glisse <jglisse@redhat.com>
Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Souptick Joarder <jrdr.linux@gmail.com>
Cc: Steven Sistare <steven.sistare@oracle.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Wei Yang <richard.weiyang@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# d3035be4 26-Oct-2018 Pavel Tatashin <pasha.tatashin@oracle.com>

mm: calculate deferred pages after skipping mirrored memory

update_defer_init() should be called only when struct page is about to be
initialized. Because it counts number of initialized struct pages, but
there we may skip struct pages if there is some mirrored memory.

So move, update_defer_init() after checking for mirrored memory.

Also, rename update_defer_init() to defer_init() and reverse the return
boolean to emphasize that this is a boolean function, that tells that the
reset of memmap initialization should be deferred.

Make this function self-contained: do not pass number of already
initialized pages in this zone by using static counters.

I found this bug by reading the code. The effect is that fewer than
expected struct pages are initialized early in boot, and it is possible
that in some corner cases we may fail to boot when mirrored pages are
used. The deferred on demand code should somewhat mitigate this. But
this still brings some inconsistencies compared to when booting without
mirrored pages, so it is better to fix.

[pasha.tatashin@oracle.com: add comment about defer_init's lack of locking]
Link: http://lkml.kernel.org/r/20180726193509.3326-3-pasha.tatashin@oracle.com
[akpm@linux-foundation.org: make defer_init non-inline, __meminit]
Link: http://lkml.kernel.org/r/20180724235520.10200-3-pasha.tatashin@oracle.com
Signed-off-by: Pavel Tatashin <pasha.tatashin@oracle.com>
Reviewed-by: Oscar Salvador <osalvador@suse.de>
Cc: Abdul Haleem <abdhalee@linux.vnet.ibm.com>
Cc: Baoquan He <bhe@redhat.com>
Cc: Daniel Jordan <daniel.m.jordan@oracle.com>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: Dave Hansen <dave.hansen@intel.com>
Cc: David Rientjes <rientjes@google.com>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: Jan Kara <jack@suse.cz>
Cc: Jérôme Glisse <jglisse@redhat.com>
Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Souptick Joarder <jrdr.linux@gmail.com>
Cc: Steven Sistare <steven.sistare@oracle.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Wei Yang <richard.weiyang@gmail.com>
Cc: Pasha Tatashin <Pavel.Tatashin@microsoft.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# dfb3ccd0 26-Oct-2018 Pavel Tatashin <pasha.tatashin@oracle.com>

mm: make memmap_init a proper function

memmap_init is sometimes a macro sometimes a function based on
__HAVE_ARCH_MEMMAP_INIT. It is only a function on ia64. Make memmap_init
a weak function instead, and let ia64 redefine it.

Link: http://lkml.kernel.org/r/20180724235520.10200-2-pasha.tatashin@oracle.com
Signed-off-by: Pavel Tatashin <pasha.tatashin@oracle.com>
Reviewed-by: Andrew Morton <akpm@linux-foundation.org>
Reviewed-by: Oscar Salvador <osalvador@suse.de>
Cc: Steven Sistare <steven.sistare@oracle.com>
Cc: Daniel Jordan <daniel.m.jordan@oracle.com>
Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: Jan Kara <jack@suse.cz>
Cc: Jérôme Glisse <jglisse@redhat.com>
Cc: Souptick Joarder <jrdr.linux@gmail.com>
Cc: Baoquan He <bhe@redhat.com>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Wei Yang <richard.weiyang@gmail.com>
Cc: Dave Hansen <dave.hansen@intel.com>
Cc: David Rientjes <rientjes@google.com>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: Pavel Tatashin <pasha.tatashin@oracle.com>
Cc: Abdul Haleem <abdhalee@linux.vnet.ibm.com>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Cc: Pasha Tatashin <Pavel.Tatashin@microsoft.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 4a222127 26-Oct-2018 David Rientjes <rientjes@google.com>

mm/page_alloc.c: initialize num_movable in move_freepages()

If move_freepages_block() returns 0 because !zone_spans_pfn(),
*num_movable can hold the value from the stack because it does not get
initialized in move_freepages().

Move the initialization to move_freepages_block() to guarantee the value
actually makes sense.

This currently doesn't affect its only caller where num_movable != NULL,
so no bug fix, but just more robust.

Link: http://lkml.kernel.org/r/alpine.DEB.2.21.1810051355490.212229@chino.kir.corp.google.com
Signed-off-by: David Rientjes <rientjes@google.com>
Reviewed-by: Andrew Morton <akpm@linux-foundation.org>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Cc: Greg Thelen <gthelen@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 966cf44f 26-Oct-2018 Alexander Duyck <alexander.h.duyck@linux.intel.com>

mm: defer ZONE_DEVICE page initialization to the point where we init pgmap

The ZONE_DEVICE pages were being initialized in two locations. One was
with the memory_hotplug lock held and another was outside of that lock.
The problem with this is that it was nearly doubling the memory
initialization time. Instead of doing this twice, once while holding a
global lock and once without, I am opting to defer the initialization to
the one outside of the lock. This allows us to avoid serializing the
overhead for memory init and we can instead focus on per-node init times.

One issue I encountered is that devm_memremap_pages and
hmm_devmmem_pages_create were initializing only the pgmap field the same
way. One wasn't initializing hmm_data, and the other was initializing it
to a poison value. Since this is something that is exposed to the driver
in the case of hmm I am opting for a third option and just initializing
hmm_data to 0 since this is going to be exposed to unknown third party
drivers.

[alexander.h.duyck@linux.intel.com: fix reference count for pgmap in devm_memremap_pages]
Link: http://lkml.kernel.org/r/20181008233404.1909.37302.stgit@localhost.localdomain
Link: http://lkml.kernel.org/r/20180925202053.3576.66039.stgit@localhost.localdomain
Signed-off-by: Alexander Duyck <alexander.h.duyck@linux.intel.com>
Reviewed-by: Pavel Tatashin <pavel.tatashin@microsoft.com>
Tested-by: Dan Williams <dan.j.williams@intel.com>
Cc: Dave Hansen <dave.hansen@intel.com>
Cc: Michal Hocko <mhocko@suse.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# d483da5b 26-Oct-2018 Alexander Duyck <alexander.h.duyck@linux.intel.com>

mm: create non-atomic version of SetPageReserved for init use

It doesn't make much sense to use the atomic SetPageReserved at init time
when we are using memset to clear the memory and manipulating the page
flags via simple "&=" and "|=" operations in __init_single_page.

This patch adds a non-atomic version __SetPageReserved that can be used
during page init and shows about a 10% improvement in initialization times
on the systems I have available for testing. On those systems I saw
initialization times drop from around 35 seconds to around 32 seconds to
initialize a 3TB block of persistent memory. I believe the main advantage
of this is that it allows for more compiler optimization as the __set_bit
operation can be reordered whereas the atomic version cannot.

I tried adding a bit of documentation based on f1dd2cd13c4 ("mm,
memory_hotplug: do not associate hotadded memory to zones until online").

Ideally the reserved flag should be set earlier since there is a brief
window where the page is initialization via __init_single_page and we have
not set the PG_Reserved flag. I'm leaving that for a future patch set as
that will require a more significant refactor.

Link: http://lkml.kernel.org/r/20180925202018.3576.11607.stgit@localhost.localdomain
Signed-off-by: Alexander Duyck <alexander.h.duyck@linux.intel.com>
Reviewed-by: Pavel Tatashin <pavel.tatashin@microsoft.com>
Acked-by: Michal Hocko <mhocko@suse.com>
Cc: Dave Hansen <dave.hansen@intel.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 2c029a1e 26-Oct-2018 Michal Hocko <mhocko@suse.com>

mm, page_alloc: drop should_suppress_show_mem

should_suppress_show_mem() was introduced to reduce the overhead of
show_mem on large NUMA systems. Things have changed since then though.
Namely c78e93630d15 ("mm: do not walk all of system memory during
show_mem") has reduced the overhead considerably.

Moreover warn_alloc_show_mem clears SHOW_MEM_FILTER_NODES when called from
the IRQ context already so we are not printing per node stats.

Remove should_suppress_show_mem because we are losing potentially
interesting information about allocation failures. We have seen a bug
report where system gets unresponsive under memory pressure and there is
only

kernel: [2032243.696888] qlge 0000:8b:00.1 ql1: Could not get a page chunk, i=8, clean_idx =200 .
kernel: [2032243.710725] swapper/7: page allocation failure: order:1, mode:0x1084120(GFP_ATOMIC|__GFP_COLD|__GFP_COMP)

without an additional information for debugging. It would be great to see
the state of the page allocator at the moment.

Link: http://lkml.kernel.org/r/20180907114334.7088-1-mhocko@kernel.org
Signed-off-by: Michal Hocko <mhocko@suse.com>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Cc: David Rientjes <rientjes@google.com>
Cc: Mel Gorman <mgorman@techsingularity.net>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# eb414681 26-Oct-2018 Johannes Weiner <hannes@cmpxchg.org>

psi: pressure stall information for CPU, memory, and IO

When systems are overcommitted and resources become contended, it's hard
to tell exactly the impact this has on workload productivity, or how close
the system is to lockups and OOM kills. In particular, when machines work
multiple jobs concurrently, the impact of overcommit in terms of latency
and throughput on the individual job can be enormous.

In order to maximize hardware utilization without sacrificing individual
job health or risk complete machine lockups, this patch implements a way
to quantify resource pressure in the system.

A kernel built with CONFIG_PSI=y creates files in /proc/pressure/ that
expose the percentage of time the system is stalled on CPU, memory, or IO,
respectively. Stall states are aggregate versions of the per-task delay
accounting delays:

cpu: some tasks are runnable but not executing on a CPU
memory: tasks are reclaiming, or waiting for swapin or thrashing cache
io: tasks are waiting for io completions

These percentages of walltime can be thought of as pressure percentages,
and they give a general sense of system health and productivity loss
incurred by resource overcommit. They can also indicate when the system
is approaching lockup scenarios and OOMs.

To do this, psi keeps track of the task states associated with each CPU
and samples the time they spend in stall states. Every 2 seconds, the
samples are averaged across CPUs - weighted by the CPUs' non-idle time to
eliminate artifacts from unused CPUs - and translated into percentages of
walltime. A running average of those percentages is maintained over 10s,
1m, and 5m periods (similar to the loadaverage).

[hannes@cmpxchg.org: doc fixlet, per Randy]
Link: http://lkml.kernel.org/r/20180828205625.GA14030@cmpxchg.org
[hannes@cmpxchg.org: code optimization]
Link: http://lkml.kernel.org/r/20180907175015.GA8479@cmpxchg.org
[hannes@cmpxchg.org: rename psi_clock() to psi_update_work(), per Peter]
Link: http://lkml.kernel.org/r/20180907145404.GB11088@cmpxchg.org
[hannes@cmpxchg.org: fix build]
Link: http://lkml.kernel.org/r/20180913014222.GA2370@cmpxchg.org
Link: http://lkml.kernel.org/r/20180828172258.3185-9-hannes@cmpxchg.org
Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Tested-by: Daniel Drake <drake@endlessm.com>
Tested-by: Suren Baghdasaryan <surenb@google.com>
Cc: Christopher Lameter <cl@linux.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Johannes Weiner <jweiner@fb.com>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Peter Enderborg <peter.enderborg@sony.com>
Cc: Randy Dunlap <rdunlap@infradead.org>
Cc: Shakeel Butt <shakeelb@google.com>
Cc: Tejun Heo <tj@kernel.org>
Cc: Vinayak Menon <vinmenon@codeaurora.org>
Cc: Randy Dunlap <rdunlap@infradead.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# b29940c1 26-Oct-2018 Vlastimil Babka <vbabka@suse.cz>

mm: rename and change semantics of nr_indirectly_reclaimable_bytes

The vmstat counter NR_INDIRECTLY_RECLAIMABLE_BYTES was introduced by
commit eb59254608bc ("mm: introduce NR_INDIRECTLY_RECLAIMABLE_BYTES") with
the goal of accounting objects that can be reclaimed, but cannot be
allocated via a SLAB_RECLAIM_ACCOUNT cache. This is now possible via
kmalloc() with __GFP_RECLAIMABLE flag, and the dcache external names user
is converted.

The counter is however still useful for accounting direct page allocations
(i.e. not slab) with a shrinker, such as the ION page pool. So keep it,
and:

- change granularity to pages to be more like other counters; sub-page
allocations should be able to use kmalloc
- rename the counter to NR_KERNEL_MISC_RECLAIMABLE
- expose the counter again in vmstat as "nr_kernel_misc_reclaimable"; we can
again remove the check for not printing "hidden" counters

Link: http://lkml.kernel.org/r/20180731090649.16028-5-vbabka@suse.cz
Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
Acked-by: Christoph Lameter <cl@linux.com>
Acked-by: Roman Gushchin <guro@fb.com>
Cc: Vijayanand Jitta <vjitta@codeaurora.org>
Cc: Laura Abbott <labbott@redhat.com>
Cc: Sumit Semwal <sumit.semwal@linaro.org>
Cc: David Rientjes <rientjes@google.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Mel Gorman <mgorman@techsingularity.net>
Cc: Michal Hocko <mhocko@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 7b0e0c0e 26-Oct-2018 Oscar Salvador <osalvador@suse.de>

mm/page_alloc.c: clean up check_for_memory()

check_for_memory() looks a bit confusing. First of all, we have this:

if (N_MEMORY == N_NORMAL_MEMORY)
return;

Checking the ENUM declaration, looks like N_MEMORY canot be equal to
N_NORMAL_MEMORY.

I could not find where N_MEMORY is set to N_NORMAL_MEMORY, or the other
way around either, so unless I am missing something, this condition will
never evaluate to true. It makes sense to get rid of it.

Moving forward, the operations within the loop look a bit confusing as
well.

We set N_HIGH_MEMORY unconditionally, and then we set N_NORMAL_MEMORY in
case we have CONFIG_HIGHMEM (N_NORMAL_MEMORY != N_HIGH_MEMORY) and zone <=
ZONE_NORMAL. (N_HIGH_MEMORY falls back to N_NORMAL_MEMORY on
!CONFIG_HIGHMEM systems, and that is why we can just go ahead and set
N_HIGH_MEMORY unconditionally)

Although this works, it is a bit subtle.

I think that this could be easier to follow:

First, we should only set N_HIGH_MEMORY in case we have CONFIG_HIGHMEM.
And then we should set N_NORMAL_MEMORY in case zone <= ZONE_NORMAL,
without further checking whether we have CONFIG_HIGHMEM or not.

Link: http://lkml.kernel.org/r/20180828210158.4617-1-osalvador@techadventures.net
Signed-off-by: Oscar Salvador <osalvador@suse.de>
Reviewed-by: Andrew Morton <akpm@linux-foundation.org>
Cc: Michael Hocko <mhocko@suse.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Pavel Tatashin <pavel.tatashin@microsoft.com
Cc: Stephen Rothwell <sfr@canb.auug.org.au>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 15f570bf 26-Oct-2018 Michal Hocko <mhocko@suse.com>

mm,page_alloc: PF_WQ_WORKER threads must sleep at should_reclaim_retry()

Tetsuo Handa has reported that it is possible to bypass the short sleep
for PF_WQ_WORKER threads which was introduced by commit 373ccbe5927034b5
("mm, vmstat: allow WQ concurrency to discover memory reclaim doesn't make
any progress") and lock up the system if OOM.

The primary reason is that WQ_MEM_RECLAIM WQs are not guaranteed to run
even when they have a rescuer available. Those workers might be essential
for reclaim to make a forward progress, however. If we are too unlucky
all the allocations requests can get stuck waiting for a WQ_MEM_RECLAIM
work item and the system is essentially stuck in an OOM condition without
much hope to move on. Tetsuo has seen the reclaim stuck on
drain_local_pages_wq or xlog_cil_push_work (xfs). There might be others.

Since should_reclaim_retry() should be a natural reschedule point,
let's do the short sleep for PF_WQ_WORKER threads unconditionally in
order to guarantee that other pending work items are started. This
will workaround this problem and it is less fragile than hunting down
when the sleep is missed. Having a single sleeping point is more
robust.

[akpm@linux-foundation.org: reflow comment to 80 cols to save a couple of lines]
Link: http://lkml.kernel.org/r/20180827135101.15700-1-mhocko@kernel.org
Signed-off-by: Michal Hocko <mhocko@suse.com>
Debugged-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
Reported-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
Reviewed-by: Andrew Morton <akpm@linux-foundation.org>
Cc: Roman Gushchin <guro@fb.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Vladimir Davydov <vdavydov.dev@gmail.com>
Cc: David Rientjes <rientjes@google.com>
Cc: Tejun Heo <tj@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# e0546375 06-Oct-2018 Srikar Dronamraju <srikar@linux.vnet.ibm.com>

mm, sched/numa: Remove remaining traces of NUMA rate-limiting

Remove the leftover pglist_data::numabalancing_migrate_lock and its
initialization, we stopped using this lock with:

efaffc5e40ae ("mm, sched/numa: Remove rate-limiting of automatic NUMA balancing migration")

[ mingo: Rewrote the changelog. ]

Signed-off-by: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
Acked-by: Mel Gorman <mgorman@techsingularity.net>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Linux-MM <linux-mm@kvack.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Rik van Riel <riel@surriel.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: http://lkml.kernel.org/r/1538824999-31230-1-git-send-email-srikar@linux.vnet.ibm.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>


# efaffc5e 01-Oct-2018 Mel Gorman <mgorman@techsingularity.net>

mm, sched/numa: Remove rate-limiting of automatic NUMA balancing migration

Rate limiting of page migrations due to automatic NUMA balancing was
introduced to mitigate the worst-case scenario of migrating at high
frequency due to false sharing or slowly ping-ponging between nodes.
Since then, a lot of effort was spent on correctly identifying these
pages and avoiding unnecessary migrations and the safety net may no longer
be required.

Jirka Hladky reported a regression in 4.17 due to a scheduler patch that
avoids spreading STREAM tasks wide prematurely. However, once the task
was properly placed, it delayed migrating the memory due to rate limiting.
Increasing the limit fixed the problem for him.

Currently, the limit is hard-coded and does not account for the real
capabilities of the hardware. Even if an estimate was attempted, it would
not properly account for the number of memory controllers and it could
not account for the amount of bandwidth used for normal accesses. Rather
than fudging, this patch simply eliminates the rate limiting.

However, Jirka reports that a STREAM configuration using multiple
processes achieved similar performance to 4.16. In local tests, this patch
improved performance of STREAM relative to the baseline but it is somewhat
machine-dependent. Most workloads show little or not performance difference
implying that there is not a heavily reliance on the throttling mechanism
and it is safe to remove.

STREAM on 2-socket machine
4.19.0-rc5 4.19.0-rc5
numab-v1r1 noratelimit-v1r1
MB/sec copy 43298.52 ( 0.00%) 44673.38 ( 3.18%)
MB/sec scale 30115.06 ( 0.00%) 31293.06 ( 3.91%)
MB/sec add 32825.12 ( 0.00%) 34883.62 ( 6.27%)
MB/sec triad 32549.52 ( 0.00%) 34906.60 ( 7.24%

Signed-off-by: Mel Gorman <mgorman@techsingularity.net>
Reviewed-by: Rik van Riel <riel@surriel.com>
Acked-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Jirka Hladky <jhladky@redhat.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Linux-MM <linux-mm@kvack.org>
Cc: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: http://lkml.kernel.org/r/20181001100525.29789-2-mgorman@techsingularity.net
Signed-off-by: Ingo Molnar <mingo@kernel.org>


# 464c7ffb 04-Sep-2018 Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>

mm/hugetlb: filter out hugetlb pages if HUGEPAGE migration is not supported.

When scanning for movable pages, filter out Hugetlb pages if hugepage
migration is not supported. Without this we hit infinte loop in
__offline_pages() where we do

pfn = scan_movable_pages(start_pfn, end_pfn);
if (pfn) { /* We have movable pages */
ret = do_migrate_range(pfn, end_pfn);
goto repeat;
}

Fix this by checking hugepage_migration_supported both in
has_unmovable_pages which is the primary backoff mechanism for page
offlining and for consistency reasons also into scan_movable_pages
because it doesn't make any sense to return a pfn to non-migrateable
huge page.

This issue was revealed by, but not caused by 72b39cfc4d75 ("mm,
memory_hotplug: do not fail offlining too early").

Link: http://lkml.kernel.org/r/20180824063314.21981-1-aneesh.kumar@linux.ibm.com
Fixes: 72b39cfc4d75 ("mm, memory_hotplug: do not fail offlining too early")
Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
Reported-by: Haren Myneni <haren@linux.vnet.ibm.com>
Acked-by: Michal Hocko <mhocko@suse.com>
Reviewed-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 13ba17be 24-Aug-2018 Mukesh Ojha <mojha@codeaurora.org>

notifier: Remove notifier header file wherever not used

The conversion of the hotplug notifiers to a state machine left the
notifier.h includes around in some places. Remove them.

Signed-off-by: Mukesh Ojha <mojha@codeaurora.org>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Link: https://lkml.kernel.org/r/1535114033-4605-1-git-send-email-mojha@codeaurora.org


# d4ae9916 23-Aug-2018 Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>

mm: soft-offline: close the race against page allocation

A process can be killed with SIGBUS(BUS_MCEERR_AR) when it tries to
allocate a page that was just freed on the way of soft-offline. This is
undesirable because soft-offline (which is about corrected error) is
less aggressive than hard-offline (which is about uncorrected error),
and we can make soft-offline fail and keep using the page for good
reason like "system is busy."

Two main changes of this patch are:

- setting migrate type of the target page to MIGRATE_ISOLATE. As done
in free_unref_page_commit(), this makes kernel bypass pcplist when
freeing the page. So we can assume that the page is in freelist just
after put_page() returns,

- setting PG_hwpoison on free page under zone->lock which protects
freelists, so this allows us to avoid setting PG_hwpoison on a page
that is decided to be allocated soon.

[akpm@linux-foundation.org: tweak set_hwpoison_free_buddy_page() comment]
Link: http://lkml.kernel.org/r/1531452366-11661-3-git-send-email-n-horiguchi@ah.jp.nec.com
Signed-off-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
Reported-by: Xishi Qiu <xishi.qiuxishi@alibaba-inc.com>
Tested-by: Mike Kravetz <mike.kravetz@oracle.com>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: <zy.zhengyi@alibaba-inc.com>
Cc: Mike Kravetz <mike.kravetz@oracle.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 03e85f9d 21-Aug-2018 Oscar Salvador <osalvador@suse.de>

mm/page_alloc: Introduce free_area_init_core_hotplug

Currently, whenever a new node is created/re-used from the memhotplug
path, we call free_area_init_node()->free_area_init_core(). But there is
some code that we do not really need to run when we are coming from such
path.

free_area_init_core() performs the following actions:

1) Initializes pgdat internals, such as spinlock, waitqueues and more.
2) Account # nr_all_pages and # nr_kernel_pages. These values are used later on
when creating hash tables.
3) Account number of managed_pages per zone, substracting dma_reserved and
memmap pages.
4) Initializes some fields of the zone structure data
5) Calls init_currently_empty_zone to initialize all the freelists
6) Calls memmap_init to initialize all pages belonging to certain zone

When called from memhotplug path, free_area_init_core() only performs
actions #1 and #4.

Action #2 is pointless as the zones do not have any pages since either the
node was freed, or we are re-using it, eitherway all zones belonging to
this node should have 0 pages. For the same reason, action #3 results
always in manages_pages being 0.

Action #5 and #6 are performed later on when onlining the pages:
online_pages()->move_pfn_range_to_zone()->init_currently_empty_zone()
online_pages()->move_pfn_range_to_zone()->memmap_init_zone()

This patch does two things:

First, moves the node/zone initializtion to their own function, so it
allows us to create a small version of free_area_init_core, where we only
perform:

1) Initialization of pgdat internals, such as spinlock, waitqueues and more
4) Initialization of some fields of the zone structure data

These two functions are: pgdat_init_internals() and zone_init_internals().

The second thing this patch does, is to introduce
free_area_init_core_hotplug(), the memhotplug version of
free_area_init_core():

Currently, we call free_area_init_node() from the memhotplug path. In
there, we set some pgdat's fields, and call calculate_node_totalpages().
calculate_node_totalpages() calculates the # of pages the node has.

Since the node is either new, or we are re-using it, the zones belonging
to this node should not have any pages, so there is no point to calculate
this now.

Actually, we re-set these values to 0 later on with the calls to:

reset_node_managed_pages()
reset_node_present_pages()

The # of pages per node and the # of pages per zone will be calculated when
onlining the pages:

online_pages()->move_pfn_range()->move_pfn_range_to_zone()->resize_zone_range()
online_pages()->move_pfn_range()->move_pfn_range_to_zone()->resize_pgdat_range()

Also, since free_area_init_core/free_area_init_node will now only get called during early init, let us replace
__paginginit with __init, so their code gets freed up.

[osalvador@techadventures.net: fix section usage]
Link: http://lkml.kernel.org/r/20180731101752.GA473@techadventures.net
[osalvador@suse.de: v6]
Link: http://lkml.kernel.org/r/20180801122348.21588-6-osalvador@techadventures.net
Link: http://lkml.kernel.org/r/20180730101757.28058-5-osalvador@techadventures.net
Signed-off-by: Oscar Salvador <osalvador@suse.de>
Reviewed-by: Pavel Tatashin <pasha.tatashin@oracle.com>
Acked-by: Michal Hocko <mhocko@suse.com>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Cc: Pasha Tatashin <Pavel.Tatashin@microsoft.com>
Cc: Aaron Lu <aaron.lu@intel.com>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: David Hildenbrand <david@redhat.com>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Cc: Mel Gorman <mgorman@techsingularity.net>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 0188dc98 21-Aug-2018 Oscar Salvador <osalvador@suse.de>

mm/page_alloc: inline function to handle CONFIG_DEFERRED_STRUCT_PAGE_INIT

Let us move the code between CONFIG_DEFERRED_STRUCT_PAGE_INIT to an inline
function. Not having an ifdef in the function makes the code more
readable.

Link: http://lkml.kernel.org/r/20180730101757.28058-4-osalvador@techadventures.net
Signed-off-by: Oscar Salvador <osalvador@suse.de>
Acked-by: Michal Hocko <mhocko@suse.com>
Reviewed-by: Pavel Tatashin <pasha.tatashin@oracle.com>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Cc: Aaron Lu <aaron.lu@intel.com>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: David Hildenbrand <david@redhat.com>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Cc: Mel Gorman <mgorman@techsingularity.net>
Cc: Pasha Tatashin <Pavel.Tatashin@microsoft.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 7cc2a959 21-Aug-2018 Pavel Tatashin <pasha.tatashin@oracle.com>

mm: remove __paginginit

__paginginit is the same thing as __meminit except for platforms without
sparsemem, there it is defined as __init.

Remove __paginginit and use __meminit. Use __ref in one single function
that merges __meminit and __init sections: setup_usemap().

Link: http://lkml.kernel.org/r/20180801122348.21588-4-osalvador@techadventures.net
Signed-off-by: Pavel Tatashin <pasha.tatashin@oracle.com>
Signed-off-by: Oscar Salvador <osalvador@suse.de>
Reviewed-by: Oscar Salvador <osalvador@suse.de>
Cc: Pasha Tatashin <Pavel.Tatashin@microsoft.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# c1093b74 21-Aug-2018 Pavel Tatashin <pasha.tatashin@oracle.com>

mm: access zone->node via zone_to_nid() and zone_set_nid()

zone->node is configured only when CONFIG_NUMA=y, so it is a good idea to
have inline functions to access this field in order to avoid ifdef's in c
files.

Link: http://lkml.kernel.org/r/20180730101757.28058-3-osalvador@techadventures.net
Signed-off-by: Pavel Tatashin <pasha.tatashin@oracle.com>
Signed-off-by: Oscar Salvador <osalvador@suse.de>
Reviewed-by: Oscar Salvador <osalvador@suse.de>
Acked-by: Michal Hocko <mhocko@suse.com>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Cc: Aaron Lu <aaron.lu@intel.com>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: David Hildenbrand <david@redhat.com>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Cc: Mel Gorman <mgorman@techsingularity.net>
Cc: Pasha Tatashin <Pavel.Tatashin@microsoft.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# ace1db39 21-Aug-2018 Oscar Salvador <osalvador@suse.de>

mm/page_alloc.c: move ifdefery out of free_area_init_core

Patch series "Refactor free_area_init_core and add
free_area_init_core_hotplug", v6.

This patchset does three things:

1) Clean up/refactor free_area_init_core/free_area_init_node
by moving the ifdefery out of the functions.
2) Move the pgdat/zone initialization in free_area_init_core to its
own function.
3) Introduce free_area_init_core_hotplug, a small subset of
free_area_init_core, which is only called from memhotlug code path. In this
way, we have:

free_area_init_core: called during early initialization
free_area_init_core_hotplug: called whenever a new node is allocated/re-used (memhotplug path)

This patch (of 5):

Moving the #ifdefs out of the function makes it easier to follow.

Link: http://lkml.kernel.org/r/20180730101757.28058-2-osalvador@techadventures.net
Signed-off-by: Oscar Salvador <osalvador@suse.de>
Acked-by: Michal Hocko <mhocko@suse.com>
Reviewed-by: Pavel Tatashin <pasha.tatashin@oracle.com>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Cc: Pasha Tatashin <Pavel.Tatashin@microsoft.com>
Cc: Mel Gorman <mgorman@techsingularity.net>
Cc: Aaron Lu <aaron.lu@intel.com>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: David Hildenbrand <david@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# d8a759b5 17-Aug-2018 Aaron Lu <aaron.lu@intel.com>

mm, page_alloc: double zone's batchsize

To improve page allocator's performance for order-0 pages, each CPU has
a Per-CPU-Pageset(PCP) per zone. Whenever an order-0 page is needed,
PCP will be checked first before asking pages from Buddy. When PCP is
used up, a batch of pages will be fetched from Buddy to improve
performance and the size of batch can affect performance.

zone's batch size gets doubled last time by commit ba56e91c9401("mm:
page_alloc: increase size of per-cpu-pages") over ten years ago. Since
then, CPU has envolved a lot and CPU's cache sizes also increased.

Dave Hansen is concerned the current batch size doesn't fit well with
modern hardware and suggested me to do two things: first, use a page
allocator intensive benchmark, e.g. will-it-scale/page_fault1 to find
out how performance changes with different batch sizes on various
machines and then choose a new default batch size; second, see how this
new batch size work with other workloads.

In the first test, we saw performance gains on high-core-count systems
and little to no effect on older systems with more modest core counts.
In this phase's test data, two candidates: 63 and 127 are chosen.

In the second step, ebizzy, oltp, kbuild, pigz, netperf, vm-scalability
and more will-it-scale sub-tests are tested to see how these two
candidates work with these workloads and decides a new default according
to their results.

Most test results are flat. will-it-scale/page_fault2 process mode has
10%-18% performance increase on 4-sockets Skylake and Broadwell.
vm-scalability/lru-file-mmap-read has 17%-47% performance increase for
4-sockets servers while for 2-sockets servers, it caused 3%-8% performance
drop. Further analysis showed that, with a larger pcp->batch and thus
larger pcp->high(the relationship of pcp->high=6 * pcp->batch is
maintained in this patch), zone lock contention shifted to LRU add side
lock contention and that caused performance drop. This performance drop
might be mitigated by others' work on optimizing LRU lock.

Another downside of increasing pcp->batch is, when PCP is used up and need
to fetch a batch of pages from Buddy, since batch is increased, that time
can be longer than before. My understanding is, this doesn't affect
slowpath where direct reclaim and compaction dominates. For fastpath,
throughput is a win(according to will-it-scale/page_fault1) but worst
latency can be larger now.

Overall, I think double the batch size from 31 to 63 is relatively safe
and provide good performance boost for high-core-count systems.

The two phase's test results are listed below(all tests are done with THP
disabled).

Phase one(will-it-scale/page_fault1) test results:

Skylake-EX: increased batch size has a good effect on zone->lock
contention, though LRU contention will rise at the same time and
limited the final performance increase.

batch score change zone_contention lru_contention total_contention
31 15345900 +0.00% 64% 8% 72%
53 17903847 +16.67% 32% 38% 70%
63 17992886 +17.25% 24% 45% 69%
73 18022825 +17.44% 10% 61% 71%
119 18023401 +17.45% 4% 66% 70%
127 18029012 +17.48% 3% 66% 69%
137 18036075 +17.53% 4% 66% 70%
165 18035964 +17.53% 2% 67% 69%
188 18101105 +17.95% 2% 67% 69%
223 18130951 +18.15% 2% 67% 69%
255 18118898 +18.07% 2% 67% 69%
267 18101559 +17.96% 2% 67% 69%
299 18160468 +18.34% 2% 68% 70%
320 18139845 +18.21% 2% 67% 69%
393 18160869 +18.34% 2% 68% 70%
424 18170999 +18.41% 2% 68% 70%
458 18144868 +18.24% 2% 68% 70%
467 18142366 +18.22% 2% 68% 70%
498 18154549 +18.30% 1% 68% 69%
511 18134525 +18.17% 1% 69% 70%

Broadwell-EX: similar pattern as Skylake-EX.

batch score change zone_contention lru_contention total_contention
31 16703983 +0.00% 67% 7% 74%
53 18195393 +8.93% 43% 28% 71%
63 18288885 +9.49% 38% 33% 71%
73 18344329 +9.82% 35% 37% 72%
119 18535529 +10.96% 24% 46% 70%
127 18513596 +10.83% 23% 48% 71%
137 18514327 +10.84% 23% 48% 71%
165 18511840 +10.82% 22% 49% 71%
188 18593478 +11.31% 17% 53% 70%
223 18601667 +11.36% 17% 52% 69%
255 18774825 +12.40% 12% 58% 70%
267 18754781 +12.28% 9% 60% 69%
299 18892265 +13.10% 7% 63% 70%
320 18873812 +12.99% 8% 62% 70%
393 18891174 +13.09% 6% 64% 70%
424 18975108 +13.60% 6% 64% 70%
458 18932364 +13.34% 8% 62% 70%
467 18960891 +13.51% 5% 65% 70%
498 18944526 +13.41% 5% 64% 69%
511 18960839 +13.51% 5% 64% 69%

Skylake-EP: although increased batch reduced zone->lock contention, but
the effect is not as good as EX: zone->lock contention is still as high as
20% with a very high batch value instead of 1% on Skylake-EX or 5% on
Broadwell-EX. Also, total_contention actually decreased with a higher
batch but that doesn't translate to performance increase.

batch score change zone_contention lru_contention total_contention
31 9554867 +0.00% 66% 3% 69%
53 9855486 +3.15% 63% 3% 66%
63 9980145 +4.45% 62% 4% 66%
73 10092774 +5.63% 62% 5% 67%
119 10310061 +7.90% 45% 19% 64%
127 10342019 +8.24% 42% 19% 61%
137 10358182 +8.41% 42% 21% 63%
165 10397060 +8.81% 37% 24% 61%
188 10341808 +8.24% 34% 26% 60%
223 10349135 +8.31% 31% 27% 58%
255 10327189 +8.08% 28% 29% 57%
267 10344204 +8.26% 27% 29% 56%
299 10325043 +8.06% 25% 30% 55%
320 10310325 +7.91% 25% 31% 56%
393 10293274 +7.73% 21% 31% 52%
424 10311099 +7.91% 21% 32% 53%
458 10321375 +8.02% 21% 32% 53%
467 10303881 +7.84% 21% 32% 53%
498 10332462 +8.14% 20% 33% 53%
511 10325016 +8.06% 20% 32% 52%

Broadwell-EP: zone->lock and lru lock had an agreement to make sure
performance doesn't increase and they successfully managed to keep total
contention at 70%.

batch score change zone_contention lru_contention total_contention
31 10121178 +0.00% 19% 50% 69%
53 10142366 +0.21% 6% 63% 69%
63 10117984 -0.03% 11% 58% 69%
73 10123330 +0.02% 7% 63% 70%
119 10108791 -0.12% 2% 67% 69%
127 10166074 +0.44% 3% 66% 69%
137 10141574 +0.20% 3% 66% 69%
165 10154499 +0.33% 2% 68% 70%
188 10124921 +0.04% 2% 67% 69%
223 10137399 +0.16% 2% 67% 69%
255 10143289 +0.22% 0% 68% 68%
267 10123535 +0.02% 1% 68% 69%
299 10140952 +0.20% 0% 68% 68%
320 10163170 +0.41% 0% 68% 68%
393 10000633 -1.19% 0% 69% 69%
424 10087998 -0.33% 0% 69% 69%
458 10187116 +0.65% 0% 69% 69%
467 10146790 +0.25% 0% 69% 69%
498 10197958 +0.76% 0% 69% 69%
511 10152326 +0.31% 0% 69% 69%

Haswell-EP: similar to Broadwell-EP.

batch score change zone_contention lru_contention total_contention
31 10442205 +0.00% 14% 48% 62%
53 10442255 +0.00% 5% 57% 62%
63 10452059 +0.09% 6% 57% 63%
73 10482349 +0.38% 5% 59% 64%
119 10454644 +0.12% 3% 60% 63%
127 10431514 -0.10% 3% 59% 62%
137 10423785 -0.18% 3% 60% 63%
165 10481216 +0.37% 2% 61% 63%
188 10448755 +0.06% 2% 61% 63%
223 10467144 +0.24% 2% 61% 63%
255 10480215 +0.36% 2% 61% 63%
267 10484279 +0.40% 2% 61% 63%
299 10466450 +0.23% 2% 61% 63%
320 10452578 +0.10% 2% 61% 63%
393 10499678 +0.55% 1% 62% 63%
424 10481454 +0.38% 1% 62% 63%
458 10473562 +0.30% 1% 62% 63%
467 10484269 +0.40% 0% 62% 62%
498 10505599 +0.61% 0% 62% 62%
511 10483395 +0.39% 0% 62% 62%

Westmere-EP: contention is pretty small so not interesting. Note too high
a batch value could hurt performance.

batch score change zone_contention lru_contention total_contention
31 4831523 +0.00% 2% 3% 5%
53 4834086 +0.05% 2% 4% 6%
63 4834262 +0.06% 2% 3% 5%
73 4832851 +0.03% 2% 4% 6%
119 4830534 -0.02% 1% 3% 4%
127 4827461 -0.08% 1% 4% 5%
137 4827459 -0.08% 1% 3% 4%
165 4820534 -0.23% 0% 4% 4%
188 4817947 -0.28% 0% 3% 3%
223 4809671 -0.45% 0% 3% 3%
255 4802463 -0.60% 0% 4% 4%
267 4801634 -0.62% 0% 3% 3%
299 4798047 -0.69% 0% 3% 3%
320 4793084 -0.80% 0% 3% 3%
393 4785877 -0.94% 0% 3% 3%
424 4782911 -1.01% 0% 3% 3%
458 4779346 -1.08% 0% 3% 3%
467 4780306 -1.06% 0% 3% 3%
498 4780589 -1.05% 0% 3% 3%
511 4773724 -1.20% 0% 3% 3%

Skylake-Desktop: similar to Westmere-EP, nothing interesting.

batch score change zone_contention lru_contention total_contention
31 3906608 +0.00% 2% 3% 5%
53 3940164 +0.86% 2% 3% 5%
63 3937289 +0.79% 2% 3% 5%
73 3940201 +0.86% 2% 3% 5%
119 3933240 +0.68% 2% 3% 5%
127 3930514 +0.61% 2% 4% 6%
137 3938639 +0.82% 0% 3% 3%
165 3908755 +0.05% 0% 3% 3%
188 3905621 -0.03% 0% 3% 3%
223 3903015 -0.09% 0% 4% 4%
255 3889480 -0.44% 0% 3% 3%
267 3891669 -0.38% 0% 4% 4%
299 3898728 -0.20% 0% 4% 4%
320 3894547 -0.31% 0% 4% 4%
393 3875137 -0.81% 0% 4% 4%
424 3874521 -0.82% 0% 3% 3%
458 3880432 -0.67% 0% 4% 4%
467 3888715 -0.46% 0% 3% 3%
498 3888633 -0.46% 0% 4% 4%
511 3875305 -0.80% 0% 5% 5%

Haswell-Desktop: zone->lock is pretty low as other desktops, though lru
contention is higher than other desktops.

batch score change zone_contention lru_contention total_contention
31 3511158 +0.00% 2% 5% 7%
53 3555445 +1.26% 2% 6% 8%
63 3561082 +1.42% 2% 6% 8%
73 3547218 +1.03% 2% 6% 8%
119 3571319 +1.71% 1% 7% 8%
127 3549375 +1.09% 0% 6% 6%
137 3560233 +1.40% 0% 6% 6%
165 3555176 +1.25% 2% 6% 8%
188 3551501 +1.15% 0% 8% 8%
223 3531462 +0.58% 0% 7% 7%
255 3570400 +1.69% 0% 7% 7%
267 3532235 +0.60% 1% 8% 9%
299 3562326 +1.46% 0% 6% 6%
320 3553569 +1.21% 0% 8% 8%
393 3539519 +0.81% 0% 7% 7%
424 3549271 +1.09% 0% 8% 8%
458 3528885 +0.50% 0% 8% 8%
467 3526554 +0.44% 0% 7% 7%
498 3525302 +0.40% 0% 9% 9%
511 3527556 +0.47% 0% 8% 8%

Sandybridge-Desktop: the 0% contention isn't accurate but caused by
dropped fractional part. Since multiple contention path's contentions
are all under 1% here, with some arithmetic operations like add, the
final deviation could be as large as 3%.

batch score change zone_contention lru_contention total_contention
31 1744495 +0.00% 0% 0% 0%
53 1755341 +0.62% 0% 0% 0%
63 1758469 +0.80% 0% 0% 0%
73 1759626 +0.87% 0% 0% 0%
119 1770417 +1.49% 0% 0% 0%
127 1768252 +1.36% 0% 0% 0%
137 1767848 +1.34% 0% 0% 0%
165 1765088 +1.18% 0% 0% 0%
188 1766918 +1.29% 0% 0% 0%
223 1767866 +1.34% 0% 0% 0%
255 1768074 +1.35% 0% 0% 0%
267 1763187 +1.07% 0% 0% 0%
299 1765620 +1.21% 0% 0% 0%
320 1767603 +1.32% 0% 0% 0%
393 1764612 +1.15% 0% 0% 0%
424 1758476 +0.80% 0% 0% 0%
458 1758593 +0.81% 0% 0% 0%
467 1757915 +0.77% 0% 0% 0%
498 1753363 +0.51% 0% 0% 0%
511 1755548 +0.63% 0% 0% 0%

Phase two test results:
Note: all percent change is against base(batch=31).

ebizzy.throughput (higer is better)

machine batch=31 batch=63 batch=127
lkp-skl-4sp1 2410037±7% 2600451±2% +7.9% 2602878 +8.0%
lkp-bdw-ex1 1493328 1489243 -0.3% 1492145 -0.1%
lkp-skl-2sp2 1329674 1345891 +1.2% 1351056 +1.6%
lkp-bdw-ep2 711511 711511 0.0% 710708 -0.1%
lkp-wsm-ep2 75750 75528 -0.3% 75441 -0.4%
lkp-skl-d01 264126 262791 -0.5% 264113 +0.0%
lkp-hsw-d01 176601 176328 -0.2% 176368 -0.1%
lkp-sb02 98937 98937 +0.0% 99030 +0.1%

kbuild.buildtime (less is better)

machine batch=31 batch=63 batch=127
lkp-skl-4sp1 107.00 107.67 +0.6% 107.11 +0.1%
lkp-bdw-ex1 97.33 97.33 +0.0% 97.42 +0.1%
lkp-skl-2sp2 180.00 179.83 -0.1% 179.83 -0.1%
lkp-bdw-ep2 178.17 179.17 +0.6% 177.50 -0.4%
lkp-wsm-ep2 737.00 738.00 +0.1% 738.00 +0.1%
lkp-skl-d01 642.00 653.00 +1.7% 653.00 +1.7%
lkp-hsw-d01 1310.00 1316.00 +0.5% 1311.00 +0.1%

netperf/TCP_STREAM.Throughput_total_Mbps (higher is better)

machine batch=31 batch=63 batch=127
lkp-skl-4sp1 948790 947144 -0.2% 948333 -0.0%
lkp-bdw-ex1 904224 904366 +0.0% 904926 +0.1%
lkp-skl-2sp2 239731 239607 -0.1% 239565 -0.1%
lk-bdw-ep2 365764 365933 +0.0% 365951 +0.1%
lkp-wsm-ep2 93736 93803 +0.1% 93808 +0.1%
lkp-skl-d01 77314 77303 -0.0% 77375 +0.1%
lkp-hsw-d01 58617 60387 +3.0% 60208 +2.7%
lkp-sb02 29990 30137 +0.5% 30103 +0.4%

oltp.transactions (higer is better)

machine batch=31 batch=63 batch=127
lkp-bdw-ex1 9073276 9100377 +0.3% 9036344 -0.4%
lkp-skl-2sp2 8898717 8852054 -0.5% 8894459 -0.0%
lkp-bdw-ep2 13426155 13384654 -0.3% 13333637 -0.7%
lkp-hsw-ep2 13146314 13232784 +0.7% 13193163 +0.4%
lkp-wsm-ep2 5035355 5019348 -0.3% 5033418 -0.0%
lkp-skl-d01 418485 4413339 -0.1% 4419039 +0.0%
lkp-hsw-d01 3517817±5% 3396120±3% -3.5% 3455138±3% -1.8%

pigz.throughput (higer is better)

machine batch=31 batch=63 batch=127
lkp-skl-4sp1 1.513e+08 1.507e+08 -0.4% 1.511e+08 -0.2%
lkp-bdw-ex1 2.060e+08 2.052e+08 -0.4% 2.044e+08 -0.8%
lkp-skl-2sp2 8.836e+08 8.845e+08 +0.1% 8.836e+08 -0.0%
lkp-bdw-ep2 8.275e+08 8.464e+08 +2.3% 8.330e+08 +0.7%
lkp-wsm-ep2 2.224e+08 2.221e+08 -0.2% 2.218e+08 -0.3%
lkp-skl-d01 1.177e+08 1.177e+08 -0.0% 1.176e+08 -0.1%
lkp-hsw-d01 1.154e+08 1.154e+08 +0.1% 1.154e+08 -0.0%
lkp-sb02 0.633e+08 0.633e+08 +0.1% 0.633e+08 +0.0%

will-it-scale.malloc1.processes (higher is better)

machine batch=31 batch=63 batch=127
lkp-skl-4sp1 620181 620484 +0.0% 620240 +0.0%
lkp-bdw-ex1 1403610 1401201 -0.2% 1417900 +1.0%
lkp-skl-2sp2 1288097 1284145 -0.3% 1283907 -0.3%
lkp-bdw-ep2 1427879 1427675 -0.0% 1428266 +0.0%
lkp-hsw-ep2 1362546 1353965 -0.6% 1354759 -0.6%
lkp-wsm-ep2 2099657 2107576 +0.4% 2100226 +0.0%
lkp-skl-d01 1476835 1476358 -0.0% 1474487 -0.2%
lkp-hsw-d01 1308810 1303429 -0.4% 1301299 -0.6%
lkp-sb02 589286 589284 -0.0% 588101 -0.2%

will-it-scale.malloc1.threads (higher is better)
machine batch=31 batch=63 batch=127
lkp-skl-4sp1 21289 21125 -0.8% 21241 -0.2%
lkp-bdw-ex1 28114 28089 -0.1% 28007 -0.4%
lkp-skl-2sp2 91866 91946 +0.1% 92723 +0.9%
lkp-bdw-ep2 37637 37501 -0.4% 37317 -0.9%
lkp-hsw-ep2 43673 43590 -0.2% 43754 +0.2%
lkp-wsm-ep2 28577 28298 -1.0% 28545 -0.1%
lkp-skl-d01 175277 173343 -1.1% 173082 -1.3%
lkp-hsw-d01 130303 129566 -0.6% 129250 -0.8%
lkp-sb02 113742±3% 116911 +2.8% 116417±3% +2.4%

will-it-scale.malloc2.processes (higer is better)

machine batch=31 batch=63 batch=127
lkp-skl-4sp1 1.206e+09 1.206e+09 -0.0% 1.206e+09 +0.0%
lkp-bdw-ex1 1.319e+09 1.319e+09 -0.0% 1.319e+09 +0.0%
lkp-skl-2sp2 8.000e+08 8.021e+08 +0.3% 7.995e+08 -0.1%
lkp-bdw-ep2 6.582e+08 6.634e+08 +0.8% 6.513e+08 -1.1%
lkp-hsw-ep2 6.671e+08 6.669e+08 -0.0% 6.665e+08 -0.1%
lkp-wsm-ep2 1.805e+08 1.806e+08 +0.0% 1.804e+08 -0.1%
lkp-skl-d01 1.611e+08 1.611e+08 -0.0% 1.610e+08 -0.0%
lkp-hsw-d01 1.333e+08 1.332e+08 -0.0% 1.332e+08 -0.0%
lkp-sb02 82485104 82478206 -0.0% 82473546 -0.0%

will-it-scale.malloc2.threads (higer is better)

machine batch=31 batch=63 batch=127
lkp-skl-4sp1 1.574e+09 1.574e+09 -0.0% 1.574e+09 -0.0%
lkp-bdw-ex1 1.737e+09 1.737e+09 +0.0% 1.737e+09 -0.0%
lkp-skl-2sp2 9.161e+08 9.162e+08 +0.0% 9.181e+08 +0.2%
lkp-bdw-ep2 7.856e+08 8.015e+08 +2.0% 8.113e+08 +3.3%
lkp-hsw-ep2 6.908e+08 6.904e+08 -0.1% 6.907e+08 -0.0%
lkp-wsm-ep2 2.409e+08 2.409e+08 +0.0% 2.409e+08 -0.0%
lkp-skl-d01 1.199e+08 1.199e+08 -0.0% 1.199e+08 -0.0%
lkp-hsw-d01 1.029e+08 1.029e+08 -0.0% 1.029e+08 +0.0%
lkp-sb02 68081213 68061423 -0.0% 68076037 -0.0%

will-it-scale.page_fault2.processes (higer is better)

machine batch=31 batch=63 batch=127
lkp-skl-4sp1 14509125±4% 16472364 +13.5% 17123117 +18.0%
lkp-bdw-ex1 14736381 16196588 +9.9% 16364011 +11.0%
lkp-skl-2sp2 6354925 6435444 +1.3% 6436644 +1.3%
lkp-bdw-ep2 8749584 8834422 +1.0% 8827179 +0.9%
lkp-hsw-ep2 8762591 8845920 +1.0% 8825697 +0.7%
lkp-wsm-ep2 3036083 3030428 -0.2% 3021741 -0.5%
lkp-skl-d01 2307834 2304731 -0.1% 2286142 -0.9%
lkp-hsw-d01 1806237 1800786 -0.3% 1795943 -0.6%
lkp-sb02 842616 837844 -0.6% 833921 -1.0%

will-it-scale.page_fault2.threads

machine batch=31 batch=63 batch=127
lkp-skl-4sp1 1623294 1615132±2% -0.5% 1656777 +2.1%
lkp-bdw-ex1 1995714 2025948 +1.5% 2113753±3% +5.9%
lkp-skl-2sp2 2346708 2415591 +2.9% 2416919 +3.0%
lkp-bdw-ep2 2342564 2344882 +0.1% 2300206 -1.8%
lkp-hsw-ep2 1820658 1831681 +0.6% 1844057 +1.3%
lkp-wsm-ep2 1725482 1733774 +0.5% 1740517 +0.9%
lkp-skl-d01 1832833 1823628 -0.5% 1806489 -1.4%
lkp-hsw-d01 1427913 1427287 -0.0% 1420226 -0.5%
lkp-sb02 750626 748615 -0.3% 746621 -0.5%

will-it-scale.page_fault3.processes (higher is better)

machine batch=31 batch=63 batch=127
lkp-skl-4sp1 24382726 24400317 +0.1% 24668774 +1.2%
lkp-bdw-ex1 35399750 35683124 +0.8% 35829492 +1.2%
lkp-skl-2sp2 28136820 28068248 -0.2% 28147989 +0.0%
lkp-bdw-ep2 37269077 37459490 +0.5% 37373073 +0.3%
lkp-hsw-ep2 36224967 36114085 -0.3% 36104908 -0.3%
lkp-wsm-ep2 16820457 16911005 +0.5% 16968596 +0.9%
lkp-skl-d01 7721138 7725904 +0.1% 7756740 +0.5%
lkp-hsw-d01 7611979 7650928 +0.5% 7651323 +0.5%
lkp-sb02 3781546 3796502 +0.4% 3796827 +0.4%

will-it-scale.page_fault3.threads (higer is better)

machine batch=31 batch=63 batch=127
lkp-skl-4sp1 1865820±3% 1900917±2% +1.9% 1826245±4% -2.1%
lkp-bdw-ex1 3094060 3148326 +1.8% 3150036 +1.8%
lkp-skl-2sp2 3952940 3953898 +0.0% 3989360 +0.9%
lkp-bdw-ep2 3420373±3% 3643964 +6.5% 3644910±5% +6.6%
lkp-hsw-ep2 2609635±2% 2582310±3% -1.0% 2780459 +6.5%
lkp-wsm-ep2 4395001 4417196 +0.5% 4432499 +0.9%
lkp-skl-d01 5363977 5400003 +0.7% 5411370 +0.9%
lkp-hsw-d01 5274131 5311294 +0.7% 5319359 +0.9%
lkp-sb02 2917314 2913004 -0.1% 2935286 +0.6%

will-it-scale.read1.processes (higer is better)

machine batch=31 batch=63 batch=127
lkp-skl-4sp1 73762279±14% 69322519±10% -6.0% 69349855±13% -6.0% (result unstable)
lkp-bdw-ex1 1.701e+08 1.704e+08 +0.1% 1.705e+08 +0.2%
lkp-skl-2sp2 63111570 63113953 +0.0% 63836573 +1.1%
lkp-bdw-ep2 79247409 79424610 +0.2% 78012656 -1.6%
lkp-hsw-ep2 67677026 68308800 +0.9% 67539106 -0.2%
lkp-wsm-ep2 13339630 13939817 +4.5% 13766865 +3.2%
lkp-skl-d01 10969487 10972650 +0.0% no data
lkp-hsw-d01 9857342±2% 10080592±2% +2.3% 10131560 +2.8%
lkp-sb02 5189076 5197473 +0.2% 5163253 -0.5%

will-it-scale.read1.threads (higher is better)

machine batch=31 batch=63 batch=127
lkp-skl-4sp1 62468045±12% 73666726±7% +17.9% 79553123±12% +27.4% (result unstable)
lkp-bdw-ex1 1.62e+08 1.624e+08 +0.3% 1.614e+08 -0.3%
lkp-skl-2sp2 58319780 59181032 +1.5% 59821353 +2.6%
lkp-bdw-ep2 74057992 75698171 +2.2% 74990869 +1.3%
lkp-hsw-ep2 63672959 63639652 -0.1% 64387051 +1.1%
lkp-wsm-ep2 13489943 13526058 +0.3% 13259032 -1.7%
lkp-skl-d01 10297906 10338796 +0.4% 10407328 +1.1%
lkp-hsw-d01 9636721 9667376 +0.3% 9341147 -3.1%
lkp-sb02 4801938 4804496 +0.1% 4802290 +0.0%

will-it-scale.write1.processes (higer is better)

machine batch=31 batch=63 batch=127
lkp-skl-4sp1 1.111e+08 1.104e+08±2% -0.7% 1.122e+08±2% +1.0%
lkp-bdw-ex1 1.392e+08 1.399e+08 +0.5% 1.397e+08 +0.4%
lkp-skl-2sp2 59369233 58994841 -0.6% 58715168 -1.1%
lkp-bdw-ep2 61820979 CPU throttle 63593123 +2.9%
lkp-hsw-ep2 57897587 57435605 -0.8% 56347450 -2.7%
lkp-wsm-ep2 7814203 7918017±2% +1.3% 7669068 -1.9%
lkp-skl-d01 8886557 8971422 +1.0% 8818366 -0.8%
lkp-hsw-d01 9171001±5% 9189915 +0.2% 9483909 +3.4%
lkp-sb02 4475406 4475294 -0.0% 4501756 +0.6%

will-it-scale.write1.threads (higer is better)

machine batch=31 batch=63 batch=127
lkp-skl-4sp1 1.058e+08 1.055e+08±2% -0.2% 1.065e+08 +0.7%
lkp-bdw-ex1 1.316e+08 1.300e+08 -1.2% 1.308e+08 -0.6%
lkp-skl-2sp2 54492421 56086678 +2.9% 55975657 +2.7%
lkp-bdw-ep2 59360449 59003957 -0.6% 58101262 -2.1%
lkp-hsw-ep2 53346346±2% 52530876 -1.5% 52902487 -0.8%
lkp-wsm-ep2 7774006 7800092±2% +0.3% 7558833 -2.8%
lkp-skl-d01 8346174 8235695 -1.3% no data
lkp-hsw-d01 8636244 8655731 +0.2% 8658868 +0.3%
lkp-sb02 4181820 4204107 +0.5% 4182992 +0.0%

vm-scalability.anon-r-rand.throughput (higher is better)

machine batch=31 batch=63 batch=127
lkp-skl-4sp1 11933873±3% 12356544±2% +3.5% 12188624 +2.1%
lkp-bdw-ex1 7114424±2% 7330949±2% +3.0% 7392419 +3.9%
lkp-skl-2sp2 6773277±5% 6492332±8% -4.1% 6543962 -3.4%
lkp-bdw-ep2 7133846±4% 7233508 +1.4% 7013518±3% -1.7%
lkp-hsw-ep2 4576626 4527098 -1.1% 4551679 -0.5%
lkp-wsm-ep2 2583599 2592492 +0.3% 2588039 +0.2%
lkp-hsw-d01 998199±2% 1028311 +3.0% 1006460±2% +0.8%
lkp-sb02 570572 567854 -0.5% 568449 -0.4%

vm-scalability.anon-r-rand-mt.throughput (higher is better)

machine batch=31 batch=63 batch=127
lkp-skl-4sp1 1789419 1787830 -0.1% 1788208 -0.1%
lkp-bdw-ex1 3492595±2% 3554966±2% +1.8% 3558835±3% +1.9%
lkp-skl-2sp2 3856238±2% 3975403±4% +3.1% 3994600 +3.6%
lkp-bdw-ep2 3726963±11% 3809292±6% +2.2% 3871924±4% +3.9%
lkp-hsw-ep2 2131760±3% 2033578±4% -4.6% 2130727±6% -0.0%
lkp-wsm-ep2 2369731 2368384 -0.1% 2370252 +0.0%
lkp-skl-d01 1207128 1206220 -0.1% 1205801 -0.1%
lkp-hsw-d01 964317 992329±2% +2.9% 992099±2% +2.9%
lkp-sb02 567137 567346 +0.0% 566144 -0.2%

vm-scalability.lru-file-mmap-read.throughput (higher is better)

machine batch=31 batch=63 batch=127
lkp-skl-4sp1 19560469±6% 23018999 +17.7% 23418800 +19.7%
lkp-bdw-ex1 17769135±14% 26141676±3% +47.1% 26284723±5% +47.9%
lkp-skl-2sp2 14056512 13578884 -3.4% 13146214 -6.5%
lkp-bdw-ep2 15336542 14737654 -3.9% 14088159 -8.1%
lkp-hsw-ep2 16275498 15756296 -3.2% 15018090 -7.7%
lkp-wsm-ep2 11272160 11237231 -0.3% 11310047 +0.3%
lkp-skl-d01 7322119 7324569 +0.0% 7184148 -1.9%
lkp-hsw-d01 6449234 6404542 -0.7% 6356141 -1.4%
lkp-sb02 3517943 3520668 +0.1% 3527309 +0.3%

vm-scalability.lru-file-mmap-read-rand.throughput (higher is better)

machine batch=31 batch=63 batch=127
lkp-skl-4sp1 1689052 1697553 +0.5% 1698726 +0.6%
lkp-bdw-ex1 1675246 1699764 +1.5% 1712226 +2.2%
lkp-skl-2sp2 1800533 1799749 -0.0% 1800581 +0.0%
lkp-bdw-ep2 1807422 1807758 +0.0% 1804932 -0.1%
lkp-hsw-ep2 1809807 1808781 -0.1% 1807811 -0.1%
lkp-wsm-ep2 1800198 1802434 +0.1% 1801236 +0.1%
lkp-skl-d01 696689 695537 -0.2% 694106 -0.4%
lkp-hsw-d01 698364 698666 +0.0% 696686 -0.2%
lkp-sb02 258939 258787 -0.1% 258199 -0.3%

Link: http://lkml.kernel.org/r/20180711055855.29072-1-aaron.lu@intel.com
Signed-off-by: Aaron Lu <aaron.lu@intel.com>
Suggested-by: Dave Hansen <dave.hansen@intel.com>
Acked-by: Michal Hocko <mhocko@suse.com>
Acked-by: Jesper Dangaard Brouer <brouer@redhat.com>
Cc: Huang Ying <ying.huang@intel.com>
Cc: Kemi Wang <kemi.wang@intel.com>
Cc: Tim Chen <tim.c.chen@linux.intel.com>
Cc: Andi Kleen <ak@linux.intel.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Mel Gorman <mgorman@techsingularity.net>
Cc: Jesper Dangaard Brouer <brouer@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 9ea9a680 17-Aug-2018 Michal Hocko <mhocko@suse.com>

mm: drop VM_BUG_ON from __get_free_pages

There is no real reason to blow up just because the caller doesn't know
that __get_free_pages cannot return highmem pages. Simply fix that up
silently. Even if we have some confused users such a fixup will not be
harmful.

[akpm@linux-foundation.org: mask off __GFP_HIGHMEM]
Link: http://lkml.kernel.org/r/20180622162841.25114-1-mhocko@kernel.org
Signed-off-by: Michal Hocko <mhocko@suse.com>
Reviewed-by: Andrew Morton <akpm@linux-foundation.org>
Cc: Jiankang Chen <chenjiankang1@huawei.com>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Yisheng Xie <xieyisheng1@huawei.com>
Cc: Hanjun Guo <guohanjun@huawei.com>
Cc: Kefeng Wang <wangkefeng.wang@huawei.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# d6a24df0 17-Aug-2018 Vlastimil Babka <vbabka@suse.cz>

mm, page_alloc: actually ignore mempolicies for high priority allocations

__alloc_pages_slowpath() has for a long time contained code to ignore
node restrictions from memory policies for high priority allocations.
The current code that resets the zonelist iterator however does
effectively nothing after commit 7810e6781e0f ("mm, page_alloc: do not
break __GFP_THISNODE by zonelist reset") removed a buggy zonelist reset.
Even before that commit, mempolicy restrictions were still not ignored,
as they are passed in ac->nodemask which is untouched by the code.

We can either remove the code, or make it work as intended. Since
ac->nodemask can be set from task's mempolicy via alloc_pages_current()
and thus also alloc_pages(), it may indeed affect kernel allocations,
and it makes sense to ignore it to allow progress for high priority
allocations.

Thus, this patch resets ac->nodemask to NULL in such cases. This
assumes all callers can handle it (i.e. there are no guarantees as in
the case of __GFP_THISNODE) which seems to be the case. The same
assumption is already present in check_retry_cpuset() for some time.

The expected effect is that high priority kernel allocations in the
context of userspace tasks (e.g. OOM victims) restricted by mempolicies
will have higher chance to succeed if they are restricted to nodes with
depleted memory, while there are other nodes with free memory left.

It's not a new intention, but for the first time the code will match the
intention, AFAICS. It was intended by commit 183f6371aac2 ("mm: ignore
mempolicies when using ALLOC_NO_WATERMARK") in v3.6 but I think it never
really worked, as mempolicy restriction was already encoded in nodemask,
not zonelist, at that time.

So originally that was for ALLOC_NO_WATERMARK only. Then it was
adjusted by e46e7b77c909 ("mm, page_alloc: recalculate the preferred
zoneref if the context can ignore memory policies") and cd04ae1e2dc8
("mm, oom: do not rely on TIF_MEMDIE for memory reserves access") to the
current state. So even GFP_ATOMIC would now ignore mempolicies after
the initial attempts fail - if the code worked as people thought it
does.

Link: http://lkml.kernel.org/r/20180612122624.8045-1-vbabka@suse.cz
Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
Acked-by: Michal Hocko <mhocko@suse.com>
Acked-by: Mel Gorman <mgorman@techsingularity.net>
Cc: David Rientjes <rientjes@google.com>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 720e14eb 17-Aug-2018 Pavel Tatashin <pasha.tatashin@oracle.com>

mm: skip invalid pages block at a time in zero_resv_unresv()

The role of zero_resv_unavail() is to make sure that every struct page
that is allocated but is not backed by memory that is accessible by
kernel is zeroed and not in some uninitialized state.

Since struct pages are allocated in blocks (2M pages in x86 case), we
can skip pageblock_nr_pages at a time, when the first one is found to be
invalid.

This optimization may help since now on x86 every hole in e820 maps is
marked as reserved in memblock, and thus will go through this function.

This function is called before sched_clock() is initialized, so I used
my x86 early boot clock patches to measure the performance improvement.

With 1T hole on i7-8700 currently we would take 0.606918s of boot time,
but with this optimization 0.001103s.

Link: http://lkml.kernel.org/r/20180615155733.1175-1-pasha.tatashin@oracle.com
Signed-off-by: Pavel Tatashin <pasha.tatashin@oracle.com>
Reviewed-by: Oscar Salvador <osalvador@suse.de>
Reviewed-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
Cc: Pasha Tatashin <Pavel.Tatashin@microsoft.com>
Cc: Steven Sistare <steven.sistare@oracle.com>
Cc: Daniel Jordan <daniel.m.jordan@oracle.com>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: "Huang, Ying" <ying.huang@intel.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 55f2503c 31-Jul-2018 Pingfan Liu <kernelfans@gmail.com>

PM / reboot: Eliminate race between reboot and suspend

At present, "systemctl suspend" and "shutdown" can run in parrallel. A
system can suspend after devices_shutdown(), and resume. Then the shutdown
task goes on to power off. This causes many devices are not really shut
off. Hence replacing reboot_mutex with system_transition_mutex (renamed
from pm_mutex) to achieve the exclusion. The renaming of pm_mutex as
system_transition_mutex can be better to reflect the purpose of the mutex.

Signed-off-by: Pingfan Liu <kernelfans@gmail.com>
Acked-by: Pavel Machek <pavel@ucw.cz>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>


# 0d834328 02-Aug-2018 Dave Hansen <dave.hansen@linux.intel.com>

mm: Allow non-direct-map arguments to free_reserved_area()

free_reserved_area() takes pointers as arguments to show which addresses
should be freed. However, it does this in a somewhat ambiguous way. If it
gets a kernel direct map address, it always works. However, if it gets an
address that is part of the kernel image alias mapping, it can fail.

It fails if all of the following happen:
* The specified address is part of the kernel image alias
* Poisoning is requested (forcing a memset())
* The address is in a read-only portion of the kernel image

The memset() fails on the read-only mapping, of course.
free_reserved_area() *is* called both on the direct map and on kernel image
alias addresses. We've just lucked out thus far that the kernel image
alias areas it gets used on are read-write. I'm fairly sure this has been
just a happy accident.

It is quite easy to make free_reserved_area() work for all cases: just
convert the address to a direct map address before doing the memset(), and
do this unconditionally. There is little chance of a regression here
because we previously did a virt_to_page() on the address for the memset,
so we know these are not highmem pages for which virt_to_page() would fail.

Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: keescook@google.com
Cc: aarcange@redhat.com
Cc: jgross@suse.com
Cc: jpoimboe@redhat.com
Cc: gregkh@linuxfoundation.org
Cc: peterz@infradead.org
Cc: hughd@google.com
Cc: torvalds@linux-foundation.org
Cc: bp@alien8.de
Cc: luto@kernel.org
Cc: ak@linux.intel.com
Cc: Kees Cook <keescook@google.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Juergen Gross <jgross@suse.com>
Cc: Josh Poimboeuf <jpoimboe@redhat.com>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Hugh Dickins <hughd@google.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Andi Kleen <ak@linux.intel.com>
Link: https://lkml.kernel.org/r/20180802225826.1287AE3E@viggo.jf.intel.com


# d1b47a7c 16-Jul-2018 Pavel Tatashin <pasha.tatashin@oracle.com>

mm: don't do zero_resv_unavail if memmap is not allocated

Moving zero_resv_unavail before memmap_init_zone(), caused a regression on
x86-32.

The cause is that we access struct pages before they are allocated when
CONFIG_FLAT_NODE_MEM_MAP is used.

free_area_init_nodes()
zero_resv_unavail()
mm_zero_struct_page(pfn_to_page(pfn)); <- struct page is not alloced
free_area_init_node()
if CONFIG_FLAT_NODE_MEM_MAP
alloc_node_mem_map()
memblock_virt_alloc_node_nopanic() <- struct page alloced here

On the other hand memblock_virt_alloc_node_nopanic() zeroes all the memory
that it returns, so we do not need to do zero_resv_unavail() here.

Fixes: e181ae0c5db9 ("mm: zero unavailable pages before memmap init")
Signed-off-by: Pavel Tatashin <pasha.tatashin@oracle.com>
Tested-by: Matt Hart <matt@mattface.org>
Acked-by: Michal Hocko <mhocko@suse.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# e181ae0c 14-Jul-2018 Pavel Tatashin <pasha.tatashin@oracle.com>

mm: zero unavailable pages before memmap init

We must zero struct pages for memory that is not backed by physical
memory, or kernel does not have access to.

Recently, there was a change which zeroed all memmap for all holes in
e820. Unfortunately, it introduced a bug that is discussed here:

https://www.spinics.net/lists/linux-mm/msg156764.html

Linus, also saw this bug on his machine, and confirmed that reverting
commit 124049decbb1 ("x86/e820: put !E820_TYPE_RAM regions into
memblock.reserved") fixes the issue.

The problem is that we incorrectly zero some struct pages after they
were setup.

The fix is to zero unavailable struct pages prior to initializing of
struct pages.

A more detailed fix should come later that would avoid double zeroing
cases: one in __init_single_page(), the other one in
zero_resv_unavail().

Fixes: 124049decbb1 ("x86/e820: put !E820_TYPE_RAM regions into memblock.reserved")
Signed-off-by: Pavel Tatashin <pasha.tatashin@oracle.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 0825a6f9 14-Jun-2018 Joe Perches <joe@perches.com>

mm: use octal not symbolic permissions

mm/*.c files use symbolic and octal styles for permissions.

Using octal and not symbolic permissions is preferred by many as more
readable.

https://lkml.org/lkml/2016/8/2/1945

Prefer the direct use of octal for permissions.

Done using
$ scripts/checkpatch.pl -f --types=SYMBOLIC_PERMS --fix-inplace mm/*.c
and some typing.

Before: $ git grep -P -w "0[0-7]{3,3}" mm | wc -l
44
After: $ git grep -P -w "0[0-7]{3,3}" mm | wc -l
86

Miscellanea:

o Whitespace neatening around these conversions.

Link: http://lkml.kernel.org/r/2e032ef111eebcd4c5952bae86763b541d373469.1522102887.git.joe@perches.com
Signed-off-by: Joe Perches <joe@perches.com>
Acked-by: David Rientjes <rientjes@google.com>
Acked-by: Michal Hocko <mhocko@suse.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 7810e678 07-Jun-2018 Vlastimil Babka <vbabka@suse.cz>

mm, page_alloc: do not break __GFP_THISNODE by zonelist reset

In __alloc_pages_slowpath() we reset zonelist and preferred_zoneref for
allocations that can ignore memory policies. The zonelist is obtained
from current CPU's node. This is a problem for __GFP_THISNODE
allocations that want to allocate on a different node, e.g. because the
allocating thread has been migrated to a different CPU.

This has been observed to break SLAB in our 4.4-based kernel, because
there it relies on __GFP_THISNODE working as intended. If a slab page
is put on wrong node's list, then further list manipulations may corrupt
the list because page_to_nid() is used to determine which node's
list_lock should be locked and thus we may take a wrong lock and race.

Current SLAB implementation seems to be immune by luck thanks to commit
511e3a058812 ("mm/slab: make cache_grow() handle the page allocated on
arbitrary node") but there may be others assuming that __GFP_THISNODE
works as promised.

We can fix it by simply removing the zonelist reset completely. There
is actually no reason to reset it, because memory policies and cpusets
don't affect the zonelist choice in the first place. This was different
when commit 183f6371aac2 ("mm: ignore mempolicies when using
ALLOC_NO_WATERMARK") introduced the code, as mempolicies provided their
own restricted zonelists.

We might consider this for 4.17 although I don't know if there's
anything currently broken.

SLAB is currently not affected, but in kernels older than 4.7 that don't
yet have 511e3a058812 ("mm/slab: make cache_grow() handle the page
allocated on arbitrary node") it is. That's at least 4.4 LTS. Older
ones I'll have to check.

So stable backports should be more important, but will have to be
reviewed carefully, as the code went through many changes. BTW I think
that also the ac->preferred_zoneref reset is currently useless if we
don't also reset ac->nodemask from a mempolicy to NULL first (which we
probably should for the OOM victims etc?), but I would leave that for a
separate patch.

Link: http://lkml.kernel.org/r/20180525130853.13915-1-vbabka@suse.cz
Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
Fixes: 183f6371aac2 ("mm: ignore mempolicies when using ALLOC_NO_WATERMARK")
Acked-by: Mel Gorman <mgorman@techsingularity.net>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: David Rientjes <rientjes@google.com>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 4da1984e 07-Jun-2018 Matthew Wilcox <willy@infradead.org>

mm: combine LRU and main union in struct page

This gives us five words of space in a single union in struct page. The
compound_mapcount moves position (from offset 24 to offset 20) on 64-bit
systems, but that does not seem likely to cause any trouble.

Link: http://lkml.kernel.org/r/20180518194519.3820-11-willy@infradead.org
Signed-off-by: Matthew Wilcox <mawilcox@microsoft.com>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Acked-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Cc: Christoph Lameter <cl@linux.com>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Jérôme Glisse <jglisse@redhat.com>
Cc: Lai Jiangshan <jiangshanlai@gmail.com>
Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
Cc: Pekka Enberg <penberg@kernel.org>
Cc: Randy Dunlap <rdunlap@infradead.org>
Cc: Andrey Ryabinin <aryabinin@virtuozzo.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# fa3015b7 07-Jun-2018 Matthew Wilcox <willy@infradead.org>

mm: use page->deferred_list

Now that we can represent the location of 'deferred_list' in C instead of
comments, make use of that ability.

Link: http://lkml.kernel.org/r/20180518194519.3820-9-willy@infradead.org
Signed-off-by: Matthew Wilcox <mawilcox@microsoft.com>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Acked-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Cc: Christoph Lameter <cl@linux.com>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Jérôme Glisse <jglisse@redhat.com>
Cc: Lai Jiangshan <jiangshanlai@gmail.com>
Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
Cc: Pekka Enberg <penberg@kernel.org>
Cc: Randy Dunlap <rdunlap@infradead.org>
Cc: Andrey Ryabinin <aryabinin@virtuozzo.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 6e292b9b 07-Jun-2018 Matthew Wilcox <willy@infradead.org>

mm: split page_type out from _mapcount

We're already using a union of many fields here, so stop abusing the
_mapcount and make page_type its own field. That implies renaming some of
the machinery that creates PageBuddy, PageBalloon and PageKmemcg; bring
back the PG_buddy, PG_balloon and PG_kmemcg names.

As suggested by Kirill, make page_type a bitmask. Because it starts out
life as -1 (thanks to sharing the storage with _mapcount), setting a page
flag means clearing the appropriate bit. This gives us space for probably
twenty or so extra bits (depending how paranoid we want to be about
_mapcount underflow).

Link: http://lkml.kernel.org/r/20180518194519.3820-3-willy@infradead.org
Signed-off-by: Matthew Wilcox <mawilcox@microsoft.com>
Acked-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Cc: Christoph Lameter <cl@linux.com>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Jérôme Glisse <jglisse@redhat.com>
Cc: Lai Jiangshan <jiangshanlai@gmail.com>
Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
Cc: Pekka Enberg <penberg@kernel.org>
Cc: Randy Dunlap <rdunlap@infradead.org>
Cc: Andrey Ryabinin <aryabinin@virtuozzo.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# a380b40a 07-Jun-2018 Huaisheng Ye <yehs1@lenovo.com>

mm/page_alloc.c: remove useless parameter of finalise_ac()

finalise_ac() has parameter order which is not used at all. Remove it.

Signed-off-by: Huaisheng Ye <yehs1@lenovo.com>
Acked-by: Michal Hocko <mhocko@suse.com>
Reviewed-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# fb52bbae 07-Jun-2018 Mathieu Malaterre <malat@debian.org>

mm: move is_pageblock_removable_nolock() to mm/memory_hotplug.c

is_pageblock_removable_nolock() is not used outside of
mm/memory_hotplug.c. Move it next to unique caller
is_mem_section_removable() and make it static.

Remove prototype in <linux/memory_hotplug.h> to silence gcc warning (W=1):

mm/page_alloc.c:7704:6: warning: no previous prototype for `is_pageblock_removable_nolock' [-Wmissing-prototypes]

Link: http://lkml.kernel.org/r/20180509190001.24789-1-malat@debian.org
Signed-off-by: Mathieu Malaterre <malat@debian.org>
Suggested-by: Michal Hocko <mhocko@kernel.org>
Reviewed-by: Andrew Morton <akpm@linux-foundation.org>
Acked-by: Michal Hocko <mhocko@suse.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 93781325 07-Jun-2018 Omar Sandoval <osandov@fb.com>

lockdep: fix fs_reclaim annotation

While revisiting my Btrfs swapfile series [1], I introduced a situation
in which reclaim would lock i_rwsem, and even though the swapon() path
clearly made GFP_KERNEL allocations while holding i_rwsem, I got no
complaints from lockdep. It turns out that the rework of the fs_reclaim
annotation was broken: if the current task has PF_MEMALLOC set, we don't
acquire the dummy fs_reclaim lock, but when reclaiming we always check
this _after_ we've just set the PF_MEMALLOC flag. In most cases, we can
fix this by moving the fs_reclaim_{acquire,release}() outside of the
memalloc_noreclaim_{save,restore}(), althought kswapd is slightly
different. After applying this, I got the expected lockdep splats.

1: https://lwn.net/Articles/625412/

Link: http://lkml.kernel.org/r/9f8aa70652a98e98d7c4de0fc96a4addcee13efe.1523778026.git.osandov@fb.com
Fixes: d92a8cfcb37e ("locking/lockdep: Rework FS_RECLAIM annotation")
Signed-off-by: Omar Sandoval <osandov@fb.com>
Reviewed-by: Andrew Morton <akpm@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
Cc: Ingo Molnar <mingo@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# e6943859 07-Jun-2018 Wei Yang <richard.weiyang@gmail.com>

mm/page_alloc: remove realsize in free_area_init_core()

Highmem's realsize always equals to freesize, so it is not necessary to
spare a variable to record this.

Link: http://lkml.kernel.org/r/20180413083859.65888-1-richard.weiyang@gmail.com
Signed-off-by: Wei Yang <richard.weiyang@gmail.com>
Reviewed-by: Andrew Morton <akpm@linux-foundation.org>
Cc: Michal Hocko <mhocko@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 15c30bc0 25-May-2018 Michal Hocko <mhocko@suse.com>

mm, memory_hotplug: make has_unmovable_pages more robust

Oscar has reported:
: Due to an unfortunate setting with movablecore, memblocks containing bootmem
: memory (pages marked by get_page_bootmem()) ended up marked in zone_movable.
: So while trying to remove that memory, the system failed in do_migrate_range
: and __offline_pages never returned.
:
: This can be reproduced by running
: qemu-system-x86_64 -m 6G,slots=8,maxmem=8G -numa node,mem=4096M -numa node,mem=2048M
: and movablecore=4G kernel command line
:
: linux kernel: BIOS-provided physical RAM map:
: linux kernel: BIOS-e820: [mem 0x0000000000000000-0x000000000009fbff] usable
: linux kernel: BIOS-e820: [mem 0x000000000009fc00-0x000000000009ffff] reserved
: linux kernel: BIOS-e820: [mem 0x00000000000f0000-0x00000000000fffff] reserved
: linux kernel: BIOS-e820: [mem 0x0000000000100000-0x00000000bffdffff] usable
: linux kernel: BIOS-e820: [mem 0x00000000bffe0000-0x00000000bfffffff] reserved
: linux kernel: BIOS-e820: [mem 0x00000000feffc000-0x00000000feffffff] reserved
: linux kernel: BIOS-e820: [mem 0x00000000fffc0000-0x00000000ffffffff] reserved
: linux kernel: BIOS-e820: [mem 0x0000000100000000-0x00000001bfffffff] usable
: linux kernel: NX (Execute Disable) protection: active
: linux kernel: SMBIOS 2.8 present.
: linux kernel: DMI: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.0.0-prebuilt.qemu-project.org
: linux kernel: Hypervisor detected: KVM
: linux kernel: e820: update [mem 0x00000000-0x00000fff] usable ==> reserved
: linux kernel: e820: remove [mem 0x000a0000-0x000fffff] usable
: linux kernel: last_pfn = 0x1c0000 max_arch_pfn = 0x400000000
:
: linux kernel: SRAT: PXM 0 -> APIC 0x00 -> Node 0
: linux kernel: SRAT: PXM 1 -> APIC 0x01 -> Node 1
: linux kernel: ACPI: SRAT: Node 0 PXM 0 [mem 0x00000000-0x0009ffff]
: linux kernel: ACPI: SRAT: Node 0 PXM 0 [mem 0x00100000-0xbfffffff]
: linux kernel: ACPI: SRAT: Node 0 PXM 0 [mem 0x100000000-0x13fffffff]
: linux kernel: ACPI: SRAT: Node 1 PXM 1 [mem 0x140000000-0x1bfffffff]
: linux kernel: ACPI: SRAT: Node 0 PXM 0 [mem 0x1c0000000-0x43fffffff] hotplug
: linux kernel: NUMA: Node 0 [mem 0x00000000-0x0009ffff] + [mem 0x00100000-0xbfffffff] -> [mem 0x0
: linux kernel: NUMA: Node 0 [mem 0x00000000-0xbfffffff] + [mem 0x100000000-0x13fffffff] -> [mem 0
: linux kernel: NODE_DATA(0) allocated [mem 0x13ffd6000-0x13fffffff]
: linux kernel: NODE_DATA(1) allocated [mem 0x1bffd3000-0x1bfffcfff]
:
: zoneinfo shows that the zone movable is placed into both numa nodes:
: Node 0, zone Movable
: pages free 160140
: min 1823
: low 2278
: high 2733
: spanned 262144
: present 262144
: managed 245670
: Node 1, zone Movable
: pages free 448427
: min 3827
: low 4783
: high 5739
: spanned 524288
: present 524288
: managed 515766

Note how only Node 0 has a hutplugable memory region which would rule it
out from the early memblock allocations (most likely memmap). Node1
will surely contain memmaps on the same node and those would prevent
offlining to succeed. So this is arguably a configuration issue.
Although one could argue that we should be more clever and rule early
allocations from the zone movable. This would be correct but probably
not worth the effort considering what a hack movablecore is.

Anyway, We could do better for those cases though. We rely on
start_isolate_page_range resp. has_unmovable_pages to do their job.
The first one isolates the whole range to be offlined so that we do not
allocate from it anymore and the later makes sure we are not stumbling
over non-migrateable pages.

has_unmovable_pages is overly optimistic, however. It doesn't check all
the pages if we are withing zone_movable because we rely that those
pages will be always migrateable. As it turns out we are still not
perfect there. While bootmem pages in zonemovable sound like a clear
bug which should be fixed let's remove the optimization for now and warn
if we encounter unmovable pages in zone_movable in the meantime. That
should help for now at least.

Btw. this wasn't a real problem until commit 72b39cfc4d75 ("mm,
memory_hotplug: do not fail offlining too early") because we used to
have a small number of retries and then failed. This turned out to be
too fragile though.

Link: http://lkml.kernel.org/r/20180523125555.30039-2-mhocko@kernel.org
Signed-off-by: Michal Hocko <mhocko@suse.com>
Reported-by: Oscar Salvador <osalvador@techadventures.net>
Tested-by: Oscar Salvador <osalvador@techadventures.net>
Reviewed-by: Pavel Tatashin <pasha.tatashin@oracle.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Reza Arbab <arbab@linux.vnet.ibm.com>
Cc: Igor Mammedov <imammedo@redhat.com>
Cc: Vitaly Kuznetsov <vkuznets@redhat.com>
Cc: Anshuman Khandual <khandual@linux.vnet.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# d883c6cf 22-May-2018 Joonsoo Kim <iamjoonsoo.kim@lge.com>

Revert "mm/cma: manage the memory of the CMA area by using the ZONE_MOVABLE"

This reverts the following commits that change CMA design in MM.

3d2054ad8c2d ("ARM: CMA: avoid double mapping to the CMA area if CONFIG_HIGHMEM=y")

1d47a3ec09b5 ("mm/cma: remove ALLOC_CMA")

bad8c6c0b114 ("mm/cma: manage the memory of the CMA area by using the ZONE_MOVABLE")

Ville reported a following error on i386.

Inode-cache hash table entries: 65536 (order: 6, 262144 bytes)
microcode: microcode updated early to revision 0x4, date = 2013-06-28
Initializing CPU#0
Initializing HighMem for node 0 (000377fe:00118000)
Initializing Movable for node 0 (00000001:00118000)
BUG: Bad page state in process swapper pfn:377fe
page:f53effc0 count:0 mapcount:-127 mapping:00000000 index:0x0
flags: 0x80000000()
raw: 80000000 00000000 00000000 ffffff80 00000000 00000100 00000200 00000001
page dumped because: nonzero mapcount
Modules linked in:
CPU: 0 PID: 0 Comm: swapper Not tainted 4.17.0-rc5-elk+ #145
Hardware name: Dell Inc. Latitude E5410/03VXMC, BIOS A15 07/11/2013
Call Trace:
dump_stack+0x60/0x96
bad_page+0x9a/0x100
free_pages_check_bad+0x3f/0x60
free_pcppages_bulk+0x29d/0x5b0
free_unref_page_commit+0x84/0xb0
free_unref_page+0x3e/0x70
__free_pages+0x1d/0x20
free_highmem_page+0x19/0x40
add_highpages_with_active_regions+0xab/0xeb
set_highmem_pages_init+0x66/0x73
mem_init+0x1b/0x1d7
start_kernel+0x17a/0x363
i386_start_kernel+0x95/0x99
startup_32_smp+0x164/0x168

The reason for this error is that the span of MOVABLE_ZONE is extended
to whole node span for future CMA initialization, and, normal memory is
wrongly freed here. I submitted the fix and it seems to work, but,
another problem happened.

It's so late time to fix the later problem so I decide to reverting the
series.

Reported-by: Ville Syrjälä <ville.syrjala@linux.intel.com>
Acked-by: Laura Abbott <labbott@redhat.com>
Acked-by: Michal Hocko <mhocko@suse.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 6f84f8d1 10-Apr-2018 Pavel Tatashin <pasha.tatashin@oracle.com>

xen, mm: allow deferred page initialization for xen pv domains

Juergen Gross noticed that commit f7f99100d8d ("mm: stop zeroing memory
during allocation in vmemmap") broke XEN PV domains when deferred struct
page initialization is enabled.

This is because the xen's PagePinned() flag is getting erased from
struct pages when they are initialized later in boot.

Juergen fixed this problem by disabling deferred pages on xen pv
domains. It is desirable, however, to have this feature available as it
reduces boot time. This fix re-enables the feature for pv-dmains, and
fixes the problem the following way:

The fix is to delay setting PagePinned flag until struct pages for all
allocated memory are initialized, i.e. until after free_all_bootmem().

A new x86_init.hyper op init_after_bootmem() is called to let xen know
that boot allocator is done, and hence struct pages for all the
allocated memory are now initialized. If deferred page initialization
is enabled, the rest of struct pages are going to be initialized later
in boot once page_alloc_init_late() is called.

xen_after_bootmem() walks page table's pages and marks them pinned.

Link: http://lkml.kernel.org/r/20180226160112.24724-2-pasha.tatashin@oracle.com
Signed-off-by: Pavel Tatashin <pasha.tatashin@oracle.com>
Acked-by: Ingo Molnar <mingo@kernel.org>
Reviewed-by: Juergen Gross <jgross@suse.com>
Tested-by: Juergen Gross <jgross@suse.com>
Cc: Daniel Jordan <daniel.m.jordan@oracle.com>
Cc: Pavel Tatashin <pasha.tatashin@oracle.com>
Cc: Alok Kataria <akataria@vmware.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: Boris Ostrovsky <boris.ostrovsky@oracle.com>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Laura Abbott <labbott@redhat.com>
Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Cc: Borislav Petkov <bp@suse.de>
Cc: Mathias Krause <minipli@googlemail.com>
Cc: Jinbum Park <jinb.park7@gmail.com>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: Baoquan He <bhe@redhat.com>
Cc: Jia Zhang <zhang.jia@linux.alibaba.com>
Cc: Mel Gorman <mgorman@techsingularity.net>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Stefano Stabellini <sstabellini@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 1d47a3ec 10-Apr-2018 Joonsoo Kim <iamjoonsoo.kim@lge.com>

mm/cma: remove ALLOC_CMA

Now, all reserved pages for CMA region are belong to the ZONE_MOVABLE
and it only serves for a request with GFP_HIGHMEM && GFP_MOVABLE.

Therefore, we don't need to maintain ALLOC_CMA at all.

Link: http://lkml.kernel.org/r/1512114786-5085-3-git-send-email-iamjoonsoo.kim@lge.com
Signed-off-by: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Reviewed-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
Tested-by: Tony Lindgren <tony@atomide.com>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Laura Abbott <lauraa@codeaurora.org>
Cc: Marek Szyprowski <m.szyprowski@samsung.com>
Cc: Mel Gorman <mgorman@techsingularity.net>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Michal Nazarewicz <mina86@mina86.com>
Cc: Minchan Kim <minchan@kernel.org>
Cc: Rik van Riel <riel@redhat.com>
Cc: Russell King <linux@armlinux.org.uk>
Cc: Will Deacon <will.deacon@arm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# bad8c6c0 10-Apr-2018 Joonsoo Kim <iamjoonsoo.kim@lge.com>

mm/cma: manage the memory of the CMA area by using the ZONE_MOVABLE

Patch series "mm/cma: manage the memory of the CMA area by using the
ZONE_MOVABLE", v2.

0. History

This patchset is the follow-up of the discussion about the "Introduce
ZONE_CMA (v7)" [1]. Please reference it if more information is needed.

1. What does this patch do?

This patch changes the management way for the memory of the CMA area in
the MM subsystem. Currently the memory of the CMA area is managed by
the zone where their pfn is belong to. However, this approach has some
problems since MM subsystem doesn't have enough logic to handle the
situation that different characteristic memories are in a single zone.
To solve this issue, this patch try to manage all the memory of the CMA
area by using the MOVABLE zone. In MM subsystem's point of view,
characteristic of the memory on the MOVABLE zone and the memory of the
CMA area are the same. So, managing the memory of the CMA area by using
the MOVABLE zone will not have any problem.

2. Motivation

There are some problems with current approach. See following. Although
these problem would not be inherent and it could be fixed without this
conception change, it requires many hooks addition in various code path
and it would be intrusive to core MM and would be really error-prone.
Therefore, I try to solve them with this new approach. Anyway,
following is the problems of the current implementation.

o CMA memory utilization

First, following is the freepage calculation logic in MM.

- For movable allocation: freepage = total freepage
- For unmovable allocation: freepage = total freepage - CMA freepage

Freepages on the CMA area is used after the normal freepages in the zone
where the memory of the CMA area is belong to are exhausted. At that
moment that the number of the normal freepages is zero, so

- For movable allocation: freepage = total freepage = CMA freepage
- For unmovable allocation: freepage = 0

If unmovable allocation comes at this moment, allocation request would
fail to pass the watermark check and reclaim is started. After reclaim,
there would exist the normal freepages so freepages on the CMA areas
would not be used.

FYI, there is another attempt [2] trying to solve this problem in lkml.
And, as far as I know, Qualcomm also has out-of-tree solution for this
problem.

Useless reclaim:

There is no logic to distinguish CMA pages in the reclaim path. Hence,
CMA page is reclaimed even if the system just needs the page that can be
usable for the kernel allocation.

Atomic allocation failure:

This is also related to the fallback allocation policy for the memory of
the CMA area. Consider the situation that the number of the normal
freepages is *zero* since the bunch of the movable allocation requests
come. Kswapd would not be woken up due to following freepage
calculation logic.

- For movable allocation: freepage = total freepage = CMA freepage

If atomic unmovable allocation request comes at this moment, it would
fails due to following logic.

- For unmovable allocation: freepage = total freepage - CMA freepage = 0

It was reported by Aneesh [3].

Useless compaction:

Usual high-order allocation request is unmovable allocation request and
it cannot be served from the memory of the CMA area. In compaction,
migration scanner try to migrate the page in the CMA area and make
high-order page there. As mentioned above, it cannot be usable for the
unmovable allocation request so it's just waste.

3. Current approach and new approach

Current approach is that the memory of the CMA area is managed by the
zone where their pfn is belong to. However, these memory should be
distinguishable since they have a strong limitation. So, they are
marked as MIGRATE_CMA in pageblock flag and handled specially. However,
as mentioned in section 2, the MM subsystem doesn't have enough logic to
deal with this special pageblock so many problems raised.

New approach is that the memory of the CMA area is managed by the
MOVABLE zone. MM already have enough logic to deal with special zone
like as HIGHMEM and MOVABLE zone. So, managing the memory of the CMA
area by the MOVABLE zone just naturally work well because constraints
for the memory of the CMA area that the memory should always be
migratable is the same with the constraint for the MOVABLE zone.

There is one side-effect for the usability of the memory of the CMA
area. The use of MOVABLE zone is only allowed for a request with
GFP_HIGHMEM && GFP_MOVABLE so now the memory of the CMA area is also
only allowed for this gfp flag. Before this patchset, a request with
GFP_MOVABLE can use them. IMO, It would not be a big issue since most
of GFP_MOVABLE request also has GFP_HIGHMEM flag. For example, file
cache page and anonymous page. However, file cache page for blockdev
file is an exception. Request for it has no GFP_HIGHMEM flag. There is
pros and cons on this exception. In my experience, blockdev file cache
pages are one of the top reason that causes cma_alloc() to fail
temporarily. So, we can get more guarantee of cma_alloc() success by
discarding this case.

Note that there is no change in admin POV since this patchset is just
for internal implementation change in MM subsystem. Just one minor
difference for admin is that the memory stat for CMA area will be
printed in the MOVABLE zone. That's all.

4. Result

Following is the experimental result related to utilization problem.

8 CPUs, 1024 MB, VIRTUAL MACHINE
make -j16

<Before>
CMA area: 0 MB 512 MB
Elapsed-time: 92.4 186.5
pswpin: 82 18647
pswpout: 160 69839

<After>
CMA : 0 MB 512 MB
Elapsed-time: 93.1 93.4
pswpin: 84 46
pswpout: 183 92

akpm: "kernel test robot" reported a 26% improvement in
vm-scalability.throughput:
http://lkml.kernel.org/r/20180330012721.GA3845@yexl-desktop

[1]: lkml.kernel.org/r/1491880640-9944-1-git-send-email-iamjoonsoo.kim@lge.com
[2]: https://lkml.org/lkml/2014/10/15/623
[3]: http://www.spinics.net/lists/linux-mm/msg100562.html

Link: http://lkml.kernel.org/r/1512114786-5085-2-git-send-email-iamjoonsoo.kim@lge.com
Signed-off-by: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Reviewed-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
Tested-by: Tony Lindgren <tony@atomide.com>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Laura Abbott <lauraa@codeaurora.org>
Cc: Marek Szyprowski <m.szyprowski@samsung.com>
Cc: Mel Gorman <mgorman@techsingularity.net>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Michal Nazarewicz <mina86@mina86.com>
Cc: Minchan Kim <minchan@kernel.org>
Cc: Rik van Riel <riel@redhat.com>
Cc: Russell King <linux@armlinux.org.uk>
Cc: Will Deacon <will.deacon@arm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# d3cda233 10-Apr-2018 Joonsoo Kim <iamjoonsoo.kim@lge.com>

mm/page_alloc: don't reserve ZONE_HIGHMEM for ZONE_MOVABLE request

Freepage on ZONE_HIGHMEM doesn't work for kernel memory so it's not that
important to reserve. When ZONE_MOVABLE is used, this problem would
theorectically cause to decrease usable memory for GFP_HIGHUSER_MOVABLE
allocation request which is mainly used for page cache and anon page
allocation. So, fix it by setting 0 to
sysctl_lowmem_reserve_ratio[ZONE_HIGHMEM].

And, defining sysctl_lowmem_reserve_ratio array by MAX_NR_ZONES - 1 size
makes code complex. For example, if there is highmem system, following
reserve ratio is activated for *NORMAL ZONE* which would be easyily
misleading people.

#ifdef CONFIG_HIGHMEM
32
#endif

This patch also fixes this situation by defining
sysctl_lowmem_reserve_ratio array by MAX_NR_ZONES and place "#ifdef" to
right place.

Link: http://lkml.kernel.org/r/1504672525-17915-1-git-send-email-iamjoonsoo.kim@lge.com
Signed-off-by: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Reviewed-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Tested-by: Tony Lindgren <tony@atomide.com>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Mel Gorman <mgorman@techsingularity.net>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: "Aneesh Kumar K . V" <aneesh.kumar@linux.vnet.ibm.com>
Cc: Minchan Kim <minchan@kernel.org>
Cc: Rik van Riel <riel@redhat.com>
Cc: Laura Abbott <lauraa@codeaurora.org>
Cc: Marek Szyprowski <m.szyprowski@samsung.com>
Cc: Michal Nazarewicz <mina86@mina86.com>
Cc: Russell King <linux@armlinux.org.uk>
Cc: Will Deacon <will.deacon@arm.com>
Cc: <linux-api@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 034ebf65 10-Apr-2018 Roman Gushchin <guro@fb.com>

mm: treat indirectly reclaimable memory as available in MemAvailable

Adjust /proc/meminfo MemAvailable calculation by adding the amount of
indirectly reclaimable memory (rounded to the PAGE_SIZE).

Link: http://lkml.kernel.org/r/20180305133743.12746-4-guro@fb.com
Signed-off-by: Roman Gushchin <guro@fb.com>
Reviewed-by: Andrew Morton <akpm@linux-foundation.org>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Mel Gorman <mgorman@techsingularity.net>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 2c7452a0 05-Apr-2018 Mike Kravetz <mike.kravetz@oracle.com>

mm/page_isolation.c: make start_isolate_page_range() fail if already isolated

start_isolate_page_range() is used to set the migrate type of a set of
pageblocks to MIGRATE_ISOLATE while attempting to start a migration
operation. It assumes that only one thread is calling it for the
specified range. This routine is used by CMA, memory hotplug and
gigantic huge pages. Each of these users synchronize access to the
range within their subsystem. However, two subsystems (CMA and gigantic
huge pages for example) could attempt operations on the same range. If
this happens, one thread may 'undo' the work another thread is doing.
This can result in pageblocks being incorrectly left marked as
MIGRATE_ISOLATE and therefore not available for page allocation.

What is ideally needed is a way to synchronize access to a set of
pageblocks that are undergoing isolation and migration. The only thing
we know about these pageblocks is that they are all in the same zone. A
per-node mutex is too coarse as we want to allow multiple operations on
different ranges within the same zone concurrently. Instead, we will
use the migration type of the pageblocks themselves as a form of
synchronization.

start_isolate_page_range sets the migration type on a set of page-
blocks going in order from the one associated with the smallest pfn to
the largest pfn. The zone lock is acquired to check and set the
migration type. When going through the list of pageblocks check if
MIGRATE_ISOLATE is already set. If so, this indicates another thread is
working on this pageblock. We know exactly which pageblocks we set, so
clean up by undo those and return -EBUSY.

This allows start_isolate_page_range to serve as a synchronization
mechanism and will allow for more general use of callers making use of
these interfaces. Update comments in alloc_contig_range to reflect this
new functionality.

Each CPU holds the associated zone lock to modify or examine the
migration type of a pageblock. And, it will only examine/update a
single pageblock per lock acquire/release cycle.

Link: http://lkml.kernel.org/r/20180309224731.16978-1-mike.kravetz@oracle.com
Signed-off-by: Mike Kravetz <mike.kravetz@oracle.com>
Reviewed-by: Andrew Morton <akpm@linux-foundation.org>
Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Luiz Capitulino <lcapitulino@redhat.com>
Cc: Michal Nazarewicz <mina86@mina86.com>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Mel Gorman <mgorman@techsingularity.net>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 5ecd9d40 05-Apr-2018 David Rientjes <rientjes@google.com>

mm, page_alloc: wakeup kcompactd even if kswapd cannot free more memory

Kswapd will not wakeup if per-zone watermarks are not failing or if too
many previous attempts at background reclaim have failed.

This can be true if there is a lot of free memory available. For high-
order allocations, kswapd is responsible for waking up kcompactd for
background compaction. If the zone is not below its watermarks or
reclaim has recently failed (lots of free memory, nothing left to
reclaim), kcompactd does not get woken up.

When __GFP_DIRECT_RECLAIM is not allowed, allow kcompactd to still be
woken up even if kswapd will not reclaim. This allows high-order
allocations, such as thp, to still trigger background compaction even
when the zone has an abundance of free memory.

Link: http://lkml.kernel.org/r/alpine.DEB.2.20.1803111659420.209721@chino.kir.corp.google.com
Signed-off-by: David Rientjes <rientjes@google.com>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Cc: Mel Gorman <mgorman@techsingularity.net>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 97334162 05-Apr-2018 Aaron Lu <aaron.lu@intel.com>

mm/free_pcppages_bulk: prefetch buddy while not holding lock

When a page is freed back to the global pool, its buddy will be checked
to see if it's possible to do a merge. This requires accessing buddy's
page structure and that access could take a long time if it's cache
cold.

This patch adds a prefetch to the to-be-freed page's buddy outside of
zone->lock in hope of accessing buddy's page structure later under
zone->lock will be faster. Since we *always* do buddy merging and check
an order-0 page's buddy to try to merge it when it goes into the main
allocator, the cacheline will always come in, i.e. the prefetched data
will never be unused.

Normally, the number of prefetch will be pcp->batch(default=31 and has
an upper limit of (PAGE_SHIFT * 8)=96 on x86_64) but in the case of
pcp's pages get all drained, it will be pcp->count which has an upper
limit of pcp->high. pcp->high, although has a default value of 186
(pcp->batch=31 * 6), can be changed by user through
/proc/sys/vm/percpu_pagelist_fraction and there is no software upper
limit so could be large, like several thousand. For this reason, only
the first pcp->batch number of page's buddy structure is prefetched to
avoid excessive prefetching.

In the meantime, there are two concerns:

1. the prefetch could potentially evict existing cachelines, especially
for L1D cache since it is not huge

2. there is some additional instruction overhead, namely calculating
buddy pfn twice

For 1, it's hard to say, this microbenchmark though shows good result
but the actual benefit of this patch will be workload/CPU dependant;

For 2, since the calculation is a XOR on two local variables, it's
expected in many cases that cycles spent will be offset by reduced
memory latency later. This is especially true for NUMA machines where
multiple CPUs are contending on zone->lock and the most time consuming
part under zone->lock is the wait of 'struct page' cacheline of the
to-be-freed pages and their buddies.

Test with will-it-scale/page_fault1 full load:

kernel Broadwell(2S) Skylake(2S) Broadwell(4S) Skylake(4S)
v4.16-rc2+ 9034215 7971818 13667135 15677465
patch2/3 9536374 +5.6% 8314710 +4.3% 14070408 +3.0% 16675866 +6.4%
this patch 10180856 +6.8% 8506369 +2.3% 14756865 +4.9% 17325324 +3.9%

Note: this patch's performance improvement percent is against patch2/3.

(Changelog stolen from Dave Hansen and Mel Gorman's comments at
http://lkml.kernel.org/r/148a42d8-8306-2f2f-7f7c-86bc118f8ccd@intel.com)

[aaron.lu@intel.com: use helper function, avoid disordering pages]
Link: http://lkml.kernel.org/r/20180301062845.26038-4-aaron.lu@intel.com
Link: http://lkml.kernel.org/r/20180320113146.GB24737@intel.com
[aaron.lu@intel.com: v4]
Link: http://lkml.kernel.org/r/20180301062845.26038-4-aaron.lu@intel.com
Link: http://lkml.kernel.org/r/20180309082431.GB30868@intel.com
Link: http://lkml.kernel.org/r/20180301062845.26038-4-aaron.lu@intel.com
Signed-off-by: Aaron Lu <aaron.lu@intel.com>
Suggested-by: Ying Huang <ying.huang@intel.com>
Reviewed-by: Andrew Morton <akpm@linux-foundation.org>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Cc: Andi Kleen <ak@linux.intel.com>
Cc: Dave Hansen <dave.hansen@intel.com>
Cc: David Rientjes <rientjes@google.com>
Cc: Kemi Wang <kemi.wang@intel.com>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Mel Gorman <mgorman@techsingularity.net>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Tim Chen <tim.c.chen@linux.intel.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 0a5f4e5b 05-Apr-2018 Aaron Lu <aaron.lu@intel.com>

mm/free_pcppages_bulk: do not hold lock when picking pages to free

When freeing a batch of pages from Per-CPU-Pages(PCP) back to buddy, the
zone->lock is held and then pages are chosen from PCP's migratetype
list. While there is actually no need to do this 'choose part' under
lock since it's PCP pages, the only CPU that can touch them is us and
irq is also disabled.

Moving this part outside could reduce lock held time and improve
performance. Test with will-it-scale/page_fault1 full load:

kernel Broadwell(2S) Skylake(2S) Broadwell(4S) Skylake(4S)
v4.16-rc2+ 9034215 7971818 13667135 15677465
this patch 9536374 +5.6% 8314710 +4.3% 14070408 +3.0% 16675866 +6.4%

What the test does is: starts $nr_cpu processes and each will repeatedly
do the following for 5 minutes:

- mmap 128M anonymouse space

- write access to that space

- munmap.

The score is the aggregated iteration.

https://github.com/antonblanchard/will-it-scale/blob/master/tests/page_fault1.c

Link: http://lkml.kernel.org/r/20180301062845.26038-3-aaron.lu@intel.com
Signed-off-by: Aaron Lu <aaron.lu@intel.com>
Acked-by: Mel Gorman <mgorman@techsingularity.net>
Acked-by: Michal Hocko <mhocko@suse.com>
Reviewed-by: Andrew Morton <akpm@linux-foundation.org>
Cc: Andi Kleen <ak@linux.intel.com>
Cc: Dave Hansen <dave.hansen@intel.com>
Cc: David Rientjes <rientjes@google.com>
Cc: Huang Ying <ying.huang@intel.com>
Cc: Kemi Wang <kemi.wang@intel.com>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Tim Chen <tim.c.chen@linux.intel.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 77ba9062 05-Apr-2018 Aaron Lu <aaron.lu@intel.com>

mm/free_pcppages_bulk: update pcp->count inside

Matthew Wilcox found that all callers of free_pcppages_bulk() currently
update pcp->count immediately after so it's natural to do it inside
free_pcppages_bulk().

No functionality or performance change is expected from this patch.

Link: http://lkml.kernel.org/r/20180301062845.26038-2-aaron.lu@intel.com
Signed-off-by: Aaron Lu <aaron.lu@intel.com>
Suggested-by: Matthew Wilcox <willy@infradead.org>
Acked-by: David Rientjes <rientjes@google.com>
Acked-by: Michal Hocko <mhocko@suse.com>
Reviewed-by: Andrew Morton <akpm@linux-foundation.org>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Cc: Huang Ying <ying.huang@intel.com>
Cc: Dave Hansen <dave.hansen@intel.com>
Cc: Kemi Wang <kemi.wang@intel.com>
Cc: Tim Chen <tim.c.chen@linux.intel.com>
Cc: Andi Kleen <ak@linux.intel.com>
Cc: Mel Gorman <mgorman@techsingularity.net>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 7f16f91f 05-Apr-2018 David Rientjes <rientjes@google.com>

mm, page_alloc: move mirrored_kernelcore to __meminitdata

mirrored_kernelcore can be in __meminitdata, so move it there.

At the same time, fixup section specifiers to be after the name of the
variable per checkpatch.

Link: http://lkml.kernel.org/r/alpine.DEB.2.10.1802121623280.179479@chino.kir.corp.google.com
Signed-off-by: David Rientjes <rientjes@google.com>
Reviewed-by: Andrew Morton <akpm@linux-foundation.org>
Cc: Mike Kravetz <mike.kravetz@oracle.com>
Cc: Jonathan Corbet <corbet@lwn.net>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Mel Gorman <mgorman@suse.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# a5c6d650 05-Apr-2018 David Rientjes <rientjes@google.com>

mm, page_alloc: extend kernelcore and movablecore for percent

Both kernelcore= and movablecore= can be used to define the amount of
ZONE_NORMAL and ZONE_MOVABLE on a system, respectively. This requires
the system memory capacity to be known when specifying the command line,
however.

This introduces the ability to define both kernelcore= and movablecore=
as a percentage of total system memory. This is convenient for systems
software that wants to define the amount of ZONE_MOVABLE, for example,
as a proportion of a system's memory rather than a hardcoded byte value.

To define the percentage, the final character of the parameter should be
a '%'.

mhocko: "why is anyone using these options nowadays?"

rientjes:
:
: Fragmentation of non-__GFP_MOVABLE pages due to low on memory
: situations can pollute most pageblocks on the system, as much as 1GB of
: slab being fragmented over 128GB of memory, for example. When the
: amount of kernel memory is well bounded for certain systems, it is
: better to aggressively reclaim from existing MIGRATE_UNMOVABLE
: pageblocks rather than eagerly fallback to others.
:
: We have additional patches that help with this fragmentation if you're
: interested, specifically kcompactd compaction of MIGRATE_UNMOVABLE
: pageblocks triggered by fallback of non-__GFP_MOVABLE allocations and
: draining of pcp lists back to the zone free area to prevent stranding.

[rientjes@google.com: updates]
Link: http://lkml.kernel.org/r/alpine.DEB.2.10.1802131700160.71590@chino.kir.corp.google.com
Link: http://lkml.kernel.org/r/alpine.DEB.2.10.1802121622470.179479@chino.kir.corp.google.com
Signed-off-by: David Rientjes <rientjes@google.com>
Reviewed-by: Andrew Morton <akpm@linux-foundation.org>
Reviewed-by: Mike Kravetz <mike.kravetz@oracle.com>
Cc: Jonathan Corbet <corbet@lwn.net>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Mel Gorman <mgorman@suse.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# d0dc12e8 05-Apr-2018 Pavel Tatashin <pasha.tatashin@oracle.com>

mm/memory_hotplug: optimize memory hotplug

During memory hotplugging we traverse struct pages three times:

1. memset(0) in sparse_add_one_section()
2. loop in __add_section() to set do: set_page_node(page, nid); and
SetPageReserved(page);
3. loop in memmap_init_zone() to call __init_single_pfn()

This patch removes the first two loops, and leaves only loop 3. All
struct pages are initialized in one place, the same as it is done during
boot.

The benefits:

- We improve memory hotplug performance because we are not evicting the
cache several times and also reduce loop branching overhead.

- Remove condition from hotpath in __init_single_pfn(), that was added
in order to fix the problem that was reported by Bharata in the above
email thread, thus also improve performance during normal boot.

- Make memory hotplug more similar to the boot memory initialization
path because we zero and initialize struct pages only in one
function.

- Simplifies memory hotplug struct page initialization code, and thus
enables future improvements, such as multi-threading the
initialization of struct pages in order to improve hotplug
performance even further on larger machines.

[pasha.tatashin@oracle.com: v5]
Link: http://lkml.kernel.org/r/20180228030308.1116-7-pasha.tatashin@oracle.com
Link: http://lkml.kernel.org/r/20180215165920.8570-7-pasha.tatashin@oracle.com
Signed-off-by: Pavel Tatashin <pasha.tatashin@oracle.com>
Reviewed-by: Ingo Molnar <mingo@kernel.org>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Baoquan He <bhe@redhat.com>
Cc: Bharata B Rao <bharata@linux.vnet.ibm.com>
Cc: Daniel Jordan <daniel.m.jordan@oracle.com>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Cc: Mel Gorman <mgorman@techsingularity.net>
Cc: Steven Sistare <steven.sistare@oracle.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Vlastimil Babka <vbabka@suse.cz>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# c9e97a19 05-Apr-2018 Pavel Tatashin <pasha.tatashin@oracle.com>

mm: initialize pages on demand during boot

Deferred page initialization allows the boot cpu to initialize a small
subset of the system's pages early in boot, with other cpus doing the
rest later on.

It is, however, problematic to know how many pages the kernel needs
during boot. Different modules and kernel parameters may change the
requirement, so the boot cpu either initializes too many pages or runs
out of memory.

To fix that, initialize early pages on demand. This ensures the kernel
does the minimum amount of work to initialize pages during boot and
leaves the rest to be divided in the multithreaded initialization path
(deferred_init_memmap).

The on-demand code is permanently disabled using static branching once
deferred pages are initialized. After the static branch is changed to
false, the overhead is up-to two branch-always instructions if the zone
watermark check fails or if rmqueue fails.

Sergey Senozhatsky noticed that while deferred pages currently make
sense only on NUMA machines (we start one thread per latency node),
CONFIG_NUMA is not a requirement for CONFIG_DEFERRED_STRUCT_PAGE_INIT,
so that is also must be addressed in the patch.

[akpm@linux-foundation.org: fix typo in comment, make deferred_pages static]
[pasha.tatashin@oracle.com: fix min() type mismatch warning]
Link: http://lkml.kernel.org/r/20180212164543.26592-1-pasha.tatashin@oracle.com
[pasha.tatashin@oracle.com: use zone_to_nid() in deferred_grow_zone()]
Link: http://lkml.kernel.org/r/20180214163343.21234-2-pasha.tatashin@oracle.com
[pasha.tatashin@oracle.com: might_sleep warning]
Link: http://lkml.kernel.org/r/20180306192022.28289-1-pasha.tatashin@oracle.com
[akpm@linux-foundation.org: s/spin_lock/spin_lock_irq/ in page_alloc_init_late()]
[pasha.tatashin@oracle.com: v5]
Link: http://lkml.kernel.org/r/20180309220807.24961-3-pasha.tatashin@oracle.com
[akpm@linux-foundation.org: tweak comments]
[pasha.tatashin@oracle.com: v6]
Link: http://lkml.kernel.org/r/20180313182355.17669-3-pasha.tatashin@oracle.com
[akpm@linux-foundation.org: coding-style fixes]
Link: http://lkml.kernel.org/r/20180209192216.20509-2-pasha.tatashin@oracle.com
Signed-off-by: Pavel Tatashin <pasha.tatashin@oracle.com>
Reviewed-by: Daniel Jordan <daniel.m.jordan@oracle.com>
Reviewed-by: Steven Sistare <steven.sistare@oracle.com>
Reviewed-by: Andrew Morton <akpm@linux-foundation.org>
Tested-by: Masayoshi Mizuma <m.mizuma@jp.fujitsu.com>
Acked-by: Mel Gorman <mgorman@suse.de>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: AKASHI Takahiro <takahiro.akashi@linaro.org>
Cc: Gioh Kim <gi-oh.kim@profitbricks.com>
Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
Cc: Yaowei Bai <baiyaowei@cmss.chinamobile.com>
Cc: Wei Yang <richard.weiyang@gmail.com>
Cc: Paul Burton <paul.burton@mips.com>
Cc: Miles Chen <miles.chen@mediatek.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 3a2d7fa8 05-Apr-2018 Pavel Tatashin <pasha.tatashin@oracle.com>

mm: disable interrupts while initializing deferred pages

Vlastimil Babka reported about a window issue during which when deferred
pages are initialized, and the current version of on-demand
initialization is finished, allocations may fail. While this is highly
unlikely scenario, since this kind of allocation request must be large,
and must come from interrupt handler, we still want to cover it.

We solve this by initializing deferred pages with interrupts disabled,
and holding node_size_lock spin lock while pages in the node are being
initialized. The on-demand deferred page initialization that comes
later will use the same lock, and thus synchronize with
deferred_init_memmap().

It is unlikely for threads that initialize deferred pages to be
interrupted. They run soon after smp_init(), but before modules are
initialized, and long before user space programs. This is why there is
no adverse effect of having these threads running with interrupts
disabled.

[pasha.tatashin@oracle.com: v6]
Link: http://lkml.kernel.org/r/20180313182355.17669-2-pasha.tatashin@oracle.com
Link: http://lkml.kernel.org/r/20180309220807.24961-2-pasha.tatashin@oracle.com
Signed-off-by: Pavel Tatashin <pasha.tatashin@oracle.com>
Reviewed-by: Andrew Morton <akpm@linux-foundation.org>
Cc: Steven Sistare <steven.sistare@oracle.com>
Cc: Daniel Jordan <daniel.m.jordan@oracle.com>
Cc: Masayoshi Mizuma <m.mizuma@jp.fujitsu.com>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: AKASHI Takahiro <takahiro.akashi@linaro.org>
Cc: Gioh Kim <gi-oh.kim@profitbricks.com>
Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
Cc: Yaowei Bai <baiyaowei@cmss.chinamobile.com>
Cc: Wei Yang <richard.weiyang@gmail.com>
Cc: Paul Burton <paul.burton@mips.com>
Cc: Miles Chen <miles.chen@mediatek.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 31025351 05-Apr-2018 Anshuman Khandual <khandual@linux.vnet.ibm.com>

mm/migrate: rename migration reason MR_CMA to MR_CONTIG_RANGE

alloc_contig_range() initiates compaction and eventual migration for the
purpose of either CMA or HugeTLB allocations. At present, the reason
code remains the same MR_CMA for either of these cases. Let's make it
MR_CONTIG_RANGE which will appropriately reflect the reason code in both
these cases.

Link: http://lkml.kernel.org/r/20180202091518.18798-1-khandual@linux.vnet.ibm.com
Signed-off-by: Anshuman Khandual <khandual@linux.vnet.ibm.com>
Acked-by: Michal Hocko <mhocko@suse.com>
Reviewed-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# f59f1caf 22-Mar-2018 Daniel Vacek <neelx@redhat.com>

Revert "mm: page_alloc: skip over regions of invalid pfns where possible"

This reverts commit b92df1de5d28 ("mm: page_alloc: skip over regions of
invalid pfns where possible"). The commit is meant to be a boot init
speed up skipping the loop in memmap_init_zone() for invalid pfns.

But given some specific memory mapping on x86_64 (or more generally
theoretically anywhere but on arm with CONFIG_HAVE_ARCH_PFN_VALID) the
implementation also skips valid pfns which is plain wrong and causes
'kernel BUG at mm/page_alloc.c:1389!'

crash> log | grep -e BUG -e RIP -e Call.Trace -e move_freepages_block -e rmqueue -e freelist -A1
kernel BUG at mm/page_alloc.c:1389!
invalid opcode: 0000 [#1] SMP
--
RIP: 0010: move_freepages+0x15e/0x160
--
Call Trace:
move_freepages_block+0x73/0x80
__rmqueue+0x263/0x460
get_page_from_freelist+0x7e1/0x9e0
__alloc_pages_nodemask+0x176/0x420
--

crash> page_init_bug -v | grep RAM
<struct resource 0xffff88067fffd2f8> 1000 - 9bfff System RAM (620.00 KiB)
<struct resource 0xffff88067fffd3a0> 100000 - 430bffff System RAM ( 1.05 GiB = 1071.75 MiB = 1097472.00 KiB)
<struct resource 0xffff88067fffd410> 4b0c8000 - 4bf9cfff System RAM ( 14.83 MiB = 15188.00 KiB)
<struct resource 0xffff88067fffd480> 4bfac000 - 646b1fff System RAM (391.02 MiB = 400408.00 KiB)
<struct resource 0xffff88067fffd560> 7b788000 - 7b7fffff System RAM (480.00 KiB)
<struct resource 0xffff88067fffd640> 100000000 - 67fffffff System RAM ( 22.00 GiB)

crash> page_init_bug | head -6
<struct resource 0xffff88067fffd560> 7b788000 - 7b7fffff System RAM (480.00 KiB)
<struct page 0xffffea0001ede200> 1fffff00000000 0 <struct pglist_data 0xffff88047ffd9000> 1 <struct zone 0xffff88047ffd9800> DMA32 4096 1048575
<struct page 0xffffea0001ede200> 505736 505344 <struct page 0xffffea0001ed8000> 505855 <struct page 0xffffea0001edffc0>
<struct page 0xffffea0001ed8000> 0 0 <struct pglist_data 0xffff88047ffd9000> 0 <struct zone 0xffff88047ffd9000> DMA 1 4095
<struct page 0xffffea0001edffc0> 1fffff00000400 0 <struct pglist_data 0xffff88047ffd9000> 1 <struct zone 0xffff88047ffd9800> DMA32 4096 1048575
BUG, zones differ!

crash> kmem -p 77fff000 78000000 7b5ff000 7b600000 7b787000 7b788000
PAGE PHYSICAL MAPPING INDEX CNT FLAGS
ffffea0001e00000 78000000 0 0 0 0
ffffea0001ed7fc0 7b5ff000 0 0 0 0
ffffea0001ed8000 7b600000 0 0 0 0 <<<<
ffffea0001ede1c0 7b787000 0 0 0 0
ffffea0001ede200 7b788000 0 0 1 1fffff00000000

Link: http://lkml.kernel.org/r/20180316143855.29838-1-neelx@redhat.com
Fixes: b92df1de5d28 ("mm: page_alloc: skip over regions of invalid pfns where possible")
Signed-off-by: Daniel Vacek <neelx@redhat.com>
Acked-by: Ard Biesheuvel <ard.biesheuvel@linaro.org>
Acked-by: Michal Hocko <mhocko@suse.com>
Reviewed-by: Andrew Morton <akpm@linux-foundation.org>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Mel Gorman <mgorman@techsingularity.net>
Cc: Pavel Tatashin <pasha.tatashin@oracle.com>
Cc: Paul Burton <paul.burton@imgtec.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 2e517d68 22-Mar-2018 Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>

lockdep: fix fs_reclaim warning

Dave Jones reported fs_reclaim lockdep warnings.

============================================
WARNING: possible recursive locking detected
4.15.0-rc9-backup-debug+ #1 Not tainted
--------------------------------------------
sshd/24800 is trying to acquire lock:
(fs_reclaim){+.+.}, at: [<0000000084f438c2>] fs_reclaim_acquire.part.102+0x5/0x30

but task is already holding lock:
(fs_reclaim){+.+.}, at: [<0000000084f438c2>] fs_reclaim_acquire.part.102+0x5/0x30

other info that might help us debug this:
Possible unsafe locking scenario:

CPU0
----
lock(fs_reclaim);
lock(fs_reclaim);

*** DEADLOCK ***

May be due to missing lock nesting notation

2 locks held by sshd/24800:
#0: (sk_lock-AF_INET6){+.+.}, at: [<000000001a069652>] tcp_sendmsg+0x19/0x40
#1: (fs_reclaim){+.+.}, at: [<0000000084f438c2>] fs_reclaim_acquire.part.102+0x5/0x30

stack backtrace:
CPU: 3 PID: 24800 Comm: sshd Not tainted 4.15.0-rc9-backup-debug+ #1
Call Trace:
dump_stack+0xbc/0x13f
__lock_acquire+0xa09/0x2040
lock_acquire+0x12e/0x350
fs_reclaim_acquire.part.102+0x29/0x30
kmem_cache_alloc+0x3d/0x2c0
alloc_extent_state+0xa7/0x410
__clear_extent_bit+0x3ea/0x570
try_release_extent_mapping+0x21a/0x260
__btrfs_releasepage+0xb0/0x1c0
btrfs_releasepage+0x161/0x170
try_to_release_page+0x162/0x1c0
shrink_page_list+0x1d5a/0x2fb0
shrink_inactive_list+0x451/0x940
shrink_node_memcg.constprop.88+0x4c9/0x5e0
shrink_node+0x12d/0x260
try_to_free_pages+0x418/0xaf0
__alloc_pages_slowpath+0x976/0x1790
__alloc_pages_nodemask+0x52c/0x5c0
new_slab+0x374/0x3f0
___slab_alloc.constprop.81+0x47e/0x5a0
__slab_alloc.constprop.80+0x32/0x60
__kmalloc_track_caller+0x267/0x310
__kmalloc_reserve.isra.40+0x29/0x80
__alloc_skb+0xee/0x390
sk_stream_alloc_skb+0xb8/0x340
tcp_sendmsg_locked+0x8e6/0x1d30
tcp_sendmsg+0x27/0x40
inet_sendmsg+0xd0/0x310
sock_write_iter+0x17a/0x240
__vfs_write+0x2ab/0x380
vfs_write+0xfb/0x260
SyS_write+0xb6/0x140
do_syscall_64+0x1e5/0xc05
entry_SYSCALL64_slow_path+0x25/0x25

This warning is caused by commit d92a8cfcb37e ("locking/lockdep:
Rework FS_RECLAIM annotation") which replaced the use of
lockdep_{set,clear}_current_reclaim_state() in __perform_reclaim()
and lockdep_trace_alloc() in slab_pre_alloc_hook() with
fs_reclaim_acquire()/ fs_reclaim_release().

Since __kmalloc_reserve() from __alloc_skb() adds __GFP_NOMEMALLOC |
__GFP_NOWARN to gfp_mask, and all reclaim path simply propagates
__GFP_NOMEMALLOC, fs_reclaim_acquire() in slab_pre_alloc_hook() is
trying to grab the 'fake' lock again when __perform_reclaim() already
grabbed the 'fake' lock.

The

/* this guy won't enter reclaim */
if ((current->flags & PF_MEMALLOC) && !(gfp_mask & __GFP_NOMEMALLOC))
return false;

test which causes slab_pre_alloc_hook() to try to grab the 'fake' lock
was added by commit cf40bd16fdad ("lockdep: annotate reclaim context
(__GFP_NOFS)"). But that test is outdated because PF_MEMALLOC thread
won't enter reclaim regardless of __GFP_NOMEMALLOC after commit
341ce06f69ab ("page allocator: calculate the alloc_flags for allocation
only once") added the PF_MEMALLOC safeguard (

/* Avoid recursion of direct reclaim */
if (p->flags & PF_MEMALLOC)
goto nopage;

in __alloc_pages_slowpath()).

Thus, let's fix outdated test by removing __GFP_NOMEMALLOC test and
allow __need_fs_reclaim() to return false.

Link: http://lkml.kernel.org/r/201802280650.FJC73911.FOSOMLJVFFQtHO@I-love.SAKURA.ne.jp
Fixes: d92a8cfcb37ecd13 ("locking/lockdep: Rework FS_RECLAIM annotation")
Signed-off-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
Reported-by: Dave Jones <davej@codemonkey.org.uk>
Tested-by: Dave Jones <davej@codemonkey.org.uk>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Nick Piggin <npiggin@gmail.com>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: Nikolay Borisov <nborisov@suse.com>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: <stable@vger.kernel.org> [4.14+]
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 79375ea3 09-Mar-2018 Arnd Bergmann <arnd@arndb.de>

mm: remove obsolete alloc_remap()

Tile was the only remaining architecture to implement alloc_remap(),
and since that is being removed, there is no point in keeping this
function.

Removing all callers simplifies the mem_map handling.

Reviewed-by: Pavel Tatashin <pasha.tatashin@oracle.com>
Signed-off-by: Arnd Bergmann <arnd@arndb.de>


# 3e04040d 14-Mar-2018 Ard Biesheuvel <ardb@kernel.org>

Revert "mm/page_alloc: fix memmap_init_zone pageblock alignment"

This reverts commit 864b75f9d6b0100bb24fdd9a20d156e7cda9b5ae.

Commit 864b75f9d6b0 ("mm/page_alloc: fix memmap_init_zone pageblock
alignment") modified the logic in memmap_init_zone() to initialize
struct pages associated with invalid PFNs, to appease a VM_BUG_ON()
in move_freepages(), which is redundant by its own admission, and
dereferences struct page fields to obtain the zone without checking
whether the struct pages in question are valid to begin with.

Commit 864b75f9d6b0 only makes it worse, since the rounding it does
may cause pfn assume the same value it had in a prior iteration of
the loop, resulting in an infinite loop and a hang very early in the
boot. Also, since it doesn't perform the same rounding on start_pfn
itself but only on intermediate values following an invalid PFN, we
may still hit the same VM_BUG_ON() as before.

So instead, let's fix this at the core, and ensure that the BUG
check doesn't dereference struct page fields of invalid pages.

Fixes: 864b75f9d6b0 ("mm/page_alloc: fix memmap_init_zone pageblock alignment")
Tested-by: Jan Glauber <jglauber@cavium.com>
Tested-by: Shanker Donthineni <shankerd@codeaurora.org>
Cc: Daniel Vacek <neelx@redhat.com>
Cc: Mel Gorman <mgorman@techsingularity.net>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Paul Burton <paul.burton@imgtec.com>
Cc: Pavel Tatashin <pasha.tatashin@oracle.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Ard Biesheuvel <ard.biesheuvel@linaro.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 864b75f9 09-Mar-2018 Daniel Vacek <neelx@redhat.com>

mm/page_alloc: fix memmap_init_zone pageblock alignment

Commit b92df1de5d28 ("mm: page_alloc: skip over regions of invalid pfns
where possible") introduced a bug where move_freepages() triggers a
VM_BUG_ON() on uninitialized page structure due to pageblock alignment.
To fix this, simply align the skipped pfns in memmap_init_zone() the
same way as in move_freepages_block().

Seen in one of the RHEL reports:

crash> log | grep -e BUG -e RIP -e Call.Trace -e move_freepages_block -e rmqueue -e freelist -A1
kernel BUG at mm/page_alloc.c:1389!
invalid opcode: 0000 [#1] SMP
--
RIP: 0010:[<ffffffff8118833e>] [<ffffffff8118833e>] move_freepages+0x15e/0x160
RSP: 0018:ffff88054d727688 EFLAGS: 00010087
--
Call Trace:
[<ffffffff811883b3>] move_freepages_block+0x73/0x80
[<ffffffff81189e63>] __rmqueue+0x263/0x460
[<ffffffff8118c781>] get_page_from_freelist+0x7e1/0x9e0
[<ffffffff8118caf6>] __alloc_pages_nodemask+0x176/0x420
--
RIP [<ffffffff8118833e>] move_freepages+0x15e/0x160
RSP <ffff88054d727688>

crash> page_init_bug -v | grep RAM
<struct resource 0xffff88067fffd2f8> 1000 - 9bfff System RAM (620.00 KiB)
<struct resource 0xffff88067fffd3a0> 100000 - 430bffff System RAM ( 1.05 GiB = 1071.75 MiB = 1097472.00 KiB)
<struct resource 0xffff88067fffd410> 4b0c8000 - 4bf9cfff System RAM ( 14.83 MiB = 15188.00 KiB)
<struct resource 0xffff88067fffd480> 4bfac000 - 646b1fff System RAM (391.02 MiB = 400408.00 KiB)
<struct resource 0xffff88067fffd560> 7b788000 - 7b7fffff System RAM (480.00 KiB)
<struct resource 0xffff88067fffd640> 100000000 - 67fffffff System RAM ( 22.00 GiB)

crash> page_init_bug | head -6
<struct resource 0xffff88067fffd560> 7b788000 - 7b7fffff System RAM (480.00 KiB)
<struct page 0xffffea0001ede200> 1fffff00000000 0 <struct pglist_data 0xffff88047ffd9000> 1 <struct zone 0xffff88047ffd9800> DMA32 4096 1048575
<struct page 0xffffea0001ede200> 505736 505344 <struct page 0xffffea0001ed8000> 505855 <struct page 0xffffea0001edffc0>
<struct page 0xffffea0001ed8000> 0 0 <struct pglist_data 0xffff88047ffd9000> 0 <struct zone 0xffff88047ffd9000> DMA 1 4095
<struct page 0xffffea0001edffc0> 1fffff00000400 0 <struct pglist_data 0xffff88047ffd9000> 1 <struct zone 0xffff88047ffd9800> DMA32 4096 1048575
BUG, zones differ!

Note that this range follows two not populated sections
68000000-77ffffff in this zone. 7b788000-7b7fffff is the first one
after a gap. This makes memmap_init_zone() skip all the pfns up to the
beginning of this range. But this range is not pageblock (2M) aligned.
In fact no range has to be.

crash> kmem -p 77fff000 78000000 7b5ff000 7b600000 7b787000 7b788000
PAGE PHYSICAL MAPPING INDEX CNT FLAGS
ffffea0001e00000 78000000 0 0 0 0
ffffea0001ed7fc0 7b5ff000 0 0 0 0
ffffea0001ed8000 7b600000 0 0 0 0 <<<<
ffffea0001ede1c0 7b787000 0 0 0 0
ffffea0001ede200 7b788000 0 0 1 1fffff00000000

Top part of page flags should contain nodeid and zonenr, which is not
the case for page ffffea0001ed8000 here (<<<<).

crash> log | grep -o fffea0001ed[^\ ]* | sort -u
fffea0001ed8000
fffea0001eded20
fffea0001edffc0

crash> bt -r | grep -o fffea0001ed[^\ ]* | sort -u
fffea0001ed8000
fffea0001eded00
fffea0001eded20
fffea0001edffc0

Initialization of the whole beginning of the section is skipped up to
the start of the range due to the commit b92df1de5d28. Now any code
calling move_freepages_block() (like reusing the page from a freelist as
in this example) with a page from the beginning of the range will get
the page rounded down to start_page ffffea0001ed8000 and passed to
move_freepages() which crashes on assertion getting wrong zonenr.

> VM_BUG_ON(page_zone(start_page) != page_zone(end_page));

Note, page_zone() derives the zone from page flags here.

From similar machine before commit b92df1de5d28:

crash> kmem -p 77fff000 78000000 7b5ff000 7b600000 7b7fe000 7b7ff000
PAGE PHYSICAL MAPPING INDEX CNT FLAGS
fffff73941e00000 78000000 0 0 1 1fffff00000000
fffff73941ed7fc0 7b5ff000 0 0 1 1fffff00000000
fffff73941ed8000 7b600000 0 0 1 1fffff00000000
fffff73941edff80 7b7fe000 0 0 1 1fffff00000000
fffff73941edffc0 7b7ff000 ffff8e67e04d3ae0 ad84 1 1fffff00020068 uptodate,lru,active,mappedtodisk

All the pages since the beginning of the section are initialized.
move_freepages()' not gonna blow up.

The same machine with this fix applied:

crash> kmem -p 77fff000 78000000 7b5ff000 7b600000 7b7fe000 7b7ff000
PAGE PHYSICAL MAPPING INDEX CNT FLAGS
ffffea0001e00000 78000000 0 0 0 0
ffffea0001e00000 7b5ff000 0 0 0 0
ffffea0001ed8000 7b600000 0 0 1 1fffff00000000
ffffea0001edff80 7b7fe000 0 0 1 1fffff00000000
ffffea0001edffc0 7b7ff000 ffff88017fb13720 8 2 1fffff00020068 uptodate,lru,active,mappedtodisk

At least the bare minimum of pages is initialized preventing the crash
as well.

Customers started to report this as soon as 7.4 (where b92df1de5d28 was
merged in RHEL) was released. I remember reports from
September/October-ish times. It's not easily reproduced and happens on
a handful of machines only. I guess that's why. But that does not make
it less serious, I think.

Though there actually is a report here:
https://bugzilla.kernel.org/show_bug.cgi?id=196443

And there are reports for Fedora from July:
https://bugzilla.redhat.com/show_bug.cgi?id=1473242
and CentOS:
https://bugs.centos.org/view.php?id=13964
and we internally track several dozens reports for RHEL bug
https://bugzilla.redhat.com/show_bug.cgi?id=1525121

Link: http://lkml.kernel.org/r/0485727b2e82da7efbce5f6ba42524b429d0391a.1520011945.git.neelx@redhat.com
Fixes: b92df1de5d28 ("mm: page_alloc: skip over regions of invalid pfns where possible")
Signed-off-by: Daniel Vacek <neelx@redhat.com>
Cc: Mel Gorman <mgorman@techsingularity.net>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Paul Burton <paul.burton@imgtec.com>
Cc: Pavel Tatashin <pasha.tatashin@oracle.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 895f7b8e 21-Feb-2018 Juergen Gross <jgross@suse.com>

mm: don't defer struct page initialization for Xen pv guests

Commit f7f99100d8d9 ("mm: stop zeroing memory during allocation in
vmemmap") broke Xen pv domains in some configurations, as the "Pinned"
information in struct page of early page tables could get lost.

This will lead to the kernel trying to write directly into the page
tables instead of asking the hypervisor to do so. The result is a crash
like the following:

BUG: unable to handle kernel paging request at ffff8801ead19008
IP: xen_set_pud+0x4e/0xd0
PGD 1c0a067 P4D 1c0a067 PUD 23a0067 PMD 1e9de0067 PTE 80100001ead19065
Oops: 0003 [#1] PREEMPT SMP
Modules linked in:
CPU: 0 PID: 0 Comm: swapper/0 Not tainted 4.14.0-default+ #271
Hardware name: Dell Inc. Latitude E6440/0159N7, BIOS A07 06/26/2014
task: ffffffff81c10480 task.stack: ffffffff81c00000
RIP: e030:xen_set_pud+0x4e/0xd0
Call Trace:
__pmd_alloc+0x128/0x140
ioremap_page_range+0x3f4/0x410
__ioremap_caller+0x1c3/0x2e0
acpi_os_map_iomem+0x175/0x1b0
acpi_tb_acquire_table+0x39/0x66
acpi_tb_validate_table+0x44/0x7c
acpi_tb_verify_temp_table+0x45/0x304
acpi_reallocate_root_table+0x12d/0x141
acpi_early_init+0x4d/0x10a
start_kernel+0x3eb/0x4a1
xen_start_kernel+0x528/0x532
Code: 48 01 e8 48 0f 42 15 a2 fd be 00 48 01 d0 48 ba 00 00 00 00 00 ea ff ff 48 c1 e8 0c 48 c1 e0 06 48 01 d0 48 8b 00 f6 c4 02 75 5d <4c> 89 65 00 5b 5d 41 5c c3 65 8b 05 52 9f fe 7e 89 c0 48 0f a3
RIP: xen_set_pud+0x4e/0xd0 RSP: ffffffff81c03cd8
CR2: ffff8801ead19008
---[ end trace 38eca2e56f1b642e ]---

Avoid this problem by not deferring struct page initialization when
running as Xen pv guest.

Pavel said:

: This is unique for Xen, so this particular issue won't effect other
: configurations. I am going to investigate if there is a way to
: re-enable deferred page initialization on xen guests.

[akpm@linux-foundation.org: explicitly include xen.h]
Link: http://lkml.kernel.org/r/20180216154101.22865-1-jgross@suse.com
Fixes: f7f99100d8d95d ("mm: stop zeroing memory during allocation in vmemmap")
Signed-off-by: Juergen Gross <jgross@suse.com>
Reviewed-by: Pavel Tatashin <pasha.tatashin@oracle.com>
Cc: Steven Sistare <steven.sistare@oracle.com>
Cc: Daniel Jordan <daniel.m.jordan@oracle.com>
Cc: Bob Picco <bob.picco@oracle.com>
Cc: <stable@vger.kernel.org> [4.15.x]
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 9bb5a391 31-Jan-2018 Michal Hocko <mhocko@suse.com>

mm, memory_hotplug: fix memmap initialization

Bharata has noticed that onlining a newly added memory doesn't increase
the total memory, pointing to commit f7f99100d8d9 ("mm: stop zeroing
memory during allocation in vmemmap") as a culprit. This commit has
changed the way how the memory for memmaps is initialized and moves it
from the allocation time to the initialization time. This works
properly for the early memmap init path.

It doesn't work for the memory hotplug though because we need to mark
page as reserved when the sparsemem section is created and later
initialize it completely during onlining. memmap_init_zone is called in
the early stage of onlining. With the current code it calls
__init_single_page and as such it clears up the whole stage and
therefore online_pages_range skips those pages.

Fix this by skipping mm_zero_struct_page in __init_single_page for
memory hotplug path. This is quite uggly but unifying both early init
and memory hotplug init paths is a large project. Make sure we plug the
regression at least.

Link: http://lkml.kernel.org/r/20180130101141.GW21609@dhcp22.suse.cz
Fixes: f7f99100d8d9 ("mm: stop zeroing memory during allocation in vmemmap")
Signed-off-by: Michal Hocko <mhocko@suse.com>
Reported-by: Bharata B Rao <bharata@linux.vnet.ibm.com>
Tested-by: Bharata B Rao <bharata@linux.vnet.ibm.com>
Reviewed-by: Pavel Tatashin <pasha.tatashin@oracle.com>
Cc: Steven Sistare <steven.sistare@oracle.com>
Cc: Daniel Jordan <daniel.m.jordan@oracle.com>
Cc: Bob Picco <bob.picco@oracle.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 3c2c6488 31-Jan-2018 Shile Zhang <zhangshile@gmail.com>

mm/page_alloc.c: fix typos in comments

Link: http://lkml.kernel.org/r/1515485774-4768-1-git-send-email-zhangshile@gmail.com
Signed-off-by: Shile Zhang <zhangshile@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 48128397 31-Jan-2018 Jiankang Chen <chenjiankang1@huawei.com>

mm/page_alloc.c: fix comment in __get_free_pages()

__get_free_pages() will return a virtual address, but it is not just a
32-bit address, for example on a 64-bit system. And this comment really
confuses new readers of mm.

Link: http://lkml.kernel.org/r/1511780964-64864-1-git-send-email-chenjiankang1@huawei.com
Signed-off-by: Jiankang Chen <chenjiankang1@huawei.com>
Reported-by: Hanjun Guo <guohanjun@huawei.com>
Cc: Mel Gorman <mgorman@techsingularity.net>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Yisheng Xie <xieyisheng1@huawei.com>
Cc: Kefeng Wang <wangkefeng.wang@huawei.com>
Cc: Michal Hocko <mhocko@suse.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 80b1f41c 31-Jan-2018 Pavel Tatashin <pasha.tatashin@oracle.com>

mm: split deferred_init_range into initializing and freeing parts

In deferred_init_range() we initialize struct pages, and also free them
to buddy allocator. We do it in separate loops, because buddy page is
computed ahead, so we do not want to access a struct page that has not
been initialized yet.

There is still, however, a corner case where it is potentially possible
to access uninitialized struct page: this is when buddy page is from the
next memblock range.

This patch fixes this problem by splitting deferred_init_range() into
two functions: one to initialize struct pages, and another to free them.

In addition, this patch brings the following improvements:
- Get rid of __def_free() helper function. And simplifies loop logic by
adding a new pfn validity check function: deferred_pfn_valid().
- Reduces number of variables that we track. So, there is a higher
chance that we will avoid using stack to store/load variables inside
hot loops.
- Enables future multi-threading of these functions: do initialization
in multiple threads, wait for all threads to finish, do freeing part
in multithreading.

Tested on x86 with 1T of memory to make sure no regressions are
introduced.

[akpm@linux-foundation.org: fix spello in comment]
Link: http://lkml.kernel.org/r/20171107150446.32055-2-pasha.tatashin@oracle.com
Signed-off-by: Pavel Tatashin <pasha.tatashin@oracle.com>
Acked-by: Michal Hocko <mhocko@suse.com>
Cc: Steven Sistare <steven.sistare@oracle.com>
Cc: Daniel Jordan <daniel.m.jordan@oracle.com>
Cc: Mel Gorman <mgorman@techsingularity.net>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# a99583e7 29-Dec-2017 Christoph Hellwig <hch@lst.de>

mm: pass the vmem_altmap to memmap_init_zone

Pass the vmem_altmap two levels down instead of needing a lookup.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>


# e8c24773 04-Jan-2018 Dave Young <dyoung@redhat.com>

mm: check pfn_valid first in zero_resv_unavail

With latest kernel I get below bug while testing kdump:

BUG: unable to handle kernel paging request at ffffea00034b1040
IP: zero_resv_unavail+0xbd/0x126
PGD 37b98067 P4D 37b98067 PUD 37b97067 PMD 0
Oops: 0002 [#1] SMP
Modules linked in:
CPU: 0 PID: 0 Comm: swapper Not tainted 4.15.0-rc1+ #316
Hardware name: LENOVO 20ARS1BJ02/20ARS1BJ02, BIOS GJET92WW (2.42 ) 03/03/2017
task: ffffffff81a0e4c0 task.stack: ffffffff81a00000
RIP: 0010:zero_resv_unavail+0xbd/0x126
RSP: 0000:ffffffff81a03d88 EFLAGS: 00010006
RAX: 0000000000000000 RBX: ffffea00034b1040 RCX: 0000000000000010
RDX: 0000000000000000 RSI: 0000000000000092 RDI: ffffea00034b1040
RBP: 00000000000d2c41 R08: 00000000000000c0 R09: 0000000000000a0d
R10: 0000000000000002 R11: 0000000000007f01 R12: ffffffff81a03d90
R13: ffffea0000000000 R14: 0000000000000063 R15: 0000000000000062
FS: 0000000000000000(0000) GS:ffffffff81c73000(0000) knlGS:0000000000000000
CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: ffffea00034b1040 CR3: 0000000037609000 CR4: 00000000000606b0
Call Trace:
? free_area_init_nodes+0x640/0x664
? zone_sizes_init+0x58/0x72
? setup_arch+0xb50/0xc6c
? start_kernel+0x64/0x43d
? secondary_startup_64+0xa5/0xb0
Code: c1 e8 0c 48 39 d8 76 27 48 89 de 48 c1 e3 06 48 c7 c7 7a 87 79 81 e8 b0 c0 3e ff 4c 01 eb b9 10 00 00 00 31 c0 48 89 df 49 ff c6 <f3> ab eb bc 6a 00 49 c7 c0 f0 93 d1 81 31 d2 83 ce ff 41 54 49
RIP: zero_resv_unavail+0xbd/0x126 RSP: ffffffff81a03d88
CR2: ffffea00034b1040
---[ end trace f5ba9e8f73c7ee26 ]---

This is introduced by commit a4a3ede2132a ("mm: zero reserved and
unavailable struct pages").

The reason is some efi reserved boot ranges is not reported in E820 ram.
In my case it is a bgrt buffer:

efi: mem00: [Boot Data |RUN| | | | | | | |WB|WT|WC|UC] range=[0x00000000d2c41000-0x00000000d2c85fff] (0MB)

Use "add_efi_memmap" can workaround the problem with another fix:

http://lkml.kernel.org/r/20171130052327.GA3500@dhcp-128-65.nay.redhat.com

In zero_resv_unavail it would be better to check pfn_valid first before
zero the page struct. This fixes the problem and potential other
similar problems. Also as Pavel Tatashin suggested checks pfn_valid at
the beginning of the section.

The range is backed by real memory. The memory range is efi "Boot
Service Data", that means after ExitBootServices() these ranges can be
used as system ram. But some of them need to be reserved, for example
the bgrt image address in an acpi table, if the image memory is freed
then kexec reboot will fail because kexec inherit same acpi table to
initialize the driver.

Link: http://lkml.kernel.org/r/20171201095048.GA3084@dhcp-128-65.nay.redhat.com
Fixes: a4a3ede2132a ("mm: zero reserved and unavailable struct pages")
Signed-off-by: Dave Young <dyoung@redhat.com>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Pavel Tatashin <pasha.tatashin@oracle.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# c24ad77d 14-Dec-2017 Lucas Stach <l.stach@pengutronix.de>

mm/page_alloc.c: avoid excessive IRQ disabled times in free_unref_page_list()

Since commit 9cca35d42eb6 ("mm, page_alloc: enable/disable IRQs once
when freeing a list of pages") we see excessive IRQ disabled times of up
to 25ms on an embedded ARM system (tracing overhead included).

This is due to graphics buffers being freed back to the system via
release_pages(). Graphics buffers can be huge, so it's not hard to hit
cases where the list of pages to free has 2048 entries. Disabling IRQs
while freeing all those pages is clearly not a good idea.

Introduce a batch limit, which allows IRQ servicing once every few
pages. The batch count is the same as used in other parts of the MM
subsystem when dealing with IRQ disabled regions.

Link: http://lkml.kernel.org/r/20171207170314.4419-1-l.stach@pengutronix.de
Fixes: 9cca35d42eb6 ("mm, page_alloc: enable/disable IRQs once when freeing a list of pages")
Signed-off-by: Lucas Stach <l.stach@pengutronix.de>
Acked-by: Mel Gorman <mgorman@techsingularity.net>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 63cd4489 29-Nov-2017 Mike Kravetz <mike.kravetz@oracle.com>

mm/cma: fix alloc_contig_range ret code/potential leak

If the call __alloc_contig_migrate_range() in alloc_contig_range returns
-EBUSY, processing continues so that test_pages_isolated() is called
where there is a tracepoint to identify the busy pages. However, it is
possible for busy pages to become available between the calls to these
two routines. In this case, the range of pages may be allocated.
Unfortunately, the original return code (ret == -EBUSY) is still set and
returned to the caller. Therefore, the caller believes the pages were
not allocated and they are leaked.

Update the comment to indicate that allocation is still possible even if
__alloc_contig_migrate_range returns -EBUSY. Also, clear return code in
this case so that it is not accidentally used or returned to caller.

Link: http://lkml.kernel.org/r/20171122185214.25285-1-mike.kravetz@oracle.com
Fixes: 8ef5849fa8a2 ("mm/cma: always check which page caused allocation failure")
Signed-off-by: Mike Kravetz <mike.kravetz@oracle.com>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Acked-by: Michal Hocko <mhocko@suse.com>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Acked-by: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Cc: Michal Nazarewicz <mina86@mina86.com>
Cc: Laura Abbott <labbott@redhat.com>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Mel Gorman <mgorman@techsingularity.net>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 4b81cb2f 29-Nov-2017 Michal Hocko <mhocko@suse.com>

mm, memory_hotplug: do not back off draining pcp free pages from kworker context

drain_all_pages backs off when called from a kworker context since
commit 0ccce3b92421 ("mm, page_alloc: drain per-cpu pages from workqueue
context") because the original IPI based pcp draining has been replaced
by a WQ based one and the check wanted to prevent from recursion and
inter workers dependencies. This has made some sense at the time
because the system WQ has been used and one worker holding the lock
could be blocked while waiting for new workers to emerge which can be a
problem under OOM conditions.

Since then commit ce612879ddc7 ("mm: move pcp and lru-pcp draining into
single wq") has moved draining to a dedicated (mm_percpu_wq) WQ with a
rescuer so we shouldn't depend on any other WQ activity to make a
forward progress so calling drain_all_pages from a worker context is
safe as long as this doesn't happen from mm_percpu_wq itself which is
not the case because all workers are required to _not_ depend on any MM
locks.

Why is this a problem in the first place? ACPI driven memory hot-remove
(acpi_device_hotplug) is executed from the worker context. We end up
calling __offline_pages to free all the pages and that requires both
lru_add_drain_all_cpuslocked and drain_all_pages to do their job
otherwise we can have dangling pages on pcp lists and fail the offline
operation (__test_page_isolated_in_pageblock would see a page with 0 ref
count but without PageBuddy set).

Fix the issue by removing the worker check in drain_all_pages.
lru_add_drain_all_cpuslocked doesn't have this restriction so it works
as expected.

Link: http://lkml.kernel.org/r/20170828093341.26341-1-mhocko@kernel.org
Fixes: 0ccce3b924212 ("mm, page_alloc: drain per-cpu pages from workqueue context")
Signed-off-by: Michal Hocko <mhocko@suse.com>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Tejun Heo <tj@kernel.org>
Cc: <stable@vger.kernel.org> [4.11+]
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 2583d671 17-Nov-2017 Vlastimil Babka <vbabka@suse.cz>

mm, compaction: split off flag for not updating skip hints

Pageblock skip hints were added as a heuristic for compaction, which
shares core code with CMA. Since CMA reliability would suffer from the
heuristics, compact_control flag ignore_skip_hint was added for the CMA
use case. Since 6815bf3f233e ("mm/compaction: respect ignore_skip_hint
in update_pageblock_skip") the flag also means that CMA won't *update*
the skip hints in addition to ignoring them.

Today, direct compaction can also ignore the skip hints in the last
resort attempt, but there's no reason not to set them when isolation
fails in such case. Thus, this patch splits off a new no_set_skip_hint
flag to avoid the updating, which only CMA sets. This should improve
the heuristics a bit, and allow us to simplify the persistent skip bit
handling as the next step.

Link: http://lkml.kernel.org/r/20171102121706.21504-2-vbabka@suse.cz
Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
Acked-by: Mel Gorman <mgorman@techsingularity.net>
Cc: David Rientjes <rientjes@google.com>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 0cd842f9 15-Nov-2017 Oscar Salvador <osalvador@techadventures.net>

mm: make alloc_node_mem_map a void call if we don't have CONFIG_FLAT_NODE_MEM_MAP

free_area_init_node() calls alloc_node_mem_map(), but this function does
nothing unless we have CONFIG_FLAT_NODE_MEM_MAP.

As a cleanup, we can move the "#ifdef CONFIG_FLAT_NODE_MEM_MAP" within
alloc_node_mem_map() out of the function, and define a
alloc_node_mem_map() { } when CONFIG_FLAT_NODE_MEM_MAP is not present.

This also moves the printk that lays within the "#ifdef
CONFIG_FLAT_NODE_MEM_MAP" block from free_area_init_node() to
alloc_node_mem_map(), getting rid of the "#ifdef
CONFIG_FLAT_NODE_MEM_MAP" in free_area_init_node().

[akpm@linux-foundation.org: clean up the printk while we're there]
Link: http://lkml.kernel.org/r/20171114111935.GA11758@techadventures.net
Signed-off-by: Oscar Salvador <osalvador@techadventures.net>
Acked-by: Michal Hocko <mhocko@suse.com>
Cc: Mel Gorman <mgorman@techsingularity.net>
Cc: Vlastimil Babka <vbabka@suse.cz>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 0205f755 15-Nov-2017 Michal Hocko <mhocko@suse.com>

mm: simplify nodemask printing

alloc_warn() and dump_header() have to explicitly handle NULL nodemask
which forces both paths to use pr_cont. We can do better. printk
already handles NULL pointers properly so all we need is to teach
nodemask_pr_args to handle NULL nodemask carefully. This allows
simplification of both alloc_warn() and dump_header() and gets rid of
pr_cont altogether.

This patch has been motivated by patch from Joe Perches

http://lkml.kernel.org/r/b31236dfe3fc924054fd7842bde678e71d193638.1509991345.git.joe@perches.com

[akpm@linux-foundation.org: fix tile warning, per Arnd]
Link: http://lkml.kernel.org/r/20171109100531.3cn2hcqnuj7mjaju@dhcp22.suse.cz
Signed-off-by: Michal Hocko <mhocko@suse.com>
Acked-by: Joe Perches <joe@perches.com>
Cc: Arnd Bergmann <arnd@arndb.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# d135e575 15-Nov-2017 Pavel Tatashin <pasha.tatashin@oracle.com>

mm/page_alloc.c: broken deferred calculation

In reset_deferred_meminit() we determine number of pages that must not
be deferred. We initialize pages for at least 2G of memory, but also
pages for reserved memory in this node.

The reserved memory is determined in this function:
memblock_reserved_memory_within(), which operates over physical
addresses, and returns size in bytes. However, reset_deferred_meminit()
assumes that that this function operates with pfns, and returns page
count.

The result is that in the best case machine boots slower than expected
due to initializing more pages than needed in single thread, and in the
worst case panics because fewer than needed pages are initialized early.

Link: http://lkml.kernel.org/r/20171021011707.15191-1-pasha.tatashin@oracle.com
Fixes: 864b9a393dcb ("mm: consider memblock reservations for deferred memory initialization sizing")
Signed-off-by: Pavel Tatashin <pasha.tatashin@oracle.com>
Acked-by: Michal Hocko <mhocko@suse.com>
Cc: Mel Gorman <mgorman@techsingularity.net>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 400e2249 15-Nov-2017 Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>

mm: don't warn about allocations which stall for too long

Commit 63f53dea0c98 ("mm: warn about allocations which stall for too
long") was a great step for reducing possibility of silent hang up
problem caused by memory allocation stalls. But this commit reverts it,
for it is possible to trigger OOM lockup and/or soft lockups when many
threads concurrently called warn_alloc() (in order to warn about memory
allocation stalls) due to current implementation of printk(), and it is
difficult to obtain useful information due to limitation of synchronous
warning approach.

Current printk() implementation flushes all pending logs using the
context of a thread which called console_unlock(). printk() should be
able to flush all pending logs eventually unless somebody continues
appending to printk() buffer.

Since warn_alloc() started appending to printk() buffer while waiting
for oom_kill_process() to make forward progress when oom_kill_process()
is processing pending logs, it became possible for warn_alloc() to force
oom_kill_process() loop inside printk(). As a result, warn_alloc()
significantly increased possibility of preventing oom_kill_process()
from making forward progress.

---------- Pseudo code start ----------
Before warn_alloc() was introduced:

retry:
if (mutex_trylock(&oom_lock)) {
while (atomic_read(&printk_pending_logs) > 0) {
atomic_dec(&printk_pending_logs);
print_one_log();
}
// Send SIGKILL here.
mutex_unlock(&oom_lock)
}
goto retry;

After warn_alloc() was introduced:

retry:
if (mutex_trylock(&oom_lock)) {
while (atomic_read(&printk_pending_logs) > 0) {
atomic_dec(&printk_pending_logs);
print_one_log();
}
// Send SIGKILL here.
mutex_unlock(&oom_lock)
} else if (waited_for_10seconds()) {
atomic_inc(&printk_pending_logs);
}
goto retry;
---------- Pseudo code end ----------

Although waited_for_10seconds() becomes true once per 10 seconds,
unbounded number of threads can call waited_for_10seconds() at the same
time. Also, since threads doing waited_for_10seconds() keep doing
almost busy loop, the thread doing print_one_log() can use little CPU
resource. Therefore, this situation can be simplified like

---------- Pseudo code start ----------
retry:
if (mutex_trylock(&oom_lock)) {
while (atomic_read(&printk_pending_logs) > 0) {
atomic_dec(&printk_pending_logs);
print_one_log();
}
// Send SIGKILL here.
mutex_unlock(&oom_lock)
} else {
atomic_inc(&printk_pending_logs);
}
goto retry;
---------- Pseudo code end ----------

when printk() is called faster than print_one_log() can process a log.

One of possible mitigation would be to introduce a new lock in order to
make sure that no other series of printk() (either oom_kill_process() or
warn_alloc()) can append to printk() buffer when one series of printk()
(either oom_kill_process() or warn_alloc()) is already in progress.

Such serialization will also help obtaining kernel messages in readable
form.

---------- Pseudo code start ----------
retry:
if (mutex_trylock(&oom_lock)) {
mutex_lock(&oom_printk_lock);
while (atomic_read(&printk_pending_logs) > 0) {
atomic_dec(&printk_pending_logs);
print_one_log();
}
// Send SIGKILL here.
mutex_unlock(&oom_printk_lock);
mutex_unlock(&oom_lock)
} else {
if (mutex_trylock(&oom_printk_lock)) {
atomic_inc(&printk_pending_logs);
mutex_unlock(&oom_printk_lock);
}
}
goto retry;
---------- Pseudo code end ----------

But this commit does not go that direction, for we don't want to
introduce a new lock dependency, and we unlikely be able to obtain
useful information even if we serialized oom_kill_process() and
warn_alloc().

Synchronous approach is prone to unexpected results (e.g. too late [1],
too frequent [2], overlooked [3]). As far as I know, warn_alloc() never
helped with providing information other than "something is going wrong".
I want to consider asynchronous approach which can obtain information
during stalls with possibly relevant threads (e.g. the owner of
oom_lock and kswapd-like threads) and serve as a trigger for actions
(e.g. turn on/off tracepoints, ask libvirt daemon to take a memory dump
of stalling KVM guest for diagnostic purpose).

This commit temporarily loses ability to report e.g. OOM lockup due to
unable to invoke the OOM killer due to !__GFP_FS allocation request.
But asynchronous approach will be able to detect such situation and emit
warning. Thus, let's remove warn_alloc().

[1] https://bugzilla.kernel.org/show_bug.cgi?id=192981
[2] http://lkml.kernel.org/r/CAM_iQpWuPVGc2ky8M-9yukECtS+zKjiDasNymX7rMcBjBFyM_A@mail.gmail.com
[3] commit db73ee0d46379922 ("mm, vmscan: do not loop on too_many_isolated for ever"))

Link: http://lkml.kernel.org/r/1509017339-4802-1-git-send-email-penguin-kernel@I-love.SAKURA.ne.jp
Signed-off-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
Reported-by: Cong Wang <xiyou.wangcong@gmail.com>
Reported-by: yuwang.yuwang <yuwang.yuwang@alibaba-inc.com>
Reported-by: Johannes Weiner <hannes@cmpxchg.org>
Acked-by: Michal Hocko <mhocko@suse.com>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Dave Hansen <dave.hansen@intel.com>
Cc: Sergey Senozhatsky <sergey.senozhatsky@gmail.com>
Cc: Petr Mladek <pmladek@suse.com>
Cc: Steven Rostedt <rostedt@goodmis.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# b050e376 15-Nov-2017 Vlastimil Babka <vbabka@suse.cz>

mm, page_alloc: fix potential false positive in __zone_watermark_ok

Since commit 97a16fc82a7c ("mm, page_alloc: only enforce watermarks for
order-0 allocations"), __zone_watermark_ok() check for high-order
allocations will shortcut per-migratetype free list checks for
ALLOC_HARDER allocations, and return true as long as there's free page
of any migratetype. The intention is that ALLOC_HARDER can allocate
from MIGRATE_HIGHATOMIC free lists, while normal allocations can't.

However, as a side effect, the watermark check will then also return
true when there are pages only on the MIGRATE_ISOLATE list, or (prior to
CMA conversion to ZONE_MOVABLE) on the MIGRATE_CMA list. Since the
allocation cannot actually obtain isolated pages, and might not be able
to obtain CMA pages, this can result in a false positive.

The condition should be rare and perhaps the outcome is not a fatal one.
Still, it's better if the watermark check is correct. There also
shouldn't be a performance tradeoff here.

Link: http://lkml.kernel.org/r/20171102125001.23708-1-vbabka@suse.cz
Fixes: 97a16fc82a7c ("mm, page_alloc: only enforce watermarks for order-0 allocations")
Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
Acked-by: Mel Gorman <mgorman@techsingularity.net>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Cc: Rik van Riel <riel@redhat.com>
Cc: David Rientjes <rientjes@google.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 4518085e 15-Nov-2017 Kemi Wang <kemi.wang@intel.com>

mm, sysctl: make NUMA stats configurable

This is the second step which introduces a tunable interface that allow
numa stats configurable for optimizing zone_statistics(), as suggested
by Dave Hansen and Ying Huang.

=========================================================================

When page allocation performance becomes a bottleneck and you can
tolerate some possible tool breakage and decreased numa counter
precision, you can do:

echo 0 > /proc/sys/vm/numa_stat

In this case, numa counter update is ignored. We can see about
*4.8%*(185->176) drop of cpu cycles per single page allocation and
reclaim on Jesper's page_bench01 (single thread) and *8.1%*(343->315)
drop of cpu cycles per single page allocation and reclaim on Jesper's
page_bench03 (88 threads) running on a 2-Socket Broadwell-based server
(88 threads, 126G memory).

Benchmark link provided by Jesper D Brouer (increase loop times to
10000000):

https://github.com/netoptimizer/prototype-kernel/tree/master/kernel/mm/bench

=========================================================================

When page allocation performance is not a bottleneck and you want all
tooling to work, you can do:

echo 1 > /proc/sys/vm/numa_stat

This is system default setting.

Many thanks to Michal Hocko, Dave Hansen, Ying Huang and Vlastimil Babka
for comments to help improve the original patch.

[keescook@chromium.org: make sure mutex is a global static]
Link: http://lkml.kernel.org/r/20171107213809.GA4314@beast
Link: http://lkml.kernel.org/r/1508290927-8518-1-git-send-email-kemi.wang@intel.com
Signed-off-by: Kemi Wang <kemi.wang@intel.com>
Signed-off-by: Kees Cook <keescook@chromium.org>
Reported-by: Jesper Dangaard Brouer <brouer@redhat.com>
Suggested-by: Dave Hansen <dave.hansen@intel.com>
Suggested-by: Ying Huang <ying.huang@intel.com>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Acked-by: Michal Hocko <mhocko@suse.com>
Cc: "Luis R . Rodriguez" <mcgrof@kernel.org>
Cc: Kees Cook <keescook@chromium.org>
Cc: Jonathan Corbet <corbet@lwn.net>
Cc: Mel Gorman <mgorman@techsingularity.net>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Christopher Lameter <cl@linux.com>
Cc: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Cc: Andrey Ryabinin <aryabinin@virtuozzo.com>
Cc: Tim Chen <tim.c.chen@intel.com>
Cc: Andi Kleen <andi.kleen@intel.com>
Cc: Aaron Lu <aaron.lu@intel.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 0fac3ba5 15-Nov-2017 Vlastimil Babka <vbabka@suse.cz>

mm, page_alloc: simplify list handling in rmqueue_bulk()

rmqueue_bulk() fills an empty pcplist with pages from the free list. It
tries to preserve increasing order by pfn to the caller, because it
leads to better performance with some I/O controllers, as explained in
commit e084b2d95e48 ("page-allocator: preserve PFN ordering when
__GFP_COLD is set").

To preserve the order, it's sufficient to add pages to the tail of the
list as they are retrieved. The current code instead adds to the head
of the list, but then updates the list head pointer to the last added
page, in each step. This does result in the same order, but is
needlessly confusing and potentially wasteful, with no apparent benefit.
This patch simplifies the code and adjusts comment accordingly.

Link: http://lkml.kernel.org/r/f6505442-98a9-12e4-b2cd-0fa83874c159@suse.cz
Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
Acked-by: Mel Gorman <mgorman@techsingularity.net>
Cc: Andi Kleen <ak@linux.intel.com>
Cc: Dave Chinner <david@fromorbit.com>
Cc: Dave Hansen <dave.hansen@intel.com>
Cc: Jan Kara <jack@suse.cz>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 453f85d4 15-Nov-2017 Mel Gorman <mgorman@techsingularity.net>

mm: remove __GFP_COLD

As the page free path makes no distinction between cache hot and cold
pages, there is no real useful ordering of pages in the free list that
allocation requests can take advantage of. Juding from the users of
__GFP_COLD, it is likely that a number of them are the result of copying
other sites instead of actually measuring the impact. Remove the
__GFP_COLD parameter which simplifies a number of paths in the page
allocator.

This is potentially controversial but bear in mind that the size of the
per-cpu pagelists versus modern cache sizes means that the whole per-cpu
list can often fit in the L3 cache. Hence, there is only a potential
benefit for microbenchmarks that alloc/free pages in a tight loop. It's
even worse when THP is taken into account which has little or no chance
of getting a cache-hot page as the per-cpu list is bypassed and the
zeroing of multiple pages will thrash the cache anyway.

The truncate microbenchmarks are not shown as this patch affects the
allocation path and not the free path. A page fault microbenchmark was
tested but it showed no sigificant difference which is not surprising
given that the __GFP_COLD branches are a miniscule percentage of the
fault path.

Link: http://lkml.kernel.org/r/20171018075952.10627-9-mgorman@techsingularity.net
Signed-off-by: Mel Gorman <mgorman@techsingularity.net>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Cc: Andi Kleen <ak@linux.intel.com>
Cc: Dave Chinner <david@fromorbit.com>
Cc: Dave Hansen <dave.hansen@intel.com>
Cc: Jan Kara <jack@suse.cz>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 2d4894b5 15-Nov-2017 Mel Gorman <mgorman@techsingularity.net>

mm: remove cold parameter from free_hot_cold_page*

Most callers users of free_hot_cold_page claim the pages being released
are cache hot. The exception is the page reclaim paths where it is
likely that enough pages will be freed in the near future that the
per-cpu lists are going to be recycled and the cache hotness information
is lost. As no one really cares about the hotness of pages being
released to the allocator, just ditch the parameter.

The APIs are renamed to indicate that it's no longer about hot/cold
pages. It should also be less confusing as there are subtle differences
between them. __free_pages drops a reference and frees a page when the
refcount reaches zero. free_hot_cold_page handled pages whose refcount
was already zero which is non-obvious from the name. free_unref_page
should be more obvious.

No performance impact is expected as the overhead is marginal. The
parameter is removed simply because it is a bit stupid to have a useless
parameter copied everywhere.

[mgorman@techsingularity.net: add pages to head, not tail]
Link: http://lkml.kernel.org/r/20171019154321.qtpzaeftoyyw4iey@techsingularity.net
Link: http://lkml.kernel.org/r/20171018075952.10627-8-mgorman@techsingularity.net
Signed-off-by: Mel Gorman <mgorman@techsingularity.net>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Cc: Andi Kleen <ak@linux.intel.com>
Cc: Dave Chinner <david@fromorbit.com>
Cc: Dave Hansen <dave.hansen@intel.com>
Cc: Jan Kara <jack@suse.cz>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 9cca35d4 15-Nov-2017 Mel Gorman <mgorman@techsingularity.net>

mm, page_alloc: enable/disable IRQs once when freeing a list of pages

Patch series "Follow-up for speed up page cache truncation", v2.

This series is a follow-on for Jan Kara's series "Speed up page cache
truncation" series. We both ended up looking at the same problem but
saw different problems based on the same data. This series builds upon
his work.

A variety of workloads were compared on four separate machines but each
machine showed gains albeit at different levels. Minimally, some of the
differences are due to NUMA where truncating data from a remote node is
slower than a local node. The workloads checked were

o sparse truncate microbenchmark, tiny
o sparse truncate microbenchmark, large
o reaim-io disk workfile
o dbench4 (modified by mmtests to produce more stable results)
o filebench varmail configuration for small memory size
o bonnie, directory operations, working set size 2*RAM

reaim-io, dbench and filebench all showed minor gains. Truncation does
not dominate those workloads but were tested to ensure no other
regressions. They will not be reported further.

The sparse truncate microbench was written by Jan. It creates a number
of files and then times how long it takes to truncate each one. The
"tiny" configuraiton creates a number of files that easily fits in
memory and times how long it takes to truncate files with page cache.
The large configuration uses enough files to have data that is twice the
size of memory and so timings there include truncating page cache and
working set shadow entries in the radix tree.

Patches 1-4 are the most relevant parts of this series. Patches 5-8 are
optional as they are deleting code that is essentially useless but has a
negligible performance impact.

The changelogs have more information on performance but just for bonnie
delete options, the main comparison is

bonnie
4.14.0-rc5 4.14.0-rc5 4.14.0-rc5
jan-v2 vanilla mel-v2
Hmean SeqCreate ops 76.20 ( 0.00%) 75.80 ( -0.53%) 76.80 ( 0.79%)
Hmean SeqCreate read 85.00 ( 0.00%) 85.00 ( 0.00%) 85.00 ( 0.00%)
Hmean SeqCreate del 13752.31 ( 0.00%) 12090.23 ( -12.09%) 15304.84 ( 11.29%)
Hmean RandCreate ops 76.00 ( 0.00%) 75.60 ( -0.53%) 77.00 ( 1.32%)
Hmean RandCreate read 96.80 ( 0.00%) 96.80 ( 0.00%) 97.00 ( 0.21%)
Hmean RandCreate del 13233.75 ( 0.00%) 11525.35 ( -12.91%) 14446.61 ( 9.16%)

Jan's series is the baseline and the vanilla kernel is 12% slower where
as this series on top gains another 11%. This is from a different
machine than the data in the changelogs but the detailed data was not
collected as there was no substantial change in v2.

This patch (of 8):

Freeing a list of pages current enables/disables IRQs for each page
freed. This patch splits freeing a list of pages into two operations --
preparing the pages for freeing and the actual freeing. This is a
tradeoff - we're taking two passes of the list to free in exchange for
avoiding multiple enable/disable of IRQs.

sparsetruncate (tiny)
4.14.0-rc4 4.14.0-rc4
janbatch-v1r1 oneirq-v1r1
Min Time 149.00 ( 0.00%) 141.00 ( 5.37%)
1st-qrtle Time 150.00 ( 0.00%) 142.00 ( 5.33%)
2nd-qrtle Time 151.00 ( 0.00%) 142.00 ( 5.96%)
3rd-qrtle Time 151.00 ( 0.00%) 143.00 ( 5.30%)
Max-90% Time 153.00 ( 0.00%) 144.00 ( 5.88%)
Max-95% Time 155.00 ( 0.00%) 147.00 ( 5.16%)
Max-99% Time 201.00 ( 0.00%) 195.00 ( 2.99%)
Max Time 236.00 ( 0.00%) 230.00 ( 2.54%)
Amean Time 152.65 ( 0.00%) 144.37 ( 5.43%)
Stddev Time 9.78 ( 0.00%) 10.44 ( -6.72%)
Coeff Time 6.41 ( 0.00%) 7.23 ( -12.84%)
Best99%Amean Time 152.07 ( 0.00%) 143.72 ( 5.50%)
Best95%Amean Time 150.75 ( 0.00%) 142.37 ( 5.56%)
Best90%Amean Time 150.59 ( 0.00%) 142.19 ( 5.58%)
Best75%Amean Time 150.36 ( 0.00%) 141.92 ( 5.61%)
Best50%Amean Time 150.04 ( 0.00%) 141.69 ( 5.56%)
Best25%Amean Time 149.85 ( 0.00%) 141.38 ( 5.65%)

With a tiny number of files, each file truncated has resident page cache
and it shows that time to truncate is roughtly 5-6% with some minor
jitter.

4.14.0-rc4 4.14.0-rc4
janbatch-v1r1 oneirq-v1r1
Hmean SeqCreate ops 65.27 ( 0.00%) 81.86 ( 25.43%)
Hmean SeqCreate read 39.48 ( 0.00%) 47.44 ( 20.16%)
Hmean SeqCreate del 24963.95 ( 0.00%) 26319.99 ( 5.43%)
Hmean RandCreate ops 65.47 ( 0.00%) 82.01 ( 25.26%)
Hmean RandCreate read 42.04 ( 0.00%) 51.75 ( 23.09%)
Hmean RandCreate del 23377.66 ( 0.00%) 23764.79 ( 1.66%)

As expected, there is a small gain for the delete operation.

[mgorman@techsingularity.net: use page_private and set_page_private helpers]
Link: http://lkml.kernel.org/r/20171018101547.mjycw7zreb66jzpa@techsingularity.net
Link: http://lkml.kernel.org/r/20171018075952.10627-2-mgorman@techsingularity.net
Signed-off-by: Mel Gorman <mgorman@techsingularity.net>
Cc: Jan Kara <jack@suse.cz>
Cc: Andi Kleen <ak@linux.intel.com>
Cc: Dave Hansen <dave.hansen@intel.com>
Cc: Dave Chinner <david@fromorbit.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 85ccc8fa 15-Nov-2017 Aaron Lu <aaron.lu@intel.com>

mm/page_alloc: make sure __rmqueue() etc are always inline

__rmqueue(), __rmqueue_fallback(), __rmqueue_smallest() and
__rmqueue_cma_fallback() are all in page allocator's hot path and better
be finished as soon as possible. One way to make them faster is by making
them inline. But as Andrew Morton and Andi Kleen pointed out:

https://lkml.org/lkml/2017/10/10/1252
https://lkml.org/lkml/2017/10/10/1279

To make sure they are inlined, we should use __always_inline for them.

With the will-it-scale/page_fault1/process benchmark, when using nr_cpu
processes to stress buddy, the results for will-it-scale.processes with
and without the patch are:

On a 2-sockets Intel-Skylake machine:

compiler base head
gcc-4.4.7 6496131 6911823 +6.4%
gcc-4.9.4 7225110 7731072 +7.0%
gcc-5.4.1 7054224 7688146 +9.0%
gcc-6.2.0 7059794 7651675 +8.4%

On a 4-sockets Intel-Skylake machine:

compiler base head
gcc-4.4.7 13162890 13508193 +2.6%
gcc-4.9.4 14997463 15484353 +3.2%
gcc-5.4.1 14708711 15449805 +5.0%
gcc-6.2.0 14574099 15349204 +5.3%

The above 4 compilers are used because I've done the tests through
Intel's Linux Kernel Performance(LKP) infrastructure and they are the
available compilers there.

The benefit being less on 4 sockets machine is due to the lock
contention there(perf-profile/native_queued_spin_lock_slowpath=81%) is
less severe than on the 2 sockets machine(85%).

What the benchmark does is: it forks nr_cpu processes and then each
process does the following:
1 mmap() 128M anonymous space;
2 writes to each page there to trigger actual page allocation;
3 munmap() it.
in a loop.

https://github.com/antonblanchard/will-it-scale/blob/master/tests/page_fault1.c

Binary size wise, I have locally built them with different compilers:

[aaron@aaronlu obj]$ size */*/mm/page_alloc.o
text data bss dec hex filename
37409 9904 8524 55837 da1d gcc-4.9.4/base/mm/page_alloc.o
38273 9904 8524 56701 dd7d gcc-4.9.4/head/mm/page_alloc.o
37465 9840 8428 55733 d9b5 gcc-5.5.0/base/mm/page_alloc.o
38169 9840 8428 56437 dc75 gcc-5.5.0/head/mm/page_alloc.o
37573 9840 8428 55841 da21 gcc-6.4.0/base/mm/page_alloc.o
38261 9840 8428 56529 dcd1 gcc-6.4.0/head/mm/page_alloc.o
36863 9840 8428 55131 d75b gcc-7.2.0/base/mm/page_alloc.o
37711 9840 8428 55979 daab gcc-7.2.0/head/mm/page_alloc.o

Text size increased about 800 bytes for mm/page_alloc.o.

[aaron@aaronlu obj]$ size */*/vmlinux
text data bss dec hex filename
10342757 5903208 17723392 33969357 20654cd gcc-4.9.4/base/vmlinux
10342757 5903208 17723392 33969357 20654cd gcc-4.9.4/head/vmlinux
10332448 5836608 17715200 33884256 2050860 gcc-5.5.0/base/vmlinux
10332448 5836608 17715200 33884256 2050860 gcc-5.5.0/head/vmlinux
10094546 5836696 17715200 33646442 201676a gcc-6.4.0/base/vmlinux
10094546 5836696 17715200 33646442 201676a gcc-6.4.0/head/vmlinux
10018775 5828732 17715200 33562707 2002053 gcc-7.2.0/base/vmlinux
10018775 5828732 17715200 33562707 2002053 gcc-7.2.0/head/vmlinux

Text size for vmlinux has no change though, probably due to function
alignment.

Link: http://lkml.kernel.org/r/20171013063111.GA26032@intel.com
Signed-off-by: Aaron Lu <aaron.lu@intel.com>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Cc: Dave Hansen <dave.hansen@intel.com>
Cc: Andi Kleen <andi@firstfloor.org>
Cc: Huang Ying <ying.huang@intel.com>
Cc: Tim Chen <tim.c.chen@linux.intel.com>
Cc: Kemi Wang <kemi.wang@intel.com>
Cc: Anshuman Khandual <khandual@linux.vnet.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# f7f99100 15-Nov-2017 Pavel Tatashin <pasha.tatashin@oracle.com>

mm: stop zeroing memory during allocation in vmemmap

vmemmap_alloc_block() will no longer zero the block, so zero memory at
its call sites for everything except struct pages. Struct page memory
is zero'd by struct page initialization.

Replace allocators in sparse-vmemmap to use the non-zeroing version.
So, we will get the performance improvement by zeroing the memory in
parallel when struct pages are zeroed.

Add struct page zeroing as a part of initialization of other fields in
__init_single_page().

This single thread performance collected on: Intel(R) Xeon(R) CPU E7-8895
v3 @ 2.60GHz with 1T of memory (268400646 pages in 8 nodes):

BASE FIX
sparse_init 11.244671836s 0.007199623s
zone_sizes_init 4.879775891s 8.355182299s
--------------------------
Total 16.124447727s 8.362381922s

sparse_init is where memory for struct pages is zeroed, and the zeroing
part is moved later in this patch into __init_single_page(), which is
called from zone_sizes_init().

[akpm@linux-foundation.org: make vmemmap_alloc_block_zero() private to sparse-vmemmap.c]
Link: http://lkml.kernel.org/r/20171013173214.27300-10-pasha.tatashin@oracle.com
Signed-off-by: Pavel Tatashin <pasha.tatashin@oracle.com>
Reviewed-by: Steven Sistare <steven.sistare@oracle.com>
Reviewed-by: Daniel Jordan <daniel.m.jordan@oracle.com>
Reviewed-by: Bob Picco <bob.picco@oracle.com>
Tested-by: Bob Picco <bob.picco@oracle.com>
Acked-by: Michal Hocko <mhocko@suse.com>
Cc: Alexander Potapenko <glider@google.com>
Cc: Andrey Ryabinin <aryabinin@virtuozzo.com>
Cc: Ard Biesheuvel <ard.biesheuvel@linaro.org>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Christian Borntraeger <borntraeger@de.ibm.com>
Cc: David S. Miller <davem@davemloft.net>
Cc: Dmitry Vyukov <dvyukov@google.com>
Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Mark Rutland <mark.rutland@arm.com>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Mel Gorman <mgorman@techsingularity.net>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Sam Ravnborg <sam@ravnborg.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Will Deacon <will.deacon@arm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# a4a3ede2 15-Nov-2017 Pavel Tatashin <pasha.tatashin@oracle.com>

mm: zero reserved and unavailable struct pages

Some memory is reserved but unavailable: not present in memblock.memory
(because not backed by physical pages), but present in memblock.reserved.
Such memory has backing struct pages, but they are not initialized by
going through __init_single_page().

In some cases these struct pages are accessed even if they do not
contain any data. One example is page_to_pfn() might access page->flags
if this is where section information is stored (CONFIG_SPARSEMEM,
SECTION_IN_PAGE_FLAGS).

One example of such memory: trim_low_memory_range() unconditionally
reserves from pfn 0, but e820__memblock_setup() might provide the
exiting memory from pfn 1 (i.e. KVM).

Since struct pages are zeroed in __init_single_page(), and not during
allocation time, we must zero such struct pages explicitly.

The patch involves adding a new memblock iterator:
for_each_resv_unavail_range(i, p_start, p_end)

Which iterates through reserved && !memory lists, and we zero struct pages
explicitly by calling mm_zero_struct_page().

===

Here is more detailed example of problem that this patch is addressing:

Run tested on qemu with the following arguments:

-enable-kvm -cpu kvm64 -m 512 -smp 2

This patch reports that there are 98 unavailable pages.

They are: pfn 0 and pfns in range [159, 255].

Note, trim_low_memory_range() reserves only pfns in range [0, 15], it does
not reserve [159, 255] ones.

e820__memblock_setup() reports linux that the following physical ranges are
available:
[1 , 158]
[256, 130783]

Notice, that exactly unavailable pfns are missing!

Now, lets check what we have in zone 0: [1, 131039]

pfn 0, is not part of the zone, but pfns [1, 158], are.

However, the bigger problem we have if we do not initialize these struct
pages is with memory hotplug. Because, that path operates at 2M
boundaries (section_nr). And checks if 2M range of pages is hot
removable. It starts with first pfn from zone, rounds it down to 2M
boundary (sturct pages are allocated at 2M boundaries when vmemmap is
created), and checks if that section is hot removable. In this case
start with pfn 1 and convert it down to pfn 0. Later pfn is converted
to struct page, and some fields are checked. Now, if we do not zero
struct pages, we get unpredictable results.

In fact when CONFIG_VM_DEBUG is enabled, and we explicitly set all
vmemmap memory to ones, the following panic is observed with kernel test
without this patch applied:

BUG: unable to handle kernel NULL pointer dereference at (null)
IP: is_pageblock_removable_nolock+0x35/0x90
PGD 0 P4D 0
Oops: 0000 [#1] PREEMPT
...
task: ffff88001f4e2900 task.stack: ffffc90000314000
RIP: 0010:is_pageblock_removable_nolock+0x35/0x90
Call Trace:
? is_mem_section_removable+0x5a/0xd0
show_mem_removable+0x6b/0xa0
dev_attr_show+0x1b/0x50
sysfs_kf_seq_show+0xa1/0x100
kernfs_seq_show+0x22/0x30
seq_read+0x1ac/0x3a0
kernfs_fop_read+0x36/0x190
? security_file_permission+0x90/0xb0
__vfs_read+0x16/0x30
vfs_read+0x81/0x130
SyS_read+0x44/0xa0
entry_SYSCALL_64_fastpath+0x1f/0xbd

Link: http://lkml.kernel.org/r/20171013173214.27300-7-pasha.tatashin@oracle.com
Signed-off-by: Pavel Tatashin <pasha.tatashin@oracle.com>
Reviewed-by: Steven Sistare <steven.sistare@oracle.com>
Reviewed-by: Daniel Jordan <daniel.m.jordan@oracle.com>
Reviewed-by: Bob Picco <bob.picco@oracle.com>
Tested-by: Bob Picco <bob.picco@oracle.com>
Acked-by: Michal Hocko <mhocko@suse.com>
Cc: Alexander Potapenko <glider@google.com>
Cc: Andrey Ryabinin <aryabinin@virtuozzo.com>
Cc: Ard Biesheuvel <ard.biesheuvel@linaro.org>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Christian Borntraeger <borntraeger@de.ibm.com>
Cc: David S. Miller <davem@davemloft.net>
Cc: Dmitry Vyukov <dvyukov@google.com>
Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Mark Rutland <mark.rutland@arm.com>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Mel Gorman <mgorman@techsingularity.net>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Sam Ravnborg <sam@ravnborg.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Will Deacon <will.deacon@arm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# ea1f5f37 15-Nov-2017 Pavel Tatashin <pasha.tatashin@oracle.com>

mm: define memblock_virt_alloc_try_nid_raw

* A new variant of memblock_virt_alloc_* allocations:
memblock_virt_alloc_try_nid_raw()
- Does not zero the allocated memory
- Does not panic if request cannot be satisfied

* optimize early system hash allocations

Clients can call alloc_large_system_hash() with flag: HASH_ZERO to
specify that memory that was allocated for system hash needs to be
zeroed, otherwise the memory does not need to be zeroed, and client will
initialize it.

If memory does not need to be zero'd, call the new
memblock_virt_alloc_raw() interface, and thus improve the boot
performance.

* debug for raw alloctor

When CONFIG_DEBUG_VM is enabled, this patch sets all the memory that is
returned by memblock_virt_alloc_try_nid_raw() to ones to ensure that no
places excpect zeroed memory.

Link: http://lkml.kernel.org/r/20171013173214.27300-6-pasha.tatashin@oracle.com
Signed-off-by: Pavel Tatashin <pasha.tatashin@oracle.com>
Reviewed-by: Steven Sistare <steven.sistare@oracle.com>
Reviewed-by: Daniel Jordan <daniel.m.jordan@oracle.com>
Reviewed-by: Bob Picco <bob.picco@oracle.com>
Tested-by: Bob Picco <bob.picco@oracle.com>
Acked-by: Michal Hocko <mhocko@suse.com>
Cc: Alexander Potapenko <glider@google.com>
Cc: Andrey Ryabinin <aryabinin@virtuozzo.com>
Cc: Ard Biesheuvel <ard.biesheuvel@linaro.org>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Christian Borntraeger <borntraeger@de.ibm.com>
Cc: David S. Miller <davem@davemloft.net>
Cc: Dmitry Vyukov <dvyukov@google.com>
Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Mark Rutland <mark.rutland@arm.com>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Mel Gorman <mgorman@techsingularity.net>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Sam Ravnborg <sam@ravnborg.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Will Deacon <will.deacon@arm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 2f47a91f 15-Nov-2017 Pavel Tatashin <pasha.tatashin@oracle.com>

mm: deferred_init_memmap improvements

Patch series "complete deferred page initialization", v12.

SMP machines can benefit from the DEFERRED_STRUCT_PAGE_INIT config
option, which defers initializing struct pages until all cpus have been
started so it can be done in parallel.

However, this feature is sub-optimal, because the deferred page
initialization code expects that the struct pages have already been
zeroed, and the zeroing is done early in boot with a single thread only.
Also, we access that memory and set flags before struct pages are
initialized. All of this is fixed in this patchset.

In this work we do the following:
- Never read access struct page until it was initialized
- Never set any fields in struct pages before they are initialized
- Zero struct page at the beginning of struct page initialization

==========================================================================
Performance improvements on x86 machine with 8 nodes:
Intel(R) Xeon(R) CPU E7-8895 v3 @ 2.60GHz and 1T of memory:
TIME SPEED UP
base no deferred: 95.796233s
fix no deferred: 79.978956s 19.77%

base deferred: 77.254713s
fix deferred: 55.050509s 40.34%
==========================================================================
SPARC M6 3600 MHz with 15T of memory
TIME SPEED UP
base no deferred: 358.335727s
fix no deferred: 302.320936s 18.52%

base deferred: 237.534603s
fix deferred: 182.103003s 30.44%
==========================================================================
Raw dmesg output with timestamps:
x86 base no deferred: https://hastebin.com/ofunepurit.scala
x86 base deferred: https://hastebin.com/ifazegeyas.scala
x86 fix no deferred: https://hastebin.com/pegocohevo.scala
x86 fix deferred: https://hastebin.com/ofupevikuk.scala
sparc base no deferred: https://hastebin.com/ibobeteken.go
sparc base deferred: https://hastebin.com/fariqimiyu.go
sparc fix no deferred: https://hastebin.com/muhegoheyi.go
sparc fix deferred: https://hastebin.com/xadinobutu.go

This patch (of 11):

deferred_init_memmap() is called when struct pages are initialized later
in boot by slave CPUs. This patch simplifies and optimizes this
function, and also fixes a couple issues (described below).

The main change is that now we are iterating through free memblock areas
instead of all configured memory. Thus, we do not have to check if the
struct page has already been initialized.

=====
In deferred_init_memmap() where all deferred struct pages are
initialized we have a check like this:

if (page->flags) {
VM_BUG_ON(page_zone(page) != zone);
goto free_range;
}

This way we are checking if the current deferred page has already been
initialized. It works, because memory for struct pages has been zeroed,
and the only way flags are not zero if it went through
__init_single_page() before. But, once we change the current behavior
and won't zero the memory in memblock allocator, we cannot trust
anything inside "struct page"es until they are initialized. This patch
fixes this.

The deferred_init_memmap() is re-written to loop through only free
memory ranges provided by memblock.

Note, this first issue is relevant only when the following change is
merged:

=====
This patch fixes another existing issue on systems that have holes in
zones i.e CONFIG_HOLES_IN_ZONE is defined.

In for_each_mem_pfn_range() we have code like this:

if (!pfn_valid_within(pfn)
goto free_range;

Note: 'page' is not set to NULL and is not incremented but 'pfn'
advances. Thus means if deferred struct pages are enabled on systems
with these kind of holes, linux would get memory corruptions. I have
fixed this issue by defining a new macro that performs all the necessary
operations when we free the current set of pages.

[pasha.tatashin@oracle.com: buddy page accessed before initialized]
Link: http://lkml.kernel.org/r/20171102170221.7401-2-pasha.tatashin@oracle.com
Link: http://lkml.kernel.org/r/20171013173214.27300-2-pasha.tatashin@oracle.com
Signed-off-by: Pavel Tatashin <pasha.tatashin@oracle.com>
Reviewed-by: Steven Sistare <steven.sistare@oracle.com>
Reviewed-by: Daniel Jordan <daniel.m.jordan@oracle.com>
Reviewed-by: Bob Picco <bob.picco@oracle.com>
Tested-by: Bob Picco <bob.picco@oracle.com>
Acked-by: Michal Hocko <mhocko@suse.com>
Cc: Christian Borntraeger <borntraeger@de.ibm.com>
Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
Cc: David S. Miller <davem@davemloft.net>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Ard Biesheuvel <ard.biesheuvel@linaro.org>
Cc: Mark Rutland <mark.rutland@arm.com>
Cc: Will Deacon <will.deacon@arm.com>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Sam Ravnborg <sam@ravnborg.org>
Cc: Mel Gorman <mgorman@techsingularity.net>
Cc: Alexander Potapenko <glider@google.com>
Cc: Andrey Ryabinin <aryabinin@virtuozzo.com>
Cc: Dmitry Vyukov <dvyukov@google.com>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 49502766 15-Nov-2017 Levin, Alexander (Sasha Levin) <alexander.levin@verizon.com>

kmemcheck: remove annotations

Patch series "kmemcheck: kill kmemcheck", v2.

As discussed at LSF/MM, kill kmemcheck.

KASan is a replacement that is able to work without the limitation of
kmemcheck (single CPU, slow). KASan is already upstream.

We are also not aware of any users of kmemcheck (or users who don't
consider KASan as a suitable replacement).

The only objection was that since KASAN wasn't supported by all GCC
versions provided by distros at that time we should hold off for 2
years, and try again.

Now that 2 years have passed, and all distros provide gcc that supports
KASAN, kill kmemcheck again for the very same reasons.

This patch (of 4):

Remove kmemcheck annotations, and calls to kmemcheck from the kernel.

[alexander.levin@verizon.com: correctly remove kmemcheck call from dma_map_sg_attrs]
Link: http://lkml.kernel.org/r/20171012192151.26531-1-alexander.levin@verizon.com
Link: http://lkml.kernel.org/r/20171007030159.22241-2-alexander.levin@verizon.com
Signed-off-by: Sasha Levin <alexander.levin@verizon.com>
Cc: Alexander Potapenko <glider@google.com>
Cc: Eric W. Biederman <ebiederm@xmission.com>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Pekka Enberg <penberg@kernel.org>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Tim Hansen <devtimhansen@gmail.com>
Cc: Vegard Nossum <vegardno@ifi.uio.no>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# d7ab3672 15-Nov-2017 Michal Hocko <mhocko@suse.com>

mm, page_alloc: fail has_unmovable_pages when seeing reserved pages

Reserved pages should be completely ignored by the core mm because they
have a special meaning for their owners. has_unmovable_pages doesn't
check those so we rely on other tests (reference count, or PageLRU) to
fail on such pages. Althought this happens to work it is safer to
simply check for those explicitly and do not rely on the owner of the
page to abuse those fields for special purposes.

Please note that this is more of a further fortification of the code
rahter than a fix of an existing issue.

Link: http://lkml.kernel.org/r/20171013120756.jeopthigbmm3c7bl@dhcp22.suse.cz
Signed-off-by: Michal Hocko <mhocko@suse.com>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Cc: Igor Mammedov <imammedo@redhat.com>
Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Cc: Reza Arbab <arbab@linux.vnet.ibm.com>
Cc: Vitaly Kuznetsov <vkuznets@redhat.com>
Cc: Xishi Qiu <qiuxishi@huawei.com>
Cc: Yasuaki Ishimatsu <yasu.isimatu@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 4da2ce25 15-Nov-2017 Michal Hocko <mhocko@suse.com>

mm: distinguish CMA and MOVABLE isolation in has_unmovable_pages()

Joonsoo has noticed that "mm: drop migrate type checks from
has_unmovable_pages" would break CMA allocator because it relies on
has_unmovable_pages returning false even for CMA pageblocks which in
fact don't have to be movable:

alloc_contig_range
start_isolate_page_range
set_migratetype_isolate
has_unmovable_pages

This is a result of the code sharing between CMA and memory hotplug
while each one has a different idea of what has_unmovable_pages should
return. This is unfortunate but fixing it properly would require a lot
of code duplication.

Fix the issue by introducing the requested migrate type argument and
special case MIGRATE_CMA case where CMA page blocks are handled
properly. This will work for memory hotplug because it requires
MIGRATE_MOVABLE.

Link: http://lkml.kernel.org/r/20171019122118.y6cndierwl2vnguj@dhcp22.suse.cz
Signed-off-by: Michal Hocko <mhocko@suse.com>
Reported-by: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Tested-by: Stefan Wahren <stefan.wahren@i2se.com>
Tested-by: Ran Wang <ran.wang_1@nxp.com>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Igor Mammedov <imammedo@redhat.com>
Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Reza Arbab <arbab@linux.vnet.ibm.com>
Cc: Vitaly Kuznetsov <vkuznets@redhat.com>
Cc: Xishi Qiu <qiuxishi@huawei.com>
Cc: Yasuaki Ishimatsu <yasu.isimatu@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# d7b236e1 15-Nov-2017 Michal Hocko <mhocko@suse.com>

mm: drop migrate type checks from has_unmovable_pages

Michael has noticed that the memory offline tries to migrate kernel code
pages when doing

echo 0 > /sys/devices/system/memory/memory0/online

The current implementation will fail the operation after several failed
page migration attempts but we shouldn't even attempt to migrate that
memory and fail right away because this memory is clearly not
migrateable. This will become a real problem when we drop the retry
loop counter resp. timeout.

The real problem is in has_unmovable_pages in fact. We should fail if
there are any non migrateable pages in the area. In orther to guarantee
that remove the migrate type checks because MIGRATE_MOVABLE is not
guaranteed to contain only migrateable pages. It is merely a heuristic.
Similarly MIGRATE_CMA does guarantee that the page allocator doesn't
allocate any non-migrateable pages from the block but CMA allocations
themselves are unlikely to migrateable. Therefore remove both checks.

[akpm@linux-foundation.org: remove unused local `mt']
Link: http://lkml.kernel.org/r/20171013120013.698-1-mhocko@kernel.org
Signed-off-by: Michal Hocko <mhocko@suse.com>
Reported-by: Michael Ellerman <mpe@ellerman.id.au>
Tested-by: Michael Ellerman <mpe@ellerman.id.au>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Tested-by: Tony Lindgren <tony@atomide.com>
Tested-by: Ran Wang <ran.wang_1@nxp.com>
Cc: Igor Mammedov <imammedo@redhat.com>
Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Reza Arbab <arbab@linux.vnet.ibm.com>
Cc: Vitaly Kuznetsov <vkuznets@redhat.com>
Cc: Xishi Qiu <qiuxishi@huawei.com>
Cc: Yasuaki Ishimatsu <yasu.isimatu@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 629a359b 07-Nov-2017 Kirill A. Shutemov <kirill.shutemov@linux.intel.com>

mm/sparsemem: Fix ARM64 boot crash when CONFIG_SPARSEMEM_EXTREME=y

Since commit:

83e3c48729d9 ("mm/sparsemem: Allocate mem_section at runtime for CONFIG_SPARSEMEM_EXTREME=y")

we allocate the mem_section array dynamically in sparse_memory_present_with_active_regions(),
but some architectures, like arm64, don't call the routine to initialize sparsemem.

Let's move the initialization into memory_present() it should cover all
architectures.

Reported-and-tested-by: Sudeep Holla <sudeep.holla@arm.com>
Tested-by: Bjorn Andersson <bjorn.andersson@linaro.org>
Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Acked-by: Will Deacon <will.deacon@arm.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: linux-mm@kvack.org
Fixes: 83e3c48729d9 ("mm/sparsemem: Allocate mem_section at runtime for CONFIG_SPARSEMEM_EXTREME=y")
Link: http://lkml.kernel.org/r/20171107083337.89952-1-kirill.shutemov@linux.intel.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>


# 83e3c487 29-Sep-2017 Kirill A. Shutemov <kirill.shutemov@linux.intel.com>

mm/sparsemem: Allocate mem_section at runtime for CONFIG_SPARSEMEM_EXTREME=y

Size of the mem_section[] array depends on the size of the physical address space.

In preparation for boot-time switching between paging modes on x86-64
we need to make the allocation of mem_section[] dynamic, because otherwise
we waste a lot of RAM: with CONFIG_NODE_SHIFT=10, mem_section[] size is 32kB
for 4-level paging and 2MB for 5-level paging mode.

The patch allocates the array on the first call to sparse_memory_present_with_active_regions().

Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Andy Lutomirski <luto@amacapital.net>
Cc: Borislav Petkov <bp@suse.de>
Cc: Cyrill Gorcunov <gorcunov@openvz.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: linux-mm@kvack.org
Link: http://lkml.kernel.org/r/20170929140821.37654-2-kirill.shutemov@linux.intel.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>


# 9b6e63cb 03-Oct-2017 Michal Hocko <mhocko@suse.com>

mm, page_alloc: add scheduling point to memmap_init_zone

memmap_init_zone gets a pfn range to initialize and it can be really
large resulting in a soft lockup on non-preemptible kernels

NMI watchdog: BUG: soft lockup - CPU#31 stuck for 23s! [kworker/u642:5:1720]
[...]
task: ffff88ecd7e902c0 ti: ffff88eca4e50000 task.ti: ffff88eca4e50000
RIP: move_pfn_range_to_zone+0x185/0x1d0
[...]
Call Trace:
devm_memremap_pages+0x2c7/0x430
pmem_attach_disk+0x2fd/0x3f0 [nd_pmem]
nvdimm_bus_probe+0x64/0x110 [libnvdimm]
driver_probe_device+0x1f7/0x420
bus_for_each_drv+0x52/0x80
__device_attach+0xb0/0x130
bus_probe_device+0x87/0xa0
device_add+0x3fc/0x5f0
nd_async_device_register+0xe/0x40 [libnvdimm]
async_run_entry_fn+0x43/0x150
process_one_work+0x14e/0x410
worker_thread+0x116/0x490
kthread+0xc7/0xe0
ret_from_fork+0x3f/0x70

Fix this by adding a scheduling point once per page block.

Link: http://lkml.kernel.org/r/20170918121410.24466-3-mhocko@kernel.org
Signed-off-by: Michal Hocko <mhocko@suse.com>
Reported-by: Johannes Thumshirn <jthumshirn@suse.de>
Tested-by: Johannes Thumshirn <jthumshirn@suse.de>
Cc: Dan Williams <dan.j.williams@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 57148a64 03-Oct-2017 Arnd Bergmann <arnd@arndb.de>

mm: meminit: mark init_reserved_page as __meminit

The function is called from __meminit context and calls other __meminit
functions but isn't it self mark as such today:

WARNING: vmlinux.o(.text.unlikely+0x4516): Section mismatch in reference from the function init_reserved_page() to the function .meminit.text:early_pfn_to_nid()
The function init_reserved_page() references the function __meminit early_pfn_to_nid().
This is often because init_reserved_page lacks a __meminit annotation or the annotation of early_pfn_to_nid is wrong.

On most compilers, we don't notice this because the function gets
inlined all the time. Adding __meminit here fixes the harmless warning
for the old versions and is generally the correct annotation.

Link: http://lkml.kernel.org/r/20170915193149.901180-1-arnd@arndb.de
Fixes: 7e18adb4f80b ("mm: meminit: initialise remaining struct pages in parallel with kswapd")
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Acked-by: Mel Gorman <mgorman@techsingularity.net>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# f19360f0 08-Sep-2017 Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>

mm/page_alloc.c: apply gfp_allowed_mask before the first allocation attempt

We are by error initializing alloc_flags before gfp_allowed_mask is
applied. This could cause problems after pm_restrict_gfp_mask() is called
during suspend operation. Apply gfp_allowed_mask before initializing
alloc_flags so that the first allocation attempt uses correct flags.

Link: http://lkml.kernel.org/r/201709020016.ADJ21342.OFLJHOOSMFVtFQ@I-love.SAKURA.ne.jp
Fixes: 83d4ca8148fd9092 ("mm, page_alloc: move __GFP_HARDWALL modifications out of the fastpath")
Signed-off-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
Acked-by: Michal Hocko <mhocko@suse.com>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Cc: Mel Gorman <mgorman@techsingularity.net>
Cc: Jesper Dangaard Brouer <brouer@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 3a321d2a 08-Sep-2017 Kemi Wang <kemi.wang@intel.com>

mm: change the call sites of numa statistics items

Patch series "Separate NUMA statistics from zone statistics", v2.

Each page allocation updates a set of per-zone statistics with a call to
zone_statistics(). As discussed in 2017 MM summit, these are a
substantial source of overhead in the page allocator and are very rarely
consumed. This significant overhead in cache bouncing caused by zone
counters (NUMA associated counters) update in parallel in multi-threaded
page allocation (pointed out by Dave Hansen).

A link to the MM summit slides:
http://people.netfilter.org/hawk/presentations/MM-summit2017/MM-summit2017-JesperBrouer.pdf

To mitigate this overhead, this patchset separates NUMA statistics from
zone statistics framework, and update NUMA counter threshold to a fixed
size of MAX_U16 - 2, as a small threshold greatly increases the update
frequency of the global counter from local per cpu counter (suggested by
Ying Huang). The rationality is that these statistics counters don't
need to be read often, unlike other VM counters, so it's not a problem
to use a large threshold and make readers more expensive.

With this patchset, we see 31.3% drop of CPU cycles(537-->369, see
below) for per single page allocation and reclaim on Jesper's
page_bench03 benchmark. Meanwhile, this patchset keeps the same style
of virtual memory statistics with little end-user-visible effects (only
move the numa stats to show behind zone page stats, see the first patch
for details).

I did an experiment of single page allocation and reclaim concurrently
using Jesper's page_bench03 benchmark on a 2-Socket Broadwell-based
server (88 processors with 126G memory) with different size of threshold
of pcp counter.

Benchmark provided by Jesper D Brouer(increase loop times to 10000000):
https://github.com/netoptimizer/prototype-kernel/tree/master/kernel/mm/bench

Threshold CPU cycles Throughput(88 threads)
32 799 241760478
64 640 301628829
125 537 358906028 <==> system by default
256 468 412397590
512 428 450550704
4096 399 482520943
20000 394 489009617
30000 395 488017817
65533 369(-31.3%) 521661345(+45.3%) <==> with this patchset
N/A 342(-36.3%) 562900157(+56.8%) <==> disable zone_statistics

This patch (of 3):

In this patch, NUMA statistics is separated from zone statistics
framework, all the call sites of NUMA stats are changed to use
numa-stats-specific functions, it does not have any functionality change
except that the number of NUMA stats is shown behind zone page stats
when users *read* the zone info.

E.g. cat /proc/zoneinfo
***Base*** ***With this patch***
nr_free_pages 3976 nr_free_pages 3976
nr_zone_inactive_anon 0 nr_zone_inactive_anon 0
nr_zone_active_anon 0 nr_zone_active_anon 0
nr_zone_inactive_file 0 nr_zone_inactive_file 0
nr_zone_active_file 0 nr_zone_active_file 0
nr_zone_unevictable 0 nr_zone_unevictable 0
nr_zone_write_pending 0 nr_zone_write_pending 0
nr_mlock 0 nr_mlock 0
nr_page_table_pages 0 nr_page_table_pages 0
nr_kernel_stack 0 nr_kernel_stack 0
nr_bounce 0 nr_bounce 0
nr_zspages 0 nr_zspages 0
numa_hit 0 *nr_free_cma 0*
numa_miss 0 numa_hit 0
numa_foreign 0 numa_miss 0
numa_interleave 0 numa_foreign 0
numa_local 0 numa_interleave 0
numa_other 0 numa_local 0
*nr_free_cma 0* numa_other 0
... ...
vm stats threshold: 10 vm stats threshold: 10
... ...

The next patch updates the numa stats counter size and threshold.

[akpm@linux-foundation.org: coding-style fixes]
Link: http://lkml.kernel.org/r/1503568801-21305-2-git-send-email-kemi.wang@intel.com
Signed-off-by: Kemi Wang <kemi.wang@intel.com>
Reported-by: Jesper Dangaard Brouer <brouer@redhat.com>
Acked-by: Mel Gorman <mgorman@techsingularity.net>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Christopher Lameter <cl@linux.com>
Cc: Dave Hansen <dave.hansen@intel.com>
Cc: Andi Kleen <andi.kleen@intel.com>
Cc: Ying Huang <ying.huang@intel.com>
Cc: Aaron Lu <aaron.lu@intel.com>
Cc: Tim Chen <tim.c.chen@intel.com>
Cc: Dave Hansen <dave.hansen@intel.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# cd04ae1e 06-Sep-2017 Michal Hocko <mhocko@suse.com>

mm, oom: do not rely on TIF_MEMDIE for memory reserves access

For ages we have been relying on TIF_MEMDIE thread flag to mark OOM
victims and then, among other things, to give these threads full access
to memory reserves. There are few shortcomings of this implementation,
though.

First of all and the most serious one is that the full access to memory
reserves is quite dangerous because we leave no safety room for the
system to operate and potentially do last emergency steps to move on.

Secondly this flag is per task_struct while the OOM killer operates on
mm_struct granularity so all processes sharing the given mm are killed.
Giving the full access to all these task_structs could lead to a quick
memory reserves depletion. We have tried to reduce this risk by giving
TIF_MEMDIE only to the main thread and the currently allocating task but
that doesn't really solve this problem while it surely opens up a room
for corner cases - e.g. GFP_NO{FS,IO} requests might loop inside the
allocator without access to memory reserves because a particular thread
was not the group leader.

Now that we have the oom reaper and that all oom victims are reapable
after 1b51e65eab64 ("oom, oom_reaper: allow to reap mm shared by the
kthreads") we can be more conservative and grant only partial access to
memory reserves because there are reasonable chances of the parallel
memory freeing. We still want some access to reserves because we do not
want other consumers to eat up the victim's freed memory. oom victims
will still contend with __GFP_HIGH users but those shouldn't be so
aggressive to starve oom victims completely.

Introduce ALLOC_OOM flag and give all tsk_is_oom_victim tasks access to
the half of the reserves. This makes the access to reserves independent
on which task has passed through mark_oom_victim. Also drop any usage
of TIF_MEMDIE from the page allocator proper and replace it by
tsk_is_oom_victim as well which will make page_alloc.c completely
TIF_MEMDIE free finally.

CONFIG_MMU=n doesn't have oom reaper so let's stick to the original
ALLOC_NO_WATERMARKS approach.

There is a demand to make the oom killer memcg aware which will imply
many tasks killed at once. This change will allow such a usecase
without worrying about complete memory reserves depletion.

Link: http://lkml.kernel.org/r/20170810075019.28998-2-mhocko@kernel.org
Signed-off-by: Michal Hocko <mhocko@suse.com>
Acked-by: Mel Gorman <mgorman@techsingularity.net>
Cc: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
Cc: David Rientjes <rientjes@google.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Roman Gushchin <guro@fb.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# c41f012a 06-Sep-2017 Michal Hocko <mhocko@suse.com>

mm: rename global_page_state to global_zone_page_state

global_page_state is error prone as a recent bug report pointed out [1].
It only returns proper values for zone based counters as the enum it
gets suggests. We already have global_node_page_state so let's rename
global_page_state to global_zone_page_state to be more explicit here.
All existing users seems to be correct:

$ git grep "global_page_state(NR_" | sed 's@.*(\(NR_[A-Z_]*\)).*@\1@' | sort | uniq -c
2 NR_BOUNCE
2 NR_FREE_CMA_PAGES
11 NR_FREE_PAGES
1 NR_KERNEL_STACK_KB
1 NR_MLOCK
2 NR_PAGETABLE

This patch shouldn't introduce any functional change.

[1] http://lkml.kernel.org/r/201707260628.v6Q6SmaS030814@www262.sakura.ne.jp

Link: http://lkml.kernel.org/r/20170801134256.5400-2-hannes@cmpxchg.org
Signed-off-by: Michal Hocko <mhocko@suse.com>
Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
Cc: Tetsuo Handa <penguin-kernel@i-love.sakura.ne.jp>
Cc: Josef Bacik <josef@toxicpanda.com>
Cc: Vladimir Davydov <vdavydov.dev@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# b93e0f32 06-Sep-2017 Michal Hocko <mhocko@suse.com>

mm, memory_hotplug: get rid of zonelists_mutex

zonelists_mutex was introduced by commit 4eaf3f64397c ("mem-hotplug: fix
potential race while building zonelist for new populated zone") to
protect zonelist building from races. This is no longer needed though
because both memory online and offline are fully serialized. New users
have grown since then.

Notably setup_per_zone_wmarks wants to prevent from races between memory
hotplug, khugepaged setup and manual min_free_kbytes update via sysctl
(see cfd3da1e49bb ("mm: Serialize access to min_free_kbytes"). Let's
add a private lock for that purpose. This will not prevent from seeing
halfway through memory hotplug operation but that shouldn't be a big
deal becuse memory hotplug will update watermarks explicitly so we will
eventually get a full picture. The lock just makes sure we won't race
when updating watermarks leading to weird results.

Also __build_all_zonelists manipulates global data so add a private lock
for it as well. This doesn't seem to be necessary today but it is more
robust to have a lock there.

While we are at it make sure we document that memory online/offline
depends on a full serialization either via mem_hotplug_begin() or
device_lock.

Link: http://lkml.kernel.org/r/20170721143915.14161-9-mhocko@kernel.org
Signed-off-by: Michal Hocko <mhocko@suse.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Joonsoo Kim <js1304@gmail.com>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Shaohua Li <shaohua.li@intel.com>
Cc: Toshi Kani <toshi.kani@hpe.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Haicheng Li <haicheng.li@linux.intel.com>
Cc: Wu Fengguang <fengguang.wu@intel.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 11cd8638 06-Sep-2017 Michal Hocko <mhocko@suse.com>

mm, page_alloc: remove stop_machine from build_all_zonelists

build_all_zonelists has been (ab)using stop_machine to make sure that
zonelists do not change while somebody is looking at them. This is is
just a gross hack because a) it complicates the context from which we
can call build_all_zonelists (see 3f906ba23689 ("mm/memory-hotplug:
switch locking to a percpu rwsem")) and b) is is not really necessary
especially after "mm, page_alloc: simplify zonelist initialization" and
c) it doesn't really provide the protection it claims (see below).

Updates of the zonelists happen very seldom, basically only when a zone
becomes populated during memory online or when it loses all the memory
during offline. A racing iteration over zonelists could either miss a
zone or try to work on one zone twice. Both of these are something we
can live with occasionally because there will always be at least one
zone visible so we are not likely to fail allocation too easily for
example.

Please note that the original stop_machine approach doesn't really
provide a better exclusion because the iteration might be interrupted
half way (unless the whole iteration is preempt disabled which is not
the case in most cases) so the some zones could still be seen twice or a
zone missed.

I have run the pathological online/offline of the single memblock in the
movable zone while stressing the same small node with some memory
pressure.

Node 1, zone DMA
pages free 0
min 0
low 0
high 0
spanned 0
present 0
managed 0
protection: (0, 943, 943, 943)
Node 1, zone DMA32
pages free 227310
min 8294
low 10367
high 12440
spanned 262112
present 262112
managed 241436
protection: (0, 0, 0, 0)
Node 1, zone Normal
pages free 0
min 0
low 0
high 0
spanned 0
present 0
managed 0
protection: (0, 0, 0, 1024)
Node 1, zone Movable
pages free 32722
min 85
low 117
high 149
spanned 32768
present 32768
managed 32768
protection: (0, 0, 0, 0)

root@test1:/sys/devices/system/node/node1# while true
do
echo offline > memory34/state
echo online_movable > memory34/state
done

root@test1:/mnt/data/test/linux-3.7-rc5# numactl --preferred=1 make -j4

and it survived without any unexpected behavior. While this is not
really a great testing coverage it should exercise the allocation path
quite a lot.

Link: http://lkml.kernel.org/r/20170721143915.14161-8-mhocko@kernel.org
Signed-off-by: Michal Hocko <mhocko@suse.com>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Joonsoo Kim <js1304@gmail.com>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Shaohua Li <shaohua.li@intel.com>
Cc: Toshi Kani <toshi.kani@hpe.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 9d3be21b 06-Sep-2017 Michal Hocko <mhocko@suse.com>

mm, page_alloc: simplify zonelist initialization

build_zonelists gradually builds zonelists from the nearest to the most
distant node. As we do not know how many populated zones we will have
in each node we rely on the _zoneref to terminate initialized part of
the zonelist by a NULL zone. While this is functionally correct it is
quite suboptimal because we cannot allow updaters to race with zonelists
users because they could see an empty zonelist and fail the allocation
or hit the OOM killer in the worst case.

We can do much better, though. We can store the node ordering into an
already existing node_order array and then give this array to
build_zonelists_in_node_order and do the whole initialization at once.
zonelists consumers still might see halfway initialized state but that
should be much more tolerateable because the list will not be empty and
they would either see some zone twice or skip over some zone(s) in the
worst case which shouldn't lead to immediate failures.

While at it let's simplify build_zonelists_node which is rather
confusing now. It gets an index into the zoneref array and returns the
updated index for the next iteration. Let's rename the function to
build_zonerefs_node to better reflect its purpose and give it zoneref
array to update. The function doesn't the index anymore. It just
returns the number of added zones so that the caller can advance the
zonered array start for the next update.

This patch alone doesn't introduce any functional change yet, though, it
is merely a preparatory work for later changes.

Link: http://lkml.kernel.org/r/20170721143915.14161-7-mhocko@kernel.org
Signed-off-by: Michal Hocko <mhocko@suse.com>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Joonsoo Kim <js1304@gmail.com>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Shaohua Li <shaohua.li@intel.com>
Cc: Toshi Kani <toshi.kani@hpe.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 72675e13 06-Sep-2017 Michal Hocko <mhocko@suse.com>

mm, memory_hotplug: drop zone from build_all_zonelists

build_all_zonelists gets a zone parameter to initialize zone's pagesets.
There is only a single user which gives a non-NULL zone parameter and
that one doesn't really need the rest of the build_all_zonelists (see
commit 6dcd73d7011b ("memory-hotplug: allocate zone's pcp before
onlining pages")).

Therefore remove setup_zone_pageset from build_all_zonelists and call it
from its only user directly. This will also remove a pointless zonlists
rebuilding which is always good.

Link: http://lkml.kernel.org/r/20170721143915.14161-5-mhocko@kernel.org
Signed-off-by: Michal Hocko <mhocko@suse.com>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Joonsoo Kim <js1304@gmail.com>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Shaohua Li <shaohua.li@intel.com>
Cc: Toshi Kani <toshi.kani@hpe.com>
Cc: Wen Congyang <wency@cn.fujitsu.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# d9c9a0b9 06-Sep-2017 Michal Hocko <mhocko@suse.com>

mm, page_alloc: do not set_cpu_numa_mem on empty nodes initialization

__build_all_zonelists reinitializes each online cpu local node for
CONFIG_HAVE_MEMORYLESS_NODES. This makes sense because previously
memory less nodes could gain some memory during memory hotplug and so
the local node should be changed for CPUs close to such a node. It
makes less sense to do that unconditionally for a newly creaded NUMA
node which is still offline and without any memory.

Let's also simplify the cpu loop and use for_each_online_cpu instead of
an explicit cpu_online check for all possible cpus.

Link: http://lkml.kernel.org/r/20170721143915.14161-4-mhocko@kernel.org
Signed-off-by: Michal Hocko <mhocko@suse.com>
Acked-by: Mel Gorman <mgorman@suse.de>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Joonsoo Kim <js1304@gmail.com>
Cc: Shaohua Li <shaohua.li@intel.com>
Cc: Toshi Kani <toshi.kani@hpe.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# afb6ebb3 06-Sep-2017 Michal Hocko <mhocko@suse.com>

mm, page_alloc: remove boot pageset initialization from memory hotplug

boot_pageset is a boot time hack which gets superseded by normal
pagesets later in the boot process. It makes zero sense to reinitialize
it again and again during memory hotplug.

Link: http://lkml.kernel.org/r/20170721143915.14161-3-mhocko@kernel.org
Signed-off-by: Michal Hocko <mhocko@suse.com>
Acked-by: Mel Gorman <mgorman@suse.de>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Joonsoo Kim <js1304@gmail.com>
Cc: Shaohua Li <shaohua.li@intel.com>
Cc: Toshi Kani <toshi.kani@hpe.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# c9bff3ee 06-Sep-2017 Michal Hocko <mhocko@suse.com>

mm, page_alloc: rip out ZONELIST_ORDER_ZONE

Patch series "cleanup zonelists initialization", v1.

This is aimed at cleaning up the zonelists initialization code we have
but the primary motivation was bug report [2] which got resolved but the
usage of stop_machine is just too ugly to live. Most patches are
straightforward but 3 of them need a special consideration.

Patch 1 removes zone ordered zonelists completely. I am CCing linux-api
because this is a user visible change. As I argue in the patch
description I do not think we have a strong usecase for it these days.
I have kept sysctl in place and warn into the log if somebody tries to
configure zone lists ordering. If somebody has a real usecase for it we
can revert this patch but I do not expect anybody will actually notice
runtime differences. This patch is not strictly needed for the rest but
it made patch 6 easier to implement.

Patch 7 removes stop_machine from build_all_zonelists without adding any
special synchronization between iterators and updater which I _believe_
is acceptable as explained in the changelog. I hope I am not missing
anything.

Patch 8 then removes zonelists_mutex which is kind of ugly as well and
not really needed AFAICS but a care should be taken when double checking
my thinking.

This patch (of 9):

Supporting zone ordered zonelists costs us just a lot of code while the
usefulness is arguable if existent at all. Mel has already made node
ordering default on 64b systems. 32b systems are still using
ZONELIST_ORDER_ZONE because it is considered better to fallback to a
different NUMA node rather than consume precious lowmem zones.

This argument is, however, weaken by the fact that the memory reclaim
has been reworked to be node rather than zone oriented. This means that
lowmem requests have to skip over all highmem pages on LRUs already and
so zone ordering doesn't save the reclaim time much. So the only
advantage of the zone ordering is under a light memory pressure when
highmem requests do not ever hit into lowmem zones and the lowmem
pressure doesn't need to reclaim.

Considering that 32b NUMA systems are rather suboptimal already and it
is generally advisable to use 64b kernel on such a HW I believe we
should rather care about the code maintainability and just get rid of
ZONELIST_ORDER_ZONE altogether. Keep systcl in place and warn if
somebody tries to set zone ordering either from kernel command line or
the sysctl.

[mhocko@suse.com: reading vm.numa_zonelist_order will never terminate]
Link: http://lkml.kernel.org/r/20170721143915.14161-2-mhocko@kernel.org
Signed-off-by: Michal Hocko <mhocko@suse.com>
Acked-by: Mel Gorman <mgorman@suse.de>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Joonsoo Kim <js1304@gmail.com>
Cc: Shaohua Li <shaohua.li@intel.com>
Cc: Toshi Kani <toshi.kani@hpe.com>
Cc: Abdul Haleem <abdhalee@linux.vnet.ibm.com>
Cc: <linux-api@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# c1152583 06-Sep-2017 Wei Yang <richard.weiyang@gmail.com>

mm/memory_hotplug: just build zonelist for newly added node

Commit 9adb62a5df9c ("mm/hotplug: correctly setup fallback zonelists
when creating new pgdat") tries to build the correct zonelist for a
newly added node, while it is not necessary to rebuild it for already
exist nodes.

In build_zonelists(), it will iterate on nodes with memory. For a newly
added node, it will have memory until node_states_set_node() is called
in online_pages().

This patch avoids rebuilding the zonelists for already existing nodes.

build_zonelists_node() uses managed_zone(zone) checks, so it should not
include empty zones anyway. So effectively we avoid some pointless work
under stop_machine().

[akpm@linux-foundation.org: tweak comment text]
[akpm@linux-foundation.org: coding-style tweak, per Vlastimil]
Link: http://lkml.kernel.org/r/20170626035822.50155-1-richard.weiyang@gmail.com
Signed-off-by: Wei Yang <richard.weiyang@gmail.com>
Acked-by: Michal Hocko <mhocko@suse.com>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Cc: Jiang Liu <liuj97@gmail.com>
Cc: Xishi Qiu <qiuxishi@huawei.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# e746bf73 31-Aug-2017 Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>

mm,page_alloc: don't call __node_reclaim() with oom_lock held.

We are doing a last second memory allocation attempt before calling
out_of_memory(). But since slab shrinker functions might indirectly
wait for other thread's __GFP_DIRECT_RECLAIM && !__GFP_NORETRY memory
allocations via sleeping locks, calling slab shrinker functions from
node_reclaim() from get_page_from_freelist() with oom_lock held has
possibility of deadlock. Therefore, make sure that last second memory
allocation attempt does not call slab shrinker functions.

Link: http://lkml.kernel.org/r/1503577106-9196-1-git-send-email-penguin-kernel@I-love.SAKURA.ne.jp
Signed-off-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
Acked-by: Michal Hocko <mhocko@suse.com>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Vlastimil Babka <vbabka@suse.cz>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 556b969a 25-Aug-2017 Chen Yu <yu.c.chen@intel.com>

PM/hibernate: touch NMI watchdog when creating snapshot

There is a problem that when counting the pages for creating the
hibernation snapshot will take significant amount of time, especially on
system with large memory. Since the counting job is performed with irq
disabled, this might lead to NMI lockup. The following warning were
found on a system with 1.5TB DRAM:

Freezing user space processes ... (elapsed 0.002 seconds) done.
OOM killer disabled.
PM: Preallocating image memory...
NMI watchdog: Watchdog detected hard LOCKUP on cpu 27
CPU: 27 PID: 3128 Comm: systemd-sleep Not tainted 4.13.0-0.rc2.git0.1.fc27.x86_64 #1
task: ffff9f01971ac000 task.stack: ffffb1a3f325c000
RIP: 0010:memory_bm_find_bit+0xf4/0x100
Call Trace:
swsusp_set_page_free+0x2b/0x30
mark_free_pages+0x147/0x1c0
count_data_pages+0x41/0xa0
hibernate_preallocate_memory+0x80/0x450
hibernation_snapshot+0x58/0x410
hibernate+0x17c/0x310
state_store+0xdf/0xf0
kobj_attr_store+0xf/0x20
sysfs_kf_write+0x37/0x40
kernfs_fop_write+0x11c/0x1a0
__vfs_write+0x37/0x170
vfs_write+0xb1/0x1a0
SyS_write+0x55/0xc0
entry_SYSCALL_64_fastpath+0x1a/0xa5
...
done (allocated 6590003 pages)
PM: Allocated 26360012 kbytes in 19.89 seconds (1325.28 MB/s)

It has taken nearly 20 seconds(2.10GHz CPU) thus the NMI lockup was
triggered. In case the timeout of the NMI watch dog has been set to 1
second, a safe interval should be 6590003/20 = 320k pages in theory.
However there might also be some platforms running at a lower frequency,
so feed the watchdog every 100k pages.

[yu.c.chen@intel.com: simplification]
Link: http://lkml.kernel.org/r/1503460079-29721-1-git-send-email-yu.c.chen@intel.com
[yu.c.chen@intel.com: use interval of 128k instead of 100k to avoid modulus]
Link: http://lkml.kernel.org/r/1503328098-5120-1-git-send-email-yu.c.chen@intel.com
Signed-off-by: Chen Yu <yu.c.chen@intel.com>
Reported-by: Jan Filipcewicz <jan.filipcewicz@intel.com>
Suggested-by: Michal Hocko <mhocko@suse.com>
Reviewed-by: Michal Hocko <mhocko@suse.com>
Acked-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Cc: Mel Gorman <mgorman@techsingularity.net>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Len Brown <lenb@kernel.org>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 3010f876 18-Aug-2017 Pavel Tatashin <pasha.tatashin@oracle.com>

mm: discard memblock data later

There is existing use after free bug when deferred struct pages are
enabled:

The memblock_add() allocates memory for the memory array if more than
128 entries are needed. See comment in e820__memblock_setup():

* The bootstrap memblock region count maximum is 128 entries
* (INIT_MEMBLOCK_REGIONS), but EFI might pass us more E820 entries
* than that - so allow memblock resizing.

This memblock memory is freed here:
free_low_memory_core_early()

We access the freed memblock.memory later in boot when deferred pages
are initialized in this path:

deferred_init_memmap()
for_each_mem_pfn_range()
__next_mem_pfn_range()
type = &memblock.memory;

One possible explanation for why this use-after-free hasn't been hit
before is that the limit of INIT_MEMBLOCK_REGIONS has never been
exceeded at least on systems where deferred struct pages were enabled.

Tested by reducing INIT_MEMBLOCK_REGIONS down to 4 from the current 128,
and verifying in qemu that this code is getting excuted and that the
freed pages are sane.

Link: http://lkml.kernel.org/r/1502485554-318703-2-git-send-email-pasha.tatashin@oracle.com
Fixes: 7e18adb4f80b ("mm: meminit: initialise remaining struct pages in parallel with kswapd")
Signed-off-by: Pavel Tatashin <pasha.tatashin@oracle.com>
Reviewed-by: Steven Sistare <steven.sistare@oracle.com>
Reviewed-by: Daniel Jordan <daniel.m.jordan@oracle.com>
Reviewed-by: Bob Picco <bob.picco@oracle.com>
Acked-by: Michal Hocko <mhocko@suse.com>
Cc: Mel Gorman <mgorman@techsingularity.net>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 75dddef3 10-Aug-2017 Jonathan Toppins <jtoppins@redhat.com>

mm: ratelimit PFNs busy info message

The RDMA subsystem can generate several thousand of these messages per
second eventually leading to a kernel crash. Ratelimit these messages
to prevent this crash.

Doug said:
"I've been carrying a version of this for several kernel versions. I
don't remember when they started, but we have one (and only one) class
of machines: Dell PE R730xd, that generate these errors. When it
happens, without a rate limit, we get rcu timeouts and kernel oopses.
With the rate limit, we just get a lot of annoying kernel messages but
the machine continues on, recovers, and eventually the memory
operations all succeed"

And:
"> Well... why are all these EBUSY's occurring? It sounds inefficient
> (at least) but if it is expected, normal and unavoidable then
> perhaps we should just remove that message altogether?

I don't have an answer to that question. To be honest, I haven't
looked real hard. We never had this at all, then it started out of the
blue, but only on our Dell 730xd machines (and it hits all of them),
but no other classes or brands of machines. And we have our 730xd
machines loaded up with different brands and models of cards (for
instance one dedicated to mlx4 hardware, one for qib, one for mlx5, an
ocrdma/cxgb4 combo, etc), so the fact that it hit all of the machines
meant it wasn't tied to any particular brand/model of RDMA hardware.
To me, it always smelled of a hardware oddity specific to maybe the
CPUs or mainboard chipsets in these machines, so given that I'm not an
mm expert anyway, I never chased it down.

A few other relevant details: it showed up somewhere around 4.8/4.9 or
thereabouts. It never happened before, but the prinkt has been there
since the 3.18 days, so possibly the test to trigger this message was
changed, or something else in the allocator changed such that the
situation started happening on these machines?

And, like I said, it is specific to our 730xd machines (but they are
all identical, so that could mean it's something like their specific
ram configuration is causing the allocator to hit this on these
machine but not on other machines in the cluster, I don't want to say
it's necessarily the model of chipset or CPU, there are other bits of
identicalness between these machines)"

Link: http://lkml.kernel.org/r/499c0f6cc10d6eb829a67f2a4d75b4228a9b356e.1501695897.git.jtoppins@redhat.com
Signed-off-by: Jonathan Toppins <jtoppins@redhat.com>
Reviewed-by: Doug Ledford <dledford@redhat.com>
Tested-by: Doug Ledford <dledford@redhat.com>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Mel Gorman <mgorman@techsingularity.net>
Cc: Hillf Danton <hillf.zj@alibaba-inc.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# d507e2eb 10-Aug-2017 Johannes Weiner <hannes@cmpxchg.org>

mm: fix global NR_SLAB_.*CLAIMABLE counter reads

As Tetsuo points out:
"Commit 385386cff4c6 ("mm: vmstat: move slab statistics from zone to
node counters") broke "Slab:" field of /proc/meminfo . It shows nearly
0kB"

In addition to /proc/meminfo, this problem also affects the slab
counters OOM/allocation failure info dumps, can cause early -ENOMEM from
overcommit protection, and miscalculate image size requirements during
suspend-to-disk.

This is because the patch in question switched the slab counters from
the zone level to the node level, but forgot to update the global
accessor functions to read the aggregate node data instead of the
aggregate zone data.

Use global_node_page_state() to access the global slab counters.

Fixes: 385386cff4c6 ("mm: vmstat: move slab statistics from zone to node counters")
Link: http://lkml.kernel.org/r/20170801134256.5400-1-hannes@cmpxchg.org
Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
Reported-by: Tetsuo Handa <penguin-kernel@i-love.sakura.ne.jp>
Acked-by: Michal Hocko <mhocko@suse.com>
Cc: Josef Bacik <josef@toxicpanda.com>
Cc: Vladimir Davydov <vdavydov.dev@gmail.com>
Cc: Stefan Agner <stefan@agner.ch>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# d92a8cfc 03-Mar-2017 Peter Zijlstra <peterz@infradead.org>

locking/lockdep: Rework FS_RECLAIM annotation

A while ago someone, and I cannot find the email just now, asked if we
could not implement the RECLAIM_FS inversion stuff with a 'fake' lock
like we use for other things like workqueues etc. I think this should
be possible which allows reducing the 'irq' states and will reduce the
amount of __bfs() lookups we do.

Removing the 1 IRQ state results in 4 less __bfs() walks per
dependency, improving lockdep performance. And by moving this
annotation out of the lockdep code it becomes easier for the mm people
to extend.

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Byungchul Park <byungchul.park@lge.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Nikolay Borisov <nborisov@suse.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: akpm@linux-foundation.org
Cc: boqun.feng@gmail.com
Cc: iamjoonsoo.kim@lge.com
Cc: kernel-team@lge.com
Cc: kirill@shutemov.name
Cc: npiggin@gmail.com
Cc: walken@google.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>


# 167d0f25 02-Aug-2017 Heiko Carstens <hca@linux.ibm.com>

mm: take memory hotplug lock within numa_zonelist_order_handler()

Andre Wild reported the following warning:

WARNING: CPU: 2 PID: 1205 at kernel/cpu.c:240 lockdep_assert_cpus_held+0x4c/0x60
Modules linked in:
CPU: 2 PID: 1205 Comm: bash Not tainted 4.13.0-rc2-00022-gfd2b2c57ec20 #10
Hardware name: IBM 2964 N96 702 (z/VM 6.4.0)
task: 00000000701d8100 task.stack: 0000000073594000
Krnl PSW : 0704f00180000000 0000000000145e24 (lockdep_assert_cpus_held+0x4c/0x60)
...
Call Trace:
lockdep_assert_cpus_held+0x42/0x60)
stop_machine_cpuslocked+0x62/0xf0
build_all_zonelists+0x92/0x150
numa_zonelist_order_handler+0x102/0x150
proc_sys_call_handler.isra.12+0xda/0x118
proc_sys_write+0x34/0x48
__vfs_write+0x3c/0x178
vfs_write+0xbc/0x1a0
SyS_write+0x66/0xc0
system_call+0xc4/0x2b0
locks held by bash/1205:
#0: (sb_writers#4){.+.+.+}, at: vfs_write+0xa6/0x1a0
#1: (zl_order_mutex){+.+...}, at: numa_zonelist_order_handler+0x44/0x150
#2: (zonelists_mutex){+.+...}, at: numa_zonelist_order_handler+0xf4/0x150
Last Breaking-Event-Address:
lockdep_assert_cpus_held+0x48/0x60

This can be easily triggered with e.g.

echo n > /proc/sys/vm/numa_zonelist_order

In commit 3f906ba23689a ("mm/memory-hotplug: switch locking to a percpu
rwsem") memory hotplug locking was changed to fix a potential deadlock.

This also switched the stop_machine() invocation within
build_all_zonelists() to stop_machine_cpuslocked() which now expects
that online cpus are locked when being called.

This assumption is not true if build_all_zonelists() is being called
from numa_zonelist_order_handler().

In order to fix this simply add a mem_hotplug_begin()/mem_hotplug_done()
pair to numa_zonelist_order_handler().

Link: http://lkml.kernel.org/r/20170726111738.38768-1-heiko.carstens@de.ibm.com
Fixes: 3f906ba23689a ("mm/memory-hotplug: switch locking to a percpu rwsem")
Signed-off-by: Heiko Carstens <heiko.carstens@de.ibm.com>
Reported-by: Andre Wild <wild@linux.vnet.ibm.com>
Acked-by: Michal Hocko <mhocko@suse.com>
Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# dcda9b04 12-Jul-2017 Michal Hocko <mhocko@suse.com>

mm, tree wide: replace __GFP_REPEAT by __GFP_RETRY_MAYFAIL with more useful semantic

__GFP_REPEAT was designed to allow retry-but-eventually-fail semantic to
the page allocator. This has been true but only for allocations
requests larger than PAGE_ALLOC_COSTLY_ORDER. It has been always
ignored for smaller sizes. This is a bit unfortunate because there is
no way to express the same semantic for those requests and they are
considered too important to fail so they might end up looping in the
page allocator for ever, similarly to GFP_NOFAIL requests.

Now that the whole tree has been cleaned up and accidental or misled
usage of __GFP_REPEAT flag has been removed for !costly requests we can
give the original flag a better name and more importantly a more useful
semantic. Let's rename it to __GFP_RETRY_MAYFAIL which tells the user
that the allocator would try really hard but there is no promise of a
success. This will work independent of the order and overrides the
default allocator behavior. Page allocator users have several levels of
guarantee vs. cost options (take GFP_KERNEL as an example)

- GFP_KERNEL & ~__GFP_RECLAIM - optimistic allocation without _any_
attempt to free memory at all. The most light weight mode which even
doesn't kick the background reclaim. Should be used carefully because
it might deplete the memory and the next user might hit the more
aggressive reclaim

- GFP_KERNEL & ~__GFP_DIRECT_RECLAIM (or GFP_NOWAIT)- optimistic
allocation without any attempt to free memory from the current
context but can wake kswapd to reclaim memory if the zone is below
the low watermark. Can be used from either atomic contexts or when
the request is a performance optimization and there is another
fallback for a slow path.

- (GFP_KERNEL|__GFP_HIGH) & ~__GFP_DIRECT_RECLAIM (aka GFP_ATOMIC) -
non sleeping allocation with an expensive fallback so it can access
some portion of memory reserves. Usually used from interrupt/bh
context with an expensive slow path fallback.

- GFP_KERNEL - both background and direct reclaim are allowed and the
_default_ page allocator behavior is used. That means that !costly
allocation requests are basically nofail but there is no guarantee of
that behavior so failures have to be checked properly by callers
(e.g. OOM killer victim is allowed to fail currently).

- GFP_KERNEL | __GFP_NORETRY - overrides the default allocator behavior
and all allocation requests fail early rather than cause disruptive
reclaim (one round of reclaim in this implementation). The OOM killer
is not invoked.

- GFP_KERNEL | __GFP_RETRY_MAYFAIL - overrides the default allocator
behavior and all allocation requests try really hard. The request
will fail if the reclaim cannot make any progress. The OOM killer
won't be triggered.

- GFP_KERNEL | __GFP_NOFAIL - overrides the default allocator behavior
and all allocation requests will loop endlessly until they succeed.
This might be really dangerous especially for larger orders.

Existing users of __GFP_REPEAT are changed to __GFP_RETRY_MAYFAIL
because they already had their semantic. No new users are added.
__alloc_pages_slowpath is changed to bail out for __GFP_RETRY_MAYFAIL if
there is no progress and we have already passed the OOM point.

This means that all the reclaim opportunities have been exhausted except
the most disruptive one (the OOM killer) and a user defined fallback
behavior is more sensible than keep retrying in the page allocator.

[akpm@linux-foundation.org: fix arch/sparc/kernel/mdesc.c]
[mhocko@suse.com: semantic fix]
Link: http://lkml.kernel.org/r/20170626123847.GM11534@dhcp22.suse.cz
[mhocko@kernel.org: address other thing spotted by Vlastimil]
Link: http://lkml.kernel.org/r/20170626124233.GN11534@dhcp22.suse.cz
Link: http://lkml.kernel.org/r/20170623085345.11304-3-mhocko@kernel.org
Signed-off-by: Michal Hocko <mhocko@suse.com>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Cc: Alex Belits <alex.belits@cavium.com>
Cc: Chris Wilson <chris@chris-wilson.co.uk>
Cc: Christoph Hellwig <hch@infradead.org>
Cc: Darrick J. Wong <darrick.wong@oracle.com>
Cc: David Daney <david.daney@cavium.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Mel Gorman <mgorman@suse.de>
Cc: NeilBrown <neilb@suse.com>
Cc: Ralf Baechle <ralf@linux-mips.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 3f906ba2 10-Jul-2017 Thomas Gleixner <tglx@linutronix.de>

mm/memory-hotplug: switch locking to a percpu rwsem

Andrey reported a potential deadlock with the memory hotplug lock and
the cpu hotplug lock.

The reason is that memory hotplug takes the memory hotplug lock and then
calls stop_machine() which calls get_online_cpus(). That's the reverse
lock order to get_online_cpus(); get_online_mems(); in mm/slub_common.c

The problem has been there forever. The reason why this was never
reported is that the cpu hotplug locking had this homebrewn recursive
reader writer semaphore construct which due to the recursion evaded the
full lock dep coverage. The memory hotplug code copied that construct
verbatim and therefor has similar issues.

Three steps to fix this:

1) Convert the memory hotplug locking to a per cpu rwsem so the
potential issues get reported proper by lockdep.

2) Lock the online cpus in mem_hotplug_begin() before taking the memory
hotplug rwsem and use stop_machine_cpuslocked() in the page_alloc
code to avoid recursive locking.

3) The cpu hotpluck locking in #2 causes a recursive locking of the cpu
hotplug lock via __offline_pages() -> lru_add_drain_all(). Solve this
by invoking lru_add_drain_all_cpuslocked() instead.

Link: http://lkml.kernel.org/r/20170704093421.506836322@linutronix.de
Reported-by: Andrey Ryabinin <aryabinin@virtuozzo.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Acked-by: Michal Hocko <mhocko@suse.com>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Cc: Vladimir Davydov <vdavydov.dev@gmail.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Davidlohr Bueso <dave@stgolabs.net>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# b002529d 10-Jul-2017 Rasmus Villemoes <linux@rasmusvillemoes.dk>

mm/page_alloc.c: eliminate unsigned confusion in __rmqueue_fallback

Since current_order starts as MAX_ORDER-1 and is then only decremented,
the second half of the loop condition seems superfluous. However, if
order is 0, we may decrement current_order past 0, making it UINT_MAX.
This is obviously too subtle ([1], [2]).

Since we need to add some comment anyway, change the two variables to
signed, making the counting-down for loop look more familiar, and
apparently also making gcc generate slightly smaller code.

[1] https://lkml.org/lkml/2016/6/20/493
[2] https://lkml.org/lkml/2017/6/19/345

[akpm@linux-foundation.org: fix up reject fixupping]
Link: http://lkml.kernel.org/r/20170621185529.2265-1-linux@rasmusvillemoes.dk
Signed-off-by: Rasmus Villemoes <linux@rasmusvillemoes.dk>
Reported-by: Hao Lee <haolee.swjtu@gmail.com>
Acked-by: Wei Yang <weiyang@gmail.com>
Acked-by: Michal Hocko <mhocko@suse.com>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 7a8f58f3 10-Jul-2017 Vlastimil Babka <vbabka@suse.cz>

mm, page_alloc: fallback to smallest page when not stealing whole pageblock

Since commit 3bc48f96cf11 ("mm, page_alloc: split smallest stolen page
in fallback") we pick the smallest (but sufficient) page of all that
have been stolen from a pageblock of different migratetype. However,
there are cases when we decide not to steal the whole pageblock.

Practically in the current implementation it means that we are trying to
fallback for a MIGRATE_MOVABLE allocation of order X, go through the
freelists from MAX_ORDER-1 down to X, and find free page of order Y. If
Y is less than pageblock_order / 2, we decide not to steal all pages
from the pageblock. When Y > X, it means we are potentially splitting a
larger page than we need, as there might be other pages of order Z,
where X <= Z < Y. Since Y is already too small to steal whole
pageblock, picking smallest available Z will result in the same decision
and we avoid splitting a higher-order page in a MIGRATE_UNMOVABLE or
MIGRATE_RECLAIMABLE pageblock.

This patch therefore changes the fallback algorithm so that in the
situation described above, we switch the fallback search strategy to go
from order X upwards to find the smallest suitable fallback. In theory
there shouldn't be a downside of this change wrt fragmentation.

This has been tested with mmtests' stress-highalloc performing
GFP_KERNEL order-4 allocations, here is the relevant extfrag tracepoint
statistics:

4.12.0-rc2 4.12.0-rc2
1-kernel4 2-kernel4
Page alloc extfrag event 25640976 69680977
Extfrag fragmenting 25621086 69661364
Extfrag fragmenting for unmovable 74409 73204
Extfrag fragmenting unmovable placed with movable 69003 67684
Extfrag fragmenting unmovable placed with reclaim. 5406 5520
Extfrag fragmenting for reclaimable 6398 8467
Extfrag fragmenting reclaimable placed with movable 869 884
Extfrag fragmenting reclaimable placed with unmov. 5529 7583
Extfrag fragmenting for movable 25540279 69579693

Since we force movable allocations to steal the smallest available page
(which we then practially always split), we steal less per fallback, so
the number of fallbacks increases and steals potentially happen from
different pageblocks. This is however not an issue for movable pages
that can be compacted.

Importantly, the "unmovable placed with movable" statistics is lower,
which is the result of less fragmentation in the unmovable pageblocks.
The effect on reclaimable allocation is a bit unclear.

Link: http://lkml.kernel.org/r/20170529093947.22618-1-vbabka@suse.cz
Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
Acked-by: Mel Gorman <mgorman@techsingularity.net>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Cc: David Rientjes <rientjes@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# f70029bb 06-Jul-2017 Michal Hocko <mhocko@suse.com>

mm, memory_hotplug: drop CONFIG_MOVABLE_NODE

Commit 20b2f52b73fe ("numa: add CONFIG_MOVABLE_NODE for
movable-dedicated node") has introduced CONFIG_MOVABLE_NODE without a
good explanation on why it is actually useful.

It makes a lot of sense to make movable node semantic opt in but we
already have that because the feature has to be explicitly enabled on
the kernel command line. A config option on top only makes the
configuration space larger without a good reason. It also adds an
additional ifdefery that pollutes the code.

Just drop the config option and make it de-facto always enabled. This
shouldn't introduce any change to the semantic.

Link: http://lkml.kernel.org/r/20170529114141.536-3-mhocko@kernel.org
Signed-off-by: Michal Hocko <mhocko@suse.com>
Acked-by: Reza Arbab <arbab@linux.vnet.ibm.com>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Jerome Glisse <jglisse@redhat.com>
Cc: Yasuaki Ishimatsu <yasu.isimatu@gmail.com>
Cc: Xishi Qiu <qiuxishi@huawei.com>
Cc: Kani Toshimitsu <toshi.kani@hpe.com>
Cc: Chen Yucong <slaoub@gmail.com>
Cc: Joonsoo Kim <js1304@gmail.com>
Cc: Andi Kleen <ak@linux.intel.com>
Cc: David Rientjes <rientjes@google.com>
Cc: Daniel Kiper <daniel.kiper@oracle.com>
Cc: Igor Mammedov <imammedo@redhat.com>
Cc: Vitaly Kuznetsov <vkuznets@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 385386cf 06-Jul-2017 Johannes Weiner <hannes@cmpxchg.org>

mm: vmstat: move slab statistics from zone to node counters

Patch series "mm: per-lruvec slab stats"

Josef is working on a new approach to balancing slab caches and the page
cache. For this to work, he needs slab cache statistics on the lruvec
level. These patches implement that by adding infrastructure that
allows updating and reading generic VM stat items per lruvec, then
switches some existing VM accounting sites, including the slab
accounting ones, to this new cgroup-aware API.

I'll follow up with more patches on this, because there is actually
substantial simplification that can be done to the memory controller
when we replace private memcg accounting with making the existing VM
accounting sites cgroup-aware. But this is enough for Josef to base his
slab reclaim work on, so here goes.

This patch (of 5):

To re-implement slab cache vs. page cache balancing, we'll need the
slab counters at the lruvec level, which, ever since lru reclaim was
moved from the zone to the node, is the intersection of the node, not
the zone, and the memcg.

We could retain the per-zone counters for when the page allocator dumps
its memory information on failures, and have counters on both levels -
which on all but NUMA node 0 is usually redundant. But let's keep it
simple for now and just move them. If anybody complains we can restore
the per-zone counters.

[hannes@cmpxchg.org: fix oops]
Link: http://lkml.kernel.org/r/20170605183511.GA8915@cmpxchg.org
Link: http://lkml.kernel.org/r/20170530181724.27197-3-hannes@cmpxchg.org
Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
Cc: Josef Bacik <josef@toxicpanda.com>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Vladimir Davydov <vdavydov.dev@gmail.com>
Cc: Rik van Riel <riel@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 04ec6264 06-Jul-2017 Vlastimil Babka <vbabka@suse.cz>

mm, page_alloc: pass preferred nid instead of zonelist to allocator

The main allocator function __alloc_pages_nodemask() takes a zonelist
pointer as one of its parameters. All of its callers directly or
indirectly obtain the zonelist via node_zonelist() using a preferred
node id and gfp_mask. We can make the code a bit simpler by doing the
zonelist lookup in __alloc_pages_nodemask(), passing it a preferred node
id instead (gfp_mask is already another parameter).

There are some code size benefits thanks to removal of inlined
node_zonelist():

bloat-o-meter add/remove: 2/2 grow/shrink: 4/36 up/down: 399/-1351 (-952)

This will also make things simpler if we proceed with converting cpusets
to zonelists.

Link: http://lkml.kernel.org/r/20170517081140.30654-4-vbabka@suse.cz
Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
Reviewed-by: Christoph Lameter <cl@linux.com>
Acked-by: Michal Hocko <mhocko@suse.com>
Cc: Dimitri Sivanich <sivanich@sgi.com>
Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Anshuman Khandual <khandual@linux.vnet.ibm.com>
Cc: David Rientjes <rientjes@google.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: Li Zefan <lizefan@huawei.com>
Cc: Mel Gorman <mgorman@techsingularity.net>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 902b6281 06-Jul-2017 Vlastimil Babka <vbabka@suse.cz>

mm, page_alloc: fix more premature OOM due to race with cpuset update

I would like to stress that this patchset aims to fix issues and cleanup
the code *within the existing documented semantics*, i.e. patch 1
ignores mempolicy restrictions if the set of allowed nodes has no
intersection with set of nodes allowed by cpuset. I believe discussing
potential changes of the semantics can be better done once we have a
baseline with no known bugs of the current semantics.

I've recently summarized the cpuset/mempolicy issues in a LSF/MM
proposal [1] and the discussion itself [2]. I've been trying to rewrite
the handling as proposed, with the idea that changing semantics to make
all mempolicies static wrt cpuset updates (and discarding the relative
and default modes) can be tried on top, as there's a high risk of being
rejected/reverted because somebody might still care about the removed
modes.

However I haven't yet figured out how to properly:

1) make mempolicies swappable instead of rebinding in place. I thought
mbind() already works that way and uses refcounting to avoid
use-after-free of the old policy by a parallel allocation, but turns
out true refcounting is only done for shared (shmem) mempolicies, and
the actual protection for mbind() comes from mmap_sem. Extending the
refcounting means more overhead in allocator hot path. Also swapping
whole mempolicies means that we have to allocate the new ones, which
can fail, and reverting of the partially done work also means
allocating (note that mbind() doesn't care and will just leave part
of the range updated and part not updated when returning -ENOMEM...).

2) make cpuset's task->mems_allowed also swappable (after converting it
from nodemask to zonelist, which is the easy part) for mostly the
same reasons.

The good news is that while trying to do the above, I've at least
figured out how to hopefully close the remaining premature OOM's, and do
a buch of cleanups on top, removing quite some of the code that was also
supposed to prevent the cpuset update races, but doesn't work anymore
nowadays. This should fix the most pressing concerns with this topic
and give us a better baseline before either proceeding with the original
proposal, or pushing a change of semantics that removes the problem 1)
above. I'd be then fine with trying to change the semantic first and
rewrite later.

Patchset has been tested with the LTP cpuset01 stress test.

[1] https://lkml.kernel.org/r/4c44a589-5fd8-08d0-892c-e893bb525b71@suse.cz
[2] https://lwn.net/Articles/717797/
[3] https://marc.info/?l=linux-mm&m=149191957922828&w=2

This patch (of 6):

Commit e47483bca2cc ("mm, page_alloc: fix premature OOM when racing with
cpuset mems update") has fixed known recent regressions found by LTP's
cpuset01 testcase. I have however found that by modifying the testcase
to use per-vma mempolicies via bind(2) instead of per-task mempolicies
via set_mempolicy(2), the premature OOM still happens and the issue is
much older.

The root of the problem is that the cpuset's mems_allowed and
mempolicy's nodemask can temporarily have no intersection, thus
get_page_from_freelist() cannot find any usable zone. The current
semantic for empty intersection is to ignore mempolicy's nodemask and
honour cpuset restrictions. This is checked in node_zonelist(), but the
racy update can happen after we already passed the check. Such races
should be protected by the seqlock task->mems_allowed_seq, but it
doesn't work here, because 1) mpol_rebind_mm() does not happen under
seqlock for write, and doing so would lead to deadlock, as it takes
mmap_sem for write, while the allocation can have mmap_sem for read when
it's taking the seqlock for read. And 2) the seqlock cookie of callers
of node_zonelist() (alloc_pages_vma() and alloc_pages_current()) is
different than the one of __alloc_pages_slowpath(), so there's still a
potential race window.

This patch fixes the issue by having __alloc_pages_slowpath() check for
empty intersection of cpuset and ac->nodemask before OOM or allocation
failure. If it's indeed empty, the nodemask is ignored and allocation
retried, which mimics node_zonelist(). This works fine, because almost
all callers of __alloc_pages_nodemask are obtaining the nodemask via
node_zonelist(). The only exception is new_node_page() from hotplug,
where the potential violation of nodemask isn't an issue, as there's
already a fallback allocation attempt without any nodemask. If there's
a future caller that needs to have its specific nodemask honoured over
task's cpuset restrictions, we'll have to e.g. add a gfp flag for that.

Link: http://lkml.kernel.org/r/20170517081140.30654-2-vbabka@suse.cz
Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
Acked-by: Michal Hocko <mhocko@suse.com>
Cc: Li Zefan <lizefan@huawei.com>
Cc: Mel Gorman <mgorman@techsingularity.net>
Cc: David Rientjes <rientjes@google.com>
Cc: Christoph Lameter <cl@linux.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Anshuman Khandual <khandual@linux.vnet.ibm.com>
Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
Cc: Dimitri Sivanich <sivanich@sgi.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# d73d3c9f 06-Jul-2017 Matthias Kaehlcke <mka@chromium.org>

mm/page_alloc.c: mark bad_range() and meminit_pfn_in_nid() as __maybe_unused

The functions are not used in some configurations. Adding the attribute
fixes the following warnings when building with clang:

mm/page_alloc.c:409:19: error: function 'bad_range' is not needed and
will not be emitted [-Werror,-Wunneeded-internal-declaration]

mm/page_alloc.c:1106:30: error: unused function 'meminit_pfn_in_nid'
[-Werror,-Wunused-function]

Link: http://lkml.kernel.org/r/20170518182030.165633-1-mka@chromium.org
Signed-off-by: Matthias Kaehlcke <mka@chromium.org>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Mel Gorman <mgorman@techsingularity.net>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 9017217b 06-Jul-2017 Pavel Tatashin <pasha.tatashin@oracle.com>

mm: adaptive hash table scaling

Allow hash tables to scale with memory but at slower pace, when
HASH_ADAPT is provided every time memory quadruples the sizes of hash
tables will only double instead of quadrupling as well. This algorithm
starts working only when memory size reaches a certain point, currently
set to 64G.

This is example of dentry hash table size, before and after four various
memory configurations:

MEMORY SCALE HASH_SIZE
old new old new
8G 13 13 8M 8M
16G 13 13 16M 16M
32G 13 13 32M 32M
64G 13 13 64M 64M
128G 13 14 128M 64M
256G 13 14 256M 128M
512G 13 15 512M 128M
1024G 13 15 1024M 256M
2048G 13 16 2048M 256M
4096G 13 16 4096M 512M
8192G 13 17 8192M 512M
16384G 13 17 16384M 1024M
32768G 13 18 32768M 1024M
65536G 13 18 65536M 2048M

The effect of this change on runtime is undetectable as filesystem
growth is not proportional to machine memory size as is currently
assumed. The change effects only large memory machine. Additional
tuning might be needed, but that can be done by the clients of the
kmem_cache_create interface, not the generic cache allocator itself.

The adaptive hashing is disabled on 32 bit systems to avoid confusion of
whether base should be different for smaller systems, and to avoid
overflows.

[mhocko@suse.com: drop HASH_ADAPT]
Link: http://lkml.kernel.org/r/20170509094607.GG6481@dhcp22.suse.cz
[pasha.tatashin@oracle.com: UL -> ULL fix]
Link: http://lkml.kernel.org/r/1495300013-653283-2-git-send-email-pasha.tatashin@oracle.com
[pasha.tatashin@oracle.com: disable adaptive hash on 32 bit systems]
Link: http://lkml.kernel.org/r/1495469329-755807-2-git-send-email-pasha.tatashin@oracle.com
Link: http://lkml.kernel.org/r/1488432825-92126-5-git-send-email-pasha.tatashin@oracle.com
Signed-off-by: Pavel Tatashin <pasha.tatashin@oracle.com>
Signed-off-by: Michal Hocko <mhocko@suse.com>
Acked-by: Michal Hocko <mhocko@suse.com>
Cc: David Miller <davem@davemloft.net>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Babu Moger <babu.moger@oracle.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 3749a8f0 06-Jul-2017 Pavel Tatashin <pasha.tatashin@oracle.com>

mm: zero hash tables in allocator

Add a new flag HASH_ZERO which when provided grantees that the hash
table that is returned by alloc_large_system_hash() is zeroed. In most
cases that is what is needed by the caller. Use page level allocator's
__GFP_ZERO flags to zero the memory. It is using memset() which is
efficient method to zero memory and is optimized for most platforms.

Link: http://lkml.kernel.org/r/1488432825-92126-3-git-send-email-pasha.tatashin@oracle.com
Signed-off-by: Pavel Tatashin <pasha.tatashin@oracle.com>
Reviewed-by: Babu Moger <babu.moger@oracle.com>
Cc: David Miller <davem@davemloft.net>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 2d070eab 06-Jul-2017 Michal Hocko <mhocko@suse.com>

mm: consider zone which is not fully populated to have holes

__pageblock_pfn_to_page has two users currently, set_zone_contiguous
which checks whether the given zone contains holes and
pageblock_pfn_to_page which then carefully returns a first valid page
from the given pfn range for the given zone. This doesn't handle zones
which are not fully populated though. Memory pageblocks can be offlined
or might not have been onlined yet. In such a case the zone should be
considered to have holes otherwise pfn walkers can touch and play with
offline pages.

Current callers of pageblock_pfn_to_page in compaction seem to work
properly right now because they only isolate PageBuddy
(isolate_freepages_block) or PageLRU resp. __PageMovable
(isolate_migratepages_block) which will be always false for these pages.
It would be safer to skip these pages altogether, though.

In order to do this patch adds a new memory section state
(SECTION_IS_ONLINE) which is set in memory_present (during boot time) or
in online_pages_range during the memory hotplug. Similarly
offline_mem_sections clears the bit and it is called when the memory
range is offlined.

pfn_to_online_page helper is then added which check the mem section and
only returns a page if it is onlined already.

Use the new helper in __pageblock_pfn_to_page and skip the whole page
block in such a case.

[mhocko@suse.com: check valid section number in pfn_to_online_page (Vlastimil),
mark sections online after all struct pages are initialized in
online_pages_range (Vlastimil)]
Link: http://lkml.kernel.org/r/20170518164210.GD18333@dhcp22.suse.cz
Link: http://lkml.kernel.org/r/20170515085827.16474-8-mhocko@kernel.org
Signed-off-by: Michal Hocko <mhocko@suse.com>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Cc: Andi Kleen <ak@linux.intel.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Balbir Singh <bsingharora@gmail.com>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: Daniel Kiper <daniel.kiper@oracle.com>
Cc: David Rientjes <rientjes@google.com>
Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
Cc: Igor Mammedov <imammedo@redhat.com>
Cc: Jerome Glisse <jglisse@redhat.com>
Cc: Joonsoo Kim <js1304@gmail.com>
Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Reza Arbab <arbab@linux.vnet.ibm.com>
Cc: Tobias Regnery <tobias.regnery@gmail.com>
Cc: Toshi Kani <toshi.kani@hpe.com>
Cc: Vitaly Kuznetsov <vkuznets@redhat.com>
Cc: Xishi Qiu <qiuxishi@huawei.com>
Cc: Yasuaki Ishimatsu <isimatu.yasuaki@jp.fujitsu.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# dc0bbf3b 06-Jul-2017 Michal Hocko <mhocko@suse.com>

mm: remove return value from init_currently_empty_zone

Patch series "mm: make movable onlining suck less", v4.

Movable onlining is a real hack with many downsides - mainly
reintroduction of lowmem/highmem issues we used to have on 32b systems -
but it is the only way to make the memory hotremove more reliable which
is something that people are asking for.

The current semantic of memory movable onlinening is really cumbersome,
however. The main reason for this is that the udev driven approach is
basically unusable because udev races with the memory probing while only
the last memory block or the one adjacent to the existing zone_movable
are allowed to be onlined movable. In short the criterion for the
successful online_movable changes under udev's feet. A reliable udev
approach would require a 2 phase approach where the first successful
movable online would have to check all the previous blocks and online
them in descending order. This is hard to be considered sane.

This patchset aims at making the onlining semantic more usable. First
of all it allows to online memory movable as long as it doesn't clash
with the existing ZONE_NORMAL. That means that ZONE_NORMAL and
ZONE_MOVABLE cannot overlap. Currently I preserve the original ordering
semantic so the zone always precedes the movable zone but I have plans
to remove this restriction in future because it is not really necessary.

First 3 patches are cleanups which should be ready to be merged right
away (unless I have missed something subtle of course).

Patch 4 deals with ZONE_DEVICE dependencies down the __add_pages path.

Patch 5 deals with implicit assumptions of register_one_node on pgdat
initialization.

Patches 6-10 deal with offline holes in the zone for pfn walkers. I
hope I got all of them right but people familiar with compaction should
double check this.

Patch 11 is the core of the change. In order to make it easier to
review I have tried it to be as minimalistic as possible and the large
code removal is moved to patch 14.

Patch 12 is a trivial follow up cleanup. Patch 13 fixes sparse warnings
and finally patch 14 removes the unused code.

I have tested the patches in kvm:
# qemu-system-x86_64 -enable-kvm -monitor pty -m 2G,slots=4,maxmem=4G -numa node,mem=1G -numa node,mem=1G ...

and then probed the additional memory by
(qemu) object_add memory-backend-ram,id=mem1,size=1G
(qemu) device_add pc-dimm,id=dimm1,memdev=mem1

Then I have used this simple script to probe the memory block by hand
# cat probe_memblock.sh
#!/bin/sh

BLOCK_NR=$1

# echo $((0x100000000+$BLOCK_NR*(128<<20))) > /sys/devices/system/memory/probe

# for i in $(seq 10); do sh probe_memblock.sh $i; done
# grep . /sys/devices/system/memory/memory3?/valid_zones 2>/dev/null
/sys/devices/system/memory/memory33/valid_zones:Normal Movable
/sys/devices/system/memory/memory34/valid_zones:Normal Movable
/sys/devices/system/memory/memory35/valid_zones:Normal Movable
/sys/devices/system/memory/memory36/valid_zones:Normal Movable
/sys/devices/system/memory/memory37/valid_zones:Normal Movable
/sys/devices/system/memory/memory38/valid_zones:Normal Movable
/sys/devices/system/memory/memory39/valid_zones:Normal Movable

The main difference to the original implementation is that all new
memblocks can be both online_kernel and online_movable initially because
there is no clash obviously. For the comparison the original
implementation would have

/sys/devices/system/memory/memory33/valid_zones:Normal
/sys/devices/system/memory/memory34/valid_zones:Normal
/sys/devices/system/memory/memory35/valid_zones:Normal
/sys/devices/system/memory/memory36/valid_zones:Normal
/sys/devices/system/memory/memory37/valid_zones:Normal
/sys/devices/system/memory/memory38/valid_zones:Normal
/sys/devices/system/memory/memory39/valid_zones:Normal Movable

Now
# echo online_movable > /sys/devices/system/memory/memory34/state
# grep . /sys/devices/system/memory/memory3?/valid_zones 2>/dev/null
/sys/devices/system/memory/memory33/valid_zones:Normal Movable
/sys/devices/system/memory/memory34/valid_zones:Movable
/sys/devices/system/memory/memory35/valid_zones:Movable
/sys/devices/system/memory/memory36/valid_zones:Movable
/sys/devices/system/memory/memory37/valid_zones:Movable
/sys/devices/system/memory/memory38/valid_zones:Movable
/sys/devices/system/memory/memory39/valid_zones:Movable

Block 33 can still be online both kernel and movable while all
the remaining can be only movable.

/proc/zonelist says
Node 0, zone Normal
pages free 0
min 0
low 0
high 0
spanned 0
present 0
--
Node 0, zone Movable
pages free 32753
min 85
low 117
high 149
spanned 32768
present 32768

A new memblock at a lower address will result in a new memblock (32)
which will still allow both Normal and Movable.

# sh probe_memblock.sh 0
# grep . /sys/devices/system/memory/memory3[2-5]/valid_zones 2>/dev/null
/sys/devices/system/memory/memory32/valid_zones:Normal Movable
/sys/devices/system/memory/memory33/valid_zones:Normal Movable
/sys/devices/system/memory/memory34/valid_zones:Movable
/sys/devices/system/memory/memory35/valid_zones:Movable

and online_kernel will convert it to the zone normal properly
while 33 can be still onlined both ways.

# echo online_kernel > /sys/devices/system/memory/memory32/state
# grep . /sys/devices/system/memory/memory3[2-5]/valid_zones 2>/dev/null
/sys/devices/system/memory/memory32/valid_zones:Normal
/sys/devices/system/memory/memory33/valid_zones:Normal Movable
/sys/devices/system/memory/memory34/valid_zones:Movable
/sys/devices/system/memory/memory35/valid_zones:Movable

/proc/zoneinfo will now tell
Node 0, zone Normal
pages free 65441
min 165
low 230
high 295
spanned 65536
present 65536
--
Node 0, zone Movable
pages free 32740
min 82
low 114
high 146
spanned 32768
present 32768

so both zones have one memblock spanned and present.

Onlining 39 should associate this block to the movable zone

# echo online > /sys/devices/system/memory/memory39/state

/proc/zoneinfo will now tell
Node 0, zone Normal
pages free 32765
min 80
low 112
high 144
spanned 32768
present 32768
--
Node 0, zone Movable
pages free 65501
min 160
low 225
high 290
spanned 196608
present 65536

so we will have a movable zone which spans 6 memblocks, 2 present and 4
representing a hole.

Offlining both movable blocks will lead to the zone with no present
pages which is the expected behavior I believe.

# echo offline > /sys/devices/system/memory/memory39/state
# echo offline > /sys/devices/system/memory/memory34/state
# grep -A6 "Movable\|Normal" /proc/zoneinfo
Node 0, zone Normal
pages free 32735
min 90
low 122
high 154
spanned 32768
present 32768
--
Node 0, zone Movable
pages free 0
min 0
low 0
high 0
spanned 196608
present 0

As a bonus we will get a nice cleanup in the memory hotplug codebase.

This patch (of 16):

init_currently_empty_zone doesn't have any error to return yet it is
still an int and callers try to be defensive and try to handle potential
error. Remove this nonsense and simplify all callers.

This patch shouldn't have any visible effect

Link: http://lkml.kernel.org/r/20170515085827.16474-2-mhocko@kernel.org
Signed-off-by: Michal Hocko <mhocko@suse.com>
Reviewed-by: Yasuaki Ishimatsu <isimatu.yasuaki@jp.fujitsu.com>
Acked-by: Balbir Singh <bsingharora@gmail.com>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Cc: Andi Kleen <ak@linux.intel.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: Daniel Kiper <daniel.kiper@oracle.com>
Cc: David Rientjes <rientjes@google.com>
Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
Cc: Igor Mammedov <imammedo@redhat.com>
Cc: Jerome Glisse <jglisse@redhat.com>
Cc: Joonsoo Kim <js1304@gmail.com>
Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Reza Arbab <arbab@linux.vnet.ibm.com>
Cc: Tobias Regnery <tobias.regnery@gmail.com>
Cc: Toshi Kani <toshi.kani@hpe.com>
Cc: Vitaly Kuznetsov <vkuznets@redhat.com>
Cc: Xishi Qiu <qiuxishi@huawei.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 864b9a39 02-Jun-2017 Michal Hocko <mhocko@suse.com>

mm: consider memblock reservations for deferred memory initialization sizing

We have seen an early OOM killer invocation on ppc64 systems with
crashkernel=4096M:

kthreadd invoked oom-killer: gfp_mask=0x16040c0(GFP_KERNEL|__GFP_COMP|__GFP_NOTRACK), nodemask=7, order=0, oom_score_adj=0
kthreadd cpuset=/ mems_allowed=7
CPU: 0 PID: 2 Comm: kthreadd Not tainted 4.4.68-1.gd7fe927-default #1
Call Trace:
dump_stack+0xb0/0xf0 (unreliable)
dump_header+0xb0/0x258
out_of_memory+0x5f0/0x640
__alloc_pages_nodemask+0xa8c/0xc80
kmem_getpages+0x84/0x1a0
fallback_alloc+0x2a4/0x320
kmem_cache_alloc_node+0xc0/0x2e0
copy_process.isra.25+0x260/0x1b30
_do_fork+0x94/0x470
kernel_thread+0x48/0x60
kthreadd+0x264/0x330
ret_from_kernel_thread+0x5c/0xa4

Mem-Info:
active_anon:0 inactive_anon:0 isolated_anon:0
active_file:0 inactive_file:0 isolated_file:0
unevictable:0 dirty:0 writeback:0 unstable:0
slab_reclaimable:5 slab_unreclaimable:73
mapped:0 shmem:0 pagetables:0 bounce:0
free:0 free_pcp:0 free_cma:0
Node 7 DMA free:0kB min:0kB low:0kB high:0kB active_anon:0kB inactive_anon:0kB active_file:0kB inactive_file:0kB unevictable:0kB isolated(anon):0kB isolated(file):0kB present:52428800kB managed:110016kB mlocked:0kB dirty:0kB writeback:0kB mapped:0kB shmem:0kB slab_reclaimable:320kB slab_unreclaimable:4672kB kernel_stack:1152kB pagetables:0kB unstable:0kB bounce:0kB free_pcp:0kB local_pcp:0kB free_cma:0kB writeback_tmp:0kB pages_scanned:0 all_unreclaimable? yes
lowmem_reserve[]: 0 0 0 0
Node 7 DMA: 0*64kB 0*128kB 0*256kB 0*512kB 0*1024kB 0*2048kB 0*4096kB 0*8192kB 0*16384kB = 0kB
0 total pagecache pages
0 pages in swap cache
Swap cache stats: add 0, delete 0, find 0/0
Free swap = 0kB
Total swap = 0kB
819200 pages RAM
0 pages HighMem/MovableOnly
817481 pages reserved
0 pages cma reserved
0 pages hwpoisoned

the reason is that the managed memory is too low (only 110MB) while the
rest of the the 50GB is still waiting for the deferred intialization to
be done. update_defer_init estimates the initial memoty to initialize
to 2GB at least but it doesn't consider any memory allocated in that
range. In this particular case we've had

Reserving 4096MB of memory at 128MB for crashkernel (System RAM: 51200MB)

so the low 2GB is mostly depleted.

Fix this by considering memblock allocations in the initial static
initialization estimation. Move the max_initialise to
reset_deferred_meminit and implement a simple memblock_reserved_memory
helper which iterates all reserved blocks and sums the size of all that
start below the given address. The cumulative size is than added on top
of the initial estimation. This is still not ideal because
reset_deferred_meminit doesn't consider holes and so reservation might
be above the initial estimation whihch we ignore but let's make the
logic simpler until we really need to handle more complicated cases.

Fixes: 3a80a7fa7989 ("mm: meminit: initialise a subset of struct pages if CONFIG_DEFERRED_STRUCT_PAGE_INIT is set")
Link: http://lkml.kernel.org/r/20170531104010.GI27783@dhcp22.suse.cz
Signed-off-by: Michal Hocko <mhocko@suse.com>
Acked-by: Mel Gorman <mgorman@suse.de>
Tested-by: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
Cc: <stable@vger.kernel.org> [4.2+]
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# c288983d 02-Jun-2017 Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>

mm/page_alloc.c: make sure OOM victim can try allocations with no watermarks once

Roman Gushchin has reported that the OOM killer can trivially selects
next OOM victim when a thread doing memory allocation from page fault
path was selected as first OOM victim.

allocate invoked oom-killer: gfp_mask=0x14280ca(GFP_HIGHUSER_MOVABLE|__GFP_ZERO), nodemask=(null), order=0, oom_score_adj=0
allocate cpuset=/ mems_allowed=0
CPU: 1 PID: 492 Comm: allocate Not tainted 4.12.0-rc1-mm1+ #181
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS Ubuntu-1.8.2-1ubuntu1 04/01/2014
Call Trace:
oom_kill_process+0x219/0x3e0
out_of_memory+0x11d/0x480
__alloc_pages_slowpath+0xc84/0xd40
__alloc_pages_nodemask+0x245/0x260
alloc_pages_vma+0xa2/0x270
__handle_mm_fault+0xca9/0x10c0
handle_mm_fault+0xf3/0x210
__do_page_fault+0x240/0x4e0
trace_do_page_fault+0x37/0xe0
do_async_page_fault+0x19/0x70
async_page_fault+0x28/0x30
...
Out of memory: Kill process 492 (allocate) score 899 or sacrifice child
Killed process 492 (allocate) total-vm:2052368kB, anon-rss:1894576kB, file-rss:4kB, shmem-rss:0kB
allocate: page allocation failure: order:0, mode:0x14280ca(GFP_HIGHUSER_MOVABLE|__GFP_ZERO), nodemask=(null)
allocate cpuset=/ mems_allowed=0
CPU: 1 PID: 492 Comm: allocate Not tainted 4.12.0-rc1-mm1+ #181
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS Ubuntu-1.8.2-1ubuntu1 04/01/2014
Call Trace:
__alloc_pages_slowpath+0xd32/0xd40
__alloc_pages_nodemask+0x245/0x260
alloc_pages_vma+0xa2/0x270
__handle_mm_fault+0xca9/0x10c0
handle_mm_fault+0xf3/0x210
__do_page_fault+0x240/0x4e0
trace_do_page_fault+0x37/0xe0
do_async_page_fault+0x19/0x70
async_page_fault+0x28/0x30
...
oom_reaper: reaped process 492 (allocate), now anon-rss:0kB, file-rss:0kB, shmem-rss:0kB
...
allocate invoked oom-killer: gfp_mask=0x0(), nodemask=(null), order=0, oom_score_adj=0
allocate cpuset=/ mems_allowed=0
CPU: 1 PID: 492 Comm: allocate Not tainted 4.12.0-rc1-mm1+ #181
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS Ubuntu-1.8.2-1ubuntu1 04/01/2014
Call Trace:
oom_kill_process+0x219/0x3e0
out_of_memory+0x11d/0x480
pagefault_out_of_memory+0x68/0x80
mm_fault_error+0x8f/0x190
? handle_mm_fault+0xf3/0x210
__do_page_fault+0x4b2/0x4e0
trace_do_page_fault+0x37/0xe0
do_async_page_fault+0x19/0x70
async_page_fault+0x28/0x30
...
Out of memory: Kill process 233 (firewalld) score 10 or sacrifice child
Killed process 233 (firewalld) total-vm:246076kB, anon-rss:20956kB, file-rss:0kB, shmem-rss:0kB

There is a race window that the OOM reaper completes reclaiming the
first victim's memory while nothing but mutex_trylock() prevents the
first victim from calling out_of_memory() from pagefault_out_of_memory()
after memory allocation for page fault path failed due to being selected
as an OOM victim.

This is a side effect of commit 9a67f6488eca926f ("mm: consolidate
GFP_NOFAIL checks in the allocator slowpath") because that commit
silently changed the behavior from

/* Avoid allocations with no watermarks from looping endlessly */

to

/*
* Give up allocations without trying memory reserves if selected
* as an OOM victim
*/

in __alloc_pages_slowpath() by moving the location to check TIF_MEMDIE
flag. I have noticed this change but I didn't post a patch because I
thought it is an acceptable change other than noise by warn_alloc()
because !__GFP_NOFAIL allocations are allowed to fail. But we
overlooked that failing memory allocation from page fault path makes
difference due to the race window explained above.

While it might be possible to add a check to pagefault_out_of_memory()
that prevents the first victim from calling out_of_memory() or remove
out_of_memory() from pagefault_out_of_memory(), changing
pagefault_out_of_memory() does not suppress noise by warn_alloc() when
allocating thread was selected as an OOM victim. There is little point
with printing similar backtraces and memory information from both
out_of_memory() and warn_alloc().

Instead, if we guarantee that current thread can try allocations with no
watermarks once when current thread looping inside
__alloc_pages_slowpath() was selected as an OOM victim, we can follow "who
can use memory reserves" rules and suppress noise by warn_alloc() and
prevent memory allocations from page fault path from calling
pagefault_out_of_memory().

If we take the comment literally, this patch would do

- if (test_thread_flag(TIF_MEMDIE))
- goto nopage;
+ if (alloc_flags == ALLOC_NO_WATERMARKS || (gfp_mask & __GFP_NOMEMALLOC))
+ goto nopage;

because gfp_pfmemalloc_allowed() returns false if __GFP_NOMEMALLOC is
given. But if I recall correctly (I couldn't find the message), the
condition is meant to apply to only OOM victims despite the comment.
Therefore, this patch preserves TIF_MEMDIE check.

Fixes: 9a67f6488eca926f ("mm: consolidate GFP_NOFAIL checks in the allocator slowpath")
Link: http://lkml.kernel.org/r/201705192112.IAF69238.OQOHSJLFOFFMtV@I-love.SAKURA.ne.jp
Signed-off-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
Reported-by: Roman Gushchin <guro@fb.com>
Tested-by: Roman Gushchin <guro@fb.com>
Acked-by: Michal Hocko <mhocko@suse.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Vladimir Davydov <vdavydov.dev@gmail.com>
Cc: <stable@vger.kernel.org> [4.11]
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 499118e9 08-May-2017 Vlastimil Babka <vbabka@suse.cz>

mm: introduce memalloc_noreclaim_{save,restore}

The previous patch ("mm: prevent potential recursive reclaim due to
clearing PF_MEMALLOC") has shown that simply setting and clearing
PF_MEMALLOC in current->flags can result in wrongly clearing a
pre-existing PF_MEMALLOC flag and potentially lead to recursive reclaim.
Let's introduce helpers that support proper nesting by saving the
previous stat of the flag, similar to the existing memalloc_noio_* and
memalloc_nofs_* helpers. Convert existing setting/clearing of
PF_MEMALLOC within mm to the new helpers.

There are no known issues with the converted code, but the change makes
it more robust.

Link: http://lkml.kernel.org/r/20170405074700.29871-3-vbabka@suse.cz
Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
Suggested-by: Michal Hocko <mhocko@suse.com>
Acked-by: Michal Hocko <mhocko@suse.com>
Acked-by: Hillf Danton <hillf.zj@alibaba-inc.com>
Cc: Mel Gorman <mgorman@techsingularity.net>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Andrey Ryabinin <aryabinin@virtuozzo.com>
Cc: Boris Brezillon <boris.brezillon@free-electrons.com>
Cc: Chris Leech <cleech@redhat.com>
Cc: "David S. Miller" <davem@davemloft.net>
Cc: Eric Dumazet <edumazet@google.com>
Cc: Josef Bacik <jbacik@fb.com>
Cc: Lee Duncan <lduncan@suse.com>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Richard Weinberger <richard@nod.at>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 62be1511 08-May-2017 Vlastimil Babka <vbabka@suse.cz>

mm: prevent potential recursive reclaim due to clearing PF_MEMALLOC

Patch series "more robust PF_MEMALLOC handling"

This series aims to unify the setting and clearing of PF_MEMALLOC, which
prevents recursive reclaim. There are some places that clear the flag
unconditionally from current->flags, which may result in clearing a
pre-existing flag. This already resulted in a bug report that Patch 1
fixes (without the new helpers, to make backporting easier). Patch 2
introduces the new helpers, modelled after existing memalloc_noio_* and
memalloc_nofs_* helpers, and converts mm core to use them. Patches 3
and 4 convert non-mm code.

This patch (of 4):

__alloc_pages_direct_compact() sets PF_MEMALLOC to prevent deadlock
during page migration by lock_page() (see the comment in
__unmap_and_move()). Then it unconditionally clears the flag, which can
clear a pre-existing PF_MEMALLOC flag and result in recursive reclaim.
This was not a problem until commit a8161d1ed609 ("mm, page_alloc:
restructure direct compaction handling in slowpath"), because direct
compation was called only after direct reclaim, which was skipped when
PF_MEMALLOC flag was set.

Even now it's only a theoretical issue, as the new callsite of
__alloc_pages_direct_compact() is reached only for costly orders and
when gfp_pfmemalloc_allowed() is true, which means either
__GFP_NOMEMALLOC is in gfp_flags or in_interrupt() is true. There is no
such known context, but let's play it safe and make
__alloc_pages_direct_compact() robust for cases where PF_MEMALLOC is
already set.

Fixes: a8161d1ed609 ("mm, page_alloc: restructure direct compaction handling in slowpath")
Link: http://lkml.kernel.org/r/20170405074700.29871-2-vbabka@suse.cz
Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
Reported-by: Andrey Ryabinin <aryabinin@virtuozzo.com>
Acked-by: Michal Hocko <mhocko@suse.com>
Acked-by: Hillf Danton <hillf.zj@alibaba-inc.com>
Cc: Mel Gorman <mgorman@techsingularity.net>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Boris Brezillon <boris.brezillon@free-electrons.com>
Cc: Chris Leech <cleech@redhat.com>
Cc: "David S. Miller" <davem@davemloft.net>
Cc: Eric Dumazet <edumazet@google.com>
Cc: Josef Bacik <jbacik@fb.com>
Cc: Lee Duncan <lduncan@suse.com>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Richard Weinberger <richard@nod.at>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 282722b0 08-May-2017 Vlastimil Babka <vbabka@suse.cz>

mm, compaction: restrict async compaction to pageblocks of same migratetype

The migrate scanner in async compaction is currently limited to
MIGRATE_MOVABLE pageblocks. This is a heuristic intended to reduce
latency, based on the assumption that non-MOVABLE pageblocks are
unlikely to contain movable pages.

However, with the exception of THP's, most high-order allocations are
not movable. Should the async compaction succeed, this increases the
chance that the non-MOVABLE allocations will fallback to a MOVABLE
pageblock, making the long-term fragmentation worse.

This patch attempts to help the situation by changing async direct
compaction so that the migrate scanner only scans the pageblocks of the
requested migratetype. If it's a non-MOVABLE type and there are such
pageblocks that do contain movable pages, chances are that the
allocation can succeed within one of such pageblocks, removing the need
for a fallback. If that fails, the subsequent sync attempt will ignore
this restriction.

In testing based on 4.9 kernel with stress-highalloc from mmtests
configured for order-4 GFP_KERNEL allocations, this patch has reduced
the number of unmovable allocations falling back to movable pageblocks
by 30%. The number of movable allocations falling back is reduced by
12%.

Link: http://lkml.kernel.org/r/20170307131545.28577-8-vbabka@suse.cz
Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
Cc: Mel Gorman <mgorman@techsingularity.net>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Cc: David Rientjes <rientjes@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 02aa0cdd 08-May-2017 Vlastimil Babka <vbabka@suse.cz>

mm, page_alloc: count movable pages when stealing from pageblock

When stealing pages from pageblock of a different migratetype, we count
how many free pages were stolen, and change the pageblock's migratetype
if more than half of the pageblock was free. This might be too
conservative, as there might be other pages that are not free, but were
allocated with the same migratetype as our allocation requested.

While we cannot determine the migratetype of allocated pages precisely
(at least without the page_owner functionality enabled), we can count
pages that compaction would try to isolate for migration - those are
either on LRU or __PageMovable(). The rest can be assumed to be
MIGRATE_RECLAIMABLE or MIGRATE_UNMOVABLE, which we cannot easily
distinguish. This counting can be done as part of free page stealing
with little additional overhead.

The page stealing code is changed so that it considers free pages plus
pages of the "good" migratetype for the decision whether to change
pageblock's migratetype.

The result should be more accurate migratetype of pageblocks wrt the
actual pages in the pageblocks, when stealing from semi-occupied
pageblocks. This should help the efficiency of page grouping by
mobility.

In testing based on 4.9 kernel with stress-highalloc from mmtests
configured for order-4 GFP_KERNEL allocations, this patch has reduced
the number of unmovable allocations falling back to movable pageblocks
by 47%. The number of movable allocations falling back to other
pageblocks are increased by 55%, but these events don't cause permanent
fragmentation, so the tradeoff should be positive. Later patches also
offset the movable fallback increase to some extent.

[akpm@linux-foundation.org: merge fix]
Link: http://lkml.kernel.org/r/20170307131545.28577-5-vbabka@suse.cz
Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
Acked-by: Mel Gorman <mgorman@techsingularity.net>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Cc: David Rientjes <rientjes@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 3bc48f96 08-May-2017 Vlastimil Babka <vbabka@suse.cz>

mm, page_alloc: split smallest stolen page in fallback

The __rmqueue_fallback() function is called when there's no free page of
requested migratetype, and we need to steal from a different one.

There are various heuristics to make this event infrequent and reduce
permanent fragmentation. The main one is to try stealing from a
pageblock that has the most free pages, and possibly steal them all at
once and convert the whole pageblock. Precise searching for such
pageblock would be expensive, so instead the heuristics walks the free
lists from MAX_ORDER down to requested order and assumes that the block
with highest-order free page is likely to also have the most free pages
in total.

Chances are that together with the highest-order page, we steal also
pages of lower orders from the same block. But then we still split the
highest order page. This is wasteful and can contribute to
fragmentation instead of avoiding it.

This patch thus changes __rmqueue_fallback() to just steal the page(s)
and put them on the freelist of the requested migratetype, and only
report whether it was successful. Then we pick (and eventually split)
the smallest page with __rmqueue_smallest(). This all happens under
zone lock, so nobody can steal it from us in the process. This should
reduce fragmentation due to fallbacks. At worst we are only stealing a
single highest-order page and waste some cycles by moving it between
lists and then removing it, but fallback is not exactly hot path so that
should not be a concern. As a side benefit the patch removes some
duplicate code by reusing __rmqueue_smallest().

[vbabka@suse.cz: fix endless loop in the modified __rmqueue()]
Link: http://lkml.kernel.org/r/59d71b35-d556-4fc9-ee2e-1574259282fd@suse.cz
Link: http://lkml.kernel.org/r/20170307131545.28577-4-vbabka@suse.cz
Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
Acked-by: Mel Gorman <mgorman@techsingularity.net>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Cc: David Rientjes <rientjes@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 0f7896f1 03-May-2017 Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>

mm, page_alloc: remove debug_guardpage_minorder() test in warn_alloc()

Commit c0a32fc5a2e4 ("mm: more intensive memory corruption debugging")
changed to check debug_guardpage_minorder() > 0 when reporting
allocation failures. The reasoning was

When we use guard page to debug memory corruption, it shrinks
available pages to 1/2, 1/4, 1/8 and so on, depending on parameter
value. In such case memory allocation failures can be common and
printing errors can flood dmesg. If somebody debug corruption,
allocation failures are not the things he/she is interested about.

but this is misguided.

Allocation requests with __GFP_NOWARN flag by definition do not cause
flooding of allocation failure messages. Allocation requests with
__GFP_NORETRY flag likely also have __GFP_NOWARN flag. Costly
allocation requests likely also have __GFP_NOWARN flag.

Allocation requests without __GFP_DIRECT_RECLAIM flag likely also have
__GFP_NOWARN flag or __GFP_HIGH flag. Non-costly allocation requests
with __GFP_DIRECT_RECLAIM flag basically retry forever due to the "too
small to fail" memory-allocation rule.

Therefore, as a whole, shrinking available pages by
debug_guardpage_minorder= kernel boot parameter might cause flooding of
OOM killer messages but unlikely causes flooding of allocation failure
messages. Let's remove debug_guardpage_minorder() > 0 check which would
likely be pointless.

Link: http://lkml.kernel.org/r/1491910035-4231-1-git-send-email-penguin-kernel@I-love.SAKURA.ne.jp
Signed-off-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
Acked-by: Michal Hocko <mhocko@suse.com>
Cc: Stanislaw Gruszka <sgruszka@redhat.com>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: "Rafael J . Wysocki" <rafael.j.wysocki@intel.com>
Cc: Christoph Lameter <cl@linux-foundation.org>
Cc: Pekka Enberg <penberg@cs.helsinki.fi>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# bd33ef36 03-May-2017 Vinayak Menon <vinmenon@codeaurora.org>

mm: enable page poisoning early at boot

On SPARSEMEM systems page poisoning is enabled after buddy is up,
because of the dependency on page extension init. This causes the pages
released by free_all_bootmem not to be poisoned. This either delays or
misses the identification of some issues because the pages have to
undergo another cycle of alloc-free-alloc for any corruption to be
detected.

Enable page poisoning early by getting rid of the PAGE_EXT_DEBUG_POISON
flag. Since all the free pages will now be poisoned, the flag need not
be verified before checking the poison during an alloc.

[vinmenon@codeaurora.org: fix Kconfig]
Link: http://lkml.kernel.org/r/1490878002-14423-1-git-send-email-vinmenon@codeaurora.org
Link: http://lkml.kernel.org/r/1490358246-11001-1-git-send-email-vinmenon@codeaurora.org
Signed-off-by: Vinayak Menon <vinmenon@codeaurora.org>
Acked-by: Laura Abbott <labbott@redhat.com>
Tested-by: Laura Abbott <labbott@redhat.com>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Akinobu Mita <akinobu.mita@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 82251963 03-May-2017 Johannes Weiner <hannes@cmpxchg.org>

mm: page_alloc: __GFP_NOWARN shouldn't suppress stall warnings

__GFP_NOWARN, which is usually added to avoid warnings from callsites
that expect to fail and have fallbacks, currently also suppresses
allocation stall warnings. These trigger when an allocation is stuck
inside the allocator for 10 seconds or longer.

But there is no class of allocations that can get legitimately stuck in
the allocator for this long. This always indicates a problem.

Always emit stall warnings. Restrict __GFP_NOWARN to alloc failures.

Link: http://lkml.kernel.org/r/20170125181150.GA16398@cmpxchg.org
Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
Acked-by: Minchan Kim <minchan@kernel.org>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 7dea19f9 03-May-2017 Michal Hocko <mhocko@suse.com>

mm: introduce memalloc_nofs_{save,restore} API

GFP_NOFS context is used for the following 5 reasons currently:

- to prevent from deadlocks when the lock held by the allocation
context would be needed during the memory reclaim

- to prevent from stack overflows during the reclaim because the
allocation is performed from a deep context already

- to prevent lockups when the allocation context depends on other
reclaimers to make a forward progress indirectly

- just in case because this would be safe from the fs POV

- silence lockdep false positives

Unfortunately overuse of this allocation context brings some problems to
the MM. Memory reclaim is much weaker (especially during heavy FS
metadata workloads), OOM killer cannot be invoked because the MM layer
doesn't have enough information about how much memory is freeable by the
FS layer.

In many cases it is far from clear why the weaker context is even used
and so it might be used unnecessarily. We would like to get rid of
those as much as possible. One way to do that is to use the flag in
scopes rather than isolated cases. Such a scope is declared when really
necessary, tracked per task and all the allocation requests from within
the context will simply inherit the GFP_NOFS semantic.

Not only this is easier to understand and maintain because there are
much less problematic contexts than specific allocation requests, this
also helps code paths where FS layer interacts with other layers (e.g.
crypto, security modules, MM etc...) and there is no easy way to convey
the allocation context between the layers.

Introduce memalloc_nofs_{save,restore} API to control the scope of
GFP_NOFS allocation context. This is basically copying
memalloc_noio_{save,restore} API we have for other restricted allocation
context GFP_NOIO. The PF_MEMALLOC_NOFS flag already exists and it is
just an alias for PF_FSTRANS which has been xfs specific until recently.
There are no more PF_FSTRANS users anymore so let's just drop it.

PF_MEMALLOC_NOFS is now checked in the MM layer and drops __GFP_FS
implicitly same as PF_MEMALLOC_NOIO drops __GFP_IO. memalloc_noio_flags
is renamed to current_gfp_context because it now cares about both
PF_MEMALLOC_NOFS and PF_MEMALLOC_NOIO contexts. Xfs code paths preserve
their semantic. kmem_flags_convert() doesn't need to evaluate the flag
anymore.

This patch shouldn't introduce any functional changes.

Let's hope that filesystems will drop direct GFP_NOFS (resp. ~__GFP_FS)
usage as much as possible and only use a properly documented
memalloc_nofs_{save,restore} checkpoints where they are appropriate.

[akpm@linux-foundation.org: fix comment typo, reflow comment]
Link: http://lkml.kernel.org/r/20170306131408.9828-5-mhocko@kernel.org
Signed-off-by: Michal Hocko <mhocko@suse.com>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Cc: Dave Chinner <david@fromorbit.com>
Cc: Theodore Ts'o <tytso@mit.edu>
Cc: Chris Mason <clm@fb.com>
Cc: David Sterba <dsterba@suse.cz>
Cc: Jan Kara <jack@suse.cz>
Cc: Brian Foster <bfoster@redhat.com>
Cc: Darrick J. Wong <darrick.wong@oracle.com>
Cc: Nikolay Borisov <nborisov@suse.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# a6ffdc07 03-May-2017 Xishi Qiu <qiuxishi@huawei.com>

mm: use is_migrate_highatomic() to simplify the code

Introduce two helpers, is_migrate_highatomic() and is_migrate_highatomic_page().

Simplify the code, no functional changes.

[akpm@linux-foundation.org: use static inlines rather than macros, per mhocko]
Link: http://lkml.kernel.org/r/58B94F15.6060606@huawei.com
Signed-off-by: Xishi Qiu <qiuxishi@huawei.com>
Acked-by: Michal Hocko <mhocko@suse.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Mel Gorman <mgorman@techsingularity.net>
Cc: Minchan Kim <minchan@kernel.org>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 491d79ae 03-May-2017 Johannes Weiner <hannes@cmpxchg.org>

mm: remove unnecessary back-off function when retrying page reclaim

The backoff mechanism is not needed. If we have MAX_RECLAIM_RETRIES
loops without progress, we'll OOM anyway; backing off might cut one or
two iterations off that in the rare OOM case. If we have intermittent
success reclaiming a few pages, the backoff function gets reset also,
and so is of little help in these scenarios.

We might want a backoff function for when there IS progress, but not
enough to be satisfactory. But this isn't that. Remove it.

Link: http://lkml.kernel.org/r/20170228214007.5621-10-hannes@cmpxchg.org
Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
Acked-by: Hillf Danton <hillf.zj@alibaba-inc.com>
Acked-by: Michal Hocko <mhocko@suse.com>
Cc: Jia He <hejianet@gmail.com>
Cc: Mel Gorman <mgorman@suse.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# c822f622 03-May-2017 Johannes Weiner <hannes@cmpxchg.org>

mm: delete NR_PAGES_SCANNED and pgdat_reclaimable()

NR_PAGES_SCANNED counts number of pages scanned since the last page free
event in the allocator. This was used primarily to measure the
reclaimability of zones and nodes, and determine when reclaim should
give up on them. In that role, it has been replaced in the preceding
patches by a different mechanism.

Being implemented as an efficient vmstat counter, it was automatically
exported to userspace as well. It's however unlikely that anyone
outside the kernel is using this counter in any meaningful way.

Remove the counter and the unused pgdat_reclaimable().

Link: http://lkml.kernel.org/r/20170228214007.5621-8-hannes@cmpxchg.org
Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
Acked-by: Hillf Danton <hillf.zj@alibaba-inc.com>
Acked-by: Michal Hocko <mhocko@suse.com>
Cc: Jia He <hejianet@gmail.com>
Cc: Mel Gorman <mgorman@suse.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# c73322d0 03-May-2017 Johannes Weiner <hannes@cmpxchg.org>

mm: fix 100% CPU kswapd busyloop on unreclaimable nodes

Patch series "mm: kswapd spinning on unreclaimable nodes - fixes and
cleanups".

Jia reported a scenario in which the kswapd of a node indefinitely spins
at 100% CPU usage. We have seen similar cases at Facebook.

The kernel's current method of judging its ability to reclaim a node (or
whether to back off and sleep) is based on the amount of scanned pages
in proportion to the amount of reclaimable pages. In Jia's and our
scenarios, there are no reclaimable pages in the node, however, and the
condition for backing off is never met. Kswapd busyloops in an attempt
to restore the watermarks while having nothing to work with.

This series reworks the definition of an unreclaimable node based not on
scanning but on whether kswapd is able to actually reclaim pages in
MAX_RECLAIM_RETRIES (16) consecutive runs. This is the same criteria
the page allocator uses for giving up on direct reclaim and invoking the
OOM killer. If it cannot free any pages, kswapd will go to sleep and
leave further attempts to direct reclaim invocations, which will either
make progress and re-enable kswapd, or invoke the OOM killer.

Patch #1 fixes the immediate problem Jia reported, the remainder are
smaller fixlets, cleanups, and overall phasing out of the old method.

Patch #6 is the odd one out. It's a nice cleanup to get_scan_count(),
and directly related to #5, but in itself not relevant to the series.

If the whole series is too ambitious for 4.11, I would consider the
first three patches fixes, the rest cleanups.

This patch (of 9):

Jia He reports a problem with kswapd spinning at 100% CPU when
requesting more hugepages than memory available in the system:

$ echo 4000 >/proc/sys/vm/nr_hugepages

top - 13:42:59 up 3:37, 1 user, load average: 1.09, 1.03, 1.01
Tasks: 1 total, 1 running, 0 sleeping, 0 stopped, 0 zombie
%Cpu(s): 0.0 us, 12.5 sy, 0.0 ni, 85.5 id, 2.0 wa, 0.0 hi, 0.0 si, 0.0 st
KiB Mem: 31371520 total, 30915136 used, 456384 free, 320 buffers
KiB Swap: 6284224 total, 115712 used, 6168512 free. 48192 cached Mem

PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
76 root 20 0 0 0 0 R 100.0 0.000 217:17.29 kswapd3

At that time, there are no reclaimable pages left in the node, but as
kswapd fails to restore the high watermarks it refuses to go to sleep.

Kswapd needs to back away from nodes that fail to balance. Up until
commit 1d82de618ddd ("mm, vmscan: make kswapd reclaim in terms of
nodes") kswapd had such a mechanism. It considered zones whose
theoretically reclaimable pages it had reclaimed six times over as
unreclaimable and backed away from them. This guard was erroneously
removed as the patch changed the definition of a balanced node.

However, simply restoring this code wouldn't help in the case reported
here: there *are* no reclaimable pages that could be scanned until the
threshold is met. Kswapd would stay awake anyway.

Introduce a new and much simpler way of backing off. If kswapd runs
through MAX_RECLAIM_RETRIES (16) cycles without reclaiming a single
page, make it back off from the node. This is the same number of shots
direct reclaim takes before declaring OOM. Kswapd will go to sleep on
that node until a direct reclaimer manages to reclaim some pages, thus
proving the node reclaimable again.

[hannes@cmpxchg.org: check kswapd failure against the cumulative nr_reclaimed count]
Link: http://lkml.kernel.org/r/20170306162410.GB2090@cmpxchg.org
[shakeelb@google.com: fix condition for throttle_direct_reclaim]
Link: http://lkml.kernel.org/r/20170314183228.20152-1-shakeelb@google.com
Link: http://lkml.kernel.org/r/20170228214007.5621-2-hannes@cmpxchg.org
Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
Signed-off-by: Shakeel Butt <shakeelb@google.com>
Reported-by: Jia He <hejianet@gmail.com>
Tested-by: Jia He <hejianet@gmail.com>
Acked-by: Michal Hocko <mhocko@suse.com>
Acked-by: Hillf Danton <hillf.zj@alibaba-inc.com>
Acked-by: Minchan Kim <minchan@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# d34b0733 20-Apr-2017 Mel Gorman <mgorman@techsingularity.net>

Revert "mm, page_alloc: only use per-cpu allocator for irq-safe requests"

This reverts commit 374ad05ab64.

While the patch worked great for userspace allocations, the fact that
softirq loses the per-cpu allocator caused problems. It needs to be
redone taking into account that a separate list is needed for hard/soft
IRQs or alternatively find a cheap way of detecting reentry due to an
interrupt. Both are possible but sufficiently tricky that it shouldn't
be rushed.

Jesper had one method for allowing softirqs but reported that the cost
was high enough that it performed similarly to a plain revert. His
figures for netperf TCP_STREAM were as follows

Baseline v4.10.0 : 60316 Mbit/s
Current 4.11.0-rc6: 47491 Mbit/s
Jesper's patch : 60662 Mbit/s
This patch : 60106 Mbit/s

As this is a regression, I wish to revert to noirq allocator for now and
go back to the drawing board.

Link: http://lkml.kernel.org/r/20170415145350.ixy7vtrzdzve57mh@techsingularity.net
Signed-off-by: Mel Gorman <mgorman@techsingularity.net>
Reported-by: Tariq Toukan <ttoukan.linux@gmail.com>
Acked-by: Jesper Dangaard Brouer <brouer@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# ce612879 07-Apr-2017 Michal Hocko <mhocko@suse.com>

mm: move pcp and lru-pcp draining into single wq

We currently have 2 specific WQ_RECLAIM workqueues in the mm code.
vmstat_wq for updating pcp stats and lru_add_drain_wq dedicated to drain
per cpu lru caches. This seems more than necessary because both can run
on a single WQ. Both do not block on locks requiring a memory
allocation nor perform any allocations themselves. We will save one
rescuer thread this way.

On the other hand drain_all_pages() queues work on the system wq which
doesn't have rescuer and so this depend on memory allocation (when all
workers are stuck allocating and new ones cannot be created).

Initially we thought this would be more of a theoretical problem but
Hugh Dickins has reported:

: 4.11-rc has been giving me hangs after hours of swapping load. At
: first they looked like memory leaks ("fork: Cannot allocate memory");
: but for no good reason I happened to do "cat /proc/sys/vm/stat_refresh"
: before looking at /proc/meminfo one time, and the stat_refresh stuck
: in D state, waiting for completion of flush_work like many kworkers.
: kthreadd waiting for completion of flush_work in drain_all_pages().

This worker should be using WQ_RECLAIM as well in order to guarantee a
forward progress. We can reuse the same one as for lru draining and
vmstat.

Link: http://lkml.kernel.org/r/20170307131751.24936-1-mhocko@kernel.org
Signed-off-by: Michal Hocko <mhocko@suse.com>
Suggested-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Acked-by: Mel Gorman <mgorman@suse.de>
Tested-by: Yang Li <pku.leo@gmail.com>
Tested-by: Hugh Dickins <hughd@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 1f06b81a 07-Apr-2017 Alexander Polakov <apolyakov@beget.ru>

mm/page_alloc.c: fix print order in show_free_areas()

Fixes: 11fb998986a72a ("mm: move most file-based accounting to the node")
Link: http://lkml.kernel.org/r/1490377730.30219.2.camel@beget.ru
Signed-off-by: Alexander Polyakov <apolyakov@beget.com>
Acked-by: Michal Hocko <mhocko@suse.com>
Cc: Mel Gorman <mgorman@techsingularity.net>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: <stable@vger.kernel.org> [4.8+]
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# b80f0f6c 02-Apr-2017 Steven Rostedt (VMware) <rostedt@goodmis.org>

ftrace: Have init/main.c call ftrace directly to free init memory

Relying on free_reserved_area() to call ftrace to free init memory proved to
not be sufficient. The issue is that on x86, when debug_pagealloc is
enabled, the init memory is not freed, but simply set as not present. Since
ftrace was uninformed of this, starting function tracing still tries to
update pages that are not present according to the page tables, causing
ftrace to bug, as well as killing the kernel itself.

Instead of relying on free_reserved_area(), have init/main.c call ftrace
directly just before it frees the init memory. Then it needs to use
__init_begin and __init_end to know where the init memory location is.
Looking at all archs (and testing what I can), it appears that this should
work for each of them.

Reported-by: kernel test robot <xiaolong.ye@intel.com>
Reported-by: Fengguang Wu <fengguang.wu@intel.com>
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>


# 0e056eb5 30-Mar-2017 Mauro Carvalho Chehab <mchehab@kernel.org>

kernel-api.rst: fix a series of errors when parsing C files

./lib/string.c:134: WARNING: Inline emphasis start-string without end-string.
./mm/filemap.c:522: WARNING: Inline interpreted text or phrase reference start-string without end-string.
./mm/filemap.c:1283: ERROR: Unexpected indentation.
./mm/filemap.c:3003: WARNING: Inline interpreted text or phrase reference start-string without end-string.
./mm/vmalloc.c:1544: WARNING: Inline emphasis start-string without end-string.
./mm/page_alloc.c:4245: ERROR: Unexpected indentation.
./ipc/util.c:676: ERROR: Unexpected indentation.
./drivers/pci/irq.c:35: WARNING: Block quote ends without a blank line; unexpected unindent.
./security/security.c:109: ERROR: Unexpected indentation.
./security/security.c:110: WARNING: Definition list ends without a blank line; unexpected unindent.
./block/genhd.c:275: WARNING: Inline strong start-string without end-string.
./block/genhd.c:283: WARNING: Inline strong start-string without end-string.
./include/linux/clk.h:134: WARNING: Inline emphasis start-string without end-string.
./include/linux/clk.h:134: WARNING: Inline emphasis start-string without end-string.
./ipc/util.c:477: ERROR: Unknown target name: "s".

Signed-off-by: Mauro Carvalho Chehab <mchehab@s-opensource.com>
Acked-by: Bjorn Helgaas <bhelgaas@google.com>
Signed-off-by: Jonathan Corbet <corbet@lwn.net>


# 42c269c8 03-Mar-2017 Steven Rostedt (VMware) <rostedt@goodmis.org>

ftrace: Allow for function tracing to record init functions on boot up

Adding a hook into free_reserve_area() that informs ftrace that boot up init
text is being free, lets ftrace safely remove those init functions from its
records, which keeps ftrace from trying to modify text that no longer
exists.

Note, this still does not allow for tracing .init text of modules, as
modules require different work for freeing its init code.

Link: http://lkml.kernel.org/r/1488502497.7212.24.camel@linux.intel.com

Cc: linux-mm@kvack.org
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Mel Gorman <mgorman@techsingularity.net>
Cc: Peter Zijlstra <peterz@infradead.org>
Requested-by: Todd Brandt <todd.e.brandt@linux.intel.com>
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>


# b4fb8f66 08-Mar-2017 Tony Luck <tony.luck@intel.com>

mm, page_alloc: Add missing check for memory holes

Commit 13ad59df67f1 ("mm, page_alloc: avoid page_to_pfn() when merging
buddies") moved the check for memory holes out of page_is_buddy() and
had the callers do the check.

But this wasn't done correctly in one place which caused ia64 to crash
very early in boot.

Update to fix that and make ia64 boot again.

[ v2: Vlastimil pointed out we don't need to call page_to_pfn()
since we already have the result of that in "buddy_pfn" ]

Fixes: 13ad59df67f1 ("avoid page_to_pfn() when merging buddies")
Cc: Mel Gorman <mgorman@techsingularity.net>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Tony Luck <tony.luck@intel.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 5b3cc15a 02-Feb-2017 Ingo Molnar <mingo@kernel.org>

sched/headers: Prepare to move the memalloc_noio_*() APIs to <linux/sched/mm.h>

Update the .c files that depend on these APIs.

Acked-by: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: linux-kernel@vger.kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>


# 89d790ab 27-Feb-2017 Masahiro Yamada <yamada.masahiro@socionext.com>

scripts/spelling.txt: add "algined" pattern and fix typo instances

Fix typos and add the following to the scripts/spelling.txt:

algined||aligned

While we are here, fix the "appplication" in the touched line in
drivers/block/loop.c. Also, fix the "may not naturally ..." to
"may not be naturally ..." in the touched line in mm/page_alloc.

Link: http://lkml.kernel.org/r/1481573103-11329-9-git-send-email-yamada.masahiro@socionext.com
Signed-off-by: Masahiro Yamada <yamada.masahiro@socionext.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# ad69444e 24-Feb-2017 Wei Yang <richard.weiyang@gmail.com>

mm/page_alloc.c: remove redundant init code for ZONE_MOVABLE

arch_zone_lowest/highest_possible_pfn[] is set to 0 and [ZONE_MOVABLE]
is skipped in the loop. No need to reset them to 0 again.

This patch just removes the redundant code.

Link: http://lkml.kernel.org/r/20170209141731.60208-1-richard.weiyang@gmail.com
Signed-off-by: Wei Yang <richard.weiyang@gmail.com>
Cc: Anshuman Khandual <khandual@linux.vnet.ibm.com>
Cc: Mel Gorman <mgorman@techsingularity.net>
Cc: Michal Hocko <mhocko@suse.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# e02dc017 24-Feb-2017 Gavin Shan <gwshan@linux.vnet.ibm.com>

mm/page_alloc: fix nodes for reclaim in fast path

When @node_reclaim_node isn't 0, the page allocator tries to reclaim
pages if the amount of free memory in the zones are below the low
watermark. On Power platform, none of NUMA nodes are scanned for page
reclaim because no nodes match the condition in zone_allows_reclaim().
On Power platform, RECLAIM_DISTANCE is set to 10 which is the distance
of Node-A to Node-A. So the preferred node even won't be scanned for
page reclaim.

__alloc_pages_nodemask()
get_page_from_freelist()
zone_allows_reclaim()

Anton proposed the test code as below:

# cat alloc.c
:
int main(int argc, char *argv[])
{
void *p;
unsigned long size;
unsigned long start, end;

start = time(NULL);
size = strtoul(argv[1], NULL, 0);
printf("To allocate %ldGB memory\n", size);

size <<= 30;
p = malloc(size);
assert(p);
memset(p, 0, size);

end = time(NULL);
printf("Used time: %ld seconds\n", end - start);
sleep(3600);
return 0;
}

The system I use for testing has two NUMA nodes. Both have 128GB
memory. In below scnario, the page caches on node#0 should be reclaimed
when it encounters pressure to accommodate request of allocation.

# echo 2 > /proc/sys/vm/zone_reclaim_mode; \
sync; \
echo 3 > /proc/sys/vm/drop_caches; \
# taskset -c 0 cat file.32G > /dev/null; \
grep FilePages /sys/devices/system/node/node0/meminfo
Node 0 FilePages: 33619712 kB
# taskset -c 0 ./alloc 128
# grep FilePages /sys/devices/system/node/node0/meminfo
Node 0 FilePages: 33619840 kB
# grep MemFree /sys/devices/system/node/node0/meminfo
Node 0 MemFree: 186816 kB

With the patch applied, the pagecache on node-0 is reclaimed when its
free memory is running out. It's the expected behaviour.

# echo 2 > /proc/sys/vm/zone_reclaim_mode; \
sync; \
echo 3 > /proc/sys/vm/drop_caches
# taskset -c 0 cat file.32G > /dev/null; \
grep FilePages /sys/devices/system/node/node0/meminfo
Node 0 FilePages: 33605568 kB
# taskset -c 0 ./alloc 128
# grep FilePages /sys/devices/system/node/node0/meminfo
Node 0 FilePages: 1379520 kB
# grep MemFree /sys/devices/system/node/node0/meminfo
Node 0 MemFree: 317120 kB

Fixes: 5f7a75acdb24 ("mm: page_alloc: do not cache reclaim distances")
Link: http://lkml.kernel.org/r/1486532455-29613-1-git-send-email-gwshan@linux.vnet.ibm.com
Signed-off-by: Gavin Shan <gwshan@linux.vnet.ibm.com>
Acked-by: Mel Gorman <mgorman@suse.de>
Acked-by: Michal Hocko <mhocko@suse.com>
Cc: Anton Blanchard <anton@samba.org>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Cc: <stable@vger.kernel.org> [3.16+]
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# f2bf14d1 24-Feb-2017 Masanari Iida <standby24x7@gmail.com>

mm/page_alloc.c: remove duplicate inclusion of page_ext.h

Link: http://lkml.kernel.org/r/20170202011942.1609-1-standby24x7@gmail.com
Signed-off-by: Masanari Iida <standby24x7@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# ca96b625 24-Feb-2017 Lucas Stach <l.stach@pengutronix.de>

mm: alloc_contig_range: allow to specify GFP mask

Currently alloc_contig_range assumes that the compaction should be done
with the default GFP_KERNEL flags. This is probably right for all
current uses of this interface, but may change as CMA is used in more
use-cases (including being the default DMA memory allocator on some
platforms).

Change the function prototype, to allow for passing through the GFP mask
set by upper layers.

Also respect global restrictions by applying memalloc_noio_flags to the
passed in flags.

Link: http://lkml.kernel.org/r/20170127172328.18574-1-l.stach@pengutronix.de
Signed-off-by: Lucas Stach <l.stach@pengutronix.de>
Acked-by: Michal Hocko <mhocko@suse.com>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Cc: Radim Krcmar <rkrcmar@redhat.com>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Will Deacon <will.deacon@arm.com>
Cc: Chris Zankel <chris@zankel.net>
Cc: Ralf Baechle <ralf@linux-mips.org>
Cc: Paolo Bonzini <pbonzini@redhat.com>
Cc: Alexander Graf <agraf@suse.com>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 0efadf48 24-Feb-2017 Yisheng Xie <xieyisheng1@huawei.com>

mm/hotplug: enable memory hotplug for non-lru movable pages

We had considered all of the non-lru pages as unmovable before commit
bda807d44454 ("mm: migrate: support non-lru movable page migration").
But now some of non-lru pages like zsmalloc, virtio-balloon pages also
become movable. So we can offline such blocks by using non-lru page
migration.

This patch straightforwardly adds non-lru migration code, which means
adding non-lru related code to the functions which scan over pfn and
collect pages to be migrated and isolate them before migration.

Signed-off-by: Yisheng Xie <xieyisheng1@huawei.com>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Minchan Kim <minchan@kernel.org>
Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Andi Kleen <ak@linux.intel.com>
Cc: Hanjun Guo <guohanjun@huawei.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Cc: Mel Gorman <mgorman@techsingularity.net>
Cc: Reza Arbab <arbab@linux.vnet.ibm.com>
Cc: Taku Izumi <izumi.taku@jp.fujitsu.com>
Cc: Vitaly Kuznetsov <vkuznets@redhat.com>
Cc: Xishi Qiu <qiuxishi@huawei.com>
Cc: Yisheng Xie <xieyisheng1@huawei.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# bd233f53 24-Feb-2017 Mel Gorman <mgorman@techsingularity.net>

mm, page_alloc: use static global work_struct for draining per-cpu pages

As suggested by Vlastimil Babka and Tejun Heo, this patch uses a static
work_struct to co-ordinate the draining of per-cpu pages on the
workqueue. Only one task can drain at a time but this is better than
the previous scheme that allowed multiple tasks to send IPIs at a time.

One consideration is whether parallel requests should synchronise
against each other. This patch does not synchronise for a global drain
as the common case for such callers is expected to be multiple parallel
direct reclaimers competing for pages when the watermark is close to
min. Draining the per-cpu list is unlikely to make much progress and
serialising the drain is of dubious merit. Drains are synchonrised for
callers such as memory hotplug and CMA that care about the drain being
complete when the function returns.

Link: http://lkml.kernel.org/r/20170125083038.rzb5f43nptmk7aed@techsingularity.net
Signed-off-by: Mel Gorman <mgorman@techsingularity.net>
Suggested-by: Tejun Heo <tj@kernel.org>
Suggested-by: Vlastimil Babka <vbabka@suse.cz>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Acked-by: Hillf Danton <hillf.zj@alibaba-inc.com>
Cc: Jesper Dangaard Brouer <brouer@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 51047820 24-Feb-2017 Vlastimil Babka <vbabka@suse.cz>

mm, page_alloc: don't check cpuset allowed twice in fast-path

Since commit 682a3385e773 ("mm, page_alloc: inline the fast path of the
zonelist iterator") we replace a NULL nodemask with
cpuset_current_mems_allowed in the fast path, so that
get_page_from_freelist() filters nodes allowed by the cpuset via
for_next_zone_zonelist_nodemask().

In that case it's pointless to additionaly check __cpuset_zone_allowed()
in each iteration, which we can avoid by not adding ALLOC_CPUSET to
alloc_flags in that scenario.

This saves some cycles in the allocator fast path on systems with one or
more non-root cpuset configured. In the slow path, ALLOC_CPUSET is
reset according to __alloc_pages_slowpath(). Without configured
cpusets, this code is disabled by a static key.

Link: http://lkml.kernel.org/r/20170124150511.5710-2-vbabka@suse.cz
Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
Reviewed-by: Anshuman Khandual <khandual@linux.vnet.ibm.com>
Acked-by: Michal Hocko <mhocko@suse.com>
Cc: Anshuman Khandual <khandual@linux.vnet.ibm.com>
Cc: Mel Gorman <mgorman@techsingularity.net>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# df76cee6 24-Feb-2017 Vlastimil Babka <vbabka@suse.cz>

mm, page_alloc: remove redundant checks from alloc fastpath

The allocation fast path contains two similar checks for zoneref->zone
being NULL, where zoneref points either to the first zone in the
zonelist, or to the preferred zone. These can be NULL either due to
empty zonelist, or no zone being compatible with given nodemask or
task's cpuset.

These checks are unnecessary, because the zonelist walks in
first_zones_zonelist() and get_page_from_freelist() handle a NULL
starting zoneref->zone or preferred_zoneref->zone safely. It's safe to
fallback to __alloc_pages_slowpath() where we also have the check early
enough.

Link: http://lkml.kernel.org/r/20170124150511.5710-1-vbabka@suse.cz
Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Anshuman Khandual <khandual@linux.vnet.ibm.com>
Cc: Mel Gorman <mgorman@techsingularity.net>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 374ad05a 24-Feb-2017 Mel Gorman <mgorman@techsingularity.net>

mm, page_alloc: only use per-cpu allocator for irq-safe requests

Many workloads that allocate pages are not handling an interrupt at a
time. As allocation requests may be from IRQ context, it's necessary to
disable/enable IRQs for every page allocation. This cost is the bulk of
the free path but also a significant percentage of the allocation path.

This patch alters the locking and checks such that only irq-safe
allocation requests use the per-cpu allocator. All others acquire the
irq-safe zone->lock and allocate from the buddy allocator. It relies on
disabling preemption to safely access the per-cpu structures. It could
be slightly modified to avoid soft IRQs using it but it's not clear it's
worthwhile.

This modification may slow allocations from IRQ context slightly but the
main gain from the per-cpu allocator is that it scales better for
allocations from multiple contexts. There is an implicit assumption
that intensive allocations from IRQ contexts on multiple CPUs from a
single NUMA node are rare and that the fast majority of scaling issues
are encountered in !IRQ contexts such as page faulting. It's worth
noting that this patch is not required for a bulk page allocator but it
significantly reduces the overhead.

The following is results from a page allocator micro-benchmark. Only
order-0 is interesting as higher orders do not use the per-cpu allocator

4.10.0-rc2 4.10.0-rc2
vanilla irqsafe-v1r5
Amean alloc-odr0-1 287.15 ( 0.00%) 219.00 ( 23.73%)
Amean alloc-odr0-2 221.23 ( 0.00%) 183.23 ( 17.18%)
Amean alloc-odr0-4 187.00 ( 0.00%) 151.38 ( 19.05%)
Amean alloc-odr0-8 167.54 ( 0.00%) 132.77 ( 20.75%)
Amean alloc-odr0-16 156.00 ( 0.00%) 123.00 ( 21.15%)
Amean alloc-odr0-32 149.00 ( 0.00%) 118.31 ( 20.60%)
Amean alloc-odr0-64 138.77 ( 0.00%) 116.00 ( 16.41%)
Amean alloc-odr0-128 145.00 ( 0.00%) 118.00 ( 18.62%)
Amean alloc-odr0-256 136.15 ( 0.00%) 125.00 ( 8.19%)
Amean alloc-odr0-512 147.92 ( 0.00%) 121.77 ( 17.68%)
Amean alloc-odr0-1024 147.23 ( 0.00%) 126.15 ( 14.32%)
Amean alloc-odr0-2048 155.15 ( 0.00%) 129.92 ( 16.26%)
Amean alloc-odr0-4096 164.00 ( 0.00%) 136.77 ( 16.60%)
Amean alloc-odr0-8192 166.92 ( 0.00%) 138.08 ( 17.28%)
Amean alloc-odr0-16384 159.00 ( 0.00%) 138.00 ( 13.21%)
Amean free-odr0-1 165.00 ( 0.00%) 89.00 ( 46.06%)
Amean free-odr0-2 113.00 ( 0.00%) 63.00 ( 44.25%)
Amean free-odr0-4 99.00 ( 0.00%) 54.00 ( 45.45%)
Amean free-odr0-8 88.00 ( 0.00%) 47.38 ( 46.15%)
Amean free-odr0-16 83.00 ( 0.00%) 46.00 ( 44.58%)
Amean free-odr0-32 80.00 ( 0.00%) 44.38 ( 44.52%)
Amean free-odr0-64 72.62 ( 0.00%) 43.00 ( 40.78%)
Amean free-odr0-128 78.00 ( 0.00%) 42.00 ( 46.15%)
Amean free-odr0-256 80.46 ( 0.00%) 57.00 ( 29.16%)
Amean free-odr0-512 96.38 ( 0.00%) 64.69 ( 32.88%)
Amean free-odr0-1024 107.31 ( 0.00%) 72.54 ( 32.40%)
Amean free-odr0-2048 108.92 ( 0.00%) 78.08 ( 28.32%)
Amean free-odr0-4096 113.38 ( 0.00%) 82.23 ( 27.48%)
Amean free-odr0-8192 112.08 ( 0.00%) 82.85 ( 26.08%)
Amean free-odr0-16384 110.38 ( 0.00%) 81.92 ( 25.78%)
Amean total-odr0-1 452.15 ( 0.00%) 308.00 ( 31.88%)
Amean total-odr0-2 334.23 ( 0.00%) 246.23 ( 26.33%)
Amean total-odr0-4 286.00 ( 0.00%) 205.38 ( 28.19%)
Amean total-odr0-8 255.54 ( 0.00%) 180.15 ( 29.50%)
Amean total-odr0-16 239.00 ( 0.00%) 169.00 ( 29.29%)
Amean total-odr0-32 229.00 ( 0.00%) 162.69 ( 28.96%)
Amean total-odr0-64 211.38 ( 0.00%) 159.00 ( 24.78%)
Amean total-odr0-128 223.00 ( 0.00%) 160.00 ( 28.25%)
Amean total-odr0-256 216.62 ( 0.00%) 182.00 ( 15.98%)
Amean total-odr0-512 244.31 ( 0.00%) 186.46 ( 23.68%)
Amean total-odr0-1024 254.54 ( 0.00%) 198.69 ( 21.94%)
Amean total-odr0-2048 264.08 ( 0.00%) 208.00 ( 21.24%)
Amean total-odr0-4096 277.38 ( 0.00%) 219.00 ( 21.05%)
Amean total-odr0-8192 279.00 ( 0.00%) 220.92 ( 20.82%)
Amean total-odr0-16384 269.38 ( 0.00%) 219.92 ( 18.36%)

This is the alloc, free and total overhead of allocating order-0 pages
in batches of 1 page up to 16384 pages. Avoiding disabling/enabling
overhead massively reduces overhead. Alloc overhead is roughly reduced
by 14-20% in most cases. The free path is reduced by 26-46% and the
total reduction is significant.

Many users require zeroing of pages from the page allocator which is the
vast cost of allocation. Hence, the impact on a basic page faulting
benchmark is not that significant

4.10.0-rc2 4.10.0-rc2
vanilla irqsafe-v1r5
Hmean page_test 656632.98 ( 0.00%) 675536.13 ( 2.88%)
Hmean brk_test 3845502.67 ( 0.00%) 3867186.94 ( 0.56%)
Stddev page_test 10543.29 ( 0.00%) 4104.07 ( 61.07%)
Stddev brk_test 33472.36 ( 0.00%) 15538.39 ( 53.58%)
CoeffVar page_test 1.61 ( 0.00%) 0.61 ( 62.15%)
CoeffVar brk_test 0.87 ( 0.00%) 0.40 ( 53.84%)
Max page_test 666513.33 ( 0.00%) 678640.00 ( 1.82%)
Max brk_test 3882800.00 ( 0.00%) 3887008.66 ( 0.11%)

This is from aim9 and the most notable outcome is that fault variability
is reduced by the patch. The headline improvement is small as the
overall fault cost, zeroing, page table insertion etc dominate relative
to disabling/enabling IRQs in the per-cpu allocator.

Similarly, little benefit was seen on networking benchmarks both
localhost and between physical server/clients where other costs
dominate. It's possible that this will only be noticable on very high
speed networks.

Jesper Dangaard Brouer independently tested this with a separate
microbenchmark from
https://github.com/netoptimizer/prototype-kernel/tree/master/kernel/mm/bench

Micro-benchmarked with [1] page_bench02:
modprobe page_bench02 page_order=0 run_flags=$((2#010)) loops=$((10**8)); \
rmmod page_bench02 ; dmesg --notime | tail -n 4

Compared to baseline: 213 cycles(tsc) 53.417 ns
- against this : 184 cycles(tsc) 46.056 ns
- Saving : -29 cycles
- Very close to expected 27 cycles saving [see below [2]]

Micro benchmarking via time_bench_sample[3], we get the cost of these
operations:

time_bench: Type:for_loop Per elem: 0 cycles(tsc) 0.232 ns (step:0)
time_bench: Type:spin_lock_unlock Per elem: 33 cycles(tsc) 8.334 ns (step:0)
time_bench: Type:spin_lock_unlock_irqsave Per elem: 62 cycles(tsc) 15.607 ns (step:0)
time_bench: Type:irqsave_before_lock Per elem: 57 cycles(tsc) 14.344 ns (step:0)
time_bench: Type:spin_lock_unlock_irq Per elem: 34 cycles(tsc) 8.560 ns (step:0)
time_bench: Type:simple_irq_disable_before_lock Per elem: 37 cycles(tsc) 9.289 ns (step:0)
time_bench: Type:local_BH_disable_enable Per elem: 19 cycles(tsc) 4.920 ns (step:0)
time_bench: Type:local_IRQ_disable_enable Per elem: 7 cycles(tsc) 1.864 ns (step:0)
time_bench: Type:local_irq_save_restore Per elem: 38 cycles(tsc) 9.665 ns (step:0)
[Mel's patch removes a ^^^^^^^^^^^^^^^^] ^^^^^^^^^ expected saving - preempt cost
time_bench: Type:preempt_disable_enable Per elem: 11 cycles(tsc) 2.794 ns (step:0)
[adds a preempt ^^^^^^^^^^^^^^^^^^^^^^] ^^^^^^^^^ adds this cost
time_bench: Type:funcion_call_cost Per elem: 6 cycles(tsc) 1.689 ns (step:0)
time_bench: Type:func_ptr_call_cost Per elem: 11 cycles(tsc) 2.767 ns (step:0)
time_bench: Type:page_alloc_put Per elem: 211 cycles(tsc) 52.803 ns (step:0)

Thus, expected improvement is: 38-11 = 27 cycles.

[mgorman@techsingularity.net: s/preempt_enable_no_resched/preempt_enable/]
Link: http://lkml.kernel.org/r/20170208143128.25ahymqlyspjcixu@techsingularity.net
Link: http://lkml.kernel.org/r/20170123153906.3122-5-mgorman@techsingularity.net
Signed-off-by: Mel Gorman <mgorman@techsingularity.net>
Acked-by: Hillf Danton <hillf.zj@alibaba-inc.com>
Acked-by: Jesper Dangaard Brouer <brouer@redhat.com>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Cc: Hillf Danton <hillf.zj@alibaba-inc.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Vlastimil Babka <vbabka@suse.cz>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# a459eeb7 24-Feb-2017 Michal Hocko <mhocko@suse.com>

mm, page_alloc: do not depend on cpu hotplug locks inside the allocator

Dmitry has reported the following lockdep splat
lock_acquire+0x2a1/0x630 kernel/locking/lockdep.c:3753
__mutex_lock_common kernel/locking/mutex.c:521 [inline]
mutex_lock_nested+0x24e/0xff0 kernel/locking/mutex.c:621
pcpu_alloc+0xbda/0x1280 mm/percpu.c:896
__alloc_percpu+0x24/0x30 mm/percpu.c:1075
smpcfd_prepare_cpu+0x73/0xd0 kernel/smp.c:44
cpuhp_invoke_callback+0x254/0x1480 kernel/cpu.c:136
cpuhp_up_callbacks+0x81/0x2a0 kernel/cpu.c:493
_cpu_up+0x1e3/0x2a0 kernel/cpu.c:1057
do_cpu_up+0x73/0xa0 kernel/cpu.c:1087
cpu_up+0x18/0x20 kernel/cpu.c:1095
smp_init+0xe9/0xee kernel/smp.c:564
kernel_init_freeable+0x439/0x690 init/main.c:1010
kernel_init+0x13/0x180 init/main.c:941
ret_from_fork+0x2a/0x40 arch/x86/entry/entry_64.S:433

cpu_hotplug_begin
cpu_hotplug.lock
pcpu_alloc
pcpu_alloc_mutex

get_online_cpus+0x62/0x90 kernel/cpu.c:248
drain_all_pages+0xf8/0x710 mm/page_alloc.c:2385
__alloc_pages_direct_reclaim mm/page_alloc.c:3440 [inline]
__alloc_pages_slowpath+0x8fd/0x2370 mm/page_alloc.c:3778
__alloc_pages_nodemask+0x8f5/0xc60 mm/page_alloc.c:3980
__alloc_pages include/linux/gfp.h:426 [inline]
__alloc_pages_node include/linux/gfp.h:439 [inline]
alloc_pages_node include/linux/gfp.h:453 [inline]
pcpu_alloc_pages mm/percpu-vm.c:93 [inline]
pcpu_populate_chunk+0x1e1/0x900 mm/percpu-vm.c:282
pcpu_alloc+0xe01/0x1280 mm/percpu.c:998
__alloc_percpu_gfp+0x27/0x30 mm/percpu.c:1062
bpf_array_alloc_percpu kernel/bpf/arraymap.c:34 [inline]
array_map_alloc+0x532/0x710 kernel/bpf/arraymap.c:99
find_and_alloc_map kernel/bpf/syscall.c:34 [inline]
map_create kernel/bpf/syscall.c:188 [inline]
SYSC_bpf kernel/bpf/syscall.c:870 [inline]
SyS_bpf+0xd64/0x2500 kernel/bpf/syscall.c:827
entry_SYSCALL_64_fastpath+0x1f/0xc2

pcpu_alloc
pcpu_alloc_mutex
drain_all_pages
get_online_cpus
cpu_hotplug.lock

cpu_hotplug_begin+0x206/0x2e0 kernel/cpu.c:304
_cpu_up+0xca/0x2a0 kernel/cpu.c:1011
do_cpu_up+0x73/0xa0 kernel/cpu.c:1087
cpu_up+0x18/0x20 kernel/cpu.c:1095
smp_init+0xe9/0xee kernel/smp.c:564
kernel_init_freeable+0x439/0x690 init/main.c:1010
kernel_init+0x13/0x180 init/main.c:941
ret_from_fork+0x2a/0x40 arch/x86/entry/entry_64.S:433

cpu_hotplug_begin
cpu_hotplug.lock

Pulling cpu hotplug locks inside the page allocator is just too
dangerous. Let's remove the dependency by dropping get_online_cpus()
from drain_all_pages. This is not so simple though because now we do
not have a protection against cpu hotplug which means 2 things:

- the work item might be executed on a different cpu in worker from
unbound pool so it doesn't run on pinned on the cpu

- we have to make sure that we do not race with page_alloc_cpu_dead
calling drain_pages_zone

Disabling preemption in drain_local_pages_wq will solve the first
problem drain_local_pages will determine its local CPU from the WQ
context which will be stable after that point, page_alloc_cpu_dead is
pinned to the CPU already. The later condition is achieved by disabling
IRQs in drain_pages_zone.

Fixes: mm, page_alloc: drain per-cpu pages from workqueue context
Link: http://lkml.kernel.org/r/20170207201950.20482-1-mhocko@kernel.org
Signed-off-by: Michal Hocko <mhocko@suse.com>
Reported-by: Dmitry Vyukov <dvyukov@google.com>
Acked-by: Tejun Heo <tj@kernel.org>
Acked-by: Mel Gorman <mgorman@techsingularity.net>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 0ccce3b9 24-Feb-2017 Mel Gorman <mgorman@techsingularity.net>

mm, page_alloc: drain per-cpu pages from workqueue context

The per-cpu page allocator can be drained immediately via
drain_all_pages() which sends IPIs to every CPU. In the next patch, the
per-cpu allocator will only be used for interrupt-safe allocations which
prevents draining it from IPI context. This patch uses workqueues to
drain the per-cpu lists instead.

This is slower but no slowdown during intensive reclaim was measured and
the paths that use drain_all_pages() are not that sensitive to
performance. This is particularly true as the path would only be
triggered when reclaim is failing. It also makes a some sense to avoid
storming a machine with IPIs when it's under memory pressure. Arguably,
it should be further adjusted so that only one caller at a time is
draining pages but it's beyond the scope of the current patch.

Link: http://lkml.kernel.org/r/20170123153906.3122-4-mgorman@techsingularity.net
Signed-off-by: Mel Gorman <mgorman@techsingularity.net>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Hillf Danton <hillf.zj@alibaba-inc.com>
Cc: Jesper Dangaard Brouer <brouer@redhat.com>
Cc: Tejun Heo <tj@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 9cd75558 24-Feb-2017 Mel Gorman <mgorman@techsingularity.net>

mm, page_alloc: split alloc_pages_nodemask()

alloc_pages_nodemask does a number of preperation steps that determine
what zones can be used for the allocation depending on a variety of
factors. This is fine but a hypothetical caller that wanted multiple
order-0 pages has to do the preparation steps multiple times. This
patch structures __alloc_pages_nodemask such that it's relatively easy
to build a bulk order-0 page allocator. There is no functional change.

Link: http://lkml.kernel.org/r/20170123153906.3122-3-mgorman@techsingularity.net
Signed-off-by: Mel Gorman <mgorman@techsingularity.net>
Acked-by: Hillf Danton <hillf.zj@alibaba-inc.com>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Cc: Jesper Dangaard Brouer <brouer@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 066b2393 24-Feb-2017 Mel Gorman <mgorman@techsingularity.net>

mm, page_alloc: split buffered_rmqueue()

Patch series "Use per-cpu allocator for !irq requests and prepare for a
bulk allocator", v5.

This series is motivated by a conversation led by Jesper Dangaard Brouer
at the last LSF/MM proposing a generic page pool for DMA-coherent pages.
Part of his motivation was due to the overhead of allocating multiple
order-0 that led some drivers to use high-order allocations and
splitting them. This is very slow in some cases.

The first two patches in this series restructure the page allocator such
that it is relatively easy to introduce an order-0 bulk page allocator.
A patch exists to do that and has been handed over to Jesper until an
in-kernel users is created. The third patch prevents the per-cpu
allocator being drained from IPI context as that can potentially corrupt
the list after patch four is merged. The final patch alters the per-cpu
alloctor to make it exclusive to !irq requests. This cuts
allocation/free overhead by roughly 30%.

Performance tests from both Jesper and me are included in the patch.

This patch (of 4):

buffered_rmqueue removes a page from a given zone and uses the per-cpu
list for order-0. This is fine but a hypothetical caller that wanted
multiple order-0 pages has to disable/reenable interrupts multiple
times. This patch structures buffere_rmqueue such that it's relatively
easy to build a bulk order-0 page allocator. There is no functional
change.

[mgorman@techsingularity.net: failed per-cpu refill may blow up]
Link: http://lkml.kernel.org/r/20170124112723.mshmgwq2ihxku2um@techsingularity.net
Link: http://lkml.kernel.org/r/20170123153906.3122-2-mgorman@techsingularity.net
Signed-off-by: Mel Gorman <mgorman@techsingularity.net>
Acked-by: Hillf Danton <hillf.zj@alibaba-inc.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Jesper Dangaard Brouer <brouer@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 685dbf6f 22-Feb-2017 David Rientjes <rientjes@google.com>

mm, page_alloc: warn_alloc nodemask is NULL when cpusets are disabled

The patch "mm, page_alloc: warn_alloc print nodemask" implicitly sets
the allocation nodemask to cpuset_current_mems_allowed when there is no
effective mempolicy. cpuset_current_mems_allowed is only effective when
cpusets are enabled, which is also printed by warn_alloc(), so setting
the nodemask to cpuset_current_mems_allowed is redundant and prevents
debugging issues where ac->nodemask is not set properly in the page
allocator.

This provides better debugging output since
cpuset_print_current_mems_allowed() is already provided.

Link: http://lkml.kernel.org/r/alpine.DEB.2.10.1701181347320.142399@chino.kir.corp.google.com
Signed-off-by: David Rientjes <rientjes@google.com>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Vlastimil Babka <vbabka@suse.cz>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 6c18ba7a 22-Feb-2017 Michal Hocko <mhocko@suse.com>

mm: help __GFP_NOFAIL allocations which do not trigger OOM killer

Now that __GFP_NOFAIL doesn't override decisions to skip the oom killer
we are left with requests which require to loop inside the allocator
without invoking the oom killer (e.g. GFP_NOFS|__GFP_NOFAIL used by fs
code) and so they might, in very unlikely situations, loop for ever -
e.g. other parallel request could starve them.

This patch tries to limit the likelihood of such a lockup by giving
these __GFP_NOFAIL requests a chance to move on by consuming a small
part of memory reserves. We are using ALLOC_HARDER which should be
enough to prevent from the starvation by regular allocation requests,
yet it shouldn't consume enough from the reserves to disrupt high
priority requests (ALLOC_HIGH).

While we are at it, let's introduce a helper __alloc_pages_cpuset_fallback
which enforces the cpusets but allows to fallback to ignore them if the
first attempt fails. __GFP_NOFAIL requests can be considered important
enough to allow cpuset runaway in order for the system to move on. It
is highly unlikely that any of these will be GFP_USER anyway.

Link: http://lkml.kernel.org/r/20161220134904.21023-4-mhocko@kernel.org
Signed-off-by: Michal Hocko <mhocko@suse.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Hillf Danton <hillf.zj@alibaba-inc.com>
Cc: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 06ad276a 22-Feb-2017 Michal Hocko <mhocko@suse.com>

mm, oom: do not enforce OOM killer for __GFP_NOFAIL automatically

__alloc_pages_may_oom makes sure to skip the OOM killer depending on the
allocation request. This includes lowmem requests, costly high order
requests and others. For a long time __GFP_NOFAIL acted as an override
for all those rules. This is not documented and it can be quite
surprising as well. E.g. GFP_NOFS requests are not invoking the OOM
killer but GFP_NOFS|__GFP_NOFAIL does so if we try to convert some of
the existing open coded loops around allocator to nofail request (and we
have done that in the past) then such a change would have a non trivial
side effect which is far from obvious. Note that the primary motivation
for skipping the OOM killer is to prevent from pre-mature invocation.

The exception has been added by commit 82553a937f12 ("oom: invoke oom
killer for __GFP_NOFAIL"). The changelog points out that the oom killer
has to be invoked otherwise the request would be looping for ever. But
this argument is rather weak because the OOM killer doesn't really
guarantee a forward progress for those exceptional cases:

- it will hardly help to form costly order which in turn can result in
the system panic because of no oom killable task in the end - I believe
we certainly do not want to put the system down just because there is a
nasty driver asking for order-9 page with GFP_NOFAIL not realizing all
the consequences. It is much better this request would loop for ever
than the massive system disruption

- lowmem is also highly unlikely to be freed during OOM killer

- GFP_NOFS request could trigger while there is still a lot of memory
pinned by filesystems.

This patch simply removes the __GFP_NOFAIL special case in order to have a
more clear semantic without surprising side effects.

Signed-off-by: Michal Hocko <mhocko@suse.com>
Reported-by: Nils Holland <nholland@tisys.org>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Hillf Danton <hillf.zj@alibaba-inc.com>
Cc: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 9a67f648 22-Feb-2017 Michal Hocko <mhocko@suse.com>

mm: consolidate GFP_NOFAIL checks in the allocator slowpath

Tetsuo Handa has pointed out that commit 0a0337e0d1d1 ("mm, oom: rework
oom detection") has subtly changed semantic for costly high order
requests with __GFP_NOFAIL and withtout __GFP_REPEAT and those can fail
right now. My code inspection didn't reveal any such users in the tree
but it is true that this might lead to unexpected allocation failures
and subsequent OOPs.

__alloc_pages_slowpath wrt. GFP_NOFAIL is hard to follow currently.
There are few special cases but we are lacking a catch all place to be
sure we will not miss any case where the non failing allocation might
fail. This patch reorganizes the code a bit and puts all those special
cases under nopage label which is the generic go-to-fail path. Non
failing allocations are retried or those that cannot retry like
non-sleeping allocation go to the failure point directly. This should
make the code flow much easier to follow and make it less error prone
for future changes.

While we are there we have to move the stall check up to catch
potentially looping non-failing allocations.

[akpm@linux-foundation.org: fix alloc_flags may-be-used-uninitalized]
Link: http://lkml.kernel.org/r/20161220134904.21023-2-mhocko@kernel.org
Signed-off-by: Michal Hocko <mhocko@suse.com>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Acked-by: Hillf Danton <hillf.zj@alibaba-inc.com>
Cc: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 9af744d7 22-Feb-2017 Michal Hocko <mhocko@suse.com>

lib/show_mem.c: teach show_mem to work with the given nodemask

show_mem() allows to filter out node specific data which is irrelevant
to the allocation request via SHOW_MEM_FILTER_NODES. The filtering is
done in skip_free_areas_node which skips all nodes which are not in the
mems_allowed of the current process. This works most of the time as
expected because the nodemask shouldn't be outside of the allocating
task but there are some exceptions. E.g. memory hotplug might want to
request allocations from outside of the allowed nodes (see
new_node_page).

Get rid of this hardcoded behavior and push the allocation mask down the
show_mem path and use it instead of cpuset_current_mems_allowed. NULL
nodemask is interpreted as cpuset_current_mems_allowed.

[akpm@linux-foundation.org: coding-style fixes]
Link: http://lkml.kernel.org/r/20170117091543.25850-5-mhocko@kernel.org
Signed-off-by: Michal Hocko <mhocko@suse.com>
Acked-by: Mel Gorman <mgorman@suse.de>
Cc: Hillf Danton <hillf.zj@alibaba-inc.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: David Rientjes <rientjes@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# a8e99259 22-Feb-2017 Michal Hocko <mhocko@suse.com>

mm, page_alloc: warn_alloc print nodemask

warn_alloc is currently used for to report an allocation failure or an
allocation stall. We print some details of the allocation request like
the gfp mask and the request order. We do not print the allocation
nodemask which is important when debugging the reason for the allocation
failure as well. We alreaddy print the nodemask in the OOM report.

Add nodemask to warn_alloc and print it in warn_alloc as well.

Link: http://lkml.kernel.org/r/20170117091543.25850-3-mhocko@kernel.org
Signed-off-by: Michal Hocko <mhocko@suse.com>
Acked-by: Mel Gorman <mgorman@suse.de>
Acked-by: Hillf Danton <hillf.zj@alibaba-inc.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: David Rientjes <rientjes@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# c02e50bb 22-Feb-2017 Michal Hocko <mhocko@suse.com>

mm, page_alloc: do not report all nodes in show_mem

Patch series "show_mem updates", v2.

This is a mixture of one bug fix (patch 1), an enhancement (patch 2) and
cleanups (the rest of the series). First two patches should be really
straightforward. Patch 3 removes some arch specific show_mem
implementations because I think they are quite outdated and do not
really serve any useful purpose anymore. I think we should really
strive to have a consistent show_mem output regardless of the
architecture. If some architecture is really special and wants to dump
something additional we should do that via an arch specific hook.

The last patch adds nodemask parameter so that we do not rely on the
hardcoded mems_allowed of the current task when doing the node
filtering. I consider this more a cleanup than a fix because basically
all users use a nodemask which is a subset of mems_allowed. There is
only one call path in the memory hotplug which doesn't comply with this
but that is hardly something to worry about.

This patch (of 4):

Commit 599d0c954f91 ("mm, vmscan: move LRU lists to node") has added per
numa node statistics to show_mem but it forgot to add
skip_free_areas_node to filter out nodes which are outside of the
allocating task numa policy. Add this check to not pollute the output
with the pointless information.

Link: http://lkml.kernel.org/r/20170117091543.25850-2-mhocko@kernel.org
Signed-off-by: Michal Hocko <mhocko@suse.com>
Acked-by: Mel Gorman <mgorman@suse.de>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Acked-by: Hillf Danton <hillf.zj@alibaba-inc.com>
Acked-by: David Rientjes <rientjes@google.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# b92df1de 22-Feb-2017 Paul Burton <paulburton@kernel.org>

mm: page_alloc: skip over regions of invalid pfns where possible

When using a sparse memory model memmap_init_zone() when invoked with
the MEMMAP_EARLY context will skip over pages which aren't valid - ie.
which aren't in a populated region of the sparse memory map. However if
the memory map is extremely sparse then it can spend a long time
linearly checking each PFN in a large non-populated region of the memory
map & skipping it in turn.

When CONFIG_HAVE_MEMBLOCK_NODE_MAP is enabled, we have sufficient
information to quickly discover the next valid PFN given an invalid one
by searching through the list of memory regions & skipping forwards to
the first PFN covered by the memory region to the right of the
non-populated region. Implement this in order to speed up
memmap_init_zone() for systems with extremely sparse memory maps.

James said "I have tested this patch on a virtual model of a Samurai CPU
with a sparse memory map. The kernel boot time drops from 109 to
62 seconds. "

Link: http://lkml.kernel.org/r/20161125185518.29885-1-paul.burton@imgtec.com
Signed-off-by: Paul Burton <paul.burton@imgtec.com>
Tested-by: James Hartley <james.hartley@imgtec.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 65190cff 22-Feb-2017 Michal Hocko <mhocko@suse.com>

oom, trace: add compaction retry tracepoint

Higher order requests oom debugging is currently quite hard. We do have
some compaction points which can tell us how the compaction is operating
but there is no trace point to tell us about compaction retry logic.
This patch adds a one which will have the following format

bash-3126 [001] .... 1498.220001: compact_retry: order=9 priority=COMPACT_PRIO_SYNC_LIGHT compaction_result=withdrawn retries=0 max_retries=16 should_retry=0

we can see that the order 9 request is not retried even though we are in
the highest compaction priority mode becase the last compaction attempt
was withdrawn. This means that compaction_zonelist_suitable must have
returned false and there is no suitable zone to compact for this request
and so no need to retry further.

another example would be
<...>-3137 [001] .... 81.501689: compact_retry: order=9 priority=COMPACT_PRIO_SYNC_LIGHT compaction_result=failed retries=0 max_retries=16 should_retry=0

in this case the order-9 compaction failed to find any suitable block.
We do not retry anymore because this is a costly request and those do
not go below COMPACT_PRIO_SYNC_LIGHT priority.

Link: http://lkml.kernel.org/r/20161220130135.15719-4-mhocko@kernel.org
Signed-off-by: Michal Hocko <mhocko@suse.com>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Cc: David Rientjes <rientjes@google.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# d379f01d 22-Feb-2017 Michal Hocko <mhocko@suse.com>

oom, trace: add oom detection tracepoints

should_reclaim_retry is the central decision point for declaring the
OOM. It might be really useful to expose data used for this decision
making when debugging an unexpected oom situations.

Say we have an OOM report:
[ 52.264001] mem_eater invoked oom-killer: gfp_mask=0x24280ca(GFP_HIGHUSER_MOVABLE|__GFP_ZERO), nodemask=0, order=0, oom_score_adj=0
[ 52.267549] CPU: 3 PID: 3148 Comm: mem_eater Tainted: G W 4.8.0-oomtrace3-00006-gb21338b386d2 #1024

Now we can check the tracepoint data to see how we have ended up in this
situation:
mem_eater-3148 [003] .... 52.432801: reclaim_retry_zone: node=0 zone=DMA32 order=0 reclaimable=51 available=11134 min_wmark=11084 no_progress_loops=1 wmark_check=1
mem_eater-3148 [003] .... 52.433269: reclaim_retry_zone: node=0 zone=DMA32 order=0 reclaimable=51 available=11103 min_wmark=11084 no_progress_loops=1 wmark_check=1
mem_eater-3148 [003] .... 52.433712: reclaim_retry_zone: node=0 zone=DMA32 order=0 reclaimable=51 available=11100 min_wmark=11084 no_progress_loops=2 wmark_check=1
mem_eater-3148 [003] .... 52.434067: reclaim_retry_zone: node=0 zone=DMA32 order=0 reclaimable=51 available=11097 min_wmark=11084 no_progress_loops=3 wmark_check=1
mem_eater-3148 [003] .... 52.434414: reclaim_retry_zone: node=0 zone=DMA32 order=0 reclaimable=51 available=11094 min_wmark=11084 no_progress_loops=4 wmark_check=1
mem_eater-3148 [003] .... 52.434761: reclaim_retry_zone: node=0 zone=DMA32 order=0 reclaimable=51 available=11091 min_wmark=11084 no_progress_loops=5 wmark_check=1
mem_eater-3148 [003] .... 52.435108: reclaim_retry_zone: node=0 zone=DMA32 order=0 reclaimable=51 available=11087 min_wmark=11084 no_progress_loops=6 wmark_check=1
mem_eater-3148 [003] .... 52.435478: reclaim_retry_zone: node=0 zone=DMA32 order=0 reclaimable=51 available=11084 min_wmark=11084 no_progress_loops=7 wmark_check=0
mem_eater-3148 [003] .... 52.435478: reclaim_retry_zone: node=0 zone=DMA order=0 reclaimable=0 available=1126 min_wmark=179 no_progress_loops=7 wmark_check=0

The above shows that we can quickly deduce that the reclaim stopped
making any progress (see no_progress_loops increased in each round) and
while there were still some 51 reclaimable pages they couldn't be
dropped for some reason (vmscan trace points would tell us more about
that part). available will represent reclaimable + free_pages scaled
down per no_progress_loops factor. This is essentially an optimistic
estimate of how much memory we would have when reclaiming everything.
This can be compared to min_wmark to get a rought idea but the
wmark_check tells the result of the watermark check which is more
precise (includes lowmem reserves, considers the order etc.). As we can
see no zone is eligible in the end and that is why we have triggered the
oom in this situation.

Please note that higher order requests might fail on the wmark_check
even when there is much more memory available than min_wmark - e.g.
when the memory is fragmented. A follow up tracepoint will help to
debug those situations.

Link: http://lkml.kernel.org/r/20161220130135.15719-3-mhocko@kernel.org
Signed-off-by: Michal Hocko <mhocko@suse.com>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Cc: David Rientjes <rientjes@google.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 13ad59df 22-Feb-2017 Vlastimil Babka <vbabka@suse.cz>

mm, page_alloc: avoid page_to_pfn() when merging buddies

On architectures that allow memory holes, page_is_buddy() has to perform
page_to_pfn() to check for the memory hole. After the previous patch,
we have the pfn already available in __free_one_page(), which is the
only caller of page_is_buddy(), so move the check there and avoid
page_to_pfn().

Link: http://lkml.kernel.org/r/20161216120009.20064-2-vbabka@suse.cz
Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
Acked-by: Mel Gorman <mgorman@techsingularity.net>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 76741e77 22-Feb-2017 Vlastimil Babka <vbabka@suse.cz>

mm, page_alloc: don't convert pfn to idx when merging

In __free_one_page() we do the buddy merging arithmetics on "page/buddy
index", which is just the lower MAX_ORDER bits of pfn. The operations
we do that affect the higher bits are bitwise AND and subtraction (in
that order), where the final result will be the same with the higher
bits left unmasked, as long as these bits are equal for both buddies -
which must be true by the definition of a buddy.

We can therefore use pfn's directly instead of "index" and skip the
zeroing of >MAX_ORDER bits. This can help a bit by itself, although
compiler might be smart enough already. It also helps the next patch to
avoid page_to_pfn() for memory hole checks.

Link: http://lkml.kernel.org/r/20161216120009.20064-1-vbabka@suse.cz
Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
Acked-by: Mel Gorman <mgorman@techsingularity.net>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# aa187507 22-Feb-2017 Michal Hocko <mhocko@suse.com>

mm: throttle show_mem() from warn_alloc()

Tetsuo has been stressing OOM killer path with many parallel allocation
requests when he has noticed that it is not all that hard to swamp
kernel logs with warn_alloc messages caused by allocation stalls. Even
though the allocation stall message is triggered only once in 10s there
might be many different tasks hitting it roughly around the same time.

A big part of the output is show_mem() which can generate a lot of
output even on a small machines. There is no reason to show the state
of memory counter for each allocation stall, especially when multiple of
them are reported in a short time period. Chances are that not much has
changed since the last report. This patch simply rate limits show_mem
called from warn_alloc to only dump something once per second. This
should be enough to give us a clue why an allocation might be stalling
while burst of warnings will not swamp log with too much data.

While we are at it, extract all the show_mem related handling (filters)
into a separate function warn_alloc_show_mem. This will make the code
cleaner and as a bonus point we can distinguish which part of warn_alloc
got throttled due to rate limiting as ___ratelimit dumps the caller.

[akpm@linux-foundation.org: reduce scope of the ratelimit_states]
Link: http://lkml.kernel.org/r/20161215101510.9030-1-mhocko@kernel.org
Signed-off-by: Michal Hocko <mhocko@suse.com>
Reported-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# e47483bc 24-Jan-2017 Vlastimil Babka <vbabka@suse.cz>

mm, page_alloc: fix premature OOM when racing with cpuset mems update

Ganapatrao Kulkarni reported that the LTP test cpuset01 in stress mode
triggers OOM killer in few seconds, despite lots of free memory. The
test attempts to repeatedly fault in memory in one process in a cpuset,
while changing allowed nodes of the cpuset between 0 and 1 in another
process.

The problem comes from insufficient protection against cpuset changes,
which can cause get_page_from_freelist() to consider all zones as
non-eligible due to nodemask and/or current->mems_allowed. This was
masked in the past by sufficient retries, but since commit 682a3385e773
("mm, page_alloc: inline the fast path of the zonelist iterator") we fix
the preferred_zoneref once, and don't iterate over the whole zonelist in
further attempts, thus the only eligible zones might be placed in the
zonelist before our starting point and we always miss them.

A previous patch fixed this problem for current->mems_allowed. However,
cpuset changes also update the task's mempolicy nodemask. The fix has
two parts. We have to repeat the preferred_zoneref search when we
detect cpuset update by way of seqcount, and we have to check the
seqcount before considering OOM.

[akpm@linux-foundation.org: fix typo in comment]
Link: http://lkml.kernel.org/r/20170120103843.24587-5-vbabka@suse.cz
Fixes: c33d6c06f60f ("mm, page_alloc: avoid looking up the first zone in a zonelist twice")
Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
Reported-by: Ganapatrao Kulkarni <gpkulkarni@gmail.com>
Acked-by: Mel Gorman <mgorman@techsingularity.net>
Acked-by: Hillf Danton <hillf.zj@alibaba-inc.com>
Cc: Michal Hocko <mhocko@suse.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 5ce9bfef 24-Jan-2017 Vlastimil Babka <vbabka@suse.cz>

mm, page_alloc: move cpuset seqcount checking to slowpath

This is a preparation for the following patch to make review simpler.
While the primary motivation is a bug fix, this also simplifies the fast
path, although the moved code is only enabled when cpusets are in use.

Link: http://lkml.kernel.org/r/20170120103843.24587-4-vbabka@suse.cz
Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
Acked-by: Mel Gorman <mgorman@techsingularity.net>
Acked-by: Hillf Danton <hillf.zj@alibaba-inc.com>
Cc: Ganapatrao Kulkarni <gpkulkarni@gmail.com>
Cc: Michal Hocko <mhocko@suse.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 16096c25 24-Jan-2017 Vlastimil Babka <vbabka@suse.cz>

mm, page_alloc: fix fast-path race with cpuset update or removal

Ganapatrao Kulkarni reported that the LTP test cpuset01 in stress mode
triggers OOM killer in few seconds, despite lots of free memory. The
test attempts to repeatedly fault in memory in one process in a cpuset,
while changing allowed nodes of the cpuset between 0 and 1 in another
process.

One possible cause is that in the fast path we find the preferred
zoneref according to current mems_allowed, so that it points to the
middle of the zonelist, skipping e.g. zones of node 1 completely. If
the mems_allowed is updated to contain only node 1, we never reach it in
the zonelist, and trigger OOM before checking the cpuset_mems_cookie.

This patch fixes the particular case by redoing the preferred zoneref
search if we switch back to the original nodemask. The condition is
also slightly changed so that when the last non-root cpuset is removed,
we don't miss it.

Note that this is not a full fix, and more patches will follow.

Link: http://lkml.kernel.org/r/20170120103843.24587-3-vbabka@suse.cz
Fixes: 682a3385e773 ("mm, page_alloc: inline the fast path of the zonelist iterator")
Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
Reported-by: Ganapatrao Kulkarni <gpkulkarni@gmail.com>
Acked-by: Michal Hocko <mhocko@suse.com>
Acked-by: Mel Gorman <mgorman@techsingularity.net>
Acked-by: Hillf Danton <hillf.zj@alibaba-inc.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# ea57485a 24-Jan-2017 Vlastimil Babka <vbabka@suse.cz>

mm, page_alloc: fix check for NULL preferred_zone

Patch series "fix premature OOM regression in 4.7+ due to cpuset races".

This is v2 of my attempt to fix the recent report based on LTP cpuset
stress test [1]. The intention is to go to stable 4.9 LTSS with this,
as triggering repeated OOMs is not nice. That's why the patches try to
be not too intrusive.

Unfortunately why investigating I found that modifying the testcase to
use per-VMA policies instead of per-task policies will bring the OOM's
back, but that seems to be much older and harder to fix problem. I have
posted a RFC [2] but I believe that fixing the recent regressions has a
higher priority.

Longer-term we might try to think how to fix the cpuset mess in a better
and less error prone way. I was for example very surprised to learn,
that cpuset updates change not only task->mems_allowed, but also
nodemask of mempolicies. Until now I expected the parameter to
alloc_pages_nodemask() to be stable. I wonder why do we then treat
cpusets specially in get_page_from_freelist() and distinguish HARDWALL
etc, when there's unconditional intersection between mempolicy and
cpuset. I would expect the nodemask adjustment for saving overhead in
g_p_f(), but that clearly doesn't happen in the current form. So we
have both crazy complexity and overhead, AFAICS.

[1] https://lkml.kernel.org/r/CAFpQJXUq-JuEP=QPidy4p_=FN0rkH5Z-kfB4qBvsf6jMS87Edg@mail.gmail.com
[2] https://lkml.kernel.org/r/7c459f26-13a6-a817-e508-b65b903a8378@suse.cz

This patch (of 4):

Since commit c33d6c06f60f ("mm, page_alloc: avoid looking up the first
zone in a zonelist twice") we have a wrong check for NULL preferred_zone,
which can theoretically happen due to concurrent cpuset modification. We
check the zoneref pointer which is never NULL and we should check the zone
pointer. Also document this in first_zones_zonelist() comment per Michal
Hocko.

Fixes: c33d6c06f60f ("mm, page_alloc: avoid looking up the first zone in a zonelist twice")
Link: http://lkml.kernel.org/r/20170120103843.24587-2-vbabka@suse.cz
Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
Acked-by: Mel Gorman <mgorman@techsingularity.net>
Acked-by: Hillf Danton <hillf.zj@alibaba-inc.com>
Cc: Ganapatrao Kulkarni <gpkulkarni@gmail.com>
Cc: Michal Hocko <mhocko@suse.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 424f6c48 24-Jan-2017 Lucas Stach <l.stach@pengutronix.de>

mm: alloc_contig: re-allow CMA to compact FS pages

Commit 73e64c51afc5 ("mm, compaction: allow compaction for GFP_NOFS
requests") changed compation to skip FS pages if not explicitly allowed
to touch them, but missed to update the CMA compact_control.

This leads to a very high isolation failure rate, crippling performance
of CMA even on a lightly loaded system. Re-allow CMA to compact FS
pages by setting the correct GFP flags, restoring CMA behavior and
performance to the kernel 4.9 level.

Fixes: 73e64c51afc5 (mm, compaction: allow compaction for GFP_NOFS requests)
Link: http://lkml.kernel.org/r/20170113115155.24335-1-l.stach@pengutronix.de
Signed-off-by: Lucas Stach <l.stach@pengutronix.de>
Acked-by: Michal Hocko <mhocko@suse.com>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Cc: Mel Gorman <mgorman@techsingularity.net>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 2976db80 10-Jan-2017 Alexander Duyck <alexander.h.duyck@intel.com>

mm: rename __page_frag functions to __page_frag_cache, drop order from drain

This patch does two things.

First it goes through and renames the __page_frag prefixed functions to
__page_frag_cache so that we can be clear that we are draining or
refilling the cache, not the frags themselves.

Second we drop the order parameter from __page_frag_cache_drain since we
don't actually need to pass it since all fragments are either order 0 or
must be a compound page.

Link: http://lkml.kernel.org/r/20170104023954.13451.5678.stgit@localhost.localdomain
Signed-off-by: Alexander Duyck <alexander.h.duyck@intel.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 8c2dd3e4 10-Jan-2017 Alexander Duyck <alexander.h.duyck@intel.com>

mm: rename __alloc_page_frag to page_frag_alloc and __free_page_frag to page_frag_free

Patch series "Page fragment updates", v4.

This patch series takes care of a few cleanups for the page fragments
API.

First we do some renames so that things are much more consistent. First
we move the page_frag_ portion of the name to the front of the functions
names. Secondly we split out the cache specific functions from the
other page fragment functions by adding the word "cache" to the name.

Finally I added a bit of documentation that will hopefully help to
explain some of this. I plan to revisit this later as we get things
more ironed out in the near future with the changes planned for the DMA
setup to support eXpress Data Path.

This patch (of 3):

This patch renames the page frag functions to be more consistent with
other APIs. Specifically we place the name page_frag first in the name
and then have either an alloc or free call name that we append as the
suffix. This makes it a bit clearer in terms of naming.

In addition we drop the leading double underscores since we are
technically no longer a backing interface and instead the front end that
is called from the networking APIs.

Link: http://lkml.kernel.org/r/20170104023854.13451.67390.stgit@localhost.localdomain
Signed-off-by: Alexander Duyck <alexander.h.duyck@intel.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# f073bdc5 10-Jan-2017 Ard Biesheuvel <ardb@kernel.org>

mm: don't dereference struct page fields of invalid pages

The VM_BUG_ON() check in move_freepages() checks whether the node id of
a page matches the node id of its zone. However, it does this before
having checked whether the struct page pointer refers to a valid struct
page to begin with. This is guaranteed in most cases, but may not be
the case if CONFIG_HOLES_IN_ZONE=y.

So reorder the VM_BUG_ON() with the pfn_valid_within() check.

Link: http://lkml.kernel.org/r/1481706707-6211-2-git-send-email-ard.biesheuvel@linaro.org
Signed-off-by: Ard Biesheuvel <ard.biesheuvel@linaro.org>
Acked-by: Will Deacon <will.deacon@arm.com>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Hanjun Guo <hanjun.guo@linaro.org>
Cc: Yisheng Xie <xieyisheng1@huawei.com>
Cc: Robert Richter <rrichter@cavium.com>
Cc: James Morse <james.morse@arm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 41b6167e 10-Jan-2017 Michal Hocko <mhocko@suse.com>

mm: get rid of __GFP_OTHER_NODE

The flag was introduced by commit 78afd5612deb ("mm: add
__GFP_OTHER_NODE flag") to allow proper accounting of remote node
allocations done by kernel daemons on behalf of a process - e.g.
khugepaged.

After "mm: fix remote numa hits statistics" we do not need and actually
use the flag so we can safely remove it because all allocations which
are satisfied from their "home" node are accounted properly.

[mhocko@suse.com: fix build]
Link: http://lkml.kernel.org/r/20170106122225.GK5556@dhcp22.suse.cz
Link: http://lkml.kernel.org/r/20170102153057.9451-3-mhocko@kernel.org
Signed-off-by: Michal Hocko <mhocko@suse.com>
Acked-by: Mel Gorman <mgorman@suse.de>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Cc: Taku Izumi <izumi.taku@jp.fujitsu.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 2df26639 10-Jan-2017 Michal Hocko <mhocko@suse.com>

mm: fix remote numa hits statistics

Jia He has noticed that commit b9f00e147f27 ("mm, page_alloc: reduce
branches in zone_statistics") has an unintentional side effect that
remote node allocation requests are accounted as NUMA_MISS rathat than
NUMA_HIT and NUMA_OTHER if such a request doesn't use __GFP_OTHER_NODE.

There are many of these potentially because the flag is used very rarely
while we have many users of __alloc_pages_node.

Fix this by simply ignoring __GFP_OTHER_NODE (it can be removed in a
follow up patch) and treat all allocations that were satisfied from the
preferred zone's node as NUMA_HITS because this is the same node we
requested the allocation from in most cases. If this is not the local
node then we just account it as NUMA_OTHER rather than NUMA_LOCAL.

One downsize would be that an allocation request for a node which is
outside of the mempolicy nodemask would be reported as a hit which is a
bit weird but that was the case before b9f00e147f27 already.

Fixes: b9f00e147f27 ("mm, page_alloc: reduce branches in zone_statistics")
Link: http://lkml.kernel.org/r/20170102153057.9451-2-mhocko@kernel.org
Signed-off-by: Michal Hocko <mhocko@suse.com>
Reported-by: Jia He <hejianet@gmail.com>
Reviewed-by: Vlastimil Babka <vbabka@suse.cz> # with cbmc[1] superpowers
Acked-by: Mel Gorman <mgorman@suse.de>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Cc: Taku Izumi <izumi.taku@jp.fujitsu.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 44fdffd7 14-Dec-2016 Alexander Duyck <alexander.h.duyck@intel.com>

mm: add support for releasing multiple instances of a page

Add a function that allows us to batch free a page that has multiple
references outstanding. Specifically this function can be used to drop
a page being used in the page frag alloc cache. With this drivers can
make use of functionality similar to the page frag alloc cache without
having to do any workarounds for the fact that there is no function that
frees multiple references.

Link: http://lkml.kernel.org/r/20161110113606.76501.70752.stgit@ahduyck-blue-test.jf.intel.com
Signed-off-by: Alexander Duyck <alexander.h.duyck@intel.com>
Cc: "David S. Miller" <davem@davemloft.net>
Cc: "James E.J. Bottomley" <jejb@parisc-linux.org>
Cc: Chris Metcalf <cmetcalf@mellanox.com>
Cc: David Howells <dhowells@redhat.com>
Cc: Geert Uytterhoeven <geert@linux-m68k.org>
Cc: Hans-Christian Noren Egtvedt <egtvedt@samfundet.no>
Cc: Helge Deller <deller@gmx.de>
Cc: James Hogan <james.hogan@imgtec.com>
Cc: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
Cc: Jonas Bonn <jonas@southpole.se>
Cc: Keguang Zhang <keguang.zhang@gmail.com>
Cc: Ley Foon Tan <lftan@altera.com>
Cc: Mark Salter <msalter@redhat.com>
Cc: Max Filippov <jcmvbkbc@gmail.com>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Cc: Michal Simek <monstr@monstr.eu>
Cc: Ralf Baechle <ralf@linux-mips.org>
Cc: Rich Felker <dalias@libc.org>
Cc: Richard Kuo <rkuo@codeaurora.org>
Cc: Russell King <linux@armlinux.org.uk>
Cc: Steven Miao <realmz6@gmail.com>
Cc: Tobias Klauser <tklauser@distanz.ch>
Cc: Vineet Gupta <vgupta@synopsys.com>
Cc: Yoshinori Sato <ysato@users.sourceforge.jp>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# a6de734b 12-Dec-2016 Mel Gorman <mgorman@techsingularity.net>

mm, page_alloc: keep pcp count and list contents in sync if struct page is corrupted

Vlastimil Babka pointed out that commit 479f854a207c ("mm, page_alloc:
defer debugging checks of pages allocated from the PCP") will allow the
per-cpu list counter to be out of sync with the per-cpu list contents if
a struct page is corrupted.

The consequence is an infinite loop if the per-cpu lists get fully
drained by free_pcppages_bulk because all the lists are empty but the
count is positive. The infinite loop occurs here

do {
batch_free++;
if (++migratetype == MIGRATE_PCPTYPES)
migratetype = 0;
list = &pcp->lists[migratetype];
} while (list_empty(list));

What the user sees is a bad page warning followed by a soft lockup with
interrupts disabled in free_pcppages_bulk().

This patch keeps the accounting in sync.

Fixes: 479f854a207c ("mm, page_alloc: defer debugging checks of pages allocated from the PCP")
Link: http://lkml.kernel.org/r/20161202112951.23346-2-mgorman@techsingularity.net
Signed-off-by: Mel Gorman <mgorman@suse.de>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Acked-by: Michal Hocko <mhocko@suse.com>
Acked-by: Hillf Danton <hillf.zj@alibaba-inc.com>
Cc: Christoph Lameter <cl@linux.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Jesper Dangaard Brouer <brouer@redhat.com>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Cc: <stable@vger.kernel.org> [4.7+]
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 29fac03b 12-Dec-2016 Minchan Kim <minchan@kernel.org>

mm: make unreserve highatomic functions reliable

Currently, unreserve_highatomic_pageblock bails out if it found
highatomic pageblock regardless of really moving free pages from the one
so that it could mitigate unreserve logic's goal which saves OOM of a
process.

This patch makes unreserve functions bail out only if it moves some
pages out of !highatomic free list to avoid such false positive.

Another potential problem is that by race between page freeing and
reserve highatomic function, pages could be in highatomic free list even
though the pageblock is !high atomic migratetype. In that case,
unreserve_highatomic_pageblock can be void if count of highatomic
reserve is less than pageblock_nr_pages. We could solve it simply via
draining all of reserved pages before the OOM. It would have a
safeguard role to exhuast reserved pages before converging to OOM.

Link: http://lkml.kernel.org/r/1476259429-18279-5-git-send-email-minchan@kernel.org
Signed-off-by: Minchan Kim <minchan@kernel.org>
Signed-off-by: Michal Hocko <mhocko@suse.com>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Cc: Mel Gorman <mgorman@techsingularity.net>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Cc: Sangseok Lee <sangseok.lee@lge.com>
Cc: Michal Hocko <mhocko@suse.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 04c8716f 12-Dec-2016 Minchan Kim <minchan@kernel.org>

mm: try to exhaust highatomic reserve before the OOM

I got OOM report from production team with v4.4 kernel. It had enough
free memory but failed to allocate GFP_KERNEL order-0 page and finally
encountered OOM kill. It occured during QA process which launches
several apps, switching and so on. It happned rarely. IOW, In normal
situation, it was not a problem but if we are unluck so that several
apps uses peak memory at the same time, it can happen. If we manage to
pass the phase, the system can go working well.

I could reproduce it with my test(memory spike easily. Look at below.

The reason is free pages(19M) of DMA32 zone are reserved for
HIGHORDERATOMIC and doesn't unreserved before the OOM.

balloon invoked oom-killer: gfp_mask=0x24280ca(GFP_HIGHUSER_MOVABLE|__GFP_ZERO), order=0, oom_score_adj=0
balloon cpuset=/ mems_allowed=0
CPU: 1 PID: 8473 Comm: balloon Tainted: G W OE 4.8.0-rc7-00219-g3f74c9559583-dirty #3161
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS Ubuntu-1.8.2-1ubuntu1 04/01/2014
Call Trace:
dump_stack+0x63/0x90
dump_header+0x5c/0x1ce
oom_kill_process+0x22e/0x400
out_of_memory+0x1ac/0x210
__alloc_pages_nodemask+0x101e/0x1040
handle_mm_fault+0xa0a/0xbf0
__do_page_fault+0x1dd/0x4d0
trace_do_page_fault+0x43/0x130
do_async_page_fault+0x1a/0xa0
async_page_fault+0x28/0x30
Mem-Info:
active_anon:383949 inactive_anon:106724 isolated_anon:0
active_file:15 inactive_file:44 isolated_file:0
unevictable:0 dirty:0 writeback:24 unstable:0
slab_reclaimable:2483 slab_unreclaimable:3326
mapped:0 shmem:0 pagetables:1906 bounce:0
free:6898 free_pcp:291 free_cma:0
Node 0 active_anon:1535796kB inactive_anon:426896kB active_file:60kB inactive_file:176kB unevictable:0kB isolated(anon):0kB isolated(file):0kB mapped:0kB dirty:0kB writeback:96kB shmem:0kB writeback_tmp:0kB unstable:0kB pages_scanned:1418 all_unreclaimable? no
DMA free:8188kB min:44kB low:56kB high:68kB active_anon:7648kB inactive_anon:0kB active_file:0kB inactive_file:4kB unevictable:0kB writepending:0kB present:15992kB managed:15908kB mlocked:0kB slab_reclaimable:0kB slab_unreclaimable:20kB kernel_stack:0kB pagetables:0kB bounce:0kB free_pcp:0kB local_pcp:0kB free_cma:0kB
lowmem_reserve[]: 0 1952 1952 1952
DMA32 free:19404kB min:5628kB low:7624kB high:9620kB active_anon:1528148kB inactive_anon:426896kB active_file:60kB inactive_file:420kB unevictable:0kB writepending:96kB present:2080640kB managed:2030092kB mlocked:0kB slab_reclaimable:9932kB slab_unreclaimable:13284kB kernel_stack:2496kB pagetables:7624kB bounce:0kB free_pcp:900kB local_pcp:112kB free_cma:0kB
lowmem_reserve[]: 0 0 0 0
DMA: 0*4kB 0*8kB 0*16kB 0*32kB 0*64kB 0*128kB 0*256kB 0*512kB 0*1024kB 0*2048kB 2*4096kB (H) = 8192kB
DMA32: 7*4kB (H) 8*8kB (H) 30*16kB (H) 31*32kB (H) 14*64kB (H) 9*128kB (H) 2*256kB (H) 2*512kB (H) 4*1024kB (H) 5*2048kB (H) 0*4096kB = 19484kB
51131 total pagecache pages
50795 pages in swap cache
Swap cache stats: add 3532405601, delete 3532354806, find 124289150/1822712228
Free swap = 8kB
Total swap = 255996kB
524158 pages RAM
0 pages HighMem/MovableOnly
12658 pages reserved
0 pages cma reserved
0 pages hwpoisoned

Another example exceeded the limit by the race is

in:imklog: page allocation failure: order:0, mode:0x2280020(GFP_ATOMIC|__GFP_NOTRACK)
CPU: 0 PID: 476 Comm: in:imklog Tainted: G E 4.8.0-rc7-00217-g266ef83c51e5-dirty #3135
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS Ubuntu-1.8.2-1ubuntu1 04/01/2014
Call Trace:
dump_stack+0x63/0x90
warn_alloc_failed+0xdb/0x130
__alloc_pages_nodemask+0x4d6/0xdb0
new_slab+0x339/0x490
___slab_alloc.constprop.74+0x367/0x480
__slab_alloc.constprop.73+0x20/0x40
__kmalloc+0x1a4/0x1e0
alloc_indirect.isra.14+0x1d/0x50
virtqueue_add_sgs+0x1c4/0x470
__virtblk_add_req+0xae/0x1f0
virtio_queue_rq+0x12d/0x290
__blk_mq_run_hw_queue+0x239/0x370
blk_mq_run_hw_queue+0x8f/0xb0
blk_mq_insert_requests+0x18c/0x1a0
blk_mq_flush_plug_list+0x125/0x140
blk_flush_plug_list+0xc7/0x220
blk_finish_plug+0x2c/0x40
__do_page_cache_readahead+0x196/0x230
filemap_fault+0x448/0x4f0
ext4_filemap_fault+0x36/0x50
__do_fault+0x75/0x140
handle_mm_fault+0x84d/0xbe0
__do_page_fault+0x1dd/0x4d0
trace_do_page_fault+0x43/0x130
do_async_page_fault+0x1a/0xa0
async_page_fault+0x28/0x30
Mem-Info:
active_anon:363826 inactive_anon:121283 isolated_anon:32
active_file:65 inactive_file:152 isolated_file:0
unevictable:0 dirty:0 writeback:46 unstable:0
slab_reclaimable:2778 slab_unreclaimable:3070
mapped:112 shmem:0 pagetables:1822 bounce:0
free:9469 free_pcp:231 free_cma:0
Node 0 active_anon:1455304kB inactive_anon:485132kB active_file:260kB inactive_file:608kB unevictable:0kB isolated(anon):128kB isolated(file):0kB mapped:448kB dirty:0kB writeback:184kB shmem:0kB writeback_tmp:0kB unstable:0kB pages_scanned:13641 all_unreclaimable? no
DMA free:7748kB min:44kB low:56kB high:68kB active_anon:7944kB inactive_anon:104kB active_file:0kB inactive_file:0kB unevictable:0kB writepending:0kB present:15992kB managed:15908kB mlocked:0kB slab_reclaimable:0kB slab_unreclaimable:108kB kernel_stack:0kB pagetables:4kB bounce:0kB free_pcp:0kB local_pcp:0kB free_cma:0kB
lowmem_reserve[]: 0 1952 1952 1952
DMA32 free:30128kB min:5628kB low:7624kB high:9620kB active_anon:1447360kB inactive_anon:485028kB active_file:260kB inactive_file:608kB unevictable:0kB writepending:184kB present:2080640kB managed:2030132kB mlocked:0kB slab_reclaimable:11112kB slab_unreclaimable:12172kB kernel_stack:2400kB pagetables:7284kB bounce:0kB free_pcp:924kB local_pcp:72kB free_cma:0kB
lowmem_reserve[]: 0 0 0 0
DMA: 7*4kB (UE) 3*8kB (UH) 1*16kB (M) 0*32kB 2*64kB (U) 1*128kB (M) 1*256kB (U) 0*512kB 1*1024kB (U) 1*2048kB (U) 1*4096kB (H) = 7748kB
DMA32: 10*4kB (H) 3*8kB (H) 47*16kB (H) 38*32kB (H) 5*64kB (H) 1*128kB (H) 2*256kB (H) 3*512kB (H) 3*1024kB (H) 3*2048kB (H) 4*4096kB (H) = 30128kB
2775 total pagecache pages
2536 pages in swap cache
Swap cache stats: add 206786828, delete 206784292, find 7323106/106686077
Free swap = 108744kB
Total swap = 255996kB
524158 pages RAM
0 pages HighMem/MovableOnly
12648 pages reserved
0 pages cma reserved
0 pages hwpoisoned

It's weird to show that zone has enough free memory above min watermark
but OOMed with 4K GFP_KERNEL allocation due to reserved highatomic
pages. As last resort, try to unreserve highatomic pages again and if
it has moved pages to non-highatmoc free list, retry reclaim once more.

Link: http://lkml.kernel.org/r/1476259429-18279-4-git-send-email-minchan@kernel.org
Signed-off-by: Minchan Kim <minchan@kernel.org>
Signed-off-by: Michal Hocko <mhocko@suse.com>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Cc: Mel Gorman <mgorman@techsingularity.net>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Cc: Sangseok Lee <sangseok.lee@lge.com>
Cc: Michal Hocko <mhocko@suse.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 4855e4a7 12-Dec-2016 Minchan Kim <minchan@kernel.org>

mm: prevent double decrease of nr_reserved_highatomic

There is race between page freeing and unreserved highatomic.

CPU 0 CPU 1

free_hot_cold_page
mt = get_pfnblock_migratetype
set_pcppage_migratetype(page, mt)
unreserve_highatomic_pageblock
spin_lock_irqsave(&zone->lock)
move_freepages_block
set_pageblock_migratetype(page)
spin_unlock_irqrestore(&zone->lock)
free_pcppages_bulk
__free_one_page(mt) <- mt is stale

By above race, a page on CPU 0 could go non-highorderatomic free list
since the pageblock's type is changed. By that, unreserve logic of
highorderatomic can decrease reserved count on a same pageblock severak
times and then it will make mismatch between nr_reserved_highatomic and
the number of reserved pageblock.

So, this patch verifies whether the pageblock is highatomic or not and
decrease the count only if the pageblock is highatomic.

Link: http://lkml.kernel.org/r/1476259429-18279-3-git-send-email-minchan@kernel.org
Signed-off-by: Minchan Kim <minchan@kernel.org>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Acked-by: Mel Gorman <mgorman@techsingularity.net>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Cc: Sangseok Lee <sangseok.lee@lge.com>
Cc: Michal Hocko <mhocko@suse.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 88ed365e 12-Dec-2016 Minchan Kim <minchan@kernel.org>

mm: don't steal highatomic pageblock

Patch series "use up highorder free pages before OOM", v3.

I got OOM report from production team with v4.4 kernel. It had enough
free memory but failed to allocate GFP_KERNEL order-0 page and finally
encountered OOM kill. It occured during QA process which launches
several apps, switching and so on. It happned rarely. IOW, In normal
situation, it was not a problem but if we are unluck so that several
apps uses peak memory at the same time, it can happen. If we manage to
pass the phase, the system can go working well.

I could reproduce it with my test(memory spike easily. Look at below.

The reason is free pages(19M) of DMA32 zone are reserved for
HIGHORDERATOMIC and doesn't unreserved before the OOM.

balloon invoked oom-killer: gfp_mask=0x24280ca(GFP_HIGHUSER_MOVABLE|__GFP_ZERO), order=0, oom_score_adj=0
balloon cpuset=/ mems_allowed=0
CPU: 1 PID: 8473 Comm: balloon Tainted: G W OE 4.8.0-rc7-00219-g3f74c9559583-dirty #3161
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS Ubuntu-1.8.2-1ubuntu1 04/01/2014
Call Trace:
dump_stack+0x63/0x90
dump_header+0x5c/0x1ce
oom_kill_process+0x22e/0x400
out_of_memory+0x1ac/0x210
__alloc_pages_nodemask+0x101e/0x1040
handle_mm_fault+0xa0a/0xbf0
__do_page_fault+0x1dd/0x4d0
trace_do_page_fault+0x43/0x130
do_async_page_fault+0x1a/0xa0
async_page_fault+0x28/0x30
Mem-Info:
active_anon:383949 inactive_anon:106724 isolated_anon:0
active_file:15 inactive_file:44 isolated_file:0
unevictable:0 dirty:0 writeback:24 unstable:0
slab_reclaimable:2483 slab_unreclaimable:3326
mapped:0 shmem:0 pagetables:1906 bounce:0
free:6898 free_pcp:291 free_cma:0
Node 0 active_anon:1535796kB inactive_anon:426896kB active_file:60kB inactive_file:176kB unevictable:0kB isolated(anon):0kB isolated(file):0kB mapped:0kB dirty:0kB writeback:96kB shmem:0kB writeback_tmp:0kB unstable:0kB pages_scanned:1418 all_unreclaimable? no
DMA free:8188kB min:44kB low:56kB high:68kB active_anon:7648kB inactive_anon:0kB active_file:0kB inactive_file:4kB unevictable:0kB writepending:0kB present:15992kB managed:15908kB mlocked:0kB slab_reclaimable:0kB slab_unreclaimable:20kB kernel_stack:0kB pagetables:0kB bounce:0kB free_pcp:0kB local_pcp:0kB free_cma:0kB
lowmem_reserve[]: 0 1952 1952 1952
DMA32 free:19404kB min:5628kB low:7624kB high:9620kB active_anon:1528148kB inactive_anon:426896kB active_file:60kB inactive_file:420kB unevictable:0kB writepending:96kB present:2080640kB managed:2030092kB mlocked:0kB slab_reclaimable:9932kB slab_unreclaimable:13284kB kernel_stack:2496kB pagetables:7624kB bounce:0kB free_pcp:900kB local_pcp:112kB free_cma:0kB
lowmem_reserve[]: 0 0 0 0
DMA: 0*4kB 0*8kB 0*16kB 0*32kB 0*64kB 0*128kB 0*256kB 0*512kB 0*1024kB 0*2048kB 2*4096kB (H) = 8192kB
DMA32: 7*4kB (H) 8*8kB (H) 30*16kB (H) 31*32kB (H) 14*64kB (H) 9*128kB (H) 2*256kB (H) 2*512kB (H) 4*1024kB (H) 5*2048kB (H) 0*4096kB = 19484kB
51131 total pagecache pages
50795 pages in swap cache
Swap cache stats: add 3532405601, delete 3532354806, find 124289150/1822712228
Free swap = 8kB
Total swap = 255996kB
524158 pages RAM
0 pages HighMem/MovableOnly
12658 pages reserved
0 pages cma reserved
0 pages hwpoisoned

Another example exceeded the limit by the race is

in:imklog: page allocation failure: order:0, mode:0x2280020(GFP_ATOMIC|__GFP_NOTRACK)
CPU: 0 PID: 476 Comm: in:imklog Tainted: G E 4.8.0-rc7-00217-g266ef83c51e5-dirty #3135
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS Ubuntu-1.8.2-1ubuntu1 04/01/2014
Call Trace:
dump_stack+0x63/0x90
warn_alloc_failed+0xdb/0x130
__alloc_pages_nodemask+0x4d6/0xdb0
new_slab+0x339/0x490
___slab_alloc.constprop.74+0x367/0x480
__slab_alloc.constprop.73+0x20/0x40
__kmalloc+0x1a4/0x1e0
alloc_indirect.isra.14+0x1d/0x50
virtqueue_add_sgs+0x1c4/0x470
__virtblk_add_req+0xae/0x1f0
virtio_queue_rq+0x12d/0x290
__blk_mq_run_hw_queue+0x239/0x370
blk_mq_run_hw_queue+0x8f/0xb0
blk_mq_insert_requests+0x18c/0x1a0
blk_mq_flush_plug_list+0x125/0x140
blk_flush_plug_list+0xc7/0x220
blk_finish_plug+0x2c/0x40
__do_page_cache_readahead+0x196/0x230
filemap_fault+0x448/0x4f0
ext4_filemap_fault+0x36/0x50
__do_fault+0x75/0x140
handle_mm_fault+0x84d/0xbe0
__do_page_fault+0x1dd/0x4d0
trace_do_page_fault+0x43/0x130
do_async_page_fault+0x1a/0xa0
async_page_fault+0x28/0x30
Mem-Info:
active_anon:363826 inactive_anon:121283 isolated_anon:32
active_file:65 inactive_file:152 isolated_file:0
unevictable:0 dirty:0 writeback:46 unstable:0
slab_reclaimable:2778 slab_unreclaimable:3070
mapped:112 shmem:0 pagetables:1822 bounce:0
free:9469 free_pcp:231 free_cma:0
Node 0 active_anon:1455304kB inactive_anon:485132kB active_file:260kB inactive_file:608kB unevictable:0kB isolated(anon):128kB isolated(file):0kB mapped:448kB dirty:0kB writeback:184kB shmem:0kB writeback_tmp:0kB unstable:0kB pages_scanned:13641 all_unreclaimable? no
DMA free:7748kB min:44kB low:56kB high:68kB active_anon:7944kB inactive_anon:104kB active_file:0kB inactive_file:0kB unevictable:0kB writepending:0kB present:15992kB managed:15908kB mlocked:0kB slab_reclaimable:0kB slab_unreclaimable:108kB kernel_stack:0kB pagetables:4kB bounce:0kB free_pcp:0kB local_pcp:0kB free_cma:0kB
lowmem_reserve[]: 0 1952 1952 1952
DMA32 free:30128kB min:5628kB low:7624kB high:9620kB active_anon:1447360kB inactive_anon:485028kB active_file:260kB inactive_file:608kB unevictable:0kB writepending:184kB present:2080640kB managed:2030132kB mlocked:0kB slab_reclaimable:11112kB slab_unreclaimable:12172kB kernel_stack:2400kB pagetables:7284kB bounce:0kB free_pcp:924kB local_pcp:72kB free_cma:0kB
lowmem_reserve[]: 0 0 0 0
DMA: 7*4kB (UE) 3*8kB (UH) 1*16kB (M) 0*32kB 2*64kB (U) 1*128kB (M) 1*256kB (U) 0*512kB 1*1024kB (U) 1*2048kB (U) 1*4096kB (H) = 7748kB
DMA32: 10*4kB (H) 3*8kB (H) 47*16kB (H) 38*32kB (H) 5*64kB (H) 1*128kB (H) 2*256kB (H) 3*512kB (H) 3*1024kB (H) 3*2048kB (H) 4*4096kB (H) = 30128kB
2775 total pagecache pages
2536 pages in swap cache
Swap cache stats: add 206786828, delete 206784292, find 7323106/106686077
Free swap = 108744kB
Total swap = 255996kB
524158 pages RAM
0 pages HighMem/MovableOnly
12648 pages reserved
0 pages cma reserved
0 pages hwpoisoned

During the investigation, I found some problems with highatomic so this
patch aims to solve the problems and the final goal is to unreserve
every highatomic free pages before the OOM kill.

This patch (of 4):

In page freeing path, migratetype is racy so that a highorderatomic page
could free into non-highorderatomic free list. If that page is
allocated, VM can change the pageblock from higorderatomic to something.
In that case, highatomic pageblock accounting is broken so it doesn't
work(e.g., VM cannot reserve highorderatomic pageblocks any more
although it doesn't reach 1% limit).

So, this patch prohibits the changing from highatomic to other type.
It's no problem because MIGRATE_HIGHATOMIC is not listed in fallback
array so stealing will only happen due to unexpected races which is
really rare. Also, such prohibiting keeps highatomic pageblock more
longer so it would be better for highorderatomic page allocation.

Link: http://lkml.kernel.org/r/1476259429-18279-2-git-send-email-minchan@kernel.org
Signed-off-by: Minchan Kim <minchan@kernel.org>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Acked-by: Mel Gorman <mgorman@techsingularity.net>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Cc: Sangseok Lee <sangseok.lee@lge.com>
Cc: Michal Hocko <mhocko@suse.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 9e80c719 10-Nov-2016 Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>

mm: remove extra newline from allocation stall warning

Commit 63f53dea0c98 ("mm: warn about allocations which stall for too
long") by error embedded "\n" in the format string, resulting in strange
output.

[ 722.876655] kworker/0:1: page alloction stalls for 160001ms, order:0
[ 722.876656] , mode:0x2400000(GFP_NOIO)
[ 722.876657] CPU: 0 PID: 6966 Comm: kworker/0:1 Not tainted 4.8.0+ #69

Link: http://lkml.kernel.org/r/1476026219-7974-1-git-send-email-penguin-kernel@I-love.SAKURA.ne.jp
Signed-off-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
Acked-by: Michal Hocko <mhocko@suse.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 005fd4bb 03-Nov-2016 Sebastian Andrzej Siewior <bigeasy@linutronix.de>

mm/page_alloc: Convert to hotplug state machine

Install the callbacks via the state machine.

Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: linux-mm@kvack.org
Cc: rt@linutronix.de
Link: http://lkml.kernel.org/r/20161103145021.28528-7-bigeasy@linutronix.de
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>


# 58bea414 18-Oct-2016 Kees Cook <keescook@chromium.org>

latent_entropy: Fix wrong gcc code generation with 64 bit variables

The stack frame size could grow too large when the plugin used long long
on 32-bit architectures when the given function had too many basic blocks.

The gcc warning was:

drivers/pci/hotplug/ibmphp_ebda.c: In function 'ibmphp_access_ebda':
drivers/pci/hotplug/ibmphp_ebda.c:409:1: warning: the frame size of 1108 bytes is larger than 1024 bytes [-Wframe-larger-than=]

This switches latent_entropy from u64 to unsigned long.

Thanks to PaX Team and Emese Revfy for the patch.

Signed-off-by: Kees Cook <keescook@chromium.org>


# 1f84a18f 27-Oct-2016 Joe Perches <joe@perches.com>

mm: page_alloc: use KERN_CONT where appropriate

Recent changes to printk require KERN_CONT uses to continue logging
messages. So add KERN_CONT where necessary.

[akpm@linux-foundation.org: coding-style fixes]
Fixes: 4bcc595ccd80 ("printk: reinstate KERN_CONT for printing continuation lines")
Link: http://lkml.kernel.org/r/c7df37c8665134654a17aaeb8b9f6ace1d6db58b.1476239034.git.joe@perches.com
Reported-by: Mark Rutland <mark.rutland@arm.com>
Signed-off-by: Joe Perches <joe@perches.com>
Acked-by: Michal Hocko <mhocko@suse.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 9dcb8b68 26-Oct-2016 Linus Torvalds <torvalds@linux-foundation.org>

mm: remove per-zone hashtable of bitlock waitqueues

The per-zone waitqueues exist because of a scalability issue with the
page waitqueues on some NUMA machines, but it turns out that they hurt
normal loads, and now with the vmalloced stacks they also end up
breaking gfs2 that uses a bit_wait on a stack object:

wait_on_bit(&gh->gh_iflags, HIF_WAIT, TASK_UNINTERRUPTIBLE)

where 'gh' can be a reference to the local variable 'mount_gh' on the
stack of fill_super().

The reason the per-zone hash table breaks for this case is that there is
no "zone" for virtual allocations, and trying to look up the physical
page to get at it will fail (with a BUG_ON()).

It turns out that I actually complained to the mm people about the
per-zone hash table for another reason just a month ago: the zone lookup
also hurts the regular use of "unlock_page()" a lot, because the zone
lookup ends up forcing several unnecessary cache misses and generates
horrible code.

As part of that earlier discussion, we had a much better solution for
the NUMA scalability issue - by just making the page lock have a
separate contention bit, the waitqueue doesn't even have to be looked at
for the normal case.

Peter Zijlstra already has a patch for that, but let's see if anybody
even notices. In the meantime, let's fix the actual gfs2 breakage by
simplifying the bitlock waitqueues and removing the per-zone issue.

Reported-by: Andreas Gruenbacher <agruenba@redhat.com>
Tested-by: Bob Peterson <rpeterso@redhat.com>
Acked-by: Mel Gorman <mgorman@techsingularity.net>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Steven Whitehouse <swhiteho@redhat.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# adb1fe9a 25-Oct-2016 Josh Poimboeuf <jpoimboe@redhat.com>

mm/page_alloc: Remove kernel address exposure in free_reserved_area()

Linus suggested we try to remove some of the low-hanging fruit related
to kernel address exposure in dmesg. The only leaks I see on my local
system are:

Freeing SMP alternatives memory: 32K (ffffffff9e309000 - ffffffff9e311000)
Freeing initrd memory: 10588K (ffffa0b736b42000 - ffffa0b737599000)
Freeing unused kernel memory: 3592K (ffffffff9df87000 - ffffffff9e309000)
Freeing unused kernel memory: 1352K (ffffa0b7288ae000 - ffffa0b728a00000)
Freeing unused kernel memory: 632K (ffffa0b728d62000 - ffffa0b728e00000)

Linus says:

"I suspect we should just remove [the addresses in the 'Freeing'
messages]. I'm sure they are useful in theory, but I suspect they
were more useful back when the whole "free init memory" was
originally done.

These days, if we have a use-after-free, I suspect the init-mem
situation is the easiest situation by far. Compared to all the dynamic
allocations which are much more likely to show it anyway. So having
debug output for that case is likely not all that productive."

With this patch the freeing messages now look like this:

Freeing SMP alternatives memory: 32K
Freeing initrd memory: 10588K
Freeing unused kernel memory: 3592K
Freeing unused kernel memory: 1352K
Freeing unused kernel memory: 632K

Suggested-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Josh Poimboeuf <jpoimboe@redhat.com>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Brian Gerst <brgerst@gmail.com>
Cc: Denys Vlasenko <dvlasenk@redhat.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: linux-mm@kvack.org
Link: http://lkml.kernel.org/r/6836ff90c45b71d38e5d4405aec56fa9e5d1d4b2.1477405374.git.jpoimboe@redhat.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>


# 0766f788 20-Jun-2016 Emese Revfy <re.emese@gmail.com>

latent_entropy: Mark functions with __latent_entropy

The __latent_entropy gcc attribute can be used only on functions and
variables. If it is on a function then the plugin will instrument it for
gathering control-flow entropy. If the attribute is on a variable then
the plugin will initialize it with random contents. The variable must
be an integer, an integer array type or a structure with integer fields.

These specific functions have been selected because they are init
functions (to help gather boot-time entropy), are called at unpredictable
times, or they have variable loops, each of which provide some level of
latent entropy.

Signed-off-by: Emese Revfy <re.emese@gmail.com>
[kees: expanded commit message]
Signed-off-by: Kees Cook <keescook@chromium.org>


# 38addce8 20-Jun-2016 Emese Revfy <re.emese@gmail.com>

gcc-plugins: Add latent_entropy plugin

This adds a new gcc plugin named "latent_entropy". It is designed to
extract as much possible uncertainty from a running system at boot time as
possible, hoping to capitalize on any possible variation in CPU operation
(due to runtime data differences, hardware differences, SMP ordering,
thermal timing variation, cache behavior, etc).

At the very least, this plugin is a much more comprehensive example for
how to manipulate kernel code using the gcc plugin internals.

The need for very-early boot entropy tends to be very architecture or
system design specific, so this plugin is more suited for those sorts
of special cases. The existing kernel RNG already attempts to extract
entropy from reliable runtime variation, but this plugin takes the idea to
a logical extreme by permuting a global variable based on any variation
in code execution (e.g. a different value (and permutation function)
is used to permute the global based on loop count, case statement,
if/then/else branching, etc).

To do this, the plugin starts by inserting a local variable in every
marked function. The plugin then adds logic so that the value of this
variable is modified by randomly chosen operations (add, xor and rol) and
random values (gcc generates separate static values for each location at
compile time and also injects the stack pointer at runtime). The resulting
value depends on the control flow path (e.g., loops and branches taken).

Before the function returns, the plugin mixes this local variable into
the latent_entropy global variable. The value of this global variable
is added to the kernel entropy pool in do_one_initcall() and _do_fork(),
though it does not credit any bytes of entropy to the pool; the contents
of the global are just used to mix the pool.

Additionally, the plugin can pre-initialize arrays with build-time
random contents, so that two different kernel builds running on identical
hardware will not have the same starting values.

Signed-off-by: Emese Revfy <re.emese@gmail.com>
[kees: expanded commit message and code comments]
Signed-off-by: Kees Cook <keescook@chromium.org>


# 63f53dea 07-Oct-2016 Michal Hocko <mhocko@suse.com>

mm: warn about allocations which stall for too long

Currently we do warn only about allocation failures but small
allocations are basically nofail and they might loop in the page
allocator for a long time. Especially when the reclaim cannot make any
progress - e.g. GFP_NOFS cannot invoke the oom killer and rely on a
different context to make a forward progress in case there is a lot
memory used by filesystems.

Give us at least a clue when something like this happens and warn about
allocations which take more than 10s. Print the basic allocation
context information along with the cumulative time spent in the
allocation as well as the allocation stack. Repeat the warning after
every 10 seconds so that we know that the problem is permanent rather
than ephemeral.

Link: http://lkml.kernel.org/r/20160929084407.7004-3-mhocko@kernel.org
Signed-off-by: Michal Hocko <mhocko@suse.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Dave Hansen <dave.hansen@intel.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 7877cdcc 07-Oct-2016 Michal Hocko <mhocko@suse.com>

mm: consolidate warn_alloc_failed users

warn_alloc_failed is currently used from the page and vmalloc
allocators. This is a good reuse of the code except that vmalloc would
appreciate a slightly different warning message. This is already
handled by the fmt parameter except that

"%s: page allocation failure: order:%u, mode:%#x(%pGg)"

is printed anyway. This might be quite misleading because it might be a
vmalloc failure which leads to the warning while the page allocator is
not the culprit here. Fix this by always using the fmt string and only
print the context that makes sense for the particular context (e.g.
order makes only very little sense for the vmalloc context).

Rename the function to not miss any user and also because a later patch
will reuse it also for !failure cases.

Link: http://lkml.kernel.org/r/20160929084407.7004-2-mhocko@kernel.org
Signed-off-by: Michal Hocko <mhocko@suse.com>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Cc: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Dave Hansen <dave.hansen@intel.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 423b452e 07-Oct-2016 Vlastimil Babka <vbabka@suse.cz>

mm, page_alloc: pull no_progress_loops update to should_reclaim_retry()

The should_reclaim_retry() makes decisions based on no_progress_loops,
so it makes sense to also update the counter there. It will be also
consistent with should_compact_retry() and compaction_retries. No
functional change.

[hillf.zj@alibaba-inc.com: fix missing pointer dereferences]
Link: http://lkml.kernel.org/r/20160926162025.21555-3-vbabka@suse.cz
Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
Acked-by: Hillf Danton <hillf.zj@alibaba-inc.com>
Acked-by: Michal Hocko <mhocko@suse.com>
Cc: Mel Gorman <mgorman@techsingularity.net>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Cc: David Rientjes <rientjes@google.com>
Cc: Rik van Riel <riel@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# c2033b00 07-Oct-2016 Vlastimil Babka <vbabka@suse.cz>

mm, compaction: restrict full priority to non-costly orders

The new ultimate compaction priority disables some heuristics, which may
result in excessive cost. This is fine for non-costly orders where we
want to try hard before resulting for OOM, but might be disruptive for
costly orders which do not trigger OOM and should generally have some
fallback. Thus, we disable the full priority for costly orders.

Suggested-by: Michal Hocko <mhocko@kernel.org>
Link: http://lkml.kernel.org/r/20160906135258.18335-4-vbabka@suse.cz
Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Mel Gorman <mgorman@techsingularity.net>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Cc: David Rientjes <rientjes@google.com>
Cc: Rik van Riel <riel@redhat.com>
Cc: Tetsuo Handa <penguin-kernel@i-love.sakura.ne.jp>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# d9436498 07-Oct-2016 Vlastimil Babka <vbabka@suse.cz>

mm, compaction: more reliably increase direct compaction priority

During reclaim/compaction loop, compaction priority can be increased by
the should_compact_retry() function, but the current code is not
optimal. Priority is only increased when compaction_failed() is true,
which means that compaction has scanned the whole zone. This may not
happen even after multiple attempts with a lower priority due to
parallel activity, so we might needlessly struggle on the lower
priorities and possibly run out of compaction retry attempts in the
process.

After this patch we are guaranteed at least one attempt at the highest
compaction priority even if we exhaust all retries at the lower
priorities.

Link: http://lkml.kernel.org/r/20160906135258.18335-3-vbabka@suse.cz
Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Mel Gorman <mgorman@techsingularity.net>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Cc: David Rientjes <rientjes@google.com>
Cc: Rik van Riel <riel@redhat.com>
Cc: Tetsuo Handa <penguin-kernel@i-love.sakura.ne.jp>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 3250845d 07-Oct-2016 Vlastimil Babka <vbabka@suse.cz>

Revert "mm, oom: prevent premature OOM killer invocation for high order request"

Patch series "reintroduce compaction feedback for OOM decisions".

After several people reported OOM's for order-2 allocations in 4.7 due
to Michal Hocko's OOM rework, he reverted the part that considered
compaction feedback [1] in the decisions to retry reclaim/compaction.
This was to provide a fix quickly for 4.8 rc and 4.7 stable series,
while mmotm had an almost complete solution that instead improved
compaction reliability.

This series completes the mmotm solution and reintroduces the compaction
feedback into OOM decisions. The first two patches restore the state of
mmotm before the temporary solution was merged, the last patch should be
the missing piece for reliability. The third patch restricts the
hardened compaction to non-costly orders, since costly orders don't
result in OOMs in the first place.

[1] http://marc.info/?i=20160822093249.GA14916%40dhcp22.suse.cz%3E

This patch (of 4):

Commit 6b4e3181d7bd ("mm, oom: prevent premature OOM killer invocation
for high order request") was intended as a quick fix of OOM regressions
for 4.8 and stable 4.7.x kernels. For a better long-term solution, we
still want to consider compaction feedback, which should be possible
after some more improvements in the following patches.

This reverts commit 6b4e3181d7bd5ca5ab6f45929e4a5ffa7ab4ab7f.

Link: http://lkml.kernel.org/r/20160906135258.18335-2-vbabka@suse.cz
Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
Acked-by: Michal Hocko <mhocko@suse.com>
Cc: Mel Gorman <mgorman@techsingularity.net>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Cc: David Rientjes <rientjes@google.com>
Cc: Rik van Riel <riel@redhat.com>
Cc: Tetsuo Handa <penguin-kernel@i-love.sakura.ne.jp>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# f6f34b43 07-Oct-2016 Srikar Dronamraju <srikar@linux.vnet.ibm.com>

mm: introduce arch_reserved_kernel_pages()

Currently arch specific code can reserve memory blocks but
alloc_large_system_hash() may not take it into consideration when sizing
the hashes. This can lead to bigger hash than required and lead to no
available memory for other purposes. This is specifically true for
systems with CONFIG_DEFERRED_STRUCT_PAGE_INIT enabled.

One approach to solve this problem would be to walk through the memblock
regions and calculate the available memory and base the size of hash
system on the available memory.

The other approach would be to depend on the architecture to provide the
number of pages that are reserved. This change provides hooks to allow
the architecture to provide the required info.

Link: http://lkml.kernel.org/r/1472476010-4709-2-git-send-email-srikar@linux.vnet.ibm.com
Signed-off-by: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
Suggested-by: Mel Gorman <mgorman@techsingularity.net>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Cc: Mahesh Salgaonkar <mahesh@linux.vnet.ibm.com>
Cc: Hari Bathini <hbathini@linux.vnet.ibm.com>
Cc: Dave Hansen <dave.hansen@intel.com>
Cc: Balbir Singh <bsingharora@gmail.com>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# c9634cf0 07-Oct-2016 Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>

mm: use zonelist name instead of using hardcoded index

Use the existing enums instead of hardcoded index when looking at the
zonelist. This makes it more readable. No functionality change by this
patch.

Link: http://lkml.kernel.org/r/1472227078-24852-1-git-send-email-aneesh.kumar@linux.vnet.ibm.com
Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
Reviewed-by: Anshuman Khandual <khandual@linux.vnet.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 980ac167 07-Oct-2016 Joonsoo Kim <iamjoonsoo.kim@lge.com>

mm/page_ext: support extra space allocation by page_ext user

Until now, if some page_ext users want to use it's own field on
page_ext, it should be defined in struct page_ext by hard-coding. It
has a problem that wastes memory in following situation.

struct page_ext {
#ifdef CONFIG_A
int a;
#endif
#ifdef CONFIG_B
int b;
#endif
};

Assume that kernel is built with both CONFIG_A and CONFIG_B. Even if we
enable feature A and doesn't enable feature B at runtime, each entry of
struct page_ext takes two int rather than one int. It's undesirable
result so this patch tries to fix it.

To solve above problem, this patch implements to support extra space
allocation at runtime. When need() callback returns true, it's extra
memory requirement is summed to entry size of page_ext. Also, offset
for each user's extra memory space is returned. With this offset, user
can use this extra space and there is no need to define needed field on
page_ext by hard-coding.

This patch only implements an infrastructure. Following patch will use
it for page_owner which is only user having it's own fields on page_ext.

Link: http://lkml.kernel.org/r/1471315879-32294-6-git-send-email-iamjoonsoo.kim@lge.com
Signed-off-by: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Cc: Minchan Kim <minchan@kernel.org>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Sergey Senozhatsky <sergey.senozhatsky@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# f1c1e9f7 07-Oct-2016 Joonsoo Kim <iamjoonsoo.kim@lge.com>

mm/debug_pagealloc.c: don't allocate page_ext if we don't use guard page

What debug_pagealloc does is just mapping/unmapping page table.
Basically, it doesn't need additional memory space to memorize
something. But, with guard page feature, it requires additional memory
to distinguish if the page is for guard or not. Guard page is only used
when debug_guardpage_minorder is non-zero so this patch removes
additional memory allocation (page_ext) if debug_guardpage_minorder is
zero.

It saves memory if we just use debug_pagealloc and not guard page.

Link: http://lkml.kernel.org/r/1471315879-32294-3-git-send-email-iamjoonsoo.kim@lge.com
Signed-off-by: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Reviewed-by: Sergey Senozhatsky <sergey.senozhatsky@gmail.com>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Cc: Minchan Kim <minchan@kernel.org>
Cc: Michal Hocko <mhocko@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# acbc15a4 07-Oct-2016 Joonsoo Kim <iamjoonsoo.kim@lge.com>

mm/debug_pagealloc.c: clean-up guard page handling code

Patch series "Reduce memory waste by page extension user".

This patchset tries to reduce memory waste by page extension user.

First case is architecture supported debug_pagealloc. It doesn't
requires additional memory if guard page isn't used. 8 bytes per page
will be saved in this case.

Second case is related to page owner feature. Until now, if page_ext
users want to use it's own fields on page_ext, fields should be defined
in struct page_ext by hard-coding. It has a following problem.

struct page_ext {
#ifdef CONFIG_A
int a;
#endif
#ifdef CONFIG_B
int b;
#endif
};

Assume that kernel is built with both CONFIG_A and CONFIG_B. Even if we
enable feature A and doesn't enable feature B at runtime, each entry of
struct page_ext takes two int rather than one int. It's undesirable
waste so this patch tries to reduce it. By this patchset, we can save
20 bytes per page dedicated for page owner feature in some
configurations.

This patch (of 6):

We can make code clean by moving decision condition for set_page_guard()
into set_page_guard() itself. It will help code readability. There is
no functional change.

Link: http://lkml.kernel.org/r/1471315879-32294-2-git-send-email-iamjoonsoo.kim@lge.com
Signed-off-by: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Cc: Minchan Kim <minchan@kernel.org>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Sergey Senozhatsky <sergey.senozhatsky@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# e780149b 07-Oct-2016 Xishi Qiu <qiuxishi@huawei.com>

mm: fix set pageblock migratetype in deferred struct page init

On x86_64 MAX_ORDER_NR_PAGES is usually 4M, and a pageblock is usually
2M, so we only set one pageblock's migratetype in deferred_free_range()
if pfn is aligned to MAX_ORDER_NR_PAGES. That means it causes
uninitialized migratetype blocks, you can see from "cat
/proc/pagetypeinfo", almost half blocks are Unmovable.

Also we missed freeing the last block in deferred_init_memmap(), it
causes memory leak.

Fixes: ac5d2539b238 ("mm: meminit: reduce number of times pageblocks are set during struct page init")
Link: http://lkml.kernel.org/r/57A3260F.4050709@huawei.com
Signed-off-by: Xishi Qiu <qiuxishi@huawei.com>
Cc: Taku Izumi <izumi.taku@jp.fujitsu.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Mel Gorman <mgorman@techsingularity.net>
Cc: Michal Hocko <mhocko@suse.com>
Cc: David Rientjes <rientjes@google.com>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Cc: "Kirill A . Shutemov" <kirill.shutemov@linux.intel.com>
Cc: Kamezawa Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# e506b996 07-Oct-2016 Xishi Qiu <qiuxishi@huawei.com>

mem-hotplug: fix node spanned pages when we have a movable node

Commit 342332e6a925 ("mm/page_alloc.c: introduce kernelcore=mirror
option") rewrote the calculation of node spanned pages. But when we
have a movable node, the size of node spanned pages is double added.
That's because we have an empty normal zone, the present pages is zero,
but its spanned pages is not zero.

e.g.
Zone ranges:
DMA [mem 0x0000000000001000-0x0000000000ffffff]
DMA32 [mem 0x0000000001000000-0x00000000ffffffff]
Normal [mem 0x0000000100000000-0x0000007c7fffffff]
Movable zone start for each node
Node 1: 0x0000001080000000
Node 2: 0x0000002080000000
Node 3: 0x0000003080000000
Node 4: 0x0000003c80000000
Node 5: 0x0000004c80000000
Node 6: 0x0000005c80000000
Early memory node ranges
node 0: [mem 0x0000000000001000-0x000000000009ffff]
node 0: [mem 0x0000000000100000-0x000000007552afff]
node 0: [mem 0x000000007bd46000-0x000000007bd46fff]
node 0: [mem 0x000000007bdcd000-0x000000007bffffff]
node 0: [mem 0x0000000100000000-0x000000107fffffff]
node 1: [mem 0x0000001080000000-0x000000207fffffff]
node 2: [mem 0x0000002080000000-0x000000307fffffff]
node 3: [mem 0x0000003080000000-0x0000003c7fffffff]
node 4: [mem 0x0000003c80000000-0x0000004c7fffffff]
node 5: [mem 0x0000004c80000000-0x0000005c7fffffff]
node 6: [mem 0x0000005c80000000-0x0000006c7fffffff]
node 7: [mem 0x0000006c80000000-0x0000007c7fffffff]

node1:
Normal, start=0x1080000, present=0x0, spanned=0x1000000
Movable, start=0x1080000, present=0x1000000, spanned=0x1000000
pgdat, start=0x1080000, present=0x1000000, spanned=0x2000000

After this patch, the problem is fixed.

node1:
Normal, start=0x0, present=0x0, spanned=0x0
Movable, start=0x1080000, present=0x1000000, spanned=0x1000000
pgdat, start=0x1080000, present=0x1000000, spanned=0x1000000

Link: http://lkml.kernel.org/r/57A325E8.6070100@huawei.com
Signed-off-by: Xishi Qiu <qiuxishi@huawei.com>
Cc: Taku Izumi <izumi.taku@jp.fujitsu.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Mel Gorman <mgorman@techsingularity.net>
Cc: Michal Hocko <mhocko@suse.com>
Cc: David Rientjes <rientjes@google.com>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Cc: "Kirill A . Shutemov" <kirill.shutemov@linux.intel.com>
Cc: Kamezawa Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 8348faf9 07-Oct-2016 Vlastimil Babka <vbabka@suse.cz>

mm, compaction: require only min watermarks for non-costly orders

The __compaction_suitable() function checks the low watermark plus a
compact_gap() gap to decide if there's enough free memory to perform
compaction. Then __isolate_free_page uses low watermark check to decide
if particular free page can be isolated. In the latter case, using low
watermark is needlessly pessimistic, as the free page isolations are
only temporary. For __compaction_suitable() the higher watermark makes
sense for high-order allocations where more freepages increase the
chance of success, and we can typically fail with some order-0 fallback
when the system is struggling to reach that watermark. But for
low-order allocation, forming the page should not be that hard. So
using low watermark here might just prevent compaction from even trying,
and eventually lead to OOM killer even if we are above min watermarks.

So after this patch, we use min watermark for non-costly orders in
__compaction_suitable(), and for all orders in __isolate_free_page().

[vbabka@suse.cz: clarify __isolate_free_page() comment]
Link: http://lkml.kernel.org/r/7ae4baec-4eca-e70b-2a69-94bea4fb19fa@suse.cz
Link: http://lkml.kernel.org/r/20160810091226.6709-11-vbabka@suse.cz
Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
Tested-by: Lorenzo Stoakes <lstoakes@gmail.com>
Acked-by: Michal Hocko <mhocko@suse.com>
Cc: Mel Gorman <mgorman@techsingularity.net>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Cc: David Rientjes <rientjes@google.com>
Cc: Rik van Riel <riel@redhat.com>
Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
Tested-by: Lorenzo Stoakes <lstoakes@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 984fdba6 07-Oct-2016 Vlastimil Babka <vbabka@suse.cz>

mm, compaction: use proper alloc_flags in __compaction_suitable()

The __compaction_suitable() function checks the low watermark plus a
compact_gap() gap to decide if there's enough free memory to perform
compaction. This check uses direct compactor's alloc_flags, but that's
wrong, since these flags are not applicable for freepage isolation.

For example, alloc_flags may indicate access to memory reserves, making
compaction proceed, and then fail watermark check during the isolation.

A similar problem exists for ALLOC_CMA, which may be part of
alloc_flags, but not during freepage isolation. In this case however it
makes sense to use ALLOC_CMA both in __compaction_suitable() and
__isolate_free_page(), since there's actually nothing preventing the
freepage scanner to isolate from CMA pageblocks, with the assumption
that a page that could be migrated once by compaction can be migrated
also later by CMA allocation. Thus we should count pages in CMA
pageblocks when considering compaction suitability and when isolating
freepages.

To sum up, this patch should remove some false positives from
__compaction_suitable(), and allow compaction to proceed when free pages
required for compaction reside in the CMA pageblocks.

Link: http://lkml.kernel.org/r/20160810091226.6709-10-vbabka@suse.cz
Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
Tested-by: Lorenzo Stoakes <lstoakes@gmail.com>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Mel Gorman <mgorman@techsingularity.net>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Cc: David Rientjes <rientjes@google.com>
Cc: Rik van Riel <riel@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 6aa303de 01-Sep-2016 Mel Gorman <mgorman@techsingularity.net>

mm, vmscan: only allocate and reclaim from zones with pages managed by the buddy allocator

Firmware Assisted Dump (FA_DUMP) on ppc64 reserves substantial amounts
of memory when booting a secondary kernel. Srikar Dronamraju reported
that multiple nodes may have no memory managed by the buddy allocator
but still return true for populated_zone().

Commit 1d82de618ddd ("mm, vmscan: make kswapd reclaim in terms of
nodes") was reported to cause kswapd to spin at 100% CPU usage when
fadump was enabled. The old code happened to deal with the situation of
a populated node with zero free pages by co-incidence but the current
code tries to reclaim populated zones without realising that is
impossible.

We cannot just convert populated_zone() as many existing users really
need to check for present_pages. This patch introduces a managed_zone()
helper and uses it in the few cases where it is critical that the check
is made for managed pages -- zonelist construction and page reclaim.

Link: http://lkml.kernel.org/r/20160831195104.GB8119@techsingularity.net
Signed-off-by: Mel Gorman <mgorman@techsingularity.net>
Reported-by: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
Tested-by: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
Acked-by: Michal Hocko <mhocko@suse.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 6b4e3181 01-Sep-2016 Michal Hocko <mhocko@suse.com>

mm, oom: prevent premature OOM killer invocation for high order request

There have been several reports about pre-mature OOM killer invocation
in 4.7 kernel when order-2 allocation request (for the kernel stack)
invoked OOM killer even during basic workloads (light IO or even kernel
compile on some filesystems). In all reported cases the memory is
fragmented and there are no order-2+ pages available. There is usually
a large amount of slab memory (usually dentries/inodes) and further
debugging has shown that there are way too many unmovable blocks which
are skipped during the compaction. Multiple reporters have confirmed
that the current linux-next which includes [1] and [2] helped and OOMs
are not reproducible anymore.

A simpler fix for the late rc and stable is to simply ignore the
compaction feedback and retry as long as there is a reclaim progress and
we are not getting OOM for order-0 pages. We already do that for
CONFING_COMPACTION=n so let's reuse the same code when compaction is
enabled as well.

[1] http://lkml.kernel.org/r/20160810091226.6709-1-vbabka@suse.cz
[2] http://lkml.kernel.org/r/f7a9ea9d-bb88-bfd6-e340-3a933559305a@suse.cz

Fixes: 0a0337e0d1d1 ("mm, oom: rework oom detection")
Link: http://lkml.kernel.org/r/20160823074339.GB23577@dhcp22.suse.cz
Signed-off-by: Michal Hocko <mhocko@suse.com>
Tested-by: Olaf Hering <olaf@aepfle.de>
Tested-by: Ralf-Peter Rohbeck <Ralf-Peter.Rohbeck@quantum.com>
Cc: Markus Trippelsdorf <markus@trippelsdorf.de>
Cc: Arkadiusz Miskiewicz <a.miskiewicz@gmail.com>
Cc: Ralf-Peter Rohbeck <Ralf-Peter.Rohbeck@quantum.com>
Cc: Jiri Slaby <jslaby@suse.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Joonsoo Kim <js1304@gmail.com>
Cc: Tetsuo Handa <penguin-kernel@i-love.sakura.ne.jp>
Cc: David Rientjes <rientjes@google.com>
Cc: <stable@vger.kernel.org> [4.7.x]
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 2f95ff90 11-Aug-2016 Mel Gorman <mgorman@techsingularity.net>

proc, meminfo: use correct helpers for calculating LRU sizes in meminfo

meminfo_proc_show() and si_mem_available() are using the wrong helpers
for calculating the size of the LRUs. The user-visible impact is that
there appears to be an abnormally high number of unevictable pages.

Link: http://lkml.kernel.org/r/20160805105805.GR2799@techsingularity.net
Signed-off-by: Mel Gorman <mgorman@techsingularity.net>
Cc: Dave Chinner <david@fromorbit.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 6423aa81 10-Aug-2016 Joonsoo Kim <iamjoonsoo.kim@lge.com>

mm/page_alloc.c: recalculate some of node threshold when on/offline memory

Some of node threshold depends on number of managed pages in the node.
When memory is going on/offline, it can be changed and we need to adjust
them.

Add recalculation to appropriate places and clean-up related functions
for better maintenance.

Link: http://lkml.kernel.org/r/1470724248-26780-2-git-send-email-iamjoonsoo.kim@lge.com
Signed-off-by: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Acked-by: Mel Gorman <mgorman@techsingularity.net>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Minchan Kim <minchan@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 81cbcbc2 10-Aug-2016 Joonsoo Kim <iamjoonsoo.kim@lge.com>

mm/page_alloc.c: fix wrong initialization when sysctl_min_unmapped_ratio changes

Before resetting min_unmapped_pages, we need to initialize
min_unmapped_pages rather than min_slab_pages.

Fixes: a5f5f91da6 (mm: convert zone_reclaim to node_reclaim)
Link: http://lkml.kernel.org/r/1470724248-26780-1-git-send-email-iamjoonsoo.kim@lge.com
Signed-off-by: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Acked-by: Mel Gorman <mgorman@techsingularity.net>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Minchan Kim <minchan@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# c4159a75 08-Aug-2016 Vladimir Davydov <vdavydov.dev@gmail.com>

mm: memcontrol: only mark charged pages with PageKmemcg

To distinguish non-slab pages charged to kmemcg we mark them PageKmemcg,
which sets page->_mapcount to -512. Currently, we set/clear PageKmemcg
in __alloc_pages_nodemask()/free_pages_prepare() for any page allocated
with __GFP_ACCOUNT, including those that aren't actually charged to any
cgroup, i.e. allocated from the root cgroup context. To avoid overhead
in case cgroups are not used, we only do that if memcg_kmem_enabled() is
true. The latter is set iff there are kmem-enabled memory cgroups
(online or offline). The root cgroup is not considered kmem-enabled.

As a result, if a page is allocated with __GFP_ACCOUNT for the root
cgroup when there are kmem-enabled memory cgroups and is freed after all
kmem-enabled memory cgroups were removed, e.g.

# no memory cgroups has been created yet, create one
mkdir /sys/fs/cgroup/memory/test
# run something allocating pages with __GFP_ACCOUNT, e.g.
# a program using pipe
dmesg | tail
# remove the memory cgroup
rmdir /sys/fs/cgroup/memory/test

we'll get bad page state bug complaining about page->_mapcount != -1:

BUG: Bad page state in process swapper/0 pfn:1fd945c
page:ffffea007f651700 count:0 mapcount:-511 mapping: (null) index:0x0
flags: 0x1000000000000000()

To avoid that, let's mark with PageKmemcg only those pages that are
actually charged to and hence pin a non-root memory cgroup.

Fixes: 4949148ad433 ("mm: charge/uncharge kmemcg from generic page allocator paths")
Reported-and-tested-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: Vladimir Davydov <vdavydov@virtuozzo.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# b4911ea2 04-Aug-2016 Mel Gorman <mgorman@techsingularity.net>

mm: initialise per_cpu_nodestats for all online pgdats at boot

Paul Mackerras and Reza Arbab reported that machines with memoryless
nodes fail when vmstats are refreshed. Paul reported an oops as follows

Unable to handle kernel paging request for data at address 0xff7a10000
Faulting instruction address: 0xc000000000270cd0
Oops: Kernel access of bad area, sig: 11 [#1]
SMP NR_CPUS=2048 NUMA PowerNV
Modules linked in:
CPU: 0 PID: 1 Comm: swapper/0 Not tainted 4.7.0-kvm+ #118
task: c000000ff0680010 task.stack: c000000ff0704000
NIP: c000000000270cd0 LR: c000000000270ce8 CTR: 0000000000000000
REGS: c000000ff0707900 TRAP: 0300 Not tainted (4.7.0-kvm+)
MSR: 9000000102009033 <SF,HV,VEC,EE,ME,IR,DR,RI,LE,TM[E]> CR: 846b6824 XER: 20000000
CFAR: c000000000008768 DAR: 0000000ff7a10000 DSISR: 42000000 SOFTE: 1
NIP refresh_zone_stat_thresholds+0x80/0x240
LR refresh_zone_stat_thresholds+0x98/0x240
Call Trace:
refresh_zone_stat_thresholds+0xb8/0x240 (unreliable)

Both supplied potential fixes but one potentially misses checks and
another had redundant initialisations. This version initialises
per_cpu_nodestats on a per-pgdat basis instead of on a per-zone basis.

Link: http://lkml.kernel.org/r/20160804092404.GI2799@techsingularity.net
Signed-off-by: Mel Gorman <mgorman@techsingularity.net>
Reported-by: Paul Mackerras <paulus@ozlabs.org>
Reported-by: Reza Arbab <arbab@linux.vnet.ibm.com>
Tested-by: Reza Arbab <arbab@linux.vnet.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# bd721ea7 02-Aug-2016 Fabian Frederick <fabf@skynet.be>

treewide: replace obsolete _refok by __ref

There was only one use of __initdata_refok and __exit_refok

__init_refok was used 46 times against 82 for __ref.

Those definitions are obsolete since commit 312b1485fb50 ("Introduce new
section reference annotations tags: __ref, __refdata, __refconst")

This patch removes the following compatibility definitions and replaces
them treewide.

/* compatibility defines */
#define __init_refok __ref
#define __initdata_refok __refdata
#define __exit_refok __ref

I can also provide separate patches if necessary.
(One patch per tree and check in 1 month or 2 to remove old definitions)

[akpm@linux-foundation.org: coding-style fixes]
Link: http://lkml.kernel.org/r/1466796271-3043-1-git-send-email-fabf@skynet.be
Signed-off-by: Fabian Frederick <fabf@skynet.be>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Sam Ravnborg <sam@ravnborg.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# c3486f53 28-Jul-2016 Vlastimil Babka <vbabka@suse.cz>

mm, compaction: simplify contended compaction handling

Async compaction detects contention either due to failing trylock on
zone->lock or lru_lock, or by need_resched(). Since 1f9efdef4f3f ("mm,
compaction: khugepaged should not give up due to need_resched()") the
code got quite complicated to distinguish these two up to the
__alloc_pages_slowpath() level, so different decisions could be taken
for khugepaged allocations.

After the recent changes, khugepaged allocations don't check for
contended compaction anymore, so we again don't need to distinguish lock
and sched contention, and simplify the current convoluted code a lot.

However, I believe it's also possible to simplify even more and
completely remove the check for contended compaction after the initial
async compaction for costly orders, which was originally aimed at THP
page fault allocations. There are several reasons why this can be done
now:

- with the new defaults, THP page faults no longer do reclaim/compaction at
all, unless the system admin has overridden the default, or application has
indicated via madvise that it can benefit from THP's. In both cases, it
means that the potential extra latency is expected and worth the benefits.
- even if reclaim/compaction proceeds after this patch where it previously
wouldn't, the second compaction attempt is still async and will detect the
contention and back off, if the contention persists
- there are still heuristics like deferred compaction and pageblock skip bits
in place that prevent excessive THP page fault latencies

Link: http://lkml.kernel.org/r/20160721073614.24395-9-vbabka@suse.cz
Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
Acked-by: Michal Hocko <mhocko@suse.com>
Acked-by: Mel Gorman <mgorman@techsingularity.net>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# a5508cd8 28-Jul-2016 Vlastimil Babka <vbabka@suse.cz>

mm, compaction: introduce direct compaction priority

In the context of direct compaction, for some types of allocations we
would like the compaction to either succeed or definitely fail while
trying as hard as possible. Current async/sync_light migration mode is
insufficient, as there are heuristics such as caching scanner positions,
marking pageblocks as unsuitable or deferring compaction for a zone. At
least the final compaction attempt should be able to override these
heuristics.

To communicate how hard compaction should try, we replace migration mode
with a new enum compact_priority and change the relevant function
signatures. In compact_zone_order() where struct compact_control is
constructed, the priority is mapped to suitable control flags. This
patch itself has no functional change, as the current priority levels
are mapped back to the same migration modes as before. Expanding them
will be done next.

Note that !CONFIG_COMPACTION variant of try_to_compact_pages() is
removed, as the only caller exists under CONFIG_COMPACTION.

Link: http://lkml.kernel.org/r/20160721073614.24395-8-vbabka@suse.cz
Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
Acked-by: Michal Hocko <mhocko@suse.com>
Acked-by: Mel Gorman <mgorman@techsingularity.net>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 25160354 28-Jul-2016 Vlastimil Babka <vbabka@suse.cz>

mm, thp: remove __GFP_NORETRY from khugepaged and madvised allocations

After the previous patch, we can distinguish costly allocations that
should be really lightweight, such as THP page faults, with
__GFP_NORETRY. This means we don't need to recognize khugepaged
allocations via PF_KTHREAD anymore. We can also change THP page faults
in areas where madvise(MADV_HUGEPAGE) was used to try as hard as
khugepaged, as the process has indicated that it benefits from THP's and
is willing to pay some initial latency costs.

We can also make the flags handling less cryptic by distinguishing
GFP_TRANSHUGE_LIGHT (no reclaim at all, default mode in page fault) from
GFP_TRANSHUGE (only direct reclaim, khugepaged default). Adding
__GFP_NORETRY or __GFP_KSWAPD_RECLAIM is done where needed.

The patch effectively changes the current GFP_TRANSHUGE users as
follows:

* get_huge_zero_page() - the zero page lifetime should be relatively
long and it's shared by multiple users, so it's worth spending some
effort on it. We use GFP_TRANSHUGE, and __GFP_NORETRY is not added.
This also restores direct reclaim to this allocation, which was
unintentionally removed by commit e4a49efe4e7e ("mm: thp: set THP defrag
by default to madvise and add a stall-free defrag option")

* alloc_hugepage_khugepaged_gfpmask() - this is khugepaged, so latency
is not an issue. So if khugepaged "defrag" is enabled (the default), do
reclaim via GFP_TRANSHUGE without __GFP_NORETRY. We can remove the
PF_KTHREAD check from page alloc.

As a side-effect, khugepaged will now no longer check if the initial
compaction was deferred or contended. This is OK, as khugepaged sleep
times between collapsion attempts are long enough to prevent noticeable
disruption, so we should allow it to spend some effort.

* migrate_misplaced_transhuge_page() - already was masking out
__GFP_RECLAIM, so just convert to GFP_TRANSHUGE_LIGHT which is
equivalent.

* alloc_hugepage_direct_gfpmask() - vma's with VM_HUGEPAGE (via madvise)
are now allocating without __GFP_NORETRY. Other vma's keep using
__GFP_NORETRY if direct reclaim/compaction is at all allowed (by default
it's allowed only for madvised vma's). The rest is conversion to
GFP_TRANSHUGE(_LIGHT).

[mhocko@suse.com: suggested GFP_TRANSHUGE_LIGHT]
Link: http://lkml.kernel.org/r/20160721073614.24395-7-vbabka@suse.cz
Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
Acked-by: Michal Hocko <mhocko@suse.com>
Acked-by: Mel Gorman <mgorman@techsingularity.net>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 3eb2771b 28-Jul-2016 Vlastimil Babka <vbabka@suse.cz>

mm, page_alloc: make THP-specific decisions more generic

Since THP allocations during page faults can be costly, extra decisions
are employed for them to avoid excessive reclaim and compaction, if the
initial compaction doesn't look promising. The detection has never been
perfect as there is no gfp flag specific to THP allocations. At this
moment it checks the whole combination of flags that makes up
GFP_TRANSHUGE, and hopes that no other users of such combination exist,
or would mind being treated the same way. Extra care is also taken to
separate allocations from khugepaged, where latency doesn't matter that
much.

It is however possible to distinguish these allocations in a simpler and
more reliable way. The key observation is that after the initial
compaction followed by the first iteration of "standard"
reclaim/compaction, both __GFP_NORETRY allocations and costly
allocations without __GFP_REPEAT are declared as failures:

/* Do not loop if specifically requested */
if (gfp_mask & __GFP_NORETRY)
goto nopage;

/*
* Do not retry costly high order allocations unless they are
* __GFP_REPEAT
*/
if (order > PAGE_ALLOC_COSTLY_ORDER && !(gfp_mask & __GFP_REPEAT))
goto nopage;

This means we can further distinguish allocations that are costly order
*and* additionally include the __GFP_NORETRY flag. As it happens,
GFP_TRANSHUGE allocations do already fall into this category. This will
also allow other costly allocations with similar high-order benefit vs
latency considerations to use this semantic. Furthermore, we can
distinguish THP allocations that should try a bit harder (such as from
khugepageed) by removing __GFP_NORETRY, as will be done in the next
patch.

Link: http://lkml.kernel.org/r/20160721073614.24395-6-vbabka@suse.cz
Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
Acked-by: Michal Hocko <mhocko@suse.com>
Acked-by: Mel Gorman <mgorman@techsingularity.net>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# a8161d1e 28-Jul-2016 Vlastimil Babka <vbabka@suse.cz>

mm, page_alloc: restructure direct compaction handling in slowpath

The retry loop in __alloc_pages_slowpath is supposed to keep trying
reclaim and compaction (and OOM), until either the allocation succeeds,
or returns with failure. Success here is more probable when reclaim
precedes compaction, as certain watermarks have to be met for compaction
to even try, and more free pages increase the probability of compaction
success. On the other hand, starting with light async compaction (if
the watermarks allow it), can be more efficient, especially for smaller
orders, if there's enough free memory which is just fragmented.

Thus, the current code starts with compaction before reclaim, and to
make sure that the last reclaim is always followed by a final
compaction, there's another direct compaction call at the end of the
loop. This makes the code hard to follow and adds some duplicated
handling of migration_mode decisions. It's also somewhat inefficient
that even if reclaim or compaction decides not to retry, the final
compaction is still attempted. Some gfp flags combination also shortcut
these retry decisions by "goto noretry;", making it even harder to
follow.

This patch attempts to restructure the code with only minimal functional
changes. The call to the first compaction and THP-specific checks are
now placed above the retry loop, and the "noretry" direct compaction is
removed.

The initial compaction is additionally restricted only to costly orders,
as we can expect smaller orders to be held back by watermarks, and only
larger orders to suffer primarily from fragmentation. This better
matches the checks in reclaim's shrink_zones().

There are two other smaller functional changes. One is that the upgrade
from async migration to light sync migration will always occur after the
initial compaction. This is how it has been until recent patch "mm,
oom: protect !costly allocations some more", which introduced upgrading
the mode based on COMPACT_COMPLETE result, but kept the final compaction
always upgraded, which made it even more special. It's better to return
to the simpler handling for now, as migration modes will be further
modified later in the series.

The second change is that once both reclaim and compaction declare it's
not worth to retry the reclaim/compact loop, there is no final
compaction attempt. As argued above, this is intentional. If that
final compaction were to succeed, it would be due to a wrong retry
decision, or simply a race with somebody else freeing memory for us.

The main outcome of this patch should be simpler code. Logically, the
initial compaction without reclaim is the exceptional case to the
reclaim/compaction scheme, but prior to the patch, it was the last loop
iteration that was exceptional. Now the code matches the logic better.
The change also enable the following patches.

Link: http://lkml.kernel.org/r/20160721073614.24395-5-vbabka@suse.cz
Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
Acked-by: Michal Hocko <mhocko@suse.com>
Acked-by: Mel Gorman <mgorman@techsingularity.net>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 23771235 28-Jul-2016 Vlastimil Babka <vbabka@suse.cz>

mm, page_alloc: don't retry initial attempt in slowpath

After __alloc_pages_slowpath() sets up new alloc_flags and wakes up
kswapd, it first tries get_page_from_freelist() with the new
alloc_flags, as it may succeed e.g. due to using min watermark instead
of low watermark. It makes sense to to do this attempt before adjusting
zonelist based on alloc_flags/gfp_mask, as it's still relatively a fast
path if we just wake up kswapd and successfully allocate.

This patch therefore moves the initial attempt above the retry label and
reorganizes a bit the part below the retry label. We still have to
attempt get_page_from_freelist() on each retry, as some allocations
cannot do that as part of direct reclaim or compaction, and yet are not
allowed to fail (even though they do a WARN_ON_ONCE() and thus should
not exist). We can reuse the call meant for ALLOC_NO_WATERMARKS attempt
and just set alloc_flags to ALLOC_NO_WATERMARKS if the context allows
it. As a side-effect, the attempts from direct reclaim/compaction will
also no longer obey watermarks once this is set, but there's little harm
in that.

Kswapd wakeups are also done on each retry to be safe from potential
races resulting in kswapd going to sleep while a process (that may not
be able to reclaim by itself) is still looping.

Link: http://lkml.kernel.org/r/20160721073614.24395-4-vbabka@suse.cz
Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
Acked-by: Michal Hocko <mhocko@suse.com>
Acked-by: Mel Gorman <mgorman@techsingularity.net>
Acked-by: David Rientjes <rientjes@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 31a6c190 28-Jul-2016 Vlastimil Babka <vbabka@suse.cz>

mm, page_alloc: set alloc_flags only once in slowpath

In __alloc_pages_slowpath(), alloc_flags doesn't change after it's
initialized, so move the initialization above the retry: label. Also
make the comment above the initialization more descriptive.

The only exception in the alloc_flags being constant is
ALLOC_NO_WATERMARKS, which may change due to TIF_MEMDIE being set on the
allocating thread. We can fix this, and make the code simpler and a bit
more effective at the same time, by moving the part that determines
ALLOC_NO_WATERMARKS from gfp_to_alloc_flags() to gfp_pfmemalloc_allowed().

This means we don't have to mask out ALLOC_NO_WATERMARKS in numerous
places in __alloc_pages_slowpath() anymore. The only two tests for the
flag can instead call gfp_pfmemalloc_allowed().

Link: http://lkml.kernel.org/r/20160721073614.24395-3-vbabka@suse.cz
Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
Acked-by: Michal Hocko <mhocko@suse.com>
Acked-by: Mel Gorman <mgorman@techsingularity.net>
Acked-by: David Rientjes <rientjes@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# d30dd8be 28-Jul-2016 Andy Lutomirski <luto@kernel.org>

mm: track NR_KERNEL_STACK in KiB instead of number of stacks

Currently, NR_KERNEL_STACK tracks the number of kernel stacks in a zone.
This only makes sense if each kernel stack exists entirely in one zone,
and allowing vmapped stacks could break this assumption.

Since frv has THREAD_SIZE < PAGE_SIZE, we need to track kernel stack
allocations in a unit that divides both THREAD_SIZE and PAGE_SIZE on all
architectures. Keep it simple and use KiB.

Link: http://lkml.kernel.org/r/083c71e642c5fa5f1b6898902e1b2db7b48940d4.1468523549.git.luto@kernel.org
Signed-off-by: Andy Lutomirski <luto@kernel.org>
Cc: Vladimir Davydov <vdavydov@virtuozzo.com>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Cc: Michal Hocko <mhocko@kernel.org>
Reviewed-by: Josh Poimboeuf <jpoimboe@redhat.com>
Reviewed-by: Vladimir Davydov <vdavydov@virtuozzo.com>
Acked-by: Michal Hocko <mhocko@suse.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 5a1c84b4 28-Jul-2016 Mel Gorman <mgorman@techsingularity.net>

mm: remove reclaim and compaction retry approximations

If per-zone LRU accounting is available then there is no point
approximating whether reclaim and compaction should retry based on pgdat
statistics. This is effectively a revert of "mm, vmstat: remove zone
and node double accounting by approximating retries" with the difference
that inactive/active stats are still available. This preserves the
history of why the approximation was retried and why it had to be
reverted to handle OOM kills on 32-bit systems.

Link: http://lkml.kernel.org/r/1469110261-7365-4-git-send-email-mgorman@techsingularity.net
Signed-off-by: Mel Gorman <mgorman@techsingularity.net>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Acked-by: Minchan Kim <minchan@kernel.org>
Cc: Michal Hocko <mhocko@suse.cz>
Cc: Vlastimil Babka <vbabka@suse.cz>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 71c799f4 28-Jul-2016 Minchan Kim <minchan@kernel.org>

mm: add per-zone lru list stat

When I did stress test with hackbench, I got OOM message frequently
which didn't ever happen in zone-lru.

gfp_mask=0x26004c0(GFP_KERNEL|__GFP_REPEAT|__GFP_NOTRACK), order=0
..
..
__alloc_pages_nodemask+0xe52/0xe60
? new_slab+0x39c/0x3b0
new_slab+0x39c/0x3b0
___slab_alloc.constprop.87+0x6da/0x840
? __alloc_skb+0x3c/0x260
? _raw_spin_unlock_irq+0x27/0x60
? trace_hardirqs_on_caller+0xec/0x1b0
? finish_task_switch+0xa6/0x220
? poll_select_copy_remaining+0x140/0x140
__slab_alloc.isra.81.constprop.86+0x40/0x6d
? __alloc_skb+0x3c/0x260
kmem_cache_alloc+0x22c/0x260
? __alloc_skb+0x3c/0x260
__alloc_skb+0x3c/0x260
alloc_skb_with_frags+0x4e/0x1a0
sock_alloc_send_pskb+0x16a/0x1b0
? wait_for_unix_gc+0x31/0x90
? alloc_set_pte+0x2ad/0x310
unix_stream_sendmsg+0x28d/0x340
sock_sendmsg+0x2d/0x40
sock_write_iter+0x6c/0xc0
__vfs_write+0xc0/0x120
vfs_write+0x9b/0x1a0
? __might_fault+0x49/0xa0
SyS_write+0x44/0x90
do_fast_syscall_32+0xa6/0x1e0
sysenter_past_esp+0x45/0x74

Mem-Info:
active_anon:104698 inactive_anon:105791 isolated_anon:192
active_file:433 inactive_file:283 isolated_file:22
unevictable:0 dirty:0 writeback:296 unstable:0
slab_reclaimable:6389 slab_unreclaimable:78927
mapped:474 shmem:0 pagetables:101426 bounce:0
free:10518 free_pcp:334 free_cma:0
Node 0 active_anon:418792kB inactive_anon:423164kB active_file:1732kB inactive_file:1132kB unevictable:0kB isolated(anon):768kB isolated(file):88kB mapped:1896kB dirty:0kB writeback:1184kB shmem:0kB writeback_tmp:0kB unstable:0kB pages_scanned:1478632 all_unreclaimable? yes
DMA free:3304kB min:68kB low:84kB high:100kB present:15992kB managed:15916kB mlocked:0kB slab_reclaimable:0kB slab_unreclaimable:4088kB kernel_stack:0kB pagetables:2480kB bounce:0kB free_pcp:0kB local_pcp:0kB free_cma:0kB
lowmem_reserve[]: 0 809 1965 1965
Normal free:3436kB min:3604kB low:4504kB high:5404kB present:897016kB managed:858460kB mlocked:0kB slab_reclaimable:25556kB slab_unreclaimable:311712kB kernel_stack:164608kB pagetables:30844kB bounce:0kB free_pcp:620kB local_pcp:104kB free_cma:0kB
lowmem_reserve[]: 0 0 9247 9247
HighMem free:33808kB min:512kB low:1796kB high:3080kB present:1183736kB managed:1183736kB mlocked:0kB slab_reclaimable:0kB slab_unreclaimable:0kB kernel_stack:0kB pagetables:372252kB bounce:0kB free_pcp:428kB local_pcp:72kB free_cma:0kB
lowmem_reserve[]: 0 0 0 0
DMA: 2*4kB (UM) 2*8kB (UM) 0*16kB 1*32kB (U) 1*64kB (U) 2*128kB (UM) 1*256kB (U) 1*512kB (M) 0*1024kB 1*2048kB (U) 0*4096kB = 3192kB
Normal: 33*4kB (MH) 79*8kB (ME) 11*16kB (M) 4*32kB (M) 2*64kB (ME) 2*128kB (EH) 7*256kB (EH) 0*512kB 0*1024kB 0*2048kB 0*4096kB = 3244kB
HighMem: 2590*4kB (UM) 1568*8kB (UM) 491*16kB (UM) 60*32kB (UM) 6*64kB (M) 0*128kB 0*256kB 0*512kB 0*1024kB 0*2048kB 0*4096kB = 33064kB
Node 0 hugepages_total=0 hugepages_free=0 hugepages_surp=0 hugepages_size=2048kB
25121 total pagecache pages
24160 pages in swap cache
Swap cache stats: add 86371, delete 62211, find 42865/60187
Free swap = 4015560kB
Total swap = 4192252kB
524186 pages RAM
295934 pages HighMem/MovableOnly
9658 pages reserved
0 pages cma reserved

The order-0 allocation for normal zone failed while there are a lot of
reclaimable memory(i.e., anonymous memory with free swap). I wanted to
analyze the problem but it was hard because we removed per-zone lru stat
so I couldn't know how many of anonymous memory there are in normal/dma
zone.

When we investigate OOM problem, reclaimable memory count is crucial
stat to find a problem. Without it, it's hard to parse the OOM message
so I believe we should keep it.

With per-zone lru stat,

gfp_mask=0x26004c0(GFP_KERNEL|__GFP_REPEAT|__GFP_NOTRACK), order=0
Mem-Info:
active_anon:101103 inactive_anon:102219 isolated_anon:0
active_file:503 inactive_file:544 isolated_file:0
unevictable:0 dirty:0 writeback:34 unstable:0
slab_reclaimable:6298 slab_unreclaimable:74669
mapped:863 shmem:0 pagetables:100998 bounce:0
free:23573 free_pcp:1861 free_cma:0
Node 0 active_anon:404412kB inactive_anon:409040kB active_file:2012kB inactive_file:2176kB unevictable:0kB isolated(anon):0kB isolated(file):0kB mapped:3452kB dirty:0kB writeback:136kB shmem:0kB writeback_tmp:0kB unstable:0kB pages_scanned:1320845 all_unreclaimable? yes
DMA free:3296kB min:68kB low:84kB high:100kB active_anon:5540kB inactive_anon:0kB active_file:0kB inactive_file:0kB present:15992kB managed:15916kB mlocked:0kB slab_reclaimable:248kB slab_unreclaimable:2628kB kernel_stack:792kB pagetables:2316kB bounce:0kB free_pcp:0kB local_pcp:0kB free_cma:0kB
lowmem_reserve[]: 0 809 1965 1965
Normal free:3600kB min:3604kB low:4504kB high:5404kB active_anon:86304kB inactive_anon:0kB active_file:160kB inactive_file:376kB present:897016kB managed:858524kB mlocked:0kB slab_reclaimable:24944kB slab_unreclaimable:296048kB kernel_stack:163832kB pagetables:35892kB bounce:0kB free_pcp:3076kB local_pcp:656kB free_cma:0kB
lowmem_reserve[]: 0 0 9247 9247
HighMem free:86156kB min:512kB low:1796kB high:3080kB active_anon:312852kB inactive_anon:410024kB active_file:1924kB inactive_file:2012kB present:1183736kB managed:1183736kB mlocked:0kB slab_reclaimable:0kB slab_unreclaimable:0kB kernel_stack:0kB pagetables:365784kB bounce:0kB free_pcp:3868kB local_pcp:720kB free_cma:0kB
lowmem_reserve[]: 0 0 0 0
DMA: 8*4kB (UM) 8*8kB (UM) 4*16kB (M) 2*32kB (UM) 2*64kB (UM) 1*128kB (M) 3*256kB (UME) 2*512kB (UE) 1*1024kB (E) 0*2048kB 0*4096kB = 3296kB
Normal: 240*4kB (UME) 160*8kB (UME) 23*16kB (ME) 3*32kB (UE) 3*64kB (UME) 2*128kB (ME) 1*256kB (U) 0*512kB 0*1024kB 0*2048kB 0*4096kB = 3408kB
HighMem: 10942*4kB (UM) 3102*8kB (UM) 866*16kB (UM) 76*32kB (UM) 11*64kB (UM) 4*128kB (UM) 1*256kB (M) 0*512kB 0*1024kB 0*2048kB 0*4096kB = 86344kB
Node 0 hugepages_total=0 hugepages_free=0 hugepages_surp=0 hugepages_size=2048kB
54409 total pagecache pages
53215 pages in swap cache
Swap cache stats: add 300982, delete 247765, find 157978/226539
Free swap = 3803244kB
Total swap = 4192252kB
524186 pages RAM
295934 pages HighMem/MovableOnly
9642 pages reserved
0 pages cma reserved

With that, we can see normal zone has a 86M reclaimable memory so we can
know something goes wrong(I will fix the problem in next patch) in
reclaim.

[mgorman@techsingularity.net: rename zone LRU stats in /proc/vmstat]
Link: http://lkml.kernel.org/r/20160725072300.GK10438@techsingularity.net
Link: http://lkml.kernel.org/r/1469110261-7365-2-git-send-email-mgorman@techsingularity.net
Signed-off-by: Minchan Kim <minchan@kernel.org>
Signed-off-by: Mel Gorman <mgorman@techsingularity.net>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Cc: Michal Hocko <mhocko@suse.cz>
Cc: Vlastimil Babka <vbabka@suse.cz>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 33e077bd 28-Jul-2016 Minchan Kim <minchan@kernel.org>

mm: show node_pages_scanned per node, not zone

The node_pages_scanned represents the number of scanned pages of node
for reclaim so it's pointless to show it as kilobytes.

As well, node_pages_scanned is per-node value, not per-zone.

This patch changes node_pages_scanned per-zone-killobytes with
per-node-count.

[minchan@kernel.org: fix node_pages_scanned]
Link: http://lkml.kernel.org/r/20160716101431.GA10305@bbox
Link: http://lkml.kernel.org/r/1468588165-12461-5-git-send-email-mgorman@techsingularity.net
Signed-off-by: Minchan Kim <minchan@kernel.org>
Signed-off-by: Mel Gorman <mgorman@techsingularity.net>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# bca67592 28-Jul-2016 Mel Gorman <mgorman@techsingularity.net>

mm, vmstat: remove zone and node double accounting by approximating retries

The number of LRU pages, dirty pages and writeback pages must be
accounted for on both zones and nodes because of the reclaim retry
logic, compaction retry logic and highmem calculations all depending on
per-zone stats.

Many lowmem allocations are immune from OOM kill due to a check in
__alloc_pages_may_oom for (ac->high_zoneidx < ZONE_NORMAL) since commit
03668b3ceb0c ("oom: avoid oom killer for lowmem allocations"). The
exception is costly high-order allocations or allocations that cannot
fail. If the __alloc_pages_may_oom avoids OOM-kill for low-order lowmem
allocations then it would fall through to __alloc_pages_direct_compact.

This patch will blindly retry reclaim for zone-constrained allocations
in should_reclaim_retry up to MAX_RECLAIM_RETRIES. This is not ideal
but without per-zone stats there are not many alternatives. The impact
it that zone-constrained allocations may delay before considering the
OOM killer.

As there is no guarantee enough memory can ever be freed to satisfy
compaction, this patch avoids retrying compaction for zone-contrained
allocations.

In combination, that means that the per-node stats can be used when
deciding whether to continue reclaim using a rough approximation. While
it is possible this will make the wrong decision on occasion, it will
not infinite loop as the number of reclaim attempts is capped by
MAX_RECLAIM_RETRIES.

The final step is calculating the number of dirtyable highmem pages. As
those calculations only care about the global count of file pages in
highmem. This patch uses a global counter used instead of per-zone
stats as it is sufficient.

In combination, this allows the per-zone LRU and dirty state counters to
be removed.

[mgorman@techsingularity.net: fix acct_highmem_file_pages()]
Link: http://lkml.kernel.org/r/1468853426-12858-4-git-send-email-mgorman@techsingularity.netLink: http://lkml.kernel.org/r/1467970510-21195-35-git-send-email-mgorman@techsingularity.net
Signed-off-by: Mel Gorman <mgorman@techsingularity.net>
Suggested by: Michal Hocko <mhocko@kernel.org>
Acked-by: Hillf Danton <hillf.zj@alibaba-inc.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Minchan Kim <minchan@kernel.org>
Cc: Rik van Riel <riel@surriel.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 16709d1d 28-Jul-2016 Mel Gorman <mgorman@techsingularity.net>

mm: vmstat: replace __count_zone_vm_events with a zone id equivalent

This is partially a preparation patch for more vmstat work but it also
has the slight advantage that __count_zid_vm_events is cheaper to
calculate than __count_zone_vm_events().

Link: http://lkml.kernel.org/r/1467970510-21195-32-git-send-email-mgorman@techsingularity.net
Signed-off-by: Mel Gorman <mgorman@techsingularity.net>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Cc: Hillf Danton <hillf.zj@alibaba-inc.com>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Minchan Kim <minchan@kernel.org>
Cc: Rik van Riel <riel@surriel.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 3b8c0be4 28-Jul-2016 Mel Gorman <mgorman@techsingularity.net>

mm: page_alloc: cache the last node whose dirty limit is reached

If a page is about to be dirtied then the page allocator attempts to
limit the total number of dirty pages that exists in any given zone.
The call to node_dirty_ok is expensive so this patch records if the last
pgdat examined hit the dirty limits. In some cases, this reduces the
number of calls to node_dirty_ok().

Link: http://lkml.kernel.org/r/1467970510-21195-31-git-send-email-mgorman@techsingularity.net
Signed-off-by: Mel Gorman <mgorman@techsingularity.net>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Cc: Hillf Danton <hillf.zj@alibaba-inc.com>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Minchan Kim <minchan@kernel.org>
Cc: Rik van Riel <riel@surriel.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# e6cbd7f2 28-Jul-2016 Mel Gorman <mgorman@techsingularity.net>

mm, page_alloc: remove fair zone allocation policy

The fair zone allocation policy interleaves allocation requests between
zones to avoid an age inversion problem whereby new pages are reclaimed
to balance a zone. Reclaim is now node-based so this should no longer
be an issue and the fair zone allocation policy is not free. This patch
removes it.

Link: http://lkml.kernel.org/r/1467970510-21195-30-git-send-email-mgorman@techsingularity.net
Signed-off-by: Mel Gorman <mgorman@techsingularity.net>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Cc: Hillf Danton <hillf.zj@alibaba-inc.com>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Minchan Kim <minchan@kernel.org>
Cc: Rik van Riel <riel@surriel.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# a5f5f91d 28-Jul-2016 Mel Gorman <mgorman@techsingularity.net>

mm: convert zone_reclaim to node_reclaim

As reclaim is now per-node based, convert zone_reclaim to be
node_reclaim. It is possible that a node will be reclaimed multiple
times if it has multiple zones but this is unavoidable without caching
all nodes traversed so far. The documentation and interface to
userspace is the same from a configuration perspective and will will be
similar in behaviour unless the node-local allocation requests were also
limited to lower zones.

Link: http://lkml.kernel.org/r/1467970510-21195-24-git-send-email-mgorman@techsingularity.net
Signed-off-by: Mel Gorman <mgorman@techsingularity.net>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Cc: Hillf Danton <hillf.zj@alibaba-inc.com>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Minchan Kim <minchan@kernel.org>
Cc: Rik van Riel <riel@surriel.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 52e9f87a 28-Jul-2016 Mel Gorman <mgorman@techsingularity.net>

mm, page_alloc: wake kswapd based on the highest eligible zone

The ac_classzone_idx is used as the basis for waking kswapd and that is
based on the preferred zoneref. If the preferred zoneref's first zone
is lower than what is available on other nodes, it's possible that
kswapd is woken on a zone with only higher, but still eligible, zones.
As classzone_idx is strictly adhered to now, it causes a problem because
eligible pages are skipped.

For example, node 0 has only DMA32 and node 1 has only NORMAL. An
allocating context running on node 0 may wake kswapd on node 1 telling
it to skip all NORMAL pages.

Link: http://lkml.kernel.org/r/1467970510-21195-23-git-send-email-mgorman@techsingularity.net
Signed-off-by: Mel Gorman <mgorman@techsingularity.net>
Acked-by: Hillf Danton <hillf.zj@alibaba-inc.com>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Minchan Kim <minchan@kernel.org>
Cc: Rik van Riel <riel@surriel.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# e1a55637 28-Jul-2016 Mel Gorman <mgorman@techsingularity.net>

mm, vmscan: only wakeup kswapd once per node for the requested classzone

kswapd is woken when zones are below the low watermark but the wakeup
decision is not taking the classzone into account. Now that reclaim is
node-based, it is only required to wake kswapd once per node and only if
all zones are unbalanced for the requested classzone.

Note that one node might be checked multiple times if the zonelist is
ordered by node because there is no cheap way of tracking what nodes
have already been visited. For zone-ordering, each node should be
checked only once.

Link: http://lkml.kernel.org/r/1467970510-21195-22-git-send-email-mgorman@techsingularity.net
Signed-off-by: Mel Gorman <mgorman@techsingularity.net>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Cc: Hillf Danton <hillf.zj@alibaba-inc.com>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Minchan Kim <minchan@kernel.org>
Cc: Rik van Riel <riel@surriel.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 11fb9989 28-Jul-2016 Mel Gorman <mgorman@techsingularity.net>

mm: move most file-based accounting to the node

There are now a number of accounting oddities such as mapped file pages
being accounted for on the node while the total number of file pages are
accounted on the zone. This can be coped with to some extent but it's
confusing so this patch moves the relevant file-based accounted. Due to
throttling logic in the page allocator for reliable OOM detection, it is
still necessary to track dirty and writeback pages on a per-zone basis.

[mgorman@techsingularity.net: fix NR_ZONE_WRITE_PENDING accounting]
Link: http://lkml.kernel.org/r/1468404004-5085-5-git-send-email-mgorman@techsingularity.net
Link: http://lkml.kernel.org/r/1467970510-21195-20-git-send-email-mgorman@techsingularity.net
Signed-off-by: Mel Gorman <mgorman@techsingularity.net>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Acked-by: Michal Hocko <mhocko@suse.com>
Cc: Hillf Danton <hillf.zj@alibaba-inc.com>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Cc: Minchan Kim <minchan@kernel.org>
Cc: Rik van Riel <riel@surriel.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 50658e2e 28-Jul-2016 Mel Gorman <mgorman@techsingularity.net>

mm: move page mapped accounting to the node

Reclaim makes decisions based on the number of pages that are mapped but
it's mixing node and zone information. Account NR_FILE_MAPPED and
NR_ANON_PAGES pages on the node.

Link: http://lkml.kernel.org/r/1467970510-21195-18-git-send-email-mgorman@techsingularity.net
Signed-off-by: Mel Gorman <mgorman@techsingularity.net>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Acked-by: Michal Hocko <mhocko@suse.com>
Cc: Hillf Danton <hillf.zj@alibaba-inc.com>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Cc: Minchan Kim <minchan@kernel.org>
Cc: Rik van Riel <riel@surriel.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 281e3726 28-Jul-2016 Mel Gorman <mgorman@techsingularity.net>

mm, page_alloc: consider dirtyable memory in terms of nodes

Historically dirty pages were spread among zones but now that LRUs are
per-node it is more appropriate to consider dirty pages in a node.

Link: http://lkml.kernel.org/r/1467970510-21195-17-git-send-email-mgorman@techsingularity.net
Signed-off-by: Mel Gorman <mgorman@techsingularity.net>
Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Acked-by: Michal Hocko <mhocko@suse.com>
Cc: Hillf Danton <hillf.zj@alibaba-inc.com>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Cc: Minchan Kim <minchan@kernel.org>
Cc: Rik van Riel <riel@surriel.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# a9dd0a83 28-Jul-2016 Mel Gorman <mgorman@techsingularity.net>

mm, vmscan: make shrink_node decisions more node-centric

Earlier patches focused on having direct reclaim and kswapd use data
that is node-centric for reclaiming but shrink_node() itself still uses
too much zone information. This patch removes unnecessary zone-based
information with the most important decision being whether to continue
reclaim or not. Some memcg APIs are adjusted as a result even though
memcg itself still uses some zone information.

[mgorman@techsingularity.net: optimization]
Link: http://lkml.kernel.org/r/1468588165-12461-2-git-send-email-mgorman@techsingularity.net
Link: http://lkml.kernel.org/r/1467970510-21195-14-git-send-email-mgorman@techsingularity.net
Signed-off-by: Mel Gorman <mgorman@techsingularity.net>
Acked-by: Michal Hocko <mhocko@suse.com>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Cc: Hillf Danton <hillf.zj@alibaba-inc.com>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Cc: Minchan Kim <minchan@kernel.org>
Cc: Rik van Riel <riel@surriel.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 38087d9b 28-Jul-2016 Mel Gorman <mgorman@techsingularity.net>

mm, vmscan: simplify the logic deciding whether kswapd sleeps

kswapd goes through some complex steps trying to figure out if it should
stay awake based on the classzone_idx and the requested order. It is
unnecessarily complex and passes in an invalid classzone_idx to
balance_pgdat(). What matters most of all is whether a larger order has
been requsted and whether kswapd successfully reclaimed at the previous
order. This patch irons out the logic to check just that and the end
result is less headache inducing.

Link: http://lkml.kernel.org/r/1467970510-21195-10-git-send-email-mgorman@techsingularity.net
Signed-off-by: Mel Gorman <mgorman@techsingularity.net>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Cc: Hillf Danton <hillf.zj@alibaba-inc.com>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Minchan Kim <minchan@kernel.org>
Cc: Rik van Riel <riel@surriel.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 599d0c95 28-Jul-2016 Mel Gorman <mgorman@techsingularity.net>

mm, vmscan: move LRU lists to node

This moves the LRU lists from the zone to the node and related data such
as counters, tracing, congestion tracking and writeback tracking.

Unfortunately, due to reclaim and compaction retry logic, it is
necessary to account for the number of LRU pages on both zone and node
logic. Most reclaim logic is based on the node counters but the retry
logic uses the zone counters which do not distinguish inactive and
active sizes. It would be possible to leave the LRU counters on a
per-zone basis but it's a heavier calculation across multiple cache
lines that is much more frequent than the retry checks.

Other than the LRU counters, this is mostly a mechanical patch but note
that it introduces a number of anomalies. For example, the scans are
per-zone but using per-node counters. We also mark a node as congested
when a zone is congested. This causes weird problems that are fixed
later but is easier to review.

In the event that there is excessive overhead on 32-bit systems due to
the nodes being on LRU then there are two potential solutions

1. Long-term isolation of highmem pages when reclaim is lowmem

When pages are skipped, they are immediately added back onto the LRU
list. If lowmem reclaim persisted for long periods of time, the same
highmem pages get continually scanned. The idea would be that lowmem
keeps those pages on a separate list until a reclaim for highmem pages
arrives that splices the highmem pages back onto the LRU. It potentially
could be implemented similar to the UNEVICTABLE list.

That would reduce the skip rate with the potential corner case is that
highmem pages have to be scanned and reclaimed to free lowmem slab pages.

2. Linear scan lowmem pages if the initial LRU shrink fails

This will break LRU ordering but may be preferable and faster during
memory pressure than skipping LRU pages.

Link: http://lkml.kernel.org/r/1467970510-21195-4-git-send-email-mgorman@techsingularity.net
Signed-off-by: Mel Gorman <mgorman@techsingularity.net>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Cc: Hillf Danton <hillf.zj@alibaba-inc.com>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Minchan Kim <minchan@kernel.org>
Cc: Rik van Riel <riel@surriel.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# a52633d8 28-Jul-2016 Mel Gorman <mgorman@techsingularity.net>

mm, vmscan: move lru_lock to the node

Node-based reclaim requires node-based LRUs and locking. This is a
preparation patch that just moves the lru_lock to the node so later
patches are easier to review. It is a mechanical change but note this
patch makes contention worse because the LRU lock is hotter and direct
reclaim and kswapd can contend on the same lock even when reclaiming
from different zones.

Link: http://lkml.kernel.org/r/1467970510-21195-3-git-send-email-mgorman@techsingularity.net
Signed-off-by: Mel Gorman <mgorman@techsingularity.net>
Reviewed-by: Minchan Kim <minchan@kernel.org>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Cc: Hillf Danton <hillf.zj@alibaba-inc.com>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Rik van Riel <riel@surriel.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 75ef7184 28-Jul-2016 Mel Gorman <mgorman@techsingularity.net>

mm, vmstat: add infrastructure for per-node vmstats

Patchset: "Move LRU page reclaim from zones to nodes v9"

This series moves LRUs from the zones to the node. While this is a
current rebase, the test results were based on mmotm as of June 23rd.
Conceptually, this series is simple but there are a lot of details.
Some of the broad motivations for this are;

1. The residency of a page partially depends on what zone the page was
allocated from. This is partially combatted by the fair zone allocation
policy but that is a partial solution that introduces overhead in the
page allocator paths.

2. Currently, reclaim on node 0 behaves slightly different to node 1. For
example, direct reclaim scans in zonelist order and reclaims even if
the zone is over the high watermark regardless of the age of pages
in that LRU. Kswapd on the other hand starts reclaim on the highest
unbalanced zone. A difference in distribution of file/anon pages due
to when they were allocated results can result in a difference in
again. While the fair zone allocation policy mitigates some of the
problems here, the page reclaim results on a multi-zone node will
always be different to a single-zone node.
it was scheduled on as a result.

3. kswapd and the page allocator scan zones in the opposite order to
avoid interfering with each other but it's sensitive to timing. This
mitigates the page allocator using pages that were allocated very recently
in the ideal case but it's sensitive to timing. When kswapd is allocating
from lower zones then it's great but during the rebalancing of the highest
zone, the page allocator and kswapd interfere with each other. It's worse
if the highest zone is small and difficult to balance.

4. slab shrinkers are node-based which makes it harder to identify the exact
relationship between slab reclaim and LRU reclaim.

The reason we have zone-based reclaim is that we used to have
large highmem zones in common configurations and it was necessary
to quickly find ZONE_NORMAL pages for reclaim. Today, this is much
less of a concern as machines with lots of memory will (or should) use
64-bit kernels. Combinations of 32-bit hardware and 64-bit hardware are
rare. Machines that do use highmem should have relatively low highmem:lowmem
ratios than we worried about in the past.

Conceptually, moving to node LRUs should be easier to understand. The
page allocator plays fewer tricks to game reclaim and reclaim behaves
similarly on all nodes.

The series has been tested on a 16 core UMA machine and a 2-socket 48
core NUMA machine. The UMA results are presented in most cases as the NUMA
machine behaved similarly.

pagealloc
---------

This is a microbenchmark that shows the benefit of removing the fair zone
allocation policy. It was tested uip to order-4 but only orders 0 and 1 are
shown as the other orders were comparable.

4.7.0-rc4 4.7.0-rc4
mmotm-20160623 nodelru-v9
Min total-odr0-1 490.00 ( 0.00%) 457.00 ( 6.73%)
Min total-odr0-2 347.00 ( 0.00%) 329.00 ( 5.19%)
Min total-odr0-4 288.00 ( 0.00%) 273.00 ( 5.21%)
Min total-odr0-8 251.00 ( 0.00%) 239.00 ( 4.78%)
Min total-odr0-16 234.00 ( 0.00%) 222.00 ( 5.13%)
Min total-odr0-32 223.00 ( 0.00%) 211.00 ( 5.38%)
Min total-odr0-64 217.00 ( 0.00%) 208.00 ( 4.15%)
Min total-odr0-128 214.00 ( 0.00%) 204.00 ( 4.67%)
Min total-odr0-256 250.00 ( 0.00%) 230.00 ( 8.00%)
Min total-odr0-512 271.00 ( 0.00%) 269.00 ( 0.74%)
Min total-odr0-1024 291.00 ( 0.00%) 282.00 ( 3.09%)
Min total-odr0-2048 303.00 ( 0.00%) 296.00 ( 2.31%)
Min total-odr0-4096 311.00 ( 0.00%) 309.00 ( 0.64%)
Min total-odr0-8192 316.00 ( 0.00%) 314.00 ( 0.63%)
Min total-odr0-16384 317.00 ( 0.00%) 315.00 ( 0.63%)
Min total-odr1-1 742.00 ( 0.00%) 712.00 ( 4.04%)
Min total-odr1-2 562.00 ( 0.00%) 530.00 ( 5.69%)
Min total-odr1-4 457.00 ( 0.00%) 433.00 ( 5.25%)
Min total-odr1-8 411.00 ( 0.00%) 381.00 ( 7.30%)
Min total-odr1-16 381.00 ( 0.00%) 356.00 ( 6.56%)
Min total-odr1-32 372.00 ( 0.00%) 346.00 ( 6.99%)
Min total-odr1-64 372.00 ( 0.00%) 343.00 ( 7.80%)
Min total-odr1-128 375.00 ( 0.00%) 351.00 ( 6.40%)
Min total-odr1-256 379.00 ( 0.00%) 351.00 ( 7.39%)
Min total-odr1-512 385.00 ( 0.00%) 355.00 ( 7.79%)
Min total-odr1-1024 386.00 ( 0.00%) 358.00 ( 7.25%)
Min total-odr1-2048 390.00 ( 0.00%) 362.00 ( 7.18%)
Min total-odr1-4096 390.00 ( 0.00%) 362.00 ( 7.18%)
Min total-odr1-8192 388.00 ( 0.00%) 363.00 ( 6.44%)

This shows a steady improvement throughout. The primary benefit is from
reduced system CPU usage which is obvious from the overall times;

4.7.0-rc4 4.7.0-rc4
mmotm-20160623nodelru-v8
User 189.19 191.80
System 2604.45 2533.56
Elapsed 2855.30 2786.39

The vmstats also showed that the fair zone allocation policy was definitely
removed as can be seen here;

4.7.0-rc3 4.7.0-rc3
mmotm-20160623 nodelru-v8
DMA32 allocs 28794729769 0
Normal allocs 48432501431 77227309877
Movable allocs 0 0

tiobench on ext4
----------------

tiobench is a benchmark that artifically benefits if old pages remain resident
while new pages get reclaimed. The fair zone allocation policy mitigates this
problem so pages age fairly. While the benchmark has problems, it is important
that tiobench performance remains constant as it implies that page aging
problems that the fair zone allocation policy fixes are not re-introduced.

4.7.0-rc4 4.7.0-rc4
mmotm-20160623 nodelru-v9
Min PotentialReadSpeed 89.65 ( 0.00%) 90.21 ( 0.62%)
Min SeqRead-MB/sec-1 82.68 ( 0.00%) 82.01 ( -0.81%)
Min SeqRead-MB/sec-2 72.76 ( 0.00%) 72.07 ( -0.95%)
Min SeqRead-MB/sec-4 75.13 ( 0.00%) 74.92 ( -0.28%)
Min SeqRead-MB/sec-8 64.91 ( 0.00%) 65.19 ( 0.43%)
Min SeqRead-MB/sec-16 62.24 ( 0.00%) 62.22 ( -0.03%)
Min RandRead-MB/sec-1 0.88 ( 0.00%) 0.88 ( 0.00%)
Min RandRead-MB/sec-2 0.95 ( 0.00%) 0.92 ( -3.16%)
Min RandRead-MB/sec-4 1.43 ( 0.00%) 1.34 ( -6.29%)
Min RandRead-MB/sec-8 1.61 ( 0.00%) 1.60 ( -0.62%)
Min RandRead-MB/sec-16 1.80 ( 0.00%) 1.90 ( 5.56%)
Min SeqWrite-MB/sec-1 76.41 ( 0.00%) 76.85 ( 0.58%)
Min SeqWrite-MB/sec-2 74.11 ( 0.00%) 73.54 ( -0.77%)
Min SeqWrite-MB/sec-4 80.05 ( 0.00%) 80.13 ( 0.10%)
Min SeqWrite-MB/sec-8 72.88 ( 0.00%) 73.20 ( 0.44%)
Min SeqWrite-MB/sec-16 75.91 ( 0.00%) 76.44 ( 0.70%)
Min RandWrite-MB/sec-1 1.18 ( 0.00%) 1.14 ( -3.39%)
Min RandWrite-MB/sec-2 1.02 ( 0.00%) 1.03 ( 0.98%)
Min RandWrite-MB/sec-4 1.05 ( 0.00%) 0.98 ( -6.67%)
Min RandWrite-MB/sec-8 0.89 ( 0.00%) 0.92 ( 3.37%)
Min RandWrite-MB/sec-16 0.92 ( 0.00%) 0.93 ( 1.09%)

4.7.0-rc4 4.7.0-rc4
mmotm-20160623 approx-v9
User 645.72 525.90
System 403.85 331.75
Elapsed 6795.36 6783.67

This shows that the series has little or not impact on tiobench which is
desirable and a reduction in system CPU usage. It indicates that the fair
zone allocation policy was removed in a manner that didn't reintroduce
one class of page aging bug. There were only minor differences in overall
reclaim activity

4.7.0-rc4 4.7.0-rc4
mmotm-20160623nodelru-v8
Minor Faults 645838 647465
Major Faults 573 640
Swap Ins 0 0
Swap Outs 0 0
DMA allocs 0 0
DMA32 allocs 46041453 44190646
Normal allocs 78053072 79887245
Movable allocs 0 0
Allocation stalls 24 67
Stall zone DMA 0 0
Stall zone DMA32 0 0
Stall zone Normal 0 2
Stall zone HighMem 0 0
Stall zone Movable 0 65
Direct pages scanned 10969 30609
Kswapd pages scanned 93375144 93492094
Kswapd pages reclaimed 93372243 93489370
Direct pages reclaimed 10969 30609
Kswapd efficiency 99% 99%
Kswapd velocity 13741.015 13781.934
Direct efficiency 100% 100%
Direct velocity 1.614 4.512
Percentage direct scans 0% 0%

kswapd activity was roughly comparable. There were differences in direct
reclaim activity but negligible in the context of the overall workload
(velocity of 4 pages per second with the patches applied, 1.6 pages per
second in the baseline kernel).

pgbench read-only large configuration on ext4
---------------------------------------------

pgbench is a database benchmark that can be sensitive to page reclaim
decisions. This also checks if removing the fair zone allocation policy
is safe

pgbench Transactions
4.7.0-rc4 4.7.0-rc4
mmotm-20160623 nodelru-v8
Hmean 1 188.26 ( 0.00%) 189.78 ( 0.81%)
Hmean 5 330.66 ( 0.00%) 328.69 ( -0.59%)
Hmean 12 370.32 ( 0.00%) 380.72 ( 2.81%)
Hmean 21 368.89 ( 0.00%) 369.00 ( 0.03%)
Hmean 30 382.14 ( 0.00%) 360.89 ( -5.56%)
Hmean 32 428.87 ( 0.00%) 432.96 ( 0.95%)

Negligible differences again. As with tiobench, overall reclaim activity
was comparable.

bonnie++ on ext4
----------------

No interesting performance difference, negligible differences on reclaim
stats.

paralleldd on ext4
------------------

This workload uses varying numbers of dd instances to read large amounts of
data from disk.

4.7.0-rc3 4.7.0-rc3
mmotm-20160623 nodelru-v9
Amean Elapsd-1 186.04 ( 0.00%) 189.41 ( -1.82%)
Amean Elapsd-3 192.27 ( 0.00%) 191.38 ( 0.46%)
Amean Elapsd-5 185.21 ( 0.00%) 182.75 ( 1.33%)
Amean Elapsd-7 183.71 ( 0.00%) 182.11 ( 0.87%)
Amean Elapsd-12 180.96 ( 0.00%) 181.58 ( -0.35%)
Amean Elapsd-16 181.36 ( 0.00%) 183.72 ( -1.30%)

4.7.0-rc4 4.7.0-rc4
mmotm-20160623 nodelru-v9
User 1548.01 1552.44
System 8609.71 8515.08
Elapsed 3587.10 3594.54

There is little or no change in performance but some drop in system CPU usage.

4.7.0-rc3 4.7.0-rc3
mmotm-20160623 nodelru-v9
Minor Faults 362662 367360
Major Faults 1204 1143
Swap Ins 22 0
Swap Outs 2855 1029
DMA allocs 0 0
DMA32 allocs 31409797 28837521
Normal allocs 46611853 49231282
Movable allocs 0 0
Direct pages scanned 0 0
Kswapd pages scanned 40845270 40869088
Kswapd pages reclaimed 40830976 40855294
Direct pages reclaimed 0 0
Kswapd efficiency 99% 99%
Kswapd velocity 11386.711 11369.769
Direct efficiency 100% 100%
Direct velocity 0.000 0.000
Percentage direct scans 0% 0%
Page writes by reclaim 2855 1029
Page writes file 0 0
Page writes anon 2855 1029
Page reclaim immediate 771 1628
Sector Reads 293312636 293536360
Sector Writes 18213568 18186480
Page rescued immediate 0 0
Slabs scanned 128257 132747
Direct inode steals 181 56
Kswapd inode steals 59 1131

It basically shows that kswapd was active at roughly the same rate in
both kernels. There was also comparable slab scanning activity and direct
reclaim was avoided in both cases. There appears to be a large difference
in numbers of inodes reclaimed but the workload has few active inodes and
is likely a timing artifact.

stutter
-------

stutter simulates a simple workload. One part uses a lot of anonymous
memory, a second measures mmap latency and a third copies a large file.
The primary metric is checking for mmap latency.

stutter
4.7.0-rc4 4.7.0-rc4
mmotm-20160623 nodelru-v8
Min mmap 16.6283 ( 0.00%) 13.4258 ( 19.26%)
1st-qrtle mmap 54.7570 ( 0.00%) 34.9121 ( 36.24%)
2nd-qrtle mmap 57.3163 ( 0.00%) 46.1147 ( 19.54%)
3rd-qrtle mmap 58.9976 ( 0.00%) 47.1882 ( 20.02%)
Max-90% mmap 59.7433 ( 0.00%) 47.4453 ( 20.58%)
Max-93% mmap 60.1298 ( 0.00%) 47.6037 ( 20.83%)
Max-95% mmap 73.4112 ( 0.00%) 82.8719 (-12.89%)
Max-99% mmap 92.8542 ( 0.00%) 88.8870 ( 4.27%)
Max mmap 1440.6569 ( 0.00%) 121.4201 ( 91.57%)
Mean mmap 59.3493 ( 0.00%) 42.2991 ( 28.73%)
Best99%Mean mmap 57.2121 ( 0.00%) 41.8207 ( 26.90%)
Best95%Mean mmap 55.9113 ( 0.00%) 39.9620 ( 28.53%)
Best90%Mean mmap 55.6199 ( 0.00%) 39.3124 ( 29.32%)
Best50%Mean mmap 53.2183 ( 0.00%) 33.1307 ( 37.75%)
Best10%Mean mmap 45.9842 ( 0.00%) 20.4040 ( 55.63%)
Best5%Mean mmap 43.2256 ( 0.00%) 17.9654 ( 58.44%)
Best1%Mean mmap 32.9388 ( 0.00%) 16.6875 ( 49.34%)

This shows a number of improvements with the worst-case outlier greatly
improved.

Some of the vmstats are interesting

4.7.0-rc4 4.7.0-rc4
mmotm-20160623nodelru-v8
Swap Ins 163 502
Swap Outs 0 0
DMA allocs 0 0
DMA32 allocs 618719206 1381662383
Normal allocs 891235743 564138421
Movable allocs 0 0
Allocation stalls 2603 1
Direct pages scanned 216787 2
Kswapd pages scanned 50719775 41778378
Kswapd pages reclaimed 41541765 41777639
Direct pages reclaimed 209159 0
Kswapd efficiency 81% 99%
Kswapd velocity 16859.554 14329.059
Direct efficiency 96% 0%
Direct velocity 72.061 0.001
Percentage direct scans 0% 0%
Page writes by reclaim 6215049 0
Page writes file 6215049 0
Page writes anon 0 0
Page reclaim immediate 70673 90
Sector Reads 81940800 81680456
Sector Writes 100158984 98816036
Page rescued immediate 0 0
Slabs scanned 1366954 22683

While this is not guaranteed in all cases, this particular test showed
a large reduction in direct reclaim activity. It's also worth noting
that no page writes were issued from reclaim context.

This series is not without its hazards. There are at least three areas
that I'm concerned with even though I could not reproduce any problems in
that area.

1. Reclaim/compaction is going to be affected because the amount of reclaim is
no longer targetted at a specific zone. Compaction works on a per-zone basis
so there is no guarantee that reclaiming a few THP's worth page pages will
have a positive impact on compaction success rates.

2. The Slab/LRU reclaim ratio is affected because the frequency the shrinkers
are called is now different. This may or may not be a problem but if it
is, it'll be because shrinkers are not called enough and some balancing
is required.

3. The anon/file reclaim ratio may be affected. Pages about to be dirtied are
distributed between zones and the fair zone allocation policy used to do
something very similar for anon. The distribution is now different but not
necessarily in any way that matters but it's still worth bearing in mind.

VM statistic counters for reclaim decisions are zone-based. If the kernel
is to reclaim on a per-node basis then we need to track per-node
statistics but there is no infrastructure for that. The most notable
change is that the old node_page_state is renamed to
sum_zone_node_page_state. The new node_page_state takes a pglist_data and
uses per-node stats but none exist yet. There is some renaming such as
vm_stat to vm_zone_stat and the addition of vm_node_stat and the renaming
of mod_state to mod_zone_state. Otherwise, this is mostly a mechanical
patch with no functional change. There is a lot of similarity between the
node and zone helpers which is unfortunate but there was no obvious way of
reusing the code and maintaining type safety.

Link: http://lkml.kernel.org/r/1467970510-21195-2-git-send-email-mgorman@techsingularity.net
Signed-off-by: Mel Gorman <mgorman@techsingularity.net>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Cc: Rik van Riel <riel@surriel.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Minchan Kim <minchan@kernel.org>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Cc: Hillf Danton <hillf.zj@alibaba-inc.com>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Minchan Kim <minchan@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# a621184a 28-Jul-2016 Mel Gorman <mgorman@techsingularity.net>

mm, meminit: remove early_page_nid_uninitialised

The helper early_page_nid_uninitialised() has been dead since commit
974a786e63c9 ("mm, page_alloc: remove MIGRATE_RESERVE") so remove the
dead code.

Link: http://lkml.kernel.org/r/1468008031-3848-2-git-send-email-mgorman@techsingularity.net
Signed-off-by: Mel Gorman <mgorman@techsingularity.net>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 400bc7fd 28-Jul-2016 zhong jiang <zhongjiang@huawei.com>

mm: update the comment in __isolate_free_page

We need to assure the comment is consistent with the code.

[akpm@linux-foundation.org: coding-style fixes]
Link: http://lkml.kernel.org/r/1466171914-21027-1-git-send-email-zhongjiang@huawei.com
Signed-off-by: zhong jiang <zhongjiang@huawei.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 65c45377 26-Jul-2016 Kirill A. Shutemov <kirill.shutemov@linux.intel.com>

mm, rmap: account shmem thp pages

Let's add ShmemHugePages and ShmemPmdMapped fields into meminfo and
smaps. It indicates how many times we allocate and map shmem THP.

NR_ANON_TRANSPARENT_HUGEPAGES is renamed to NR_ANON_THPS.

Link: http://lkml.kernel.org/r/1466021202-61880-27-git-send-email-kirill.shutemov@linux.intel.com
Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 9a73f61b 26-Jul-2016 Kirill A. Shutemov <kirill.shutemov@linux.intel.com>

thp, mlock: do not mlock PTE-mapped file huge pages

As with anon THP, we only mlock file huge pages if we can prove that the
page is not mapped with PTE. This way we can avoid mlock leak into
non-mlocked vma on split.

We rely on PageDoubleMap() under lock_page() to check if the the page
may be PTE mapped. PG_double_map is set by page_add_file_rmap() when
the page mapped with PTEs.

Link: http://lkml.kernel.org/r/1466021202-61880-21-git-send-email-kirill.shutemov@linux.intel.com
Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 4949148a 26-Jul-2016 Vladimir Davydov <vdavydov.dev@gmail.com>

mm: charge/uncharge kmemcg from generic page allocator paths

Currently, to charge a non-slab allocation to kmemcg one has to use
alloc_kmem_pages helper with __GFP_ACCOUNT flag. A page allocated with
this helper should finally be freed using free_kmem_pages, otherwise it
won't be uncharged.

This API suits its current users fine, but it turns out to be impossible
to use along with page reference counting, i.e. when an allocation is
supposed to be freed with put_page, as it is the case with pipe or unix
socket buffers.

To overcome this limitation, this patch moves charging/uncharging to
generic page allocator paths, i.e. to __alloc_pages_nodemask and
free_pages_prepare, and zaps alloc/free_kmem_pages helpers. This way,
one can use any of the available page allocation functions to get the
allocated page charged to kmemcg - it's enough to pass __GFP_ACCOUNT,
just like in case of kmalloc and friends. A charged page will be
automatically uncharged on free.

To make it possible, we need to mark pages charged to kmemcg somehow.
To avoid introducing a new page flag, we make use of page->_mapcount for
marking such pages. Since pages charged to kmemcg are not supposed to
be mapped to userspace, it should work just fine. There are other
(ab)users of page->_mapcount - buddy and balloon pages - but we don't
conflict with them.

In case kmemcg is compiled out or not used at runtime, this patch
introduces no overhead to generic page allocator paths. If kmemcg is
used, it will be plus one gfp flags check on alloc and plus one
page->_mapcount check on free, which shouldn't hurt performance, because
the data accessed are hot.

Link: http://lkml.kernel.org/r/a9736d856f895bcb465d9f257b54efe32eda6f99.1464079538.git.vdavydov@virtuozzo.com
Signed-off-by: Vladimir Davydov <vdavydov@virtuozzo.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Eric Dumazet <eric.dumazet@gmail.com>
Cc: Minchan Kim <minchan@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 45264778 26-Jul-2016 Vladimir Davydov <vdavydov.dev@gmail.com>

mm: memcontrol: cleanup kmem charge functions

- Handle memcg_kmem_enabled check out to the caller. This reduces the
number of function definitions making the code easier to follow. At
the same time it doesn't result in code bloat, because all of these
functions are used only in one or two places.

- Move __GFP_ACCOUNT check to the caller as well so that one wouldn't
have to dive deep into memcg implementation to see which allocations
are charged and which are not.

- Refresh comments.

Link: http://lkml.kernel.org/r/52882a28b542c1979fd9a033b4dc8637fc347399.1464079537.git.vdavydov@virtuozzo.com
Signed-off-by: Vladimir Davydov <vdavydov@virtuozzo.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Eric Dumazet <eric.dumazet@gmail.com>
Cc: Minchan Kim <minchan@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 46f24fd8 26-Jul-2016 Joonsoo Kim <iamjoonsoo.kim@lge.com>

mm/page_alloc: introduce post allocation processing on page allocator

This patch is motivated from Hugh and Vlastimil's concern [1].

There are two ways to get freepage from the allocator. One is using
normal memory allocation API and the other is __isolate_free_page()
which is internally used for compaction and pageblock isolation. Later
usage is rather tricky since it doesn't do whole post allocation
processing done by normal API.

One problematic thing I already know is that poisoned page would not be
checked if it is allocated by __isolate_free_page(). Perhaps, there
would be more.

We could add more debug logic for allocated page in the future and this
separation would cause more problem. I'd like to fix this situation at
this time. Solution is simple. This patch commonize some logic for
newly allocated page and uses it on all sites. This will solve the
problem.

[1] http://marc.info/?i=alpine.LSU.2.11.1604270029350.7066%40eggly.anvils%3E

[iamjoonsoo.kim@lge.com: mm-page_alloc-introduce-post-allocation-processing-on-page-allocator-v3]
Link: http://lkml.kernel.org/r/1464230275-25791-7-git-send-email-iamjoonsoo.kim@lge.com
Link: http://lkml.kernel.org/r/1466150259-27727-9-git-send-email-iamjoonsoo.kim@lge.com
Link: http://lkml.kernel.org/r/1464230275-25791-7-git-send-email-iamjoonsoo.kim@lge.com
Signed-off-by: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Cc: Mel Gorman <mgorman@techsingularity.net>
Cc: Minchan Kim <minchan@kernel.org>
Cc: Alexander Potapenko <glider@google.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: Michal Hocko <mhocko@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# a9627bc5 26-Jul-2016 Joonsoo Kim <iamjoonsoo.kim@lge.com>

mm/page_owner: introduce split_page_owner and replace manual handling

split_page() calls set_page_owner() to set up page_owner to each pages.
But, it has a drawback that head page and the others have different
stacktrace because callsite of set_page_owner() is slightly differnt.
To avoid this problem, this patch copies head page's page_owner to the
others. It needs to introduce new function, split_page_owner() but it
also remove the other function, get_page_owner_gfp() so looks good to
do.

Link: http://lkml.kernel.org/r/1464230275-25791-4-git-send-email-iamjoonsoo.kim@lge.com
Signed-off-by: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Cc: Mel Gorman <mgorman@techsingularity.net>
Cc: Minchan Kim <minchan@kernel.org>
Cc: Alexander Potapenko <glider@google.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: Michal Hocko <mhocko@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 83358ece 26-Jul-2016 Joonsoo Kim <iamjoonsoo.kim@lge.com>

mm/page_owner: initialize page owner without holding the zone lock

It's not necessary to initialized page_owner with holding the zone lock.
It would cause more contention on the zone lock although it's not a big
problem since it is just debug feature. But, it is better than before
so do it. This is also preparation step to use stackdepot in page owner
feature. Stackdepot allocates new pages when there is no reserved space
and holding the zone lock in this case will cause deadlock.

Link: http://lkml.kernel.org/r/1464230275-25791-2-git-send-email-iamjoonsoo.kim@lge.com
Signed-off-by: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Cc: Mel Gorman <mgorman@techsingularity.net>
Cc: Minchan Kim <minchan@kernel.org>
Cc: Alexander Potapenko <glider@google.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: Michal Hocko <mhocko@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 66c64223 26-Jul-2016 Joonsoo Kim <iamjoonsoo.kim@lge.com>

mm/compaction: split freepages without holding the zone lock

We don't need to split freepages with holding the zone lock. It will
cause more contention on zone lock so not desirable.

[rientjes@google.com: if __isolate_free_page() fails, avoid adding to freelist so we don't call map_pages() with it]
Link: http://lkml.kernel.org/r/alpine.DEB.2.10.1606211447001.43430@chino.kir.corp.google.com
Link: http://lkml.kernel.org/r/1464230275-25791-1-git-send-email-iamjoonsoo.kim@lge.com
Signed-off-by: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Cc: Mel Gorman <mgorman@techsingularity.net>
Cc: Minchan Kim <minchan@kernel.org>
Cc: Alexander Potapenko <glider@google.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: Michal Hocko <mhocko@kernel.org>
Signed-off-by: David Rientjes <rientjes@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# bda807d4 26-Jul-2016 Minchan Kim <minchan@kernel.org>

mm: migrate: support non-lru movable page migration

We have allowed migration for only LRU pages until now and it was enough
to make high-order pages. But recently, embedded system(e.g., webOS,
android) uses lots of non-movable pages(e.g., zram, GPU memory) so we
have seen several reports about troubles of small high-order allocation.
For fixing the problem, there were several efforts (e,g,. enhance
compaction algorithm, SLUB fallback to 0-order page, reserved memory,
vmalloc and so on) but if there are lots of non-movable pages in system,
their solutions are void in the long run.

So, this patch is to support facility to change non-movable pages with
movable. For the feature, this patch introduces functions related to
migration to address_space_operations as well as some page flags.

If a driver want to make own pages movable, it should define three
functions which are function pointers of struct
address_space_operations.

1. bool (*isolate_page) (struct page *page, isolate_mode_t mode);

What VM expects on isolate_page function of driver is to return *true*
if driver isolates page successfully. On returing true, VM marks the
page as PG_isolated so concurrent isolation in several CPUs skip the
page for isolation. If a driver cannot isolate the page, it should
return *false*.

Once page is successfully isolated, VM uses page.lru fields so driver
shouldn't expect to preserve values in that fields.

2. int (*migratepage) (struct address_space *mapping,
struct page *newpage, struct page *oldpage, enum migrate_mode);

After isolation, VM calls migratepage of driver with isolated page. The
function of migratepage is to move content of the old page to new page
and set up fields of struct page newpage. Keep in mind that you should
indicate to the VM the oldpage is no longer movable via
__ClearPageMovable() under page_lock if you migrated the oldpage
successfully and returns 0. If driver cannot migrate the page at the
moment, driver can return -EAGAIN. On -EAGAIN, VM will retry page
migration in a short time because VM interprets -EAGAIN as "temporal
migration failure". On returning any error except -EAGAIN, VM will give
up the page migration without retrying in this time.

Driver shouldn't touch page.lru field VM using in the functions.

3. void (*putback_page)(struct page *);

If migration fails on isolated page, VM should return the isolated page
to the driver so VM calls driver's putback_page with migration failed
page. In this function, driver should put the isolated page back to the
own data structure.

4. non-lru movable page flags

There are two page flags for supporting non-lru movable page.

* PG_movable

Driver should use the below function to make page movable under
page_lock.

void __SetPageMovable(struct page *page, struct address_space *mapping)

It needs argument of address_space for registering migration family
functions which will be called by VM. Exactly speaking, PG_movable is
not a real flag of struct page. Rather than, VM reuses page->mapping's
lower bits to represent it.

#define PAGE_MAPPING_MOVABLE 0x2
page->mapping = page->mapping | PAGE_MAPPING_MOVABLE;

so driver shouldn't access page->mapping directly. Instead, driver
should use page_mapping which mask off the low two bits of page->mapping
so it can get right struct address_space.

For testing of non-lru movable page, VM supports __PageMovable function.
However, it doesn't guarantee to identify non-lru movable page because
page->mapping field is unified with other variables in struct page. As
well, if driver releases the page after isolation by VM, page->mapping
doesn't have stable value although it has PAGE_MAPPING_MOVABLE (Look at
__ClearPageMovable). But __PageMovable is cheap to catch whether page
is LRU or non-lru movable once the page has been isolated. Because LRU
pages never can have PAGE_MAPPING_MOVABLE in page->mapping. It is also
good for just peeking to test non-lru movable pages before more
expensive checking with lock_page in pfn scanning to select victim.

For guaranteeing non-lru movable page, VM provides PageMovable function.
Unlike __PageMovable, PageMovable functions validates page->mapping and
mapping->a_ops->isolate_page under lock_page. The lock_page prevents
sudden destroying of page->mapping.

Driver using __SetPageMovable should clear the flag via
__ClearMovablePage under page_lock before the releasing the page.

* PG_isolated

To prevent concurrent isolation among several CPUs, VM marks isolated
page as PG_isolated under lock_page. So if a CPU encounters PG_isolated
non-lru movable page, it can skip it. Driver doesn't need to manipulate
the flag because VM will set/clear it automatically. Keep in mind that
if driver sees PG_isolated page, it means the page have been isolated by
VM so it shouldn't touch page.lru field. PG_isolated is alias with
PG_reclaim flag so driver shouldn't use the flag for own purpose.

[opensource.ganesh@gmail.com: mm/compaction: remove local variable is_lru]
Link: http://lkml.kernel.org/r/20160618014841.GA7422@leo-test
Link: http://lkml.kernel.org/r/1464736881-24886-3-git-send-email-minchan@kernel.org
Signed-off-by: Gioh Kim <gi-oh.kim@profitbricks.com>
Signed-off-by: Minchan Kim <minchan@kernel.org>
Signed-off-by: Ganesh Mahendran <opensource.ganesh@gmail.com>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Cc: Sergey Senozhatsky <sergey.senozhatsky@gmail.com>
Cc: Rik van Riel <riel@redhat.com>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Hugh Dickins <hughd@google.com>
Cc: Rafael Aquini <aquini@redhat.com>
Cc: Jonathan Corbet <corbet@lwn.net>
Cc: John Einar Reitan <john.reitan@foss.arm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 2a966b77 26-Jul-2016 Vladimir Davydov <vdavydov.dev@gmail.com>

mm: oom: add memcg to oom_control

It's a part of oom context just like allocation order and nodemask, so
let's move it to oom_control instead of passing it in the argument list.

Link: http://lkml.kernel.org/r/40e03fd7aaf1f55c75d787128d6d17c5a71226c2.1464358556.git.vdavydov@virtuozzo.com
Signed-off-by: Vladimir Davydov <vdavydov@virtuozzo.com>
Acked-by: Michal Hocko <mhocko@suse.com>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Cc: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
Cc: David Rientjes <rientjes@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 90cae1fe 26-Jul-2016 Oliver O'Halloran <oohall@gmail.com>

mm/init: fix zone boundary creation

As a part of memory initialisation the architecture passes an array to
free_area_init_nodes() which specifies the max PFN of each memory zone.
This array is not necessarily monotonic (due to unused zones) so this
array is parsed to build monotonic lists of the min and max PFN for each
zone. ZONE_MOVABLE is special cased here as its limits are managed by
the mm subsystem rather than the architecture. Unfortunately, this
special casing is broken when ZONE_MOVABLE is the not the last zone in
the zone list. The core of the issue is:

if (i == ZONE_MOVABLE)
continue;
arch_zone_lowest_possible_pfn[i] =
arch_zone_highest_possible_pfn[i-1];

As ZONE_MOVABLE is skipped the lowest_possible_pfn of the next zone will
be set to zero. This patch fixes this bug by adding explicitly tracking
where the next zone should start rather than relying on the contents
arch_zone_highest_possible_pfn[].

Thie is low priority. To get bitten by this you need to enable a zone
that appears after ZONE_MOVABLE in the zone_type enum. As far as I can
tell this means running a kernel with ZONE_DEVICE or ZONE_CMA enabled,
so I can't see this affecting too many people.

I only noticed this because I've been fiddling with ZONE_DEVICE on
powerpc and 4.6 broke my test kernel. This bug, in conjunction with the
changes in Taku Izumi's kernelcore=mirror patch (d91749c1dda71) and
powerpc being the odd architecture which initialises max_zone_pfn[] to
~0ul instead of 0 caused all of system memory to be placed into
ZONE_DEVICE at boot, followed a panic since device memory cannot be used
for kernel allocations. I've already submitted a patch to fix the
powerpc specific bits, but I figured this should be fixed too.

Link: http://lkml.kernel.org/r/1462435033-15601-1-git-send-email-oohall@gmail.com
Signed-off-by: Oliver O'Halloran <oohall@gmail.com>
Cc: Anton Blanchard <anton@samba.org>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Mel Gorman <mgorman@techsingularity.net>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# ef70b6f4 14-Jul-2016 Mel Gorman <mgorman@techsingularity.net>

mm, meminit: ensure node is online before checking whether pages are uninitialised

early_page_uninitialised looks up an arbitrary PFN. While a machine
without node 0 will boot with "mm, page_alloc: Always return a valid
node from early_pfn_to_nid", it works because it assumes that nodes are
always in PFN order. This is not guaranteed so this patch adds
robustness by always checking if the node being checked is online.

Link: http://lkml.kernel.org/r/1468008031-3848-4-git-send-email-mgorman@techsingularity.net
Signed-off-by: Mel Gorman <mgorman@techsingularity.net>
Acked-by: David Rientjes <rientjes@google.com>
Cc: <stable@vger.kernel.org> [4.2+]
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# e4568d38 14-Jul-2016 Mel Gorman <mgorman@techsingularity.net>

mm, meminit: always return a valid node from early_pfn_to_nid

early_pfn_to_nid can return node 0 if a PFN is invalid on machines that
has no node 0. A machine with only node 1 was observed to crash with
the following message:

BUG: unable to handle kernel paging request at 000000000002a3c8
PGD 0
Modules linked in:
Hardware name: Supermicro H8DSP-8/H8DSP-8, BIOS 080011 06/30/2006
task: ffffffff81c0d500 ti: ffffffff81c00000 task.ti: ffffffff81c00000
RIP: reserve_bootmem_region+0x6a/0xef
CR2: 000000000002a3c8 CR3: 0000000001c06000 CR4: 00000000000006b0
Call Trace:
free_all_bootmem+0x4b/0x12a
mem_init+0x70/0xa3
start_kernel+0x25b/0x49b

The problem is that early_page_uninitialised uses the early_pfn_to_nid
helper which returns node 0 for invalid PFNs. No caller of
early_pfn_to_nid cares except early_page_uninitialised. This patch has
early_pfn_to_nid always return a valid node.

Link: http://lkml.kernel.org/r/1468008031-3848-3-git-send-email-mgorman@techsingularity.net
Signed-off-by: Mel Gorman <mgorman@techsingularity.net>
Acked-by: David Rientjes <rientjes@google.com>
Cc: <stable@vger.kernel.org> [4.2+]
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# e46e7b77 03-Jun-2016 Mel Gorman <mgorman@techsingularity.net>

mm, page_alloc: recalculate the preferred zoneref if the context can ignore memory policies

The optimistic fast path may use cpuset_current_mems_allowed instead of
of a NULL nodemask supplied by the caller for cpuset allocations. The
preferred zone is calculated on this basis for statistic purposes and as
a starting point in the zonelist iterator.

However, if the context can ignore memory policies due to being atomic
or being able to ignore watermarks then the starting point in the
zonelist iterator is no longer correct. This patch resets the zonelist
iterator in the allocator slowpath if the context can ignore memory
policies. This will alter the zone used for statistics but only after
it is known that it makes sense for that context. Resetting it before
entering the slowpath would potentially allow an ALLOC_CPUSET allocation
to be accounted for against the wrong zone. Note that while nodemask is
not explicitly set to the original nodemask, it would only have been
overwritten if cpuset_enabled() and it was reset before the slowpath was
entered.

Link: http://lkml.kernel.org/r/20160602103936.GU2527@techsingularity.net
Fixes: c33d6c06f60f710 ("mm, page_alloc: avoid looking up the first zone in a zonelist twice")
Signed-off-by: Mel Gorman <mgorman@techsingularity.net>
Reported-by: Geert Uytterhoeven <geert@linux-m68k.org>
Tested-by: Geert Uytterhoeven <geert@linux-m68k.org>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 0d0bd894 03-Jun-2016 Mel Gorman <mgorman@techsingularity.net>

mm, page_alloc: reset zonelist iterator after resetting fair zone allocation policy

Geert Uytterhoeven reported the following problem that bisected to
commit c33d6c06f60f ("mm, page_alloc: avoid looking up the first zone
in a zonelist twice") on m68k/ARAnyM

BUG: scheduling while atomic: cron/668/0x10c9a0c0
Modules linked in:
CPU: 0 PID: 668 Comm: cron Not tainted 4.6.0-atari-05133-gc33d6c06f60f710f #364
Call Trace: [<0003d7d0>] __schedule_bug+0x40/0x54
__schedule+0x312/0x388
__schedule+0x0/0x388
prepare_to_wait+0x0/0x52
schedule+0x64/0x82
schedule_timeout+0xda/0x104
set_next_entity+0x18/0x40
pick_next_task_fair+0x78/0xda
io_schedule_timeout+0x36/0x4a
bit_wait_io+0x0/0x40
bit_wait_io+0x12/0x40
__wait_on_bit+0x46/0x76
wait_on_page_bit_killable+0x64/0x6c
bit_wait_io+0x0/0x40
wake_bit_function+0x0/0x4e
__lock_page_or_retry+0xde/0x124
do_scan_async+0x114/0x17c
lookup_swap_cache+0x24/0x4e
handle_mm_fault+0x626/0x7de
find_vma+0x0/0x66
down_read+0x0/0xe
wait_on_page_bit_killable_timeout+0x77/0x7c
find_vma+0x16/0x66
do_page_fault+0xe6/0x23a
res_func+0xa3c/0x141a
buserr_c+0x190/0x6d4
res_func+0xa3c/0x141a
buserr+0x20/0x28
res_func+0xa3c/0x141a
buserr+0x20/0x28

The relationship is not obvious but it's due to a failure to rescan the
full zonelist after the fair zone allocation policy exhausts the batch
count. While this is a functional problem, it's also a performance
issue. A page allocator microbenchmark showed the following

4.7.0-rc1 4.7.0-rc1
vanilla reset-v1r2
Min alloc-odr0-1 327.00 ( 0.00%) 326.00 ( 0.31%)
Min alloc-odr0-2 235.00 ( 0.00%) 235.00 ( 0.00%)
Min alloc-odr0-4 198.00 ( 0.00%) 198.00 ( 0.00%)
Min alloc-odr0-8 170.00 ( 0.00%) 170.00 ( 0.00%)
Min alloc-odr0-16 156.00 ( 0.00%) 156.00 ( 0.00%)
Min alloc-odr0-32 150.00 ( 0.00%) 150.00 ( 0.00%)
Min alloc-odr0-64 146.00 ( 0.00%) 146.00 ( 0.00%)
Min alloc-odr0-128 145.00 ( 0.00%) 145.00 ( 0.00%)
Min alloc-odr0-256 155.00 ( 0.00%) 155.00 ( 0.00%)
Min alloc-odr0-512 168.00 ( 0.00%) 165.00 ( 1.79%)
Min alloc-odr0-1024 175.00 ( 0.00%) 174.00 ( 0.57%)
Min alloc-odr0-2048 180.00 ( 0.00%) 180.00 ( 0.00%)
Min alloc-odr0-4096 187.00 ( 0.00%) 186.00 ( 0.53%)
Min alloc-odr0-8192 190.00 ( 0.00%) 190.00 ( 0.00%)
Min alloc-odr0-16384 191.00 ( 0.00%) 191.00 ( 0.00%)
Min alloc-odr1-1 736.00 ( 0.00%) 445.00 ( 39.54%)
Min alloc-odr1-2 343.00 ( 0.00%) 335.00 ( 2.33%)
Min alloc-odr1-4 277.00 ( 0.00%) 270.00 ( 2.53%)
Min alloc-odr1-8 238.00 ( 0.00%) 233.00 ( 2.10%)
Min alloc-odr1-16 224.00 ( 0.00%) 218.00 ( 2.68%)
Min alloc-odr1-32 210.00 ( 0.00%) 208.00 ( 0.95%)
Min alloc-odr1-64 207.00 ( 0.00%) 203.00 ( 1.93%)
Min alloc-odr1-128 276.00 ( 0.00%) 202.00 ( 26.81%)
Min alloc-odr1-256 206.00 ( 0.00%) 202.00 ( 1.94%)
Min alloc-odr1-512 207.00 ( 0.00%) 202.00 ( 2.42%)
Min alloc-odr1-1024 208.00 ( 0.00%) 205.00 ( 1.44%)
Min alloc-odr1-2048 213.00 ( 0.00%) 212.00 ( 0.47%)
Min alloc-odr1-4096 218.00 ( 0.00%) 216.00 ( 0.92%)
Min alloc-odr1-8192 341.00 ( 0.00%) 219.00 ( 35.78%)

Note that order-0 allocations are unaffected but higher orders get a
small boost from this patch and a large reduction in system CPU usage
overall as can be seen here:

4.7.0-rc1 4.7.0-rc1
vanilla reset-v1r2
User 85.32 86.31
System 2221.39 2053.36
Elapsed 2368.89 2202.47

Fixes: c33d6c06f60f ("mm, page_alloc: avoid looking up the first zone in a zonelist twice")
Link: http://lkml.kernel.org/r/20160531100848.GR2527@techsingularity.net
Signed-off-by: Mel Gorman <mgorman@techsingularity.net>
Reported-by: Geert Uytterhoeven <geert@linux-m68k.org>
Tested-by: Geert Uytterhoeven <geert@linux-m68k.org>
Tested-by: Mikulas Patocka <mpatocka@redhat.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 83b9355b 03-Jun-2016 Vlastimil Babka <vbabka@suse.cz>

mm, page_alloc: prevent infinite loop in buffered_rmqueue()

In DEBUG_VM kernel, we can hit infinite loop for order == 0 in
buffered_rmqueue() when check_new_pcp() returns 1, because the bad page
is never removed from the pcp list. Fix this by removing the page
before retrying. Also we don't need to check if page is non-NULL,
because we simply grab it from the list which was just tested for being
non-empty.

Fixes: 479f854a207c ("mm, page_alloc: defer debugging checks of pages allocated from the PCP")
Link: http://lkml.kernel.org/r/20160530090154.GM2527@techsingularity.net
Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
Signed-off-by: Mel Gorman <mgorman@techsingularity.net>
Reported-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# f86e4271 03-Jun-2016 Yang Shi <yang.shi@linaro.org>

mm: check the return value of lookup_page_ext for all call sites

Per the discussion with Joonsoo Kim [1], we need check the return value
of lookup_page_ext() for all call sites since it might return NULL in
some cases, although it is unlikely, i.e. memory hotplug.

Tested with ltp with "page_owner=0".

[1] http://lkml.kernel.org/r/20160519002809.GA10245@js1304-P5Q-DELUXE

[akpm@linux-foundation.org: fix build-breaking typos]
[arnd@arndb.de: fix build problems from lookup_page_ext]
Link: http://lkml.kernel.org/r/6285269.2CksypHdYp@wuerfel
[akpm@linux-foundation.org: coding-style fixes]
Link: http://lkml.kernel.org/r/1464023768-31025-1-git-send-email-yang.shi@linaro.org
Signed-off-by: Yang Shi <yang.shi@linaro.org>
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# e570f56c 20-May-2016 Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>

mm: check_new_page_bad() directly returns in __PG_HWPOISON case

Currently we check page->flags twice for "HWPoisoned" case of
check_new_page_bad(), which can cause a race with unpoisoning.

This race unnecessarily taints kernel with "BUG: Bad page state".
check_new_page_bad() is the only caller of bad_page() which is
interested in __PG_HWPOISON, so let's move the hwpoison related code in
bad_page() to it.

Link: http://lkml.kernel.org/r/20160518100949.GA17299@hori1.linux.bs1.fc.nec.co.jp
Signed-off-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
Acked-by: Mel Gorman <mgorman@techsingularity.net>
Cc: Vlastimil Babka <vbabka@suse.cz>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 29b52de1 20-May-2016 seokhoon.yoon <iamyooon@gmail.com>

mm, kasan: fix to call kasan_free_pages() after poisoning page

When CONFIG_PAGE_POISONING and CONFIG_KASAN is enabled,
free_pages_prepare()'s codeflow is below.

1)kmemcheck_free_shadow()
2)kasan_free_pages()
- set shadow byte of page is freed
3)kernel_poison_pages()
3.1) check access to page is valid or not using kasan
---> error occur, kasan think it is invalid access
3.2) poison page
4)kernel_map_pages()

So kasan_free_pages() should be called after poisoning the page.

Link: http://lkml.kernel.org/r/1463220405-7455-1-git-send-email-iamyooon@gmail.com
Signed-off-by: seokhoon.yoon <iamyooon@gmail.com>
Cc: Andrey Ryabinin <a.ryabinin@samsung.com>
Cc: Laura Abbott <labbott@fedoraproject.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 4b50bcc7 20-May-2016 Stefan Bader <stefan.bader@canonical.com>

mm: use phys_addr_t for reserve_bootmem_region() arguments

Since commit 92923ca3aace ("mm: meminit: only set page reserved in the
memblock region") the reserved bit is set on reserved memblock regions.
However start and end address are passed as unsigned long. This is only
32bit on i386, so it can end up marking the wrong pages reserved for
ranges at 4GB and above.

This was observed on a 32bit Xen dom0 which was booted with initial
memory set to a value below 4G but allowing to balloon in memory
(dom0_mem=1024M for example). This would define a reserved bootmem
region for the additional memory (for example on a 8GB system there was
a reverved region covering the 4GB-8GB range). But since the addresses
were passed on as unsigned long, this was actually marking all pages
from 0 to 4GB as reserved.

Fixes: 92923ca3aacef63 ("mm: meminit: only set page reserved in the memblock region")
Link: http://lkml.kernel.org/r/1463491221-10573-1-git-send-email-stefan.bader@canonical.com
Signed-off-by: Stefan Bader <stefan.bader@canonical.com>
Cc: <stable@vger.kernel.org> [4.2+]
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 2a138dc7 20-May-2016 Minfei Huang <mnghuan@gmail.com>

mm: use existing helper to convert "on"/"off" to boolean

It's more convenient to use existing function helper to convert string
"on/off" to boolean.

Link: http://lkml.kernel.org/r/1461908824-16129-1-git-send-email-mnghuan@gmail.com
Signed-off-by: Minfei Huang <mnghuan@gmail.com>
Acked-by: Michal Hocko <mhocko@suse.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 31e49bfd 20-May-2016 Michal Hocko <mhocko@suse.com>

mm, oom: protect !costly allocations some more for !CONFIG_COMPACTION

Joonsoo has reported that he is able to trigger OOM for !costly high
order requests (heavy fork() workload close the OOM) with the new oom
detection rework. This is because we rely only on should_reclaim_retry
when the compaction is disabled and it only checks watermarks for the
requested order and so we might trigger OOM when there is a lot of free
memory.

It is not very clear what are the usual workloads when the compaction is
disabled. Relying on high order allocations heavily without any
mechanism to create those orders except for unbound amount of reclaim is
certainly not a good idea.

To prevent from potential regressions let's help this configuration
some. We have to sacrifice the determinsm though because there simply
is none here possible. should_compact_retry implementation for
!CONFIG_COMPACTION, which was empty so far, will do watermark check for
order-0 on all eligible zones. This will cause retrying until either
the reclaim cannot make any further progress or all the zones are
depleted even for order-0 pages. This means that the number of retries
is basically unbounded for !costly orders but that was the case before
the rework as well so this shouldn't regress.

[akpm@linux-foundation.org: coding-style fixes]
Link: http://lkml.kernel.org/r/1463051677-29418-3-git-send-email-mhocko@kernel.org
Reported-by: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Signed-off-by: Michal Hocko <mhocko@suse.com>
Acked-by: Hillf Danton <hillf.zj@alibaba-inc.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 86a294a8 20-May-2016 Michal Hocko <mhocko@suse.com>

mm, oom, compaction: prevent from should_compact_retry looping for ever for costly orders

"mm: consider compaction feedback also for costly allocation" has
removed the upper bound for the reclaim/compaction retries based on the
number of reclaimed pages for costly orders. While this is desirable
the patch did miss a mis interaction between reclaim, compaction and the
retry logic. The direct reclaim tries to get zones over min watermark
while compaction backs off and returns COMPACT_SKIPPED when all zones
are below low watermark + 1<<order gap. If we are getting really close
to OOM then __compaction_suitable can keep returning COMPACT_SKIPPED a
high order request (e.g. hugetlb order-9) while the reclaim is not able
to release enough pages to get us over low watermark. The reclaim is
still able to make some progress (usually trashing over few remaining
pages) so we are not able to break out from the loop.

I have seen this happening with the same test described in "mm: consider
compaction feedback also for costly allocation" on a swapless system.
The original problem got resolved by "vmscan: consider classzone_idx in
compaction_ready" but it shows how things might go wrong when we
approach the oom event horizont.

The reason why compaction requires being over low rather than min
watermark is not clear to me. This check was there essentially since
56de7263fcf3 ("mm: compaction: direct compact when a high-order
allocation fails"). It is clearly an implementation detail though and
we shouldn't pull it into the generic retry logic while we should be
able to cope with such eventuality. The only place in
should_compact_retry where we retry without any upper bound is for
compaction_withdrawn() case.

Introduce compaction_zonelist_suitable function which checks the given
zonelist and returns true only if there is at least one zone which would
would unblock __compaction_suitable if more memory got reclaimed. In
this implementation it checks __compaction_suitable with NR_FREE_PAGES
plus part of the reclaimable memory as the target for the watermark
check. The reclaimable memory is reduced linearly by the allocation
order. The idea is that we do not want to reclaim all the remaining
memory for a single allocation request just unblock
__compaction_suitable which doesn't guarantee we will make a further
progress.

The new helper is then used if compaction_withdrawn() feedback was
provided so we do not retry if there is no outlook for a further
progress. !costly requests shouldn't be affected much - e.g. order-2
pages would require to have at least 64kB on the reclaimable LRUs while
order-9 would need at least 32M which should be enough to not lock up.

[vbabka@suse.cz: fix classzone_idx vs. high_zoneidx usage in compaction_zonelist_suitable]
[akpm@linux-foundation.org: fix it for Mel's mm-page_alloc-remove-field-from-alloc_context.patch]
Signed-off-by: Michal Hocko <mhocko@suse.com>
Acked-by: Hillf Danton <hillf.zj@alibaba-inc.com>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Cc: David Rientjes <rientjes@google.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Joonsoo Kim <js1304@gmail.com>
Cc: Mel Gorman <mgorman@techsingularity.net>
Cc: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
Cc: Vladimir Davydov <vdavydov@virtuozzo.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 7854ea6c 20-May-2016 Michal Hocko <mhocko@suse.com>

mm: consider compaction feedback also for costly allocation

PAGE_ALLOC_COSTLY_ORDER retry logic is mostly handled inside
should_reclaim_retry currently where we decide to not retry after at
least order worth of pages were reclaimed or the watermark check for at
least one zone would succeed after reclaiming all pages if the reclaim
hasn't made any progress. Compaction feedback is mostly ignored and we
just try to make sure that the compaction did at least something before
giving up.

The first condition was added by a41f24ea9fd6 ("page allocator: smarter
retry of costly-order allocations) and it assumed that lumpy reclaim
could have created a page of the sufficient order. Lumpy reclaim, has
been removed quite some time ago so the assumption doesn't hold anymore.
Remove the check for the number of reclaimed pages and rely on the
compaction feedback solely. should_reclaim_retry now only makes sure
that we keep retrying reclaim for high order pages only if they are
hidden by watermaks so order-0 reclaim makes really sense.

should_compact_retry now keeps retrying even for the costly allocations.
The number of retries is reduced wrt. !costly requests because they are
less important and harder to grant and so their pressure shouldn't cause
contention for other requests or cause an over reclaim. We also do not
reset no_progress_loops for costly request to make sure we do not keep
reclaiming too agressively.

This has been tested by running a process which fragments memory:
- compact memory
- mmap large portion of the memory (1920M on 2GRAM machine with 2G
of swapspace)
- MADV_DONTNEED single page in PAGE_SIZE*((1UL<<MAX_ORDER)-1)
steps until certain amount of memory is freed (250M in my test)
and reduce the step to (step / 2) + 1 after reaching the end of
the mapping
- then run a script which populates the page cache 2G (MemTotal)
from /dev/zero to a new file
And then tries to allocate
nr_hugepages=$(awk '/MemAvailable/{printf "%d\n", $2/(2*1024)}' /proc/meminfo)
huge pages.

root@test1:~# echo 1 > /proc/sys/vm/overcommit_memory;echo 1 > /proc/sys/vm/compact_memory; ./fragment-mem-and-run /root/alloc_hugepages.sh 1920M 250M
Node 0, zone DMA 31 28 31 10 2 0 2 1 2 3 1
Node 0, zone DMA32 437 319 171 50 28 25 20 16 16 14 437

* This is the /proc/buddyinfo after the compaction

Done fragmenting. size=2013265920 freed=262144000
Node 0, zone DMA 165 48 3 1 2 0 2 2 2 2 0
Node 0, zone DMA32 35109 14575 185 51 41 12 6 0 0 0 0

* /proc/buddyinfo after memory got fragmented

Executing "/root/alloc_hugepages.sh"
Eating some pagecache
508623+0 records in
508623+0 records out
2083319808 bytes (2.1 GB) copied, 11.7292 s, 178 MB/s
Node 0, zone DMA 3 5 3 1 2 0 2 2 2 2 0
Node 0, zone DMA32 111 344 153 20 24 10 3 0 0 0 0

* /proc/buddyinfo after page cache got eaten

Trying to allocate 129
129

* 129 hugepages requested and all of them granted.

Node 0, zone DMA 3 5 3 1 2 0 2 2 2 2 0
Node 0, zone DMA32 127 97 30 99 11 6 2 1 4 0 0

* /proc/buddyinfo after hugetlb allocation.

10 runs will behave as follows:
Trying to allocate 130
130
--
Trying to allocate 129
129
--
Trying to allocate 128
128
--
Trying to allocate 129
129
--
Trying to allocate 128
128
--
Trying to allocate 129
129
--
Trying to allocate 132
132
--
Trying to allocate 129
129
--
Trying to allocate 128
128
--
Trying to allocate 129
129

So basically 100% success for all 10 attempts.
Without the patch numbers looked much worse:
Trying to allocate 128
12
--
Trying to allocate 129
14
--
Trying to allocate 129
7
--
Trying to allocate 129
16
--
Trying to allocate 129
30
--
Trying to allocate 129
38
--
Trying to allocate 129
19
--
Trying to allocate 129
37
--
Trying to allocate 129
28
--
Trying to allocate 129
37

Just for completness the base kernel without oom detection rework looks
as follows:
Trying to allocate 127
30
--
Trying to allocate 129
12
--
Trying to allocate 129
52
--
Trying to allocate 128
32
--
Trying to allocate 129
12
--
Trying to allocate 129
10
--
Trying to allocate 129
32
--
Trying to allocate 128
14
--
Trying to allocate 128
16
--
Trying to allocate 129
8

As we can see the success rate is much more volatile and smaller without
this patch. So the patch not only makes the retry logic for costly
requests more sensible the success rate is even higher.

Signed-off-by: Michal Hocko <mhocko@suse.com>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Acked-by: Hillf Danton <hillf.zj@alibaba-inc.com>
Cc: David Rientjes <rientjes@google.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Joonsoo Kim <js1304@gmail.com>
Cc: Mel Gorman <mgorman@techsingularity.net>
Cc: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
Cc: Vladimir Davydov <vdavydov@virtuozzo.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 33c2d214 20-May-2016 Michal Hocko <mhocko@suse.com>

mm, oom: protect !costly allocations some more

should_reclaim_retry will give up retries for higher order allocations
if none of the eligible zones has any requested or higher order pages
available even if we pass the watermak check for order-0. This is done
because there is no guarantee that the reclaimable and currently free
pages will form the required order.

This can, however, lead to situations where the high-order request (e.g.
order-2 required for the stack allocation during fork) will trigger OOM
too early - e.g. after the first reclaim/compaction round. Such a
system would have to be highly fragmented and there is no guarantee
further reclaim/compaction attempts would help but at least make sure
that the compaction was active before we go OOM and keep retrying even
if should_reclaim_retry tells us to oom if

- the last compaction round backed off or
- we haven't completed at least MAX_COMPACT_RETRIES active
compaction rounds.

The first rule ensures that the very last attempt for compaction was not
ignored while the second guarantees that the compaction has done some
work. Multiple retries might be needed to prevent occasional pigggy
backing of other contexts to steal the compacted pages before the
current context manages to retry to allocate them.

compaction_failed() is taken as a final word from the compaction that
the retry doesn't make much sense. We have to be careful though because
the first compaction round is MIGRATE_ASYNC which is rather weak as it
ignores pages under writeback and gives up too easily in other
situations. We therefore have to make sure that MIGRATE_SYNC_LIGHT mode
has been used before we give up. With this logic in place we do not
have to increase the migration mode unconditionally and rather do it
only if the compaction failed for the weaker mode. A nice side effect
is that the stronger migration mode is used only when really needed so
this has a potential of smaller latencies in some cases.

Please note that the compaction doesn't tell us much about how
successful it was when returning compaction_made_progress so we just
have to blindly trust that another retry is worthwhile and cap the
number to something reasonable to guarantee a convergence.

If the given number of successful retries is not sufficient for a
reasonable workloads we should focus on the collected compaction
tracepoints data and try to address the issue in the compaction code.
If this is not feasible we can increase the retries limit.

[mhocko@suse.com: fix warning]
Link: http://lkml.kernel.org/r/20160512061636.GA4200@dhcp22.suse.cz
Signed-off-by: Michal Hocko <mhocko@suse.com>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Acked-by: Hillf Danton <hillf.zj@alibaba-inc.com>
Cc: David Rientjes <rientjes@google.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Joonsoo Kim <js1304@gmail.com>
Cc: Mel Gorman <mgorman@techsingularity.net>
Cc: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
Cc: Vladimir Davydov <vdavydov@virtuozzo.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# ede37713 20-May-2016 Michal Hocko <mhocko@suse.com>

mm: throttle on IO only when there are too many dirty and writeback pages

wait_iff_congested has been used to throttle allocator before it retried
another round of direct reclaim to allow the writeback to make some
progress and prevent reclaim from looping over dirty/writeback pages
without making any progress.

We used to do congestion_wait before commit 0e093d99763e ("writeback: do
not sleep on the congestion queue if there are no congested BDIs or if
significant congestion is not being encountered in the current zone")
but that led to undesirable stalls and sleeping for the full timeout
even when the BDI wasn't congested. Hence wait_iff_congested was used
instead.

But it seems that even wait_iff_congested doesn't work as expected. We
might have a small file LRU list with all pages dirty/writeback and yet
the bdi is not congested so this is just a cond_resched in the end and
can end up triggering pre mature OOM.

This patch replaces the unconditional wait_iff_congested by
congestion_wait which is executed only if we _know_ that the last round
of direct reclaim didn't make any progress and dirty+writeback pages are
more than a half of the reclaimable pages on the zone which might be
usable for our target allocation. This shouldn't reintroduce stalls
fixed by 0e093d99763e because congestion_wait is called only when we are
getting hopeless when sleeping is a better choice than OOM with many
pages under IO.

We have to preserve logic introduced by commit 373ccbe59270 ("mm,
vmstat: allow WQ concurrency to discover memory reclaim doesn't make any
progress") into the __alloc_pages_slowpath now that wait_iff_congested
is not used anymore. As the only remaining user of wait_iff_congested
is shrink_inactive_list we can remove the WQ specific short sleep from
wait_iff_congested because the sleep is needed to be done only once in
the allocation retry cycle.

[mhocko@suse.com: high_zoneidx->ac_classzone_idx to evaluate memory reserves properly]
Link: http://lkml.kernel.org/r/1463051677-29418-2-git-send-email-mhocko@kernel.org
Signed-off-by: Michal Hocko <mhocko@suse.com>
Acked-by: Hillf Danton <hillf.zj@alibaba-inc.com>
Cc: David Rientjes <rientjes@google.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Joonsoo Kim <js1304@gmail.com>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
Cc: Vladimir Davydov <vdavydov@virtuozzo.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 0a0337e0 20-May-2016 Michal Hocko <mhocko@suse.com>

mm, oom: rework oom detection

__alloc_pages_slowpath has traditionally relied on the direct reclaim
and did_some_progress as an indicator that it makes sense to retry
allocation rather than declaring OOM. shrink_zones had to rely on
zone_reclaimable if shrink_zone didn't make any progress to prevent from
a premature OOM killer invocation - the LRU might be full of dirty or
writeback pages and direct reclaim cannot clean those up.

zone_reclaimable allows to rescan the reclaimable lists several times
and restart if a page is freed. This is really subtle behavior and it
might lead to a livelock when a single freed page keeps allocator
looping but the current task will not be able to allocate that single
page. OOM killer would be more appropriate than looping without any
progress for unbounded amount of time.

This patch changes OOM detection logic and pulls it out from shrink_zone
which is too low to be appropriate for any high level decisions such as
OOM which is per zonelist property. It is __alloc_pages_slowpath which
knows how many attempts have been done and what was the progress so far
therefore it is more appropriate to implement this logic.

The new heuristic is implemented in should_reclaim_retry helper called
from __alloc_pages_slowpath. It tries to be more deterministic and
easier to follow. It builds on an assumption that retrying makes sense
only if the currently reclaimable memory + free pages would allow the
current allocation request to succeed (as per __zone_watermark_ok) at
least for one zone in the usable zonelist.

This alone wouldn't be sufficient, though, because the writeback might
get stuck and reclaimable pages might be pinned for a really long time
or even depend on the current allocation context. Therefore there is a
backoff mechanism implemented which reduces the reclaim target after
each reclaim round without any progress. This means that we should
eventually converge to only NR_FREE_PAGES as the target and fail on the
wmark check and proceed to OOM. The backoff is simple and linear with
1/16 of the reclaimable pages for each round without any progress. We
are optimistic and reset counter for successful reclaim rounds.

Costly high order pages mostly preserve their semantic and those without
__GFP_REPEAT fail right away while those which have the flag set will
back off after the amount of reclaimable pages reaches equivalent of the
requested order. The only difference is that if there was no progress
during the reclaim we rely on zone watermark check. This is more
logical thing to do than previous 1<<order attempts which were a result
of zone_reclaimable faking the progress.

[vdavydov@virtuozzo.com: check classzone_idx for shrink_zone]
[hannes@cmpxchg.org: separate the heuristic into should_reclaim_retry]
[rientjes@google.com: use zone_page_state_snapshot for NR_FREE_PAGES]
[rientjes@google.com: shrink_zones doesn't need to return anything]
Signed-off-by: Michal Hocko <mhocko@suse.com>
Acked-by: Hillf Danton <hillf.zj@alibaba-inc.com>
Cc: Vladimir Davydov <vdavydov@virtuozzo.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: David Rientjes <rientjes@google.com>
Cc: Joonsoo Kim <js1304@gmail.com>
Cc: Mel Gorman <mgorman@techsingularity.net>
Cc: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
Cc: Vlastimil Babka <vbabka@suse.cz>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# c5d01d0d 20-May-2016 Michal Hocko <mhocko@suse.com>

mm, compaction: simplify __alloc_pages_direct_compact feedback interface

__alloc_pages_direct_compact communicates potential back off by two
variables:
- deferred_compaction tells that the compaction returned
COMPACT_DEFERRED
- contended_compaction is set when there is a contention on
zone->lock resp. zone->lru_lock locks

__alloc_pages_slowpath then backs of for THP allocation requests to
prevent from long stalls. This is rather messy and it would be much
cleaner to return a single compact result value and hide all the nasty
details into __alloc_pages_direct_compact.

This patch shouldn't introduce any functional changes.

Signed-off-by: Michal Hocko <mhocko@suse.com>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Acked-by: Hillf Danton <hillf.zj@alibaba-inc.com>
Cc: David Rientjes <rientjes@google.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Joonsoo Kim <js1304@gmail.com>
Cc: Mel Gorman <mgorman@techsingularity.net>
Cc: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
Cc: Vladimir Davydov <vdavydov@virtuozzo.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# ea7ab982 20-May-2016 Michal Hocko <mhocko@suse.com>

mm, compaction: change COMPACT_ constants into enum

Compaction code is doing weird dances between COMPACT_FOO -> int ->
unsigned long

But there doesn't seem to be any reason for that. All functions which
return/use one of those constants are not expecting any other value so it
really makes sense to define an enum for them and make it clear that no
other values are expected.

This is a pure cleanup and shouldn't introduce any functional changes.

Signed-off-by: Michal Hocko <mhocko@suse.com>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Acked-by: Hillf Danton <hillf.zj@alibaba-inc.com>
Cc: David Rientjes <rientjes@google.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Joonsoo Kim <js1304@gmail.com>
Cc: Mel Gorman <mgorman@techsingularity.net>
Cc: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
Cc: Vladimir Davydov <vdavydov@virtuozzo.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 59dc76b0 20-May-2016 Rik van Riel <riel@redhat.com>

mm: vmscan: reduce size of inactive file list

The inactive file list should still be large enough to contain readahead
windows and freshly written file data, but it no longer is the only
source for detecting multiple accesses to file pages. The workingset
refault measurement code causes recently evicted file pages that get
accessed again after a shorter interval to be promoted directly to the
active list.

With that mechanism in place, we can afford to (on a larger system)
dedicate more memory to the active file list, so we can actually cache
more of the frequently used file pages in memory, and not have them
pushed out by streaming writes, once-used streaming file reads, etc.

This can help things like database workloads, where only half the page
cache can currently be used to cache the database working set. This
patch automatically increases that fraction on larger systems, using the
same ratio that has already been used for anonymous memory.

[hannes@cmpxchg.org: cgroup-awareness]
Signed-off-by: Rik van Riel <riel@redhat.com>
Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
Reported-by: Andres Freund <andres@anarazel.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 4741526b 19-May-2016 Mel Gorman <mgorman@techsingularity.net>

mm, page_alloc: restore the original nodemask if the fast path allocation failed

The page allocator fast path uses either the requested nodemask or
cpuset_current_mems_allowed if cpusets are enabled. If the allocation
context allows watermarks to be ignored then it can also ignore memory
policies. However, on entering the allocator slowpath the nodemask may
still be cpuset_current_mems_allowed and the policies are enforced.
This patch resets the nodemask appropriately before entering the
slowpath.

Link: http://lkml.kernel.org/r/20160504143628.GU2858@techsingularity.net
Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
Signed-off-by: Mel Gorman <mgorman@techsingularity.net>
Cc: Jesper Dangaard Brouer <brouer@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 4e611801 19-May-2016 Vlastimil Babka <vbabka@suse.cz>

mm, page_alloc: uninline the bad page part of check_new_page()

Bad pages should be rare so the code handling them doesn't need to be
inline for performance reasons. Put it to separate function which
returns void. This also assumes that the initial page_expected_state()
result will match the result of the thorough check, i.e. the page
doesn't become "good" in the meanwhile. This matches the same
expectations already in place in free_pages_check().

!DEBUG_VM bloat-o-meter:

add/remove: 1/0 grow/shrink: 0/1 up/down: 134/-274 (-140)
function old new delta
check_new_page_bad - 134 +134
get_page_from_freelist 3468 3194 -274

Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
Acked-by: Mel Gorman <mgorman@techsingularity.net>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# e2769dbd 19-May-2016 Mel Gorman <mgorman@techsingularity.net>

mm, page_alloc: don't duplicate code in free_pcp_prepare

The new free_pcp_prepare() function shares a lot of code with
free_pages_prepare(), which makes this a maintenance risk when some
future patch modifies only one of them. We should be able to achieve
the same effect (skipping free_pages_check() from !DEBUG_VM configs) by
adding a parameter to free_pages_prepare() and making it inline, so the
checks (and the order != 0 parts) are eliminated from the call from
free_pcp_prepare().

!DEBUG_VM: bloat-o-meter reports no difference, as my gcc was already
inlining free_pages_prepare() and the elimination seems to work as
expected

DEBUG_VM bloat-o-meter:

add/remove: 0/1 grow/shrink: 2/0 up/down: 1035/-778 (257)
function old new delta
__free_pages_ok 297 1060 +763
free_hot_cold_page 480 752 +272
free_pages_prepare 778 - -778

Here inlining didn't occur before, and added some code, but it's ok for
a debug option.

[akpm@linux-foundation.org: fix build]
Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
Signed-off-by: Mel Gorman <mgorman@techsingularity.net>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 479f854a 19-May-2016 Mel Gorman <mgorman@techsingularity.net>

mm, page_alloc: defer debugging checks of pages allocated from the PCP

Every page allocated checks a number of page fields for validity. This
catches corruption bugs of pages that are already freed but it is
expensive. This patch weakens the debugging check by checking PCP pages
only when the PCP lists are being refilled. All compound pages are
checked. This potentially avoids debugging checks entirely if the PCP
lists are never emptied and refilled so some corruption issues may be
missed. Full checking requires DEBUG_VM.

With the two deferred debugging patches applied, the impact to a page
allocator microbenchmark is

4.6.0-rc3 4.6.0-rc3
inline-v3r6 deferalloc-v3r7
Min alloc-odr0-1 344.00 ( 0.00%) 317.00 ( 7.85%)
Min alloc-odr0-2 248.00 ( 0.00%) 231.00 ( 6.85%)
Min alloc-odr0-4 209.00 ( 0.00%) 192.00 ( 8.13%)
Min alloc-odr0-8 181.00 ( 0.00%) 166.00 ( 8.29%)
Min alloc-odr0-16 168.00 ( 0.00%) 154.00 ( 8.33%)
Min alloc-odr0-32 161.00 ( 0.00%) 148.00 ( 8.07%)
Min alloc-odr0-64 158.00 ( 0.00%) 145.00 ( 8.23%)
Min alloc-odr0-128 156.00 ( 0.00%) 143.00 ( 8.33%)
Min alloc-odr0-256 168.00 ( 0.00%) 154.00 ( 8.33%)
Min alloc-odr0-512 178.00 ( 0.00%) 167.00 ( 6.18%)
Min alloc-odr0-1024 186.00 ( 0.00%) 174.00 ( 6.45%)
Min alloc-odr0-2048 192.00 ( 0.00%) 180.00 ( 6.25%)
Min alloc-odr0-4096 198.00 ( 0.00%) 184.00 ( 7.07%)
Min alloc-odr0-8192 200.00 ( 0.00%) 188.00 ( 6.00%)
Min alloc-odr0-16384 201.00 ( 0.00%) 188.00 ( 6.47%)
Min free-odr0-1 189.00 ( 0.00%) 180.00 ( 4.76%)
Min free-odr0-2 132.00 ( 0.00%) 126.00 ( 4.55%)
Min free-odr0-4 104.00 ( 0.00%) 99.00 ( 4.81%)
Min free-odr0-8 90.00 ( 0.00%) 85.00 ( 5.56%)
Min free-odr0-16 84.00 ( 0.00%) 80.00 ( 4.76%)
Min free-odr0-32 80.00 ( 0.00%) 76.00 ( 5.00%)
Min free-odr0-64 78.00 ( 0.00%) 74.00 ( 5.13%)
Min free-odr0-128 77.00 ( 0.00%) 73.00 ( 5.19%)
Min free-odr0-256 94.00 ( 0.00%) 91.00 ( 3.19%)
Min free-odr0-512 108.00 ( 0.00%) 112.00 ( -3.70%)
Min free-odr0-1024 115.00 ( 0.00%) 118.00 ( -2.61%)
Min free-odr0-2048 120.00 ( 0.00%) 125.00 ( -4.17%)
Min free-odr0-4096 123.00 ( 0.00%) 129.00 ( -4.88%)
Min free-odr0-8192 126.00 ( 0.00%) 130.00 ( -3.17%)
Min free-odr0-16384 126.00 ( 0.00%) 131.00 ( -3.97%)

Note that the free paths for large numbers of pages is impacted as the
debugging cost gets shifted into that path when the page data is no
longer necessarily cache-hot.

Signed-off-by: Mel Gorman <mgorman@techsingularity.net>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Cc: Jesper Dangaard Brouer <brouer@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 4db7548c 19-May-2016 Mel Gorman <mgorman@techsingularity.net>

mm, page_alloc: defer debugging checks of freed pages until a PCP drain

Every page free checks a number of page fields for validity. This
catches premature frees and corruptions but it is also expensive. This
patch weakens the debugging check by checking PCP pages at the time they
are drained from the PCP list. This will trigger the bug but the site
that freed the corrupt page will be lost. To get the full context, a
kernel rebuild with DEBUG_VM is necessary.

[akpm@linux-foundation.org: fix build]
Signed-off-by: Mel Gorman <mgorman@techsingularity.net>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Jesper Dangaard Brouer <brouer@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 002f2906 19-May-2016 Vlastimil Babka <vbabka@suse.cz>

cpuset: use static key better and convert to new API

An important function for cpusets is cpuset_node_allowed(), which
optimizes on the fact if there's a single root CPU set, it must be
trivially allowed. But the check "nr_cpusets() <= 1" doesn't use the
cpusets_enabled_key static key the right way where static keys eliminate
branching overhead with jump labels.

This patch converts it so that static key is used properly. It's also
switched to the new static key API and the checking functions are
converted to return bool instead of int. We also provide a new variant
__cpuset_zone_allowed() which expects that the static key check was
already done and they key was enabled. This is needed for
get_page_from_freelist() where we want to also avoid the relatively
slower check when ALLOC_CPUSET is not set in alloc_flags.

The impact on the page allocator microbenchmark is less than expected
but the cleanup in itself is worthwhile.

4.6.0-rc2 4.6.0-rc2
multcheck-v1r20 cpuset-v1r20
Min alloc-odr0-1 348.00 ( 0.00%) 348.00 ( 0.00%)
Min alloc-odr0-2 254.00 ( 0.00%) 254.00 ( 0.00%)
Min alloc-odr0-4 213.00 ( 0.00%) 213.00 ( 0.00%)
Min alloc-odr0-8 186.00 ( 0.00%) 183.00 ( 1.61%)
Min alloc-odr0-16 173.00 ( 0.00%) 171.00 ( 1.16%)
Min alloc-odr0-32 166.00 ( 0.00%) 163.00 ( 1.81%)
Min alloc-odr0-64 162.00 ( 0.00%) 159.00 ( 1.85%)
Min alloc-odr0-128 160.00 ( 0.00%) 157.00 ( 1.88%)
Min alloc-odr0-256 169.00 ( 0.00%) 166.00 ( 1.78%)
Min alloc-odr0-512 180.00 ( 0.00%) 180.00 ( 0.00%)
Min alloc-odr0-1024 188.00 ( 0.00%) 187.00 ( 0.53%)
Min alloc-odr0-2048 194.00 ( 0.00%) 193.00 ( 0.52%)
Min alloc-odr0-4096 199.00 ( 0.00%) 198.00 ( 0.50%)
Min alloc-odr0-8192 202.00 ( 0.00%) 201.00 ( 0.50%)
Min alloc-odr0-16384 203.00 ( 0.00%) 202.00 ( 0.49%)

Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
Signed-off-by: Mel Gorman <mgorman@techsingularity.net>
Acked-by: Zefan Li <lizefan@huawei.com>
Cc: Jesper Dangaard Brouer <brouer@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 0b423ca2 19-May-2016 Mel Gorman <mgorman@techsingularity.net>

mm, page_alloc: inline pageblock lookup in page free fast paths

The function call overhead of get_pfnblock_flags_mask() is measurable in
the page free paths. This patch uses an inlined version that is faster.

Signed-off-by: Mel Gorman <mgorman@techsingularity.net>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Cc: Jesper Dangaard Brouer <brouer@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# e5b31ac2 19-May-2016 Mel Gorman <mgorman@techsingularity.net>

mm, page_alloc: remove unnecessary variable from free_pcppages_bulk

The original count is never reused so it can be removed.

Signed-off-by: Mel Gorman <mgorman@techsingularity.net>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Cc: Jesper Dangaard Brouer <brouer@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# da838d4f 19-May-2016 Mel Gorman <mgorman@techsingularity.net>

mm, page_alloc: pull out side effects from free_pages_check

Check without side-effects should be easier to maintain. It also
removes the duplicated cpupid and flags reset done in !DEBUG_VM variant
of both free_pcp_prepare() and then bulkfree_pcp_prepare(). Finally, it
enables the next patch.

It shouldn't result in new branches, thanks to inlining of the check.

!DEBUG_VM bloat-o-meter:

add/remove: 0/0 grow/shrink: 0/2 up/down: 0/-27 (-27)
function old new delta
__free_pages_ok 748 739 -9
free_pcppages_bulk 1403 1385 -18

DEBUG_VM:

add/remove: 0/0 grow/shrink: 0/1 up/down: 0/-28 (-28)
function old new delta
free_pages_prepare 806 778 -28

This is also slightly faster because cpupid information is not set on
tail pages so we can avoid resets there.

Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
Signed-off-by: Mel Gorman <mgorman@techsingularity.net>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# bb552ac6 19-May-2016 Mel Gorman <mgorman@techsingularity.net>

mm, page_alloc: un-inline the bad part of free_pages_check

From: Vlastimil Babka <vbabka@suse.cz>

!DEBUG_VM size and bloat-o-meter:

add/remove: 1/0 grow/shrink: 0/2 up/down: 124/-370 (-246)
function old new delta
free_pages_check_bad - 124 +124
free_pcppages_bulk 1288 1171 -117
__free_pages_ok 948 695 -253

DEBUG_VM:

add/remove: 1/0 grow/shrink: 0/1 up/down: 124/-214 (-90)
function old new delta
free_pages_check_bad - 124 +124
free_pages_prepare 1112 898 -214

[akpm@linux-foundation.org: fix whitespace]
Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
Signed-off-by: Mel Gorman <mgorman@techsingularity.net>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 7bfec6f4 19-May-2016 Mel Gorman <mgorman@techsingularity.net>

mm, page_alloc: check multiple page fields with a single branch

Every page allocated or freed is checked for sanity to avoid corruptions
that are difficult to detect later. A bad page could be due to a number
of fields. Instead of using multiple branches, this patch combines
multiple fields into a single branch. A detailed check is only
necessary if that check fails.

4.6.0-rc2 4.6.0-rc2
initonce-v1r20 multcheck-v1r20
Min alloc-odr0-1 359.00 ( 0.00%) 348.00 ( 3.06%)
Min alloc-odr0-2 260.00 ( 0.00%) 254.00 ( 2.31%)
Min alloc-odr0-4 214.00 ( 0.00%) 213.00 ( 0.47%)
Min alloc-odr0-8 186.00 ( 0.00%) 186.00 ( 0.00%)
Min alloc-odr0-16 173.00 ( 0.00%) 173.00 ( 0.00%)
Min alloc-odr0-32 165.00 ( 0.00%) 166.00 ( -0.61%)
Min alloc-odr0-64 162.00 ( 0.00%) 162.00 ( 0.00%)
Min alloc-odr0-128 161.00 ( 0.00%) 160.00 ( 0.62%)
Min alloc-odr0-256 170.00 ( 0.00%) 169.00 ( 0.59%)
Min alloc-odr0-512 181.00 ( 0.00%) 180.00 ( 0.55%)
Min alloc-odr0-1024 190.00 ( 0.00%) 188.00 ( 1.05%)
Min alloc-odr0-2048 196.00 ( 0.00%) 194.00 ( 1.02%)
Min alloc-odr0-4096 202.00 ( 0.00%) 199.00 ( 1.49%)
Min alloc-odr0-8192 205.00 ( 0.00%) 202.00 ( 1.46%)
Min alloc-odr0-16384 205.00 ( 0.00%) 203.00 ( 0.98%)

Again, the benefit is marginal but avoiding excessive branches is
important. Ideally the paths would not have to check these conditions
at all but regrettably abandoning the tests would make use-after-free
bugs much harder to detect.

Signed-off-by: Mel Gorman <mgorman@techsingularity.net>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Cc: Jesper Dangaard Brouer <brouer@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 93ea9964 19-May-2016 Mel Gorman <mgorman@techsingularity.net>

mm, page_alloc: remove field from alloc_context

The classzone_idx can be inferred from preferred_zoneref so remove the
unnecessary field and save stack space.

Signed-off-by: Mel Gorman <mgorman@techsingularity.net>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Jesper Dangaard Brouer <brouer@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# c33d6c06 19-May-2016 Mel Gorman <mgorman@techsingularity.net>

mm, page_alloc: avoid looking up the first zone in a zonelist twice

The allocator fast path looks up the first usable zone in a zonelist and
then get_page_from_freelist does the same job in the zonelist iterator.
This patch preserves the necessary information.

4.6.0-rc2 4.6.0-rc2
fastmark-v1r20 initonce-v1r20
Min alloc-odr0-1 364.00 ( 0.00%) 359.00 ( 1.37%)
Min alloc-odr0-2 262.00 ( 0.00%) 260.00 ( 0.76%)
Min alloc-odr0-4 214.00 ( 0.00%) 214.00 ( 0.00%)
Min alloc-odr0-8 186.00 ( 0.00%) 186.00 ( 0.00%)
Min alloc-odr0-16 173.00 ( 0.00%) 173.00 ( 0.00%)
Min alloc-odr0-32 165.00 ( 0.00%) 165.00 ( 0.00%)
Min alloc-odr0-64 161.00 ( 0.00%) 162.00 ( -0.62%)
Min alloc-odr0-128 159.00 ( 0.00%) 161.00 ( -1.26%)
Min alloc-odr0-256 168.00 ( 0.00%) 170.00 ( -1.19%)
Min alloc-odr0-512 180.00 ( 0.00%) 181.00 ( -0.56%)
Min alloc-odr0-1024 190.00 ( 0.00%) 190.00 ( 0.00%)
Min alloc-odr0-2048 196.00 ( 0.00%) 196.00 ( 0.00%)
Min alloc-odr0-4096 202.00 ( 0.00%) 202.00 ( 0.00%)
Min alloc-odr0-8192 206.00 ( 0.00%) 205.00 ( 0.49%)
Min alloc-odr0-16384 206.00 ( 0.00%) 205.00 ( 0.49%)

The benefit is negligible and the results are within the noise but each
cycle counts.

Signed-off-by: Mel Gorman <mgorman@techsingularity.net>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Jesper Dangaard Brouer <brouer@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 48ee5f36 19-May-2016 Mel Gorman <mgorman@techsingularity.net>

mm, page_alloc: shortcut watermark checks for order-0 pages

Watermarks have to be checked on every allocation including the number
of pages being allocated and whether reserves can be accessed. The
reserves only matter if memory is limited and the free_pages adjustment
only applies to high-order pages. This patch adds a shortcut for
order-0 pages that avoids numerous calculations if there is plenty of
free memory yielding the following performance difference in a page
allocator microbenchmark;

4.6.0-rc2 4.6.0-rc2
optfair-v1r20 fastmark-v1r20
Min alloc-odr0-1 380.00 ( 0.00%) 364.00 ( 4.21%)
Min alloc-odr0-2 273.00 ( 0.00%) 262.00 ( 4.03%)
Min alloc-odr0-4 227.00 ( 0.00%) 214.00 ( 5.73%)
Min alloc-odr0-8 196.00 ( 0.00%) 186.00 ( 5.10%)
Min alloc-odr0-16 183.00 ( 0.00%) 173.00 ( 5.46%)
Min alloc-odr0-32 173.00 ( 0.00%) 165.00 ( 4.62%)
Min alloc-odr0-64 169.00 ( 0.00%) 161.00 ( 4.73%)
Min alloc-odr0-128 169.00 ( 0.00%) 159.00 ( 5.92%)
Min alloc-odr0-256 180.00 ( 0.00%) 168.00 ( 6.67%)
Min alloc-odr0-512 190.00 ( 0.00%) 180.00 ( 5.26%)
Min alloc-odr0-1024 198.00 ( 0.00%) 190.00 ( 4.04%)
Min alloc-odr0-2048 204.00 ( 0.00%) 196.00 ( 3.92%)
Min alloc-odr0-4096 209.00 ( 0.00%) 202.00 ( 3.35%)
Min alloc-odr0-8192 213.00 ( 0.00%) 206.00 ( 3.29%)
Min alloc-odr0-16384 214.00 ( 0.00%) 206.00 ( 3.74%)

Signed-off-by: Mel Gorman <mgorman@techsingularity.net>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Cc: Jesper Dangaard Brouer <brouer@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 30534755 19-May-2016 Mel Gorman <mgorman@techsingularity.net>

mm, page_alloc: reduce cost of fair zone allocation policy retry

The fair zone allocation policy is not without cost but it can be
reduced slightly. This patch removes an unnecessary local variable,
checks the likely conditions of the fair zone policy first, uses a bool
instead of a flags check and falls through when a remote node is
encountered instead of doing a full restart. The benefit is marginal
but it's there

4.6.0-rc2 4.6.0-rc2
decstat-v1r20 optfair-v1r20
Min alloc-odr0-1 377.00 ( 0.00%) 380.00 ( -0.80%)
Min alloc-odr0-2 273.00 ( 0.00%) 273.00 ( 0.00%)
Min alloc-odr0-4 226.00 ( 0.00%) 227.00 ( -0.44%)
Min alloc-odr0-8 196.00 ( 0.00%) 196.00 ( 0.00%)
Min alloc-odr0-16 183.00 ( 0.00%) 183.00 ( 0.00%)
Min alloc-odr0-32 175.00 ( 0.00%) 173.00 ( 1.14%)
Min alloc-odr0-64 172.00 ( 0.00%) 169.00 ( 1.74%)
Min alloc-odr0-128 170.00 ( 0.00%) 169.00 ( 0.59%)
Min alloc-odr0-256 183.00 ( 0.00%) 180.00 ( 1.64%)
Min alloc-odr0-512 191.00 ( 0.00%) 190.00 ( 0.52%)
Min alloc-odr0-1024 199.00 ( 0.00%) 198.00 ( 0.50%)
Min alloc-odr0-2048 204.00 ( 0.00%) 204.00 ( 0.00%)
Min alloc-odr0-4096 210.00 ( 0.00%) 209.00 ( 0.48%)
Min alloc-odr0-8192 213.00 ( 0.00%) 213.00 ( 0.00%)
Min alloc-odr0-16384 214.00 ( 0.00%) 214.00 ( 0.00%)

The benefit is marginal at best but one of the most important benefits,
avoiding a second search when falling back to another node is not
triggered by this particular test so the benefit for some corner cases
is understated.

Signed-off-by: Mel Gorman <mgorman@techsingularity.net>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Cc: Jesper Dangaard Brouer <brouer@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 4fcb0971 19-May-2016 Mel Gorman <mgorman@techsingularity.net>

mm, page_alloc: shorten the page allocator fast path

The page allocator fast path checks page multiple times unnecessarily.
This patch avoids all the slowpath checks if the first allocation
attempt succeeds.

Signed-off-by: Mel Gorman <mgorman@techsingularity.net>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Cc: Jesper Dangaard Brouer <brouer@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 3777999d 19-May-2016 Mel Gorman <mgorman@techsingularity.net>

mm, page_alloc: check once if a zone has isolated pageblocks

When bulk freeing pages from the per-cpu lists the zone is checked for
isolated pageblocks on every release. This patch checks it once per
drain.

[mgorman@techsingularity.net: fix locking radce, per Vlastimil]
Signed-off-by: Mel Gorman <mgorman@techsingularity.net>
Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Jesper Dangaard Brouer <brouer@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 83d4ca81 19-May-2016 Mel Gorman <mgorman@techsingularity.net>

mm, page_alloc: move __GFP_HARDWALL modifications out of the fastpath

__GFP_HARDWALL only has meaning in the context of cpusets but the fast
path always applies the flag on the first attempt. Move the
manipulations into the cpuset paths where they will be masked by a
static branch in the common case.

With the other micro-optimisations in this series combined, the impact
on a page allocator microbenchmark is

4.6.0-rc2 4.6.0-rc2
decstat-v1r20 micro-v1r20
Min alloc-odr0-1 381.00 ( 0.00%) 377.00 ( 1.05%)
Min alloc-odr0-2 275.00 ( 0.00%) 273.00 ( 0.73%)
Min alloc-odr0-4 229.00 ( 0.00%) 226.00 ( 1.31%)
Min alloc-odr0-8 199.00 ( 0.00%) 196.00 ( 1.51%)
Min alloc-odr0-16 186.00 ( 0.00%) 183.00 ( 1.61%)
Min alloc-odr0-32 179.00 ( 0.00%) 175.00 ( 2.23%)
Min alloc-odr0-64 174.00 ( 0.00%) 172.00 ( 1.15%)
Min alloc-odr0-128 172.00 ( 0.00%) 170.00 ( 1.16%)
Min alloc-odr0-256 181.00 ( 0.00%) 183.00 ( -1.10%)
Min alloc-odr0-512 193.00 ( 0.00%) 191.00 ( 1.04%)
Min alloc-odr0-1024 201.00 ( 0.00%) 199.00 ( 1.00%)
Min alloc-odr0-2048 206.00 ( 0.00%) 204.00 ( 0.97%)
Min alloc-odr0-4096 212.00 ( 0.00%) 210.00 ( 0.94%)
Min alloc-odr0-8192 215.00 ( 0.00%) 213.00 ( 0.93%)
Min alloc-odr0-16384 216.00 ( 0.00%) 214.00 ( 0.93%)

Signed-off-by: Mel Gorman <mgorman@techsingularity.net>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Cc: Jesper Dangaard Brouer <brouer@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 5bb1b169 19-May-2016 Mel Gorman <mgorman@techsingularity.net>

mm, page_alloc: remove unnecessary initialisation from __alloc_pages_nodemask()

page is guaranteed to be set before it is read with or without the
initialisation.

[akpm@linux-foundation.org: fix warning]
Signed-off-by: Mel Gorman <mgorman@techsingularity.net>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Cc: Jesper Dangaard Brouer <brouer@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# be06af00 19-May-2016 Mel Gorman <mgorman@techsingularity.net>

mm, page_alloc: remove unnecessary initialisation in get_page_from_freelist

Signed-off-by: Mel Gorman <mgorman@techsingularity.net>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Cc: Jesper Dangaard Brouer <brouer@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 4dfa6cd8 19-May-2016 Mel Gorman <mgorman@techsingularity.net>

mm, page_alloc: remove unnecessary local variable in get_page_from_freelist

zonelist here is a copy of a struct field that is used once. Ditch it.

Signed-off-by: Mel Gorman <mgorman@techsingularity.net>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Cc: Jesper Dangaard Brouer <brouer@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# fa379b95 19-May-2016 Mel Gorman <mgorman@techsingularity.net>

mm, page_alloc: convert nr_fair_skipped to bool

The number of zones skipped to a zone expiring its fair zone allocation
quota is irrelevant. Convert to bool.

Signed-off-by: Mel Gorman <mgorman@techsingularity.net>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Cc: Jesper Dangaard Brouer <brouer@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# c603844b 19-May-2016 Mel Gorman <mgorman@techsingularity.net>

mm, page_alloc: convert alloc_flags to unsigned

alloc_flags is a bitmask of flags but it is signed which does not
necessarily generate the best code depending on the compiler. Even
without an impact, it makes more sense that this be unsigned.

Signed-off-by: Mel Gorman <mgorman@techsingularity.net>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Cc: Jesper Dangaard Brouer <brouer@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# f75fb889 19-May-2016 Mel Gorman <mgorman@techsingularity.net>

mm, page_alloc: avoid unnecessary zone lookups during pageblock operations

Pageblocks have an associated bitmap to store migrate types and whether
the pageblock should be skipped during compaction. The bitmap may be
associated with a memory section or a zone but the zone is looked up
unconditionally. The compiler should optimise this away automatically
so this is a cosmetic patch only in many cases.

Signed-off-by: Mel Gorman <mgorman@techsingularity.net>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Cc: Jesper Dangaard Brouer <brouer@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 754078eb 19-May-2016 Mel Gorman <mgorman@techsingularity.net>

mm, page_alloc: use __dec_zone_state for order-0 page allocation

__dec_zone_state is cheaper to use for removing an order-0 page as it
has fewer conditions to check.

The performance difference on a page allocator microbenchmark is;

4.6.0-rc2 4.6.0-rc2
optiter-v1r20 decstat-v1r20
Min alloc-odr0-1 382.00 ( 0.00%) 381.00 ( 0.26%)
Min alloc-odr0-2 282.00 ( 0.00%) 275.00 ( 2.48%)
Min alloc-odr0-4 233.00 ( 0.00%) 229.00 ( 1.72%)
Min alloc-odr0-8 203.00 ( 0.00%) 199.00 ( 1.97%)
Min alloc-odr0-16 188.00 ( 0.00%) 186.00 ( 1.06%)
Min alloc-odr0-32 182.00 ( 0.00%) 179.00 ( 1.65%)
Min alloc-odr0-64 177.00 ( 0.00%) 174.00 ( 1.69%)
Min alloc-odr0-128 175.00 ( 0.00%) 172.00 ( 1.71%)
Min alloc-odr0-256 184.00 ( 0.00%) 181.00 ( 1.63%)
Min alloc-odr0-512 197.00 ( 0.00%) 193.00 ( 2.03%)
Min alloc-odr0-1024 203.00 ( 0.00%) 201.00 ( 0.99%)
Min alloc-odr0-2048 209.00 ( 0.00%) 206.00 ( 1.44%)
Min alloc-odr0-4096 214.00 ( 0.00%) 212.00 ( 0.93%)
Min alloc-odr0-8192 218.00 ( 0.00%) 215.00 ( 1.38%)
Min alloc-odr0-16384 219.00 ( 0.00%) 216.00 ( 1.37%)

Signed-off-by: Mel Gorman <mgorman@techsingularity.net>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Cc: Jesper Dangaard Brouer <brouer@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 682a3385 19-May-2016 Mel Gorman <mgorman@techsingularity.net>

mm, page_alloc: inline the fast path of the zonelist iterator

The page allocator iterates through a zonelist for zones that match the
addressing limitations and nodemask of the caller but many allocations
will not be restricted. Despite this, there is always functional call
overhead which builds up.

This patch inlines the optimistic basic case and only calls the iterator
function for the complex case. A hindrance was the fact that
cpuset_current_mems_allowed is used in the fastpath as the allowed
nodemask even though all nodes are allowed on most systems. The patch
handles this by only considering cpuset_current_mems_allowed if a cpuset
exists. As well as being faster in the fast-path, this removes some
junk in the slowpath.

The performance difference on a page allocator microbenchmark is;

4.6.0-rc2 4.6.0-rc2
statinline-v1r20 optiter-v1r20
Min alloc-odr0-1 412.00 ( 0.00%) 382.00 ( 7.28%)
Min alloc-odr0-2 301.00 ( 0.00%) 282.00 ( 6.31%)
Min alloc-odr0-4 247.00 ( 0.00%) 233.00 ( 5.67%)
Min alloc-odr0-8 215.00 ( 0.00%) 203.00 ( 5.58%)
Min alloc-odr0-16 199.00 ( 0.00%) 188.00 ( 5.53%)
Min alloc-odr0-32 191.00 ( 0.00%) 182.00 ( 4.71%)
Min alloc-odr0-64 187.00 ( 0.00%) 177.00 ( 5.35%)
Min alloc-odr0-128 185.00 ( 0.00%) 175.00 ( 5.41%)
Min alloc-odr0-256 193.00 ( 0.00%) 184.00 ( 4.66%)
Min alloc-odr0-512 207.00 ( 0.00%) 197.00 ( 4.83%)
Min alloc-odr0-1024 213.00 ( 0.00%) 203.00 ( 4.69%)
Min alloc-odr0-2048 220.00 ( 0.00%) 209.00 ( 5.00%)
Min alloc-odr0-4096 226.00 ( 0.00%) 214.00 ( 5.31%)
Min alloc-odr0-8192 229.00 ( 0.00%) 218.00 ( 4.80%)
Min alloc-odr0-16384 229.00 ( 0.00%) 219.00 ( 4.37%)

perf indicated that next_zones_zonelist disappeared in the profile and
__next_zones_zonelist did not appear. This is expected as the
micro-benchmark would hit the inlined fast-path every time.

Signed-off-by: Mel Gorman <mgorman@techsingularity.net>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Jesper Dangaard Brouer <brouer@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 060e7417 19-May-2016 Mel Gorman <mgorman@techsingularity.net>

mm, page_alloc: inline zone_statistics

zone_statistics has one call-site but it's a public function. Make it
static and inline.

The performance difference on a page allocator microbenchmark is;

4.6.0-rc2 4.6.0-rc2
statbranch-v1r20 statinline-v1r20
Min alloc-odr0-1 419.00 ( 0.00%) 412.00 ( 1.67%)
Min alloc-odr0-2 305.00 ( 0.00%) 301.00 ( 1.31%)
Min alloc-odr0-4 250.00 ( 0.00%) 247.00 ( 1.20%)
Min alloc-odr0-8 219.00 ( 0.00%) 215.00 ( 1.83%)
Min alloc-odr0-16 203.00 ( 0.00%) 199.00 ( 1.97%)
Min alloc-odr0-32 195.00 ( 0.00%) 191.00 ( 2.05%)
Min alloc-odr0-64 191.00 ( 0.00%) 187.00 ( 2.09%)
Min alloc-odr0-128 189.00 ( 0.00%) 185.00 ( 2.12%)
Min alloc-odr0-256 198.00 ( 0.00%) 193.00 ( 2.53%)
Min alloc-odr0-512 210.00 ( 0.00%) 207.00 ( 1.43%)
Min alloc-odr0-1024 216.00 ( 0.00%) 213.00 ( 1.39%)
Min alloc-odr0-2048 221.00 ( 0.00%) 220.00 ( 0.45%)
Min alloc-odr0-4096 227.00 ( 0.00%) 226.00 ( 0.44%)
Min alloc-odr0-8192 232.00 ( 0.00%) 229.00 ( 1.29%)
Min alloc-odr0-16384 232.00 ( 0.00%) 229.00 ( 1.29%)

Signed-off-by: Mel Gorman <mgorman@techsingularity.net>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Cc: Jesper Dangaard Brouer <brouer@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 17514574 19-May-2016 Mel Gorman <mgorman@techsingularity.net>

mm, page_alloc: use new PageAnonHead helper in the free page fast path

The PageAnon check always checks for compound_head but this is a
relatively expensive check if the caller already knows the page is a
head page. This patch creates a helper and uses it in the page free
path which only operates on head pages.

With this patch and "Only check PageCompound for high-order pages", the
performance difference on a page allocator microbenchmark is;

4.6.0-rc2 4.6.0-rc2
vanilla nocompound-v1r20
Min alloc-odr0-1 425.00 ( 0.00%) 417.00 ( 1.88%)
Min alloc-odr0-2 313.00 ( 0.00%) 308.00 ( 1.60%)
Min alloc-odr0-4 257.00 ( 0.00%) 253.00 ( 1.56%)
Min alloc-odr0-8 224.00 ( 0.00%) 221.00 ( 1.34%)
Min alloc-odr0-16 208.00 ( 0.00%) 205.00 ( 1.44%)
Min alloc-odr0-32 199.00 ( 0.00%) 199.00 ( 0.00%)
Min alloc-odr0-64 195.00 ( 0.00%) 193.00 ( 1.03%)
Min alloc-odr0-128 192.00 ( 0.00%) 191.00 ( 0.52%)
Min alloc-odr0-256 204.00 ( 0.00%) 200.00 ( 1.96%)
Min alloc-odr0-512 213.00 ( 0.00%) 212.00 ( 0.47%)
Min alloc-odr0-1024 219.00 ( 0.00%) 219.00 ( 0.00%)
Min alloc-odr0-2048 225.00 ( 0.00%) 225.00 ( 0.00%)
Min alloc-odr0-4096 230.00 ( 0.00%) 231.00 ( -0.43%)
Min alloc-odr0-8192 235.00 ( 0.00%) 234.00 ( 0.43%)
Min alloc-odr0-16384 235.00 ( 0.00%) 234.00 ( 0.43%)
Min free-odr0-1 215.00 ( 0.00%) 191.00 ( 11.16%)
Min free-odr0-2 152.00 ( 0.00%) 136.00 ( 10.53%)
Min free-odr0-4 119.00 ( 0.00%) 107.00 ( 10.08%)
Min free-odr0-8 106.00 ( 0.00%) 96.00 ( 9.43%)
Min free-odr0-16 97.00 ( 0.00%) 87.00 ( 10.31%)
Min free-odr0-32 91.00 ( 0.00%) 83.00 ( 8.79%)
Min free-odr0-64 89.00 ( 0.00%) 81.00 ( 8.99%)
Min free-odr0-128 88.00 ( 0.00%) 80.00 ( 9.09%)
Min free-odr0-256 106.00 ( 0.00%) 95.00 ( 10.38%)
Min free-odr0-512 116.00 ( 0.00%) 111.00 ( 4.31%)
Min free-odr0-1024 125.00 ( 0.00%) 118.00 ( 5.60%)
Min free-odr0-2048 133.00 ( 0.00%) 126.00 ( 5.26%)
Min free-odr0-4096 136.00 ( 0.00%) 130.00 ( 4.41%)
Min free-odr0-8192 138.00 ( 0.00%) 130.00 ( 5.80%)
Min free-odr0-16384 137.00 ( 0.00%) 130.00 ( 5.11%)

There is a sizable boost to the free allocator performance. While there
is an apparent boost on the allocation side, it's likely a co-incidence
or due to the patches slightly reducing cache footprint.

Signed-off-by: Mel Gorman <mgorman@techsingularity.net>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Cc: Jesper Dangaard Brouer <brouer@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# d61f8590 19-May-2016 Mel Gorman <mgorman@techsingularity.net>

mm, page_alloc: only check PageCompound for high-order pages

Another year, another round of page allocator optimisations focusing
this time on the alloc and free fast paths. This should be of help to
workloads that are allocator-intensive from kernel space where the cost
of zeroing is not nceessraily incurred.

The series is motivated by the observation that page alloc
microbenchmarks on multiple machines regressed between 3.12.44 and 4.4.
Second, there is discussions before LSF/MM considering the possibility
of adding another page allocator which is potentially hazardous but a
patch series improving performance is better than whining.

After the series is applied, there are still hazards. In the free
paths, the debugging checking and page zone/pageblock lookups dominate
but there was not an obvious solution to that. In the alloc path, the
major contributers are dealing with zonelists, new page preperation, the
fair zone allocation and numerous statistic updates. The fair zone
allocator is removed by the per-node LRU series if that gets merged so
it's nor a major concern at the moment.

On normal userspace benchmarks, there is little impact as the zeroing
cost is significant but it's visible

aim9
4.6.0-rc3 4.6.0-rc3
vanilla deferalloc-v3
Min page_test 828693.33 ( 0.00%) 887060.00 ( 7.04%)
Min brk_test 4847266.67 ( 0.00%) 4966266.67 ( 2.45%)
Min exec_test 1271.00 ( 0.00%) 1275.67 ( 0.37%)
Min fork_test 12371.75 ( 0.00%) 12380.00 ( 0.07%)

The overall impact on a page allocator microbenchmark for a range of orders
and number of pages allocated in a batch is

4.6.0-rc3 4.6.0-rc3
vanilla deferalloc-v3r7
Min alloc-odr0-1 428.00 ( 0.00%) 316.00 ( 26.17%)
Min alloc-odr0-2 314.00 ( 0.00%) 231.00 ( 26.43%)
Min alloc-odr0-4 256.00 ( 0.00%) 192.00 ( 25.00%)
Min alloc-odr0-8 222.00 ( 0.00%) 166.00 ( 25.23%)
Min alloc-odr0-16 207.00 ( 0.00%) 154.00 ( 25.60%)
Min alloc-odr0-32 197.00 ( 0.00%) 148.00 ( 24.87%)
Min alloc-odr0-64 193.00 ( 0.00%) 144.00 ( 25.39%)
Min alloc-odr0-128 191.00 ( 0.00%) 143.00 ( 25.13%)
Min alloc-odr0-256 203.00 ( 0.00%) 153.00 ( 24.63%)
Min alloc-odr0-512 212.00 ( 0.00%) 165.00 ( 22.17%)
Min alloc-odr0-1024 221.00 ( 0.00%) 172.00 ( 22.17%)
Min alloc-odr0-2048 225.00 ( 0.00%) 179.00 ( 20.44%)
Min alloc-odr0-4096 232.00 ( 0.00%) 185.00 ( 20.26%)
Min alloc-odr0-8192 235.00 ( 0.00%) 187.00 ( 20.43%)
Min alloc-odr0-16384 236.00 ( 0.00%) 188.00 ( 20.34%)
Min alloc-odr1-1 519.00 ( 0.00%) 450.00 ( 13.29%)
Min alloc-odr1-2 391.00 ( 0.00%) 336.00 ( 14.07%)
Min alloc-odr1-4 313.00 ( 0.00%) 268.00 ( 14.38%)
Min alloc-odr1-8 277.00 ( 0.00%) 235.00 ( 15.16%)
Min alloc-odr1-16 256.00 ( 0.00%) 218.00 ( 14.84%)
Min alloc-odr1-32 252.00 ( 0.00%) 212.00 ( 15.87%)
Min alloc-odr1-64 244.00 ( 0.00%) 206.00 ( 15.57%)
Min alloc-odr1-128 244.00 ( 0.00%) 207.00 ( 15.16%)
Min alloc-odr1-256 243.00 ( 0.00%) 207.00 ( 14.81%)
Min alloc-odr1-512 245.00 ( 0.00%) 209.00 ( 14.69%)
Min alloc-odr1-1024 248.00 ( 0.00%) 214.00 ( 13.71%)
Min alloc-odr1-2048 253.00 ( 0.00%) 220.00 ( 13.04%)
Min alloc-odr1-4096 258.00 ( 0.00%) 224.00 ( 13.18%)
Min alloc-odr1-8192 261.00 ( 0.00%) 229.00 ( 12.26%)
Min alloc-odr2-1 560.00 ( 0.00%) 753.00 (-34.46%)
Min alloc-odr2-2 424.00 ( 0.00%) 351.00 ( 17.22%)
Min alloc-odr2-4 339.00 ( 0.00%) 393.00 (-15.93%)
Min alloc-odr2-8 298.00 ( 0.00%) 246.00 ( 17.45%)
Min alloc-odr2-16 276.00 ( 0.00%) 227.00 ( 17.75%)
Min alloc-odr2-32 271.00 ( 0.00%) 221.00 ( 18.45%)
Min alloc-odr2-64 264.00 ( 0.00%) 217.00 ( 17.80%)
Min alloc-odr2-128 264.00 ( 0.00%) 217.00 ( 17.80%)
Min alloc-odr2-256 264.00 ( 0.00%) 218.00 ( 17.42%)
Min alloc-odr2-512 269.00 ( 0.00%) 223.00 ( 17.10%)
Min alloc-odr2-1024 279.00 ( 0.00%) 230.00 ( 17.56%)
Min alloc-odr2-2048 283.00 ( 0.00%) 235.00 ( 16.96%)
Min alloc-odr2-4096 285.00 ( 0.00%) 239.00 ( 16.14%)
Min alloc-odr3-1 629.00 ( 0.00%) 505.00 ( 19.71%)
Min alloc-odr3-2 472.00 ( 0.00%) 374.00 ( 20.76%)
Min alloc-odr3-4 383.00 ( 0.00%) 301.00 ( 21.41%)
Min alloc-odr3-8 341.00 ( 0.00%) 266.00 ( 21.99%)
Min alloc-odr3-16 316.00 ( 0.00%) 248.00 ( 21.52%)
Min alloc-odr3-32 308.00 ( 0.00%) 241.00 ( 21.75%)
Min alloc-odr3-64 305.00 ( 0.00%) 241.00 ( 20.98%)
Min alloc-odr3-128 308.00 ( 0.00%) 244.00 ( 20.78%)
Min alloc-odr3-256 317.00 ( 0.00%) 249.00 ( 21.45%)
Min alloc-odr3-512 327.00 ( 0.00%) 256.00 ( 21.71%)
Min alloc-odr3-1024 331.00 ( 0.00%) 261.00 ( 21.15%)
Min alloc-odr3-2048 333.00 ( 0.00%) 266.00 ( 20.12%)
Min alloc-odr4-1 767.00 ( 0.00%) 572.00 ( 25.42%)
Min alloc-odr4-2 578.00 ( 0.00%) 429.00 ( 25.78%)
Min alloc-odr4-4 474.00 ( 0.00%) 346.00 ( 27.00%)
Min alloc-odr4-8 422.00 ( 0.00%) 310.00 ( 26.54%)
Min alloc-odr4-16 399.00 ( 0.00%) 295.00 ( 26.07%)
Min alloc-odr4-32 392.00 ( 0.00%) 293.00 ( 25.26%)
Min alloc-odr4-64 394.00 ( 0.00%) 293.00 ( 25.63%)
Min alloc-odr4-128 405.00 ( 0.00%) 305.00 ( 24.69%)
Min alloc-odr4-256 417.00 ( 0.00%) 319.00 ( 23.50%)
Min alloc-odr4-512 425.00 ( 0.00%) 326.00 ( 23.29%)
Min alloc-odr4-1024 426.00 ( 0.00%) 329.00 ( 22.77%)
Min free-odr0-1 216.00 ( 0.00%) 178.00 ( 17.59%)
Min free-odr0-2 152.00 ( 0.00%) 125.00 ( 17.76%)
Min free-odr0-4 120.00 ( 0.00%) 99.00 ( 17.50%)
Min free-odr0-8 106.00 ( 0.00%) 85.00 ( 19.81%)
Min free-odr0-16 97.00 ( 0.00%) 80.00 ( 17.53%)
Min free-odr0-32 92.00 ( 0.00%) 76.00 ( 17.39%)
Min free-odr0-64 89.00 ( 0.00%) 74.00 ( 16.85%)
Min free-odr0-128 89.00 ( 0.00%) 73.00 ( 17.98%)
Min free-odr0-256 107.00 ( 0.00%) 90.00 ( 15.89%)
Min free-odr0-512 117.00 ( 0.00%) 108.00 ( 7.69%)
Min free-odr0-1024 125.00 ( 0.00%) 118.00 ( 5.60%)
Min free-odr0-2048 132.00 ( 0.00%) 125.00 ( 5.30%)
Min free-odr0-4096 135.00 ( 0.00%) 130.00 ( 3.70%)
Min free-odr0-8192 137.00 ( 0.00%) 130.00 ( 5.11%)
Min free-odr0-16384 137.00 ( 0.00%) 131.00 ( 4.38%)
Min free-odr1-1 318.00 ( 0.00%) 289.00 ( 9.12%)
Min free-odr1-2 228.00 ( 0.00%) 207.00 ( 9.21%)
Min free-odr1-4 182.00 ( 0.00%) 165.00 ( 9.34%)
Min free-odr1-8 163.00 ( 0.00%) 146.00 ( 10.43%)
Min free-odr1-16 151.00 ( 0.00%) 135.00 ( 10.60%)
Min free-odr1-32 146.00 ( 0.00%) 129.00 ( 11.64%)
Min free-odr1-64 145.00 ( 0.00%) 130.00 ( 10.34%)
Min free-odr1-128 148.00 ( 0.00%) 134.00 ( 9.46%)
Min free-odr1-256 148.00 ( 0.00%) 137.00 ( 7.43%)
Min free-odr1-512 151.00 ( 0.00%) 140.00 ( 7.28%)
Min free-odr1-1024 154.00 ( 0.00%) 143.00 ( 7.14%)
Min free-odr1-2048 156.00 ( 0.00%) 144.00 ( 7.69%)
Min free-odr1-4096 156.00 ( 0.00%) 142.00 ( 8.97%)
Min free-odr1-8192 156.00 ( 0.00%) 140.00 ( 10.26%)
Min free-odr2-1 361.00 ( 0.00%) 457.00 (-26.59%)
Min free-odr2-2 258.00 ( 0.00%) 224.00 ( 13.18%)
Min free-odr2-4 208.00 ( 0.00%) 223.00 ( -7.21%)
Min free-odr2-8 185.00 ( 0.00%) 160.00 ( 13.51%)
Min free-odr2-16 173.00 ( 0.00%) 149.00 ( 13.87%)
Min free-odr2-32 166.00 ( 0.00%) 145.00 ( 12.65%)
Min free-odr2-64 166.00 ( 0.00%) 146.00 ( 12.05%)
Min free-odr2-128 169.00 ( 0.00%) 148.00 ( 12.43%)
Min free-odr2-256 170.00 ( 0.00%) 152.00 ( 10.59%)
Min free-odr2-512 177.00 ( 0.00%) 156.00 ( 11.86%)
Min free-odr2-1024 182.00 ( 0.00%) 162.00 ( 10.99%)
Min free-odr2-2048 181.00 ( 0.00%) 160.00 ( 11.60%)
Min free-odr2-4096 180.00 ( 0.00%) 159.00 ( 11.67%)
Min free-odr3-1 431.00 ( 0.00%) 367.00 ( 14.85%)
Min free-odr3-2 306.00 ( 0.00%) 259.00 ( 15.36%)
Min free-odr3-4 249.00 ( 0.00%) 208.00 ( 16.47%)
Min free-odr3-8 224.00 ( 0.00%) 186.00 ( 16.96%)
Min free-odr3-16 208.00 ( 0.00%) 176.00 ( 15.38%)
Min free-odr3-32 206.00 ( 0.00%) 174.00 ( 15.53%)
Min free-odr3-64 210.00 ( 0.00%) 178.00 ( 15.24%)
Min free-odr3-128 215.00 ( 0.00%) 182.00 ( 15.35%)
Min free-odr3-256 224.00 ( 0.00%) 189.00 ( 15.62%)
Min free-odr3-512 232.00 ( 0.00%) 195.00 ( 15.95%)
Min free-odr3-1024 230.00 ( 0.00%) 195.00 ( 15.22%)
Min free-odr3-2048 229.00 ( 0.00%) 193.00 ( 15.72%)
Min free-odr4-1 561.00 ( 0.00%) 439.00 ( 21.75%)
Min free-odr4-2 418.00 ( 0.00%) 318.00 ( 23.92%)
Min free-odr4-4 339.00 ( 0.00%) 269.00 ( 20.65%)
Min free-odr4-8 299.00 ( 0.00%) 239.00 ( 20.07%)
Min free-odr4-16 289.00 ( 0.00%) 234.00 ( 19.03%)
Min free-odr4-32 291.00 ( 0.00%) 235.00 ( 19.24%)
Min free-odr4-64 298.00 ( 0.00%) 238.00 ( 20.13%)
Min free-odr4-128 308.00 ( 0.00%) 251.00 ( 18.51%)
Min free-odr4-256 321.00 ( 0.00%) 267.00 ( 16.82%)
Min free-odr4-512 327.00 ( 0.00%) 269.00 ( 17.74%)
Min free-odr4-1024 326.00 ( 0.00%) 271.00 ( 16.87%)
Min total-odr0-1 644.00 ( 0.00%) 494.00 ( 23.29%)
Min total-odr0-2 466.00 ( 0.00%) 356.00 ( 23.61%)
Min total-odr0-4 376.00 ( 0.00%) 291.00 ( 22.61%)
Min total-odr0-8 328.00 ( 0.00%) 251.00 ( 23.48%)
Min total-odr0-16 304.00 ( 0.00%) 234.00 ( 23.03%)
Min total-odr0-32 289.00 ( 0.00%) 224.00 ( 22.49%)
Min total-odr0-64 282.00 ( 0.00%) 218.00 ( 22.70%)
Min total-odr0-128 280.00 ( 0.00%) 216.00 ( 22.86%)
Min total-odr0-256 310.00 ( 0.00%) 243.00 ( 21.61%)
Min total-odr0-512 329.00 ( 0.00%) 273.00 ( 17.02%)
Min total-odr0-1024 346.00 ( 0.00%) 290.00 ( 16.18%)
Min total-odr0-2048 357.00 ( 0.00%) 304.00 ( 14.85%)
Min total-odr0-4096 367.00 ( 0.00%) 315.00 ( 14.17%)
Min total-odr0-8192 372.00 ( 0.00%) 317.00 ( 14.78%)
Min total-odr0-16384 373.00 ( 0.00%) 319.00 ( 14.48%)
Min total-odr1-1 838.00 ( 0.00%) 739.00 ( 11.81%)
Min total-odr1-2 619.00 ( 0.00%) 543.00 ( 12.28%)
Min total-odr1-4 495.00 ( 0.00%) 433.00 ( 12.53%)
Min total-odr1-8 440.00 ( 0.00%) 382.00 ( 13.18%)
Min total-odr1-16 407.00 ( 0.00%) 353.00 ( 13.27%)
Min total-odr1-32 398.00 ( 0.00%) 341.00 ( 14.32%)
Min total-odr1-64 389.00 ( 0.00%) 336.00 ( 13.62%)
Min total-odr1-128 392.00 ( 0.00%) 341.00 ( 13.01%)
Min total-odr1-256 391.00 ( 0.00%) 344.00 ( 12.02%)
Min total-odr1-512 396.00 ( 0.00%) 349.00 ( 11.87%)
Min total-odr1-1024 402.00 ( 0.00%) 357.00 ( 11.19%)
Min total-odr1-2048 409.00 ( 0.00%) 364.00 ( 11.00%)
Min total-odr1-4096 414.00 ( 0.00%) 366.00 ( 11.59%)
Min total-odr1-8192 417.00 ( 0.00%) 369.00 ( 11.51%)
Min total-odr2-1 921.00 ( 0.00%) 1210.00 (-31.38%)
Min total-odr2-2 682.00 ( 0.00%) 576.00 ( 15.54%)
Min total-odr2-4 547.00 ( 0.00%) 616.00 (-12.61%)
Min total-odr2-8 483.00 ( 0.00%) 406.00 ( 15.94%)
Min total-odr2-16 449.00 ( 0.00%) 376.00 ( 16.26%)
Min total-odr2-32 437.00 ( 0.00%) 366.00 ( 16.25%)
Min total-odr2-64 431.00 ( 0.00%) 363.00 ( 15.78%)
Min total-odr2-128 433.00 ( 0.00%) 365.00 ( 15.70%)
Min total-odr2-256 434.00 ( 0.00%) 371.00 ( 14.52%)
Min total-odr2-512 446.00 ( 0.00%) 379.00 ( 15.02%)
Min total-odr2-1024 461.00 ( 0.00%) 392.00 ( 14.97%)
Min total-odr2-2048 464.00 ( 0.00%) 395.00 ( 14.87%)
Min total-odr2-4096 465.00 ( 0.00%) 398.00 ( 14.41%)
Min total-odr3-1 1060.00 ( 0.00%) 872.00 ( 17.74%)
Min total-odr3-2 778.00 ( 0.00%) 633.00 ( 18.64%)
Min total-odr3-4 632.00 ( 0.00%) 510.00 ( 19.30%)
Min total-odr3-8 565.00 ( 0.00%) 452.00 ( 20.00%)
Min total-odr3-16 524.00 ( 0.00%) 424.00 ( 19.08%)
Min total-odr3-32 514.00 ( 0.00%) 415.00 ( 19.26%)
Min total-odr3-64 515.00 ( 0.00%) 419.00 ( 18.64%)
Min total-odr3-128 523.00 ( 0.00%) 426.00 ( 18.55%)
Min total-odr3-256 541.00 ( 0.00%) 438.00 ( 19.04%)
Min total-odr3-512 559.00 ( 0.00%) 451.00 ( 19.32%)
Min total-odr3-1024 561.00 ( 0.00%) 456.00 ( 18.72%)
Min total-odr3-2048 562.00 ( 0.00%) 459.00 ( 18.33%)
Min total-odr4-1 1328.00 ( 0.00%) 1011.00 ( 23.87%)
Min total-odr4-2 997.00 ( 0.00%) 747.00 ( 25.08%)
Min total-odr4-4 813.00 ( 0.00%) 615.00 ( 24.35%)
Min total-odr4-8 721.00 ( 0.00%) 550.00 ( 23.72%)
Min total-odr4-16 689.00 ( 0.00%) 529.00 ( 23.22%)
Min total-odr4-32 683.00 ( 0.00%) 528.00 ( 22.69%)
Min total-odr4-64 692.00 ( 0.00%) 531.00 ( 23.27%)
Min total-odr4-128 713.00 ( 0.00%) 556.00 ( 22.02%)
Min total-odr4-256 738.00 ( 0.00%) 586.00 ( 20.60%)
Min total-odr4-512 753.00 ( 0.00%) 595.00 ( 20.98%)
Min total-odr4-1024 752.00 ( 0.00%) 600.00 ( 20.21%)

This patch (of 27):

order-0 pages by definition cannot be compound so avoid the check in the
fast path for those pages.

[akpm@linux-foundation.org: use unlikely(order) in free_pages_prepare(), per Vlastimil]
Signed-off-by: Mel Gorman <mgorman@techsingularity.net>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Cc: Jesper Dangaard Brouer <brouer@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 3da88fb3 19-May-2016 Michal Hocko <mhocko@suse.com>

mm, oom: move GFP_NOFS check to out_of_memory

__alloc_pages_may_oom is the central place to decide when the
out_of_memory should be invoked. This is a good approach for most
checks there because they are page allocator specific and the allocation
fails right after for all of them.

The notable exception is GFP_NOFS context which is faking
did_some_progress and keep the page allocator looping even though there
couldn't have been any progress from the OOM killer. This patch doesn't
change this behavior because we are not ready to allow those allocation
requests to fail yet (and maybe we will face the reality that we will
never manage to safely fail these request). Instead __GFP_FS check is
moved down to out_of_memory and prevent from OOM victim selection there.
There are two reasons for that

- OOM notifiers might release some memory even from this context
as none of the registered notifier seems to be FS related
- this might help a dying thread to get an access to memory
reserves and move on which will make the behavior more
consistent with the case when the task gets killed from a
different context.

Keep a comment in __alloc_pages_may_oom to make sure we do not forget
how GFP_NOFS is special and that we really want to do something about
it.

Note to the current oom_notifier users:

The observable difference for you is that oom notifiers cannot depend on
any fs locks because we could deadlock. Not that this would be allowed
today because that would just lockup machine in most of the cases and
ruling out the OOM killer along the way. Another difference is that
callbacks might be invoked sooner now because GFP_NOFS is a weaker
reclaim context and so there could be reclaimable memory which is just
not reachable now. That would require GFP_NOFS only loads which are
really rare and more importantly the observable result would be dropping
of reconstructible object and potential performance drop which is not
such a big deal when we are struggling to fulfill other important
allocation requests.

Signed-off-by: Michal Hocko <mhocko@suse.com>
Cc: Raushaniya Maksudova <rmaksudova@parallels.com>
Cc: Michael S. Tsirkin <mst@redhat.com>
Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Cc: David Rientjes <rientjes@google.com>
Cc: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
Cc: Daniel Vetter <daniel.vetter@intel.com>
Cc: Oleg Nesterov <oleg@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# fc2bd799 19-May-2016 Joonsoo Kim <iamjoonsoo.kim@lge.com>

mm/page_alloc: correct highmem memory statistics

ZONE_MOVABLE could be treated as highmem so we need to consider it for
accurate statistics. And, in following patches, ZONE_CMA will be
introduced and it can be treated as highmem, too. So, instead of
manually adding stat of ZONE_MOVABLE, looping all zones and check
whether the zone is highmem or not and add stat of the zone which can be
treated as highmem.

Signed-off-by: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Reviewed-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
Cc: Rik van Riel <riel@redhat.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Mel Gorman <mgorman@techsingularity.net>
Cc: Laura Abbott <lauraa@codeaurora.org>
Cc: Minchan Kim <minchan@kernel.org>
Cc: Marek Szyprowski <m.szyprowski@samsung.com>
Cc: Michal Nazarewicz <mina86@mina86.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# ba6b0979 19-May-2016 Joonsoo Kim <iamjoonsoo.kim@lge.com>

power: add zone range overlapping check

There is a system thats node's pfns are overlapped as follows:

-----pfn-------->
N0 N1 N2 N0 N1 N2

Therefore, we need to care this overlapping when iterating pfn range.

mark_free_pages() iterates requested zone's pfn range and unset all
range's bitmap first. And then it marks freepages in a zone to the
bitmap. If there is an overlapping zone, above unset could clear
previous marked bit and reference to this bitmap in the future will
cause the problem. To prevent it, this patch adds a zone check in
mark_free_pages().

Signed-off-by: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Cc: Rik van Riel <riel@redhat.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Laura Abbott <lauraa@codeaurora.org>
Cc: Minchan Kim <minchan@kernel.org>
Cc: Marek Szyprowski <m.szyprowski@samsung.com>
Cc: Michal Nazarewicz <mina86@mina86.com>
Cc: "Aneesh Kumar K.V" <aneesh.kumar@linux.vnet.ibm.com>
Cc: "Rafael J. Wysocki" <rjw@rjwysocki.net>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# b9eb6319 19-May-2016 Joonsoo Kim <iamjoonsoo.kim@lge.com>

mm/memory_hotplug: add comment to some functions related to memory hotplug

__offline_isolated_pages() and test_pages_isolated() are used by memory
hotplug. These functions require that range is in a single zone but
there is no code to do this because memory hotplug checks it before
calling these functions. To avoid confusing future user of these
functions, this patch adds comments to them.

Signed-off-by: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Cc: Rik van Riel <riel@redhat.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Mel Gorman <mgorman@techsingularity.net>
Cc: Laura Abbott <lauraa@codeaurora.org>
Cc: Minchan Kim <minchan@kernel.org>
Cc: Marek Szyprowski <m.szyprowski@samsung.com>
Cc: Michal Nazarewicz <mina86@mina86.com>
Cc: "Aneesh Kumar K.V" <aneesh.kumar@linux.vnet.ibm.com>
Cc: "Rafael J. Wysocki" <rjw@rjwysocki.net>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 949698a3 19-May-2016 Li Zhang <zhlcindy@linux.vnet.ibm.com>

mm/page_alloc: Remove useless parameter of __free_pages_boot_core

__free_pages_boot_core has parameter pfn which is not used at all.
Remove it.

Signed-off-by: Li Zhang <zhlcindy@linux.vnet.ibm.com>
Reviewed-by: Pan Xinhui <xinhui.pan@linux.vnet.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 0139aa7b 19-May-2016 Joonsoo Kim <iamjoonsoo.kim@lge.com>

mm: rename _count, field of the struct page, to _refcount

Many developers already know that field for reference count of the
struct page is _count and atomic type. They would try to handle it
directly and this could break the purpose of page reference count
tracepoint. To prevent direct _count modification, this patch rename it
to _refcount and add warning message on the code. After that, developer
who need to handle reference count will find that field should not be
accessed directly.

[akpm@linux-foundation.org: fix comments, per Vlastimil]
[akpm@linux-foundation.org: Documentation/vm/transhuge.txt too]
[sfr@canb.auug.org.au: sync ethernet driver changes]
Signed-off-by: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Signed-off-by: Stephen Rothwell <sfr@canb.auug.org.au>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Hugh Dickins <hughd@google.com>
Cc: Johannes Berg <johannes@sipsolutions.net>
Cc: "David S. Miller" <davem@davemloft.net>
Cc: Sunil Goutham <sgoutham@cavium.com>
Cc: Chris Metcalf <cmetcalf@mellanox.com>
Cc: Manish Chopra <manish.chopra@qlogic.com>
Cc: Yuval Mintz <yuval.mintz@qlogic.com>
Cc: Tariq Toukan <tariqt@mellanox.com>
Cc: Saeed Mahameed <saeedm@mellanox.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# bc22af74 05-May-2016 Jason Baron <jbaron@akamai.com>

mm: update min_free_kbytes from khugepaged after core initialization

Khugepaged attempts to raise min_free_kbytes if its set too low.
However, on boot khugepaged sets min_free_kbytes first from
subsys_initcall(), and then the mm 'core' over-rides min_free_kbytes
after from init_per_zone_wmark_min(), via a module_init() call.

Khugepaged used to use a late_initcall() to set min_free_kbytes (such
that it occurred after the core initialization), however this was
removed when the initialization of min_free_kbytes was integrated into
the starting of the khugepaged thread.

The fix here is simply to invoke the core initialization using a
core_initcall() instead of module_init(), such that the previous
initialization ordering is restored. I didn't restore the
late_initcall() since start_stop_khugepaged() already sets
min_free_kbytes via set_recommended_min_free_kbytes().

This was noticed when we had a number of page allocation failures when
moving a workload to a kernel with this new initialization ordering. On
an 8GB system this restores min_free_kbytes back to 67584 from 11365
when CONFIG_TRANSPARENT_HUGEPAGE=y is set and either
CONFIG_TRANSPARENT_HUGEPAGE_ALWAYS=y or
CONFIG_TRANSPARENT_HUGEPAGE_MADVISE=y.

Fixes: 79553da293d3 ("thp: cleanup khugepaged startup")
Signed-off-by: Jason Baron <jbaron@akamai.com>
Acked-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Acked-by: David Rientjes <rientjes@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# d9dddbf5 25-Mar-2016 Vlastimil Babka <vbabka@suse.cz>

mm/page_alloc: prevent merging between isolated and other pageblocks

Hanjun Guo has reported that a CMA stress test causes broken accounting of
CMA and free pages:

> Before the test, I got:
> -bash-4.3# cat /proc/meminfo | grep Cma
> CmaTotal: 204800 kB
> CmaFree: 195044 kB
>
>
> After running the test:
> -bash-4.3# cat /proc/meminfo | grep Cma
> CmaTotal: 204800 kB
> CmaFree: 6602584 kB
>
> So the freed CMA memory is more than total..
>
> Also the the MemFree is more than mem total:
>
> -bash-4.3# cat /proc/meminfo
> MemTotal: 16342016 kB
> MemFree: 22367268 kB
> MemAvailable: 22370528 kB

Laura Abbott has confirmed the issue and suspected the freepage accounting
rewrite around 3.18/4.0 by Joonsoo Kim. Joonsoo had a theory that this is
caused by unexpected merging between MIGRATE_ISOLATE and MIGRATE_CMA
pageblocks:

> CMA isolates MAX_ORDER aligned blocks, but, during the process,
> partialy isolated block exists. If MAX_ORDER is 11 and
> pageblock_order is 9, two pageblocks make up MAX_ORDER
> aligned block and I can think following scenario because pageblock
> (un)isolation would be done one by one.
>
> (each character means one pageblock. 'C', 'I' means MIGRATE_CMA,
> MIGRATE_ISOLATE, respectively.
>
> CC -> IC -> II (Isolation)
> II -> CI -> CC (Un-isolation)
>
> If some pages are freed at this intermediate state such as IC or CI,
> that page could be merged to the other page that is resident on
> different type of pageblock and it will cause wrong freepage count.

This was supposed to be prevented by CMA operating on MAX_ORDER blocks,
but since it doesn't hold the zone->lock between pageblocks, a race
window does exist.

It's also likely that unexpected merging can occur between
MIGRATE_ISOLATE and non-CMA pageblocks. This should be prevented in
__free_one_page() since commit 3c605096d315 ("mm/page_alloc: restrict
max order of merging on isolated pageblock"). However, we only check
the migratetype of the pageblock where buddy merging has been initiated,
not the migratetype of the buddy pageblock (or group of pageblocks)
which can be MIGRATE_ISOLATE.

Joonsoo has suggested checking for buddy migratetype as part of
page_is_buddy(), but that would add extra checks in allocator hotpath
and bloat-o-meter has shown significant code bloat (the function is
inline).

This patch reduces the bloat at some expense of more complicated code.
The buddy-merging while-loop in __free_one_page() is initially bounded
to pageblock_border and without any migratetype checks. The checks are
placed outside, bumping the max_order if merging is allowed, and
returning to the while-loop with a statement which can't be possibly
considered harmful.

This fixes the accounting bug and also removes the arguably weird state
in the original commit 3c605096d315 where buddies could be left
unmerged.

Fixes: 3c605096d315 ("mm/page_alloc: restrict max order of merging on isolated pageblock")
Link: https://lkml.org/lkml/2016/3/2/280
Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
Reported-by: Hanjun Guo <guohanjun@huawei.com>
Tested-by: Hanjun Guo <guohanjun@huawei.com>
Acked-by: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Debugged-by: Laura Abbott <labbott@redhat.com>
Debugged-by: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Cc: Mel Gorman <mgorman@techsingularity.net>
Cc: "Kirill A. Shutemov" <kirill@shutemov.name>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Minchan Kim <minchan@kernel.org>
Cc: Yasuaki Ishimatsu <isimatu.yasuaki@jp.fujitsu.com>
Cc: Zhang Yanfei <zhangyanfei@cn.fujitsu.com>
Cc: Michal Nazarewicz <mina86@mina86.com>
Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
Cc: "Aneesh Kumar K.V" <aneesh.kumar@linux.vnet.ibm.com>
Cc: <stable@vger.kernel.org> [3.18+]
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 0a687aac 17-Mar-2016 Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>

mm,oom: do not loop !__GFP_FS allocation if the OOM killer is disabled

After the OOM killer is disabled during suspend operation, any
!__GFP_NOFAIL && __GFP_FS allocations are forced to fail. Thus, any
!__GFP_NOFAIL && !__GFP_FS allocations should be forced to fail as well.

Signed-off-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
Acked-by: Michal Hocko <mhocko@suse.com>
Acked-by: David Rientjes <rientjes@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 987b3095 17-Mar-2016 Li Zhang <zhlcindy@linux.vnet.ibm.com>

mm: meminit: initialise more memory for inode/dentry hash tables in early boot

Upstream has supported page parallel initialisation for X86 and the boot
time is improved greately. Some tests have been done for Power.

Here is the result I have done with different memory size.

* 4GB memory:
boot time is as the following:
with patch vs without patch: 10.4s vs 24.5s
boot time is improved 57%
* 200GB memory:
boot time looks the same with and without patches.
boot time is about 38s
* 32TB memory:
boot time looks the same with and without patches
boot time is about 160s.
The boot time is much shorter than X86 with 24TB memory.
From community discussion, it costs about 694s for X86 24T system.

Parallel initialisation improves the performance by deferring memory
initilisation to kswap with N kthreads, it should improve the performance
therotically.

In testing on X86, performance is improved greatly with huge memory. But
on Power platform, it is improved greatly with less than 100GB memory.
For huge memory, it is not improved greatly. But it saves the time with
several threads at least, as the following information shows(32TB system
log):

[ 22.648169] node 9 initialised, 16607461 pages in 280ms
[ 22.783772] node 3 initialised, 23937243 pages in 410ms
[ 22.858877] node 6 initialised, 29179347 pages in 490ms
[ 22.863252] node 2 initialised, 29179347 pages in 490ms
[ 22.907545] node 0 initialised, 32049614 pages in 540ms
[ 22.920891] node 15 initialised, 32212280 pages in 550ms
[ 22.923236] node 4 initialised, 32306127 pages in 550ms
[ 22.923384] node 12 initialised, 32314319 pages in 550ms
[ 22.924754] node 8 initialised, 32314319 pages in 550ms
[ 22.940780] node 13 initialised, 33353677 pages in 570ms
[ 22.940796] node 11 initialised, 33353677 pages in 570ms
[ 22.941700] node 5 initialised, 33353677 pages in 570ms
[ 22.941721] node 10 initialised, 33353677 pages in 570ms
[ 22.941876] node 7 initialised, 33353677 pages in 570ms
[ 22.944946] node 14 initialised, 33353677 pages in 570ms
[ 22.946063] node 1 initialised, 33345485 pages in 580ms

It saves the time about 550*16 ms at least, although it can be ignore to
compare the boot time about 160 seconds. What's more, the boot time is
much shorter on Power even without patches than x86 for huge memory
machine.

So this patchset is still necessary to be enabled for Power.

This patch (of 2):

This patch is based on Mel Gorman's old patch in the mailing list,
https://lkml.org/lkml/2015/5/5/280 which is discussed but it is fixed with
a completion to wait for all memory initialised in page_alloc_init_late().
It is to fix the OOM problem on X86 with 24TB memory which allocates
memory in late initialisation. But for Power platform with 32TB memory,
it causes a call trace in vfs_caches_init->inode_init() and inode hash
table needs more memory. So this patch allocates 1GB for 0.25TB/node for
large system as it is mentioned in https://lkml.org/lkml/2015/5/1/627

This call trace is found on Power with 32TB memory, 1024CPUs, 16nodes.
Currently, it only allocates 2GB*16=32GB for early initialisation. But
Dentry cache hash table needes 16GB and Inode cache hash table needs 16GB.
So the system have no enough memory for it. The log from dmesg as the
following:

Dentry cache hash table entries: 2147483648 (order: 18,17179869184 bytes)
vmalloc: allocation failure, allocated 16021913600 of 17179934720 bytes
swapper/0: page allocation failure: order:0,mode:0x2080020
CPU: 0 PID: 0 Comm: swapper/0 Not tainted 4.4.0-0-ppc64
Call Trace:
.dump_stack+0xb4/0xb664 (unreliable)
.warn_alloc_failed+0x114/0x160
.__vmalloc_area_node+0x1a4/0x2b0
.__vmalloc_node_range+0xe4/0x110
.__vmalloc_node+0x40/0x50
.alloc_large_system_hash+0x134/0x2a4
.inode_init+0xa4/0xf0
.vfs_caches_init+0x80/0x144
.start_kernel+0x40c/0x4e0
start_here_common+0x20/0x4a4

Signed-off-by: Li Zhang <zhlcindy@linux.vnet.ibm.com>
Acked-by: Mel Gorman <mgorman@techsingularity.net>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 1170532b 17-Mar-2016 Joe Perches <joe@perches.com>

mm: convert printk(KERN_<LEVEL> to pr_<level>

Most of the mm subsystem uses pr_<level> so make it consistent.

Miscellanea:

- Realign arguments
- Add missing newline to format
- kmemleak-test.c has a "kmemleak: " prefix added to the
"Kmemleak testing" logging message via pr_fmt

Signed-off-by: Joe Perches <joe@perches.com>
Acked-by: Tejun Heo <tj@kernel.org> [percpu]
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 756a025f 17-Mar-2016 Joe Perches <joe@perches.com>

mm: coalesce split strings

Kernel style prefers a single string over split strings when the string is
'user-visible'.

Miscellanea:

- Add a missing newline
- Realign arguments

Signed-off-by: Joe Perches <joe@perches.com>
Acked-by: Tejun Heo <tj@kernel.org> [percpu]
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 0f352e53 17-Mar-2016 Michal Hocko <mhocko@suse.com>

mm: remove __GFP_NOFAIL is deprecated comment

Commit 647757197cd3 ("mm: clarify __GFP_NOFAIL deprecation status") was
incomplete and didn't remove the comment about __GFP_NOFAIL being
deprecated in buffered_rmqueue.

Let's get rid of this leftover but keep the WARN_ON_ONCE for order > 1
because we should really discourage from using __GFP_NOFAIL with higher
order allocations because those are just too subtle.

Signed-off-by: Michal Hocko <mhocko@suse.com>
Reviewed-by: Nikolay Borisov <kernel@kyup.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# fe896d18 17-Mar-2016 Joonsoo Kim <iamjoonsoo.kim@lge.com>

mm: introduce page reference manipulation functions

The success of CMA allocation largely depends on the success of
migration and key factor of it is page reference count. Until now, page
reference is manipulated by direct calling atomic functions so we cannot
follow up who and where manipulate it. Then, it is hard to find actual
reason of CMA allocation failure. CMA allocation should be guaranteed
to succeed so finding offending place is really important.

In this patch, call sites where page reference is manipulated are
converted to introduced wrapper function. This is preparation step to
add tracepoint to each page reference manipulation function. With this
facility, we can easily find reason of CMA allocation failure. There is
no functional change in this patch.

In addition, this patch also converts reference read sites. It will
help a second step that renames page._count to something else and
prevents later attempt to direct access to it (Suggested by Andrew).

Signed-off-by: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Acked-by: Michal Nazarewicz <mina86@mina86.com>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Cc: Minchan Kim <minchan@kernel.org>
Cc: Mel Gorman <mgorman@techsingularity.net>
Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
Cc: Sergey Senozhatsky <sergey.senozhatsky.work@gmail.com>
Cc: Steven Rostedt <rostedt@goodmis.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 444eb2a4 17-Mar-2016 Mel Gorman <mgorman@techsingularity.net>

mm: thp: set THP defrag by default to madvise and add a stall-free defrag option

THP defrag is enabled by default to direct reclaim/compact but not wake
kswapd in the event of a THP allocation failure. The problem is that
THP allocation requests potentially enter reclaim/compaction. This
potentially incurs a severe stall that is not guaranteed to be offset by
reduced TLB misses. While there has been considerable effort to reduce
the impact of reclaim/compaction, it is still a high cost and workloads
that should fit in memory fail to do so. Specifically, a simple
anon/file streaming workload will enter direct reclaim on NUMA at least
even though the working set size is 80% of RAM. It's been years and
it's time to throw in the towel.

First, this patch defines THP defrag as follows;

madvise: A failed allocation will direct reclaim/compact if the application requests it
never: Neither reclaim/compact nor wake kswapd
defer: A failed allocation will wake kswapd/kcompactd
always: A failed allocation will direct reclaim/compact (historical behaviour)
khugepaged defrag will enter direct/reclaim but not wake kswapd.

Next it sets the default defrag option to be "madvise" to only enter
direct reclaim/compaction for applications that specifically requested
it.

Lastly, it removes a check from the page allocator slowpath that is
related to __GFP_THISNODE to allow "defer" to work. The callers that
really cares are slub/slab and they are updated accordingly. The slab
one may be surprising because it also corrects a comment as kswapd was
never woken up by that path.

This means that a THP fault will no longer stall for most applications
by default and the ideal for most users that get THP if they are
immediately available. There are still options for users that prefer a
stall at startup of a new application by either restoring historical
behaviour with "always" or pick a half-way point with "defer" where
kswapd does some of the work in the background and wakes kcompactd if
necessary. THP defrag for khugepaged remains enabled and will enter
direct/reclaim but no wakeup kswapd or kcompactd.

After this patch a THP allocation failure will quickly fallback and rely
on khugepaged to recover the situation at some time in the future. In
some cases, this will reduce THP usage but the benefit of THP is hard to
measure and not a universal win where as a stall to reclaim/compaction
is definitely measurable and can be painful.

The first test for this is using "usemem" to read a large file and write
a large anonymous mapping (to avoid the zero page) multiple times. The
total size of the mappings is 80% of RAM and the benchmark simply
measures how long it takes to complete. It uses multiple threads to see
if that is a factor. On UMA, the performance is almost identical so is
not reported but on NUMA, we see this

usemem
4.4.0 4.4.0
kcompactd-v1r1 nodefrag-v1r3
Amean System-1 102.86 ( 0.00%) 46.81 ( 54.50%)
Amean System-4 37.85 ( 0.00%) 34.02 ( 10.12%)
Amean System-7 48.12 ( 0.00%) 46.89 ( 2.56%)
Amean System-12 51.98 ( 0.00%) 56.96 ( -9.57%)
Amean System-21 80.16 ( 0.00%) 79.05 ( 1.39%)
Amean System-30 110.71 ( 0.00%) 107.17 ( 3.20%)
Amean System-48 127.98 ( 0.00%) 124.83 ( 2.46%)
Amean Elapsd-1 185.84 ( 0.00%) 105.51 ( 43.23%)
Amean Elapsd-4 26.19 ( 0.00%) 25.58 ( 2.33%)
Amean Elapsd-7 21.65 ( 0.00%) 21.62 ( 0.16%)
Amean Elapsd-12 18.58 ( 0.00%) 17.94 ( 3.43%)
Amean Elapsd-21 17.53 ( 0.00%) 16.60 ( 5.33%)
Amean Elapsd-30 17.45 ( 0.00%) 17.13 ( 1.84%)
Amean Elapsd-48 15.40 ( 0.00%) 15.27 ( 0.82%)

For a single thread, the benchmark completes 43.23% faster with this
patch applied with smaller benefits as the thread increases. Similar,
notice the large reduction in most cases in system CPU usage. The
overall CPU time is

4.4.0 4.4.0
kcompactd-v1r1 nodefrag-v1r3
User 10357.65 10438.33
System 3988.88 3543.94
Elapsed 2203.01 1634.41

Which is substantial. Now, the reclaim figures

4.4.0 4.4.0
kcompactd-v1r1nodefrag-v1r3
Minor Faults 128458477 278352931
Major Faults 2174976 225
Swap Ins 16904701 0
Swap Outs 17359627 0
Allocation stalls 43611 0
DMA allocs 0 0
DMA32 allocs 19832646 19448017
Normal allocs 614488453 580941839
Movable allocs 0 0
Direct pages scanned 24163800 0
Kswapd pages scanned 0 0
Kswapd pages reclaimed 0 0
Direct pages reclaimed 20691346 0
Compaction stalls 42263 0
Compaction success 938 0
Compaction failures 41325 0

This patch eliminates almost all swapping and direct reclaim activity.
There is still overhead but it's from NUMA balancing which does not
identify that it's pointless trying to do anything with this workload.

I also tried the thpscale benchmark which forces a corner case where
compaction can be used heavily and measures the latency of whether base
or huge pages were used

thpscale Fault Latencies
4.4.0 4.4.0
kcompactd-v1r1 nodefrag-v1r3
Amean fault-base-1 5288.84 ( 0.00%) 2817.12 ( 46.73%)
Amean fault-base-3 6365.53 ( 0.00%) 3499.11 ( 45.03%)
Amean fault-base-5 6526.19 ( 0.00%) 4363.06 ( 33.15%)
Amean fault-base-7 7142.25 ( 0.00%) 4858.08 ( 31.98%)
Amean fault-base-12 13827.64 ( 0.00%) 10292.11 ( 25.57%)
Amean fault-base-18 18235.07 ( 0.00%) 13788.84 ( 24.38%)
Amean fault-base-24 21597.80 ( 0.00%) 24388.03 (-12.92%)
Amean fault-base-30 26754.15 ( 0.00%) 19700.55 ( 26.36%)
Amean fault-base-32 26784.94 ( 0.00%) 19513.57 ( 27.15%)
Amean fault-huge-1 4223.96 ( 0.00%) 2178.57 ( 48.42%)
Amean fault-huge-3 2194.77 ( 0.00%) 2149.74 ( 2.05%)
Amean fault-huge-5 2569.60 ( 0.00%) 2346.95 ( 8.66%)
Amean fault-huge-7 3612.69 ( 0.00%) 2997.70 ( 17.02%)
Amean fault-huge-12 3301.75 ( 0.00%) 6727.02 (-103.74%)
Amean fault-huge-18 6696.47 ( 0.00%) 6685.72 ( 0.16%)
Amean fault-huge-24 8000.72 ( 0.00%) 9311.43 (-16.38%)
Amean fault-huge-30 13305.55 ( 0.00%) 9750.45 ( 26.72%)
Amean fault-huge-32 9981.71 ( 0.00%) 10316.06 ( -3.35%)

The average time to fault pages is substantially reduced in the majority
of caseds but with the obvious caveat that fewer THPs are actually used
in this adverse workload

4.4.0 4.4.0
kcompactd-v1r1 nodefrag-v1r3
Percentage huge-1 0.71 ( 0.00%) 14.04 (1865.22%)
Percentage huge-3 10.77 ( 0.00%) 33.05 (206.85%)
Percentage huge-5 60.39 ( 0.00%) 38.51 (-36.23%)
Percentage huge-7 45.97 ( 0.00%) 34.57 (-24.79%)
Percentage huge-12 68.12 ( 0.00%) 40.07 (-41.17%)
Percentage huge-18 64.93 ( 0.00%) 47.82 (-26.35%)
Percentage huge-24 62.69 ( 0.00%) 44.23 (-29.44%)
Percentage huge-30 43.49 ( 0.00%) 55.38 ( 27.34%)
Percentage huge-32 50.72 ( 0.00%) 51.90 ( 2.35%)

4.4.0 4.4.0
kcompactd-v1r1nodefrag-v1r3
Minor Faults 37429143 47564000
Major Faults 1916 1558
Swap Ins 1466 1079
Swap Outs 2936863 149626
Allocation stalls 62510 3
DMA allocs 0 0
DMA32 allocs 6566458 6401314
Normal allocs 216361697 216538171
Movable allocs 0 0
Direct pages scanned 25977580 17998
Kswapd pages scanned 0 3638931
Kswapd pages reclaimed 0 207236
Direct pages reclaimed 8833714 88
Compaction stalls 103349 5
Compaction success 270 4
Compaction failures 103079 1

Note again that while this does swap as it's an aggressive workload, the
direct relcim activity and allocation stalls is substantially reduced.
There is some kswapd activity but ftrace showed that the kswapd activity
was due to normal wakeups from 4K pages being allocated.
Compaction-related stalls and activity are almost eliminated.

I also tried the stutter benchmark. For this, I do not have figures for
NUMA but it's something that does impact UMA so I'll report what is
available

stutter
4.4.0 4.4.0
kcompactd-v1r1 nodefrag-v1r3
Min mmap 7.3571 ( 0.00%) 7.3438 ( 0.18%)
1st-qrtle mmap 7.5278 ( 0.00%) 17.9200 (-138.05%)
2nd-qrtle mmap 7.6818 ( 0.00%) 21.6055 (-181.25%)
3rd-qrtle mmap 11.0889 ( 0.00%) 21.8881 (-97.39%)
Max-90% mmap 27.8978 ( 0.00%) 22.1632 ( 20.56%)
Max-93% mmap 28.3202 ( 0.00%) 22.3044 ( 21.24%)
Max-95% mmap 28.5600 ( 0.00%) 22.4580 ( 21.37%)
Max-99% mmap 29.6032 ( 0.00%) 25.5216 ( 13.79%)
Max mmap 4109.7289 ( 0.00%) 4813.9832 (-17.14%)
Mean mmap 12.4474 ( 0.00%) 19.3027 (-55.07%)

This benchmark is trying to fault an anonymous mapping while there is a
heavy IO load -- a scenario that desktop users used to complain about
frequently. This shows a mix because the ideal case of mapping with THP
is not hit as often. However, note that 99% of the mappings complete
13.79% faster. The CPU usage here is particularly interesting

4.4.0 4.4.0
kcompactd-v1r1nodefrag-v1r3
User 67.50 0.99
System 1327.88 91.30
Elapsed 2079.00 2128.98

And once again we look at the reclaim figures

4.4.0 4.4.0
kcompactd-v1r1nodefrag-v1r3
Minor Faults 335241922 1314582827
Major Faults 715 819
Swap Ins 0 0
Swap Outs 0 0
Allocation stalls 532723 0
DMA allocs 0 0
DMA32 allocs 1822364341 1177950222
Normal allocs 1815640808 1517844854
Movable allocs 0 0
Direct pages scanned 21892772 0
Kswapd pages scanned 20015890 41879484
Kswapd pages reclaimed 19961986 41822072
Direct pages reclaimed 21892741 0
Compaction stalls 1065755 0
Compaction success 514 0
Compaction failures 1065241 0

Allocation stalls and all direct reclaim activity is eliminated as well
as compaction-related stalls.

THP gives impressive gains in some cases but only if they are quickly
available. We're not going to reach the point where they are completely
free so lets take the costs out of the fast paths finally and defer the
cost to kswapd, kcompactd and khugepaged where it belongs.

Signed-off-by: Mel Gorman <mgorman@techsingularity.net>
Acked-by: Rik van Riel <riel@redhat.com>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 795ae7a0 17-Mar-2016 Johannes Weiner <hannes@cmpxchg.org>

mm: scale kswapd watermarks in proportion to memory

In machines with 140G of memory and enterprise flash storage, we have
seen read and write bursts routinely exceed the kswapd watermarks and
cause thundering herds in direct reclaim. Unfortunately, the only way
to tune kswapd aggressiveness is through adjusting min_free_kbytes - the
system's emergency reserves - which is entirely unrelated to the
system's latency requirements. In order to get kswapd to maintain a
250M buffer of free memory, the emergency reserves need to be set to 1G.
That is a lot of memory wasted for no good reason.

On the other hand, it's reasonable to assume that allocation bursts and
overall allocation concurrency scale with memory capacity, so it makes
sense to make kswapd aggressiveness a function of that as well.

Change the kswapd watermark scale factor from the currently fixed 25% of
the tunable emergency reserve to a tunable 0.1% of memory.

Beyond 1G of memory, this will produce bigger watermark steps than the
current formula in default settings. Ensure that the new formula never
chooses steps smaller than that, i.e. 25% of the emergency reserve.

On a 140G machine, this raises the default watermark steps - the
distance between min and low, and low and high - from 16M to 143M.

Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
Acked-by: Mel Gorman <mgorman@suse.de>
Acked-by: Rik van Riel <riel@redhat.com>
Acked-by: David Rientjes <rientjes@google.com>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# d02bd27b 17-Mar-2016 Igor Redko <redkoi@virtuozzo.com>

mm/page_alloc.c: calculate 'available' memory in a separate function

Add a new field, VIRTIO_BALLOON_S_AVAIL, to virtio_balloon memory
statistics protocol, corresponding to 'Available' in /proc/meminfo.

It indicates to the hypervisor how big the balloon can be inflated
without pushing the guest system to swap. This metric would be very
useful in VM orchestration software to improve memory management of
different VMs under overcommit.

This patch (of 2):

Factor out calculation of the available memory counter into a separate
exportable function, in order to be able to use it in other parts of the
kernel.

In particular, it appears a relevant metric to report to the hypervisor
via virtio-balloon statistics interface (in a followup patch).

Signed-off-by: Igor Redko <redkoi@virtuozzo.com>
Signed-off-by: Denis V. Lunev <den@openvz.org>
Reviewed-by: Roman Kagan <rkagan@virtuozzo.com>
Cc: Michael S. Tsirkin <mst@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 698b1b30 17-Mar-2016 Vlastimil Babka <vbabka@suse.cz>

mm, compaction: introduce kcompactd

Memory compaction can be currently performed in several contexts:

- kswapd balancing a zone after a high-order allocation failure
- direct compaction to satisfy a high-order allocation, including THP
page fault attemps
- khugepaged trying to collapse a hugepage
- manually from /proc

The purpose of compaction is two-fold. The obvious purpose is to
satisfy a (pending or future) high-order allocation, and is easy to
evaluate. The other purpose is to keep overal memory fragmentation low
and help the anti-fragmentation mechanism. The success wrt the latter
purpose is more

The current situation wrt the purposes has a few drawbacks:

- compaction is invoked only when a high-order page or hugepage is not
available (or manually). This might be too late for the purposes of
keeping memory fragmentation low.
- direct compaction increases latency of allocations. Again, it would
be better if compaction was performed asynchronously to keep
fragmentation low, before the allocation itself comes.
- (a special case of the previous) the cost of compaction during THP
page faults can easily offset the benefits of THP.
- kswapd compaction appears to be complex, fragile and not working in
some scenarios. It could also end up compacting for a high-order
allocation request when it should be reclaiming memory for a later
order-0 request.

To improve the situation, we should be able to benefit from an
equivalent of kswapd, but for compaction - i.e. a background thread
which responds to fragmentation and the need for high-order allocations
(including hugepages) somewhat proactively.

One possibility is to extend the responsibilities of kswapd, which could
however complicate its design too much. It should be better to let
kswapd handle reclaim, as order-0 allocations are often more critical
than high-order ones.

Another possibility is to extend khugepaged, but this kthread is a
single instance and tied to THP configs.

This patch goes with the option of a new set of per-node kthreads called
kcompactd, and lays the foundations, without introducing any new
tunables. The lifecycle mimics kswapd kthreads, including the memory
hotplug hooks.

For compaction, kcompactd uses the standard compaction_suitable() and
ompact_finished() criteria and the deferred compaction functionality.
Unlike direct compaction, it uses only sync compaction, as there's no
allocation latency to minimize.

This patch doesn't yet add a call to wakeup_kcompactd. The kswapd
compact/reclaim loop for high-order pages will be replaced by waking up
kcompactd in the next patch with the description of what's wrong with
the old approach.

Waking up of the kcompactd threads is also tied to kswapd activity and
follows these rules:
- we don't want to affect any fastpaths, so wake up kcompactd only from
the slowpath, as it's done for kswapd
- if kswapd is doing reclaim, it's more important than compaction, so
don't invoke kcompactd until kswapd goes to sleep
- the target order used for kswapd is passed to kcompactd

Future possible future uses for kcompactd include the ability to wake up
kcompactd on demand in special situations, such as when hugepages are
not available (currently not done due to __GFP_NO_KSWAPD) or when a
fragmentation event (i.e. __rmqueue_fallback()) occurs. It's also
possible to perform periodic compaction with kcompactd.

[arnd@arndb.de: fix build errors with kcompactd]
[paul.gortmaker@windriver.com: don't use modular references for non modular code]
Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
Cc: Rik van Riel <riel@redhat.com>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Cc: Mel Gorman <mgorman@techsingularity.net>
Cc: David Rientjes <rientjes@google.com>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Signed-off-by: Paul Gortmaker <paul.gortmaker@windriver.com>
Cc: Hugh Dickins <hughd@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 505f6d22 17-Mar-2016 Joonsoo Kim <iamjoonsoo.kim@lge.com>

sound: query dynamic DEBUG_PAGEALLOC setting

We can disable debug_pagealloc processing even if the code is compiled
with CONFIG_DEBUG_PAGEALLOC. This patch changes the code to query
whether it is enabled or not in runtime.

[akpm@linux-foundation.org: export _debug_pagealloc_enabled to modules]
Signed-off-by: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Acked-by: David Rientjes <rientjes@google.com>
Acked-by: Takashi Iwai <tiwai@suse.de>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Chris Metcalf <cmetcalf@ezchip.com>
Cc: Christian Borntraeger <borntraeger@de.ibm.com>
Cc: Christoph Lameter <cl@linux.com>
Cc: Pekka Enberg <penberg@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 832fc1de 17-Mar-2016 Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>

/proc/kpageflags: return KPF_BUDDY for "tail" buddy pages

Currently /proc/kpageflags returns nothing for "tail" buddy pages, which
is inconvenient when grasping how free pages are distributed. This
patch sets KPF_BUDDY for such pages.

With this patch:

$ grep MemFree /proc/meminfo ; tools/vm/page-types -b buddy
MemFree: 3134992 kB
flags page-count MB symbolic-flags long-symbolic-flags
0x0000000000000400 779272 3044 __________B_______________________________ buddy
0x0000000000000c00 4385 17 __________BM______________________________ buddy,mmap
total 783657 3061

783657 pages is 3134628 kB (roughly consistent with the global counter,)
so it's OK.

[akpm@linux-foundation.org: update comment, per Naoya]
Signed-off-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
Reviewed-by: Vladimir Davydov <vdavydov@virtuozzo.com>>
Cc: Konstantin Khlebnikov <koct9i@gmail.com>
Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 7cf91a98 15-Mar-2016 Joonsoo Kim <iamjoonsoo.kim@lge.com>

mm/compaction: speed up pageblock_pfn_to_page() when zone is contiguous

There is a performance drop report due to hugepage allocation and in
there half of cpu time are spent on pageblock_pfn_to_page() in
compaction [1].

In that workload, compaction is triggered to make hugepage but most of
pageblocks are un-available for compaction due to pageblock type and
skip bit so compaction usually fails. Most costly operations in this
case is to find valid pageblock while scanning whole zone range. To
check if pageblock is valid to compact, valid pfn within pageblock is
required and we can obtain it by calling pageblock_pfn_to_page(). This
function checks whether pageblock is in a single zone and return valid
pfn if possible. Problem is that we need to check it every time before
scanning pageblock even if we re-visit it and this turns out to be very
expensive in this workload.

Although we have no way to skip this pageblock check in the system where
hole exists at arbitrary position, we can use cached value for zone
continuity and just do pfn_to_page() in the system where hole doesn't
exist. This optimization considerably speeds up in above workload.

Before vs After
Max: 1096 MB/s vs 1325 MB/s
Min: 635 MB/s 1015 MB/s
Avg: 899 MB/s 1194 MB/s

Avg is improved by roughly 30% [2].

[1]: http://www.spinics.net/lists/linux-mm/msg97378.html
[2]: https://lkml.org/lkml/2015/12/9/23

[akpm@linux-foundation.org: don't forget to restore zone->contiguous on error path, per Vlastimil]
Signed-off-by: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Reported-by: Aaron Lu <aaron.lu@intel.com>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Tested-by: Aaron Lu <aaron.lu@intel.com>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Rik van Riel <riel@redhat.com>
Cc: David Rientjes <rientjes@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 1414c7f4 15-Mar-2016 Laura Abbott <labbott@fedoraproject.org>

mm/page_poisoning.c: allow for zero poisoning

By default, page poisoning uses a poison value (0xaa) on free. If this
is changed to 0, the page is not only sanitized but zeroing on alloc
with __GFP_ZERO can be skipped as well. The tradeoff is that detecting
corruption from the poisoning is harder to detect. This feature also
cannot be used with hibernation since pages are not guaranteed to be
zeroed after hibernation.

Credit to Grsecurity/PaX team for inspiring this work

Signed-off-by: Laura Abbott <labbott@fedoraproject.org>
Acked-by: Rafael J. Wysocki <rjw@rjwysocki.net>
Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Kees Cook <keescook@chromium.org>
Cc: Mathias Krause <minipli@googlemail.com>
Cc: Dave Hansen <dave.hansen@intel.com>
Cc: Jianyu Zhan <nasa4836@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 8823b1db 15-Mar-2016 Laura Abbott <labbott@fedoraproject.org>

mm/page_poison.c: enable PAGE_POISONING as a separate option

Page poisoning is currently set up as a feature if architectures don't
have architecture debug page_alloc to allow unmapping of pages. It has
uses apart from that though. Clearing of the pages on free provides an
increase in security as it helps to limit the risk of information leaks.
Allow page poisoning to be enabled as a separate option independent of
kernel_map pages since the two features do separate work. Because of
how hiberanation is implemented, the checks on alloc cannot occur if
hibernation is enabled. The runtime alloc checks can also be enabled
with an option when !HIBERNATION.

Credit to Grsecurity/PaX team for inspiring this work

Signed-off-by: Laura Abbott <labbott@fedoraproject.org>
Cc: Rafael J. Wysocki <rjw@rjwysocki.net>
Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Kees Cook <keescook@chromium.org>
Cc: Mathias Krause <minipli@googlemail.com>
Cc: Dave Hansen <dave.hansen@intel.com>
Cc: Jianyu Zhan <nasa4836@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# ff8e8116 15-Mar-2016 Vlastimil Babka <vbabka@suse.cz>

mm, debug: move bad flags printing to bad_page()

Since bad_page() is the only user of the badflags parameter of
dump_page_badflags(), we can move the code to bad_page() and simplify a
bit.

The dump_page_badflags() function is renamed to __dump_page() and can
still be called separately from dump_page() for temporary debug prints
where page_owner info is not desired.

The only user-visible change is that page->mem_cgroup is printed before
the bad flags.

Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
Acked-by: Michal Hocko <mhocko@suse.com>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Cc: Minchan Kim <minchan@kernel.org>
Cc: Sasha Levin <sasha.levin@oracle.com>
Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
Cc: Mel Gorman <mgorman@suse.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 4e462112 15-Mar-2016 Vlastimil Babka <vbabka@suse.cz>

mm, page_owner: dump page owner info from dump_page()

The page_owner mechanism is useful for dealing with memory leaks. By
reading /sys/kernel/debug/page_owner one can determine the stack traces
leading to allocations of all pages, and find e.g. a buggy driver.

This information might be also potentially useful for debugging, such as
the VM_BUG_ON_PAGE() calls to dump_page(). So let's print the stored
info from dump_page().

Example output:

page:ffffea000292f1c0 count:1 mapcount:0 mapping:ffff8800b2f6cc18 index:0x91d
flags: 0x1fffff8001002c(referenced|uptodate|lru|mappedtodisk)
page dumped because: VM_BUG_ON_PAGE(1)
page->mem_cgroup:ffff8801392c5000
page allocated via order 0, migratetype Movable, gfp_mask 0x24213ca(GFP_HIGHUSER_MOVABLE|__GFP_COLD|__GFP_NOWARN|__GFP_NORETRY)
[<ffffffff811682c4>] __alloc_pages_nodemask+0x134/0x230
[<ffffffff811b40c8>] alloc_pages_current+0x88/0x120
[<ffffffff8115e386>] __page_cache_alloc+0xe6/0x120
[<ffffffff8116ba6c>] __do_page_cache_readahead+0xdc/0x240
[<ffffffff8116bd05>] ondemand_readahead+0x135/0x260
[<ffffffff8116be9c>] page_cache_async_readahead+0x6c/0x70
[<ffffffff811604c2>] generic_file_read_iter+0x3f2/0x760
[<ffffffff811e0dc7>] __vfs_read+0xa7/0xd0
page has been migrated, last migrate reason: compaction

Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
Acked-by: Michal Hocko <mhocko@suse.com>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Cc: Minchan Kim <minchan@kernel.org>
Cc: Sasha Levin <sasha.levin@oracle.com>
Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
Cc: Mel Gorman <mgorman@suse.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 60f30350 15-Mar-2016 Vlastimil Babka <vbabka@suse.cz>

mm, page_owner: print migratetype of page and pageblock, symbolic flags

The information in /sys/kernel/debug/page_owner includes the migratetype
of the pageblock the page belongs to. This is also checked against the
page's migratetype (as declared by gfp_flags during its allocation), and
the page is reported as Fallback if its migratetype differs from the
pageblock's one. t This is somewhat misleading because in fact fallback
allocation is not the only reason why these two can differ. It also
doesn't direcly provide the page's migratetype, although it's possible
to derive that from the gfp_flags.

It's arguably better to print both page and pageblock's migratetype and
leave the interpretation to the consumer than to suggest fallback
allocation as the only possible reason. While at it, we can print the
migratetypes as string the same way as /proc/pagetypeinfo does, as some
of the numeric values depend on kernel configuration. For that, this
patch moves the migratetype_names array from #ifdef CONFIG_PROC_FS part
of mm/vmstat.c to mm/page_alloc.c and exports it.

With the new format strings for flags, we can now also provide symbolic
page and gfp flags in the /sys/kernel/debug/page_owner file. This
replaces the positional printing of page flags as single letters, which
might have looked nicer, but was limited to a subset of flags, and
required the user to remember the letters.

Example page_owner entry after the patch:

Page allocated via order 0, mask 0x24213ca(GFP_HIGHUSER_MOVABLE|__GFP_COLD|__GFP_NOWARN|__GFP_NORETRY)
PFN 520 type Movable Block 1 type Movable Flags 0xfffff8001006c(referenced|uptodate|lru|active|mappedtodisk)
[<ffffffff811682c4>] __alloc_pages_nodemask+0x134/0x230
[<ffffffff811b4058>] alloc_pages_current+0x88/0x120
[<ffffffff8115e386>] __page_cache_alloc+0xe6/0x120
[<ffffffff8116ba6c>] __do_page_cache_readahead+0xdc/0x240
[<ffffffff8116bd05>] ondemand_readahead+0x135/0x260
[<ffffffff8116bfb1>] page_cache_sync_readahead+0x31/0x50
[<ffffffff81160523>] generic_file_read_iter+0x453/0x760
[<ffffffff811e0d57>] __vfs_read+0xa7/0xd0

Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
Acked-by: Michal Hocko <mhocko@suse.com>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Cc: Minchan Kim <minchan@kernel.org>
Cc: Sasha Levin <sasha.levin@oracle.com>
Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
Cc: Mel Gorman <mgorman@suse.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# c5c990e8 15-Mar-2016 Vlastimil Babka <vbabka@suse.cz>

mm, page_alloc: print symbolic gfp_flags on allocation failure

It would be useful to translate gfp_flags into string representation
when printing in case of an allocation failure, especially as the flags
have been undergoing some changes recently and the script
./scripts/gfp-translate needs a matching source version to be accurate.

Example output:

stapio: page allocation failure: order:9, mode:0x2080020(GFP_ATOMIC)

Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
Acked-by: Michal Hocko <mhocko@suse.com>
Acked-by: David Rientjes <rientjes@google.com>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Cc: Minchan Kim <minchan@kernel.org>
Cc: Sasha Levin <sasha.levin@oracle.com>
Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
Cc: Mel Gorman <mgorman@suse.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# ea6eabb0 15-Mar-2016 Christian Borntraeger <borntraeger@de.ibm.com>

mm/debug_pagealloc: ask users for default setting of debug_pagealloc

Since commit 031bc5743f158 ("mm/debug-pagealloc: make debug-pagealloc
boottime configurable") CONFIG_DEBUG_PAGEALLOC is by default not adding
any page debugging.

This resulted in several unnoticed bugs, e.g.

https://lkml.kernel.org/g/<569F5E29.3090107@de.ibm.com>
or
https://lkml.kernel.org/g/<56A20F30.4050705@de.ibm.com>

as this behaviour change was not even documented in Kconfig.

Let's provide a new Kconfig symbol that allows to change the default
back to enabled, e.g. for debug kernels. This also makes the change
obvious to kernel packagers.

Let's also change the Kconfig description for CONFIG_DEBUG_PAGEALLOC, to
indicate that there are two stages of overhead.

Signed-off-by: Christian Borntraeger <borntraeger@de.ibm.com>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# b72d0ffb 15-Mar-2016 Andrew Morton <akpm@linux-foundation.org>

mm/page_alloc.c: rework code layout in memmap_init_zone()

This function is getting full of weird tricks to avoid word-wrapping.
Use a goto to eliminate a tab stop then use the new space

Cc: Taku Izumi <izumi.taku@jp.fujitsu.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 342332e6 15-Mar-2016 Taku Izumi <izumi.taku@jp.fujitsu.com>

mm/page_alloc.c: introduce kernelcore=mirror option

This patch extends existing "kernelcore" option and introduces
kernelcore=mirror option. By specifying "mirror" instead of specifying
the amount of memory, non-mirrored (non-reliable) region will be
arranged into ZONE_MOVABLE.

[akpm@linux-foundation.org: fix build with CONFIG_HAVE_MEMBLOCK_NODE_MAP=n]
Signed-off-by: Taku Izumi <izumi.taku@jp.fujitsu.com>
Tested-by: Sudeep Holla <sudeep.holla@arm.com>
Cc: Tony Luck <tony.luck@intel.com>
Cc: Xishi Qiu <qiuxishi@huawei.com>
Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Mel Gorman <mel@csn.ul.ie>
Cc: Dave Hansen <dave.hansen@intel.com>
Cc: Matt Fleming <matt@codeblueprint.co.uk>
Cc: Arnd Bergmann <arnd@arndb.de>
Cc: Steve Capper <steve.capper@linaro.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# d91749c1 15-Mar-2016 Taku Izumi <izumi.taku@jp.fujitsu.com>

mm/page_alloc.c: calculate zone_start_pfn at zone_spanned_pages_in_node()

Xeon E7 v3 based systems supports Address Range Mirroring and UEFI BIOS
complied with UEFI spec 2.5 can notify which ranges are mirrored
(reliable) via EFI memory map. Now Linux kernel utilize its information
and allocates boot time memory from reliable region.

My requirement is:
- allocate kernel memory from mirrored region
- allocate user memory from non-mirrored region

In order to meet my requirement, ZONE_MOVABLE is useful. By arranging
non-mirrored range into ZONE_MOVABLE, mirrored memory is used for kernel
allocations.

My idea is to extend existing "kernelcore" option and introduces
kernelcore=mirror option. By specifying "mirror" instead of specifying
the amount of memory, non-mirrored region will be arranged into
ZONE_MOVABLE.

Earlier discussions are at:
https://lkml.org/lkml/2015/10/9/24
https://lkml.org/lkml/2015/10/15/9
https://lkml.org/lkml/2015/11/27/18
https://lkml.org/lkml/2015/12/8/836

For example, suppose 2-nodes system with the following memory range:

node 0 [mem 0x0000000000001000-0x000000109fffffff]
node 1 [mem 0x00000010a0000000-0x000000209fffffff]
and the following ranges are marked as reliable (mirrored):
[0x0000000000000000-0x0000000100000000]
[0x0000000100000000-0x0000000180000000]
[0x0000000800000000-0x0000000880000000]
[0x00000010a0000000-0x0000001120000000]
[0x00000017a0000000-0x0000001820000000]

If you specify kernelcore=mirror, ZONE_NORMAL and ZONE_MOVABLE are
arranged like bellow:

- node 0:
ZONE_NORMAL : [0x0000000100000000-0x00000010a0000000]
ZONE_MOVABLE: [0x0000000180000000-0x00000010a0000000]
- node 1:
ZONE_NORMAL : [0x00000010a0000000-0x00000020a0000000]
ZONE_MOVABLE: [0x0000001120000000-0x00000020a0000000]

In overlapped range, pages to be ZONE_MOVABLE in ZONE_NORMAL are treated
as absent pages, and vice versa.

This patch (of 2):

Currently each zone's zone_start_pfn is calculated at
free_area_init_core(). However zone's range is fixed at the time when
invoking zone_spanned_pages_in_node().

This patch changes how each zone->zone_start_pfn is calculated in
zone_spanned_pages_in_node().

Signed-off-by: Taku Izumi <izumi.taku@jp.fujitsu.com>
Cc: Tony Luck <tony.luck@intel.com>
Cc: Xishi Qiu <qiuxishi@huawei.com>
Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Mel Gorman <mel@csn.ul.ie>
Cc: Dave Hansen <dave.hansen@intel.com>
Cc: Matt Fleming <matt@codeblueprint.co.uk>
Cc: Arnd Bergmann <arnd@arndb.de>
Cc: Steve Capper <steve.capper@linaro.org>
Cc: Sudeep Holla <sudeep.holla@arm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 080fe206 05-Feb-2016 Vlastimil Babka <vbabka@suse.cz>

mm, hugetlb: don't require CMA for runtime gigantic pages

Commit 944d9fec8d7a ("hugetlb: add support for gigantic page allocation
at runtime") has added the runtime gigantic page allocation via
alloc_contig_range(), making this support available only when CONFIG_CMA
is enabled. Because it doesn't depend on MIGRATE_CMA pageblocks and the
associated infrastructure, it is possible with few simple adjustments to
require only CONFIG_MEMORY_ISOLATION instead of full CONFIG_CMA.

After this patch, alloc_contig_range() and related functions are
available and used for gigantic pages with just CONFIG_MEMORY_ISOLATION
enabled. Note CONFIG_CMA selects CONFIG_MEMORY_ISOLATION. This allows
supporting runtime gigantic pages without the CMA-specific checks in
page allocator fastpaths.

Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
Cc: Luiz Capitulino <lcapitulino@redhat.com>
Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Cc: Zhang Yanfei <zhangyanfei@cn.fujitsu.com>
Cc: Yasuaki Ishimatsu <isimatu.yasuaki@jp.fujitsu.com>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
Cc: Mel Gorman <mgorman@techsingularity.net>
Cc: Davidlohr Bueso <dave@stgolabs.net>
Cc: Hillf Danton <hillf.zj@alibaba-inc.com>
Cc: Mike Kravetz <mike.kravetz@oracle.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# a3d0a918 02-Feb-2016 Kirill A. Shutemov <kirill.shutemov@linux.intel.com>

thp: make split_queue per-node

Andrea Arcangeli suggested to make split queue per-node to improve
scalability. Let's do it.

Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Suggested-by: Andrea Arcangeli <aarcange@redhat.com>
Reviewed-by: Andrea Arcangeli <aarcange@redhat.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: Dave Hansen <dave.hansen@intel.com>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Rik van Riel <riel@redhat.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: "Aneesh Kumar K.V" <aneesh.kumar@linux.vnet.ibm.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Michal Hocko <mhocko@suse.cz>
Cc: Jerome Marchand <jmarchan@redhat.com>
Cc: Sasha Levin <sasha.levin@oracle.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# f16f091b 15-Jan-2016 Alexander Kuleshov <kuleshovmail@gmail.com>

mm/page_alloc.c: remove unused struct zone *z variable

Remove unused struct zone *z variable which appeared in 86051ca5eaf5
("mm: fix usemap initialization").

Signed-off-by: Alexander Kuleshov <kuleshovmail@gmail.com>
Acked-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 4b94ffdc 15-Jan-2016 Dan Williams <dan.j.williams@intel.com>

x86, mm: introduce vmem_altmap to augment vmemmap_populate()

In support of providing struct page for large persistent memory
capacities, use struct vmem_altmap to change the default policy for
allocating memory for the memmap array. The default vmemmap_populate()
allocates page table storage area from the page allocator. Given
persistent memory capacities relative to DRAM it may not be feasible to
store the memmap in 'System Memory'. Instead vmem_altmap represents
pre-allocated "device pages" to satisfy vmemmap_alloc_block_buf()
requests.

Signed-off-by: Dan Williams <dan.j.williams@intel.com>
Reported-by: kbuild test robot <lkp@intel.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 9a982250 15-Jan-2016 Kirill A. Shutemov <kirill.shutemov@linux.intel.com>

thp: introduce deferred_split_huge_page()

Currently we don't split huge page on partial unmap. It's not an ideal
situation. It can lead to memory overhead.

Furtunately, we can detect partial unmap on page_remove_rmap(). But we
cannot call split_huge_page() from there due to locking context.

It's also counterproductive to do directly from munmap() codepath: in
many cases we will hit this from exit(2) and splitting the huge page
just to free it up in small pages is not what we really want.

The patch introduce deferred_split_huge_page() which put the huge page
into queue for splitting. The splitting itself will happen when we get
memory pressure via shrinker interface. The page will be dropped from
list on freeing through compound page destructor.

Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Tested-by: Sasha Levin <sasha.levin@oracle.com>
Tested-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Acked-by: Jerome Marchand <jmarchan@redhat.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: Dave Hansen <dave.hansen@intel.com>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Rik van Riel <riel@redhat.com>
Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
Cc: Steve Capper <steve.capper@linaro.org>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Michal Hocko <mhocko@suse.cz>
Cc: Christoph Lameter <cl@linux.com>
Cc: David Rientjes <rientjes@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 53f9263b 15-Jan-2016 Kirill A. Shutemov <kirill.shutemov@linux.intel.com>

mm: rework mapcount accounting to enable 4k mapping of THPs

We're going to allow mapping of individual 4k pages of THP compound. It
means we need to track mapcount on per small page basis.

Straight-forward approach is to use ->_mapcount in all subpages to track
how many time this subpage is mapped with PMDs or PTEs combined. But
this is rather expensive: mapping or unmapping of a THP page with PMD
would require HPAGE_PMD_NR atomic operations instead of single we have
now.

The idea is to store separately how many times the page was mapped as
whole -- compound_mapcount. This frees up ->_mapcount in subpages to
track PTE mapcount.

We use the same approach as with compound page destructor and compound
order to store compound_mapcount: use space in first tail page,
->mapping this time.

Any time we map/unmap whole compound page (THP or hugetlb) -- we
increment/decrement compound_mapcount. When we map part of compound
page with PTE we operate on ->_mapcount of the subpage.

page_mapcount() counts both: PTE and PMD mappings of the page.

Basically, we have mapcount for a subpage spread over two counters. It
makes tricky to detect when last mapcount for a page goes away.

We introduced PageDoubleMap() for this. When we split THP PMD for the
first time and there's other PMD mapping left we offset up ->_mapcount
in all subpages by one and set PG_double_map on the compound page.
These additional references go away with last compound_mapcount.

This approach provides a way to detect when last mapcount goes away on
per small page basis without introducing new overhead for most common
cases.

[akpm@linux-foundation.org: fix typo in comment]
[mhocko@suse.com: ignore partial THP when moving task]
Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Tested-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
Acked-by: Jerome Marchand <jmarchan@redhat.com>
Cc: Sasha Levin <sasha.levin@oracle.com>
Cc: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
Cc: Jerome Marchand <jmarchan@redhat.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: Dave Hansen <dave.hansen@intel.com>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Rik van Riel <riel@redhat.com>
Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
Cc: Steve Capper <steve.capper@linaro.org>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Christoph Lameter <cl@linux.com>
Cc: David Rientjes <rientjes@google.com>
Signed-off-by: Michal Hocko <mhocko@suse.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 1c290f64 15-Jan-2016 Kirill A. Shutemov <kirill.shutemov@linux.intel.com>

mm: sanitize page->mapping for tail pages

We don't define meaning of page->mapping for tail pages. Currently it's
always NULL, which can be inconsistent with head page and potentially
lead to problems.

Let's poison the pointer to catch all illigal uses.

page_rmapping(), page_mapping() and page_anon_vma() are changed to look
on head page.

The only illegal use I've caught so far is __GPF_COMP pages from sound
subsystem, mapped with PTEs. do_shared_fault() is changed to use
page_rmapping() instead of direct access to fault_page->mapping.

Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Reviewed-by: Jérôme Glisse <jglisse@redhat.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: Dave Hansen <dave.hansen@intel.com>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Rik van Riel <riel@redhat.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Christoph Lameter <cl@linux.com>
Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
Cc: Steve Capper <steve.capper@linaro.org>
Cc: "Aneesh Kumar K.V" <aneesh.kumar@linux.vnet.ibm.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Michal Hocko <mhocko@suse.cz>
Cc: Jerome Marchand <jmarchan@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 5020e285 14-Jan-2016 Michal Hocko <mhocko@suse.com>

mm, oom: give __GFP_NOFAIL allocations access to memory reserves

__GFP_NOFAIL is a big hammer used to ensure that the allocation request
can never fail. This is a strong requirement and as such it also
deserves a special treatment when the system is OOM. The primary
problem here is that the allocation request might have come with some
locks held and the oom victim might be blocked on the same locks. This
is basically an OOM deadlock situation.

This patch tries to reduce the risk of such a deadlocks by giving
__GFP_NOFAIL allocations a special treatment and let them dive into
memory reserves after oom killer invocation. This should help them to
make a progress and release resources they are holding. The OOM victim
should compensate for the reserves consumption.

Signed-off-by: Michal Hocko <mhocko@suse.com>
Suggested-by: Andrea Arcangeli <aarcange@redhat.com>
Cc: Mel Gorman <mgorman@techsingularity.net>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Acked-by: David Rientjes <rientjes@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 86760a2c 14-Jan-2016 Geliang Tang <geliangtang@163.com>

mm/page_alloc.c: use list_for_each_entry in mark_free_pages()

Use list_for_each_entry instead of list_for_each + list_entry to
simplify the code.

Signed-off-by: Geliang Tang <geliangtang@163.com>
Acked-by: Michal Hocko <mhocko@suse.com>
Acked-by: Mel Gorman <mgorman@techsingularity.net>
Acked-by: David Rientjes <rientjes@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# a16601c5 14-Jan-2016 Geliang Tang <geliangtang@163.com>

mm/page_alloc.c: use list_{first,last}_entry instead of list_entry

To make the intention clearer, use list_{first,last}_entry instead of
list_entry.

Signed-off-by: Geliang Tang <geliangtang@163.com>
Acked-by: Michal Hocko <mhocko@suse.com>
Acked-by: Mel Gorman <mgorman@techsingularity.net>
Acked-by: David Rientjes <rientjes@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 6ac0206b 14-Jan-2016 Mel Gorman <mgorman@techsingularity.net>

mm/page_alloc.c: remove unnecessary parameter from __rmqueue

Commit 0aaa29a56e4f ("mm, page_alloc: reserve pageblocks for high-order
atomic allocations on demand") added an unnecessary and unused parameter
to __rmqueue. It was a parameter that was used in an earlier version of
the patch and then left behind. This patch cleans it up.

Signed-off-by: Mel Gorman <mgorman@techsingularity.net>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# a8d01437 14-Jan-2016 Johannes Weiner <hannes@cmpxchg.org>

mm: page_alloc: generalize the dirty balance reserve

The dirty balance reserve that dirty throttling has to consider is
merely memory not available to userspace allocations. There is nothing
writeback-specific about it. Generalize the name so that it's reusable
outside of that context.

Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
Cc: Rik van Riel <riel@redhat.com>
Cc: Mel Gorman <mgorman@suse.de>
Acked-by: Michal Hocko <mhocko@suse.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 33d53103 14-Jan-2016 Michal Hocko <mhocko@suse.com>

mm/page_alloc.c: do not loop over ALLOC_NO_WATERMARKS without triggering reclaim

__alloc_pages_slowpath is looping over ALLOC_NO_WATERMARKS requests if
__GFP_NOFAIL is requested. This is fragile because we are basically
relying on somebody else to make the reclaim (be it the direct reclaim
or OOM killer) for us. The caller might be holding resources (e.g.
locks) which block other other reclaimers from making any progress for
example. Remove the retry loop and rely on __alloc_pages_slowpath to
invoke all allowed reclaim steps and retry logic.

We have to be careful about __GFP_NOFAIL allocations from the
PF_MEMALLOC context even though this is a very bad idea to begin with
because no progress can be gurateed at all. We shouldn't break the
__GFP_NOFAIL semantic here though. It could be argued that this is
essentially GFP_NOWAIT context which we do not support but PF_MEMALLOC
is much harder to check for existing users because they might happen
deep down the code path performed much later after setting the flag so
we cannot really rule out there is no kernel path triggering this
combination.

Signed-off-by: Michal Hocko <mhocko@suse.com>
Acked-by: Mel Gorman <mgorman@suse.de>
Acked-by: David Rientjes <rientjes@google.com>
Cc: Tetsuo Handa <penguin-kernel@i-love.sakura.ne.jp>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# fde82aaa 14-Jan-2016 Michal Hocko <mhocko@suse.com>

mm/page_alloc.c: get rid of __alloc_pages_high_priority()

__alloc_pages_high_priority doesn't do anything special other than it
calls get_page_from_freelist and loops around GFP_NOFAIL allocation
until it succeeds. It would be better if the first part was done in
__alloc_pages_slowpath where we modify the zonelist because this would
be easier to read and understand. Opencoding the function into its only
caller allows to simplify it a bit as well.

This patch doesn't introduce any functional changes.

[akpm@linux-foundation.org: coding-style fixes]
Signed-off-by: Michal Hocko <mhocko@suse.com>
Acked-by: Mel Gorman <mgorman@suse.de>
Acked-by: David Rientjes <rientjes@google.com>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# c00eb15a 14-Jan-2016 Yaowei Bai <baiyaowei@cmss.chinamobile.com>

mm/zonelist: enumerate zonelists array index

Hardcoding index to zonelists array in gfp_zonelist() is not a good
idea, let's enumerate it to improve readability.

No functional change.

[akpm@linux-foundation.org: coding-style fixes]
[akpm@linux-foundation.org: fix CONFIG_NUMA=n build]
[n-horiguchi@ah.jp.nec.com: fix warning in comparing enumerator]
Signed-off-by: Yaowei Bai <baiyaowei@cmss.chinamobile.com>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: David Rientjes <rientjes@google.com>
Signed-off-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 8ef5849f 14-Jan-2016 Joonsoo Kim <js1304@gmail.com>

mm/cma: always check which page caused allocation failure

Now, we have tracepoint in test_pages_isolated() to notify pfn which
cannot be isolated. But, in alloc_contig_range(), some error path
doesn't call test_pages_isolated() so it's still hard to know exact pfn
that causes allocation failure.

This patch change this situation by calling test_pages_isolated() in
almost error path. In allocation failure case, some overhead is added
by this change, but, allocation failure is really rare event so it would
not matter.

In fatal signal pending case, we don't call test_pages_isolated()
because this failure is intentional one.

There was a bogus outer_start problem due to unchecked buddy order and
this patch also fix it. Before this patch, it didn't matter, because
end result is same thing. But, after this patch, tracepoint will report
failed pfn so it should be accurate.

Signed-off-by: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Acked-by: Michal Nazarewicz <mina86@mina86.com>
Cc: David Rientjes <rientjes@google.com>
Cc: Minchan Kim <minchan@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# a9bb7e62 14-Jan-2016 Vladimir Davydov <vdavydov.dev@gmail.com>

memcg: only account kmem allocations marked as __GFP_ACCOUNT

Black-list kmem accounting policy (aka __GFP_NOACCOUNT) turned out to be
fragile and difficult to maintain, because there seem to be many more
allocations that should not be accounted than those that should be.
Besides, false accounting an allocation might result in much worse
consequences than not accounting at all, namely increased memory
consumption due to pinned dead kmem caches.

So this patch switches kmem accounting to the white-policy: now only
those kmem allocations that are marked as __GFP_ACCOUNT are accounted to
memcg. Currently, no kmem allocations are marked like this. The
following patches will mark several kmem allocations that are known to
be easily triggered from userspace and therefore should be accounted to
memcg.

Signed-off-by: Vladimir Davydov <vdavydov@virtuozzo.com>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Acked-by: Michal Hocko <mhocko@suse.com>
Cc: Tejun Heo <tj@kernel.org>
Cc: Greg Thelen <gthelen@google.com>
Cc: Christoph Lameter <cl@linux.com>
Cc: Pekka Enberg <penberg@kernel.org>
Cc: David Rientjes <rientjes@google.com>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 475a2f90 11-Dec-2015 Vlastimil Babka <vbabka@suse.cz>

mm: fix swapped Movable and Reclaimable in /proc/pagetypeinfo

Commit 016c13daa5c9 ("mm, page_alloc: use masks and shifts when
converting GFP flags to migrate types") has swapped MIGRATE_MOVABLE and
MIGRATE_RECLAIMABLE in the enum definition. However, migratetype_names
wasn't updated to reflect that.

As a result, the file /proc/pagetypeinfo shows the counts for Movable as
Reclaimable and vice versa.

Additionally, commit 0aaa29a56e4f ("mm, page_alloc: reserve pageblocks
for high-order atomic allocations on demand") introduced
MIGRATE_HIGHATOMIC, but did not add a letter to distinguish it into
show_migration_types(), so it doesn't appear in the listing of free
areas during page alloc failures or oom kills.

This patch fixes both problems. The atomic reserves will show with a
letter 'H' in the free areas listings.

Fixes: 016c13daa5c9 ("mm, page_alloc: use masks and shifts when converting GFP flags to migrate types")
Fixes: 0aaa29a56e4f ("mm, page_alloc: reserve pageblocks for high-order atomic allocations on demand")
Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
Acked-by: Mel Gorman <mgorman@suse.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# b0aeba74 10-Nov-2015 Tony Luck <tony.luck@intel.com>

Fix alloc_node_mem_map() to work on ia64 again

In commit a1c34a3bf00a ("mm: Don't offset memmap for flatmem") Laura
fixed a problem for Srinivas relating to the bottom 2MB of RAM on an ARM
IFC6410 board.

One small wrinkle on ia64 is that it allocates the node_mem_map earlier
in arch code, so it skips the block of code where "offset" is
initialized.

Move initialization of start and offset before the check for the
node_mem_map so that they will always be available in the latter part of
the function.

Tested-by: Laura Abbott <laura@labbott.name>
Fixes: a1c34a3bf00a (mm: Don't offset memmap for flatmem)
Signed-off-by: Tony Luck <tony.luck@intel.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# d00181b9 06-Nov-2015 Kirill A. Shutemov <kirill.shutemov@linux.intel.com>

mm: use 'unsigned int' for page order

Let's try to be consistent about data type of page order.

[sfr@canb.auug.org.au: fix build (type of pageblock_order)]
[hughd@google.com: some configs end up with MAX_ORDER and pageblock_order having different types]
Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Acked-by: Michal Hocko <mhocko@suse.com>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Reviewed-by: Andrea Arcangeli <aarcange@redhat.com>
Cc: "Paul E. McKenney" <paulmck@linux.vnet.ibm.com>
Cc: Andi Kleen <ak@linux.intel.com>
Cc: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
Cc: Christoph Lameter <cl@linux.com>
Cc: David Rientjes <rientjes@google.com>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Cc: Sergey Senozhatsky <sergey.senozhatsky@gmail.com>
Signed-off-by: Stephen Rothwell <sfr@canb.auug.org.au>
Signed-off-by: Hugh Dickins <hughd@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 1d798ca3 06-Nov-2015 Kirill A. Shutemov <kirill.shutemov@linux.intel.com>

mm: make compound_head() robust

Hugh has pointed that compound_head() call can be unsafe in some
context. There's one example:

CPU0 CPU1

isolate_migratepages_block()
page_count()
compound_head()
!!PageTail() == true
put_page()
tail->first_page = NULL
head = tail->first_page
alloc_pages(__GFP_COMP)
prep_compound_page()
tail->first_page = head
__SetPageTail(p);
!!PageTail() == true
<head == NULL dereferencing>

The race is pure theoretical. I don't it's possible to trigger it in
practice. But who knows.

We can fix the race by changing how encode PageTail() and compound_head()
within struct page to be able to update them in one shot.

The patch introduces page->compound_head into third double word block in
front of compound_dtor and compound_order. Bit 0 encodes PageTail() and
the rest bits are pointer to head page if bit zero is set.

The patch moves page->pmd_huge_pte out of word, just in case if an
architecture defines pgtable_t into something what can have the bit 0
set.

hugetlb_cgroup uses page->lru.next in the second tail page to store
pointer struct hugetlb_cgroup. The patch switch it to use page->private
in the second tail page instead. The space is free since ->first_page is
removed from the union.

The patch also opens possibility to remove HUGETLB_CGROUP_MIN_ORDER
limitation, since there's now space in first tail page to store struct
hugetlb_cgroup pointer. But that's out of scope of the patch.

That means page->compound_head shares storage space with:

- page->lru.next;
- page->next;
- page->rcu_head.next;

That's too long list to be absolutely sure, but looks like nobody uses
bit 0 of the word.

page->rcu_head.next guaranteed[1] to have bit 0 clean as long as we use
call_rcu(), call_rcu_bh(), call_rcu_sched(), or call_srcu(). But future
call_rcu_lazy() is not allowed as it makes use of the bit and we can
get false positive PageTail().

[1] http://lkml.kernel.org/g/20150827163634.GD4029@linux.vnet.ibm.com

Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Acked-by: Michal Hocko <mhocko@suse.com>
Reviewed-by: Andrea Arcangeli <aarcange@redhat.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: David Rientjes <rientjes@google.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Acked-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Cc: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
Cc: Andi Kleen <ak@linux.intel.com>
Cc: Christoph Lameter <cl@linux.com>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Cc: Sergey Senozhatsky <sergey.senozhatsky@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# f1e61557 06-Nov-2015 Kirill A. Shutemov <kirill.shutemov@linux.intel.com>

mm: pack compound_dtor and compound_order into one word in struct page

The patch halves space occupied by compound_dtor and compound_order in
struct page.

For compound_order, it's trivial long -> short conversion.

For get_compound_page_dtor(), we now use hardcoded table for destructor
lookup and store its index in the struct page instead of direct pointer
to destructor. It shouldn't be a big trouble to maintain the table: we
have only two destructor and NULL currently.

This patch free up one word in tail pages for reuse. This is preparation
for the next patch.

Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Reviewed-by: Michal Hocko <mhocko@suse.com>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Reviewed-by: Andrea Arcangeli <aarcange@redhat.com>
Cc: "Paul E. McKenney" <paulmck@linux.vnet.ibm.com>
Cc: Andi Kleen <ak@linux.intel.com>
Cc: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
Cc: Christoph Lameter <cl@linux.com>
Cc: David Rientjes <rientjes@google.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Cc: Sergey Senozhatsky <sergey.senozhatsky@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 97a16fc8 06-Nov-2015 Mel Gorman <mgorman@techsingularity.net>

mm, page_alloc: only enforce watermarks for order-0 allocations

The primary purpose of watermarks is to ensure that reclaim can always
make forward progress in PF_MEMALLOC context (kswapd and direct reclaim).
These assume that order-0 allocations are all that is necessary for
forward progress.

High-order watermarks serve a different purpose. Kswapd had no high-order
awareness before they were introduced
(https://lkml.kernel.org/r/413AA7B2.4000907@yahoo.com.au). This was
particularly important when there were high-order atomic requests. The
watermarks both gave kswapd awareness and made a reserve for those atomic
requests.

There are two important side-effects of this. The most important is that
a non-atomic high-order request can fail even though free pages are
available and the order-0 watermarks are ok. The second is that
high-order watermark checks are expensive as the free list counts up to
the requested order must be examined.

With the introduction of MIGRATE_HIGHATOMIC it is no longer necessary to
have high-order watermarks. Kswapd and compaction still need high-order
awareness which is handled by checking that at least one suitable
high-order page is free.

With the patch applied, there was little difference in the allocation
failure rates as the atomic reserves are small relative to the number of
allocation attempts. The expected impact is that there will never be an
allocation failure report that shows suitable pages on the free lists.

The one potential side-effect of this is that in a vanilla kernel, the
watermark checks may have kept a free page for an atomic allocation. Now,
we are 100% relying on the HighAtomic reserves and an early allocation to
have allocated them. If the first high-order atomic allocation is after
the system is already heavily fragmented then it'll fail.

[akpm@linux-foundation.org: simplify __zone_watermark_ok(), per Vlastimil]
Signed-off-by: Mel Gorman <mgorman@techsingularity.net>
Acked-by: Michal Hocko <mhocko@suse.com>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Cc: Christoph Lameter <cl@linux.com>
Cc: David Rientjes <rientjes@google.com>
Cc: Vitaly Wool <vitalywool@gmail.com>
Cc: Rik van Riel <riel@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 0aaa29a5 06-Nov-2015 Mel Gorman <mgorman@techsingularity.net>

mm, page_alloc: reserve pageblocks for high-order atomic allocations on demand

High-order watermark checking exists for two reasons -- kswapd high-order
awareness and protection for high-order atomic requests. Historically the
kernel depended on MIGRATE_RESERVE to preserve min_free_kbytes as
high-order free pages for as long as possible. This patch introduces
MIGRATE_HIGHATOMIC that reserves pageblocks for high-order atomic
allocations on demand and avoids using those blocks for order-0
allocations. This is more flexible and reliable than MIGRATE_RESERVE was.

A MIGRATE_HIGHORDER pageblock is created when an atomic high-order
allocation request steals a pageblock but limits the total number to 1% of
the zone. Callers that speculatively abuse atomic allocations for
long-lived high-order allocations to access the reserve will quickly fail.
Note that SLUB is currently not such an abuser as it reclaims at least
once. It is possible that the pageblock stolen has few suitable
high-order pages and will need to steal again in the near future but there
would need to be strong justification to search all pageblocks for an
ideal candidate.

The pageblocks are unreserved if an allocation fails after a direct
reclaim attempt.

The watermark checks account for the reserved pageblocks when the
allocation request is not a high-order atomic allocation.

The reserved pageblocks can not be used for order-0 allocations. This may
allow temporary wastage until a failed reclaim reassigns the pageblock.
This is deliberate as the intent of the reservation is to satisfy a
limited number of atomic high-order short-lived requests if the system
requires them.

The stutter benchmark was used to evaluate this but while it was running
there was a systemtap script that randomly allocated between 1 high-order
page and 12.5% of memory's worth of order-3 pages using GFP_ATOMIC. This
is much larger than the potential reserve and it does not attempt to be
realistic. It is intended to stress random high-order allocations from an
unknown source, show that there is a reduction in failures without
introducing an anomaly where atomic allocations are more reliable than
regular allocations. The amount of memory reserved varied throughout the
workload as reserves were created and reclaimed under memory pressure.
The allocation failures once the workload warmed up were as follows;

4.2-rc5-vanilla 70%
4.2-rc5-atomic-reserve 56%

The failure rate was also measured while building multiple kernels. The
failure rate was 14% but is 6% with this patch applied.

Overall, this is a small reduction but the reserves are small relative to
the number of allocation requests. In early versions of the patch, the
failure rate reduced by a much larger amount but that required much larger
reserves and perversely made atomic allocations seem more reliable than
regular allocations.

[yalin.wang2010@gmail.com: fix redundant check and a memory leak]
Signed-off-by: Mel Gorman <mgorman@techsingularity.net>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Acked-by: Michal Hocko <mhocko@suse.com>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Cc: Christoph Lameter <cl@linux.com>
Cc: David Rientjes <rientjes@google.com>
Cc: Vitaly Wool <vitalywool@gmail.com>
Cc: Rik van Riel <riel@redhat.com>
Signed-off-by: yalin wang <yalin.wang2010@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 974a786e 06-Nov-2015 Mel Gorman <mgorman@techsingularity.net>

mm, page_alloc: remove MIGRATE_RESERVE

MIGRATE_RESERVE preserves an old property of the buddy allocator that
existed prior to fragmentation avoidance -- min_free_kbytes worth of pages
tended to remain contiguous until the only alternative was to fail the
allocation. At the time it was discovered that high-order atomic
allocations relied on this property so MIGRATE_RESERVE was introduced. A
later patch will introduce an alternative MIGRATE_HIGHATOMIC so this patch
deletes MIGRATE_RESERVE and supporting code so it'll be easier to review.
Note that this patch in isolation may look like a false regression if
someone was bisecting high-order atomic allocation failures.

Signed-off-by: Mel Gorman <mgorman@techsingularity.net>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Cc: Christoph Lameter <cl@linux.com>
Cc: David Rientjes <rientjes@google.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Vitaly Wool <vitalywool@gmail.com>
Cc: Rik van Riel <riel@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# f77cf4e4 06-Nov-2015 Mel Gorman <mgorman@techsingularity.net>

mm, page_alloc: delete the zonelist_cache

The zonelist cache (zlc) was introduced to skip over zones that were
recently known to be full. This avoided expensive operations such as the
cpuset checks, watermark calculations and zone_reclaim. The situation
today is different and the complexity of zlc is harder to justify.

1) The cpuset checks are no-ops unless a cpuset is active and in general
are a lot cheaper.

2) zone_reclaim is now disabled by default and I suspect that was a large
source of the cost that zlc wanted to avoid. When it is enabled, it's
known to be a major source of stalling when nodes fill up and it's
unwise to hit every other user with the overhead.

3) Watermark checks are expensive to calculate for high-order
allocation requests. Later patches in this series will reduce the cost
of the watermark checking.

4) The most important issue is that in the current implementation it
is possible for a failed THP allocation to mark a zone full for order-0
allocations and cause a fallback to remote nodes.

The last issue could be addressed with additional complexity but as the
benefit of zlc is questionable, it is better to remove it. If stalls due
to zone_reclaim are ever reported then an alternative would be to
introduce deferring logic based on a timeout inside zone_reclaim itself
and leave the page allocator fast paths alone.

The impact on page-allocator microbenchmarks is negligible as they don't
hit the paths where the zlc comes into play. Most page-reclaim related
workloads showed no noticeable difference as a result of the removal.

The impact was noticeable in a workload called "stutter". One part uses a
lot of anonymous memory, a second measures mmap latency and a third copies
a large file. In an ideal world the latency application would not notice
the mmap latency. On a 2-node machine the results of this patch are

stutter
4.3.0-rc1 4.3.0-rc1
baseline nozlc-v4
Min mmap 20.9243 ( 0.00%) 20.7716 ( 0.73%)
1st-qrtle mmap 22.0612 ( 0.00%) 22.0680 ( -0.03%)
2nd-qrtle mmap 22.3291 ( 0.00%) 22.3809 ( -0.23%)
3rd-qrtle mmap 25.2244 ( 0.00%) 25.2396 ( -0.06%)
Max-90% mmap 48.0995 ( 0.00%) 28.3713 ( 41.02%)
Max-93% mmap 52.5557 ( 0.00%) 36.0170 ( 31.47%)
Max-95% mmap 55.8173 ( 0.00%) 47.3163 ( 15.23%)
Max-99% mmap 67.3781 ( 0.00%) 70.1140 ( -4.06%)
Max mmap 24447.6375 ( 0.00%) 12915.1356 ( 47.17%)
Mean mmap 33.7883 ( 0.00%) 27.7944 ( 17.74%)
Best99%Mean mmap 27.7825 ( 0.00%) 25.2767 ( 9.02%)
Best95%Mean mmap 26.3912 ( 0.00%) 23.7994 ( 9.82%)
Best90%Mean mmap 24.9886 ( 0.00%) 23.2251 ( 7.06%)
Best50%Mean mmap 22.0157 ( 0.00%) 22.0261 ( -0.05%)
Best10%Mean mmap 21.6705 ( 0.00%) 21.6083 ( 0.29%)
Best5%Mean mmap 21.5581 ( 0.00%) 21.4611 ( 0.45%)
Best1%Mean mmap 21.3079 ( 0.00%) 21.1631 ( 0.68%)

Note that the maximum stall latency went from 24 seconds to 12 which is
still bad but an improvement. The milage varies considerably 2-node
machine on an earlier test went from 494 seconds to 47 seconds and a
4-node machine that tested an earlier version of this patch went from a
worst case stall time of 6 seconds to 67ms. The nature of the benchmark
is inherently unpredictable as it is hammering the system and the milage
will vary between machines.

There is a secondary impact with potentially more direct reclaim because
zones are now being considered instead of being skipped by zlc. In this
particular test run it did not occur so will not be described. However,
in at least one test the following was observed

1. Direct reclaim rates were higher. This was likely due to direct reclaim
being entered instead of the zlc disabling a zone and busy looping.
Busy looping may have the effect of allowing kswapd to make more
progress and in some cases may be better overall. If this is found then
the correct action is to put direct reclaimers to sleep on a waitqueue
and allow kswapd make forward progress. Busy looping on the zlc is even
worse than when the allocator used to blindly call congestion_wait().

2. There was higher swap activity as direct reclaim was active.

3. Direct reclaim efficiency was lower. This is related to 1 as more
scanning activity also encountered more pages that could not be
immediately reclaimed

In that case, the direct page scan and reclaim rates are noticeable but
it is not considered a problem for a few reasons

1. The test is primarily concerned with latency. The mmap attempts are also
faulted which means there are THP allocation requests. The ZLC could
cause zones to be disabled causing the process to busy loop instead
of reclaiming. This looks like elevated direct reclaim activity but
it's the correct action to take based on what processes requested.

2. The test hammers reclaim and compaction heavily. The number of successful
THP faults is highly variable but affects the reclaim stats. It's not a
realistic or reasonable measure of page reclaim activity.

3. No other page-reclaim intensive workload that was tested showed a problem.

4. If a workload is identified that benefitted from the busy looping then it
should be fixed by having direct reclaimers sleep on a wait queue until
woken by kswapd instead of busy looping. We had this class of problem before
when congestion_waits() with a fixed timeout was a brain damaged decision
but happened to benefit some workloads.

If a workload is identified that relied on the zlc to busy loop then it
should be fixed correctly and have a direct reclaimer sleep on a waitqueue
until woken by kswapd.

Signed-off-by: Mel Gorman <mgorman@techsingularity.net>
Acked-by: David Rientjes <rientjes@google.com>
Acked-by: Christoph Lameter <cl@linux.com>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Acked-by: Michal Hocko <mhocko@suse.com>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Cc: Vitaly Wool <vitalywool@gmail.com>
Cc: Rik van Riel <riel@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 71baba4b 06-Nov-2015 Mel Gorman <mgorman@techsingularity.net>

mm, page_alloc: rename __GFP_WAIT to __GFP_RECLAIM

__GFP_WAIT was used to signal that the caller was in atomic context and
could not sleep. Now it is possible to distinguish between true atomic
context and callers that are not willing to sleep. The latter should
clear __GFP_DIRECT_RECLAIM so kswapd will still wake. As clearing
__GFP_WAIT behaves differently, there is a risk that people will clear the
wrong flags. This patch renames __GFP_WAIT to __GFP_RECLAIM to clearly
indicate what it does -- setting it allows all reclaim activity, clearing
them prevents it.

[akpm@linux-foundation.org: fix build]
[akpm@linux-foundation.org: coding-style fixes]
Signed-off-by: Mel Gorman <mgorman@techsingularity.net>
Acked-by: Michal Hocko <mhocko@suse.com>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Cc: Christoph Lameter <cl@linux.com>
Acked-by: David Rientjes <rientjes@google.com>
Cc: Vitaly Wool <vitalywool@gmail.com>
Cc: Rik van Riel <riel@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# d0164adc 06-Nov-2015 Mel Gorman <mgorman@techsingularity.net>

mm, page_alloc: distinguish between being unable to sleep, unwilling to sleep and avoiding waking kswapd

__GFP_WAIT has been used to identify atomic context in callers that hold
spinlocks or are in interrupts. They are expected to be high priority and
have access one of two watermarks lower than "min" which can be referred
to as the "atomic reserve". __GFP_HIGH users get access to the first
lower watermark and can be called the "high priority reserve".

Over time, callers had a requirement to not block when fallback options
were available. Some have abused __GFP_WAIT leading to a situation where
an optimisitic allocation with a fallback option can access atomic
reserves.

This patch uses __GFP_ATOMIC to identify callers that are truely atomic,
cannot sleep and have no alternative. High priority users continue to use
__GFP_HIGH. __GFP_DIRECT_RECLAIM identifies callers that can sleep and
are willing to enter direct reclaim. __GFP_KSWAPD_RECLAIM to identify
callers that want to wake kswapd for background reclaim. __GFP_WAIT is
redefined as a caller that is willing to enter direct reclaim and wake
kswapd for background reclaim.

This patch then converts a number of sites

o __GFP_ATOMIC is used by callers that are high priority and have memory
pools for those requests. GFP_ATOMIC uses this flag.

o Callers that have a limited mempool to guarantee forward progress clear
__GFP_DIRECT_RECLAIM but keep __GFP_KSWAPD_RECLAIM. bio allocations fall
into this category where kswapd will still be woken but atomic reserves
are not used as there is a one-entry mempool to guarantee progress.

o Callers that are checking if they are non-blocking should use the
helper gfpflags_allow_blocking() where possible. This is because
checking for __GFP_WAIT as was done historically now can trigger false
positives. Some exceptions like dm-crypt.c exist where the code intent
is clearer if __GFP_DIRECT_RECLAIM is used instead of the helper due to
flag manipulations.

o Callers that built their own GFP flags instead of starting with GFP_KERNEL
and friends now also need to specify __GFP_KSWAPD_RECLAIM.

The first key hazard to watch out for is callers that removed __GFP_WAIT
and was depending on access to atomic reserves for inconspicuous reasons.
In some cases it may be appropriate for them to use __GFP_HIGH.

The second key hazard is callers that assembled their own combination of
GFP flags instead of starting with something like GFP_KERNEL. They may
now wish to specify __GFP_KSWAPD_RECLAIM. It's almost certainly harmless
if it's missed in most cases as other activity will wake kswapd.

Signed-off-by: Mel Gorman <mgorman@techsingularity.net>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Acked-by: Michal Hocko <mhocko@suse.com>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Cc: Christoph Lameter <cl@linux.com>
Cc: David Rientjes <rientjes@google.com>
Cc: Vitaly Wool <vitalywool@gmail.com>
Cc: Rik van Riel <riel@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# c9ab0c4f 06-Nov-2015 Mel Gorman <mgorman@techsingularity.net>

mm, page_alloc: remove unnecessary recalculations for dirty zone balancing

File-backed pages that will be immediately written are balanced between
zones. This heuristic tries to avoid having a single zone filled with
recently dirtied pages but the checks are unnecessarily expensive. Move
consider_zone_balanced into the alloc_context instead of checking bitmaps
multiple times. The patch also gives the parameter a more meaningful
name.

Signed-off-by: Mel Gorman <mgorman@techsingularity.net>
Acked-by: David Rientjes <rientjes@google.com>
Acked-by: Michal Hocko <mhocko@suse.com>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Cc: Christoph Lameter <cl@linux.com>
Cc: Vitaly Wool <vitalywool@gmail.com>
Cc: Rik van Riel <riel@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# e2b19197 06-Nov-2015 Mel Gorman <mgorman@techsingularity.net>

mm, page_alloc: remove unnecessary parameter from zone_watermark_ok_safe

Overall, the intent of this series is to remove the zonelist cache which
was introduced to avoid high overhead in the page allocator. Once this is
done, it is necessary to reduce the cost of watermark checks.

The series starts with minor micro-optimisations.

Next it notes that GFP flags that affect watermark checks are abused.
__GFP_WAIT historically identified callers that could not sleep and could
access reserves. This was later abused to identify callers that simply
prefer to avoid sleeping and have other options. A patch distinguishes
between atomic callers, high-priority callers and those that simply wish
to avoid sleep.

The zonelist cache has been around for a long time but it is of dubious
merit with a lot of complexity and some issues that are explained. The
most important issue is that a failed THP allocation can cause a zone to
be treated as "full". This potentially causes unnecessary stalls, reclaim
activity or remote fallbacks. The issues could be fixed but it's not
worth it. The series places a small number of other micro-optimisations
on top before examining GFP flags watermarks.

High-order watermarks enforcement can cause high-order allocations to fail
even though pages are free. The watermark checks both protect high-order
atomic allocations and make kswapd aware of high-order pages but there is
a much better way that can be handled using migrate types. This series
uses page grouping by mobility to reserve pageblocks for high-order
allocations with the size of the reservation depending on demand. kswapd
awareness is maintained by examining the free lists. By patch 12 in this
series, there are no high-order watermark checks while preserving the
properties that motivated the introduction of the watermark checks.

This patch (of 10):

No user of zone_watermark_ok_safe() specifies alloc_flags. This patch
removes the unnecessary parameter.

Signed-off-by: Mel Gorman <mgorman@techsingularity.net>
Acked-by: David Rientjes <rientjes@google.com>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Acked-by: Michal Hocko <mhocko@suse.com>
Reviewed-by: Christoph Lameter <cl@linux.com>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# d05e83a6 05-Nov-2015 Vladimir Davydov <vdavydov.dev@gmail.com>

memcg: simplify charging kmem pages

Charging kmem pages proceeds in two steps. First, we try to charge the
allocation size to the memcg the current task belongs to, then we allocate
a page and "commit" the charge storing the pointer to the memcg in the
page struct.

Such a design looks overcomplicated, because there is not much sense in
trying charging the allocation before actually allocating a page: we won't
be able to consume much memory over the limit even if we charge after
doing the actual allocation, besides we already charge user pages post
factum, so being pedantic with kmem pages just looks pointless.

So this patch simplifies the design by merging the "charge" and the
"commit" steps into the same function, which takes the allocated page.

Also, rename the charge and uncharge methods to memcg_kmem_charge and
memcg_kmem_uncharge and make the charge method return error code instead
of bool to conform to mem_cgroup_try_charge.

Signed-off-by: Vladimir Davydov <vdavydov@virtuozzo.com>
Acked-by: Michal Hocko <mhocko@suse.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# bde304bd 05-Nov-2015 Xishi Qiu <qiuxishi@huawei.com>

mm/page_alloc.c: skip ZONE_MOVABLE if required_kernelcore is larger than totalpages

If kernelcore was not specified, or the kernelcore size is zero
(required_movablecore >= totalpages), or the kernelcore size is larger
than totalpages, there is no ZONE_MOVABLE. We should fill the zone with
both kernel memory and movable memory.

Signed-off-by: Xishi Qiu <qiuxishi@huawei.com>
Reviewed-by: Yasuaki Ishimatsu <isimatu.yasuaki@jp.fujitsu.com>
Cc: Mel Gorman <mgorman@suse.de>
Cc: David Rientjes <rientjes@google.com>
Cc: Tang Chen <tangchen@cn.fujitsu.com>
Cc: <zhongjiang@huawei.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# a1c34a3b 05-Nov-2015 Laura Abbott <laura@labbott.name>

mm: Don't offset memmap for flatmem

Srinivas Kandagatla reported bad page messages when trying to remove the
bottom 2MB on an ARM based IFC6410 board

BUG: Bad page state in process swapper pfn:fffa8
page:ef7fb500 count:0 mapcount:0 mapping: (null) index:0x0
flags: 0x96640253(locked|error|dirty|active|arch_1|reclaim|mlocked)
page dumped because: PAGE_FLAGS_CHECK_AT_FREE flag(s) set
bad because of flags:
flags: 0x200041(locked|active|mlocked)
Modules linked in:
CPU: 0 PID: 0 Comm: swapper Not tainted 3.19.0-rc3-00007-g412f9ba-dirty #816
Hardware name: Qualcomm (Flattened Device Tree)
unwind_backtrace
show_stack
dump_stack
bad_page
free_pages_prepare
free_hot_cold_page
__free_pages
free_highmem_page
mem_init
start_kernel
Disabling lock debugging due to kernel taint

Removing the lower 2MB made the start of the lowmem zone to no longer be
page block aligned. IFC6410 uses CONFIG_FLATMEM where alloc_node_mem_map
allocates memory for the mem_map. alloc_node_mem_map will offset for
unaligned nodes with the assumption the pfn/page translation functions
will account for the offset. The functions for CONFIG_FLATMEM do not
offset however, resulting in overrunning the memmap array. Just use the
allocated memmap without any offset when running with CONFIG_FLATMEM to
avoid the overrun.

Signed-off-by: Laura Abbott <laura@labbott.name>
Signed-off-by: Laura Abbott <lauraa@codeaurora.org>
Reported-by: Srinivas Kandagatla <srinivas.kandagatla@linaro.org>
Tested-by: Srinivas Kandagatla <srinivas.kandagatla@linaro.org>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Tested-by: Bjorn Andersson <bjorn.andersson@sonymobile.com>
Cc: Santosh Shilimkar <ssantosh@kernel.org>
Cc: Russell King <rmk@arm.linux.org.uk>
Cc: Kevin Hilman <khilman@linaro.org>
Cc: Arnd Bergman <arnd@arndb.de>
Cc: Stephen Boyd <sboyd@codeaurora.org>
Cc: Andy Gross <agross@codeaurora.org>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 9fd745d4 05-Nov-2015 Xishi Qiu <qiuxishi@huawei.com>

mm: fix overflow in find_zone_movable_pfns_for_nodes()

If the user set "movablecore=xx" to a large number, corepages will
overflow. Fix the problem.

Signed-off-by: Xishi Qiu <qiuxishi@huawei.com>
Reviewed-by: Yasuaki Ishimatsu <isimatu.yasuaki@jp.fujitsu.com>
Acked-by: Tang Chen <tangchen@cn.fujitsu.com>
Acked-by: David Rientjes <rientjes@google.com>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Tang Chen <tangchen@cn.fujitsu.com>
Cc: Zhang Yanfei <zhangyanfei@cn.fujitsu.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# b171e409 05-Nov-2015 Yaowei Bai <bywxiaobai@163.com>

mm/page_alloc: remove unused parameter in init_currently_empty_zone()

Commit a2f3aa025766 ("[PATCH] Fix sparsemem on Cell") fixed an oops
experienced on the Cell architecture when init-time functions,
early_*(), are called at runtime by introducing an 'enum memmap_context'
parameter to memmap_init_zone() and init_currently_empty_zone(). This
parameter is intended to be used to tell whether the call of these two
functions is being made on behalf of a hotplug event, or happening at
boot-time. However, init_currently_empty_zone() does not use this
parameter at all, so remove it.

Signed-off-by: Yaowei Bai <bywxiaobai@163.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 621a5f7a 26-Sep-2015 Viresh Kumar <viresh.kumar@linaro.org>

debugfs: Pass bool pointer to debugfs_create_bool()

Its a bit odd that debugfs_create_bool() takes 'u32 *' as an argument,
when all it needs is a boolean pointer.

It would be better to update this API to make it accept 'bool *'
instead, as that will make it more consistent and often more convenient.
Over that bool takes just a byte.

That required updates to all user sites as well, in the same commit
updating the API. regmap core was also using
debugfs_{read|write}_file_bool(), directly and variable types were
updated for that to be bool as well.

Signed-off-by: Viresh Kumar <viresh.kumar@linaro.org>
Acked-by: Mark Brown <broonie@kernel.org>
Acked-by: Charles Keepax <ckeepax@opensource.wolfsonmicro.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>


# 4ada0c5a 08-Sep-2015 Zhen Lei <thunder.leizhen@huawei.com>

mm/page_alloc.c: fix type information of memoryless node

For a memoryless node, the output of get_pfn_range_for_nid are all zero.
It will display mem from 0 to -1.

Signed-off-by: Zhen Lei <thunder.leizhen@huawei.com>
Acked-by: David Rientjes <rientjes@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# b5685e92 08-Sep-2015 Xishi Qiu <qiuxishi@huawei.com>

memory-hotplug: fix comments in zone_spanned_pages_in_node() and zone_spanned_pages_in_node()

When hot adding a node from add_memory(), we will add memblock first, so
the node is not empty. But when called from cpu_up(), the node should
be empty.

Signed-off-by: Xishi Qiu <qiuxishi@huawei.com>
Cc: Tang Chen <tangchen@cn.fujitsu.com>
Cc: Yasuaki Ishimatsu <isimatu.yasuaki@jp.fujitsu.com>
Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Taku Izumi <izumi.taku@jp.fujitsu.com>\
Acked-by: David Rientjes <rientjes@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 34b10060 08-Sep-2015 Yaowei Bai <bywxiaobai@163.com>

mm/page_alloc.c: change sysctl_lower_zone_reserve_ratio to sysctl_lowmem_reserve_ratio in comments

We use sysctl_lowmem_reserve_ratio rather than
sysctl_lower_zone_reserve_ratio to determine how aggressive the kernel
is in defending lowmem from the possibility of being captured into
pinned user memory. To avoid misleading, correct it in some comments.

Signed-off-by: Yaowei Bai <bywxiaobai@163.com>
Acked-by: Michal Hocko <mhocko@suse.com>
Acked-by: David Rientjes <rientjes@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 013110a7 08-Sep-2015 Yaowei Bai <bywxiaobai@163.com>

mm/page_alloc.c: fix a misleading comment

The comment says that the per-cpu batchsize and zone watermarks are
determined by present_pages which is definitely wrong, they are both
calculated from managed_pages. Fix it.

Signed-off-by: Yaowei Bai <bywxiaobai@163.com>
Acked-by: Michal Hocko <mhocko@suse.com>
Cc: David Rientjes <rientjes@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 96db800f 08-Sep-2015 Vlastimil Babka <vbabka@suse.cz>

mm: rename alloc_pages_exact_node() to __alloc_pages_node()

alloc_pages_exact_node() was introduced in commit 6484eb3e2a81 ("page
allocator: do not check NUMA node ID when the caller knows the node is
valid") as an optimized variant of alloc_pages_node(), that doesn't
fallback to current node for nid == NUMA_NO_NODE. Unfortunately the
name of the function can easily suggest that the allocation is
restricted to the given node and fails otherwise. In truth, the node is
only preferred, unless __GFP_THISNODE is passed among the gfp flags.

The misleading name has lead to mistakes in the past, see for example
commits 5265047ac301 ("mm, thp: really limit transparent hugepage
allocation to local node") and b360edb43f8e ("mm, mempolicy:
migrate_to_node should only migrate to node").

Another issue with the name is that there's a family of
alloc_pages_exact*() functions where 'exact' means exact size (instead
of page order), which leads to more confusion.

To prevent further mistakes, this patch effectively renames
alloc_pages_exact_node() to __alloc_pages_node() to better convey that
it's an optimized variant of alloc_pages_node() not intended for general
usage. Both functions get described in comments.

It has been also considered to really provide a convenience function for
allocations restricted to a node, but the major opinion seems to be that
__GFP_THISNODE already provides that functionality and we shouldn't
duplicate the API needlessly. The number of users would be small
anyway.

Existing callers of alloc_pages_exact_node() are simply converted to
call __alloc_pages_node(), with the exception of sba_alloc_coherent()
which open-codes the check for NUMA_NO_NODE, so it is converted to use
alloc_pages_node() instead. This means it no longer performs some
VM_BUG_ON checks, and since the current check for nid in
alloc_pages_node() uses a 'nid < 0' comparison (which includes
NUMA_NO_NODE), it may hide wrong values which would be previously
exposed.

Both differences will be rectified by the next patch.

To sum up, this patch makes no functional changes, except temporarily
hiding potentially buggy callers. Restricting the checks in
alloc_pages_node() is left for the next patch which can in turn expose
more existing buggy callers.

Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Acked-by: Robin Holt <robinmholt@gmail.com>
Acked-by: Michal Hocko <mhocko@suse.com>
Acked-by: Christoph Lameter <cl@linux.com>
Acked-by: Michael Ellerman <mpe@ellerman.id.au>
Cc: Mel Gorman <mgorman@suse.de>
Cc: David Rientjes <rientjes@google.com>
Cc: Greg Thelen <gthelen@google.com>
Cc: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
Cc: Pekka Enberg <penberg@kernel.org>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
Cc: Tony Luck <tony.luck@intel.com>
Cc: Fenghua Yu <fenghua.yu@intel.com>
Cc: Arnd Bergmann <arnd@arndb.de>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Gleb Natapov <gleb@kernel.org>
Cc: Paolo Bonzini <pbonzini@redhat.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: Cliff Whickman <cpw@sgi.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# bb14c2c7 08-Sep-2015 Vlastimil Babka <vbabka@suse.cz>

mm: rename and move get/set_freepage_migratetype

The pair of get/set_freepage_migratetype() functions are used to cache
pageblock migratetype for a page put on a pcplist, so that it does not
have to be retrieved again when the page is put on a free list (e.g.
when pcplists become full). Historically it was also assumed that the
value is accurate for pages on freelists (as the functions' names
unfortunately suggest), but that cannot be guaranteed without affecting
various allocator fast paths. It is in fact not needed and all such
uses have been removed.

The last remaining (but pointless) usage related to pages of freelists
is in move_freepages(), which this patch removes.

To prevent further confusion, rename the functions to
get/set_pcppage_migratetype() and expand their description. Since all
the users are now in mm/page_alloc.c, move the functions there from the
shared header.

Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
Acked-by: David Rientjes <rientjes@google.com>
Acked-by: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Cc: Minchan Kim <minchan@kernel.org>
Acked-by: Michal Nazarewicz <mina86@mina86.com>
Cc: Laura Abbott <lauraa@codeaurora.org>
Reviewed-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
Cc: Seungho Park <seungho1.park@lge.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
Acked-by: Mel Gorman <mgorman@techsingularity.net>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# aa016d14 08-Sep-2015 Vlastimil Babka <vbabka@suse.cz>

mm, page_isolation: remove bogus tests for isolated pages

The __test_page_isolated_in_pageblock() is used to verify whether all
pages in pageblock were either successfully isolated, or are hwpoisoned.
Two of the possible state of pages, that are tested, are however bogus
and misleading.

Both tests rely on get_freepage_migratetype(page), which however has no
guarantees about pages on freelists. Specifically, it doesn't guarantee
that the migratetype returned by the function actually matches the
migratetype of the freelist that the page is on. Such guarantee is not
its purpose and would have negative impact on allocator performance.

The first test checks whether the freepage_migratetype equals
MIGRATE_ISOLATE, supposedly to catch races between page isolation and
allocator activity. These races should be fixed nowadays with
51bb1a4093 ("mm/page_alloc: add freepage on isolate pageblock to correct
buddy list") and related patches. As explained above, the check
wouldn't be able to catch them reliably anyway. For the same reason
false positives can happen, although they are harmless, as the
move_freepages() call would just move the page to the same freelist it's
already on. So removing the test is not a bug fix, just cleanup. After
this patch, we assume that all PageBuddy pages are on the correct
freelist and that the races were really fixed. A truly reliable
verification in the form of e.g. VM_BUG_ON() would be complicated and
is arguably not needed.

The second test (page_count(page) == 0 && get_freepage_migratetype(page)
== MIGRATE_ISOLATE) is probably supposed (the code comes from a big
memory isolation patch from 2007) to catch pages on MIGRATE_ISOLATE
pcplists. However, pcplists don't contain MIGRATE_ISOLATE freepages
nowadays, those are freed directly to free lists, so the check is
obsolete. Remove it as well.

Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
Acked-by: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Cc: Minchan Kim <minchan@kernel.org>
Acked-by: Michal Nazarewicz <mina86@mina86.com>
Cc: Laura Abbott <lauraa@codeaurora.org>
Reviewed-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
Cc: Seungho Park <seungho1.park@lge.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
Acked-by: Mel Gorman <mgorman@techsingularity.net>
Cc: David Rientjes <rientjes@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 54e9e291 08-Sep-2015 David Rientjes <rientjes@google.com>

mm, oom: pass an oom order of -1 when triggered by sysrq

The force_kill member of struct oom_control isn't needed if an order of -1
is used instead. This is the same as order == -1 in struct
compact_control which requires full memory compaction.

This patch introduces no functional change.

Signed-off-by: David Rientjes <rientjes@google.com>
Cc: Sergey Senozhatsky <sergey.senozhatsky.work@gmail.com>
Cc: Michal Hocko <mhocko@suse.cz>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 6e0fc46d 08-Sep-2015 David Rientjes <rientjes@google.com>

mm, oom: organize oom context into struct

There are essential elements to an oom context that are passed around to
multiple functions.

Organize these elements into a new struct, struct oom_control, that
specifies the context for an oom condition.

This patch introduces no functional change.

Signed-off-by: David Rientjes <rientjes@google.com>
Acked-by: Michal Hocko <mhocko@suse.cz>
Cc: Sergey Senozhatsky <sergey.senozhatsky.work@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 7f3eb55b 08-Sep-2015 Wei Yang <weiyang@linux.vnet.ibm.com>

mm/page_alloc.c: remove unused variable in free_area_init_core()

Commit febd5949e134 ("mm/memory hotplug: init the zone's size when
calculating node totalpages") refines the function
free_area_init_core().

After doing so, these two parameters are not used anymore.

This patch removes these two parameters.

Signed-off-by: Wei Yang <weiyang@linux.vnet.ibm.com>
Cc: Gu Zheng <guz.fnst@cn.fujitsu.com>
Acked-by: David Rientjes <rientjes@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 904a9553 08-Sep-2015 Wei Yang <weiyang@linux.vnet.ibm.com>

mm/page_alloc.c: refine the calculation of highest possible node id

nr_node_ids records the highest possible node id, which is calculated by
scanning the bitmap node_states[N_POSSIBLE]. Current implementation
scan the bitmap from the beginning, which will scan the whole bitmap.

This patch reverses the order by scanning from the end with
find_last_bit().

Signed-off-by: Wei Yang <weiyang@linux.vnet.ibm.com>
Cc: Tejun Heo <tj@kernel.org>
Acked-by: David Rientjes <rientjes@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 033fbae9 09-Aug-2015 Dan Williams <dan.j.williams@intel.com>

mm: ZONE_DEVICE for "device memory"

While pmem is usable as a block device or via DAX mappings to userspace
there are several usage scenarios that can not target pmem due to its
lack of struct page coverage. In preparation for "hot plugging" pmem
into the vmemmap add ZONE_DEVICE as a new zone to tag these pages
separately from the ones that are subject to standard page allocations.
Importantly "device memory" can be removed at will by userspace
unbinding the driver of the device.

Having a separate zone prevents allocation and otherwise marks these
pages that are distinct from typical uniform memory. Device memory has
different lifetime and performance characteristics than RAM. However,
since we have run out of ZONES_SHIFT bits this functionality currently
depends on sacrificing ZONE_DMA.

Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Rik van Riel <riel@redhat.com>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Jerome Glisse <j.glisse@gmail.com>
[hch: various simplifications in the arch interface]
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>


# 2f064f34 21-Aug-2015 Michal Hocko <mhocko@suse.com>

mm: make page pfmemalloc check more robust

Commit c48a11c7ad26 ("netvm: propagate page->pfmemalloc to skb") added
checks for page->pfmemalloc to __skb_fill_page_desc():

if (page->pfmemalloc && !page->mapping)
skb->pfmemalloc = true;

It assumes page->mapping == NULL implies that page->pfmemalloc can be
trusted. However, __delete_from_page_cache() can set set page->mapping
to NULL and leave page->index value alone. Due to being in union, a
non-zero page->index will be interpreted as true page->pfmemalloc.

So the assumption is invalid if the networking code can see such a page.
And it seems it can. We have encountered this with a NFS over loopback
setup when such a page is attached to a new skbuf. There is no copying
going on in this case so the page confuses __skb_fill_page_desc which
interprets the index as pfmemalloc flag and the network stack drops
packets that have been allocated using the reserves unless they are to
be queued on sockets handling the swapping which is the case here and
that leads to hangs when the nfs client waits for a response from the
server which has been dropped and thus never arrive.

The struct page is already heavily packed so rather than finding another
hole to put it in, let's do a trick instead. We can reuse the index
again but define it to an impossible value (-1UL). This is the page
index so it should never see the value that large. Replace all direct
users of page->pfmemalloc by page_is_pfmemalloc which will hide this
nastiness from unspoiled eyes.

The information will get lost if somebody wants to use page->index
obviously but that was the case before and the original code expected
that the information should be persisted somewhere else if that is
really needed (e.g. what SLAB and SLUB do).

[akpm@linux-foundation.org: fix blooper in slub]
Fixes: c48a11c7ad26 ("netvm: propagate page->pfmemalloc to skb")
Signed-off-by: Michal Hocko <mhocko@suse.com>
Debugged-by: Vlastimil Babka <vbabka@suse.com>
Debugged-by: Jiri Bohac <jbohac@suse.com>
Cc: Eric Dumazet <eric.dumazet@gmail.com>
Cc: David Miller <davem@davemloft.net>
Acked-by: Mel Gorman <mgorman@suse.de>
Cc: <stable@vger.kernel.org> [3.6+]
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# f9126ab9 14-Aug-2015 Xishi Qiu <qiuxishi@huawei.com>

memory-hotplug: fix wrong edge when hot add a new node

When we add a new node, the edge of memory may be wrong.

e.g. system has 4 nodes, and node3 is movable, node3 mem:[24G-32G],

1. hotremove the node3,
2. then hotadd node3 with a part of memory, mem:[26G-30G],
3. call hotadd_new_pgdat()
free_area_init_node()
get_pfn_range_for_nid()
4. it will return wrong start_pfn and end_pfn, because we have not
update the memblock.

This patch also fixes a BUG_ON during hot-addition, please see
http://marc.info/?l=linux-kernel&m=142961156129456&w=2

Signed-off-by: Xishi Qiu <qiuxishi@huawei.com>
Cc: Yasuaki Ishimatsu <isimatu.yasuaki@jp.fujitsu.com>
Cc: Kamezawa Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Taku Izumi <izumi.taku@jp.fujitsu.com>
Cc: Tang Chen <tangchen@cn.fujitsu.com>
Cc: Gu Zheng <guz.fnst@cn.fujitsu.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# f4c18e6f 06-Aug-2015 Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>

mm: check __PG_HWPOISON separately from PAGE_FLAGS_CHECK_AT_*

The race condition addressed in commit add05cecef80 ("mm: soft-offline:
don't free target page in successful page migration") was not closed
completely, because that can happen not only for soft-offline, but also
for hard-offline. Consider that a slab page is about to be freed into
buddy pool, and then an uncorrected memory error hits the page just
after entering __free_one_page(), then VM_BUG_ON_PAGE(page->flags &
PAGE_FLAGS_CHECK_AT_PREP) is triggered, despite the fact that it's not
necessary because the data on the affected page is not consumed.

To solve it, this patch drops __PG_HWPOISON from page flag checks at
allocation/free time. I think it's justified because __PG_HWPOISON
flags is defined to prevent the page from being reused, and setting it
outside the page's alloc-free cycle is a designed behavior (not a bug.)

For recent months, I was annoyed about BUG_ON when soft-offlined page
remains on lru cache list for a while, which is avoided by calling
put_page() instead of putback_lru_page() in page migration's success
path. This means that this patch reverts a major change from commit
add05cecef80 about the new refcounting rule of soft-offlined pages, so
"reuse window" revives. This will be closed by a subsequent patch.

Signed-off-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
Cc: Andi Kleen <andi@firstfloor.org>
Cc: Dean Nelson <dnelson@redhat.com>
Cc: Tony Luck <tony.luck@intel.com>
Cc: "Kirill A. Shutemov" <kirill@shutemov.name>
Cc: Hugh Dickins <hughd@google.com>
Cc: David Rientjes <rientjes@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 4248b0da 06-Aug-2015 Mel Gorman <mgorman@suse.de>

fs, file table: reinit files_stat.max_files after deferred memory initialisation

Dave Hansen reported the following;

My laptop has been behaving strangely with 4.2-rc2. Once I log
in to my X session, I start getting all kinds of strange errors
from applications and see this in my dmesg:

VFS: file-max limit 8192 reached

The problem is that the file-max is calculated before memory is fully
initialised and miscalculates how much memory the kernel is using. This
patch recalculates file-max after deferred memory initialisation. Note
that using memory hotplug infrastructure would not have avoided this
problem as the value is not recalculated after memory hot-add.

4.1: files_stat.max_files = 6582781
4.2-rc2: files_stat.max_files = 8192
4.2-rc2 patched: files_stat.max_files = 6562467

Small differences with the patch applied and 4.1 but not enough to matter.

Signed-off-by: Mel Gorman <mgorman@suse.de>
Reported-by: Dave Hansen <dave.hansen@intel.com>
Cc: Nicolai Stange <nicstange@gmail.com>
Cc: Dave Hansen <dave.hansen@intel.com>
Cc: Alex Ng <alexng@microsoft.com>
Cc: Fengguang Wu <fengguang.wu@intel.com>
Cc: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# d3cd131d 06-Aug-2015 Nicolai Stange <nicstange@gmail.com>

mm, meminit: replace rwsem with completion

Commit 0e1cc95b4cc7 ("mm: meminit: finish initialisation of struct pages
before basic setup") introduced a rwsem to signal completion of the
initialization workers.

Lockdep complains about possible recursive locking:
=============================================
[ INFO: possible recursive locking detected ]
4.1.0-12802-g1dc51b8 #3 Not tainted
---------------------------------------------
swapper/0/1 is trying to acquire lock:
(pgdat_init_rwsem){++++.+},
at: [<ffffffff8424c7fb>] page_alloc_init_late+0xc7/0xe6

but task is already holding lock:
(pgdat_init_rwsem){++++.+},
at: [<ffffffff8424c772>] page_alloc_init_late+0x3e/0xe6

Replace the rwsem by a completion together with an atomic
"outstanding work counter".

[peterz@infradead.org: Barrier removal on the grounds of being pointless]
[mgorman@suse.de: Applied review feedback]
Signed-off-by: Nicolai Stange <nicstange@gmail.com>
Signed-off-by: Mel Gorman <mgorman@suse.de>
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Dave Hansen <dave.hansen@intel.com>
Cc: Alex Ng <alexng@microsoft.com>
Cc: Fengguang Wu <fengguang.wu@intel.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 7ace9917 06-Aug-2015 Mel Gorman <mgorman@suse.de>

mm, meminit: allow early_pfn_to_nid to be used during runtime

early_pfn_to_nid() historically was inherently not SMP safe but only
used during boot which is inherently single threaded or during hotplug
which is protected by a giant mutex.

With deferred memory initialisation there was a thread-safe version
introduced and the early_pfn_to_nid would trigger a BUG_ON if used
unsafely. Memory hotplug hit that check. This patch makes
early_pfn_to_nid introduces a lock to make it safe to use during
hotplug.

Signed-off-by: Mel Gorman <mgorman@suse.de>
Reported-by: Alex Ng <alexng@microsoft.com>
Tested-by: Alex Ng <alexng@microsoft.com>
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Nicolai Stange <nicstange@gmail.com>
Cc: Dave Hansen <dave.hansen@intel.com>
Cc: Fengguang Wu <fengguang.wu@intel.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# e2cfc911 17-Jul-2015 Joonsoo Kim <js1304@gmail.com>

mm/page_owner: set correct gfp_mask on page_owner

Currently, we set wrong gfp_mask to page_owner info in case of isolated
freepage by compaction and split page. It causes incorrect mixed
pageblock report that we can get from '/proc/pagetypeinfo'. This metric
is really useful to measure fragmentation effect so should be accurate.
This patch fixes it by setting correct information.

Without this patch, after kernel build workload is finished, number of
mixed pageblock is 112 among roughly 210 movable pageblocks.

But, with this fix, output shows that mixed pageblock is just 57.

Signed-off-by: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Vlastimil Babka <vbabka@suse.cz>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# f3a14ced 17-Jul-2015 Joonsoo Kim <js1304@gmail.com>

mm/page_owner: fix possible access violation

When I tested my new patches, I found that page pointer which is used
for setting page_owner information is changed. This is because page
pointer is used to set new migratetype in loop. After this work, page
pointer could be out of bound. If this wrong pointer is used for
page_owner, access violation happens. Below is error message that I
got.

BUG: unable to handle kernel paging request at 0000000000b00018
IP: [<ffffffff81025f30>] save_stack_address+0x30/0x40
PGD 1af2d067 PUD 166e0067 PMD 0
Oops: 0002 [#1] SMP
...snip...
Call Trace:
print_context_stack+0xcf/0x100
dump_trace+0x15f/0x320
save_stack_trace+0x2f/0x50
__set_page_owner+0x46/0x70
__isolate_free_page+0x1f7/0x210
split_free_page+0x21/0xb0
isolate_freepages_block+0x1e2/0x410
compaction_alloc+0x22d/0x2d0
migrate_pages+0x289/0x8b0
compact_zone+0x409/0x880
compact_zone_order+0x6d/0x90
try_to_compact_pages+0x110/0x210
__alloc_pages_direct_compact+0x3d/0xe6
__alloc_pages_nodemask+0x6cd/0x9a0
alloc_pages_current+0x91/0x100
runtest_store+0x296/0xa50
simple_attr_write+0xbd/0xe0
__vfs_write+0x28/0xf0
vfs_write+0xa9/0x1b0
SyS_write+0x46/0xb0
system_call_fastpath+0x16/0x75

This patch fixes this error by moving up set_page_owner().

Signed-off-by: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Vlastimil Babka <vbabka@suse.cz>
Acked-by: Minchan Kim <minchan@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# ae026b2a 17-Jul-2015 Mel Gorman <mgorman@suse.de>

mm, meminit: suppress unused memory variable warning

The kbuild test robot reported the following

tree: git://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git master
head: 14a6f1989dae9445d4532941bdd6bbad84f4c8da
commit: 3b242c66ccbd60cf47ab0e8992119d9617548c23 x86: mm: enable deferred struct page initialisation on x86-64
date: 3 days ago
config: x86_64-randconfig-x006-201527 (attached as .config)
reproduce:
git checkout 3b242c66ccbd60cf47ab0e8992119d9617548c23
# save the attached .config to linux build tree
make ARCH=x86_64

All warnings (new ones prefixed by >>):

mm/page_alloc.c: In function 'early_page_uninitialised':
>> mm/page_alloc.c:247:6: warning: unused variable 'nid' [-Wunused-variable]
int nid = early_pfn_to_nid(pfn);

It's due to the NODE_DATA macro ignoring the nid parameter on !NUMA
configurations. This patch avoids the warning by not declaring nid.

Signed-off-by: Mel Gorman <mgorman@suse.de>
Reported-by: Wu Fengguang <fengguang.wu@intel.com>
Cc: "Kirill A. Shutemov" <kirill@shutemov.name>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 0e1cc95b 30-Jun-2015 Mel Gorman <mgorman@suse.de>

mm: meminit: finish initialisation of struct pages before basic setup

Waiman Long reported that 24TB machines hit OOM during basic setup when
struct page initialisation was deferred. One approach is to initialise
memory on demand but it interferes with page allocator paths. This patch
creates dedicated threads to initialise memory before basic setup. It
then blocks on a rw_semaphore until completion as a wait_queue and counter
is overkill. This may be slower to boot but it's simplier overall and
also gets rid of a section mangling which existed so kswapd could do the
initialisation.

[akpm@linux-foundation.org: include rwsem.h, use DECLARE_RWSEM, fix comment, remove unneeded cast]
Signed-off-by: Mel Gorman <mgorman@suse.de>
Cc: Waiman Long <waiman.long@hp.com
Cc: Nathan Zimmer <nzimmer@sgi.com>
Cc: Dave Hansen <dave.hansen@intel.com>
Cc: Scott Norton <scott.norton@hp.com>
Tested-by: Daniel J Blueman <daniel@numascale.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 74033a79 30-Jun-2015 Mel Gorman <mgorman@suse.de>

mm: meminit: remove mminit_verify_page_links

mminit_verify_page_links() is an extremely paranoid check that was
introduced when memory initialisation was being heavily reworked.
Profiles indicated that up to 10% of parallel memory initialisation was
spent on checking this for every page. The cost could be reduced but in
practice this check only found problems very early during the
initialisation rewrite and has found nothing since. This patch removes an
expensive unnecessary check.

Signed-off-by: Mel Gorman <mgorman@suse.de>
Tested-by: Nate Zimmer <nzimmer@sgi.com>
Tested-by: Waiman Long <waiman.long@hp.com>
Tested-by: Daniel J Blueman <daniel@numascale.com>
Acked-by: Pekka Enberg <penberg@kernel.org>
Cc: Robin Holt <robinmholt@gmail.com>
Cc: Nate Zimmer <nzimmer@sgi.com>
Cc: Dave Hansen <dave.hansen@intel.com>
Cc: Waiman Long <waiman.long@hp.com>
Cc: Scott Norton <scott.norton@hp.com>
Cc: "Luck, Tony" <tony.luck@intel.com>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# ac5d2539 30-Jun-2015 Mel Gorman <mgorman@suse.de>

mm: meminit: reduce number of times pageblocks are set during struct page init

During parallel sturct page initialisation, ranges are checked for every
PFN unnecessarily which increases boot times. This patch alters when the
ranges are checked.

Signed-off-by: Mel Gorman <mgorman@suse.de>
Tested-by: Nate Zimmer <nzimmer@sgi.com>
Tested-by: Waiman Long <waiman.long@hp.com>
Tested-by: Daniel J Blueman <daniel@numascale.com>
Acked-by: Pekka Enberg <penberg@kernel.org>
Cc: Robin Holt <robinmholt@gmail.com>
Cc: Nate Zimmer <nzimmer@sgi.com>
Cc: Dave Hansen <dave.hansen@intel.com>
Cc: Waiman Long <waiman.long@hp.com>
Cc: Scott Norton <scott.norton@hp.com>
Cc: "Luck, Tony" <tony.luck@intel.com>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# a4de83dd 30-Jun-2015 Mel Gorman <mgorman@suse.de>

mm: meminit: free pages in large chunks where possible

Parallel struct page frees pages one at a time. Try free pages as single
large pages where possible.

Signed-off-by: Mel Gorman <mgorman@suse.de>
Tested-by: Nate Zimmer <nzimmer@sgi.com>
Tested-by: Waiman Long <waiman.long@hp.com>
Tested-by: Daniel J Blueman <daniel@numascale.com>
Acked-by: Pekka Enberg <penberg@kernel.org>
Cc: Robin Holt <robinmholt@gmail.com>
Cc: Nate Zimmer <nzimmer@sgi.com>
Cc: Dave Hansen <dave.hansen@intel.com>
Cc: Waiman Long <waiman.long@hp.com>
Cc: Scott Norton <scott.norton@hp.com>
Cc: "Luck, Tony" <tony.luck@intel.com>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 54608c3f 30-Jun-2015 Mel Gorman <mgorman@suse.de>

mm: meminit: minimise number of pfn->page lookups during initialisation

Deferred struct page initialisation is using pfn_to_page() on every PFN
unnecessarily. This patch minimises the number of lookups and scheduler
checks.

Signed-off-by: Mel Gorman <mgorman@suse.de>
Tested-by: Nate Zimmer <nzimmer@sgi.com>
Tested-by: Waiman Long <waiman.long@hp.com>
Tested-by: Daniel J Blueman <daniel@numascale.com>
Acked-by: Pekka Enberg <penberg@kernel.org>
Cc: Robin Holt <robinmholt@gmail.com>
Cc: Nate Zimmer <nzimmer@sgi.com>
Cc: Dave Hansen <dave.hansen@intel.com>
Cc: Waiman Long <waiman.long@hp.com>
Cc: Scott Norton <scott.norton@hp.com>
Cc: "Luck, Tony" <tony.luck@intel.com>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 7e18adb4 30-Jun-2015 Mel Gorman <mgorman@suse.de>

mm: meminit: initialise remaining struct pages in parallel with kswapd

Only a subset of struct pages are initialised at the moment. When this
patch is applied kswapd initialise the remaining struct pages in parallel.

This should boot faster by spreading the work to multiple CPUs and
initialising data that is local to the CPU. The user-visible effect on
large machines is that free memory will appear to rapidly increase early
in the lifetime of the system until kswapd reports that all memory is
initialised in the kernel log. Once initialised there should be no other
user-visibile effects.

Signed-off-by: Mel Gorman <mgorman@suse.de>
Tested-by: Nate Zimmer <nzimmer@sgi.com>
Tested-by: Waiman Long <waiman.long@hp.com>
Tested-by: Daniel J Blueman <daniel@numascale.com>
Acked-by: Pekka Enberg <penberg@kernel.org>
Cc: Robin Holt <robinmholt@gmail.com>
Cc: Nate Zimmer <nzimmer@sgi.com>
Cc: Dave Hansen <dave.hansen@intel.com>
Cc: Waiman Long <waiman.long@hp.com>
Cc: Scott Norton <scott.norton@hp.com>
Cc: "Luck, Tony" <tony.luck@intel.com>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 3a80a7fa 30-Jun-2015 Mel Gorman <mgorman@suse.de>

mm: meminit: initialise a subset of struct pages if CONFIG_DEFERRED_STRUCT_PAGE_INIT is set

This patch initalises all low memory struct pages and 2G of the highest
zone on each node during memory initialisation if
CONFIG_DEFERRED_STRUCT_PAGE_INIT is set. That config option cannot be set
but will be available in a later patch. Parallel initialisation of struct
page depends on some features from memory hotplug and it is necessary to
alter alter section annotations.

Signed-off-by: Mel Gorman <mgorman@suse.de>
Tested-by: Nate Zimmer <nzimmer@sgi.com>
Tested-by: Waiman Long <waiman.long@hp.com>
Tested-by: Daniel J Blueman <daniel@numascale.com>
Acked-by: Pekka Enberg <penberg@kernel.org>
Cc: Robin Holt <robinmholt@gmail.com>
Cc: Nate Zimmer <nzimmer@sgi.com>
Cc: Dave Hansen <dave.hansen@intel.com>
Cc: Waiman Long <waiman.long@hp.com>
Cc: Scott Norton <scott.norton@hp.com>
Cc: "Luck, Tony" <tony.luck@intel.com>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 75a592a4 30-Jun-2015 Mel Gorman <mgorman@suse.de>

mm: meminit: inline some helper functions

early_pfn_in_nid() and meminit_pfn_in_nid() are small functions that are
unnecessarily visible outside memory initialisation. As well as
unnecessary visibility, it's unnecessary function call overhead when
initialising pages. This patch moves the helpers inline.

[akpm@linux-foundation.org: fix build]
[mhocko@suse.cz: fix build]
Signed-off-by: Mel Gorman <mgorman@suse.de>
Tested-by: Nate Zimmer <nzimmer@sgi.com>
Tested-by: Waiman Long <waiman.long@hp.com>
Tested-by: Daniel J Blueman <daniel@numascale.com>
Acked-by: Pekka Enberg <penberg@kernel.org>
Cc: Robin Holt <robinmholt@gmail.com>
Cc: Nate Zimmer <nzimmer@sgi.com>
Cc: Dave Hansen <dave.hansen@intel.com>
Cc: Waiman Long <waiman.long@hp.com>
Cc: Scott Norton <scott.norton@hp.com>
Cc: "Luck, Tony" <tony.luck@intel.com>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Michal Hocko <mhocko@suse.cz>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 8a942fde 30-Jun-2015 Mel Gorman <mgorman@suse.de>

mm: meminit: make __early_pfn_to_nid SMP-safe and introduce meminit_pfn_in_nid

__early_pfn_to_nid() use static variables to cache recent lookups as
memblock lookups are very expensive but it assumes that memory
initialisation is single-threaded. Parallel initialisation of struct
pages will break that assumption so this patch makes __early_pfn_to_nid()
SMP-safe by requiring the caller to cache recent search information.
early_pfn_to_nid() keeps the same interface but is only safe to use early
in boot due to the use of a global static variable. meminit_pfn_in_nid()
is an SMP-safe version that callers must maintain their own state for.

Signed-off-by: Mel Gorman <mgorman@suse.de>
Tested-by: Nate Zimmer <nzimmer@sgi.com>
Tested-by: Waiman Long <waiman.long@hp.com>
Tested-by: Daniel J Blueman <daniel@numascale.com>
Acked-by: Pekka Enberg <penberg@kernel.org>
Cc: Robin Holt <robinmholt@gmail.com>
Cc: Nate Zimmer <nzimmer@sgi.com>
Cc: Dave Hansen <dave.hansen@intel.com>
Cc: Waiman Long <waiman.long@hp.com>
Cc: Scott Norton <scott.norton@hp.com>
Cc: "Luck, Tony" <tony.luck@intel.com>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# d70ddd7a 30-Jun-2015 Mel Gorman <mgorman@suse.de>

mm: page_alloc: pass PFN to __free_pages_bootmem

__free_pages_bootmem prepares a page for release to the buddy allocator
and assumes that the struct page is initialised. Parallel initialisation
of struct pages defers initialisation and __free_pages_bootmem can be
called for struct pages that cannot yet map struct page to PFN. This
patch passes PFN to __free_pages_bootmem with no other functional change.

Signed-off-by: Mel Gorman <mgorman@suse.de>
Tested-by: Nate Zimmer <nzimmer@sgi.com>
Tested-by: Waiman Long <waiman.long@hp.com>
Tested-by: Daniel J Blueman <daniel@numascale.com>
Acked-by: Pekka Enberg <penberg@kernel.org>
Cc: Robin Holt <robinmholt@gmail.com>
Cc: Nate Zimmer <nzimmer@sgi.com>
Cc: Dave Hansen <dave.hansen@intel.com>
Cc: Waiman Long <waiman.long@hp.com>
Cc: Scott Norton <scott.norton@hp.com>
Cc: "Luck, Tony" <tony.luck@intel.com>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 92923ca3 30-Jun-2015 Nathan Zimmer <nzimmer@sgi.com>

mm: meminit: only set page reserved in the memblock region

Currently each page struct is set as reserved upon initialization. This
patch leaves the reserved bit clear and only sets the reserved bit when it
is known the memory was allocated by the bootmem allocator. This makes it
easier to distinguish between uninitialised struct pages and reserved
struct pages in later patches.

Signed-off-by: Robin Holt <holt@sgi.com>
Signed-off-by: Nathan Zimmer <nzimmer@sgi.com>
Signed-off-by: Mel Gorman <mgorman@suse.de>
Tested-by: Nate Zimmer <nzimmer@sgi.com>
Tested-by: Waiman Long <waiman.long@hp.com>
Tested-by: Daniel J Blueman <daniel@numascale.com>
Acked-by: Pekka Enberg <penberg@kernel.org>
Cc: Robin Holt <robinmholt@gmail.com>
Cc: Dave Hansen <dave.hansen@intel.com>
Cc: Waiman Long <waiman.long@hp.com>
Cc: Scott Norton <scott.norton@hp.com>
Cc: "Luck, Tony" <tony.luck@intel.com>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 1e8ce83c 30-Jun-2015 Robin Holt <holt@sgi.com>

mm: meminit: move page initialization into a separate function

Currently, memmap_init_zone() has all the smarts for initializing a single
page. A subset of this is required for parallel page initialisation and
so this patch breaks up the monolithic function in preparation.

Signed-off-by: Robin Holt <holt@sgi.com>
Signed-off-by: Nathan Zimmer <nzimmer@sgi.com>
Signed-off-by: Mel Gorman <mgorman@suse.de>
Tested-by: Nate Zimmer <nzimmer@sgi.com>
Tested-by: Waiman Long <waiman.long@hp.com>
Tested-by: Daniel J Blueman <daniel@numascale.com>
Acked-by: Pekka Enberg <penberg@kernel.org>
Cc: Robin Holt <robinmholt@gmail.com>
Cc: Dave Hansen <dave.hansen@intel.com>
Cc: Waiman Long <waiman.long@hp.com>
Cc: Scott Norton <scott.norton@hp.com>
Cc: "Luck, Tony" <tony.luck@intel.com>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 9083905a 24-Jun-2015 Johannes Weiner <hannes@cmpxchg.org>

mm: page_alloc: inline should_alloc_retry()

The should_alloc_retry() function was meant to encapsulate retry
conditions of the allocator slowpath, but there are still checks
remaining in the main function, and much of how the retrying is
performed also depends on the OOM killer progress. The physical
separation of those conditions make the code hard to follow.

Inline the should_alloc_retry() checks. Notes:

- The __GFP_NOFAIL check is already done in __alloc_pages_may_oom(),
replace it with looping on OOM killer progress

- The pm_suspended_storage() check is meant to skip the OOM killer
when reclaim has no IO available, move to __alloc_pages_may_oom()

- The order <= PAGE_ALLOC_COSTLY order is re-united with its original
counterpart of checking whether reclaim actually made any progress

Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
Acked-by: Michal Hocko <mhocko@suse.cz>
Cc: David Rientjes <rientjes@google.com>
Cc: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Dave Chinner <david@fromorbit.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# dc56401f 24-Jun-2015 Johannes Weiner <hannes@cmpxchg.org>

mm: oom_kill: simplify OOM killer locking

The zonelist locking and the oom_sem are two overlapping locks that are
used to serialize global OOM killing against different things.

The historical zonelist locking serializes OOM kills from allocations with
overlapping zonelists against each other to prevent killing more tasks
than necessary in the same memory domain. Only when neither tasklists nor
zonelists from two concurrent OOM kills overlap (tasks in separate memcgs
bound to separate nodes) are OOM kills allowed to execute in parallel.

The younger oom_sem is a read-write lock to serialize OOM killing against
the PM code trying to disable the OOM killer altogether.

However, the OOM killer is a fairly cold error path, there is really no
reason to optimize for highly performant and concurrent OOM kills. And
the oom_sem is just flat-out redundant.

Replace both locking schemes with a single global mutex serializing OOM
kills regardless of context.

Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
Acked-by: Michal Hocko <mhocko@suse.cz>
Acked-by: David Rientjes <rientjes@google.com>
Cc: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Dave Chinner <david@fromorbit.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# febd5949 24-Jun-2015 Gu Zheng <guz.fnst@cn.fujitsu.com>

mm/memory hotplug: init the zone's size when calculating node totalpages

Init the zone's size when calculating node totalpages to avoid duplicated
operations in free_area_init_core().

Signed-off-by: Gu Zheng <guz.fnst@cn.fujitsu.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# f4d2897b 24-Jun-2015 Anisse Astier <anisse@astier.eu>

mm/page_alloc.c: cleanup obsolete KM_USER*

It's been five years now that KM_* kmap flags have been removed and that
we can call clear_highpage from any context. So we remove prep_zero_pages
accordingly.

Signed-off-by: Anisse Astier <anisse@astier.eu>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# a9919c79 24-Jun-2015 Rasmus Villemoes <linux@rasmusvillemoes.dk>

mm: only define hashdist variable when needed

For !CONFIG_NUMA, hashdist will always be 0, since it's setter is
otherwise compiled out. So we can save 4 bytes of data and some .text
(although mostly in __init functions) by only defining it for
CONFIG_NUMA.

Signed-off-by: Rasmus Villemoes <linux@rasmusvillemoes.dk>
Acked-by: David Rientjes <rientjes@google.com>
Reviewed-by: Michal Hocko <mhocko@suse.cz>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# b63ae8ca 06-May-2015 Alexander Duyck <alexander.h.duyck@redhat.com>

mm/net: Rename and move page fragment handling from net/ to mm/

This change moves the __alloc_page_frag functionality out of the networking
stack and into the page allocation portion of mm. The idea it so help make
this maintainable by placing it with other page allocation functions.

Since we are moving it from skbuff.c to page_alloc.c I have also renamed
the basic defines and structure from netdev_alloc_cache to page_frag_cache
to reflect that this is now part of a different kernel subsystem.

I have also added a simple __free_page_frag function which can handle
freeing the frags based on the skb->head pointer. The model for this is
based off of __free_pages since we don't actually need to deal with all of
the cases that put_page handles. I incorporated the virt_to_head_page call
and compound_order into the function as it actually allows for a signficant
size reduction by reducing code duplication.

Signed-off-by: Alexander Duyck <alexander.h.duyck@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>


# 4db0c3c2 15-Apr-2015 Jason Low <jason.low2@hp.com>

mm: remove rest of ACCESS_ONCE() usages

We converted some of the usages of ACCESS_ONCE to READ_ONCE in the mm/
tree since it doesn't work reliably on non-scalar types.

This patch removes the rest of the usages of ACCESS_ONCE, and use the new
READ_ONCE API for the read accesses. This makes things cleaner, instead
of using separate/multiple sets of APIs.

Signed-off-by: Jason Low <jason.low2@hp.com>
Acked-by: Michal Hocko <mhocko@suse.cz>
Acked-by: Davidlohr Bueso <dave@stgolabs.net>
Acked-by: Rik van Riel <riel@redhat.com>
Reviewed-by: Christian Borntraeger <borntraeger@de.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 42ff2703 14-Apr-2015 Yaowei Bai <bywxiaobai@163.com>

mm/page_alloc.c: clean up comment

Signed-off-by: Yaowei Bai <bywxiaobai@163.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 4167e9b2 14-Apr-2015 David Rientjes <rientjes@google.com>

mm: remove GFP_THISNODE

NOTE: this is not about __GFP_THISNODE, this is only about GFP_THISNODE.

GFP_THISNODE is a secret combination of gfp bits that have different
behavior than expected. It is a combination of __GFP_THISNODE,
__GFP_NORETRY, and __GFP_NOWARN and is special-cased in the page
allocator slowpath to fail without trying reclaim even though it may be
used in combination with __GFP_WAIT.

An example of the problem this creates: commit e97ca8e5b864 ("mm: fix
GFP_THISNODE callers and clarify") fixed up many users of GFP_THISNODE
that really just wanted __GFP_THISNODE. The problem doesn't end there,
however, because even it was a no-op for alloc_misplaced_dst_page(),
which also sets __GFP_NORETRY and __GFP_NOWARN, and
migrate_misplaced_transhuge_page(), where __GFP_NORETRY and __GFP_NOWAIT
is set in GFP_TRANSHUGE. Converting GFP_THISNODE to __GFP_THISNODE is a
no-op in these cases since the page allocator special-cases
__GFP_THISNODE && __GFP_NORETRY && __GFP_NOWARN.

It's time to just remove GFP_THISNODE entirely. We leave __GFP_THISNODE
to restrict an allocation to a local node, but remove GFP_THISNODE and
its obscurity. Instead, we require that a caller clear __GFP_WAIT if it
wants to avoid reclaim.

This allows the aforementioned functions to actually reclaim as they
should. It also enables any future callers that want to do
__GFP_THISNODE but also __GFP_NORETRY && __GFP_NOWARN to reclaim. The
rule is simple: if you don't want to reclaim, then don't set __GFP_WAIT.

Aside: ovs_flow_stats_update() really wants to avoid reclaim as well, so
it is unchanged.

Signed-off-by: David Rientjes <rientjes@google.com>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Cc: Christoph Lameter <cl@linux.com>
Acked-by: Pekka Enberg <penberg@kernel.org>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Pravin Shelar <pshelar@nicira.com>
Cc: Jarno Rajahalme <jrajahalme@nicira.com>
Cc: Li Zefan <lizefan@huawei.com>
Cc: Greg Thelen <gthelen@google.com>
Cc: Tejun Heo <tj@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 761b0677 14-Apr-2015 Konstantin Khlebnikov <koct9i@gmail.com>

mm: completely remove dumping per-cpu lists from show_mem()

It seems nobody needs this.

Signed-off-by: Konstantin Khlebnikov <koct9i@gmail.com>
Cc: Michal Hocko <mhocko@suse.cz>
Cc: David Rientjes <rientjes@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# d1bfcdb8 14-Apr-2015 Konstantin Khlebnikov <koct9i@gmail.com>

mm: hide per-cpu lists in output of show_mem()

This makes show_mem() much less verbose on huge machines. Instead of huge
and almost useless dump of counters for each per-zone per-cpu lists this
patch prints the sum of these counters for each zone (free_pcp) and size
of per-cpu list for current cpu (local_pcp).

The filter flag SHOW_MEM_PERCPU_LISTS reverts to the old verbose mode.

[akpm@linux-foundation.org: update show_free_areas comment]
Signed-off-by: Konstantin Khlebnikov <khlebnikov@yandex-team.ru>
Acked-by: Michal Hocko <mhocko@suse.cz>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 2149cdae 14-Apr-2015 Joonsoo Kim <iamjoonsoo.kim@lge.com>

mm/compaction: enhance compaction finish condition

Compaction has anti fragmentation algorithm. It is that freepage should
be more than pageblock order to finish the compaction if we don't find any
freepage in requested migratetype buddy list. This is for mitigating
fragmentation, but, there is a lack of migratetype consideration and it is
too excessive compared to page allocator's anti fragmentation algorithm.

Not considering migratetype would cause premature finish of compaction.
For example, if allocation request is for unmovable migratetype, freepage
with CMA migratetype doesn't help that allocation and compaction should
not be stopped. But, current logic regards this situation as compaction
is no longer needed, so finish the compaction.

Secondly, condition is too excessive compared to page allocator's logic.
We can steal freepage from other migratetype and change pageblock
migratetype on more relaxed conditions in page allocator. This is
designed to prevent fragmentation and we can use it here. Imposing hard
constraint only to the compaction doesn't help much in this case since
page allocator would cause fragmentation again.

To solve these problems, this patch borrows anti fragmentation logic from
page allocator. It will reduce premature compaction finish in some cases
and reduce excessive compaction work.

stress-highalloc test in mmtests with non movable order 7 allocation shows
considerable increase of compaction success rate.

Compaction success rate (Compaction success * 100 / Compaction stalls, %)
31.82 : 42.20

I tested it on non-reboot 5 runs stress-highalloc benchmark and found that
there is no more degradation on allocation success rate than before. That
roughly means that this patch doesn't result in more fragmentations.

Vlastimil suggests additional idea that we only test for fallbacks when
migration scanner has scanned a whole pageblock. It looked good for
fragmentation because chance of stealing increase due to making more free
pages in certain pageblock. So, I tested it, but, it results in decreased
compaction success rate, roughly 38.00. I guess the reason that if system
is low memory condition, watermark check could be failed due to not enough
order 0 free page and so, sometimes, we can't reach a fallback check
although migrate_pfn is aligned to pageblock_nr_pages. I can insert code
to cope with this situation but it makes code more complicated so I don't
include his idea at this patch.

[akpm@linux-foundation.org: fix CONFIG_CMA=n build]
Signed-off-by: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Cc: Mel Gorman <mgorman@suse.de>
Cc: David Rientjes <rientjes@google.com>
Cc: Rik van Riel <riel@redhat.com>
Cc: Zhang Yanfei <zhangyanfei@cn.fujitsu.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 4eb7dce6 14-Apr-2015 Joonsoo Kim <iamjoonsoo.kim@lge.com>

mm/page_alloc: factor out fallback freepage checking

This is preparation step to use page allocator's anti fragmentation logic
in compaction. This patch just separates fallback freepage checking part
from fallback freepage management part. Therefore, there is no functional
change.

Signed-off-by: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Cc: Mel Gorman <mgorman@suse.de>
Cc: David Rientjes <rientjes@google.com>
Cc: Rik van Riel <riel@redhat.com>
Cc: Zhang Yanfei <zhangyanfei@cn.fujitsu.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# dc67647b 14-Apr-2015 Joonsoo Kim <iamjoonsoo.kim@lge.com>

mm/cma: change fallback behaviour for CMA freepage

Freepage with MIGRATE_CMA can be used only for MIGRATE_MOVABLE and they
should not be expanded to other migratetype buddy list to protect them
from unmovable/reclaimable allocation. Implementing these requirements in
__rmqueue_fallback(), that is, finding largest possible block of freepage
has bad effect that high order freepage with MIGRATE_CMA are broken
continually although there are suitable order CMA freepage. Reason is
that they are not be expanded to other migratetype buddy list and next
__rmqueue_fallback() invocation try to finds another largest block of
freepage and break it again. So, MIGRATE_CMA fallback should be handled
separately. This patch introduces __rmqueue_cma_fallback(), that just
wrapper of __rmqueue_smallest() and call it before __rmqueue_fallback() if
migratetype == MIGRATE_MOVABLE.

This results in unintended behaviour change that MIGRATE_CMA freepage is
always used first rather than other migratetype as movable allocation's
fallback. But, as already mentioned above, MIGRATE_CMA can be used only
for MIGRATE_MOVABLE, so it is better to use MIGRATE_CMA freepage first as
much as possible. Otherwise, we needlessly take up precious freepages
with other migratetype and increase chance of fragmentation.

Signed-off-by: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Cc: Mel Gorman <mgorman@suse.de>
Cc: David Rientjes <rientjes@google.com>
Cc: Rik van Riel <riel@redhat.com>
Cc: Zhang Yanfei <zhangyanfei@cn.fujitsu.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# e009d5dc 12-Mar-2015 Michal Hocko <mhocko@suse.cz>

mm, oom: do not fail __GFP_NOFAIL allocation if oom killer is disabled

Tetsuo Handa has pointed out that __GFP_NOFAIL allocations might fail
after OOM killer is disabled if the allocation is performed by a kernel
thread. This behavior was introduced from the very beginning by
7f33d49a2ed5 ("mm, PM/Freezer: Disable OOM killer when tasks are frozen").
This means that the basic contract for the allocation request is broken
and the context requesting such an allocation might blow up unexpectedly.

There are basically two ways forward.

1) move oom_killer_disable after kernel threads are frozen. This has a
risk that the OOM victim wouldn't be able to finish because it would
depend on an already frozen kernel thread. This would be really tricky
to debug.

2) do not fail GFP_NOFAIL allocation no matter what and risk a
potential Freezable kernel threads will loop and fail the suspend.
Incidental allocations after kernel threads are frozen will at least
dump a warning - if we are lucky and the serial console is still active
of course...

This patch implements the later option because it is safer. We would see
warning rather than allocation failures for the kernel threads which would
blow up otherwise and have a higher chances to identify __GFP_NOFAIL users
from deeper pm code.

Signed-off-by: Michal Hocko <mhocko@suse.cz>
Acked-by: David Rientjes <rientjes@gooogle.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Tetsuo Handa <penguin-kernel@i-love.sakura.ne.jp>
Cc: "Rafael J. Wysocki" <rjw@rjwysocki.net>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# cc873177 27-Feb-2015 Johannes Weiner <hannes@cmpxchg.org>

mm: page_alloc: revert inadvertent !__GFP_FS retry behavior change

Historically, !__GFP_FS allocations were not allowed to invoke the OOM
killer once reclaim had failed, but nevertheless kept looping in the
allocator.

Commit 9879de7373fc ("mm: page_alloc: embed OOM killing naturally into
allocation slowpath"), which should have been a simple cleanup patch,
accidentally changed the behavior to aborting the allocation at that
point. This creates problems with filesystem callers (?) that currently
rely on the allocator waiting for other tasks to intervene.

Revert the behavior as it shouldn't have been changed as part of a
cleanup patch.

Fixes: 9879de7373fc ("mm: page_alloc: embed OOM killing naturally into allocation slowpath")
Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
Acked-by: Michal Hocko <mhocko@suse.cz>
Reported-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
Cc: Theodore Ts'o <tytso@mit.edu>
Cc: Dave Chinner <david@fromorbit.com>
Acked-by: David Rientjes <rientjes@google.com>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Mel Gorman <mgorman@suse.de>
Cc: <stable@vger.kernel.org> [3.19.x]
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# b8c73fc2 13-Feb-2015 Andrey Ryabinin <ryabinin.a.a@gmail.com>

mm: page_alloc: add kasan hooks on alloc and free paths

Add kernel address sanitizer hooks to mark allocated page's addresses as
accessible in corresponding shadow region. Mark freed pages as
inaccessible.

Signed-off-by: Andrey Ryabinin <a.ryabinin@samsung.com>
Cc: Dmitry Vyukov <dvyukov@google.com>
Cc: Konstantin Serebryany <kcc@google.com>
Cc: Dmitry Chernenkov <dmitryc@google.com>
Signed-off-by: Andrey Konovalov <adech.fo@gmail.com>
Cc: Yuri Gribov <tetra2005@gmail.com>
Cc: Konstantin Khlebnikov <koct9i@gmail.com>
Cc: Sasha Levin <sasha.levin@oracle.com>
Cc: Christoph Lameter <cl@linux.com>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Cc: Dave Hansen <dave.hansen@intel.com>
Cc: Andi Kleen <andi@firstfloor.org>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: Christoph Lameter <cl@linux.com>
Cc: Pekka Enberg <penberg@kernel.org>
Cc: David Rientjes <rientjes@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 84109e15 12-Feb-2015 Yaowei Bai <bywxiaobai@163.com>

mm/page_alloc: fix comment

Add a necessary 'leave'.

Signed-off-by: Yaowei Bai <bywxiaobai@163.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 061f67bc 12-Feb-2015 Rasmus Villemoes <linux@rasmusvillemoes.dk>

mm/page_alloc.c: pull out init code from build_all_zonelists

Pulling the code protected by if (system_state == SYSTEM_BOOTING) into
its own helper allows us to shrink .text a little. This relies on
build_all_zonelists already having a __ref annotation. Add a comment
explaining why so one doesn't have to track it down through git log.

The real saving comes in 3/5, ("mm/mm_init.c: Mark mminit_verify_zonelist
as __init"), where we save about 400 bytes

Signed-off-by: Rasmus Villemoes <linux@rasmusvillemoes.dk>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Rik van Riel <riel@redhat.com>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Cc: David Rientjes <rientjes@google.com>
Cc: Vishnu Pratap Singh <vishnu.ps@samsung.com>
Cc: Pintu Kumar <pintu.k@samsung.com>
Cc: Michal Nazarewicz <mina86@mina86.com>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Paul Gortmaker <paul.gortmaker@windriver.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Tim Chen <tim.c.chen@linux.intel.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: Li Zefan <lizefan@huawei.com>
Cc: Tejun Heo <tj@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 9c0415eb 11-Feb-2015 Vlastimil Babka <vbabka@suse.cz>

mm: more aggressive page stealing for UNMOVABLE allocations

When allocation falls back to stealing free pages of another migratetype,
it can decide to steal extra pages, or even the whole pageblock in order
to reduce fragmentation, which could happen if further allocation
fallbacks pick a different pageblock. In try_to_steal_freepages(), one of
the situations where extra pages are stolen happens when we are trying to
allocate a MIGRATE_RECLAIMABLE page.

However, MIGRATE_UNMOVABLE allocations are not treated the same way,
although spreading such allocation over multiple fallback pageblocks is
arguably even worse than it is for RECLAIMABLE allocations. To minimize
fragmentation, we should minimize the number of such fallbacks, and thus
steal as much as is possible from each fallback pageblock.

Note that in theory this might put more pressure on movable pageblocks and
cause movable allocations to steal back from unmovable pageblocks.
However, movable allocations are not as aggressive with stealing, and do
not cause permanent fragmentation, so the tradeoff is reasonable, and
evaluation seems to support the change.

This patch thus adds a check for MIGRATE_UNMOVABLE to the decision to
steal extra free pages. When evaluating with stress-highalloc from
mmtests, this has reduced the number of MIGRATE_UNMOVABLE fallbacks to
roughly 1/6. The number of these fallbacks stealing from MIGRATE_MOVABLE
block is reduced to 1/3. There was no observation of growing number of
unmovable pageblocks over time, and also not of increased movable
allocation fallbacks.

Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
Acked-by: Mel Gorman <mgorman@suse.de>
Cc: Zhang Yanfei <zhangyanfei@cn.fujitsu.com>
Cc: Minchan Kim <minchan@kernel.org>
Cc: David Rientjes <rientjes@google.com>
Cc: Rik van Riel <riel@redhat.com>
Cc: "Aneesh Kumar K.V" <aneesh.kumar@linux.vnet.ibm.com>
Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Cc: Michal Hocko <mhocko@suse.cz>
Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 3a1086fb 11-Feb-2015 Vlastimil Babka <vbabka@suse.cz>

mm: always steal split buddies in fallback allocations

When allocation falls back to another migratetype, it will steal a page
with highest available order, and (depending on this order and desired
migratetype), it might also steal the rest of free pages from the same
pageblock.

Given the preference of highest available order, it is likely that it will
be higher than the desired order, and result in the stolen buddy page
being split. The remaining pages after split are currently stolen only
when the rest of the free pages are stolen. This can however lead to
situations where for MOVABLE allocations we split e.g. order-4 fallback
UNMOVABLE page, but steal only order-0 page. Then on the next MOVABLE
allocation (which may be batched to fill the pcplists) we split another
order-3 or higher page, etc. By stealing all pages that we have split, we
can avoid further stealing.

This patch therefore adjusts the page stealing so that buddy pages created
by split are always stolen. This has effect only on MOVABLE allocations,
as RECLAIMABLE and UNMOVABLE allocations already always do that in
addition to stealing the rest of free pages from the pageblock. The
change also allows to simplify try_to_steal_freepages() and factor out CMA
handling.

According to Mel, it has been intended since the beginning that buddy
pages after split would be stolen always, but it doesn't seem like it was
ever the case until commit 47118af076f6 ("mm: mmzone: MIGRATE_CMA
migration type added"). The commit has unintentionally introduced this
behavior, but was reverted by commit 0cbef29a7821 ("mm:
__rmqueue_fallback() should respect pageblock type"). Neither included
evaluation.

My evaluation with stress-highalloc from mmtests shows about 2.5x
reduction of page stealing events for MOVABLE allocations, without
affecting the page stealing events for other allocation migratetypes.

Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
Acked-by: Mel Gorman <mgorman@suse.de>
Cc: Zhang Yanfei <zhangyanfei@cn.fujitsu.com>
Acked-by: Minchan Kim <minchan@kernel.org>
Cc: David Rientjes <rientjes@google.com>
Cc: Rik van Riel <riel@redhat.com>
Cc: "Aneesh Kumar K.V" <aneesh.kumar@linux.vnet.ibm.com>
Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Cc: Michal Hocko <mhocko@suse.cz>
Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 99592d59 11-Feb-2015 Vlastimil Babka <vbabka@suse.cz>

mm: when stealing freepages, also take pages created by splitting buddy page

When studying page stealing, I noticed some weird looking decisions in
try_to_steal_freepages(). The first I assume is a bug (Patch 1), the
following two patches were driven by evaluation.

Testing was done with stress-highalloc of mmtests, using the
mm_page_alloc_extfrag tracepoint and postprocessing to get counts of how
often page stealing occurs for individual migratetypes, and what
migratetypes are used for fallbacks. Arguably, the worst case of page
stealing is when UNMOVABLE allocation steals from MOVABLE pageblock.
RECLAIMABLE allocation stealing from MOVABLE allocation is also not ideal,
so the goal is to minimize these two cases.

The evaluation of v2 wasn't always clear win and Joonsoo questioned the
results. Here I used different baseline which includes RFC compaction
improvements from [1]. I found that the compaction improvements reduce
variability of stress-highalloc, so there's less noise in the data.

First, let's look at stress-highalloc configured to do sync compaction,
and how these patches reduce page stealing events during the test. First
column is after fresh reboot, other two are reiterations of test without
reboot. That was all accumulater over 5 re-iterations (so the benchmark
was run 5x3 times with 5 fresh restarts).

Baseline:

3.19-rc4 3.19-rc4 3.19-rc4
5-nothp-1 5-nothp-2 5-nothp-3
Page alloc extfrag event 10264225 8702233 10244125
Extfrag fragmenting 10263271 8701552 10243473
Extfrag fragmenting for unmovable 13595 17616 15960
Extfrag fragmenting unmovable placed with movable 7989 12193 8447
Extfrag fragmenting for reclaimable 658 1840 1817
Extfrag fragmenting reclaimable placed with movable 558 1677 1679
Extfrag fragmenting for movable 10249018 8682096 10225696

With Patch 1:
3.19-rc4 3.19-rc4 3.19-rc4
6-nothp-1 6-nothp-2 6-nothp-3
Page alloc extfrag event 11834954 9877523 9774860
Extfrag fragmenting 11833993 9876880 9774245
Extfrag fragmenting for unmovable 7342 16129 11712
Extfrag fragmenting unmovable placed with movable 4191 10547 6270
Extfrag fragmenting for reclaimable 373 1130 923
Extfrag fragmenting reclaimable placed with movable 302 906 738
Extfrag fragmenting for movable 11826278 9859621 9761610

With Patch 2:
3.19-rc4 3.19-rc4 3.19-rc4
7-nothp-1 7-nothp-2 7-nothp-3
Page alloc extfrag event 4725990 3668793 3807436
Extfrag fragmenting 4725104 3668252 3806898
Extfrag fragmenting for unmovable 6678 7974 7281
Extfrag fragmenting unmovable placed with movable 2051 3829 4017
Extfrag fragmenting for reclaimable 429 1208 1278
Extfrag fragmenting reclaimable placed with movable 369 976 1034
Extfrag fragmenting for movable 4717997 3659070 3798339

With Patch 3:
3.19-rc4 3.19-rc4 3.19-rc4
8-nothp-1 8-nothp-2 8-nothp-3
Page alloc extfrag event 5016183 4700142 3850633
Extfrag fragmenting 5015325 4699613 3850072
Extfrag fragmenting for unmovable 1312 3154 3088
Extfrag fragmenting unmovable placed with movable 1115 2777 2714
Extfrag fragmenting for reclaimable 437 1193 1097
Extfrag fragmenting reclaimable placed with movable 330 969 879
Extfrag fragmenting for movable 5013576 4695266 3845887

In v2 we've seen apparent regression with Patch 1 for unmovable events,
this is now gone, suggesting it was indeed noise. Here, each patch
improves the situation for unmovable events. Reclaimable is improved by
patch 1 and then either the same modulo noise, or perhaps sligtly worse -
a small price for unmovable improvements, IMHO. The number of movable
allocations falling back to other migratetypes is most noisy, but it's
reduced to half at Patch 2 nevertheless. These are least critical as
compaction can move them around.

If we look at success rates, the patches don't affect them, that didn't change.

Baseline:
3.19-rc4 3.19-rc4 3.19-rc4
5-nothp-1 5-nothp-2 5-nothp-3
Success 1 Min 49.00 ( 0.00%) 42.00 ( 14.29%) 41.00 ( 16.33%)
Success 1 Mean 51.00 ( 0.00%) 45.00 ( 11.76%) 42.60 ( 16.47%)
Success 1 Max 55.00 ( 0.00%) 51.00 ( 7.27%) 46.00 ( 16.36%)
Success 2 Min 53.00 ( 0.00%) 47.00 ( 11.32%) 44.00 ( 16.98%)
Success 2 Mean 59.60 ( 0.00%) 50.80 ( 14.77%) 48.20 ( 19.13%)
Success 2 Max 64.00 ( 0.00%) 56.00 ( 12.50%) 52.00 ( 18.75%)
Success 3 Min 84.00 ( 0.00%) 82.00 ( 2.38%) 78.00 ( 7.14%)
Success 3 Mean 85.60 ( 0.00%) 82.80 ( 3.27%) 79.40 ( 7.24%)
Success 3 Max 86.00 ( 0.00%) 83.00 ( 3.49%) 80.00 ( 6.98%)

Patch 1:
3.19-rc4 3.19-rc4 3.19-rc4
6-nothp-1 6-nothp-2 6-nothp-3
Success 1 Min 49.00 ( 0.00%) 44.00 ( 10.20%) 44.00 ( 10.20%)
Success 1 Mean 51.80 ( 0.00%) 46.00 ( 11.20%) 45.80 ( 11.58%)
Success 1 Max 54.00 ( 0.00%) 49.00 ( 9.26%) 49.00 ( 9.26%)
Success 2 Min 58.00 ( 0.00%) 49.00 ( 15.52%) 48.00 ( 17.24%)
Success 2 Mean 60.40 ( 0.00%) 51.80 ( 14.24%) 50.80 ( 15.89%)
Success 2 Max 63.00 ( 0.00%) 54.00 ( 14.29%) 55.00 ( 12.70%)
Success 3 Min 84.00 ( 0.00%) 81.00 ( 3.57%) 79.00 ( 5.95%)
Success 3 Mean 85.00 ( 0.00%) 81.60 ( 4.00%) 79.80 ( 6.12%)
Success 3 Max 86.00 ( 0.00%) 82.00 ( 4.65%) 82.00 ( 4.65%)

Patch 2:

3.19-rc4 3.19-rc4 3.19-rc4
7-nothp-1 7-nothp-2 7-nothp-3
Success 1 Min 50.00 ( 0.00%) 44.00 ( 12.00%) 39.00 ( 22.00%)
Success 1 Mean 52.80 ( 0.00%) 45.60 ( 13.64%) 42.40 ( 19.70%)
Success 1 Max 55.00 ( 0.00%) 46.00 ( 16.36%) 47.00 ( 14.55%)
Success 2 Min 52.00 ( 0.00%) 48.00 ( 7.69%) 45.00 ( 13.46%)
Success 2 Mean 53.40 ( 0.00%) 49.80 ( 6.74%) 48.80 ( 8.61%)
Success 2 Max 57.00 ( 0.00%) 52.00 ( 8.77%) 52.00 ( 8.77%)
Success 3 Min 84.00 ( 0.00%) 81.00 ( 3.57%) 79.00 ( 5.95%)
Success 3 Mean 85.00 ( 0.00%) 82.40 ( 3.06%) 79.60 ( 6.35%)
Success 3 Max 86.00 ( 0.00%) 83.00 ( 3.49%) 80.00 ( 6.98%)

Patch 3:
3.19-rc4 3.19-rc4 3.19-rc4
8-nothp-1 8-nothp-2 8-nothp-3
Success 1 Min 46.00 ( 0.00%) 44.00 ( 4.35%) 42.00 ( 8.70%)
Success 1 Mean 50.20 ( 0.00%) 45.60 ( 9.16%) 44.00 ( 12.35%)
Success 1 Max 52.00 ( 0.00%) 47.00 ( 9.62%) 47.00 ( 9.62%)
Success 2 Min 53.00 ( 0.00%) 49.00 ( 7.55%) 48.00 ( 9.43%)
Success 2 Mean 55.80 ( 0.00%) 50.60 ( 9.32%) 49.00 ( 12.19%)
Success 2 Max 59.00 ( 0.00%) 52.00 ( 11.86%) 51.00 ( 13.56%)
Success 3 Min 84.00 ( 0.00%) 80.00 ( 4.76%) 79.00 ( 5.95%)
Success 3 Mean 85.40 ( 0.00%) 81.60 ( 4.45%) 80.40 ( 5.85%)
Success 3 Max 87.00 ( 0.00%) 83.00 ( 4.60%) 82.00 ( 5.75%)

While there's no improvement here, I consider reduced fragmentation events
to be worth on its own. Patch 2 also seems to reduce scanning for free
pages, and migrations in compaction, suggesting it has somewhat less work
to do:

Patch 1:

Compaction stalls 4153 3959 3978
Compaction success 1523 1441 1446
Compaction failures 2630 2517 2531
Page migrate success 4600827 4943120 5104348
Page migrate failure 19763 16656 17806
Compaction pages isolated 9597640 10305617 10653541
Compaction migrate scanned 77828948 86533283 87137064
Compaction free scanned 517758295 521312840 521462251
Compaction cost 5503 5932 6110

Patch 2:

Compaction stalls 3800 3450 3518
Compaction success 1421 1316 1317
Compaction failures 2379 2134 2201
Page migrate success 4160421 4502708 4752148
Page migrate failure 19705 14340 14911
Compaction pages isolated 8731983 9382374 9910043
Compaction migrate scanned 98362797 96349194 98609686
Compaction free scanned 496512560 469502017 480442545
Compaction cost 5173 5526 5811

As with v2, /proc/pagetypeinfo appears unaffected with respect to numbers
of unmovable and reclaimable pageblocks.

Configuring the benchmark to allocate like THP page fault (i.e. no sync
compaction) gives much noisier results for iterations 2 and 3 after
reboot. This is not so surprising given how [1] offers lower improvements
in this scenario due to less restarts after deferred compaction which
would change compaction pivot.

Baseline:
3.19-rc4 3.19-rc4 3.19-rc4
5-thp-1 5-thp-2 5-thp-3
Page alloc extfrag event 8148965 6227815 6646741
Extfrag fragmenting 8147872 6227130 6646117
Extfrag fragmenting for unmovable 10324 12942 15975
Extfrag fragmenting unmovable placed with movable 5972 8495 10907
Extfrag fragmenting for reclaimable 601 1707 2210
Extfrag fragmenting reclaimable placed with movable 520 1570 2000
Extfrag fragmenting for movable 8136947 6212481 6627932

Patch 1:
3.19-rc4 3.19-rc4 3.19-rc4
6-thp-1 6-thp-2 6-thp-3
Page alloc extfrag event 8345457 7574471 7020419
Extfrag fragmenting 8343546 7573777 7019718
Extfrag fragmenting for unmovable 10256 18535 30716
Extfrag fragmenting unmovable placed with movable 6893 11726 22181
Extfrag fragmenting for reclaimable 465 1208 1023
Extfrag fragmenting reclaimable placed with movable 353 996 843
Extfrag fragmenting for movable 8332825 7554034 6987979

Patch 2:
3.19-rc4 3.19-rc4 3.19-rc4
7-thp-1 7-thp-2 7-thp-3
Page alloc extfrag event 3512847 3020756 2891625
Extfrag fragmenting 3511940 3020185 2891059
Extfrag fragmenting for unmovable 9017 6892 6191
Extfrag fragmenting unmovable placed with movable 1524 3053 2435
Extfrag fragmenting for reclaimable 445 1081 1160
Extfrag fragmenting reclaimable placed with movable 375 918 986
Extfrag fragmenting for movable 3502478 3012212 2883708

Patch 3:
3.19-rc4 3.19-rc4 3.19-rc4
8-thp-1 8-thp-2 8-thp-3
Page alloc extfrag event 3181699 3082881 2674164
Extfrag fragmenting 3180812 3082303 2673611
Extfrag fragmenting for unmovable 1201 4031 4040
Extfrag fragmenting unmovable placed with movable 974 3611 3645
Extfrag fragmenting for reclaimable 478 1165 1294
Extfrag fragmenting reclaimable placed with movable 387 985 1030
Extfrag fragmenting for movable 3179133 3077107 2668277

The improvements for first iteration are clear, the rest is much noisier
and can appear like regression for Patch 1. Anyway, patch 2 rectifies it.

Allocation success rates are again unaffected so there's no point in
making this e-mail any longer.

[1] http://marc.info/?l=linux-mm&m=142166196321125&w=2

This patch (of 3):

When __rmqueue_fallback() is called to allocate a page of order X, it will
find a page of order Y >= X of a fallback migratetype, which is different
from the desired migratetype. With the help of try_to_steal_freepages(),
it may change the migratetype (to the desired one) also of:

1) all currently free pages in the pageblock containing the fallback page
2) the fallback pageblock itself
3) buddy pages created by splitting the fallback page (when Y > X)

These decisions take the order Y into account, as well as the desired
migratetype, with the goal of preventing multiple fallback allocations
that could e.g. distribute UNMOVABLE allocations among multiple
pageblocks.

Originally, decision for 1) has implied the decision for 3). Commit
47118af076f6 ("mm: mmzone: MIGRATE_CMA migration type added") changed that
(probably unintentionally) so that the buddy pages in case 3) are always
changed to the desired migratetype, except for CMA pageblocks.

Commit fef903efcf0c ("mm/page_allo.c: restructure free-page stealing code
and fix a bug") did some refactoring and added a comment that the case of
3) is intended. Commit 0cbef29a7821 ("mm: __rmqueue_fallback() should
respect pageblock type") removed the comment and tried to restore the
original behavior where 1) implies 3), but due to the previous
refactoring, the result is instead that only 2) implies 3) - and the
conditions for 2) are less frequently met than conditions for 1). This
may increase fragmentation in situations where the code decides to steal
all free pages from the pageblock (case 1)), but then gives back the buddy
pages produced by splitting.

This patch restores the original intended logic where 1) implies 3).
During testing with stress-highalloc from mmtests, this has shown to
decrease the number of events where UNMOVABLE and RECLAIMABLE allocations
steal from MOVABLE pageblocks, which can lead to permanent fragmentation.
In some cases it has increased the number of events when MOVABLE
allocations steal from UNMOVABLE or RECLAIMABLE pageblocks, but these are
fixable by sync compaction and thus less harmful.

Note that evaluation has shown that the behavior introduced by
47118af076f6 for buddy pages in case 3) is actually even better than the
original logic, so the following patch will introduce it properly once
again. For stable backports of this patch it makes thus sense to only fix
versions containing 0cbef29a7821.

[iamjoonsoo.kim@lge.com: tracepoint fix]
Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
Acked-by: Mel Gorman <mgorman@suse.de>
Cc: Zhang Yanfei <zhangyanfei@cn.fujitsu.com>
Acked-by: Minchan Kim <minchan@kernel.org>
Cc: David Rientjes <rientjes@google.com>
Cc: Rik van Riel <riel@redhat.com>
Cc: "Aneesh Kumar K.V" <aneesh.kumar@linux.vnet.ibm.com>
Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Cc: Michal Hocko <mhocko@suse.cz>
Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Cc: <stable@vger.kernel.org> [3.13+ containing 0cbef29a7821]
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# c32b3cbe 11-Feb-2015 Michal Hocko <mhocko@suse.cz>

oom, PM: make OOM detection in the freezer path raceless

Commit 5695be142e20 ("OOM, PM: OOM killed task shouldn't escape PM
suspend") has left a race window when OOM killer manages to
note_oom_kill after freeze_processes checks the counter. The race
window is quite small and really unlikely and partial solution deemed
sufficient at the time of submission.

Tejun wasn't happy about this partial solution though and insisted on a
full solution. That requires the full OOM and freezer's task freezing
exclusion, though. This is done by this patch which introduces oom_sem
RW lock and turns oom_killer_disable() into a full OOM barrier.

oom_killer_disabled check is moved from the allocation path to the OOM
level and we take oom_sem for reading for both the check and the whole
OOM invocation.

oom_killer_disable() takes oom_sem for writing so it waits for all
currently running OOM killer invocations. Then it disable all the further
OOMs by setting oom_killer_disabled and checks for any oom victims.
Victims are counted via mark_tsk_oom_victim resp. unmark_oom_victim. The
last victim wakes up all waiters enqueued by oom_killer_disable().
Therefore this function acts as the full OOM barrier.

The page fault path is covered now as well although it was assumed to be
safe before. As per Tejun, "We used to have freezing points deep in file
system code which may be reacheable from page fault." so it would be
better and more robust to not rely on freezing points here. Same applies
to the memcg OOM killer.

out_of_memory tells the caller whether the OOM was allowed to trigger and
the callers are supposed to handle the situation. The page allocation
path simply fails the allocation same as before. The page fault path will
retry the fault (more on that later) and Sysrq OOM trigger will simply
complain to the log.

Normally there wouldn't be any unfrozen user tasks after
try_to_freeze_tasks so the function will not block. But if there was an
OOM killer racing with try_to_freeze_tasks and the OOM victim didn't
finish yet then we have to wait for it. This should complete in a finite
time, though, because

- the victim cannot loop in the page fault handler (it would die
on the way out from the exception)
- it cannot loop in the page allocator because all the further
allocation would fail and __GFP_NOFAIL allocations are not
acceptable at this stage
- it shouldn't be blocked on any locks held by frozen tasks
(try_to_freeze expects lockless context) and kernel threads and
work queues are not frozen yet

Signed-off-by: Michal Hocko <mhocko@suse.cz>
Suggested-by: Tejun Heo <tj@kernel.org>
Cc: David Rientjes <rientjes@google.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Cong Wang <xiyou.wangcong@gmail.com>
Cc: "Rafael J. Wysocki" <rjw@rjwysocki.net>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 8d29e18a 11-Feb-2015 Juergen Gross <jgross@suse.com>

mm: use correct format specifiers when printing address ranges

Especially on 32 bit kernels memory node ranges are printed with 32 bit
wide addresses only. Use u64 types and %llx specifiers to print full
width of addresses.

Signed-off-by: Juergen Gross <jgross@suse.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 81422f29 11-Feb-2015 Kirill A. Shutemov <kirill.shutemov@linux.intel.com>

mm: more checks on free_pages_prepare() for tail pages

Although it was not called, destroy_compound_page() did some potentially
useful checks. Let's re-introduce them in free_pages_prepare(), where
they can be actually triggered when CONFIG_DEBUG_VM=y.

compound_order() assert is already in free_pages_prepare(). We have few
checks for tail pages left.

Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 6e9f0d58 11-Feb-2015 Kirill A. Shutemov <kirill.shutemov@linux.intel.com>

mm/page_alloc.c: drop dead destroy_compound_page()

The only caller is __free_one_page(). By the time we should have
page->flags to be cleared already:

- for 0-order pages though PCP list:
free_hot_cold_page()
free_pages_prepare()
free_pages_check()
page->flags &= ~PAGE_FLAGS_CHECK_AT_PREP;
<put the page to PCP list>

free_pcppages_bulk()
page = <withdraw pages from PCP list>
__free_one_page(page)

- for non-0-order pages:
__free_pages_ok()
free_pages_prepare()
free_pages_check()
page->flags &= ~PAGE_FLAGS_CHECK_AT_PREP;
free_one_page()
__free_one_page()

So there's no way PageCompound() will return true in __free_one_page().
Let's remove dead destroy_compound_page() and put assert for page->flags
there instead.

Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 1a6d53a1 11-Feb-2015 Vlastimil Babka <vbabka@suse.cz>

mm: reduce try_to_compact_pages parameters

Expand the usage of the struct alloc_context introduced in the previous
patch also for calling try_to_compact_pages(), to reduce the number of its
parameters. Since the function is in different compilation unit, we need
to move alloc_context definition in the shared mm/internal.h header.

With this change we get simpler code and small savings of code size and stack
usage:

add/remove: 0/0 grow/shrink: 0/1 up/down: 0/-27 (-27)
function old new delta
__alloc_pages_direct_compact 283 256 -27
add/remove: 0/0 grow/shrink: 0/1 up/down: 0/-13 (-13)
function old new delta
try_to_compact_pages 582 569 -13

Stack usage of __alloc_pages_direct_compact goes from 24 to none (per
scripts/checkstack.pl).

Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
Acked-by: Michal Hocko <mhocko@suse.cz>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Zhang Yanfei <zhangyanfei@cn.fujitsu.com>
Cc: Minchan Kim <minchan@kernel.org>
Cc: David Rientjes <rientjes@google.com>
Cc: Rik van Riel <riel@redhat.com>
Cc: "Aneesh Kumar K.V" <aneesh.kumar@linux.vnet.ibm.com>
Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# a9263751 11-Feb-2015 Vlastimil Babka <vbabka@suse.cz>

mm, page_alloc: reduce number of alloc_pages* functions' parameters

Introduce struct alloc_context to accumulate the numerous parameters
passed between the alloc_pages* family of functions and
get_page_from_freelist(). This excludes gfp_flags and alloc_info, which
mutate too much along the way, and allocation order, which is conceptually
different.

The result is shorter function signatures, as well as overal code size and
stack usage reductions.

bloat-o-meter:

add/remove: 0/0 grow/shrink: 1/2 up/down: 127/-310 (-183)
function old new delta
get_page_from_freelist 2525 2652 +127
__alloc_pages_direct_compact 329 283 -46
__alloc_pages_nodemask 2564 2300 -264

checkstack.pl:

function old new
__alloc_pages_nodemask 248 200
get_page_from_freelist 168 184
__alloc_pages_direct_compact 40 24

Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
Acked-by: Michal Hocko <mhocko@suse.cz>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Zhang Yanfei <zhangyanfei@cn.fujitsu.com>
Cc: Minchan Kim <minchan@kernel.org>
Cc: David Rientjes <rientjes@google.com>
Cc: Rik van Riel <riel@redhat.com>
Cc: "Aneesh Kumar K.V" <aneesh.kumar@linux.vnet.ibm.com>
Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 75379191 11-Feb-2015 Vlastimil Babka <vbabka@suse.cz>

mm: set page->pfmemalloc in prep_new_page()

The possibility of replacing the numerous parameters of alloc_pages*
functions with a single structure has been discussed when Minchan proposed
to expand the x86 kernel stack [1]. This series implements the change,
along with few more cleanups/microoptimizations.

The series is based on next-20150108 and I used gcc 4.8.3 20140627 on
openSUSE 13.2 for compiling. Config includess NUMA and COMPACTION.

The core change is the introduction of a new struct alloc_context, which looks
like this:

struct alloc_context {
struct zonelist *zonelist;
nodemask_t *nodemask;
struct zone *preferred_zone;
int classzone_idx;
int migratetype;
enum zone_type high_zoneidx;
};

All the contents is mostly constant, except that __alloc_pages_slowpath()
changes preferred_zone, classzone_idx and potentially zonelist. But
that's not a problem in case control returns to retry_cpuset: in
__alloc_pages_nodemask(), those will be reset to initial values again
(although it's a bit subtle). On the other hand, gfp_flags and alloc_info
mutate so much that it doesn't make sense to put them into alloc_context.
Still, the result is one parameter instead of up to 7. This is all in
Patch 2.

Patch 3 is a step to expand alloc_context usage out of page_alloc.c
itself. The function try_to_compact_pages() can also much benefit from
the parameter reduction, but it means the struct definition has to be
moved to a shared header.

Patch 1 should IMHO be included even if the rest is deemed not useful
enough. It improves maintainability and also has some code/stack
reduction. Patch 4 is OTOH a tiny optimization.

Overall bloat-o-meter results:

add/remove: 0/0 grow/shrink: 0/4 up/down: 0/-460 (-460)
function old new delta
nr_free_zone_pages 129 115 -14
__alloc_pages_direct_compact 329 256 -73
get_page_from_freelist 2670 2576 -94
__alloc_pages_nodemask 2564 2285 -279
try_to_compact_pages 582 579 -3

Overall stack sizes per ./scripts/checkstack.pl:

old new delta
get_page_from_freelist: 184 184 0
__alloc_pages_nodemask 248 200 -48
__alloc_pages_direct_c 40 - -40
try_to_compact_pages 72 72 0
-88

[1] http://marc.info/?l=linux-mm&m=140142462528257&w=2

This patch (of 4):

prep_new_page() sets almost everything in the struct page of the page
being allocated, except page->pfmemalloc. This is not obvious and has at
least once led to a bug where page->pfmemalloc was forgotten to be set
correctly, see commit 8fb74b9fb2b1 ("mm: compaction: partially revert
capture of suitable high-order page").

This patch moves the pfmemalloc setting to prep_new_page(), which means it
needs to gain alloc_flags parameter. The call to prep_new_page is moved
from buffered_rmqueue() to get_page_from_freelist(), which also leads to
simpler code. An obsolete comment for buffered_rmqueue() is replaced.

In addition to better maintainability there is a small reduction of code
and stack usage for get_page_from_freelist(), which inlines the other
functions involved.

add/remove: 0/0 grow/shrink: 0/1 up/down: 0/-145 (-145)
function old new delta
get_page_from_freelist 2670 2525 -145

Stack usage is reduced from 184 to 168 bytes.

Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
Acked-by: Michal Hocko <mhocko@suse.cz>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Zhang Yanfei <zhangyanfei@cn.fujitsu.com>
Cc: Minchan Kim <minchan@kernel.org>
Cc: David Rientjes <rientjes@google.com>
Cc: Rik van Riel <riel@redhat.com>
Cc: "Aneesh Kumar K.V" <aneesh.kumar@linux.vnet.ibm.com>
Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 23f086f9 11-Feb-2015 Xishi Qiu <qiuxishi@huawei.com>

kmemcheck: move hook into __alloc_pages_nodemask() for the page allocator

Now kmemcheck_pagealloc_alloc() is only called by __alloc_pages_slowpath().
__alloc_pages_nodemask()
__alloc_pages_slowpath()
kmemcheck_pagealloc_alloc()

And the page will not be tracked by kmemcheck in the following path.
__alloc_pages_nodemask()
get_page_from_freelist()

So move kmemcheck_pagealloc_alloc() into __alloc_pages_nodemask(),
like this:
__alloc_pages_nodemask()
...
get_page_from_freelist()
if (!page)
__alloc_pages_slowpath()
kmemcheck_pagealloc_alloc()
...

Signed-off-by: Xishi Qiu <qiuxishi@huawei.com>
Cc: Vegard Nossum <vegard.nossum@oracle.com>
Cc: Pekka Enberg <penberg@kernel.org>
Cc: Li Zefan <lizefan@huawei.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 91fbdc0f 11-Feb-2015 Andrew Morton <akpm@linux-foundation.org>

mm/page_alloc.c:__alloc_pages_nodemask(): don't alter arg gfp_mask

__alloc_pages_nodemask() strips __GFP_IO when retrying the page
allocation. But it does this by altering the function-wide variable
gfp_mask. This will cause subsequent allocation attempts to inadvertently
use the modified gfp_mask.

Also, pass the correct mask (the mask we actually used) into
trace_mm_page_alloc().

Cc: Ming Lei <ming.lei@canonical.com>
Cc: Mel Gorman <mel@csn.ul.ie>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Reviewed-by: Yasuaki Ishimatsu <isimatu.yasuaki@jp.fujitsu.com>
Cc: David Rientjes <rientjes@google.com>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 4c5018ce 10-Feb-2015 Weijie Yang <weijie.yang@samsung.com>

mm/page_alloc.c: place zone_id check before VM_BUG_ON_PAGE check

If the freeing page and its buddy page are not at the same zone, the
current holding zone->lock for the freeing page cann't prevent buddy page
getting allocated, this could trigger VM_BUG_ON_PAGE in page_is_buddy() at
a very tiny chance, such as:

cpu 0: cpu 1:
hold zone_1 lock
check page and it buddy
PageBuddy(buddy) is true hold zone_2 lock
page_order(buddy) == order is true alloc buddy
trigger VM_BUG_ON_PAGE(page_count(buddy) != 0)

zone_1->lock prevents the freeing page getting allocated
zone_2->lock prevents the buddy page getting allocated
they are not the same zone->lock.

If we can't remove the zone_id check statement, it's better handle this
rare race. This patch fixes this by placing the zone_id check before the
VM_BUG_ON_PAGE check.

Signed-off-by: Weijie Yang <weijie.yang@samsung.com>
Acked-by: Mel Gorman <mgorman@suse.de>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Rik van Riel <riel@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 9879de73 26-Jan-2015 Johannes Weiner <hannes@cmpxchg.org>

mm: page_alloc: embed OOM killing naturally into allocation slowpath

The OOM killing invocation does a lot of duplicative checks against the
task's allocation context. Rework it to take advantage of the existing
checks in the allocator slowpath.

The OOM killer is invoked when the allocator is unable to reclaim any
pages but the allocation has to keep looping. Instead of having a check
for __GFP_NORETRY hidden in oom_gfp_allowed(), just move the OOM
invocation to the true branch of should_alloc_retry(). The __GFP_FS
check from oom_gfp_allowed() can then be moved into the OOM avoidance
branch in __alloc_pages_may_oom(), along with the PF_DUMPCORE test.

__alloc_pages_may_oom() can then signal to the caller whether the OOM
killer was invoked, instead of requiring it to duplicate the order and
high_zoneidx checks to guess this when deciding whether to continue.

Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
Acked-by: Michal Hocko <mhocko@suse.cz>
Cc: David Rientjes <rientjes@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# e48322ab 18-Dec-2014 Pintu Kumar <pintu.k@samsung.com>

mm: cma: split cma-reserved in dmesg log

When the system boots up, in the dmesg logs we can see the memory
statistics along with total reserved as below. Memory: 458840k/458840k
available, 65448k reserved, 0K highmem

When CMA is enabled, still the total reserved memory remains the same.
However, the CMA memory is not considered as reserved. But, when we see
/proc/meminfo, the CMA memory is part of free memory. This creates
confusion. This patch corrects the problem by properly subtracting the
CMA reserved memory from the total reserved memory in dmesg logs.

Below is the dmesg snapshot from an arm based device with 512MB RAM and
12MB single CMA region.

Before this change:
Memory: 458840k/458840k available, 65448k reserved, 0K highmem

After this change:
Memory: 458840k/458840k available, 53160k reserved, 12288k cma-reserved, 0K highmem

Signed-off-by: Pintu Kumar <pintu.k@samsung.com>
Signed-off-by: Vishnu Pratap Singh <vishnu.ps@samsung.com>
Acked-by: Michal Nazarewicz <mina86@mina86.com>
Cc: Rafael Aquini <aquini@redhat.com>
Cc: Jerome Marchand <jmarchan@redhat.com>
Cc: Marek Szyprowski <m.szyprowski@samsung.com>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# ba914f48 12-Dec-2014 Zhong Hongbo <bocui107@gmail.com>

mm: remove the highmem zones' memmap in the highmem zone

Since 01cefaef40c4 ("mm: provide more accurate estimation
of pages occupied by memmap") allocate the pages from lowmem for the
highmem zones' memmap. So It is not need to reserver the memmap's for
the highmem.

A 2G DDR3 for the arm platform:
On node 0 totalpages: 524288
free_area_init_node: node 0, pgdat 80ccd380, node_mem_map 80d38000
DMA zone: 3568 pages used for memmap
DMA zone: 0 pages reserved
DMA zone: 456704 pages, LIFO batch:31
HighMem zone: 528 pages used for memmap
HighMem zone: 67584 pages, LIFO batch:15

On node 0 totalpages: 524288
free_area_init_node: node 0, pgdat 80cd6f40, node_mem_map 80d42000
DMA zone: 3568 pages used for memmap
DMA zone: 0 pages reserved
DMA zone: 456704 pages, LIFO batch:31
HighMem zone: 67584 pages, LIFO batch:15

Signed-off-by: Hongbo Zhong <hongbo.zhong@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 6b4f7799 12-Dec-2014 Johannes Weiner <hannes@cmpxchg.org>

mm: vmscan: invoke slab shrinkers from shrink_zone()

The slab shrinkers are currently invoked from the zonelist walkers in
kswapd, direct reclaim, and zone reclaim, all of which roughly gauge the
eligible LRU pages and assemble a nodemask to pass to NUMA-aware
shrinkers, which then again have to walk over the nodemask. This is
redundant code, extra runtime work, and fairly inaccurate when it comes to
the estimation of actually scannable LRU pages. The code duplication will
only get worse when making the shrinkers cgroup-aware and requiring them
to have out-of-band cgroup hierarchy walks as well.

Instead, invoke the shrinkers from shrink_zone(), which is where all
reclaimers end up, to avoid this duplication.

Take the count for eligible LRU pages out of get_scan_count(), which
considers many more factors than just the availability of swap space, like
zone_reclaimable_pages() currently does. Accumulate the number over all
visited lruvecs to get the per-zone value.

Some nodes have multiple zones due to memory addressing restrictions. To
avoid putting too much pressure on the shrinkers, only invoke them once
for each such node, using the class zone of the allocation as the pivot
zone.

For now, this integrates the slab shrinking better into the reclaim logic
and gets rid of duplicative invocations from kswapd, direct reclaim, and
zone reclaim. It also prepares for cgroup-awareness, allowing
memcg-capable shrinkers to be added at the lruvec level without much
duplication of both code and runtime work.

This changes kswapd behavior, which used to invoke the shrinkers for each
zone, but with scan ratios gathered from the entire node, resulting in
meaningless pressure quantities on multi-zone nodes.

Zone reclaim behavior also changes. It used to shrink slabs until the
same amount of pages were shrunk as were reclaimed from the LRUs. Now it
merely invokes the shrinkers once with the zone's scan ratio, which makes
the shrinkers go easier on caches that implement aging and would prefer
feeding back pressure from recently used slab objects to unused LRU pages.

[vdavydov@parallels.com: assure class zone is populated]
Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
Cc: Dave Chinner <david@fromorbit.com>
Signed-off-by: Vladimir Davydov <vdavydov@parallels.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 48c96a36 12-Dec-2014 Joonsoo Kim <iamjoonsoo.kim@lge.com>

mm/page_owner: keep track of page owners

This is the page owner tracking code which is introduced so far ago. It
is resident on Andrew's tree, though, nobody tried to upstream so it
remain as is. Our company uses this feature actively to debug memory leak
or to find a memory hogger so I decide to upstream this feature.

This functionality help us to know who allocates the page. When
allocating a page, we store some information about allocation in extra
memory. Later, if we need to know status of all pages, we can get and
analyze it from this stored information.

In previous version of this feature, extra memory is statically defined in
struct page, but, in this version, extra memory is allocated outside of
struct page. It enables us to turn on/off this feature at boottime
without considerable memory waste.

Although we already have tracepoint for tracing page allocation/free,
using it to analyze page owner is rather complex. We need to enlarge the
trace buffer for preventing overlapping until userspace program launched.
And, launched program continually dump out the trace buffer for later
analysis and it would change system behaviour with more possibility rather
than just keeping it in memory, so bad for debug.

Moreover, we can use page_owner feature further for various purposes. For
example, we can use it for fragmentation statistics implemented in this
patch. And, I also plan to implement some CMA failure debugging feature
using this interface.

I'd like to give the credit for all developers contributed this feature,
but, it's not easy because I don't know exact history. Sorry about that.
Below is people who has "Signed-off-by" in the patches in Andrew's tree.

Contributor:
Alexander Nyberg <alexn@dsv.su.se>
Mel Gorman <mgorman@suse.de>
Dave Hansen <dave@linux.vnet.ibm.com>
Minchan Kim <minchan@kernel.org>
Michal Nazarewicz <mina86@mina86.com>
Andrew Morton <akpm@linux-foundation.org>
Jungsoo Son <jungsoo.son@lge.com>

Signed-off-by: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Minchan Kim <minchan@kernel.org>
Cc: Dave Hansen <dave@sr71.net>
Cc: Michal Nazarewicz <mina86@mina86.com>
Cc: Jungsoo Son <jungsoo.son@lge.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 031bc574 12-Dec-2014 Joonsoo Kim <iamjoonsoo.kim@lge.com>

mm/debug-pagealloc: make debug-pagealloc boottime configurable

Now, we have prepared to avoid using debug-pagealloc in boottime. So
introduce new kernel-parameter to disable debug-pagealloc in boottime, and
makes related functions to be disabled in this case.

Only non-intuitive part is change of guard page functions. Because guard
page is effective only if debug-pagealloc is enabled, turning off
according to debug-pagealloc is reasonable thing to do.

Signed-off-by: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Minchan Kim <minchan@kernel.org>
Cc: Dave Hansen <dave@sr71.net>
Cc: Michal Nazarewicz <mina86@mina86.com>
Cc: Jungsoo Son <jungsoo.son@lge.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# e30825f1 12-Dec-2014 Joonsoo Kim <iamjoonsoo.kim@lge.com>

mm/debug-pagealloc: prepare boottime configurable on/off

Until now, debug-pagealloc needs extra flags in struct page, so we need to
recompile whole source code when we decide to use it. This is really
painful, because it takes some time to recompile and sometimes rebuild is
not possible due to third party module depending on struct page. So, we
can't use this good feature in many cases.

Now, we have the page extension feature that allows us to insert extra
flags to outside of struct page. This gets rid of third party module
issue mentioned above. And, this allows us to determine if we need extra
memory for this page extension in boottime. With these property, we can
avoid using debug-pagealloc in boottime with low computational overhead in
the kernel built with CONFIG_DEBUG_PAGEALLOC. This will help our
development process greatly.

This patch is the preparation step to achive above goal. debug-pagealloc
originally uses extra field of struct page, but, after this patch, it will
use field of struct page_ext. Because memory for page_ext is allocated
later than initialization of page allocator in CONFIG_SPARSEMEM, we should
disable debug-pagealloc feature temporarily until initialization of
page_ext. This patch implements this.

Signed-off-by: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Minchan Kim <minchan@kernel.org>
Cc: Dave Hansen <dave@sr71.net>
Cc: Michal Nazarewicz <mina86@mina86.com>
Cc: Jungsoo Son <jungsoo.son@lge.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# eefa864b 12-Dec-2014 Joonsoo Kim <iamjoonsoo.kim@lge.com>

mm/page_ext: resurrect struct page extending code for debugging

When we debug something, we'd like to insert some information to every
page. For this purpose, we sometimes modify struct page itself. But,
this has drawbacks. First, it requires re-compile. This makes us
hesitate to use the powerful debug feature so development process is
slowed down. And, second, sometimes it is impossible to rebuild the
kernel due to third party module dependency. At third, system behaviour
would be largely different after re-compile, because it changes size of
struct page greatly and this structure is accessed by every part of
kernel. Keeping this as it is would be better to reproduce errornous
situation.

This feature is intended to overcome above mentioned problems. This
feature allocates memory for extended data per page in certain place
rather than the struct page itself. This memory can be accessed by the
accessor functions provided by this code. During the boot process, it
checks whether allocation of huge chunk of memory is needed or not. If
not, it avoids allocating memory at all. With this advantage, we can
include this feature into the kernel in default and can avoid rebuild and
solve related problems.

Until now, memcg uses this technique. But, now, memcg decides to embed
their variable to struct page itself and it's code to extend struct page
has been removed. I'd like to use this code to develop debug feature, so
this patch resurrect it.

To help these things to work well, this patch introduces two callbacks for
clients. One is the need callback which is mandatory if user wants to
avoid useless memory allocation at boot-time. The other is optional, init
callback, which is used to do proper initialization after memory is
allocated. Detailed explanation about purpose of these functions is in
code comment. Please refer it.

Others are completely same with previous extension code in memcg.

Signed-off-by: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Minchan Kim <minchan@kernel.org>
Cc: Dave Hansen <dave@sr71.net>
Cc: Michal Nazarewicz <mina86@mina86.com>
Cc: Jungsoo Son <jungsoo.son@lge.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 2847cf95 12-Dec-2014 Joonsoo Kim <iamjoonsoo.kim@lge.com>

mm/debug-pagealloc: cleanup page guard code

Page guard is used by debug-pagealloc feature. Currently, it is
open-coded, but, I think that more abstraction of it makes core page
allocator code more readable.

There is no functional difference.

Signed-off-by: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Cc: Gioh Kim <gioh.kim@lge.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 9edad6ea 10-Dec-2014 Johannes Weiner <hannes@cmpxchg.org>

mm: move page->mem_cgroup bad page handling into generic code

Now that the external page_cgroup data structure and its lookup is
gone, let the generic bad_page() check for page->mem_cgroup sanity.

Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
Acked-by: Michal Hocko <mhocko@suse.cz>
Acked-by: Vladimir Davydov <vdavydov@parallels.com>
Acked-by: David S. Miller <davem@davemloft.net>
Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: "Kirill A. Shutemov" <kirill@shutemov.name>
Cc: Tejun Heo <tj@kernel.org>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 1306a85a 10-Dec-2014 Johannes Weiner <hannes@cmpxchg.org>

mm: embed the memcg pointer directly into struct page

Memory cgroups used to have 5 per-page pointers. To allow users to
disable that amount of overhead during runtime, those pointers were
allocated in a separate array, with a translation layer between them and
struct page.

There is now only one page pointer remaining: the memcg pointer, that
indicates which cgroup the page is associated with when charged. The
complexity of runtime allocation and the runtime translation overhead is
no longer justified to save that *potential* 0.19% of memory. With
CONFIG_SLUB, page->mem_cgroup actually sits in the doubleword padding
after the page->private member and doesn't even increase struct page,
and then this patch actually saves space. Remaining users that care can
still compile their kernels without CONFIG_MEMCG.

text data bss dec hex filename
8828345 1725264 983040 11536649 b00909 vmlinux.old
8827425 1725264 966656 11519345 afc571 vmlinux.new

[mhocko@suse.cz: update Documentation/cgroups/memory.txt]
Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
Acked-by: Michal Hocko <mhocko@suse.cz>
Acked-by: Vladimir Davydov <vdavydov@parallels.com>
Acked-by: David S. Miller <davem@davemloft.net>
Acked-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: "Kirill A. Shutemov" <kirill@shutemov.name>
Cc: Michal Hocko <mhocko@suse.cz>
Cc: Vladimir Davydov <vdavydov@parallels.com>
Cc: Tejun Heo <tj@kernel.org>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Acked-by: Konstantin Khlebnikov <koct9i@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 26086de3 10-Dec-2014 Wei Yuan <weiyuan.wei@huawei.com>

mm: fix a spelling mistake

Signed-off-by Wei Yuan <weiyuan.wei@huawei.com>
Acked-by: Rik van Riel <riel@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# fdaf7f5c 10-Dec-2014 Vlastimil Babka <vbabka@suse.cz>

mm, compaction: more focused lru and pcplists draining

The goal of memory compaction is to create high-order freepages through
page migration. Page migration however puts pages on the per-cpu lru_add
cache, which is later flushed to per-cpu pcplists, and only after pcplists
are drained the pages can actually merge. This can happen due to the
per-cpu caches becoming full through further freeing, or explicitly.

During direct compaction, it is useful to do the draining explicitly so
that pages merge as soon as possible and compaction can detect success
immediately and keep the latency impact at minimum. However the current
implementation is far from ideal. Draining is done only in
__alloc_pages_direct_compact(), after all zones were already compacted,
and the decisions to continue or stop compaction in individual zones was
done without the last batch of migrations being merged. It is also
missing the draining of lru_add cache before the pcplists.

This patch moves the draining for direct compaction into compact_zone().
It adds the missing lru_cache draining and uses the newly introduced
single zone pcplists draining to reduce overhead and avoid impact on
unrelated zones. Draining is only performed when it can actually lead to
merging of a page of desired order (passed by cc->order). This means it
is only done when migration occurred in the previously scanned cc->order
aligned block(s) and the migration scanner is now pointing to the next
cc->order aligned block.

The patch has been tested with stress-highalloc benchmark from mmtests.
Although overal allocation success rates of the benchmark were not
affected, the number of detected compaction successes has doubled. This
suggests that allocations were previously successful due to implicit
merging caused by background activity, making a later allocation attempt
succeed immediately, but not attributing the success to compaction. Since
stress-highalloc always tries to allocate almost the whole memory, it
cannot show the improvement in its reported success rate metric. However
after this patch, compaction should detect success and terminate earlier,
reducing the direct compaction latencies in a real scenario.

Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
Cc: Minchan Kim <minchan@kernel.org>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Cc: Michal Nazarewicz <mina86@mina86.com>
Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
Cc: Christoph Lameter <cl@linux.com>
Acked-by: Rik van Riel <riel@redhat.com>
Cc: David Rientjes <rientjes@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 97d47a65 10-Dec-2014 Vlastimil Babka <vbabka@suse.cz>

mm, compaction: simplify deferred compaction

Since commit 53853e2d2bfb ("mm, compaction: defer each zone individually
instead of preferred zone"), compaction is deferred for each zone where
sync direct compaction fails, and reset where it succeeds. However, it
was observed that for DMA zone compaction often appeared to succeed
while subsequent allocation attempt would not, due to different outcome
of watermark check.

In order to properly defer compaction in this zone, the candidate zone
has to be passed back to __alloc_pages_direct_compact() and compaction
deferred in the zone after the allocation attempt fails.

The large source of mismatch between watermark check in compaction and
allocation was the lack of alloc_flags and classzone_idx values in
compaction, which has been fixed in the previous patch. So with this
problem fixed, we can simplify the code by removing the candidate_zone
parameter and deferring in __alloc_pages_direct_compact().

After this patch, the compaction activity during stress-highalloc
benchmark is still somewhat increased, but it's negligible compared to the
increase that occurred without the better watermark checking. This
suggests that it is still possible to apparently succeed in compaction but
fail to allocate, possibly due to parallel allocation activity.

[akpm@linux-foundation.org: fix build]
Suggested-by: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
Cc: Minchan Kim <minchan@kernel.org>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Michal Nazarewicz <mina86@mina86.com>
Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
Cc: Christoph Lameter <cl@linux.com>
Cc: Rik van Riel <riel@redhat.com>
Cc: David Rientjes <rientjes@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# ebff3980 10-Dec-2014 Vlastimil Babka <vbabka@suse.cz>

mm, compaction: pass classzone_idx and alloc_flags to watermark checking

Compaction relies on zone watermark checks for decisions such as if it's
worth to start compacting in compaction_suitable() or whether compaction
should stop in compact_finished(). The watermark checks take
classzone_idx and alloc_flags parameters, which are related to the memory
allocation request. But from the context of compaction they are currently
passed as 0, including the direct compaction which is invoked to satisfy
the allocation request, and could therefore know the proper values.

The lack of proper values can lead to mismatch between decisions taken
during compaction and decisions related to the allocation request. Lack
of proper classzone_idx value means that lowmem_reserve is not taken into
account. This has manifested (during recent changes to deferred
compaction) when DMA zone was used as fallback for preferred Normal zone.
compaction_suitable() without proper classzone_idx would think that the
watermarks are already satisfied, but watermark check in
get_page_from_freelist() would fail. Because of this problem, deferring
compaction has extra complexity that can be removed in the following
patch.

The issue (not confirmed in practice) with missing alloc_flags is opposite
in nature. For allocations that include ALLOC_HIGH, ALLOC_HIGHER or
ALLOC_CMA in alloc_flags (the last includes all MOVABLE allocations on
CMA-enabled systems) the watermark checking in compaction with 0 passed
will be stricter than in get_page_from_freelist(). In these cases
compaction might be running for a longer time than is really needed.

Another issue compaction_suitable() is that the check for "does the zone
need compaction at all?" comes only after the check "does the zone have
enough free free pages to succeed compaction". The latter considers extra
pages for migration and can therefore in some situations fail and return
COMPACT_SKIPPED, although the high-order allocation would succeed and we
should return COMPACT_PARTIAL.

This patch fixes these problems by adding alloc_flags and classzone_idx to
struct compact_control and related functions involved in direct compaction
and watermark checking. Where possible, all other callers of
compaction_suitable() pass proper values where those are known. This is
currently limited to classzone_idx, which is sometimes known in kswapd
context. However, the direct reclaim callers should_continue_reclaim()
and compaction_ready() do not currently know the proper values, so the
coordination between reclaim and compaction may still not be as accurate
as it could. This can be fixed later, if it's shown to be an issue.

Additionaly the checks in compact_suitable() are reordered to address the
second issue described above.

The effect of this patch should be slightly better high-order allocation
success rates and/or less compaction overhead, depending on the type of
allocations and presence of CMA. It allows simplifying deferred
compaction code in a followup patch.

When testing with stress-highalloc, there was some slight improvement
(which might be just due to variance) in success rates of non-THP-like
allocations.

Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
Cc: Minchan Kim <minchan@kernel.org>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Cc: Michal Nazarewicz <mina86@mina86.com>
Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
Cc: Christoph Lameter <cl@linux.com>
Acked-by: Rik van Riel <riel@redhat.com>
Cc: David Rientjes <rientjes@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# ab1f306f 10-Dec-2014 Yu Zhao <yuzhao@google.com>

mm: verify compound order when freeing a page

This allows us to catch the bug fixed in the previous patch (mm: free
compound page with correct order).

Here we also verify whether a page is tail page or not -- tail pages are
supposed to be freed along with their head, not by themselves.

Signed-off-by: Yu Zhao <yuzhao@google.com>
Reviewed-by: "Kirill A. Shutemov" <kirill@shutemov.name>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Mel Gorman <mel@csn.ul.ie>
Cc: David Rientjes <rientjes@google.com>
Cc: Bob Liu <lliubbo@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 510f5507 10-Dec-2014 Vlastimil Babka <vbabka@suse.cz>

mm, cma: drain single zone pcplists

CMA allocation drains pcplists so that pages can merge back to buddy
allocator. Since it operates on a single zone, we can reduce the
pcplists drain to the single zone, which is now possible.

The change should make CMA allocations faster and not disturbing
unrelated pcplists anymore.

Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Rik van Riel <riel@redhat.com>
Cc: Yasuaki Ishimatsu <isimatu.yasuaki@jp.fujitsu.com>
Cc: Zhang Yanfei <zhangyanfei@cn.fujitsu.com>
Cc: Xishi Qiu <qiuxishi@huawei.com>
Cc: Vladimir Davydov <vdavydov@parallels.com>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Cc: Michal Nazarewicz <mina86@mina86.com>
Cc: Marek Szyprowski <m.szyprowski@samsung.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 93481ff0 10-Dec-2014 Vlastimil Babka <vbabka@suse.cz>

mm: introduce single zone pcplists drain

The functions for draining per-cpu pages back to buddy allocators
currently always operate on all zones. There are however several cases
where the drain is only needed in the context of a single zone, and
spilling other pcplists is a waste of time both due to the extra
spilling and later refilling.

This patch introduces new zone pointer parameter to drain_all_pages()
and changes the dummy parameter of drain_local_pages() to be also a zone
pointer. When NULL is passed, the functions operate on all zones as
usual. Passing a specific zone pointer reduces the work to the single
zone.

All callers are updated to pass the NULL pointer in this patch.
Conversion to single zone (where appropriate) is done in further
patches.

Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Rik van Riel <riel@redhat.com>
Cc: Yasuaki Ishimatsu <isimatu.yasuaki@jp.fujitsu.com>
Cc: Zhang Yanfei <zhangyanfei@cn.fujitsu.com>
Cc: Xishi Qiu <qiuxishi@huawei.com>
Cc: Vladimir Davydov <vdavydov@parallels.com>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Cc: Michal Nazarewicz <mina86@mina86.com>
Cc: Marek Szyprowski <m.szyprowski@samsung.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# f88dfff5 10-Dec-2014 Anton Blanchard <anton@samba.org>

mm/page_alloc.c: convert boot printks without log level to pr_info

Signed-off-by: Anton Blanchard <anton@samba.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 57cbc87e 13-Nov-2014 Joonsoo Kim <iamjoonsoo.kim@lge.com>

mm/debug-pagealloc: correct freepage accounting and order resetting

One thing I did in this patch is fixing freepage accounting. If we
clear guard page and link it onto isolate buddy list, we should not
increase freepage count. This patch adds conditional branch to skip
counting in this case. Without this patch, this overcounting happens
frequently if guard order is set and CMA is used.

Another thing fixed in this patch is the target to reset order. In
__free_one_page(), we check the buddy page whether it is a guard page or
not. And, if so, we should clear guard attribute on the buddy page and
reset order of it to 0. But, current code resets original page's order
rather than buddy one's. Maybe, this doesn't have any problem, because
whole merged page's order will be re-assigned soon. But, it is better
to correct code.

Signed-off-by: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Cc: Gioh Kim <gioh.kim@lge.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# dae803e1 13-Nov-2014 Michal Nazarewicz <mina86@mina86.com>

mm: alloc_contig_range: demote pages busy message from warn to info

Having test_pages_isolated failure message as a warning confuses users
into thinking that it is more serious than it really is. In reality, if
called via CMA, allocation will be retried so a single
test_pages_isolated failure does not prevent allocation from succeeding.

Demote the warning message to an info message and reformat it such that
the text "failed" does not appear and instead a less worrying "PFNS
busy" is used.

This message is trivially reproducible on a 10GB x86 machine on 3.16.y
kernels configured with CONFIG_DMA_CMA.

Signed-off-by: Michal Nazarewicz <mina86@mina86.com>
Cc: Laurent Pinchart <laurent.pinchart@ideasonboard.com>
Cc: Peter Hurley <peter@hurleysoftware.com>
Cc: Minchan Kim <minchan@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 3c605096 13-Nov-2014 Joonsoo Kim <iamjoonsoo.kim@lge.com>

mm/page_alloc: restrict max order of merging on isolated pageblock

Current pageblock isolation logic could isolate each pageblock
individually. This causes freepage accounting problem if freepage with
pageblock order on isolate pageblock is merged with other freepage on
normal pageblock. We can prevent merging by restricting max order of
merging to pageblock order if freepage is on isolate pageblock.

A side-effect of this change is that there could be non-merged buddy
freepage even if finishing pageblock isolation, because undoing
pageblock isolation is just to move freepage from isolate buddy list to
normal buddy list rather than to consider merging. So, the patch also
makes undoing pageblock isolation consider freepage merge. When
un-isolation, freepage with more than pageblock order and it's buddy are
checked. If they are on normal pageblock, instead of just moving, we
isolate the freepage and free it in order to get merged.

Signed-off-by: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Cc: "Kirill A. Shutemov" <kirill@shutemov.name>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Minchan Kim <minchan@kernel.org>
Cc: Yasuaki Ishimatsu <isimatu.yasuaki@jp.fujitsu.com>
Cc: Zhang Yanfei <zhangyanfei@cn.fujitsu.com>
Cc: Tang Chen <tangchen@cn.fujitsu.com>
Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
Cc: Bartlomiej Zolnierkiewicz <b.zolnierkie@samsung.com>
Cc: Wen Congyang <wency@cn.fujitsu.com>
Cc: Marek Szyprowski <m.szyprowski@samsung.com>
Cc: Michal Nazarewicz <mina86@mina86.com>
Cc: Laura Abbott <lauraa@codeaurora.org>
Cc: Heesub Shin <heesub.shin@samsung.com>
Cc: "Aneesh Kumar K.V" <aneesh.kumar@linux.vnet.ibm.com>
Cc: Ritesh Harjani <ritesh.list@gmail.com>
Cc: Gioh Kim <gioh.kim@lge.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 8f82b55d 13-Nov-2014 Joonsoo Kim <iamjoonsoo.kim@lge.com>

mm/page_alloc: move freepage counting logic to __free_one_page()

All the caller of __free_one_page() has similar freepage counting logic,
so we can move it to __free_one_page(). This reduce line of code and
help future maintenance.

This is also preparation step for "mm/page_alloc: restrict max order of
merging on isolated pageblock" which fix the freepage counting problem
on freepage with more than pageblock order.

Signed-off-by: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Cc: "Kirill A. Shutemov" <kirill@shutemov.name>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Minchan Kim <minchan@kernel.org>
Cc: Yasuaki Ishimatsu <isimatu.yasuaki@jp.fujitsu.com>
Cc: Zhang Yanfei <zhangyanfei@cn.fujitsu.com>
Cc: Tang Chen <tangchen@cn.fujitsu.com>
Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
Cc: Bartlomiej Zolnierkiewicz <b.zolnierkie@samsung.com>
Cc: Wen Congyang <wency@cn.fujitsu.com>
Cc: Marek Szyprowski <m.szyprowski@samsung.com>
Cc: Michal Nazarewicz <mina86@mina86.com>
Cc: Laura Abbott <lauraa@codeaurora.org>
Cc: Heesub Shin <heesub.shin@samsung.com>
Cc: "Aneesh Kumar K.V" <aneesh.kumar@linux.vnet.ibm.com>
Cc: Ritesh Harjani <ritesh.list@gmail.com>
Cc: Gioh Kim <gioh.kim@lge.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 51bb1a40 13-Nov-2014 Joonsoo Kim <iamjoonsoo.kim@lge.com>

mm/page_alloc: add freepage on isolate pageblock to correct buddy list

In free_pcppages_bulk(), we use cached migratetype of freepage to
determine type of buddy list where freepage will be added. This
information is stored when freepage is added to pcp list, so if
isolation of pageblock of this freepage begins after storing, this
cached information could be stale. In other words, it has original
migratetype rather than MIGRATE_ISOLATE.

There are two problems caused by this stale information.

One is that we can't keep these freepages from being allocated.
Although this pageblock is isolated, freepage will be added to normal
buddy list so that it could be allocated without any restriction. And
the other problem is incorrect freepage accounting. Freepages on
isolate pageblock should not be counted for number of freepage.

Following is the code snippet in free_pcppages_bulk().

/* MIGRATE_MOVABLE list may include MIGRATE_RESERVEs */
__free_one_page(page, page_to_pfn(page), zone, 0, mt);
trace_mm_page_pcpu_drain(page, 0, mt);
if (likely(!is_migrate_isolate_page(page))) {
__mod_zone_page_state(zone, NR_FREE_PAGES, 1);
if (is_migrate_cma(mt))
__mod_zone_page_state(zone, NR_FREE_CMA_PAGES, 1);
}

As you can see above snippet, current code already handle second
problem, incorrect freepage accounting, by re-fetching pageblock
migratetype through is_migrate_isolate_page(page).

But, because this re-fetched information isn't used for
__free_one_page(), first problem would not be solved. This patch try to
solve this situation to re-fetch pageblock migratetype before
__free_one_page() and to use it for __free_one_page().

In addition to move up position of this re-fetch, this patch use
optimization technique, re-fetching migratetype only if there is isolate
pageblock. Pageblock isolation is rare event, so we can avoid
re-fetching in common case with this optimization.

This patch also correct migratetype of the tracepoint output.

Signed-off-by: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Acked-by: Minchan Kim <minchan@kernel.org>
Acked-by: Michal Nazarewicz <mina86@mina86.com>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Cc: "Kirill A. Shutemov" <kirill@shutemov.name>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Yasuaki Ishimatsu <isimatu.yasuaki@jp.fujitsu.com>
Cc: Zhang Yanfei <zhangyanfei@cn.fujitsu.com>
Cc: Tang Chen <tangchen@cn.fujitsu.com>
Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
Cc: Bartlomiej Zolnierkiewicz <b.zolnierkie@samsung.com>
Cc: Wen Congyang <wency@cn.fujitsu.com>
Cc: Marek Szyprowski <m.szyprowski@samsung.com>
Cc: Laura Abbott <lauraa@codeaurora.org>
Cc: Heesub Shin <heesub.shin@samsung.com>
Cc: "Aneesh Kumar K.V" <aneesh.kumar@linux.vnet.ibm.com>
Cc: Ritesh Harjani <ritesh.list@gmail.com>
Cc: Gioh Kim <gioh.kim@lge.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# ad53f92e 13-Nov-2014 Joonsoo Kim <iamjoonsoo.kim@lge.com>

mm/page_alloc: fix incorrect isolation behavior by rechecking migratetype

Before describing bugs itself, I first explain definition of freepage.

1. pages on buddy list are counted as freepage.
2. pages on isolate migratetype buddy list are *not* counted as freepage.
3. pages on cma buddy list are counted as CMA freepage, too.

Now, I describe problems and related patch.

Patch 1: There is race conditions on getting pageblock migratetype that
it results in misplacement of freepages on buddy list, incorrect
freepage count and un-availability of freepage.

Patch 2: Freepages on pcp list could have stale cached information to
determine migratetype of buddy list to go. This causes misplacement of
freepages on buddy list and incorrect freepage count.

Patch 4: Merging between freepages on different migratetype of
pageblocks will cause freepages accouting problem. This patch fixes it.

Without patchset [3], above problem doesn't happens on my CMA allocation
test, because CMA reserved pages aren't used at all. So there is no
chance for above race.

With patchset [3], I did simple CMA allocation test and get below
result:

- Virtual machine, 4 cpus, 1024 MB memory, 256 MB CMA reservation
- run kernel build (make -j16) on background
- 30 times CMA allocation(8MB * 30 = 240MB) attempts in 5 sec interval
- Result: more than 5000 freepage count are missed

With patchset [3] and this patchset, I found that no freepage count are
missed so that I conclude that problems are solved.

On my simple memory offlining test, these problems also occur on that
environment, too.

This patch (of 4):

There are two paths to reach core free function of buddy allocator,
__free_one_page(), one is free_one_page()->__free_one_page() and the
other is free_hot_cold_page()->free_pcppages_bulk()->__free_one_page().
Each paths has race condition causing serious problems. At first, this
patch is focused on first type of freepath. And then, following patch
will solve the problem in second type of freepath.

In the first type of freepath, we got migratetype of freeing page
without holding the zone lock, so it could be racy. There are two cases
of this race.

1. pages are added to isolate buddy list after restoring orignal
migratetype

CPU1 CPU2

get migratetype => return MIGRATE_ISOLATE
call free_one_page() with MIGRATE_ISOLATE

grab the zone lock
unisolate pageblock
release the zone lock

grab the zone lock
call __free_one_page() with MIGRATE_ISOLATE
freepage go into isolate buddy list,
although pageblock is already unisolated

This may cause two problems. One is that we can't use this page anymore
until next isolation attempt of this pageblock, because freepage is on
isolate buddy list. The other is that freepage accouting could be wrong
due to merging between different buddy list. Freepages on isolate buddy
list aren't counted as freepage, but ones on normal buddy list are
counted as freepage. If merge happens, buddy freepage on normal buddy
list is inevitably moved to isolate buddy list without any consideration
of freepage accouting so it could be incorrect.

2. pages are added to normal buddy list while pageblock is isolated.
It is similar with above case.

This also may cause two problems. One is that we can't keep these
freepages from being allocated. Although this pageblock is isolated,
freepage would be added to normal buddy list so that it could be
allocated without any restriction. And the other problem is same as
case 1, that it, incorrect freepage accouting.

This race condition would be prevented by checking migratetype again
with holding the zone lock. Because it is somewhat heavy operation and
it isn't needed in common case, we want to avoid rechecking as much as
possible. So this patch introduce new variable, nr_isolate_pageblock in
struct zone to check if there is isolated pageblock. With this, we can
avoid to re-check migratetype in common case and do it only if there is
isolated pageblock or migratetype is MIGRATE_ISOLATE. This solve above
mentioned problems.

Changes from v3:
Add one more check in free_one_page() that checks whether migratetype is
MIGRATE_ISOLATE or not. Without this, abovementioned case 1 could happens.

Signed-off-by: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Acked-by: Minchan Kim <minchan@kernel.org>
Acked-by: Michal Nazarewicz <mina86@mina86.com>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Cc: "Kirill A. Shutemov" <kirill@shutemov.name>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Yasuaki Ishimatsu <isimatu.yasuaki@jp.fujitsu.com>
Cc: Zhang Yanfei <zhangyanfei@cn.fujitsu.com>
Cc: Tang Chen <tangchen@cn.fujitsu.com>
Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
Cc: Bartlomiej Zolnierkiewicz <b.zolnierkie@samsung.com>
Cc: Wen Congyang <wency@cn.fujitsu.com>
Cc: Marek Szyprowski <m.szyprowski@samsung.com>
Cc: Laura Abbott <lauraa@codeaurora.org>
Cc: Heesub Shin <heesub.shin@samsung.com>
Cc: "Aneesh Kumar K.V" <aneesh.kumar@linux.vnet.ibm.com>
Cc: Ritesh Harjani <ritesh.list@gmail.com>
Cc: Gioh Kim <gioh.kim@lge.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 344736f2 20-Oct-2014 Vladimir Davydov <vdavydov.dev@gmail.com>

cpuset: simplify cpuset_node_allowed API

Current cpuset API for checking if a zone/node is allowed to allocate
from looks rather awkward. We have hardwall and softwall versions of
cpuset_node_allowed with the softwall version doing literally the same
as the hardwall version if __GFP_HARDWALL is passed to it in gfp flags.
If it isn't, the softwall version may check the given node against the
enclosing hardwall cpuset, which it needs to take the callback lock to
do.

Such a distinction was introduced by commit 02a0e53d8227 ("cpuset:
rework cpuset_zone_allowed api"). Before, we had the only version with
the __GFP_HARDWALL flag determining its behavior. The purpose of the
commit was to avoid sleep-in-atomic bugs when someone would mistakenly
call the function without the __GFP_HARDWALL flag for an atomic
allocation. The suffixes introduced were intended to make the callers
think before using the function.

However, since the callback lock was converted from mutex to spinlock by
the previous patch, the softwall check function cannot sleep, and these
precautions are no longer necessary.

So let's simplify the API back to the single check.

Suggested-by: David Rientjes <rientjes@google.com>
Signed-off-by: Vladimir Davydov <vdavydov@parallels.com>
Acked-by: Christoph Lameter <cl@linux.com>
Acked-by: Zefan Li <lizefan@huawei.com>
Signed-off-by: Tejun Heo <tj@kernel.org>


# 5695be14 20-Oct-2014 Michal Hocko <mhocko@suse.cz>

OOM, PM: OOM killed task shouldn't escape PM suspend

PM freezer relies on having all tasks frozen by the time devices are
getting frozen so that no task will touch them while they are getting
frozen. But OOM killer is allowed to kill an already frozen task in
order to handle OOM situtation. In order to protect from late wake ups
OOM killer is disabled after all tasks are frozen. This, however, still
keeps a window open when a killed task didn't manage to die by the time
freeze_processes finishes.

Reduce the race window by checking all tasks after OOM killer has been
disabled. This is still not race free completely unfortunately because
oom_killer_disable cannot stop an already ongoing OOM killer so a task
might still wake up from the fridge and get killed without
freeze_processes noticing. Full synchronization of OOM and freezer is,
however, too heavy weight for this highly unlikely case.

Introduce and check oom_kills counter which gets incremented early when
the allocator enters __alloc_pages_may_oom path and only check all the
tasks if the counter changes during the freezing attempt. The counter
is updated so early to reduce the race window since allocator checked
oom_killer_disabled which is set by PM-freezing code. A false positive
will push the PM-freezer into a slow path but that is not a big deal.

Changes since v1
- push the re-check loop out of freeze_processes into
check_frozen_processes and invert the condition to make the code more
readable as per Rafael

Fixes: f660daac474c6f (oom: thaw threads if oom killed thread is frozen before deferring)
Cc: 3.2+ <stable@vger.kernel.org> # 3.2+
Signed-off-by: Michal Hocko <mhocko@suse.cz>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>


# 82742a3a 09-Oct-2014 Sasha Levin <sasha.levin@oracle.com>

mm: move debug code out of page_alloc.c

dump_page() and dump_vma() are not specific to page_alloc.c, move them out
so page_alloc.c won't turn into the unofficial debug repository.

Signed-off-by: Sasha Levin <sasha.levin@oracle.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 3193913c 09-Oct-2014 Mel Gorman <mgorman@suse.de>

mm: page_alloc: default node-ordering on 64-bit NUMA, zone-ordering on 32-bit

Zones are allocated by the page allocator in either node or zone order.
Node ordering is preferred in terms of locality and is applied
automatically in one of three cases:

1. If a node has only low memory

2. If DMA/DMA32 is a high percentage of memory

3. If low memory on a single node is greater than 70% of the node size

Otherwise zone ordering is used to preserve low memory for devices that
require it. Unfortunately a consequence of this is that applications
running on a machine with balanced NUMA nodes will experience different
performance characteristics depending on which node they happen to start
from.

The point of zone ordering is to protect lower zones for devices that
require DMA/DMA32 memory. When NUMA was first introduced, this was
critical as 32-bit NUMA machines existed and exhausting low memory
triggered OOMs easily as so many allocations required low memory. On
64-bit machines the primary concern is devices that are 32-bit only which
is less severe than the low memory exhaustion problem on 32-bit NUMA. It
seems there are really few devices that depends on it.

AGP -- I assume this is getting more rare but even then I think the allocations
happen early in boot time where lowmem pressure is less of a problem

DRM -- If the device is 32-bit only then there may be low pressure. I didn't
evaluate these in detail but it looks like some of these are mobile
graphics card. Not many NUMA laptops out there. DRM folk should know
better though.

Some TV cards -- Much demand for 32-bit capable TV cards on NUMA machines?

B43 wireless card -- again not really a NUMA thing.

I cannot find a good reason to incur a performance penalty on all 64-bit NUMA
machines in case someone throws a brain damanged TV or graphics card in there.
This patch defaults to node-ordering on 64-bit NUMA machines. I was tempted
to make it default everywhere but I understand that some embedded arches may
be using 32-bit NUMA where I cannot predict the consequences.

The performance impact depends on the workload and the characteristics of the
machine and the machine I tested on had a large Normal zone on node 0 so the
impact is within the noise for the majority of tests. The allocation stats
show more allocation requests were from DMA32 and local node. Running SpecJBB
with multiple JVMs and automatic NUMA balancing disabled the results were

specjbb
3.17.0-rc2 3.17.0-rc2
vanilla nodeorder-v1r1
Min 1 29534.00 ( 0.00%) 30020.00 ( 1.65%)
Min 10 115717.00 ( 0.00%) 134038.00 ( 15.83%)
Min 19 109718.00 ( 0.00%) 114186.00 ( 4.07%)
Min 28 104459.00 ( 0.00%) 103639.00 ( -0.78%)
Min 37 98245.00 ( 0.00%) 103756.00 ( 5.61%)
Min 46 97198.00 ( 0.00%) 96197.00 ( -1.03%)
Mean 1 30953.25 ( 0.00%) 31917.75 ( 3.12%)
Mean 10 124432.50 ( 0.00%) 140904.00 ( 13.24%)
Mean 19 116033.50 ( 0.00%) 119294.75 ( 2.81%)
Mean 28 108365.25 ( 0.00%) 106879.50 ( -1.37%)
Mean 37 102984.75 ( 0.00%) 106924.25 ( 3.83%)
Mean 46 100783.25 ( 0.00%) 105368.50 ( 4.55%)
Stddev 1 1260.38 ( 0.00%) 1109.66 ( 11.96%)
Stddev 10 7434.03 ( 0.00%) 5171.91 ( 30.43%)
Stddev 19 8453.84 ( 0.00%) 5309.59 ( 37.19%)
Stddev 28 4184.55 ( 0.00%) 2906.63 ( 30.54%)
Stddev 37 5409.49 ( 0.00%) 3192.12 ( 40.99%)
Stddev 46 4521.95 ( 0.00%) 7392.52 (-63.48%)
Max 1 32738.00 ( 0.00%) 32719.00 ( -0.06%)
Max 10 136039.00 ( 0.00%) 148614.00 ( 9.24%)
Max 19 130566.00 ( 0.00%) 127418.00 ( -2.41%)
Max 28 115404.00 ( 0.00%) 111254.00 ( -3.60%)
Max 37 112118.00 ( 0.00%) 111732.00 ( -0.34%)
Max 46 108541.00 ( 0.00%) 116849.00 ( 7.65%)
TPut 1 123813.00 ( 0.00%) 127671.00 ( 3.12%)
TPut 10 497730.00 ( 0.00%) 563616.00 ( 13.24%)
TPut 19 464134.00 ( 0.00%) 477179.00 ( 2.81%)
TPut 28 433461.00 ( 0.00%) 427518.00 ( -1.37%)
TPut 37 411939.00 ( 0.00%) 427697.00 ( 3.83%)
TPut 46 403133.00 ( 0.00%) 421474.00 ( 4.55%)

3.17.0-rc2 3.17.0-rc2
vanillanodeorder-v1r1
DMA allocs 0 0
DMA32 allocs 57 1491992
Normal allocs 32543566 30026383
Movable allocs 0 0
Direct pages scanned 0 0
Kswapd pages scanned 0 0
Kswapd pages reclaimed 0 0
Direct pages reclaimed 0 0
Kswapd efficiency 100% 100%
Kswapd velocity 0.000 0.000
Direct efficiency 100% 100%
Direct velocity 0.000 0.000
Percentage direct scans 0% 0%
Zone normal velocity 0.000 0.000
Zone dma32 velocity 0.000 0.000
Zone dma velocity 0.000 0.000
THP fault alloc 55164 52987
THP collapse alloc 139 147
THP splits 26 21
NUMA alloc hit 4169066 4250692
NUMA alloc miss 0 0

Note that there were more DMA32 allocations with the patch applied. In this
particular case there was no difference in numa_hit and numa_miss. The
expectation is that DMA32 was being used at the low watermark instead of
falling into the slow path. kswapd was not woken but it's not worken for
THP allocations.

On 32-bit, this patch defaults to zone-ordering as low memory depletion
can be a serious problem on 32-bit large memory machines. If the default
ordering was node then processes on node 0 will deplete the Normal zone
due to normal activity. The problem is worse if CONFIG_HIGHPTE is not
set. If combined with large amounts of dirty/writeback pages in Normal
zone then there is also a high risk of OOM. The heuristics are removed
as it's not clear they were ever important on 32-bit. They were only
relevant for setting node-ordering on 64-bit.

Signed-off-by: Mel Gorman <mgorman@suse.de>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Cc: Rik van Riel <riel@redhat.com>
Cc: David Rientjes <rientjes@google.com>
Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Fengguang Wu <fengguang.wu@intel.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 97ee4ba7 09-Oct-2014 Mel Gorman <mgorman@suse.de>

mm: page_alloc: Make paranoid check in move_freepages a VM_BUG_ON

Since 2.6.24 there has been a paranoid check in move_freepages that looks
up the zone of two pages. This is a very slow path and the only time I've
seen this bug trigger recently is when memory initialisation was broken
during patch development. Despite the fact it's a slow path, this patch
converts the check to a VM_BUG_ON anyway as it has served its purpose by
now.

Signed-off-by: Mel Gorman <mgorman@suse.de>
Acked-by: David Rientjes <rientjes@google.com>
Acked-by: Rik van Riel <riel@redhat.com>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 57054651 09-Oct-2014 Johannes Weiner <hannes@cmpxchg.org>

mm: clean up zone flags

Page reclaim tests zone_is_reclaim_dirty(), but the site that actually
sets this state does zone_set_flag(zone, ZONE_TAIL_LRU_DIRTY), sending the
reader through layers indirection just to track down a simple bit.

Remove all zone flag wrappers and just use bitops against zone->flags
directly. It's just as readable and the lines are barely any longer.

Also rename ZONE_TAIL_LRU_DIRTY to ZONE_DIRTY to match ZONE_WRITEBACK, and
remove the zone_flags_t typedef.

Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
Acked-by: David Rientjes <rientjes@google.com>
Acked-by: Mel Gorman <mgorman@suse.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 7ade3c99 09-Oct-2014 Weijie Yang <weijie.yang@samsung.com>

mm: page_alloc: avoid wakeup kswapd on the unintended node

When entering the page_alloc slowpath, we wakeup kswapd on every pgdat
according to the zonelist and high_zoneidx. However, this doesn't take
nodemask into account, and could prematurely wakeup kswapd on some
unintended nodes.

This patch uses for_each_zone_zonelist_nodemask() instead of
for_each_zone_zonelist() in wake_all_kswapds() to avoid the above
situation.

Signed-off-by: Weijie Yang <weijie.yang@samsung.com>
Acked-by: Mel Gorman <mgorman@suse.de>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Cc: Rik van Riel <riel@redhat.com>
Cc: David Rientjes <rientjes@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 0bf55139 09-Oct-2014 Sasha Levin <sasha.levin@oracle.com>

mm: introduce dump_vma

Introduce a helper to dump information about a VMA, this also makes
dump_page_flags more generic and re-uses that so the output looks very
similar to dump_page:

[ 61.903437] vma ffff88070f88be00 start 00007fff25970000 end 00007fff25992000
[ 61.903437] next ffff88070facd600 prev ffff88070face400 mm ffff88070fade000
[ 61.903437] prot 8000000000000025 anon_vma ffff88070fa1e200 vm_ops (null)
[ 61.903437] pgoff 7ffffffdd file (null) private_data (null)
[ 61.909129] flags: 0x100173(read|write|mayread|maywrite|mayexec|growsdown|account)

[akpm@linux-foundation.org: make dump_vma() require CONFIG_DEBUG_VM]
[swarren@nvidia.com: fix dump_vma() compilation]
Signed-off-by: Sasha Levin <sasha.levin@oracle.com>
Reviewed-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Cc: Konstantin Khlebnikov <khlebnikov@openvz.org>
Cc: Rik van Riel <riel@redhat.com>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Michal Hocko <mhocko@suse.cz>
Cc: Hugh Dickins <hughd@google.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Michel Lespinasse <walken@google.com>
Cc: Minchan Kim <minchan@kernel.org>
Signed-off-by: Stephen Warren <swarren@nvidia.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 43e7a34d 09-Oct-2014 David Rientjes <rientjes@google.com>

mm: rename allocflags_to_migratetype for clarity

The page allocator has gfp flags (like __GFP_WAIT) and alloc flags (like
ALLOC_CPUSET) that have separate semantics.

The function allocflags_to_migratetype() actually takes gfp flags, not
alloc flags, and returns a migratetype. Rename it to
gfpflags_to_migratetype().

Signed-off-by: David Rientjes <rientjes@google.com>
Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
Reviewed-by: Zhang Yanfei <zhangyanfei@cn.fujitsu.com>
Reviewed-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
Acked-by: Minchan Kim <minchan@kernel.org>
Acked-by: Mel Gorman <mgorman@suse.de>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Cc: Michal Nazarewicz <mina86@mina86.com>
Cc: Christoph Lameter <cl@linux.com>
Cc: Rik van Riel <riel@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 1f9efdef 09-Oct-2014 Vlastimil Babka <vbabka@suse.cz>

mm, compaction: khugepaged should not give up due to need_resched()

Async compaction aborts when it detects zone lock contention or
need_resched() is true. David Rientjes has reported that in practice,
most direct async compactions for THP allocation abort due to
need_resched(). This means that a second direct compaction is never
attempted, which might be OK for a page fault, but khugepaged is intended
to attempt a sync compaction in such case and in these cases it won't.

This patch replaces "bool contended" in compact_control with an int that
distinguishes between aborting due to need_resched() and aborting due to
lock contention. This allows propagating the abort through all compaction
functions as before, but passing the abort reason up to
__alloc_pages_slowpath() which decides when to continue with direct
reclaim and another compaction attempt.

Another problem is that try_to_compact_pages() did not act upon the
reported contention (both need_resched() or lock contention) immediately
and would proceed with another zone from the zonelist. When
need_resched() is true, that means initializing another zone compaction,
only to check again need_resched() in isolate_migratepages() and aborting.
For zone lock contention, the unintended consequence is that the lock
contended status reported back to the allocator is detrmined from the last
zone where compaction was attempted, which is rather arbitrary.

This patch fixes the problem in the following way:
- async compaction of a zone aborting due to need_resched() or fatal signal
pending means that further zones should not be tried. We report
COMPACT_CONTENDED_SCHED to the allocator.
- aborting zone compaction due to lock contention means we can still try
another zone, since it has different set of locks. We report back
COMPACT_CONTENDED_LOCK only if *all* zones where compaction was attempted,
it was aborted due to lock contention.

As a result of these fixes, khugepaged will proceed with second sync
compaction as intended, when the preceding async compaction aborted due to
need_resched(). Page fault compactions aborting due to need_resched()
will spare some cycles previously wasted by initializing another zone
compaction only to abort again. Lock contention will be reported only
when compaction in all zones aborted due to lock contention, and therefore
it's not a good idea to try again after reclaim.

In stress-highalloc from mmtests configured to use __GFP_NO_KSWAPD, this
has improved number of THP collapse allocations by 10%, which shows
positive effect on khugepaged. The benchmark's success rates are
unchanged as it is not recognized as khugepaged. Numbers of compact_stall
and compact_fail events have however decreased by 20%, with
compact_success still a bit improved, which is good. With benchmark
configured not to use __GFP_NO_KSWAPD, there is 6% improvement in THP
collapse allocations, and only slight improvement in stalls and failures.

[akpm@linux-foundation.org: fix warnings]
Reported-by: David Rientjes <rientjes@google.com>
Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
Cc: Minchan Kim <minchan@kernel.org>
Acked-by: Mel Gorman <mgorman@suse.de>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Cc: Michal Nazarewicz <mina86@mina86.com>
Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
Cc: Christoph Lameter <cl@linux.com>
Cc: Rik van Riel <riel@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# edc2ca61 09-Oct-2014 Vlastimil Babka <vbabka@suse.cz>

mm, compaction: move pageblock checks up from isolate_migratepages_range()

isolate_migratepages_range() is the main function of the compaction
scanner, called either on a single pageblock by isolate_migratepages()
during regular compaction, or on an arbitrary range by CMA's
__alloc_contig_migrate_range(). It currently perfoms two pageblock-wide
compaction suitability checks, and because of the CMA callpath, it tracks
if it crossed a pageblock boundary in order to repeat those checks.

However, closer inspection shows that those checks are always true for CMA:
- isolation_suitable() is true because CMA sets cc->ignore_skip_hint to true
- migrate_async_suitable() check is skipped because CMA uses sync compaction

We can therefore move the compaction-specific checks to
isolate_migratepages() and simplify isolate_migratepages_range().
Furthermore, we can mimic the freepage scanner family of functions, which
has isolate_freepages_block() function called both by compaction from
isolate_freepages() and by CMA from isolate_freepages_range(), where each
use-case adds own specific glue code. This allows further code
simplification.

Thus, we rename isolate_migratepages_range() to
isolate_migratepages_block() and limit its functionality to a single
pageblock (or its subset). For CMA, a new different
isolate_migratepages_range() is created as a CMA-specific wrapper for the
_block() function. The checks specific to compaction are moved to
isolate_migratepages(). As part of the unification of these two families
of functions, we remove the redundant zone parameter where applicable,
since zone pointer is already passed in cc->zone.

Furthermore, going back to compact_zone() and compact_finished() when
pageblock is found unsuitable (now by isolate_migratepages()) is wasteful
- the checks are meant to skip pageblocks quickly. The patch therefore
also introduces a simple loop into isolate_migratepages() so that it does
not return immediately on failed pageblock checks, but keeps going until
isolate_migratepages_range() gets called once. Similarily to
isolate_freepages(), the function periodically checks if it needs to
reschedule or abort async compaction.

[iamjoonsoo.kim@lge.com: fix isolated page counting bug in compaction]
Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
Cc: Minchan Kim <minchan@kernel.org>
Acked-by: Mel Gorman <mgorman@suse.de>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Cc: Michal Nazarewicz <mina86@mina86.com>
Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
Cc: Christoph Lameter <cl@linux.com>
Cc: Rik van Riel <riel@redhat.com>
Cc: David Rientjes <rientjes@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 98dd3b48 09-Oct-2014 Vlastimil Babka <vbabka@suse.cz>

mm, compaction: do not count compact_stall if all zones skipped compaction

The compact_stall vmstat counter counts the number of allocations stalled
by direct compaction. It does not count when all attempted zones had
deferred compaction, but it does count when all zones skipped compaction.
The skipping is decided based on very early check of
compaction_suitable(), based on watermarks and memory fragmentation.
Therefore it makes sense not to count skipped compactions as stalls.
Moreover, compact_success or compact_fail is also already not being
counted when compaction was skipped, so this patch changes the
compact_stall counting to match the other two.

Additionally, restructure __alloc_pages_direct_compact() code for better
readability.

Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
Cc: Minchan Kim <minchan@kernel.org>
Acked-by: Mel Gorman <mgorman@suse.de>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Cc: Michal Nazarewicz <mina86@mina86.com>
Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
Cc: Christoph Lameter <cl@linux.com>
Cc: Rik van Riel <riel@redhat.com>
Acked-by: David Rientjes <rientjes@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 53853e2d 09-Oct-2014 Vlastimil Babka <vbabka@suse.cz>

mm, compaction: defer each zone individually instead of preferred zone

When direct sync compaction is often unsuccessful, it may become deferred
for some time to avoid further useless attempts, both sync and async.
Successful high-order allocations un-defer compaction, while further
unsuccessful compaction attempts prolong the compaction deferred period.

Currently the checking and setting deferred status is performed only on
the preferred zone of the allocation that invoked direct compaction. But
compaction itself is attempted on all eligible zones in the zonelist, so
the behavior is suboptimal and may lead both to scenarios where 1)
compaction is attempted uselessly, or 2) where it's not attempted despite
good chances of succeeding, as shown on the examples below:

1) A direct compaction with Normal preferred zone failed and set
deferred compaction for the Normal zone. Another unrelated direct
compaction with DMA32 as preferred zone will attempt to compact DMA32
zone even though the first compaction attempt also included DMA32 zone.

In another scenario, compaction with Normal preferred zone failed to
compact Normal zone, but succeeded in the DMA32 zone, so it will not
defer compaction. In the next attempt, it will try Normal zone which
will fail again, instead of skipping Normal zone and trying DMA32
directly.

2) Kswapd will balance DMA32 zone and reset defer status based on
watermarks looking good. A direct compaction with preferred Normal
zone will skip compaction of all zones including DMA32 because Normal
was still deferred. The allocation might have succeeded in DMA32, but
won't.

This patch makes compaction deferring work on individual zone basis
instead of preferred zone. For each zone, it checks compaction_deferred()
to decide if the zone should be skipped. If watermarks fail after
compacting the zone, defer_compaction() is called. The zone where
watermarks passed can still be deferred when the allocation attempt is
unsuccessful. When allocation is successful, compaction_defer_reset() is
called for the zone containing the allocated page. This approach should
approximate calling defer_compaction() only on zones where compaction was
attempted and did not yield allocated page. There might be corner cases
but that is inevitable as long as the decision to stop compacting dues not
guarantee that a page will be allocated.

Due to a new COMPACT_DEFERRED return value, some functions relying
implicitly on COMPACT_SKIPPED = 0 had to be updated, with comments made
more accurate. The did_some_progress output parameter of
__alloc_pages_direct_compact() is removed completely, as the caller
actually does not use it after compaction sets it - it is only considered
when direct reclaim sets it.

During testing on a two-node machine with a single very small Normal zone
on node 1, this patch has improved success rates in stress-highalloc
mmtests benchmark. The success here were previously made worse by commit
3a025760fc15 ("mm: page_alloc: spill to remote nodes before waking
kswapd") as kswapd was no longer resetting often enough the deferred
compaction for the Normal zone, and DMA32 zones on both nodes were thus
not considered for compaction. On different machine, success rates were
improved with __GFP_NO_KSWAPD allocations.

[akpm@linux-foundation.org: fix CONFIG_COMPACTION=n build]
Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
Acked-by: Minchan Kim <minchan@kernel.org>
Reviewed-by: Zhang Yanfei <zhangyanfei@cn.fujitsu.com>
Acked-by: Mel Gorman <mgorman@suse.de>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Cc: Michal Nazarewicz <mina86@mina86.com>
Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
Cc: Christoph Lameter <cl@linux.com>
Cc: Rik van Riel <riel@redhat.com>
Cc: David Rientjes <rientjes@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 21bb9bd1 09-Oct-2014 Vlastimil Babka <vbabka@suse.cz>

mm: page_alloc: determine migratetype only once

The check for ALLOC_CMA in __alloc_pages_nodemask() derives migratetype
from gfp_mask in each retry pass, although the migratetype variable
already has the value determined and it does not change. Use the variable
and perform the check only once. Also convert #ifdef CONFIG_CMA to
IS_ENABLED.

Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
Acked-by: David Rientjes <rientjes@google.com>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Rik van Riel <riel@redhat.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: "Srivatsa S. Bhat" <srivatsa.bhat@linux.vnet.ibm.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: Minchan Kim <minchan@kernel.org>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# ad2c8144 09-Oct-2014 Joonsoo Kim <iamjoonsoo.kim@lge.com>

topology: add support for node_to_mem_node() to determine the fallback node

Anton noticed (http://www.spinics.net/lists/linux-mm/msg67489.html) that
on ppc LPARs with memoryless nodes, a large amount of memory was consumed
by slabs and was marked unreclaimable. He tracked it down to slab
deactivations in the SLUB core when we allocate remotely, leading to poor
efficiency always when memoryless nodes are present.

After much discussion, Joonsoo provided a few patches that help
significantly. They don't resolve the problem altogether:

- memory hotplug still needs testing, that is when a memoryless node
becomes memory-ful, we want to dtrt
- there are other reasons for going off-node than memoryless nodes,
e.g., fully exhausted local nodes

Neither case is resolved with this series, but I don't think that should
block their acceptance, as they can be explored/resolved with follow-on
patches.

The series consists of:

[1/3] topology: add support for node_to_mem_node() to determine the
fallback node

[2/3] slub: fallback to node_to_mem_node() node if allocating on
memoryless node

- Joonsoo's patches to cache the nearest node with memory for each
NUMA node

[3/3] Partial revert of 81c98869faa5 (""kthread: ensure locality of
task_struct allocations")

- At Tejun's request, keep the knowledge of memoryless node fallback
to the allocator core.

This patch (of 3):

We need to determine the fallback node in slub allocator if the allocation
target node is memoryless node. Without it, the SLUB wrongly select the
node which has no memory and can't use a partial slab, because of node
mismatch. Introduced function, node_to_mem_node(X), will return a node Y
with memory that has the nearest distance. If X is memoryless node, it
will return nearest distance node, but, if X is normal node, it will
return itself.

We will use this function in following patch to determine the fallback
node.

Signed-off-by: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Signed-off-by: Nishanth Aravamudan <nacc@linux.vnet.ibm.com>
Cc: David Rientjes <rientjes@google.com>
Cc: Han Pingtian <hanpt@linux.vnet.ibm.com>
Cc: Pekka Enberg <penberg@kernel.org>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Cc: Anton Blanchard <anton@samba.org>
Cc: Christoph Lameter <cl@linux.com>
Cc: Wanpeng Li <liwanp@linux.vnet.ibm.com>
Cc: Tejun Heo <tj@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# abe5f972 02-Oct-2014 Johannes Weiner <hannes@cmpxchg.org>

mm: page_alloc: fix zone allocation fairness on UP

The zone allocation batches can easily underflow due to higher-order
allocations or spills to remote nodes. On SMP that's fine, because
underflows are expected from concurrency and dealt with by returning 0.
But on UP, zone_page_state will just return a wrapped unsigned long,
which will get past the <= 0 check and then consider the zone eligible
until its watermarks are hit.

Commit 3a025760fc15 ("mm: page_alloc: spill to remote nodes before
waking kswapd") already made the counter-resetting use
atomic_long_read() to accomodate underflows from remote spills, but it
didn't go all the way with it.

Make it clear that these batches are expected to go negative regardless
of concurrency, and use atomic_long_read() everywhere.

Fixes: 81c0a2bb515f ("mm: page_alloc: fair zone allocator policy")
Reported-by: Vlastimil Babka <vbabka@suse.cz>
Reported-by: Leon Romanovsky <leon@leon.nu>
Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
Acked-by: Mel Gorman <mgorman@suse.de>
Cc: <stable@vger.kernel.org> [3.12+]
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 8b375f64 22-Aug-2014 Luiz Capitulino <lcapitulino@redhat.com>

x86/mm/numa: Drop dead code and rename setup_node_data() to setup_alloc_data()

The setup_node_data() function allocates a pg_data_t object,
inserts it into the node_data[] array and initializes the
following fields: node_id, node_start_pfn and
node_spanned_pages.

However, a few function calls later during the kernel boot,
free_area_init_node() re-initializes those fields, possibly with
setup_node_data() is not used.

This causes a small glitch when running Linux as a hyperv numa
guest:

SRAT: PXM 0 -> APIC 0x00 -> Node 0
SRAT: PXM 0 -> APIC 0x01 -> Node 0
SRAT: PXM 1 -> APIC 0x02 -> Node 1
SRAT: PXM 1 -> APIC 0x03 -> Node 1
SRAT: Node 0 PXM 0 [mem 0x00000000-0x7fffffff]
SRAT: Node 1 PXM 1 [mem 0x80200000-0xf7ffffff]
SRAT: Node 1 PXM 1 [mem 0x100000000-0x1081fffff]
NUMA: Node 1 [mem 0x80200000-0xf7ffffff] + [mem 0x100000000-0x1081fffff] -> [mem 0x80200000-0x1081fffff]
Initmem setup node 0 [mem 0x00000000-0x7fffffff]
NODE_DATA [mem 0x7ffdc000-0x7ffeffff]
Initmem setup node 1 [mem 0x80800000-0x1081fffff]
NODE_DATA [mem 0x1081ea000-0x1081fdfff]
crashkernel: memory value expected
[ffffea0000000000-ffffea0001ffffff] PMD -> [ffff88007de00000-ffff88007fdfffff] on node 0
[ffffea0002000000-ffffea00043fffff] PMD -> [ffff880105600000-ffff8801077fffff] on node 1
Zone ranges:
DMA [mem 0x00001000-0x00ffffff]
DMA32 [mem 0x01000000-0xffffffff]
Normal [mem 0x100000000-0x1081fffff]
Movable zone start for each node
Early memory node ranges
node 0: [mem 0x00001000-0x0009efff]
node 0: [mem 0x00100000-0x7ffeffff]
node 1: [mem 0x80200000-0xf7ffffff]
node 1: [mem 0x100000000-0x1081fffff]
On node 0 totalpages: 524174
DMA zone: 64 pages used for memmap
DMA zone: 21 pages reserved
DMA zone: 3998 pages, LIFO batch:0
DMA32 zone: 8128 pages used for memmap
DMA32 zone: 520176 pages, LIFO batch:31
On node 1 totalpages: 524288
DMA32 zone: 7672 pages used for memmap
DMA32 zone: 491008 pages, LIFO batch:31
Normal zone: 520 pages used for memmap
Normal zone: 33280 pages, LIFO batch:7

In this dmesg, the SRAT table reports that the memory range for
node 1 starts at 0x80200000. However, the line starting with
"Initmem" reports that node 1 memory range starts at 0x80800000.
The "Initmem" line is reported by setup_node_data() and is
wrong, because the kernel ends up using the range as reported in
the SRAT table.

This commit drops all that dead code from setup_node_data(),
renames it to alloc_node_data() and adds a printk() to
free_area_init_node() so that we report a node's memory range
accurately.

Here's the same dmesg section with this patch applied:

SRAT: PXM 0 -> APIC 0x00 -> Node 0
SRAT: PXM 0 -> APIC 0x01 -> Node 0
SRAT: PXM 1 -> APIC 0x02 -> Node 1
SRAT: PXM 1 -> APIC 0x03 -> Node 1
SRAT: Node 0 PXM 0 [mem 0x00000000-0x7fffffff]
SRAT: Node 1 PXM 1 [mem 0x80200000-0xf7ffffff]
SRAT: Node 1 PXM 1 [mem 0x100000000-0x1081fffff]
NUMA: Node 1 [mem 0x80200000-0xf7ffffff] + [mem 0x100000000-0x1081fffff] -> [mem 0x80200000-0x1081fffff]
NODE_DATA(0) allocated [mem 0x7ffdc000-0x7ffeffff]
NODE_DATA(1) allocated [mem 0x1081ea000-0x1081fdfff]
crashkernel: memory value expected
[ffffea0000000000-ffffea0001ffffff] PMD -> [ffff88007de00000-ffff88007fdfffff] on node 0
[ffffea0002000000-ffffea00043fffff] PMD -> [ffff880105600000-ffff8801077fffff] on node 1
Zone ranges:
DMA [mem 0x00001000-0x00ffffff]
DMA32 [mem 0x01000000-0xffffffff]
Normal [mem 0x100000000-0x1081fffff]
Movable zone start for each node
Early memory node ranges
node 0: [mem 0x00001000-0x0009efff]
node 0: [mem 0x00100000-0x7ffeffff]
node 1: [mem 0x80200000-0xf7ffffff]
node 1: [mem 0x100000000-0x1081fffff]
Initmem setup node 0 [mem 0x00001000-0x7ffeffff]
On node 0 totalpages: 524174
DMA zone: 64 pages used for memmap
DMA zone: 21 pages reserved
DMA zone: 3998 pages, LIFO batch:0
DMA32 zone: 8128 pages used for memmap
DMA32 zone: 520176 pages, LIFO batch:31
Initmem setup node 1 [mem 0x80200000-0x1081fffff]
On node 1 totalpages: 524288
DMA32 zone: 7672 pages used for memmap
DMA32 zone: 491008 pages, LIFO batch:31
Normal zone: 520 pages used for memmap
Normal zone: 33280 pages, LIFO batch:7

This commit was tested on a two node bare-metal NUMA machine and
Linux as a numa guest on hyperv and qemu/kvm.

PS: The wrong memory range reported by setup_node_data() seems to be
harmless in the current kernel because it's just not used. However,
that bad range is used in kernel 2.6.32 to initialize the old boot
memory allocator, which causes a crash during boot.

Signed-off-by: Luiz Capitulino <lcapitulino@redhat.com>
Acked-by: Rik van Riel <riel@redhat.com>
Cc: Andi Kleen <andi@firstfloor.org>
Cc: David Rientjes <rientjes@google.com>
Cc: Yasuaki Ishimatsu <isimatu.yasuaki@jp.fujitsu.com>
Cc: Yinghai Lu <yinghai@kernel.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Ingo Molnar <mingo@kernel.org>


# 8fe78048 06-Aug-2014 David Rientjes <rientjes@google.com>

mm, thp: restructure thp avoidance of light synchronous migration

__GFP_NO_KSWAPD, once the way to determine if an allocation was for thp
or not, has gained more users. Their use is not necessarily wrong, they
are trying to do a memory allocation that can easily fail without
disturbing kswapd, so the bit has gained additional usecases.

This restructures the check to determine whether MIGRATE_SYNC_LIGHT
should be used for memory compaction in the page allocator. Rather than
testing solely for __GFP_NO_KSWAPD, test for all bits that must be set
for thp allocations.

This also moves the check to be done only after the page allocator is
aborted for deferred or contended memory compaction since setting
migration_mode for this case is pointless.

Signed-off-by: David Rientjes <rientjes@google.com>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Rik van Riel <riel@redhat.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# e972a070 06-Aug-2014 David Rientjes <rientjes@google.com>

mm, oom: rename zonelist locking functions

try_set_zonelist_oom() and clear_zonelist_oom() are not named properly
to imply that they require locking semantics to avoid out_of_memory()
being reordered.

zone_scan_lock is required for both functions to ensure that there is
proper locking synchronization.

Rename try_set_zonelist_oom() to oom_zonelist_trylock() and rename
clear_zonelist_oom() to oom_zonelist_unlock() to imply there is proper
locking semantics.

At the same time, convert oom_zonelist_trylock() to return bool instead
of int since only success and failure are tested.

Signed-off-by: David Rientjes <rientjes@google.com>
Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Rik van Riel <riel@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 4ffeaf35 06-Aug-2014 Mel Gorman <mgorman@suse.de>

mm: page_alloc: reduce cost of the fair zone allocation policy

The fair zone allocation policy round-robins allocations between zones
within a node to avoid age inversion problems during reclaim. If the
first allocation fails, the batch counts are reset and a second attempt
made before entering the slow path.

One assumption made with this scheme is that batches expire at roughly
the same time and the resets each time are justified. This assumption
does not hold when zones reach their low watermark as the batches will
be consumed at uneven rates. Allocation failure due to watermark
depletion result in additional zonelist scans for the reset and another
watermark check before hitting the slowpath.

On UMA, the benefit is negligible -- around 0.25%. On 4-socket NUMA
machine it's variable due to the variability of measuring overhead with
the vmstat changes. The system CPU overhead comparison looks like

3.16.0-rc3 3.16.0-rc3 3.16.0-rc3
vanilla vmstat-v5 lowercost-v5
User 746.94 774.56 802.00
System 65336.22 32847.27 40852.33
Elapsed 27553.52 27415.04 27368.46

However it is worth noting that the overall benchmark still completed
faster and intuitively it makes sense to take as few passes as possible
through the zonelists.

Signed-off-by: Mel Gorman <mgorman@suse.de>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# f7b5d647 06-Aug-2014 Mel Gorman <mgorman@suse.de>

mm: page_alloc: abort fair zone allocation policy when remotes nodes are encountered

The purpose of numa_zonelist_order=zone is to preserve lower zones for
use with 32-bit devices. If locality is preferred then the
numa_zonelist_order=node policy should be used.

Unfortunately, the fair zone allocation policy overrides this by
skipping zones on remote nodes until the lower one is found. While this
makes sense from a page aging and performance perspective, it breaks the
expected zonelist policy. This patch restores the expected behaviour
for zone-list ordering.

Signed-off-by: Mel Gorman <mgorman@suse.de>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 0d5d823a 06-Aug-2014 Mel Gorman <mgorman@suse.de>

mm: move zone->pages_scanned into a vmstat counter

zone->pages_scanned is a write-intensive cache line during page reclaim
and it's also updated during page free. Move the counter into vmstat to
take advantage of the per-cpu updates and do not update it in the free
paths unless necessary.

On a small UMA machine running tiobench the difference is marginal. On
a 4-node machine the overhead is more noticable. Note that automatic
NUMA balancing was disabled for this test as otherwise the system CPU
overhead is unpredictable.

3.16.0-rc3 3.16.0-rc3 3.16.0-rc3
vanillarearrange-v5 vmstat-v5
User 746.94 759.78 774.56
System 65336.22 58350.98 32847.27
Elapsed 27553.52 27282.02 27415.04

Note that the overhead reduction will vary depending on where exactly
pages are allocated and freed.

Signed-off-by: Mel Gorman <mgorman@suse.de>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 3484b2de 06-Aug-2014 Mel Gorman <mgorman@suse.de>

mm: rearrange zone fields into read-only, page alloc, statistics and page reclaim lines

The arrangement of struct zone has changed over time and now it has
reached the point where there is some inappropriate sharing going on.
On x86-64 for example

o The zone->node field is shared with the zone lock and zone->node is
accessed frequently from the page allocator due to the fair zone
allocation policy.

o span_seqlock is almost never used by shares a line with free_area

o Some zone statistics share a cache line with the LRU lock so
reclaim-intensive and allocator-intensive workloads can bounce the cache
line on a stat update

This patch rearranges struct zone to put read-only and read-mostly
fields together and then splits the page allocator intensive fields, the
zone statistics and the page reclaim intensive fields into their own
cache lines. Note that the type of lowmem_reserve changes due to the
watermark calculations being signed and avoiding a signed/unsigned
conversion there.

On the test configuration I used the overall size of struct zone shrunk
by one cache line. On smaller machines, this is not likely to be
noticable. However, on a 4-node NUMA machine running tiobench the
system CPU overhead is reduced by this patch.

3.16.0-rc3 3.16.0-rc3
vanillarearrange-v5r9
User 746.94 759.78
System 65336.22 58350.98
Elapsed 27553.52 27282.02

Signed-off-by: Mel Gorman <mgorman@suse.de>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# cc7452b6 06-Aug-2014 Rafael Aquini <aquini@redhat.com>

mm: export NR_SHMEM via sysinfo(2) / si_meminfo() interfaces

Historically, we exported shared pages to userspace via sysinfo(2)
sharedram and /proc/meminfo's "MemShared" fields. With the advent of
tmpfs, from kernel v2.4 onward, that old way for accounting shared mem
was deemed inaccurate and we started to export a hard-coded 0 for
sysinfo.sharedram. Later on, during the 2.6 timeframe, "MemShared" got
re-introduced to /proc/meminfo re-branded as "Shmem", but we're still
reporting sysinfo.sharedmem as that old hard-coded zero, which makes the
"shared memory" report inconsistent across interfaces.

This patch leverages the addition of explicit accounting for pages used
by shmem/tmpfs -- "4b02108 mm: oom analysis: add shmem vmstat" -- in
order to make the users of sysinfo(2) and si_meminfo*() friends aware of
that vmstat entry and make them report it consistently across the
interfaces, as well to make sysinfo(2) returned data consistent with our
current API documentation states.

Signed-off-by: Rafael Aquini <aquini@redhat.com>
Acked-by: Rik van Riel <riel@redhat.com>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Cc: Hugh Dickins <hughd@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 7be12fc9 06-Aug-2014 Michal Nazarewicz <mina86@mina86.com>

mm: page_alloc: simplify drain_zone_pages by using min()

Instead of open-coding getting minimal value of two, just use min macro.
That is why it is there for. While changing the function also change
type of batch local variable to match type of per_cpu_pages::batch
(which is int).

Signed-off-by: Michal Nazarewicz <mina86@mina86.com>
Acked-by: David Rientjes <rientjes@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# b95b4e1e 06-Aug-2014 Andrew Morton <akpm@linux-foundation.org>

mm/page_alloc.c: unexport alloc_pages_exact_nid()

It is only called by mm/page_cgroup.c whcih cannot be modular.

Reported-by: David Rientjes <rientjes@google.com>
Cc: Fabian Frederick <fabf@skynet.be>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# e1931811 06-Aug-2014 Fabian Frederick <fabf@skynet.be>

mm/page_alloc.c: add __meminit to alloc_pages_exact_nid()

alloc_pages_exact_nid() is only called by __meminit alloc_page_cgroup()

Signed-off-by: Fabian Frederick <fabf@skynet.be>
Cc: Mel Gorman <mgorman@suse.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# b104a35d 30-Jul-2014 David Rientjes <rientjes@google.com>

mm, thp: do not allow thp faults to avoid cpuset restrictions

The page allocator relies on __GFP_WAIT to determine if ALLOC_CPUSET
should be set in allocflags. ALLOC_CPUSET controls if a page allocation
should be restricted only to the set of allowed cpuset mems.

Transparent hugepages clears __GFP_WAIT when defrag is disabled to prevent
the fault path from using memory compaction or direct reclaim. Thus, it
is unfairly able to allocate outside of its cpuset mems restriction as a
side-effect.

This patch ensures that ALLOC_CPUSET is only cleared when the gfp mask is
truly GFP_ATOMIC by verifying it is also not a thp allocation.

Signed-off-by: David Rientjes <rientjes@google.com>
Reported-by: Alex Thorlton <athorlton@sgi.com>
Tested-by: Alex Thorlton <athorlton@sgi.com>
Cc: Bob Liu <lliubbo@gmail.com>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Hedi Berriche <hedi@sgi.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Rik van Riel <riel@redhat.com>
Cc: Srivatsa S. Bhat <srivatsa.bhat@linux.vnet.ibm.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 1aab4d77 27-Jul-2014 Randy Dunlap <rdunlap@infradead.org>

mm: fix page_alloc.c kernel-doc warnings

Fix kernel-doc warnings and function name in mm/page_alloc.c:

Warning(..//mm/page_alloc.c:6074): No description found for parameter 'pfn'
Warning(..//mm/page_alloc.c:6074): No description found for parameter 'mask'
Warning(..//mm/page_alloc.c:6074): Excess function parameter 'start_bitidx' description in 'get_pfnblock_flags_mask'
Warning(..//mm/page_alloc.c:6102): No description found for parameter 'pfn'
Warning(..//mm/page_alloc.c:6102): No description found for parameter 'mask'
Warning(..//mm/page_alloc.c:6102): Excess function parameter 'start_bitidx' description in 'set_pfnblock_flags_mask'

Signed-off-by: Randy Dunlap <rdunlap@infradead.org>
Acked-by: Mel Gorman <mgorman@suse.de>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# dc78327c 02-Jul-2014 Michal Nazarewicz <mina86@mina86.com>

mm: page_alloc: fix CMA area initialisation when pageblock > MAX_ORDER

With a kernel configured with ARM64_64K_PAGES && !TRANSPARENT_HUGEPAGE,
the following is triggered at early boot:

SMP: Total of 8 processors activated.
devtmpfs: initialized
Unable to handle kernel NULL pointer dereference at virtual address 00000008
pgd = fffffe0000050000
[00000008] *pgd=00000043fba00003, *pmd=00000043fba00003, *pte=00e0000078010407
Internal error: Oops: 96000006 [#1] SMP
Modules linked in:
CPU: 0 PID: 1 Comm: swapper/0 Not tainted 3.15.0-rc864k+ #44
task: fffffe03bc040000 ti: fffffe03bc080000 task.ti: fffffe03bc080000
PC is at __list_add+0x10/0xd4
LR is at free_one_page+0x270/0x638
...
Call trace:
__list_add+0x10/0xd4
free_one_page+0x26c/0x638
__free_pages_ok.part.52+0x84/0xbc
__free_pages+0x74/0xbc
init_cma_reserved_pageblock+0xe8/0x104
cma_init_reserved_areas+0x190/0x1e4
do_one_initcall+0xc4/0x154
kernel_init_freeable+0x204/0x2a8
kernel_init+0xc/0xd4

This happens because init_cma_reserved_pageblock() calls
__free_one_page() with pageblock_order as page order but it is bigger
than MAX_ORDER. This in turn causes accesses past zone->free_list[].

Fix the problem by changing init_cma_reserved_pageblock() such that it
splits pageblock into individual MAX_ORDER pages if pageblock is bigger
than a MAX_ORDER page.

In cases where !CONFIG_HUGETLB_PAGE_SIZE_VARIABLE, which is all
architectures expect for ia64, powerpc and tile at the moment, the
“pageblock_order > MAX_ORDER” condition will be optimised out since both
sides of the operator are constants. In cases where pageblock size is
variable, the performance degradation should not be significant anyway
since init_cma_reserved_pageblock() is called only at boot time at most
MAX_CMA_AREAS times which by default is eight.

Signed-off-by: Michal Nazarewicz <mina86@mina86.com>
Reported-by: Mark Salter <msalter@redhat.com>
Tested-by: Mark Salter <msalter@redhat.com>
Tested-by: Christopher Covington <cov@codeaurora.org>
Cc: Mel Gorman <mgorman@suse.de>
Cc: David Rientjes <rientjes@google.com>
Cc: Marek Szyprowski <m.szyprowski@samsung.com>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: <stable@vger.kernel.org> [3.5+]
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 7cd2b0a3 23-Jun-2014 David Rientjes <rientjes@google.com>

mm, pcp: allow restoring percpu_pagelist_fraction default

Oleg reports a division by zero error on zero-length write() to the
percpu_pagelist_fraction sysctl:

divide error: 0000 [#1] SMP DEBUG_PAGEALLOC
CPU: 1 PID: 9142 Comm: badarea_io Not tainted 3.15.0-rc2-vm-nfs+ #19
Hardware name: Bochs Bochs, BIOS Bochs 01/01/2011
task: ffff8800d5aeb6e0 ti: ffff8800d87a2000 task.ti: ffff8800d87a2000
RIP: 0010: percpu_pagelist_fraction_sysctl_handler+0x84/0x120
RSP: 0018:ffff8800d87a3e78 EFLAGS: 00010246
RAX: 0000000000000f89 RBX: ffff88011f7fd000 RCX: 0000000000000000
RDX: 0000000000000000 RSI: 0000000000000001 RDI: 0000000000000010
RBP: ffff8800d87a3e98 R08: ffffffff81d002c8 R09: ffff8800d87a3f50
R10: 000000000000000b R11: 0000000000000246 R12: 0000000000000060
R13: ffffffff81c3c3e0 R14: ffffffff81cfddf8 R15: ffff8801193b0800
FS: 00007f614f1e9740(0000) GS:ffff88011f440000(0000) knlGS:0000000000000000
CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b
CR2: 00007f614f1fa000 CR3: 00000000d9291000 CR4: 00000000000006e0
Call Trace:
proc_sys_call_handler+0xb3/0xc0
proc_sys_write+0x14/0x20
vfs_write+0xba/0x1e0
SyS_write+0x46/0xb0
tracesys+0xe1/0xe6

However, if the percpu_pagelist_fraction sysctl is set by the user, it
is also impossible to restore it to the kernel default since the user
cannot write 0 to the sysctl.

This patch allows the user to write 0 to restore the default behavior.
It still requires a fraction equal to or larger than 8, however, as
stated by the documentation for sanity. If a value in the range [1, 7]
is written, the sysctl will return EINVAL.

This successfully solves the divide by zero issue at the same time.

Signed-off-by: David Rientjes <rientjes@google.com>
Reported-by: Oleg Drokin <green@linuxhacker.ru>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# cccad5b9 06-Jun-2014 Joe Perches <joe@perches.com>

mm: convert use of typedef ctl_table to struct ctl_table

This typedef is unnecessary and should just be removed.

Signed-off-by: Joe Perches <joe@perches.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 7d018176 04-Jun-2014 Zhang Zhen <zhenzhang.zhang@huawei.com>

mm/page_alloc.c: cleanup add_active_range() related comments

add_active_range() has been repalced by memblock_set_node(). Clean up the
comments to comply with that change.

Signed-off-by: Zhang Zhen <zhenzhang.zhang@huawei.com>
Cc: Tejun Heo <tj@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# d8846374 04-Jun-2014 Mel Gorman <mgorman@suse.de>

mm: page_alloc: calculate classzone_idx once from the zonelist ref

There is no need to calculate zone_idx(preferred_zone) multiple times
or use the pgdat to figure it out.

Signed-off-by: Mel Gorman <mgorman@suse.de>
Acked-by: Rik van Riel <riel@redhat.com>
Acked-by: David Rientjes <rientjes@google.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Jan Kara <jack@suse.cz>
Cc: Michal Hocko <mhocko@suse.cz>
Cc: Hugh Dickins <hughd@google.com>
Cc: Dave Hansen <dave.hansen@intel.com>
Cc: Theodore Ts'o <tytso@mit.edu>
Cc: "Paul E. McKenney" <paulmck@linux.vnet.ibm.com>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Dan Carpenter <dan.carpenter@oracle.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# b745bc85 04-Jun-2014 Mel Gorman <mgorman@suse.de>

mm: page_alloc: convert hot/cold parameter and immediate callers to bool

cold is a bool, make it one. Make the likely case the "if" part of the
block instead of the else as according to the optimisation manual this is
preferred.

Signed-off-by: Mel Gorman <mgorman@suse.de>
Acked-by: Rik van Riel <riel@redhat.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Jan Kara <jack@suse.cz>
Cc: Michal Hocko <mhocko@suse.cz>
Cc: Hugh Dickins <hughd@google.com>
Cc: Dave Hansen <dave.hansen@intel.com>
Cc: Theodore Ts'o <tytso@mit.edu>
Cc: "Paul E. McKenney" <paulmck@linux.vnet.ibm.com>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 7aeb09f9 04-Jun-2014 Mel Gorman <mgorman@suse.de>

mm: page_alloc: use unsigned int for order in more places

X86 prefers the use of unsigned types for iterators and there is a
tendency to mix whether a signed or unsigned type if used for page order.
This converts a number of sites in mm/page_alloc.c to use unsigned int for
order where possible.

Signed-off-by: Mel Gorman <mgorman@suse.de>
Acked-by: Rik van Riel <riel@redhat.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Jan Kara <jack@suse.cz>
Cc: Michal Hocko <mhocko@suse.cz>
Cc: Hugh Dickins <hughd@google.com>
Cc: Dave Hansen <dave.hansen@intel.com>
Cc: Theodore Ts'o <tytso@mit.edu>
Cc: "Paul E. McKenney" <paulmck@linux.vnet.ibm.com>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# cfc47a28 04-Jun-2014 Mel Gorman <mgorman@suse.de>

mm: page_alloc: lookup pageblock migratetype with IRQs enabled during free

get_pageblock_migratetype() is called during free with IRQs disabled.
This is unnecessary and disables IRQs for longer than necessary.

Signed-off-by: Mel Gorman <mgorman@suse.de>
Acked-by: Rik van Riel <riel@redhat.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Cc: Jan Kara <jack@suse.cz>
Cc: Michal Hocko <mhocko@suse.cz>
Cc: Hugh Dickins <hughd@google.com>
Cc: Dave Hansen <dave.hansen@intel.com>
Cc: Theodore Ts'o <tytso@mit.edu>
Cc: "Paul E. McKenney" <paulmck@linux.vnet.ibm.com>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# dc4b0caf 04-Jun-2014 Mel Gorman <mgorman@suse.de>

mm: page_alloc: reduce number of times page_to_pfn is called

In the free path we calculate page_to_pfn multiple times. Reduce that.

Signed-off-by: Mel Gorman <mgorman@suse.de>
Acked-by: Rik van Riel <riel@redhat.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Cc: Jan Kara <jack@suse.cz>
Cc: Michal Hocko <mhocko@suse.cz>
Cc: Hugh Dickins <hughd@google.com>
Cc: Dave Hansen <dave.hansen@intel.com>
Cc: Theodore Ts'o <tytso@mit.edu>
Cc: "Paul E. McKenney" <paulmck@linux.vnet.ibm.com>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# e58469ba 04-Jun-2014 Mel Gorman <mgorman@suse.de>

mm: page_alloc: use word-based accesses for get/set pageblock bitmaps

The test_bit operations in get/set pageblock flags are expensive. This
patch reads the bitmap on a word basis and use shifts and masks to isolate
the bits of interest. Similarly masks are used to set a local copy of the
bitmap and then use cmpxchg to update the bitmap if there have been no
other changes made in parallel.

In a test running dd onto tmpfs the overhead of the pageblock-related
functions went from 1.27% in profiles to 0.5%.

In addition to the performance benefits, this patch closes races that are
possible between:

a) get_ and set_pageblock_migratetype(), where get_pageblock_migratetype()
reads part of the bits before and other part of the bits after
set_pageblock_migratetype() has updated them.

b) set_pageblock_migratetype() and set_pageblock_skip(), where the non-atomic
read-modify-update set bit operation in set_pageblock_skip() will cause
lost updates to some bits changed in the set_pageblock_migratetype().

Joonsoo Kim first reported the case a) via code inspection. Vlastimil
Babka's testing with a debug patch showed that either a) or b) occurs
roughly once per mmtests' stress-highalloc benchmark (although not
necessarily in the same pageblock). Furthermore during development of
unrelated compaction patches, it was observed that frequent calls to
{start,undo}_isolate_page_range() the race occurs several thousands of
times and has resulted in NULL pointer dereferences in move_freepages()
and free_one_page() in places where free_list[migratetype] is
manipulated by e.g. list_move(). Further debugging confirmed that
migratetype had invalid value of 6, causing out of bounds access to the
free_list array.

That confirmed that the race exist, although it may be extremely rare,
and currently only fatal where page isolation is performed due to
memory hot remove. Races on pageblocks being updated by
set_pageblock_migratetype(), where both old and new migratetype are
lower MIGRATE_RESERVE, currently cannot result in an invalid value
being observed, although theoretically they may still lead to
unexpected creation or destruction of MIGRATE_RESERVE pageblocks.
Furthermore, things could get suddenly worse when memory isolation is
used more, or when new migratetypes are added.

After this patch, the race has no longer been observed in testing.

Signed-off-by: Mel Gorman <mgorman@suse.de>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Reported-by: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Reported-and-tested-by: Vlastimil Babka <vbabka@suse.cz>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Jan Kara <jack@suse.cz>
Cc: Michal Hocko <mhocko@suse.cz>
Cc: Hugh Dickins <hughd@google.com>
Cc: Dave Hansen <dave.hansen@intel.com>
Cc: Theodore Ts'o <tytso@mit.edu>
Cc: "Paul E. McKenney" <paulmck@linux.vnet.ibm.com>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Rik van Riel <riel@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 5dab2911 04-Jun-2014 Mel Gorman <mgorman@suse.de>

mm: page_alloc: take the ALLOC_NO_WATERMARK check out of the fast path

ALLOC_NO_WATERMARK is set in a few cases. Always by kswapd, always for
__GFP_MEMALLOC, sometimes for swap-over-nfs, tasks etc. Each of these
cases are relatively rare events but the ALLOC_NO_WATERMARK check is an
unlikely branch in the fast path. This patch moves the check out of the
fast path and after it has been determined that the watermarks have not
been met. This helps the common fast path at the cost of making the slow
path slower and hitting kswapd with a performance cost. It's a reasonable
tradeoff.

Signed-off-by: Mel Gorman <mgorman@suse.de>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Reviewed-by: Rik van Riel <riel@redhat.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Jan Kara <jack@suse.cz>
Cc: Michal Hocko <mhocko@suse.cz>
Cc: Hugh Dickins <hughd@google.com>
Cc: Dave Hansen <dave.hansen@intel.com>
Cc: Theodore Ts'o <tytso@mit.edu>
Cc: "Paul E. McKenney" <paulmck@linux.vnet.ibm.com>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# a6e21b14 04-Jun-2014 Mel Gorman <mgorman@suse.de>

mm: page_alloc: only check the alloc flags and gfp_mask for dirty once

Currently it's calculated once per zone in the zonelist.

Signed-off-by: Mel Gorman <mgorman@suse.de>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Reviewed-by: Rik van Riel <riel@redhat.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Jan Kara <jack@suse.cz>
Cc: Michal Hocko <mhocko@suse.cz>
Cc: Hugh Dickins <hughd@google.com>
Cc: Dave Hansen <dave.hansen@intel.com>
Cc: Theodore Ts'o <tytso@mit.edu>
Cc: "Paul E. McKenney" <paulmck@linux.vnet.ibm.com>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# d34c5fa0 04-Jun-2014 Mel Gorman <mgorman@suse.de>

mm: page_alloc: only check the zone id check if pages are buddies

A node/zone index is used to check if pages are compatible for merging
but this happens unconditionally even if the buddy page is not free. Defer
the calculation as long as possible. Ideally we would check the zone boundary
but nodes can overlap.

Signed-off-by: Mel Gorman <mgorman@suse.de>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Acked-by: Rik van Riel <riel@redhat.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Jan Kara <jack@suse.cz>
Cc: Michal Hocko <mhocko@suse.cz>
Cc: Hugh Dickins <hughd@google.com>
Cc: Dave Hansen <dave.hansen@intel.com>
Cc: Theodore Ts'o <tytso@mit.edu>
Cc: "Paul E. McKenney" <paulmck@linux.vnet.ibm.com>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 664eedde 04-Jun-2014 Mel Gorman <mgorman@suse.de>

mm: page_alloc: use jump labels to avoid checking number_of_cpusets

If cpusets are not in use then we still check a global variable on every
page allocation. Use jump labels to avoid the overhead.

Signed-off-by: Mel Gorman <mgorman@suse.de>
Reviewed-by: Rik van Riel <riel@redhat.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Jan Kara <jack@suse.cz>
Cc: Michal Hocko <mhocko@suse.cz>
Cc: Hugh Dickins <hughd@google.com>
Cc: Dave Hansen <dave.hansen@intel.com>
Cc: Theodore Ts'o <tytso@mit.edu>
Cc: "Paul E. McKenney" <paulmck@linux.vnet.ibm.com>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Stephen Rothwell <sfr@canb.auug.org.au>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 800a1e75 04-Jun-2014 Mel Gorman <mgorman@suse.de>

mm: page_alloc: do not treat a zone that cannot be used for dirty pages as "full"

If a zone cannot be used for a dirty page then it gets marked "full" which
is cached in the zlc and later potentially skipped by allocation requests
that have nothing to do with dirty zones.

Signed-off-by: Mel Gorman <mgorman@suse.de>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Reviewed-by: Rik van Riel <riel@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 65bb3719 04-Jun-2014 Mel Gorman <mgorman@suse.de>

mm: page_alloc: do not update zlc unless the zlc is active

The zlc is used on NUMA machines to quickly skip over zones that are full.
However it is always updated, even for the first zone scanned when the
zlc might not even be active. As it's a write to a bitmap that
potentially bounces cache line it's deceptively expensive and most
machines will not care. Only update the zlc if it was active.

Signed-off-by: Mel Gorman <mgorman@suse.de>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Reviewed-by: Rik van Riel <riel@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 75f30861 04-Jun-2014 David Rientjes <rientjes@google.com>

mm, thp: avoid excessive compaction latency during fault

Synchronous memory compaction can be very expensive: it can iterate an
enormous amount of memory without aborting, constantly rescheduling,
waiting on page locks and lru_lock, etc, if a pageblock cannot be
defragmented.

Unfortunately, it's too expensive for transparent hugepage page faults and
it's much better to simply fallback to pages. On 128GB machines, we find
that synchronous memory compaction can take O(seconds) for a single thp
fault.

Now that async compaction remembers where it left off without strictly
relying on sync compaction, this makes thp allocations best-effort without
causing egregious latency during fault. We still need to retry async
compaction after reclaim, but this won't stall for seconds.

Signed-off-by: David Rientjes <rientjes@google.com>
Acked-by: Mel Gorman <mgorman@suse.de>
Cc: Greg Thelen <gthelen@google.com>
Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Hugh Dickins <hughd@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# e0b9daeb 04-Jun-2014 David Rientjes <rientjes@google.com>

mm, compaction: embed migration mode in compact_control

We're going to want to manipulate the migration mode for compaction in the
page allocator, and currently compact_control's sync field is only a bool.

Currently, we only do MIGRATE_ASYNC or MIGRATE_SYNC_LIGHT compaction
depending on the value of this bool. Convert the bool to enum
migrate_mode and pass the migration mode in directly. Later, we'll want
to avoid MIGRATE_SYNC_LIGHT for thp allocations in the pagefault patch to
avoid unnecessary latency.

This also alters compaction triggered from sysfs, either for the entire
system or for a node, to force MIGRATE_SYNC.

[akpm@linux-foundation.org: fix build]
[iamjoonsoo.kim@lge.com: use MIGRATE_SYNC in alloc_contig_range()]
Signed-off-by: David Rientjes <rientjes@google.com>
Suggested-by: Mel Gorman <mgorman@suse.de>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Cc: Greg Thelen <gthelen@google.com>
Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
Signed-off-by: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 68711a74 04-Jun-2014 David Rientjes <rientjes@google.com>

mm, migration: add destination page freeing callback

Memory migration uses a callback defined by the caller to determine how to
allocate destination pages. When migration fails for a source page,
however, it frees the destination page back to the system.

This patch adds a memory migration callback defined by the caller to
determine how to free destination pages. If a caller, such as memory
compaction, builds its own freelist for migration targets, this can reuse
already freed memory instead of scanning additional memory.

If the caller provides a function to handle freeing of destination pages,
it is called when page migration fails. If the caller passes NULL then
freeing back to the system will be handled as usual. This patch
introduces no functional change.

Signed-off-by: David Rientjes <rientjes@google.com>
Reviewed-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
Acked-by: Mel Gorman <mgorman@suse.de>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Cc: Greg Thelen <gthelen@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 613813e8 04-Jun-2014 Dave Hansen <dave.hansen@linux.intel.com>

mm: debug: make bad_range() output more usable and readable

Nobody outputs memory addresses in decimal. PFNs are essentially
addresses, and they're gibberish in decimal. Output them in hex.

Also, add the nid and zone name to give a little more context to the
message.

Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
Acked-by: David Rientjes <rientjes@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 5bcc9f86 04-Jun-2014 Vlastimil Babka <vbabka@suse.cz>

mm/page_alloc: prevent MIGRATE_RESERVE pages from being misplaced

For the MIGRATE_RESERVE pages, it is useful when they do not get
misplaced on free_list of other migratetype, otherwise they might get
allocated prematurely and e.g. fragment the MIGRATE_RESEVE pageblocks.
While this cannot be avoided completely when allocating new
MIGRATE_RESERVE pageblocks in min_free_kbytes sysctl handler, we should
prevent the misplacement where possible.

Currently, it is possible for the misplacement to happen when a
MIGRATE_RESERVE page is allocated on pcplist through rmqueue_bulk() as a
fallback for other desired migratetype, and then later freed back
through free_pcppages_bulk() without being actually used. This happens
because free_pcppages_bulk() uses get_freepage_migratetype() to choose
the free_list, and rmqueue_bulk() calls set_freepage_migratetype() with
the *desired* migratetype and not the page's original MIGRATE_RESERVE
migratetype.

This patch fixes the problem by moving the call to
set_freepage_migratetype() from rmqueue_bulk() down to
__rmqueue_smallest() and __rmqueue_fallback() where the actual page's
migratetype (e.g. from which free_list the page is taken from) is used.
Note that this migratetype might be different from the pageblock's
migratetype due to freepage stealing decisions. This is OK, as page
stealing never uses MIGRATE_RESERVE as a fallback, and also takes care
to leave all MIGRATE_CMA pages on the correct freelist.

Therefore, as an additional benefit, the call to
get_pageblock_migratetype() from rmqueue_bulk() when CMA is enabled, can
be removed completely. This relies on the fact that MIGRATE_CMA
pageblocks are created only during system init, and the above. The
related is_migrate_isolate() check is also unnecessary, as memory
isolation has other ways to move pages between freelists, and drain pcp
lists containing pages that should be isolated. The buffered_rmqueue()
can also benefit from calling get_freepage_migratetype() instead of
get_pageblock_migratetype().

Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
Reported-by: Yong-Taek Lee <ytk.lee@samsung.com>
Reported-by: Bartlomiej Zolnierkiewicz <b.zolnierkie@samsung.com>
Suggested-by: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Acked-by: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Suggested-by: Mel Gorman <mgorman@suse.de>
Acked-by: Minchan Kim <minchan@kernel.org>
Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Cc: Marek Szyprowski <m.szyprowski@samsung.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: Rik van Riel <riel@redhat.com>
Cc: Michal Nazarewicz <mina86@mina86.com>
Cc: "Wang, Yalin" <Yalin.Wang@sonymobile.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 5f7a75ac 04-Jun-2014 Mel Gorman <mgorman@suse.de>

mm: page_alloc: do not cache reclaim distances

pgdat->reclaim_nodes tracks if a remote node is allowed to be reclaimed
by zone_reclaim due to its distance. As it is expected that
zone_reclaim_mode will be rarely enabled it is unreasonable for all
machines to take a penalty. Fortunately, the zone_reclaim_mode() path
is already slow and it is the path that takes the hit.

Signed-off-by: Mel Gorman <mgorman@suse.de>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Reviewed-by: Zhang Yanfei <zhangyanfei@cn.fujitsu.com>
Acked-by: Michal Hocko <mhocko@suse.cz>
Reviewed-by: Christoph Lameter <cl@linux.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 4f9b16a6 04-Jun-2014 Mel Gorman <mgorman@suse.de>

mm: disable zone_reclaim_mode by default

When it was introduced, zone_reclaim_mode made sense as NUMA distances
punished and workloads were generally partitioned to fit into a NUMA
node. NUMA machines are now common but few of the workloads are
NUMA-aware and it's routine to see major performance degradation due to
zone_reclaim_mode being enabled but relatively few can identify the
problem.

Those that require zone_reclaim_mode are likely to be able to detect
when it needs to be enabled and tune appropriately so lets have a
sensible default for the bulk of users.

This patch (of 2):

zone_reclaim_mode causes processes to prefer reclaiming memory from
local node instead of spilling over to other nodes. This made sense
initially when NUMA machines were almost exclusively HPC and the
workload was partitioned into nodes. The NUMA penalties were
sufficiently high to justify reclaiming the memory. On current machines
and workloads it is often the case that zone_reclaim_mode destroys
performance but not all users know how to detect this. Favour the
common case and disable it by default. Users that are sophisticated
enough to know they need zone_reclaim_mode will detect it.

Signed-off-by: Mel Gorman <mgorman@suse.de>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Reviewed-by: Zhang Yanfei <zhangyanfei@cn.fujitsu.com>
Acked-by: Michal Hocko <mhocko@suse.cz>
Reviewed-by: Christoph Lameter <cl@linux.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 52383431 04-Jun-2014 Vladimir Davydov <vdavydov.dev@gmail.com>

mm: get rid of __GFP_KMEMCG

Currently to allocate a page that should be charged to kmemcg (e.g.
threadinfo), we pass __GFP_KMEMCG flag to the page allocator. The page
allocated is then to be freed by free_memcg_kmem_pages. Apart from
looking asymmetrical, this also requires intrusion to the general
allocation path. So let's introduce separate functions that will
alloc/free pages charged to kmemcg.

The new functions are called alloc_kmem_pages and free_kmem_pages. They
should be used when the caller actually would like to use kmalloc, but
has to fall back to the page allocator for the allocation is large.
They only differ from alloc_pages and free_pages in that besides
allocating or freeing pages they also charge them to the kmem resource
counter of the current memory cgroup.

[sfr@canb.auug.org.au: export kmalloc_order() to modules]
Signed-off-by: Vladimir Davydov <vdavydov@parallels.com>
Acked-by: Greg Thelen <gthelen@google.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Acked-by: Michal Hocko <mhocko@suse.cz>
Cc: Glauber Costa <glommer@gmail.com>
Cc: Christoph Lameter <cl@linux-foundation.org>
Cc: Pekka Enberg <penberg@kernel.org>
Signed-off-by: Stephen Rothwell <sfr@canb.auug.org.au>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# ed12d845 07-Apr-2014 John Hubbard <jhubbard@nvidia.com>

mm/page_alloc.c: change mm debug routines back to EXPORT_SYMBOL

A new dump_page() routine was recently added, and marked
EXPORT_SYMBOL_GPL. dump_page() was also added to the VM_BUG_ON_PAGE()
macro, and so the end result is that non-GPL code can no longer call
get_page() and a few other routines.

This only happens if the kernel was compiled with CONFIG_DEBUG_VM.

Change dump_page() to be EXPORT_SYMBOL.

Longer explanation:

Prior to commit 309381feaee5 ("mm: dump page when hitting a VM_BUG_ON
using VM_BUG_ON_PAGE") , it was possible to build MIT-licensed (non-GPL)
drivers on Fedora. Fedora is semi-unique, in that it sets
CONFIG_VM_DEBUG.

Because Fedora sets CONFIG_VM_DEBUG, they end up pulling in dump_page(),
via VM_BUG_ON_PAGE, via get_page(). As one of the authors of NVIDIA's
new, open source, "UVM-Lite" kernel module, I originally choose to use
the kernel's get_page() routine from within nvidia_uvm_page_cache.c,
because get_page() has always seemed to be very clearly intended for use
by non-GPL, driver code.

So I'm hoping that making get_page() widely accessible again will not be
too controversial. We did check with Fedora first, and they responded
(https://bugzilla.redhat.com/show_bug.cgi?id=1074710#c3) that we should
try to get upstream changed, before asking Fedora to change. Their
reasoning seems beneficial to Linux: leaving CONFIG_DEBUG_VM set allows
Fedora to help catch mm bugs.

Signed-off-by: John Hubbard <jhubbard@nvidia.com>
Cc: Sasha Levin <sasha.levin@oracle.com>
Cc: Josh Boyer <jwboyer@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 136199f0 07-Apr-2014 Emil Medve <Emilian.Medve@freescale.com>

memblock: use for_each_memblock()

This is a small cleanup.

Signed-off-by: Emil Medve <Emilian.Medve@Freescale.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 3a025760 07-Apr-2014 Johannes Weiner <hannes@cmpxchg.org>

mm: page_alloc: spill to remote nodes before waking kswapd

On NUMA systems, a node may start thrashing cache or even swap anonymous
pages while there are still free pages on remote nodes.

This is a result of commits 81c0a2bb515f ("mm: page_alloc: fair zone
allocator policy") and fff4068cba48 ("mm: page_alloc: revert NUMA aspect
of fair allocation policy").

Before those changes, the allocator would first try all allowed zones,
including those on remote nodes, before waking any kswapds. But now,
the allocator fastpath doubles as the fairness pass, which in turn can
only consider the local node to prevent remote spilling based on
exhausted fairness batches alone. Remote nodes are only considered in
the slowpath, after the kswapds are woken up. But if remote nodes still
have free memory, kswapd should not be woken to rebalance the local node
or it may thrash cash or swap prematurely.

Fix this by adding one more unfair pass over the zonelist that is
allowed to spill to remote nodes after the local fairness pass fails but
before entering the slowpath and waking the kswapds.

This also gets rid of the GFP_THISNODE exemption from the fairness
protocol because the unfair pass is no longer tied to kswapd, which
GFP_THISNODE is not allowed to wake up.

However, because remote spills can be more frequent now - we prefer them
over local kswapd reclaim - the allocation batches on remote nodes could
underflow more heavily. When resetting the batches, use
atomic_long_read() directly instead of zone_page_state() to calculate the
delta as the latter filters negative counter values.

Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
Acked-by: Rik van Riel <riel@redhat.com>
Acked-by: Mel Gorman <mgorman@suse.de>
Cc: <stable@kernel.org> [3.12+]

Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# d230dec1 07-Apr-2014 Kirill A. Shutemov <kirill.shutemov@linux.intel.com>

mm: use 'const char *' insted of 'char *' for reason in dump_page()

I tried to use 'dump_page(page, __func__)' for debugging, but it triggers
warning:

warning: passing argument 2 of `dump_page' discards `const' qualifier from pointer target type [enabled by default]

Let's convert 'reason' to 'const char *' in dump_page() and friends: we
shouldn't modify it anyway.

Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 70ef57e6 07-Apr-2014 Michal Hocko <mhocko@suse.cz>

mm: exclude memoryless nodes from zone_reclaim

We had a report about strange OOM killer strikes on a PPC machine
although there was a lot of swap free and a tons of anonymous memory
which could be swapped out. In the end it turned out that the OOM was a
side effect of zone reclaim which wasn't unmapping and swapping out and
so the system was pushed to the OOM. Although this sounds like a bug
somewhere in the kswapd vs. zone reclaim vs. direct reclaim
interaction numactl on the said hardware suggests that the zone reclaim
should not have been set in the first place:

node 0 cpus: 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
node 0 size: 0 MB
node 0 free: 0 MB
node 2 cpus:
node 2 size: 7168 MB
node 2 free: 6019 MB
node distances:
node 0 2
0: 10 40
2: 40 10

So all the CPUs are associated with Node0 which doesn't have any memory
while Node2 contains all the available memory. Node distances cause an
automatic zone_reclaim_mode enabling.

Zone reclaim is intended to keep the allocations local but this doesn't
make any sense on the memoryless nodes. So let's exclude such nodes for
init_zone_allows_reclaim which evaluates zone reclaim behavior and
suitable reclaim_nodes.

Signed-off-by: Michal Hocko <mhocko@suse.cz>
Acked-by: David Rientjes <rientjes@google.com>
Acked-by: Nishanth Aravamudan <nacc@linux.vnet.ibm.com>
Tested-by: Nishanth Aravamudan <nacc@linux.vnet.ibm.com>
Acked-by: Mel Gorman <mgorman@suse.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# d26914d1 03-Apr-2014 Mel Gorman <mgorman@suse.de>

mm: optimize put_mems_allowed() usage

Since put_mems_allowed() is strictly optional, its a seqcount retry, we
don't need to evaluate the function if the allocation was in fact
successful, saving a smp_rmb some loads and comparisons on some relative
fast-paths.

Since the naming, get/put_mems_allowed() does suggest a mandatory
pairing, rename the interface, as suggested by Mel, to resemble the
seqcount interface.

This gives us: read_mems_allowed_begin() and read_mems_allowed_retry(),
where it is important to note that the return value of the latter call
is inverted from its previous incarnation.

Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Signed-off-by: Mel Gorman <mgorman@suse.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 27329369 03-Mar-2014 Johannes Weiner <hannes@cmpxchg.org>

mm: page_alloc: exempt GFP_THISNODE allocations from zone fairness

Jan Stancek reports manual page migration encountering allocation
failures after some pages when there is still plenty of memory free, and
bisected the problem down to commit 81c0a2bb515f ("mm: page_alloc: fair
zone allocator policy").

The problem is that GFP_THISNODE obeys the zone fairness allocation
batches on one hand, but doesn't reset them and wake kswapd on the other
hand. After a few of those allocations, the batches are exhausted and
the allocations fail.

Fixing this means either having GFP_THISNODE wake up kswapd, or
GFP_THISNODE not participating in zone fairness at all. The latter
seems safer as an acute bugfix, we can clean up later.

Reported-by: Jan Stancek <jstancek@redhat.com>
Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
Acked-by: Rik van Riel <riel@redhat.com>
Acked-by: Mel Gorman <mgorman@suse.de>
Cc: <stable@kernel.org> [3.12+]
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 668f9abb 03-Mar-2014 David Rientjes <rientjes@google.com>

mm: close PageTail race

Commit bf6bddf1924e ("mm: introduce compaction and migration for
ballooned pages") introduces page_count(page) into memory compaction
which dereferences page->first_page if PageTail(page).

This results in a very rare NULL pointer dereference on the
aforementioned page_count(page). Indeed, anything that does
compound_head(), including page_count() is susceptible to racing with
prep_compound_page() and seeing a NULL or dangling page->first_page
pointer.

This patch uses Andrea's implementation of compound_trans_head() that
deals with such a race and makes it the default compound_head()
implementation. This includes a read memory barrier that ensures that
if PageTail(head) is true that we return a head page that is neither
NULL nor dangling. The patch then adds a store memory barrier to
prep_compound_page() to ensure page->first_page is set.

This is the safest way to ensure we see the head page that we are
expecting, PageTail(page) is already in the unlikely() path and the
memory barriers are unfortunately required.

Hugetlbfs is the exception, we don't enforce a store memory barrier
during init since no race is possible.

Signed-off-by: David Rientjes <rientjes@google.com>
Cc: Holger Kiehl <Holger.Kiehl@dwd.de>
Cc: Christoph Lameter <cl@linux.com>
Cc: Rafael Aquini <aquini@redhat.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Michal Hocko <mhocko@suse.cz>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Rik van Riel <riel@redhat.com>
Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 42aa83cb 23-Jan-2014 Han Pingtian <hanpt@linux.vnet.ibm.com>

mm: show message when updating min_free_kbytes in thp

min_free_kbytes may be raised during THP's initialization. Sometimes,
this will change the value which was set by the user. Showing this
message will clarify this confusion.

Only show this message when changing a value which was set by the user
according to Michal Hocko's suggestion.

Show the old value of min_free_kbytes according to Dave Hansen's
suggestion. This will give user the chance to restore old value of
min_free_kbytes.

Signed-off-by: Han Pingtian <hanpt@linux.vnet.ibm.com>
Reviewed-by: Michal Hocko <mhocko@suse.cz>
Cc: David Rientjes <rientjes@google.com>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Dave Hansen <dave.hansen@intel.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# da8c757b 23-Jan-2014 Han Pingtian <hanpt@linux.vnet.ibm.com>

mm: prevent setting of a value less than 0 to min_free_kbytes

If echo -1 > /proc/vm/sys/min_free_kbytes, the system will hang. Changing
proc_dointvec() to proc_dointvec_minmax() in the
min_free_kbytes_sysctl_handler() can prevent this to happen.

mhocko said:

: You can still do echo $BIG_VALUE > /proc/vm/sys/min_free_kbytes and make
: your machine unusable but I agree that proc_dointvec_minmax is more
: suitable here as we already have:
:
: .proc_handler = min_free_kbytes_sysctl_handler,
: .extra1 = &zero,
:
: It used to work properly but then 6fce56ec91b5 ("sysctl: Remove references
: to ctl_name and strategy from the generic sysctl table") has removed
: sysctl_intvec strategy and so extra1 is ignored.

Signed-off-by: Han Pingtian <hanpt@linux.vnet.ibm.com>
Acked-by: Michal Hocko <mhocko@suse.cz>
Acked-by: David Rientjes <rientjes@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 309381fea 23-Jan-2014 Sasha Levin <sasha.levin@oracle.com>

mm: dump page when hitting a VM_BUG_ON using VM_BUG_ON_PAGE

Most of the VM_BUG_ON assertions are performed on a page. Usually, when
one of these assertions fails we'll get a BUG_ON with a call stack and
the registers.

I've recently noticed based on the requests to add a small piece of code
that dumps the page to various VM_BUG_ON sites that the page dump is
quite useful to people debugging issues in mm.

This patch adds a VM_BUG_ON_PAGE(cond, page) which beyond doing what
VM_BUG_ON() does, also dumps the page before executing the actual
BUG_ON.

[akpm@linux-foundation.org: fix up includes]
Signed-off-by: Sasha Levin <sasha.levin@oracle.com>
Cc: "Kirill A. Shutemov" <kirill@shutemov.name>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# f0b791a3 23-Jan-2014 Dave Hansen <dave@sr71.net>

mm: print more details for bad_page()

bad_page() is cool in that it prints out a bunch of data about the page.
But, I can never remember which page flags are good and which are bad,
or whether ->index or ->mapping is required to be NULL.

This patch allows bad/dump_page() callers to specify a string about why
they are dumping the page and adds explanation strings to a number of
places. It also adds a 'bad_flags' argument to bad_page(), which it
then dumps out separately from the flags which are actually set.

This way, the messages will show specifically why the page was bad,
*specifically* which flags it is complaining about, if it was a page
flag combination which was the problem.

[akpm@linux-foundation.org: switch to pr_alert]
Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
Reviewed-by: Christoph Lameter <cl@linux.com>
Cc: Andi Kleen <andi@firstfloor.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# aed0a0e3 21-Jan-2014 David Rientjes <rientjes@google.com>

mm, page_alloc: warn for non-blockable __GFP_NOFAIL allocation failure

__GFP_NOFAIL may return NULL when coupled with GFP_NOWAIT or GFP_ATOMIC.

Luckily, nothing currently does such craziness. So instead of causing
such allocations to loop (potentially forever), we maintain the current
behavior and also warn about the new users of the deprecated flag.

Suggested-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: David Rientjes <rientjes@google.com>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Michal Hocko <mhocko@suse.cz>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# de6c60a6 21-Jan-2014 Vlastimil Babka <vbabka@suse.cz>

mm: compaction: encapsulate defer reset logic

Currently there are several functions to manipulate the deferred
compaction state variables. The remaining case where the variables are
touched directly is when a successful allocation occurs in direct
compaction, or is expected to be successful in the future by kswapd.
Here, the lowest order that is expected to fail is updated, and in the
case of successful allocation, the deferred status and counter is reset
completely.

Create a new function compaction_defer_reset() to encapsulate this
functionality and make it easier to understand the code. No functional
change.

Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
Acked-by: Mel Gorman <mgorman@suse.de>
Reviewed-by: Rik van Riel <riel@redhat.com>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 6782832e 21-Jan-2014 Santosh Shilimkar <santosh.shilimkar@ti.com>

mm/page_alloc.c: use memblock apis for early memory allocations

Switch to memblock interfaces for early memory allocator instead of
bootmem allocator. No functional change in beahvior than what it is in
current code from bootmem users points of view.

Archs already converted to NO_BOOTMEM now directly use memblock
interfaces instead of bootmem wrappers build on top of memblock. And
the archs which still uses bootmem, these new apis just fallback to
exiting bootmem APIs.

Signed-off-by: Grygorii Strashko <grygorii.strashko@ti.com>
Signed-off-by: Santosh Shilimkar <santosh.shilimkar@ti.com>
Cc: Yinghai Lu <yinghai@kernel.org>
Cc: Tejun Heo <tj@kernel.org>
Cc: "Rafael J. Wysocki" <rjw@sisk.pl>
Cc: Arnd Bergmann <arnd@arndb.de>
Cc: Christoph Lameter <cl@linux-foundation.org>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Cc: Michal Hocko <mhocko@suse.cz>
Cc: Paul Walmsley <paul@pwsan.com>
Cc: Pavel Machek <pavel@ucw.cz>
Cc: Russell King <linux@arm.linux.org.uk>
Cc: Tony Lindgren <tony@atomide.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# b2f3eebe 21-Jan-2014 Tang Chen <tangchen@cn.fujitsu.com>

x86, numa, acpi, memory-hotplug: make movable_node have higher priority

If users specify the original movablecore=nn@ss boot option, the kernel
will arrange [ss, ss+nn) as ZONE_MOVABLE. The kernelcore=nn@ss boot
option is similar except it specifies ZONE_NORMAL ranges.

Now, if users specify "movable_node" in kernel commandline, the kernel
will arrange hotpluggable memory in SRAT as ZONE_MOVABLE. And if users
do this, all the other movablecore=nn@ss and kernelcore=nn@ss options
should be ignored.

For those who don't want this, just specify nothing. The kernel will
act as before.

Signed-off-by: Tang Chen <tangchen@cn.fujitsu.com>
Signed-off-by: Zhang Yanfei <zhangyanfei@cn.fujitsu.com>
Reviewed-by: Wanpeng Li <liwanp@linux.vnet.ibm.com>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: "Rafael J . Wysocki" <rjw@sisk.pl>
Cc: Chen Tang <imtangchen@gmail.com>
Cc: Gong Chen <gong.chen@linux.intel.com>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: Jiang Liu <jiang.liu@huawei.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Lai Jiangshan <laijs@cn.fujitsu.com>
Cc: Larry Woodman <lwoodman@redhat.com>
Cc: Len Brown <lenb@kernel.org>
Cc: Liu Jiang <jiang.liu@huawei.com>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Michal Nazarewicz <mina86@mina86.com>
Cc: Minchan Kim <minchan@kernel.org>
Cc: Prarit Bhargava <prarit@redhat.com>
Cc: Rik van Riel <riel@redhat.com>
Cc: Taku Izumi <izumi.taku@jp.fujitsu.com>
Cc: Tejun Heo <tj@kernel.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Thomas Renninger <trenn@suse.de>
Cc: Toshi Kani <toshi.kani@hp.com>
Cc: Vasilis Liaskovitis <vasilis.liaskovitis@profitbricks.com>
Cc: Wen Congyang <wency@cn.fujitsu.com>
Cc: Yasuaki Ishimatsu <isimatu.yasuaki@jp.fujitsu.com>
Cc: Yinghai Lu <yinghai@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# aec6a888 21-Jan-2014 Mel Gorman <mgorman@suse.de>

mm, show_mem: remove SHOW_MEM_FILTER_PAGE_COUNT

Commit 4b59e6c47309 ("mm, show_mem: suppress page counts in
non-blockable contexts") introduced SHOW_MEM_FILTER_PAGE_COUNT to
suppress PFN walks on large memory machines. Commit c78e93630d15 ("mm:
do not walk all of system memory during show_mem") avoided a PFN walk in
the generic show_mem helper which removes the requirement for
SHOW_MEM_FILTER_PAGE_COUNT in that case.

This patch removes PFN walkers from the arch-specific implementations
that report on a per-node or per-zone granularity. ARM and unicore32
still do a PFN walk as they report memory usage on each bank which is a
much finer granularity where the debugging information may still be of
use. As the remaining arches doing PFN walks have relatively small
amounts of memory, this patch simply removes SHOW_MEM_FILTER_PAGE_COUNT.

[akpm@linux-foundation.org: fix parisc]
Signed-off-by: Mel Gorman <mgorman@suse.de>
Acked-by: David Rientjes <rientjes@google.com>
Cc: Tony Luck <tony.luck@intel.com>
Cc: Russell King <linux@arm.linux.org.uk>
Cc: James Bottomley <jejb@parisc-linux.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 943dca1a 21-Jan-2014 Yasuaki Ishimatsu <isimatu.yasuaki@jp.fujitsu.com>

mm: get rid of unnecessary pageblock scanning in setup_zone_migrate_reserve

Yasuaki Ishimatsu reported memory hot-add spent more than 5 _hours_ on
9TB memory machine since onlining memory sections is too slow. And we
found out setup_zone_migrate_reserve spent >90% of the time.

The problem is, setup_zone_migrate_reserve scans all pageblocks
unconditionally, but it is only necessary if the number of reserved
block was reduced (i.e. memory hot remove).

Moreover, maximum MIGRATE_RESERVE per zone is currently 2. It means
that the number of reserved pageblocks is almost always unchanged.

This patch adds zone->nr_migrate_reserve_block to maintain the number of
MIGRATE_RESERVE pageblocks and it reduces the overhead of
setup_zone_migrate_reserve dramatically. The following table shows time
of onlining a memory section.

Amount of memory | 128GB | 192GB | 256GB|
---------------------------------------------
linux-3.12 | 23.9 | 31.4 | 44.5 |
This patch | 8.3 | 8.3 | 8.6 |
Mel's proposal patch | 10.9 | 19.2 | 31.3 |
---------------------------------------------
(millisecond)

128GB : 4 nodes and each node has 32GB of memory
192GB : 6 nodes and each node has 32GB of memory
256GB : 8 nodes and each node has 32GB of memory

(*1) Mel proposed his idea by the following threads.
https://lkml.org/lkml/2013/10/30/272

[akpm@linux-foundation.org: tweak comment]
Signed-off-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Signed-off-by: Yasuaki Ishimatsu <isimatu.yasuaki@jp.fujitsu.com>
Reported-by: Yasuaki Ishimatsu <isimatu.yasuaki@jp.fujitsu.com>
Tested-by: Yasuaki Ishimatsu <isimatu.yasuaki@jp.fujitsu.com>
Cc: Mel Gorman <mgorman@suse.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# fff4068c 20-Dec-2013 Johannes Weiner <hannes@cmpxchg.org>

mm: page_alloc: revert NUMA aspect of fair allocation policy

Commit 81c0a2bb515f ("mm: page_alloc: fair zone allocator policy") meant
to bring aging fairness among zones in system, but it was overzealous
and badly regressed basic workloads on NUMA systems.

Due to the way kswapd and page allocator interacts, we still want to
make sure that all zones in any given node are used equally for all
allocations to maximize memory utilization and prevent thrashing on the
highest zone in the node.

While the same principle applies to NUMA nodes - memory utilization is
obviously improved by spreading allocations throughout all nodes -
remote references can be costly and so many workloads prefer locality
over memory utilization. The original change assumed that
zone_reclaim_mode would be a good enough predictor for that, but it
turned out to be as indicative as a coin flip.

Revert the NUMA aspect of the fairness until we can find a proper way to
make it configurable and agree on a sane default.

Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
Reviewed-by: Michal Hocko <mhocko@suse.cz>
Signed-off-by: Mel Gorman <mgorman@suse.de>
Cc: <stable@kernel.org> # 3.12
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 8798cee2 20-Dec-2013 Mel Gorman <mgorman@suse.de>

Revert "mm: page_alloc: exclude unreclaimable allocations from zone fairness policy"

This reverts commit 73f038b863df. The NUMA behaviour of this patch is
less than ideal. An alternative approch is to interleave allocations
only within local zones which is implemented in the next patch.

Cc: stable@vger.kernel.org
Signed-off-by: Mel Gorman <mgorman@suse.de>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 73f038b8 18-Dec-2013 Johannes Weiner <hannes@cmpxchg.org>

mm: page_alloc: exclude unreclaimable allocations from zone fairness policy

Dave Hansen noted a regression in a microbenchmark that loops around
open() and close() on an 8-node NUMA machine and bisected it down to
commit 81c0a2bb515f ("mm: page_alloc: fair zone allocator policy").
That change forces the slab allocations of the file descriptor to spread
out to all 8 nodes, causing remote references in the page allocator and
slab.

The round-robin policy is only there to provide fairness among memory
allocations that are reclaimed involuntarily based on pressure in each
zone. It does not make sense to apply it to unreclaimable kernel
allocations that are freed manually, in this case instantly after the
allocation, and incur the remote reference costs twice for no reason.

Only round-robin allocations that are usually freed through page reclaim
or slab shrinking.

Bisected by Dave Hansen.

Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
Cc: Dave Hansen <dave.hansen@intel.com>
Cc: Mel Gorman <mgorman@suse.de>
Reviewed-by: Rik van Riel <riel@redhat.com>
Cc: <stable@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# a1aeb65a 12-Nov-2013 Zhi Yong Wu <wuzhy@linux.vnet.ibm.com>

mm/page_alloc.c: fix comment in zlc_setup()

Signed-off-by: Zhi Yong Wu <wuzhy@linux.vnet.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 0cbef29a 12-Nov-2013 KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>

mm: __rmqueue_fallback() should respect pageblock type

When __rmqueue_fallback() doesn't find a free block with the required size
it splits a larger page and puts the rest of the page onto the free list.

But it has one serious mistake. When putting back, __rmqueue_fallback()
always use start_migratetype if type is not CMA. However,
__rmqueue_fallback() is only called when all of the start_migratetype
queue is empty. That said, __rmqueue_fallback always puts back memory to
the wrong queue except try_to_steal_freepages() changed pageblock type
(i.e. requested size is smaller than half of page block). The end result
is that the antifragmentation framework increases fragmenation instead of
decreasing it.

Mel's original anti fragmentation does the right thing. But commit
47118af076f6 ("mm: mmzone: MIGRATE_CMA migration type added") broke it.

This patch restores sane and old behavior. It also removes an incorrect
comment which was introduced by commit fef903efcf0c ("mm/page_alloc.c:
restructure free-page stealing code and fix a bug").

Signed-off-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Michal Nazarewicz <mina86@mina86.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 52c8f6a5 12-Nov-2013 KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>

mm: get rid of unnecessary overhead of trace_mm_page_alloc_extfrag()

In general, every tracepoint should be zero overhead if it is disabled.
However, trace_mm_page_alloc_extfrag() is one of exception. It evaluate
"new_type == start_migratetype" even if tracepoint is disabled.

However, the code can be moved into tracepoint's TP_fast_assign() and
TP_fast_assign exist exactly such purpose. This patch does it.

Signed-off-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Acked-by: Mel Gorman <mgorman@suse.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 5d0f3f72 12-Nov-2013 KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>

mm: fix page_group_by_mobility_disabled breakage

Currently, set_pageblock_migratetype() screws up MIGRATE_CMA and
MIGRATE_ISOLATE if page_group_by_mobility_disabled is true. It rewrites
the argument to MIGRATE_UNMOVABLE and we lost these attribute.

The problem was introduced by commit 49255c619fbd ("page allocator: move
check for disabled anti-fragmentation out of fastpath"). So a 4 year
old issue may mean that nobody uses page_group_by_mobility_disabled.

But anyway, this patch fixes the problem.

Signed-off-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Acked-by: Mel Gorman <mgorman@suse.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# bfc4f9d5 12-Nov-2013 Zhang Yanfei <zhangyanfei@cn.fujitsu.com>

mm/page_alloc.c: remove unused marco LONG_ALIGN

Signed-off-by: Zhang Yanfei <zhangyanfei@cn.fujitsu.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# b9921ecd 12-Nov-2013 Qiang Huang <h.huangqiang@huawei.com>

mm: add a helper function to check may oom condition

Use helper function to check if we need to deal with oom condition.

Signed-off-by: Qiang Huang <h.huangqiang@huawei.com>
Acked-by: David Rientjes <rientjes@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# b38a8725 12-Nov-2013 Xishi Qiu <qiuxishi@huawei.com>

mm: use populated_zone() instead of if(zone->present_pages)

Use "if (zone->present_pages)" instead of "if (zone->present_pages)".
Simplify the code, no functional change.

Signed-off-by: Xishi Qiu <qiuxishi@huawei.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 90572890 07-Oct-2013 Peter Zijlstra <peterz@infradead.org>

mm: numa: Change page last {nid,pid} into {cpu,pid}

Change the per page last fault tracking to use cpu,pid instead of
nid,pid. This will allow us to try and lookup the alternate task more
easily. Note that even though it is the cpu that is store in the page
flags that the mpol_misplaced decision is still based on the node.

Signed-off-by: Peter Zijlstra <peterz@infradead.org>
Signed-off-by: Mel Gorman <mgorman@suse.de>
Reviewed-by: Rik van Riel <riel@redhat.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
Link: http://lkml.kernel.org/r/1381141781-10992-43-git-send-email-mgorman@suse.de
[ Fixed build failure on 32-bit systems. ]
Signed-off-by: Ingo Molnar <mingo@kernel.org>


# b795854b 07-Oct-2013 Mel Gorman <mgorman@suse.de>

sched/numa: Set preferred NUMA node based on number of private faults

Ideally it would be possible to distinguish between NUMA hinting faults that
are private to a task and those that are shared. If treated identically
there is a risk that shared pages bounce between nodes depending on
the order they are referenced by tasks. Ultimately what is desirable is
that task private pages remain local to the task while shared pages are
interleaved between sharing tasks running on different nodes to give good
average performance. This is further complicated by THP as even
applications that partition their data may not be partitioning on a huge
page boundary.

To start with, this patch assumes that multi-threaded or multi-process
applications partition their data and that in general the private accesses
are more important for cpu->memory locality in the general case. Also,
no new infrastructure is required to treat private pages properly but
interleaving for shared pages requires additional infrastructure.

To detect private accesses the pid of the last accessing task is required
but the storage requirements are a high. This patch borrows heavily from
Ingo Molnar's patch "numa, mm, sched: Implement last-CPU+PID hash tracking"
to encode some bits from the last accessing task in the page flags as
well as the node information. Collisions will occur but it is better than
just depending on the node information. Node information is then used to
determine if a page needs to migrate. The PID information is used to detect
private/shared accesses. The preferred NUMA node is selected based on where
the maximum number of approximately private faults were measured. Shared
faults are not taken into consideration for a few reasons.

First, if there are many tasks sharing the page then they'll all move
towards the same node. The node will be compute overloaded and then
scheduled away later only to bounce back again. Alternatively the shared
tasks would just bounce around nodes because the fault information is
effectively noise. Either way accounting for shared faults the same as
private faults can result in lower performance overall.

The second reason is based on a hypothetical workload that has a small
number of very important, heavily accessed private pages but a large shared
array. The shared array would dominate the number of faults and be selected
as a preferred node even though it's the wrong decision.

The third reason is that multiple threads in a process will race each
other to fault the shared page making the fault information unreliable.

Signed-off-by: Mel Gorman <mgorman@suse.de>
[ Fix complication error when !NUMA_BALANCING. ]
Reviewed-by: Rik van Riel <riel@redhat.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
Signed-off-by: Peter Zijlstra <peterz@infradead.org>
Link: http://lkml.kernel.org/r/1381141781-10992-30-git-send-email-mgorman@suse.de
Signed-off-by: Ingo Molnar <mingo@kernel.org>


# 7393dc45 30-Sep-2013 Joonyoung Shim <jy0922.shim@samsung.com>

revert "mm/memory-hotplug: fix lowmem count overflow when offline pages"

This reverts commit cea27eb2a202 ("mm/memory-hotplug: fix lowmem count
overflow when offline pages").

The fixed bug by commit cea27eb was fixed to another way by commit
3dcc0571cd64 ("mm: correctly update zone->managed_pages"). That commit
enhances memory_hotplug.c to adjust totalhigh_pages when hot-removing
memory, for details please refer to:

http://marc.info/?l=linux-mm&m=136957578620221&w=2

As a result, commit cea27eb2a202 currently causes duplicated decreasing
of totalhigh_pages, thus the revert.

Signed-off-by: Joonyoung Shim <jy0922.shim@samsung.com>
Reviewed-by: Wanpeng Li <liwanp@linux.vnet.ibm.com>
Cc: Jiang Liu <liuj97@gmail.com>
Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Cc: Bartlomiej Zolnierkiewicz <b.zolnierkie@samsung.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# cf6fe945 11-Sep-2013 Wang Sheng-Hui <shhuiw@gmail.com>

mm: correct the comment about the value for buddy _mapcount

Set _mapcount PAGE_BUDDY_MAPCOUNT_VALUE to make the page buddy. Not the
magic number -2.

Signed-off-by: Wang Sheng-Hui <shhuiw@gmail.com>
Cc: Mel Gorman <mel@csn.ul.ie>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 6e543d57 11-Sep-2013 Lisa Du <cldu@marvell.com>

mm: vmscan: fix do_try_to_free_pages() livelock

This patch is based on KOSAKI's work and I add a little more description,
please refer https://lkml.org/lkml/2012/6/14/74.

Currently, I found system can enter a state that there are lots of free
pages in a zone but only order-0 and order-1 pages which means the zone is
heavily fragmented, then high order allocation could make direct reclaim
path's long stall(ex, 60 seconds) especially in no swap and no compaciton
enviroment. This problem happened on v3.4, but it seems issue still lives
in current tree, the reason is do_try_to_free_pages enter live lock:

kswapd will go to sleep if the zones have been fully scanned and are still
not balanced. As kswapd thinks there's little point trying all over again
to avoid infinite loop. Instead it changes order from high-order to
0-order because kswapd think order-0 is the most important. Look at
73ce02e9 in detail. If watermarks are ok, kswapd will go back to sleep
and may leave zone->all_unreclaimable =3D 0. It assume high-order users
can still perform direct reclaim if they wish.

Direct reclaim continue to reclaim for a high order which is not a
COSTLY_ORDER without oom-killer until kswapd turn on
zone->all_unreclaimble= . This is because to avoid too early oom-kill.
So it means direct_reclaim depends on kswapd to break this loop.

In worst case, direct-reclaim may continue to page reclaim forever when
kswapd sleeps forever until someone like watchdog detect and finally kill
the process. As described in:
http://thread.gmane.org/gmane.linux.kernel.mm/103737

We can't turn on zone->all_unreclaimable from direct reclaim path because
direct reclaim path don't take any lock and this way is racy. Thus this
patch removes zone->all_unreclaimable field completely and recalculates
zone reclaimable state every time.

Note: we can't take the idea that direct-reclaim see zone->pages_scanned
directly and kswapd continue to use zone->all_unreclaimable. Because, it
is racy. commit 929bea7c71 (vmscan: all_unreclaimable() use
zone->all_unreclaimable as a name) describes the detail.

[akpm@linux-foundation.org: uninline zone_reclaimable_pages() and zone_reclaimable()]
Cc: Aaditya Kumar <aaditya.kumar.30@gmail.com>
Cc: Ying Han <yinghan@google.com>
Cc: Nick Piggin <npiggin@gmail.com>
Acked-by: Rik van Riel <riel@redhat.com>
Cc: Mel Gorman <mel@csn.ul.ie>
Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Christoph Lameter <cl@linux.com>
Cc: Bob Liu <lliubbo@gmail.com>
Cc: Neil Zhang <zhangwm@marvell.com>
Cc: Russell King - ARM Linux <linux@arm.linux.org.uk>
Reviewed-by: Michal Hocko <mhocko@suse.cz>
Acked-by: Minchan Kim <minchan@kernel.org>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Signed-off-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Signed-off-by: Lisa Du <cldu@marvell.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 3b11f0aa 11-Sep-2013 SeungHun Lee <waydi1@gmail.com>

mm: page_alloc: fix comment get_page_from_freelist

cpuset_zone_allowed is changed to cpuset_zone_allowed_softwall and the
comment is moved to __cpuset_node_allowed_softwall. So fix this comment.

Signed-off-by: SeungHun Lee <waydi1@gmail.com>
Acked-by: David Rientjes <rientjes@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# e76b63f8 11-Sep-2013 Yinghai Lu <yinghai@kernel.org>

memblock, numa: binary search node id

Current early_pfn_to_nid() on arch that support memblock go over
memblock.memory one by one, so will take too many try near the end.

We can use existing memblock_search to find the node id for given pfn,
that could save some time on bigger system that have many entries
memblock.memory array.

Here are the timing differences for several machines. In each case with
the patch less time was spent in __early_pfn_to_nid().

3.11-rc5 with patch difference (%)
-------- ---------- --------------
UV1: 256 nodes 9TB: 411.66 402.47 -9.19 (2.23%)
UV2: 255 nodes 16TB: 1141.02 1138.12 -2.90 (0.25%)
UV2: 64 nodes 2TB: 128.15 126.53 -1.62 (1.26%)
UV2: 32 nodes 2TB: 121.87 121.07 -0.80 (0.66%)
Time in seconds.

Signed-off-by: Yinghai Lu <yinghai@kernel.org>
Cc: Tejun Heo <tj@kernel.org>
Acked-by: Russ Anderson <rja@sgi.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# c8721bbb 11-Sep-2013 Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>

mm: memory-hotplug: enable memory hotplug to handle hugepage

Until now we can't offline memory blocks which contain hugepages because a
hugepage is considered as an unmovable page. But now with this patch
series, a hugepage has become movable, so by using hugepage migration we
can offline such memory blocks.

What's different from other users of hugepage migration is that we need to
decompose all the hugepages inside the target memory block into free buddy
pages after hugepage migration, because otherwise free hugepages remaining
in the memory block intervene the memory offlining. For this reason we
introduce new functions dissolve_free_huge_page() and
dissolve_free_huge_pages().

Other than that, what this patch does is straightforwardly to add hugepage
migration code, that is, adding hugepage code to the functions which scan
over pfn and collect hugepages to be migrated, and adding a hugepage
allocation function to alloc_migrate_target().

As for larger hugepages (1GB for x86_64), it's not easy to do hotremove
over them because it's larger than memory block. So we now simply leave
it to fail as it is.

[yongjun_wei@trendmicro.com.cn: remove duplicated include]
Signed-off-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
Acked-by: Andi Kleen <ak@linux.intel.com>
Cc: Hillf Danton <dhillf@gmail.com>
Cc: Wanpeng Li <liwanp@linux.vnet.ibm.com>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Hugh Dickins <hughd@google.com>
Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Cc: Michal Hocko <mhocko@suse.cz>
Cc: Rik van Riel <riel@redhat.com>
Cc: "Aneesh Kumar K.V" <aneesh.kumar@linux.vnet.ibm.com>
Signed-off-by: Wei Yongjun <yongjun_wei@trendmicro.com.cn>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 8080fc03 11-Sep-2013 Xishi Qiu <qiuxishi@huawei.com>

mm: use zone_is_empty() instead of if(zone->spanned_pages)

Use "zone_is_empty()" instead of "if (zone->spanned_pages)".
Simplify the code, no functional change.

Signed-off-by: Xishi Qiu <qiuxishi@huawei.com>
Cc: Cody P Schafer <cody@linux.vnet.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 2bb921e5 11-Sep-2013 Christoph Lameter <cl@linux.com>

vmstat: create separate function to fold per cpu diffs into local counters

The main idea behind this patchset is to reduce the vmstat update overhead
by avoiding interrupt enable/disable and the use of per cpu atomics.

This patch (of 3):

It is better to have a separate folding function because
refresh_cpu_vm_stats() also does other things like expire pages in the
page allocator caches.

If we have a separate function then refresh_cpu_vm_stats() is only called
from the local cpu which allows additional optimizations.

The folding function is only called when a cpu is being downed and
therefore no other processor will be accessing the counters. Also
simplifies synchronization.

[akpm@linux-foundation.org: fix UP build]
Signed-off-by: Christoph Lameter <cl@linux.com>
Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
CC: Tejun Heo <tj@kernel.org>
Cc: Joonsoo Kim <js1304@gmail.com>
Cc: Alexey Dobriyan <adobriyan@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# e66f0972 11-Sep-2013 Joonsoo Kim <iamjoonsoo.kim@lge.com>

mm, page_alloc: add unlikely macro to help compiler optimization

We rarely allocate a page with ALLOC_NO_WATERMARKS and it is used in slow
path. For helping compiler optimization, add unlikely macro to
ALLOC_NO_WATERMARKS checking.

This patch doesn't have any effect now, because gcc already optimize this
properly. But we cannot assume that gcc always does right and nobody
re-evaluate if gcc do proper optimization with their change, for example,
it is not optimized properly on v3.10. So adding compiler hint here is
reasonable.

Signed-off-by: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Cc: Minchan Kim <minchan@kernel.org>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Rik van Riel <riel@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 81c0a2bb 11-Sep-2013 Johannes Weiner <hannes@cmpxchg.org>

mm: page_alloc: fair zone allocator policy

Each zone that holds userspace pages of one workload must be aged at a
speed proportional to the zone size. Otherwise, the time an individual
page gets to stay in memory depends on the zone it happened to be
allocated in. Asymmetry in the zone aging creates rather unpredictable
aging behavior and results in the wrong pages being reclaimed, activated
etc.

But exactly this happens right now because of the way the page allocator
and kswapd interact. The page allocator uses per-node lists of all zones
in the system, ordered by preference, when allocating a new page. When
the first iteration does not yield any results, kswapd is woken up and the
allocator retries. Due to the way kswapd reclaims zones below the high
watermark while a zone can be allocated from when it is above the low
watermark, the allocator may keep kswapd running while kswapd reclaim
ensures that the page allocator can keep allocating from the first zone in
the zonelist for extended periods of time. Meanwhile the other zones
rarely see new allocations and thus get aged much slower in comparison.

The result is that the occasional page placed in lower zones gets
relatively more time in memory, even gets promoted to the active list
after its peers have long been evicted. Meanwhile, the bulk of the
working set may be thrashing on the preferred zone even though there may
be significant amounts of memory available in the lower zones.

Even the most basic test -- repeatedly reading a file slightly bigger than
memory -- shows how broken the zone aging is. In this scenario, no single
page should be able stay in memory long enough to get referenced twice and
activated, but activation happens in spades:

$ grep active_file /proc/zoneinfo
nr_inactive_file 0
nr_active_file 0
nr_inactive_file 0
nr_active_file 8
nr_inactive_file 1582
nr_active_file 11994
$ cat data data data data >/dev/null
$ grep active_file /proc/zoneinfo
nr_inactive_file 0
nr_active_file 70
nr_inactive_file 258753
nr_active_file 443214
nr_inactive_file 149793
nr_active_file 12021

Fix this with a very simple round robin allocator. Each zone is allowed a
batch of allocations that is proportional to the zone's size, after which
it is treated as full. The batch counters are reset when all zones have
been tried and the allocator enters the slowpath and kicks off kswapd
reclaim. Allocation and reclaim is now fairly spread out to all
available/allowable zones:

$ grep active_file /proc/zoneinfo
nr_inactive_file 0
nr_active_file 0
nr_inactive_file 174
nr_active_file 4865
nr_inactive_file 53
nr_active_file 860
$ cat data data data data >/dev/null
$ grep active_file /proc/zoneinfo
nr_inactive_file 0
nr_active_file 0
nr_inactive_file 666622
nr_active_file 4988
nr_inactive_file 190969
nr_active_file 937

When zone_reclaim_mode is enabled, allocations will now spread out to all
zones on the local node, not just the first preferred zone (which on a 4G
node might be a tiny Normal zone).

Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
Acked-by: Mel Gorman <mgorman@suse.de>
Reviewed-by: Rik van Riel <riel@redhat.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Paul Bolle <paul.bollee@gmail.com>
Cc: Zlatko Calusic <zcalusic@bitsync.net>
Tested-by: Kevin Hilman <khilman@linaro.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# e085dbc5 11-Sep-2013 Johannes Weiner <hannes@cmpxchg.org>

mm: page_alloc: rearrange watermark checking in get_page_from_freelist

Allocations that do not have to respect the watermarks are rare
high-priority events. Reorder the code such that per-zone dirty limits
and future checks important only to regular page allocations are ignored
in these extraordinary situations.

Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
Cc: Mel Gorman <mgorman@suse.de>
Reviewed-by: Rik van Riel <riel@redhat.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Paul Bolle <paul.bollee@gmail.com>
Tested-by: Zlatko Calusic <zcalusic@bitsync.net>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# e2d0bd2b 11-Sep-2013 Yinghai Lu <yinghai@kernel.org>

mm: kill one if loop in __free_pages_bootmem()

We should not check loop+1 with loop end in loop body. Just duplicate two
lines code to avoid it.

That will help a bit when we have huge amount of pages on system with
16TiB memory.

Signed-off-by: Yinghai Lu <yinghai@kernel.org>
Cc: Mel Gorman <mgorman@suse.de>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# f92310c1 11-Sep-2013 Srivatsa S. Bhat <srivatsa.bhat@linux.vnet.ibm.com>

mm/page_alloc.c: fix the value of fallback_migratetype in alloc_extfrag tracepoint()

In the current code, the value of fallback_migratetype that is printed
using the mm_page_alloc_extfrag tracepoint, is the value of the
migratetype *after* it has been set to the preferred migratetype (if the
ownership was changed). Obviously that wouldn't have been the original
intent. (We already have a separate 'change_ownership' field to tell
whether the ownership of the pageblock was changed from the
fallback_migratetype to the preferred type.)

The intent of the fallback_migratetype field is to show the migratetype
from which we borrowed pages in order to satisfy the allocation request.
So fix the code to print that value correctly.

Signed-off-by: Srivatsa S. Bhat <srivatsa.bhat@linux.vnet.ibm.com>
Cc: Mel Gorman <mel@csn.ul.ie>
Cc: Minchan Kim <minchan@kernel.org>
Cc: Cody P Schafer <cody@linux.vnet.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# fef903ef 11-Sep-2013 Srivatsa S. Bhat <srivatsa.bhat@linux.vnet.ibm.com>

mm/page_allo.c: restructure free-page stealing code and fix a bug

The free-page stealing code in __rmqueue_fallback() is somewhat hard to
follow, and has an incredible amount of subtlety hidden inside!

First off, there is a minor bug in the reporting of change-of-ownership of
pageblocks. Under some conditions, we try to move upto
'pageblock_nr_pages' no. of pages to the preferred allocation list. But
we change the ownership of that pageblock to the preferred type only if we
manage to successfully move atleast half of that pageblock (or if
page_group_by_mobility_disabled is set).

However, the current code ignores the latter part and sets the
'migratetype' variable to the preferred type, irrespective of whether we
actually changed the pageblock migratetype of that block or not. So, the
page_alloc_extfrag tracepoint can end up printing incorrect info (i.e.,
'change_ownership' might be shown as 1 when it must have been 0).

So fixing this involves moving the update of the 'migratetype' variable to
the right place. But looking closer, we observe that the 'migratetype'
variable is used subsequently for checks such as "is_migrate_cma()".
Obviously the intent there is to check if the *fallback* type is
MIGRATE_CMA, but since we already set the 'migratetype' variable to
start_migratetype, we end up checking if the *preferred* type is
MIGRATE_CMA!!

To make things more interesting, this actually doesn't cause a bug in
practice, because we never change *anything* if the fallback type is CMA.

So, restructure the code in such a way that it is trivial to understand
what is going on, and also fix the above mentioned bug. And while at it,
also add a comment explaining the subtlety behind the migratetype used in
the call to expand().

[akpm@linux-foundation.org: remove unneeded `inline', small coding-style fix]
Signed-off-by: Srivatsa S. Bhat <srivatsa.bhat@linux.vnet.ibm.com>
Cc: Mel Gorman <mel@csn.ul.ie>
Cc: Minchan Kim <minchan@kernel.org>
Cc: Cody P Schafer <cody@linux.vnet.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# b8af2941 11-Sep-2013 Pintu Kumar <pintu.k@samsung.com>

mm/page_alloc.c: fix coding style and spelling

Fix all errors reported by checkpatch and some small spelling mistakes.

Signed-off-by: Pintu Kumar <pintu.k@samsung.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 15ca220e 11-Sep-2013 Chen Gang <gang.chen@asianux.com>

mm/page_alloc.c: use '__paginginit' instead of '__init'

set_pageblock_order() may be called when memory hotplug, so need use
'__paginginit' instead of '__init'.

The related warning:

The function __meminit .free_area_init_node() references
a function __init .set_pageblock_order().
If .set_pageblock_order is only used by .free_area_init_node then
annotate .set_pageblock_order with a matching annotation.

Signed-off-by: Chen Gang <gang.chen@asianux.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# a7e83318 11-Sep-2013 Jerry Zhou <uulinux@gmail.com>

mm: fix negative left shift count when PAGE_SHIFT > 20

When PAGE_SHIFT > 20, the result of "20 - PAGE_SHIFT" is negative. The
previous calculating here will generate an unexpected result. In
addition, if PAGE_SIZE >= 1MB, The memory size of "numentries" was
already integral multiple of 1MB.

Signed-off-by: Jerry Zhou <uulinux@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 9cf510a5 22-Aug-2013 Li Zhong <zhong@linux.vnet.ibm.com>

Fix comment typo for init_cma_reserved_pageblock

It seems the "it's" should be "its" here.

Signed-off-by: Li Zhong <zhong@linux.vnet.ibm.com>
Signed-off-by: Jiri Kosina <jkosina@suse.cz>


# 5f12733e 08-Jul-2013 Michal Hocko <mhocko@suse.cz>

mm: honor min_free_kbytes set by user

min_free_kbytes is updated during memory hotplug (by
init_per_zone_wmark_min) currently which is right thing to do in most
cases but this could be unexpected if admin increased the value to
prevent from allocation failures and the new min_free_kbytes would be
decreased as a result of memory hotadd.

This patch saves the user defined value and allows updating
min_free_kbytes only if it is higher than the saved one.

A warning is printed when the new value is ignored.

Signed-off-by: Michal Hocko <mhocko@suse.cz>
Cc: Mel Gorman <mgorman@suse.de>
Acked-by: Zhang Yanfei <zhangyanfei@cn.fujitsu.com>
Acked-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 345606d4 08-Jul-2013 Zhang Yanfei <zhangyanfei@cn.fujitsu.com>

mm/page_alloc.c: remove unlikely() from the current_order test

In __rmqueue_fallback(), current_order loops down from MAX_ORDER - 1 to
the order passed. MAX_ORDER is typically 11 and pageblock_order is
typically 9 on x86. Integer division truncates, so pageblock_order / 2
is 4. For the first eight iterations, it's guaranteed that
current_order >= pageblock_order / 2 if it even gets that far!

So just remove the unlikely(), it's completely bogus.

Signed-off-by: Zhang Yanfei <zhangyanfei@cn.fujitsu.com>
Suggested-by: David Rientjes <rientjes@google.com>
Acked-by: David Rientjes <rientjes@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# bc732f1d 08-Jul-2013 Zhang Yanfei <zhangyanfei@cn.fujitsu.com>

mm/page_alloc.c: remove zone_type argument of build_zonelists_node

The callers of build_zonelists_node always pass MAX_NR_ZONES -1 as the
zone_type argument, so we can directly use the value in
build_zonelists_node and remove zone_type argument.

Signed-off-by: Zhang Yanfei <zhangyanfei@cn.fujitsu.com>
Acked-by: David Rientjes <rientjes@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 7960aedd 08-Jul-2013 Zhang Yanfei <zhangyanfei@cn.fujitsu.com>

mm: remove duplicated call of get_pfn_range_for_nid

When calculating pages in a node, for each zone in that node, we will
have

zone_spanned_pages_in_node
--> get_pfn_range_for_nid
zone_absent_pages_in_node
--> get_pfn_range_for_nid

That is to say, we call the get_pfn_range_for_nid to get start_pfn and
end_pfn of the node for MAX_NR_ZONES * 2 times. And this is totally
unnecessary if we call the get_pfn_range_for_nid before
zone_*_pages_in_node add two extra arguments node_start_pfn and
node_end_pfn for zone_*_pages_in_node, then we can remove the
get_pfn_range_in_node in zone_*_pages_in_node.

[akpm@linux-foundation.org: make definitions more readable]
Signed-off-by: Zhang Yanfei <zhangyanfei@cn.fujitsu.com>
Cc: Michal Hocko <mhocko@suse.cz>
Cc: Wu Fengguang <fengguang.wu@intel.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 7ee3d4e8 03-Jul-2013 Jiang Liu <liuj97@gmail.com>

mm: introduce helper function mem_init_print_info() to simplify mem_init()

Introduce helper function mem_init_print_info() to simplify mem_init()
across different architectures, which also unifies the format and
information printed.

Function mem_init_print_info() calculates memory statistics information
without walking each page, so it should be a little faster on some
architectures.

Also introduce another helper get_num_physpages() to kill the global
variable num_physpages.

Signed-off-by: Jiang Liu <jiang.liu@huawei.com>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Michel Lespinasse <walken@google.com>
Cc: Rik van Riel <riel@redhat.com>
Cc: Minchan Kim <minchan@kernel.org>
Cc: Marek Szyprowski <m.szyprowski@samsung.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# cdd91a77 03-Jul-2013 Jiang Liu <liuj97@gmail.com>

mm: report available pages as "MemTotal" for each NUMA node

As reported by https://bugzilla.kernel.org/show_bug.cgi?id=53501,
"MemTotal" from /proc/meminfo means memory pages managed by the buddy
system (managed_pages), but "MemTotal" from /sys/.../node/nodex/meminfo
means physical pages present (present_pages) within the NUMA node.
There's a difference between managed_pages and present_pages due to
bootmem allocator and reserved pages.

And Documentation/filesystems/proc.txt says
MemTotal: Total usable ram (i.e. physical ram minus a few reserved
bits and the kernel binary code)

So change /sys/.../node/nodex/meminfo to report available pages within
the node as "MemTotal".

Signed-off-by: Jiang Liu <jiang.liu@huawei.com>
Reported-by: <sworddragon2@aol.com>
Cc: Mel Gorman <mel@csn.ul.ie>
Cc: Minchan Kim <minchan@kernel.org>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: "Michael S. Tsirkin" <mst@redhat.com>
Cc: Arnd Bergmann <arnd@arndb.de>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Chris Metcalf <cmetcalf@tilera.com>
Cc: David Howells <dhowells@redhat.com>
Cc: Geert Uytterhoeven <geert@linux-m68k.org>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Jeremy Fitzhardinge <jeremy@goop.org>
Cc: Jianguo Wu <wujianguo@huawei.com>
Cc: Joonsoo Kim <js1304@gmail.com>
Cc: Kamezawa Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Cc: Marek Szyprowski <m.szyprowski@samsung.com>
Cc: Michel Lespinasse <walken@google.com>
Cc: Rik van Riel <riel@redhat.com>
Cc: Rusty Russell <rusty@rustcorp.com.au>
Cc: Tang Chen <tangchen@cn.fujitsu.com>
Cc: Tejun Heo <tj@kernel.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Wen Congyang <wency@cn.fujitsu.com>
Cc: Will Deacon <will.deacon@arm.com>
Cc: Yasuaki Ishimatsu <isimatu.yasuaki@jp.fujitsu.com>
Cc: Yinghai Lu <yinghai@kernel.org>
Cc: Russell King <rmk@arm.linux.org.uk>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 3dcc0571 03-Jul-2013 Jiang Liu <liuj97@gmail.com>

mm: correctly update zone->managed_pages

Enhance adjust_managed_page_count() to adjust totalhigh_pages for
highmem pages. And change code which directly adjusts totalram_pages to
use adjust_managed_page_count() because it adjusts totalram_pages,
totalhigh_pages and zone->managed_pages altogether in a safe way.

Remove inc_totalhigh_pages() and dec_totalhigh_pages() from xen/balloon
driver bacause adjust_managed_page_count() has already adjusted
totalhigh_pages.

This patch also fixes two bugs:

1) enhances virtio_balloon driver to adjust totalhigh_pages when
reserve/unreserve pages.
2) enhance memory_hotplug.c to adjust totalhigh_pages when hot-removing
memory.

We still need to deal with modifications of totalram_pages in file
arch/powerpc/platforms/pseries/cmm.c, but need help from PPC experts.

[akpm@linux-foundation.org: remove ifdef, per Wanpeng Li, virtio_balloon.c cleanup, per Sergei]
[akpm@linux-foundation.org: export adjust_managed_page_count() to modules, for drivers/virtio/virtio_balloon.c]
Signed-off-by: Jiang Liu <jiang.liu@huawei.com>
Cc: Chris Metcalf <cmetcalf@tilera.com>
Cc: Rusty Russell <rusty@rustcorp.com.au>
Cc: "Michael S. Tsirkin" <mst@redhat.com>
Cc: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Cc: Jeremy Fitzhardinge <jeremy@goop.org>
Cc: Wen Congyang <wency@cn.fujitsu.com>
Cc: Tang Chen <tangchen@cn.fujitsu.com>
Cc: Yasuaki Ishimatsu <isimatu.yasuaki@jp.fujitsu.com>
Cc: Mel Gorman <mel@csn.ul.ie>
Cc: Minchan Kim <minchan@kernel.org>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: <sworddragon2@aol.com>
Cc: Arnd Bergmann <arnd@arndb.de>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: David Howells <dhowells@redhat.com>
Cc: Geert Uytterhoeven <geert@linux-m68k.org>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Jianguo Wu <wujianguo@huawei.com>
Cc: Joonsoo Kim <js1304@gmail.com>
Cc: Kamezawa Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Marek Szyprowski <m.szyprowski@samsung.com>
Cc: Michel Lespinasse <walken@google.com>
Cc: Rik van Riel <riel@redhat.com>
Cc: Tejun Heo <tj@kernel.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Will Deacon <will.deacon@arm.com>
Cc: Yinghai Lu <yinghai@kernel.org>
Cc: Russell King <rmk@arm.linux.org.uk>
Cc: Sergei Shtylyov <sergei.shtylyov@cogentembedded.com>
Cc: Wu Fengguang <fengguang.wu@intel.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 170a5a7e 03-Jul-2013 Jiang Liu <liuj97@gmail.com>

mm: make __free_pages_bootmem() only available at boot time

In order to simpilify management of totalram_pages and
zone->managed_pages, make __free_pages_bootmem() only available at boot
time. With this change applied, __free_pages_bootmem() will only be
used by bootmem.c and nobootmem.c at boot time, so mark it as __init.
Other callers of __free_pages_bootmem() have been converted to use
free_reserved_page(), which handles totalram_pages and
zone->managed_pages in a safer way.

This patch also fix a bug in free_pagetable() for x86_64, which should
increase zone->managed_pages instead of zone->present_pages when freeing
reserved pages.

And now we have managed_pages_count_lock to protect totalram_pages and
zone->managed_pages, so remove the redundant ppb_lock lock in
put_page_bootmem(). This greatly simplifies the locking rules.

Signed-off-by: Jiang Liu <jiang.liu@huawei.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: Yinghai Lu <yinghai@kernel.org>
Cc: Wen Congyang <wency@cn.fujitsu.com>
Cc: Tang Chen <tangchen@cn.fujitsu.com>
Cc: Yasuaki Ishimatsu <isimatu.yasuaki@jp.fujitsu.com>
Cc: Mel Gorman <mel@csn.ul.ie>
Cc: Minchan Kim <minchan@kernel.org>
Cc: "Michael S. Tsirkin" <mst@redhat.com>
Cc: <sworddragon2@aol.com>
Cc: Arnd Bergmann <arnd@arndb.de>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Chris Metcalf <cmetcalf@tilera.com>
Cc: David Howells <dhowells@redhat.com>
Cc: Geert Uytterhoeven <geert@linux-m68k.org>
Cc: Jeremy Fitzhardinge <jeremy@goop.org>
Cc: Jianguo Wu <wujianguo@huawei.com>
Cc: Joonsoo Kim <js1304@gmail.com>
Cc: Kamezawa Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Cc: Marek Szyprowski <m.szyprowski@samsung.com>
Cc: Michel Lespinasse <walken@google.com>
Cc: Rik van Riel <riel@redhat.com>
Cc: Rusty Russell <rusty@rustcorp.com.au>
Cc: Tejun Heo <tj@kernel.org>
Cc: Will Deacon <will.deacon@arm.com>
Cc: Russell King <rmk@arm.linux.org.uk>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# c3d5f5f0 03-Jul-2013 Jiang Liu <liuj97@gmail.com>

mm: use a dedicated lock to protect totalram_pages and zone->managed_pages

Currently lock_memory_hotplug()/unlock_memory_hotplug() are used to
protect totalram_pages and zone->managed_pages. Other than the memory
hotplug driver, totalram_pages and zone->managed_pages may also be
modified at runtime by other drivers, such as Xen balloon,
virtio_balloon etc. For those cases, memory hotplug lock is a little
too heavy, so introduce a dedicated lock to protect totalram_pages and
zone->managed_pages.

Now we have a simplified locking rules totalram_pages and
zone->managed_pages as:

1) no locking for read accesses because they are unsigned long.
2) no locking for write accesses at boot time in single-threaded context.
3) serialize write accesses at runtime by acquiring the dedicated
managed_page_count_lock.

Also adjust zone->managed_pages when freeing reserved pages into the
buddy system, to keep totalram_pages and zone->managed_pages in
consistence.

[akpm@linux-foundation.org: don't export adjust_managed_page_count to modules (for now)]
Signed-off-by: Jiang Liu <jiang.liu@huawei.com>
Cc: Mel Gorman <mel@csn.ul.ie>
Cc: Michel Lespinasse <walken@google.com>
Cc: Rik van Riel <riel@redhat.com>
Cc: Minchan Kim <minchan@kernel.org>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: "Michael S. Tsirkin" <mst@redhat.com>
Cc: <sworddragon2@aol.com>
Cc: Arnd Bergmann <arnd@arndb.de>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Chris Metcalf <cmetcalf@tilera.com>
Cc: David Howells <dhowells@redhat.com>
Cc: Geert Uytterhoeven <geert@linux-m68k.org>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Jeremy Fitzhardinge <jeremy@goop.org>
Cc: Jianguo Wu <wujianguo@huawei.com>
Cc: Joonsoo Kim <js1304@gmail.com>
Cc: Kamezawa Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Cc: Marek Szyprowski <m.szyprowski@samsung.com>
Cc: Rusty Russell <rusty@rustcorp.com.au>
Cc: Tang Chen <tangchen@cn.fujitsu.com>
Cc: Tejun Heo <tj@kernel.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Wen Congyang <wency@cn.fujitsu.com>
Cc: Will Deacon <will.deacon@arm.com>
Cc: Yasuaki Ishimatsu <isimatu.yasuaki@jp.fujitsu.com>
Cc: Yinghai Lu <yinghai@kernel.org>
Cc: Russell King <rmk@arm.linux.org.uk>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 7b4b2a0d 03-Jul-2013 Jiang Liu <liuj97@gmail.com>

mm: accurately calculate zone->managed_pages for highmem zones

Commit "mm: introduce new field 'managed_pages' to struct zone" assumes
that all highmem pages will be freed into the buddy system by function
mem_init(). But that's not always true, some architectures may reserve
some highmem pages during boot. For example PPC may allocate highmem
pages for giagant HugeTLB pages, and several architectures have code to
check PageReserved flag to exclude highmem pages allocated during boot
when freeing highmem pages into the buddy system.

So treat highmem pages in the same way as normal pages, that is to:
1) reset zone->managed_pages to zero in mem_init().
2) recalculate managed_pages when freeing pages into the buddy system.

Signed-off-by: Jiang Liu <jiang.liu@huawei.com>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: Tejun Heo <tj@kernel.org>
Cc: Joonsoo Kim <js1304@gmail.com>
Cc: Yinghai Lu <yinghai@kernel.org>
Cc: Mel Gorman <mel@csn.ul.ie>
Cc: Minchan Kim <minchan@kernel.org>
Cc: Kamezawa Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Marek Szyprowski <m.szyprowski@samsung.com>
Cc: "Michael S. Tsirkin" <mst@redhat.com>
Cc: <sworddragon2@aol.com>
Cc: Arnd Bergmann <arnd@arndb.de>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Chris Metcalf <cmetcalf@tilera.com>
Cc: David Howells <dhowells@redhat.com>
Cc: Geert Uytterhoeven <geert@linux-m68k.org>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Jeremy Fitzhardinge <jeremy@goop.org>
Cc: Jianguo Wu <wujianguo@huawei.com>
Cc: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Cc: Michel Lespinasse <walken@google.com>
Cc: Rik van Riel <riel@redhat.com>
Cc: Rusty Russell <rusty@rustcorp.com.au>
Cc: Tang Chen <tangchen@cn.fujitsu.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Wen Congyang <wency@cn.fujitsu.com>
Cc: Will Deacon <will.deacon@arm.com>
Cc: Yasuaki Ishimatsu <isimatu.yasuaki@jp.fujitsu.com>
Cc: Russell King <rmk@arm.linux.org.uk>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 4f9f4774 03-Jul-2013 Jiang Liu <liuj97@gmail.com>

mm: use managed_pages to calculate default zonelist order

Use zone->managed_pages instead of zone->present_pages to calculate
default zonelist order because managed_pages means allocatable pages.

Signed-off-by: Jiang Liu <jiang.liu@huawei.com>
Cc: Mel Gorman <mel@csn.ul.ie>
Cc: Minchan Kim <minchan@kernel.org>
Cc: Marek Szyprowski <m.szyprowski@samsung.com>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: "Michael S. Tsirkin" <mst@redhat.com>
Cc: <sworddragon2@aol.com>
Cc: Arnd Bergmann <arnd@arndb.de>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Chris Metcalf <cmetcalf@tilera.com>
Cc: David Howells <dhowells@redhat.com>
Cc: Geert Uytterhoeven <geert@linux-m68k.org>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Jeremy Fitzhardinge <jeremy@goop.org>
Cc: Jianguo Wu <wujianguo@huawei.com>
Cc: Joonsoo Kim <js1304@gmail.com>
Cc: Kamezawa Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Cc: Michel Lespinasse <walken@google.com>
Cc: Rik van Riel <riel@redhat.com>
Cc: Rusty Russell <rusty@rustcorp.com.au>
Cc: Tang Chen <tangchen@cn.fujitsu.com>
Cc: Tejun Heo <tj@kernel.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Wen Congyang <wency@cn.fujitsu.com>
Cc: Will Deacon <will.deacon@arm.com>
Cc: Yasuaki Ishimatsu <isimatu.yasuaki@jp.fujitsu.com>
Cc: Yinghai Lu <yinghai@kernel.org>
Cc: Russell King <rmk@arm.linux.org.uk>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 834405c3 03-Jul-2013 Jiang Liu <liuj97@gmail.com>

mm: fix some trivial typos in comments

Fix some trivial typos in comments.

Signed-off-by: Jiang Liu <jiang.liu@huawei.com>
Cc: Wen Congyang <wency@cn.fujitsu.com>
Cc: Tang Chen <tangchen@cn.fujitsu.com>
Cc: Yasuaki Ishimatsu <isimatu.yasuaki@jp.fujitsu.com>
Cc: Mel Gorman <mel@csn.ul.ie>
Cc: Minchan Kim <minchan@kernel.org>
Cc: Marek Szyprowski <m.szyprowski@samsung.com>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: "Michael S. Tsirkin" <mst@redhat.com>
Cc: <sworddragon2@aol.com>
Cc: Arnd Bergmann <arnd@arndb.de>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Chris Metcalf <cmetcalf@tilera.com>
Cc: David Howells <dhowells@redhat.com>
Cc: Geert Uytterhoeven <geert@linux-m68k.org>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Jeremy Fitzhardinge <jeremy@goop.org>
Cc: Jianguo Wu <wujianguo@huawei.com>
Cc: Joonsoo Kim <js1304@gmail.com>
Cc: Kamezawa Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Cc: Michel Lespinasse <walken@google.com>
Cc: Rik van Riel <riel@redhat.com>
Cc: Rusty Russell <rusty@rustcorp.com.au>
Cc: Tejun Heo <tj@kernel.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Will Deacon <will.deacon@arm.com>
Cc: Yinghai Lu <yinghai@kernel.org>
Cc: Russell King <rmk@arm.linux.org.uk>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# dbe67df4 03-Jul-2013 Jiang Liu <liuj97@gmail.com>

mm: enhance free_reserved_area() to support poisoning memory with zero

Address more review comments from last round of code review.
1) Enhance free_reserved_area() to support poisoning freed memory with
pattern '0'. This could be used to get rid of poison_init_mem()
on ARM64.
2) A previous patch has disabled memory poison for initmem on s390
by mistake, so restore to the original behavior.
3) Remove redundant PAGE_ALIGN() when calling free_reserved_area().

Signed-off-by: Jiang Liu <jiang.liu@huawei.com>
Cc: Geert Uytterhoeven <geert@linux-m68k.org>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: "Michael S. Tsirkin" <mst@redhat.com>
Cc: <sworddragon2@aol.com>
Cc: Arnd Bergmann <arnd@arndb.de>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Chris Metcalf <cmetcalf@tilera.com>
Cc: David Howells <dhowells@redhat.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Jeremy Fitzhardinge <jeremy@goop.org>
Cc: Jianguo Wu <wujianguo@huawei.com>
Cc: Joonsoo Kim <js1304@gmail.com>
Cc: Kamezawa Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Cc: Marek Szyprowski <m.szyprowski@samsung.com>
Cc: Mel Gorman <mel@csn.ul.ie>
Cc: Michel Lespinasse <walken@google.com>
Cc: Minchan Kim <minchan@kernel.org>
Cc: Rik van Riel <riel@redhat.com>
Cc: Rusty Russell <rusty@rustcorp.com.au>
Cc: Tang Chen <tangchen@cn.fujitsu.com>
Cc: Tejun Heo <tj@kernel.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Wen Congyang <wency@cn.fujitsu.com>
Cc: Will Deacon <will.deacon@arm.com>
Cc: Yasuaki Ishimatsu <isimatu.yasuaki@jp.fujitsu.com>
Cc: Yinghai Lu <yinghai@kernel.org>
Cc: Russell King <rmk@arm.linux.org.uk>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 11199692 03-Jul-2013 Jiang Liu <liuj97@gmail.com>

mm: change signature of free_reserved_area() to fix building warnings

Change signature of free_reserved_area() according to Russell King's
suggestion to fix following build warnings:

arch/arm/mm/init.c: In function 'mem_init':
arch/arm/mm/init.c:603:2: warning: passing argument 1 of 'free_reserved_area' makes integer from pointer without a cast [enabled by default]
free_reserved_area(__va(PHYS_PFN_OFFSET), swapper_pg_dir, 0, NULL);
^
In file included from include/linux/mman.h:4:0,
from arch/arm/mm/init.c:15:
include/linux/mm.h:1301:22: note: expected 'long unsigned int' but argument is of type 'void *'
extern unsigned long free_reserved_area(unsigned long start, unsigned long end,

mm/page_alloc.c: In function 'free_reserved_area':
>> mm/page_alloc.c:5134:3: warning: passing argument 1 of 'virt_to_phys' makes pointer from integer without a cast [enabled by default]
In file included from arch/mips/include/asm/page.h:49:0,
from include/linux/mmzone.h:20,
from include/linux/gfp.h:4,
from include/linux/mm.h:8,
from mm/page_alloc.c:18:
arch/mips/include/asm/io.h:119:29: note: expected 'const volatile void *' but argument is of type 'long unsigned int'
mm/page_alloc.c: In function 'free_area_init_nodes':
mm/page_alloc.c:5030:34: warning: array subscript is below array bounds [-Warray-bounds]

Also address some minor code review comments.

Signed-off-by: Jiang Liu <jiang.liu@huawei.com>
Reported-by: Arnd Bergmann <arnd@arndb.de>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: "Michael S. Tsirkin" <mst@redhat.com>
Cc: <sworddragon2@aol.com>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Chris Metcalf <cmetcalf@tilera.com>
Cc: David Howells <dhowells@redhat.com>
Cc: Geert Uytterhoeven <geert@linux-m68k.org>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Jeremy Fitzhardinge <jeremy@goop.org>
Cc: Jianguo Wu <wujianguo@huawei.com>
Cc: Joonsoo Kim <js1304@gmail.com>
Cc: Kamezawa Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Cc: Marek Szyprowski <m.szyprowski@samsung.com>
Cc: Mel Gorman <mel@csn.ul.ie>
Cc: Michel Lespinasse <walken@google.com>
Cc: Minchan Kim <minchan@kernel.org>
Cc: Rik van Riel <riel@redhat.com>
Cc: Rusty Russell <rusty@rustcorp.com.au>
Cc: Tang Chen <tangchen@cn.fujitsu.com>
Cc: Tejun Heo <tj@kernel.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Wen Congyang <wency@cn.fujitsu.com>
Cc: Will Deacon <will.deacon@arm.com>
Cc: Yasuaki Ishimatsu <isimatu.yasuaki@jp.fujitsu.com>
Cc: Yinghai Lu <yinghai@kernel.org>
Cc: Russell King <rmk@arm.linux.org.uk>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# cea27eb2 03-Jul-2013 Wanpeng Li <liwanp@linux.vnet.ibm.com>

mm/memory-hotplug: fix lowmem count overflow when offline pages

The logic for the memory-remove code fails to correctly account the
Total High Memory when a memory block which contains High Memory is
offlined as shown in the example below. The following patch fixes it.

Before logic memory remove:

MemTotal: 7603740 kB
MemFree: 6329612 kB
Buffers: 94352 kB
Cached: 872008 kB
SwapCached: 0 kB
Active: 626932 kB
Inactive: 519216 kB
Active(anon): 180776 kB
Inactive(anon): 222944 kB
Active(file): 446156 kB
Inactive(file): 296272 kB
Unevictable: 0 kB
Mlocked: 0 kB
HighTotal: 7294672 kB
HighFree: 5704696 kB
LowTotal: 309068 kB
LowFree: 624916 kB

After logic memory remove:

MemTotal: 7079452 kB
MemFree: 5805976 kB
Buffers: 94372 kB
Cached: 872000 kB
SwapCached: 0 kB
Active: 626936 kB
Inactive: 519236 kB
Active(anon): 180780 kB
Inactive(anon): 222944 kB
Active(file): 446156 kB
Inactive(file): 296292 kB
Unevictable: 0 kB
Mlocked: 0 kB
HighTotal: 7294672 kB
HighFree: 5181024 kB
LowTotal: 4294752076 kB
LowFree: 624952 kB

[mhocko@suse.cz: fix CONFIG_HIGHMEM=n build]
Signed-off-by: Wanpeng Li <liwanp@linux.vnet.ibm.com>
Reviewed-by: Michal Hocko <mhocko@suse.cz>
Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: David Rientjes <rientjes@google.com>
Cc: <stable@vger.kernel.org> [2.6.24+]
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# dacbde09 03-Jul-2013 Chen Gang <gang.chen@asianux.com>

mm/page_alloc.c: add additional checking and return value for the 'table->data'

- check the length of the procfs data before copying it into a fixed
size array.

- when __parse_numa_zonelist_order() fails, save the error code for
return.

- 'char*' --> 'char *' coding style fix

Signed-off-by: Chen Gang <gang.chen@asianux.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 169f6c19 03-Jul-2013 Cody P Schafer <cody@linux.vnet.ibm.com>

mm/page_alloc: don't re-init pageset in zone_pcp_update()

When memory hotplug is triggered, we call pageset_init() on
per-cpu-pagesets which both contain pages and are in use, causing both the
leakage of those pages and (potentially) bad behaviour if a page is
allocated from a pageset while it is being cleared.

Avoid this by factoring out pageset_set_high_and_batch() (which contains
all needed logic too set a pageset's ->high and ->batch inrespective of
system state) from zone_pageset_init() and using the new
pageset_set_high_and_batch() instead of zone_pageset_init() in
zone_pcp_update().

Signed-off-by: Cody P Schafer <cody@linux.vnet.ibm.com>
Cc: Valdis Kletnieks <Valdis.Kletnieks@vt.edu>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 3664033c 03-Jul-2013 Cody P Schafer <cody@linux.vnet.ibm.com>

mm/page_alloc: rename setup_pagelist_highmark() to match naming of pageset_set_batch()

Signed-off-by: Cody P Schafer <cody@linux.vnet.ibm.com>
Cc: Gilad Ben-Yossef <gilad@benyossef.com>
Cc: KOSAKI Motohiro <kosaki.motohiro@gmail.com>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Pekka Enberg <penberg@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 737af4c0 03-Jul-2013 Cody P Schafer <cody@linux.vnet.ibm.com>

mm/page_alloc: in zone_pcp_update(), uze zone_pageset_init()

Previously, zone_pcp_update() called pageset_set_batch() directly,
essentially assuming that percpu_pagelist_fraction == 0.

Correct this by calling zone_pageset_init(), which chooses the
appropriate ->batch and ->high calculations.

Signed-off-by: Cody P Schafer <cody@linux.vnet.ibm.com>
Cc: Gilad Ben-Yossef <gilad@benyossef.com>
Cc: KOSAKI Motohiro <kosaki.motohiro@gmail.com>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Pekka Enberg <penberg@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 56cef2b8 03-Jul-2013 Cody P Schafer <cody@linux.vnet.ibm.com>

mm/page_alloc: factor zone_pageset_init() out of setup_zone_pageset()

Signed-off-by: Cody P Schafer <cody@linux.vnet.ibm.com>
Cc: Gilad Ben-Yossef <gilad@benyossef.com>
Cc: KOSAKI Motohiro <kosaki.motohiro@gmail.com>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Pekka Enberg <penberg@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# dd1895e2 03-Jul-2013 Cody P Schafer <cody@linux.vnet.ibm.com>

mm/page_alloc: relocate comment to be directly above code it refers to.

Signed-off-by: Cody P Schafer <cody@linux.vnet.ibm.com>
Cc: Gilad Ben-Yossef <gilad@benyossef.com>
Cc: KOSAKI Motohiro <kosaki.motohiro@gmail.com>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Pekka Enberg <penberg@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 88c90dbc 03-Jul-2013 Cody P Schafer <cody@linux.vnet.ibm.com>

mm/page_alloc: factor setup_pageset() into pageset_init() and pageset_set_batch()

Signed-off-by: Cody P Schafer <cody@linux.vnet.ibm.com>
Cc: Gilad Ben-Yossef <gilad@benyossef.com>
Cc: KOSAKI Motohiro <kosaki.motohiro@gmail.com>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Pekka Enberg <penberg@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 22a7f12b 03-Jul-2013 Cody P Schafer <cody@linux.vnet.ibm.com>

mm/page_alloc: when handling percpu_pagelist_fraction, don't unneedly recalulate high

Simply moves calculation of the new 'high' value outside the
for_each_possible_cpu() loop, as it does not depend on the cpu.

Signed-off-by: Cody P Schafer <cody@linux.vnet.ibm.com>
Cc: Gilad Ben-Yossef <gilad@benyossef.com>
Cc: KOSAKI Motohiro <kosaki.motohiro@gmail.com>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Pekka Enberg <penberg@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 0a647f38 03-Jul-2013 Cody P Schafer <cody@linux.vnet.ibm.com>

mm/page_alloc: convert zone_pcp_update() to rely on memory barriers instead of stop_machine()

zone_pcp_update()'s goal is to adjust the ->high and ->mark members of a
percpu pageset based on a zone's ->managed_pages. We don't need to drain
the entire percpu pageset just to modify these fields.

This lets us avoid calling setup_pageset() (and the draining required to
call it) and instead allows simply setting the fields' values (with some
attention paid to memory barriers to prevent the relationship between
->batch and ->high from being thrown off).

This does change the behavior of zone_pcp_update() as the percpu pagesets
will not be drained when zone_pcp_update() is called (they will end up
being shrunk, not completely drained, later when a 0-order page is freed
in free_hot_cold_page()).

Signed-off-by: Cody P Schafer <cody@linux.vnet.ibm.com>
Cc: Gilad Ben-Yossef <gilad@benyossef.com>
Cc: KOSAKI Motohiro <kosaki.motohiro@gmail.com>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Pekka Enberg <penberg@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 998d39cb 03-Jul-2013 Cody P Schafer <cody@linux.vnet.ibm.com>

mm/page_alloc: protect pcp->batch accesses with ACCESS_ONCE

pcp->batch could change at any point, avoid relying on it being a stable
value.

Signed-off-by: Cody P Schafer <cody@linux.vnet.ibm.com>
Cc: Gilad Ben-Yossef <gilad@benyossef.com>
Cc: KOSAKI Motohiro <kosaki.motohiro@gmail.com>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Pekka Enberg <penberg@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 8d7a8fa9 03-Jul-2013 Cody P Schafer <cody@linux.vnet.ibm.com>

mm/page_alloc: insert memory barriers to allow async update of pcp batch and high

Introduce pageset_update() to perform a safe transision from one set of
pcp->{batch,high} to a new set using memory barriers.

This ensures that batch is always set to a safe value (1) prior to
updating high, and ensure that high is fully updated before setting the
real value of batch. It avoids ->batch ever rising above ->high.

Suggested by Gilad Ben-Yossef in these threads:

https://lkml.org/lkml/2013/4/9/23
https://lkml.org/lkml/2013/4/10/49

Also reproduces his proposed comment.

Signed-off-by: Cody P Schafer <cody@linux.vnet.ibm.com>
Reviewed-by: Gilad Ben-Yossef <gilad@benyossef.com>
Cc: KOSAKI Motohiro <kosaki.motohiro@gmail.com>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Pekka Enberg <penberg@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# c8e251fa 03-Jul-2013 Cody P Schafer <cody@linux.vnet.ibm.com>

mm/page_alloc: prevent concurrent updaters of pcp ->batch and ->high

Because we are going to rely upon a careful transision between old and new
->high and ->batch values using memory barriers and will remove
stop_machine(), we need to prevent multiple updaters from interweaving
their memory writes.

Add a simple mutex to protect both update loops.

Signed-off-by: Cody P Schafer <cody@linux.vnet.ibm.com>
Cc: Gilad Ben-Yossef <gilad@benyossef.com>
Cc: KOSAKI Motohiro <kosaki.motohiro@gmail.com>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Pekka Enberg <penberg@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 4008bab7 03-Jul-2013 Cody P Schafer <cody@linux.vnet.ibm.com>

mm/page_alloc: factor out setting of pcp->high and pcp->batch

"Problems" with the current code:

1: there is a lack of synchronization in setting ->high and ->batch in
percpu_pagelist_fraction_sysctl_handler()

2: stop_machine() in zone_pcp_update() is unnecissary.

3: zone_pcp_update() does not consider the case where
percpu_pagelist_fraction is non-zero

To fix:

1: add memory barriers, a safe ->batch value, an update side mutex when
updating ->high and ->batch, and use ACCESS_ONCE() for ->batch users
that expect a stable value.

2: avoid draining pages in zone_pcp_update(), rely upon the memory
barriers added to fix #1

3: factor out quite a few functions, and then call the appropriate one.

Note that it results in a change to the behavior of zone_pcp_update(),
which is used by memory_hotplug. I'm rather certain that I've diserned
(and preserved) the essential behavior (changing ->high and ->batch), and
only eliminated unneeded actions (draining the per cpu pages), but this
may not be the case.

Further note that the draining of pages that previously took place in
zone_pcp_update() occured after repeated draining when attempting to
offline a page, and after the offline has "succeeded". It appears that
the draining was added to zone_pcp_update() to avoid refactoring
setup_pageset() into 2 funtions.

This patch:

Creates pageset_set_batch() for use in setup_pageset().
pageset_set_batch() imitates the functionality of
setup_pagelist_highmark(), but uses the boot time
(percpu_pagelist_fraction == 0) calculations for determining ->high based
on ->batch.

Signed-off-by: Cody P Schafer <cody@linux.vnet.ibm.com>
Cc: Gilad Ben-Yossef <gilad@benyossef.com>
Cc: KOSAKI Motohiro <kosaki.motohiro@gmail.com>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Pekka Enberg <penberg@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 026b0814 12-Jun-2013 Tomasz Stanislawski <t.stanislaws@samsung.com>

mm/page_alloc.c: fix watermark check in __zone_watermark_ok()

The watermark check consists of two sub-checks. The first one is:

if (free_pages <= min + lowmem_reserve)
return false;

The check assures that there is minimal amount of RAM in the zone. If
CMA is used then the free_pages is reduced by the number of free pages
in CMA prior to the over-mentioned check.

if (!(alloc_flags & ALLOC_CMA))
free_pages -= zone_page_state(z, NR_FREE_CMA_PAGES);

This prevents the zone from being drained from pages available for
non-movable allocations.

The second check prevents the zone from getting too fragmented.

for (o = 0; o < order; o++) {
free_pages -= z->free_area[o].nr_free << o;
min >>= 1;
if (free_pages <= min)
return false;
}

The field z->free_area[o].nr_free is equal to the number of free pages
including free CMA pages. Therefore the CMA pages are subtracted twice.
This may cause a false positive fail of __zone_watermark_ok() if the CMA
area gets strongly fragmented. In such a case there are many 0-order
free pages located in CMA. Those pages are subtracted twice therefore
they will quickly drain free_pages during the check against
fragmentation. The test fails even though there are many free non-cma
pages in the zone.

This patch fixes this issue by subtracting CMA pages only for a purpose of
(free_pages <= min + lowmem_reserve) check.

Laura said:

We were observing allocation failures of higher order pages (order 5 =
128K typically) under tight memory conditions resulting in driver
failure. The output from the page allocation failure showed plenty of
free pages of the appropriate order/type/zone and mostly CMA pages in
the lower orders.

For full disclosure, we still observed some page allocation failures
even after applying the patch but the number was drastically reduced and
those failures were attributed to fragmentation/other system issues.

Signed-off-by: Tomasz Stanislawski <t.stanislaws@samsung.com>
Signed-off-by: Kyungmin Park <kyungmin.park@samsung.com>
Tested-by: Laura Abbott <lauraa@codeaurora.org>
Cc: Bartlomiej Zolnierkiewicz <b.zolnierkie@samsung.com>
Acked-by: Minchan Kim <minchan@kernel.org>
Cc: Mel Gorman <mel@csn.ul.ie>
Tested-by: Marek Szyprowski <m.szyprowski@samsung.com>
Cc: <stable@vger.kernel.org> [3.7+]
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# bb3ec6b0 21-May-2013 Ralf Baechle <ralf@linux-mips.org>

mm: Fix virt_to_page() warning

virt_to_page() is typically implemented as a macro containing a cast so
that it will accept both pointers and unsigned long without causing a
warning.

But MIPS virt_to_page() uses virt_to_phys which is a function so passing
an unsigned long will cause a warning:

CC mm/page_alloc.o
mm/page_alloc.c: In function ‘free_reserved_area’:
mm/page_alloc.c:5161:3: warning: passing argument 1 of ‘virt_to_phys’ makes pointer from integer without a cast [enabled by default]
arch/mips/include/asm/io.h:119:100: note: expected ‘const volatile void *’ but argument is of type ‘long unsigned int’

All others users of virt_to_page() in mm/ are passing a void *.

Signed-off-by: Ralf Baechle <ralf@linux-mips.org>
Reported-by: Eunbong Song <eunb.song@samsung.com>
Cc: linux-kernel@vger.kernel.org
Cc: linux-mm@kvack.org
Cc: linux-mips@linux-mips.org
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# f9872caf 29-Apr-2013 Cody P Schafer <cody@linux.vnet.ibm.com>

page_alloc: make setup_nr_node_ids() usable for arch init code

powerpc and x86 were opencoding copies of setup_nr_node_ids(), which
page_alloc provides but makes static. Make it avaliable to the archs in
linux/mm.h.

Signed-off-by: Cody P Schafer <cody@linux.vnet.ibm.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 7c243c71 29-Apr-2013 Russ Anderson <rja@sgi.com>

mm: speedup in __early_pfn_to_nid

When booting on a large memory system, the kernel spends considerable
time in memmap_init_zone() setting up memory zones. Analysis shows
significant time spent in __early_pfn_to_nid().

The routine memmap_init_zone() checks each PFN to verify the nid is
valid. __early_pfn_to_nid() sequentially scans the list of pfn ranges
to find the right range and returns the nid. This does not scale well.
On a 4 TB (single rack) system there are 308 memory ranges to scan. The
higher the PFN the more time spent sequentially spinning through memory
ranges.

Since memmap_init_zone() increments pfn, it will almost always be
looking for the same range as the previous pfn, so check that range
first. If it is in the same range, return that nid. If not, scan the
list as before.

A 4 TB (single rack) UV1 system takes 512 seconds to get through the
zone code. This performance optimization reduces the time by 189
seconds, a 36% improvement.

A 2 TB (single rack) UV2 system goes from 212.7 seconds to 99.8 seconds,
a 112.9 second (53%) reduction.

[akpm@linux-foundation.org: make the statics __meminitdata]
[akpm@linux-foundation.org: fix comment formatting]
[akpm@linux-foundation.org: fix ia64, per yinghai]
[akpm@linux-foundation.org: add missing semicolon, per Tony]
Signed-off-by: Russ Anderson <rja@sgi.com>
Cc: David Rientjes <rientjes@google.com>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Tested-by: "Luck, Tony" <tony.luck@intel.com>
Cc: Yinghai Lu <yinghai@kernel.org>
Cc: Lin Feng <linfeng@cn.fujitsu.com>
Cc: KOSAKI Motohiro <kosaki.motohiro@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# fed2719e 29-Apr-2013 Mel Gorman <mgorman@suse.de>

mm: page_alloc: avoid marking zones full prematurely after zone_reclaim()

The following problem was reported against a distribution kernel when
zone_reclaim was enabled but the same problem applies to the mainline
kernel. The reproduction case was as follows

1. Run numactl -m +0 dd if=largefile of=/dev/null
This allocates a large number of clean pages in node 0

2. numactl -N +0 memhog 0.5*Mg
This start a memory-using application in node 0.

The expected behaviour is that the clean pages get reclaimed and the
application uses node 0 for its memory. The observed behaviour was that
the memory for the memhog application was allocated off-node since
commits cd38b115d5ad ("mm: page allocator: initialise ZLC for first zone
eligible for zone_reclaim") and commit 76d3fbf8fbf6 ("mm: page
allocator: reconsider zones for allocation after direct reclaim").

The assumption of those patches was that it was always preferable to
allocate quickly than stall for long periods of time and they were meant
to take care that the zone was only marked full when necessary but an
important case was missed.

In the allocator fast path, only the low watermarks are checked. If the
zones free pages are between the low and min watermark then allocations
from the allocators slow path will succeed. However, zone_reclaim will
only reclaim SWAP_CLUSTER_MAX or 1<<order pages. There is no guarantee
that this will meet the low watermark causing the zone to be marked full
prematurely.

This patch will only mark the zone full after zone_reclaim if it the min
watermarks are checked or if page reclaim failed to make sufficient
progress.

[mhocko@suse.cz: fix alloc_flags test]
Signed-off-by: Mel Gorman <mgorman@suse.de>
Reported-by: Hedi Berriche <hedi@sgi.com>
Tested-by: Hedi Berriche <hedi@sgi.com>
Reviewed-by: Michal Hocko <mhocko@suse.cz>
Reviewed-by: Wanpeng Li <liwanp@linux.vnet.ibm.com>
Signed-off-by: Michal Hocko <mhocko@suse.cz>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 949f7ec5 29-Apr-2013 David Rientjes <rientjes@google.com>

mm, hugetlb: include hugepages in meminfo

Particularly in oom conditions, it's troublesome that hugetlb memory is
not displayed. All other meminfo that is emitted will not add up to
what is expected, and there is no artifact left in the kernel log to
show that a potentially significant amount of memory is actually
allocated as hugepages which are not available to be reclaimed.

Booting with hugepages=8192 on the command line, this memory is now
shown in oom conditions. For example, with echo m >
/proc/sysrq-trigger:

Node 0 hugepages_total=2048 hugepages_free=2048 hugepages_surp=0 hugepages_size=2048kB
Node 1 hugepages_total=2048 hugepages_free=2048 hugepages_surp=0 hugepages_size=2048kB
Node 2 hugepages_total=2048 hugepages_free=2048 hugepages_surp=0 hugepages_size=2048kB
Node 3 hugepages_total=2048 hugepages_free=2048 hugepages_surp=0 hugepages_size=2048kB

[akpm@linux-foundation.org: coding-style fixes]
Signed-off-by: David Rientjes <rientjes@google.com>
Acked-by: Michal Hocko <mhocko@suse.cz>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# cfa11e08 29-Apr-2013 Jiang Liu <liuj97@gmail.com>

mm: introduce free_highmem_page() helper to free highmem pages into buddy system

The original goal of this patchset is to fix the bug reported by

https://bugzilla.kernel.org/show_bug.cgi?id=53501

Now it has also been expanded to reduce common code used by memory
initializion.

This is the second part, which applies to the previous part at:
http://marc.info/?l=linux-mm&m=136289696323825&w=2

It introduces a helper function free_highmem_page() to free highmem
pages into the buddy system when initializing mm subsystem.
Introduction of free_highmem_page() is one step forward to clean up
accesses and modificaitons of totalhigh_pages, totalram_pages and
zone->managed_pages etc. I hope we could remove all references to
totalhigh_pages from the arch/ subdirectory.

We have only tested these patchset on x86 platforms, and have done basic
compliation tests using cross-compilers from ftp.kernel.org. That means
some code may not pass compilation on some architectures. So any help
to test this patchset are welcomed!

There are several other parts still under development:
Part3: refine code to manage totalram_pages, totalhigh_pages and
zone->managed_pages
Part4: introduce helper functions to simplify mem_init() and remove the
global variable num_physpages.

This patch:

Introduce helper function free_highmem_page(), which will be used by
architectures with HIGHMEM enabled to free highmem pages into the buddy
system.

Signed-off-by: Jiang Liu <jiang.liu@huawei.com>
Cc: "David S. Miller" <davem@davemloft.net>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: "Suzuki K. Poulose" <suzuki@in.ibm.com>
Cc: Alexander Graf <agraf@suse.de>
Cc: Arnd Bergmann <arnd@arndb.de>
Cc: Attilio Rao <attilio.rao@citrix.com>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Cong Wang <amwang@redhat.com>
Cc: David Daney <david.daney@cavium.com>
Cc: David Howells <dhowells@redhat.com>
Cc: Geert Uytterhoeven <geert@linux-m68k.org>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: James Hogan <james.hogan@imgtec.com>
Cc: Jeff Dike <jdike@addtoit.com>
Cc: Jiang Liu <jiang.liu@huawei.com>
Cc: Jiang Liu <liuj97@gmail.com>
Cc: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Cc: Konstantin Khlebnikov <khlebnikov@openvz.org>
Cc: Linus Walleij <linus.walleij@linaro.org>
Cc: Marek Szyprowski <m.szyprowski@samsung.com>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Michal Nazarewicz <mina86@mina86.com>
Cc: Michal Simek <monstr@monstr.eu>
Cc: Michel Lespinasse <walken@google.com>
Cc: Minchan Kim <minchan@kernel.org>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Ralf Baechle <ralf@linux-mips.org>
Cc: Richard Weinberger <richard@nod.at>
Cc: Rik van Riel <riel@redhat.com>
Cc: Russell King <linux@arm.linux.org.uk>
Cc: Sam Ravnborg <sam@ravnborg.org>
Cc: Stephen Boyd <sboyd@codeaurora.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Yinghai Lu <yinghai@kernel.org>
Reviewed-by: Pekka Enberg <penberg@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 69afade7 29-Apr-2013 Jiang Liu <liuj97@gmail.com>

mm: introduce common help functions to deal with reserved/managed pages

The original goal of this patchset is to fix the bug reported by

https://bugzilla.kernel.org/show_bug.cgi?id=53501

Now it has also been expanded to reduce common code used by memory
initializion.

This is the first part, which applies to v3.9-rc1.

It introduces following common helper functions to simplify
free_initmem() and free_initrd_mem() on different architectures:

adjust_managed_page_count():
will be used to adjust totalram_pages, totalhigh_pages,
zone->managed_pages when reserving/unresering a page.

__free_reserved_page():
free a reserved page into the buddy system without adjusting
page statistics info

free_reserved_page():
free a reserved page into the buddy system and adjust page
statistics info

mark_page_reserved():
mark a page as reserved and adjust page statistics info

free_reserved_area():
free a continous ranges of pages by calling free_reserved_page()

free_initmem_default():
default method to free __init pages.

We have only tested these patchset on x86 platforms, and have done basic
compliation tests using cross-compilers from ftp.kernel.org. That means
some code may not pass compilation on some architectures. So any help to
test this patchset are welcomed!

There are several other parts still under development:
Part2: introduce free_highmem_page() to simplify freeing highmem pages
Part3: refine code to manage totalram_pages, totalhigh_pages and
zone->managed_pages
Part4: introduce helper functions to simplify mem_init() and remove the
global variable num_physpages.

This patch:

Code to deal with reserved/managed pages are duplicated by many
architectures, so introduce common help functions to reduce duplicated
code. These common help functions will also be used to concentrate code
to modify totalram_pages and zone->managed_pages, which makes the code
much more clear.

Signed-off-by: Jiang Liu <jiang.liu@huawei.com>
Acked-by: Geert Uytterhoeven <geert@linux-m68k.org>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: "James E.J. Bottomley" <jejb@parisc-linux.org>
Cc: Anatolij Gustschin <agust@denx.de>
Cc: Aurelien Jacquiot <a-jacquiot@ti.com>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Chen Liqin <liqin.chen@sunplusct.com>
Cc: Chris Zankel <chris@zankel.net>
Cc: David Howells <dhowells@redhat.com>
Cc: David S. Miller <davem@davemloft.net>
Cc: Eric Biederman <ebiederm@xmission.com>
Cc: Fenghua Yu <fenghua.yu@intel.com>
Cc: Guan Xuetao <gxt@mprc.pku.edu.cn>
Cc: Haavard Skinnemoen <hskinnemoen@gmail.com>
Cc: Hans-Christian Egtvedt <egtvedt@samfundet.no>
Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
Cc: Helge Deller <deller@gmx.de>
Cc: Hirokazu Takata <takata@linux-m32r.org>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Ivan Kokshaysky <ink@jurassic.park.msu.ru>
Cc: James Hogan <james.hogan@imgtec.com>
Cc: Jeff Dike <jdike@addtoit.com>
Cc: Jiang Liu <jiang.liu@huawei.com>
Cc: Jiang Liu <liuj97@gmail.com>
Cc: Jonas Bonn <jonas@southpole.se>
Cc: Koichi Yasutake <yasutake.koichi@jp.panasonic.com>
Cc: Lennox Wu <lennox.wu@gmail.com>
Cc: Mark Salter <msalter@redhat.com>
Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
Cc: Matt Turner <mattst88@gmail.com>
Cc: Max Filippov <jcmvbkbc@gmail.com>
Cc: Michal Simek <monstr@monstr.eu>
Cc: Mikael Starvik <starvik@axis.com>
Cc: Mike Frysinger <vapier@gentoo.org>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Paul Mundt <lethal@linux-sh.org>
Cc: Ralf Baechle <ralf@linux-mips.org>
Cc: Richard Henderson <rth@twiddle.net>
Cc: Russell King <linux@arm.linux.org.uk>
Cc: Sam Ravnborg <sam@ravnborg.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Tony Luck <tony.luck@intel.com>
Cc: Vineet Gupta <vgupta@synopsys.com>
Cc: Will Deacon <will.deacon@arm.com>
Cc: Yoshinori Sato <ysato@users.sourceforge.jp>
Cc: Zhang Yanfei <zhangyanfei@cn.fujitsu.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 4b59e6c4 29-Apr-2013 David Rientjes <rientjes@google.com>

mm, show_mem: suppress page counts in non-blockable contexts

On large systems with a lot of memory, walking all RAM to determine page
types may take a half second or even more.

In non-blockable contexts, the page allocator will emit a page allocation
failure warning unless __GFP_NOWARN is specified. In such contexts, irqs
are typically disabled and such a lengthy delay may even result in NMI
watchdog timeouts.

To fix this, suppress the page walk in such contexts when printing the
page allocation failure warning.

Signed-off-by: David Rientjes <rientjes@google.com>
Cc: Mel Gorman <mgorman@suse.de>
Acked-by: Michal Hocko <mhocko@suse.cz>
Cc: Dave Hansen <dave@linux.vnet.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 5853ff23 25-Mar-2013 K. Y. Srinivasan <kys@microsoft.com>

mm: export split_page()

This symbol will be used in the Hyper-V balloon driver to support 2M
allocations.

Signed-off-by: K. Y. Srinivasan <kys@microsoft.com>
Acked-by: Michal Hocko <mhocko@suse.cz>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>


# 20e6926d 01-Mar-2013 Yinghai Lu <yinghai@kernel.org>

x86, ACPI, mm: Revert movablemem_map support

Tim found:

WARNING: at arch/x86/kernel/smpboot.c:324 topology_sane.isra.2+0x6f/0x80()
Hardware name: S2600CP
sched: CPU #1's llc-sibling CPU #0 is not on the same node! [node: 1 != 0]. Ignoring dependency.
smpboot: Booting Node 1, Processors #1
Modules linked in:
Pid: 0, comm: swapper/1 Not tainted 3.9.0-0-generic #1
Call Trace:
set_cpu_sibling_map+0x279/0x449
start_secondary+0x11d/0x1e5

Don Morris reproduced on a HP z620 workstation, and bisected it to
commit e8d195525809 ("acpi, memory-hotplug: parse SRAT before memblock
is ready")

It turns out movable_map has some problems, and it breaks several things

1. numa_init is called several times, NOT just for srat. so those
nodes_clear(numa_nodes_parsed)
memset(&numa_meminfo, 0, sizeof(numa_meminfo))
can not be just removed. Need to consider sequence is: numaq, srat, amd, dummy.
and make fall back path working.

2. simply split acpi_numa_init to early_parse_srat.
a. that early_parse_srat is NOT called for ia64, so you break ia64.
b. for (i = 0; i < MAX_LOCAL_APIC; i++)
set_apicid_to_node(i, NUMA_NO_NODE)
still left in numa_init. So it will just clear result from early_parse_srat.
it should be moved before that....
c. it breaks ACPI_TABLE_OVERIDE...as the acpi table scan is moved
early before override from INITRD is settled.

3. that patch TITLE is total misleading, there is NO x86 in the title,
but it changes critical x86 code. It caused x86 guys did not
pay attention to find the problem early. Those patches really should
be routed via tip/x86/mm.

4. after that commit, following range can not use movable ram:
a. real_mode code.... well..funny, legacy Node0 [0,1M) could be hot-removed?
b. initrd... it will be freed after booting, so it could be on movable...
c. crashkernel for kdump...: looks like we can not put kdump kernel above 4G
anymore.
d. init_mem_mapping: can not put page table high anymore.
e. initmem_init: vmemmap can not be high local node anymore. That is
not good.

If node is hotplugable, the mem related range like page table and
vmemmap could be on the that node without problem and should be on that
node.

We have workaround patch that could fix some problems, but some can not
be fixed.

So just remove that offending commit and related ones including:

f7210e6c4ac7 ("mm/memblock.c: use CONFIG_HAVE_MEMBLOCK_NODE_MAP to
protect movablecore_map in memblock_overlaps_region().")

01a178a94e8e ("acpi, memory-hotplug: support getting hotplug info from
SRAT")

27168d38fa20 ("acpi, memory-hotplug: extend movablemem_map ranges to
the end of node")

e8d195525809 ("acpi, memory-hotplug: parse SRAT before memblock is
ready")

fb06bc8e5f42 ("page_alloc: bootmem limit with movablecore_map")

42f47e27e761 ("page_alloc: make movablemem_map have higher priority")

6981ec31146c ("page_alloc: introduce zone_movable_limit[] to keep
movable limit for nodes")

34b71f1e04fc ("page_alloc: add movable_memmap kernel parameter")

4d59a75125d5 ("x86: get pg_data_t's memory from other node")

Later we should have patches that will make sure kernel put page table
and vmemmap on local node ram instead of push them down to node0. Also
need to find way to put other kernel used ram to local node ram.

Reported-by: Tim Gardner <tim.gardner@canonical.com>
Reported-by: Don Morris <don.morris@hp.com>
Bisected-by: Don Morris <don.morris@hp.com>
Tested-by: Don Morris <don.morris@hp.com>
Signed-off-by: Yinghai Lu <yinghai@kernel.org>
Cc: Tony Luck <tony.luck@intel.com>
Cc: Thomas Renninger <trenn@suse.de>
Cc: Tejun Heo <tj@kernel.org>
Cc: Tang Chen <tangchen@cn.fujitsu.com>
Cc: Yasuaki Ishimatsu <isimatu.yasuaki@jp.fujitsu.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# e0fb5815 22-Feb-2013 Zhang Yanfei <zhangyanfei@cn.fujitsu.com>

mm: accurately document nr_free_*_pages functions with code comments

nr_free_zone_pages(), nr_free_buffer_pages() and nr_free_pagecache_pages()
are horribly badly named, so accurately document them with code comments
in case of the misuse of them.

[akpm@linux-foundation.org: tweak comments]
Reviewed-by: Randy Dunlap <rdunlap@infradead.org>
Signed-off-by: Zhang Yanfei <zhangyanfei@cn.fujitsu.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# ebec3862 22-Feb-2013 Zhang Yanfei <zhangyanfei@cn.fujitsu.com>

mm: fix return type for functions nr_free_*_pages

Currently, the amount of RAM that functions nr_free_*_pages return is
held in unsigned int. But in machines with big memory (exceeding 16TB),
the amount may be incorrect because of overflow, so fix it.

Signed-off-by: Zhang Yanfei <zhangyanfei@cn.fujitsu.com>
Cc: Simon Horman <horms@verge.net.au>
Cc: Julian Anastasov <ja@ssi.bg>
Cc: David Miller <davem@davemloft.net>
Cc: Eric Van Hensbergen <ericvh@gmail.com>
Cc: Ron Minnich <rminnich@sandia.gov>
Cc: Latchesar Ionkov <lucho@ionkov.net>
Cc: Mel Gorman <mel@csn.ul.ie>
Cc: Minchan Kim <minchan.kim@gmail.com>
Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 00ef2d2f 22-Feb-2013 David Rientjes <rientjes@google.com>

mm: use NUMA_NO_NODE

Make a sweep through mm/ and convert code that uses -1 directly to using
the more appropriate NUMA_NO_NODE.

Signed-off-by: David Rientjes <rientjes@google.com>
Reviewed-by: Yasuaki Ishimatsu <isimatu.yasuaki@jp.fujitsu.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# b5e6a5a2 22-Feb-2013 Cody P Schafer <cody@linux.vnet.ibm.com>

mm/page_alloc: add informative debugging message in page_outside_zone_boundaries()

Add a debug message which prints when a page is found outside of the
boundaries of the zone it should belong to. Format is:
"page $pfn outside zone [ $start_pfn - $end_pfn ]"

[akpm@linux-foundation.org: s/pr_debug/pr_err/]
Signed-off-by: Cody P Schafer <cody@linux.vnet.ibm.com>
Cc: David Hansen <dave@linux.vnet.ibm.com>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Mel Gorman <mel@csn.ul.ie>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# d29bb978 22-Feb-2013 Cody P Schafer <cody@linux.vnet.ibm.com>

mm/page_alloc: add a VM_BUG in __free_one_page() if the zone is uninitialized.

Freeing pages to uninitialized zones is not handled by
__free_one_page(), and should never happen when the code is correct.

Ran into this while writing some code that dynamically onlines extra
zones.

Signed-off-by: Cody P Schafer <cody@linux.vnet.ibm.com>
Cc: David Hansen <dave@linux.vnet.ibm.com>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Mel Gorman <mel@csn.ul.ie>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 108bcc96 22-Feb-2013 Cody P Schafer <cody@linux.vnet.ibm.com>

mm: add & use zone_end_pfn() and zone_spans_pfn()

Add 2 helpers (zone_end_pfn() and zone_spans_pfn()) to reduce code
duplication.

This also switches to using them in compaction (where an additional
variable needed to be renamed), page_alloc, vmstat, memory_hotplug, and
kmemleak.

Note that in compaction.c I avoid calling zone_end_pfn() repeatedly
because I expect at some point the sycronization issues with start_pfn &
spanned_pages will need fixing, either by actually using the seqlock or
clever memory barrier usage.

Signed-off-by: Cody P Schafer <cody@linux.vnet.ibm.com>
Cc: David Hansen <dave@linux.vnet.ibm.com>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Mel Gorman <mel@csn.ul.ie>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 9c620e2b 22-Feb-2013 Hugh Dickins <hughd@google.com>

mm: remove offlining arg to migrate_pages

No functional change, but the only purpose of the offlining argument to
migrate_pages() etc, was to ensure that __unmap_and_move() could migrate a
KSM page for memory hotremove (which took ksm_thread_mutex) but not for
other callers. Now all cases are safe, remove the arg.

Signed-off-by: Hugh Dickins <hughd@google.com>
Cc: Rik van Riel <riel@redhat.com>
Cc: Petr Holasek <pholasek@redhat.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Izik Eidus <izik.eidus@ravellosystems.com>
Cc: Gerald Schaefer <gerald.schaefer@de.ibm.com>
Cc: KOSAKI Motohiro <kosaki.motohiro@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 22b751c3 22-Feb-2013 Mel Gorman <mgorman@suse.de>

mm: rename page struct field helpers

The function names page_xchg_last_nid(), page_last_nid() and
reset_page_last_nid() were judged to be inconsistent so rename them to a
struct_field_op style pattern. As it looked jarring to have
reset_page_mapcount() and page_nid_reset_last() beside each other in
memmap_init_zone(), this patch also renames reset_page_mapcount() to
page_mapcount_reset(). There are others like init_page_count() but as
it is used throughout the arch code a rename would likely cause more
conflicts than it is worth.

[akpm@linux-foundation.org: fix zcache]
Signed-off-by: Mel Gorman <mgorman@suse.de>
Suggested-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 21caf2fc 22-Feb-2013 Ming Lei <ming.lei@canonical.com>

mm: teach mm by current context info to not do I/O during memory allocation

This patch introduces PF_MEMALLOC_NOIO on process flag('flags' field of
'struct task_struct'), so that the flag can be set by one task to avoid
doing I/O inside memory allocation in the task's context.

The patch trys to solve one deadlock problem caused by block device, and
the problem may happen at least in the below situations:

- during block device runtime resume, if memory allocation with
GFP_KERNEL is called inside runtime resume callback of any one of its
ancestors(or the block device itself), the deadlock may be triggered
inside the memory allocation since it might not complete until the block
device becomes active and the involed page I/O finishes. The situation
is pointed out first by Alan Stern. It is not a good approach to
convert all GFP_KERNEL[1] in the path into GFP_NOIO because several
subsystems may be involved(for example, PCI, USB and SCSI may be
involved for usb mass stoarage device, network devices involved too in
the iSCSI case)

- during block device runtime suspend, because runtime resume need to
wait for completion of concurrent runtime suspend.

- during error handling of usb mass storage deivce, USB bus reset will
be put on the device, so there shouldn't have any memory allocation with
GFP_KERNEL during USB bus reset, otherwise the deadlock similar with
above may be triggered. Unfortunately, any usb device may include one
mass storage interface in theory, so it requires all usb interface
drivers to handle the situation. In fact, most usb drivers don't know
how to handle bus reset on the device and don't provide .pre_set() and
.post_reset() callback at all, so USB core has to unbind and bind driver
for these devices. So it is still not practical to resort to GFP_NOIO
for solving the problem.

Also the introduced solution can be used by block subsystem or block
drivers too, for example, set the PF_MEMALLOC_NOIO flag before doing
actual I/O transfer.

It is not a good idea to convert all these GFP_KERNEL in the affected
path into GFP_NOIO because these functions doing that may be implemented
as library and will be called in many other contexts.

In fact, memalloc_noio_flags() can convert some of current static
GFP_NOIO allocation into GFP_KERNEL back in other non-affected contexts,
at least almost all GFP_NOIO in USB subsystem can be converted into
GFP_KERNEL after applying the approach and make allocation with GFP_NOIO
only happen in runtime resume/bus reset/block I/O transfer contexts
generally.

[1], several GFP_KERNEL allocation examples in runtime resume path

- pci subsystem
acpi_os_allocate
<-acpi_ut_allocate
<-ACPI_ALLOCATE_ZEROED
<-acpi_evaluate_object
<-__acpi_bus_set_power
<-acpi_bus_set_power
<-acpi_pci_set_power_state
<-platform_pci_set_power_state
<-pci_platform_power_transition
<-__pci_complete_power_transition
<-pci_set_power_state
<-pci_restore_standard_config
<-pci_pm_runtime_resume
- usb subsystem
usb_get_status
<-finish_port_resume
<-usb_port_resume
<-generic_resume
<-usb_resume_device
<-usb_resume_both
<-usb_runtime_resume

- some individual usb drivers
usblp, uvc, gspca, most of dvb-usb-v2 media drivers, cpia2, az6007, ....

That is just what I have found. Unfortunately, this allocation can only
be found by human being now, and there should be many not found since
any function in the resume path(call tree) may allocate memory with
GFP_KERNEL.

Signed-off-by: Ming Lei <ming.lei@canonical.com>
Signed-off-by: Minchan Kim <minchan@kernel.org>
Cc: Alan Stern <stern@rowland.harvard.edu>
Cc: Oliver Neukum <oneukum@suse.de>
Cc: Jiri Kosina <jiri.kosina@suse.com>
Cc: Mel Gorman <mel@csn.ul.ie>
Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Michal Hocko <mhocko@suse.cz>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: "Rafael J. Wysocki" <rjw@sisk.pl>
Cc: Greg KH <greg@kroah.com>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: "David S. Miller" <davem@davemloft.net>
Cc: Eric Dumazet <eric.dumazet@gmail.com>
Cc: David Decotigny <david.decotigny@google.com>
Cc: Tom Herbert <therbert@google.com>
Cc: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 194159fb 22-Feb-2013 Minchan Kim <minchan@kernel.org>

mm: remove MIGRATE_ISOLATE check in hotpath

Several functions test MIGRATE_ISOLATE and some of those are hotpath but
MIGRATE_ISOLATE is used only if we enable CONFIG_MEMORY_ISOLATION(ie,
CMA, memory-hotplug and memory-failure) which are not common config
option. So let's not add unnecessary overhead and code when we don't
enable CONFIG_MEMORY_ISOLATION.

Signed-off-by: Minchan Kim <minchan@kernel.org>
Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Acked-by: Michal Nazarewicz <mina86@mina86.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 306f2e9e 22-Feb-2013 Jiang Liu <liuj97@gmail.com>

mm: set zone->present_pages to number of existing pages in the zone

Now all users of "number of pages managed by the buddy system" have been
converted to use zone->managed_pages, so set zone->present_pages to what
it should be:

present_pages = spanned_pages - absent_pages;

Signed-off-by: Jiang Liu <jiang.liu@huawei.com>
Cc: Wen Congyang <wency@cn.fujitsu.com>
Cc: David Rientjes <rientjes@google.com>
Cc: Jiang Liu <jiang.liu@huawei.com>
Cc: Maciej Rutecki <maciej.rutecki@gmail.com>
Cc: Chris Clayton <chris2553@googlemail.com>
Cc: "Rafael J . Wysocki" <rjw@sisk.pl>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Minchan Kim <minchan@kernel.org>
Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Michal Hocko <mhocko@suse.cz>
Cc: Jianguo Wu <wujianguo@huawei.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# b40da049 22-Feb-2013 Jiang Liu <liuj97@gmail.com>

mm: use zone->present_pages instead of zone->managed_pages where appropriate

Now we have zone->managed_pages for "pages managed by the buddy system
in the zone", so replace zone->present_pages with zone->managed_pages if
what the user really wants is number of allocatable pages.

Signed-off-by: Jiang Liu <jiang.liu@huawei.com>
Cc: Wen Congyang <wency@cn.fujitsu.com>
Cc: David Rientjes <rientjes@google.com>
Cc: Jiang Liu <jiang.liu@huawei.com>
Cc: Maciej Rutecki <maciej.rutecki@gmail.com>
Cc: Chris Clayton <chris2553@googlemail.com>
Cc: "Rafael J . Wysocki" <rjw@sisk.pl>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Minchan Kim <minchan@kernel.org>
Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Michal Hocko <mhocko@suse.cz>
Cc: Jianguo Wu <wujianguo@huawei.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 01a178a9 22-Feb-2013 Tang Chen <tangchen@cn.fujitsu.com>

acpi, memory-hotplug: support getting hotplug info from SRAT

We now provide an option for users who don't want to specify physical
memory address in kernel commandline.

/*
* For movablemem_map=acpi:
*
* SRAT: |_____| |_____| |_________| |_________| ......
* node id: 0 1 1 2
* hotpluggable: n y y n
* movablemem_map: |_____| |_________|
*
* Using movablemem_map, we can prevent memblock from allocating memory
* on ZONE_MOVABLE at boot time.
*/

So user just specify movablemem_map=acpi, and the kernel will use
hotpluggable info in SRAT to determine which memory ranges should be set
as ZONE_MOVABLE.

If all the memory ranges in SRAT is hotpluggable, then no memory can be
used by kernel. But before parsing SRAT, memblock has already reserve
some memory ranges for other purposes, such as for kernel image, and so
on. We cannot prevent kernel from using these memory. So we need to
exclude these ranges even if these memory is hotpluggable.

Furthermore, there could be several memory ranges in the single node
which the kernel resides in. We may skip one range that have memory
reserved by memblock, but if the rest of memory is too small, then the
kernel will fail to boot. So, make the whole node which the kernel
resides in un-hotpluggable. Then the kernel has enough memory to use.

NOTE: Using this way will cause NUMA performance down because the
whole node will be set as ZONE_MOVABLE, and kernel cannot use memory
on it. If users don't want to lose NUMA performance, just don't use
it.

[akpm@linux-foundation.org: fix warning]
[akpm@linux-foundation.org: use strcmp()]
Signed-off-by: Tang Chen <tangchen@cn.fujitsu.com>
Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Cc: Jiang Liu <jiang.liu@huawei.com>
Cc: Jianguo Wu <wujianguo@huawei.com>
Cc: Kamezawa Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Lai Jiangshan <laijs@cn.fujitsu.com>
Cc: Wu Jianguo <wujianguo@huawei.com>
Cc: Yasuaki Ishimatsu <isimatu.yasuaki@jp.fujitsu.com>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: Len Brown <lenb@kernel.org>
Cc: "Brown, Len" <len.brown@intel.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 27168d38 22-Feb-2013 Tang Chen <tangchen@cn.fujitsu.com>

acpi, memory-hotplug: extend movablemem_map ranges to the end of node

When implementing movablemem_map boot option, we introduced an array
movablemem_map.map[] to store the memory ranges to be set as
ZONE_MOVABLE.

Since ZONE_MOVABLE is the latst zone of a node, if user didn't specify
the whole node memory range, we need to extend it to the node end so
that we can use it to prevent memblock from allocating memory in the
ranges user didn't specify.

We now implement movablemem_map boot option like this:

/*
* For movablemem_map=nn[KMG]@ss[KMG]:
*
* SRAT: |_____| |_____| |_________| |_________| ......
* node id: 0 1 1 2
* user specified: |__| |___|
* movablemem_map: |___| |_________| |______| ......
*
* Using movablemem_map, we can prevent memblock from allocating memory
* on ZONE_MOVABLE at boot time.
*
* NOTE: In this case, SRAT info will be ingored.
*/

[akpm@linux-foundation.org: clean up code, fix build warning]
Signed-off-by: Tang Chen <tangchen@cn.fujitsu.com>
Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Cc: Jiang Liu <jiang.liu@huawei.com>
Cc: Jianguo Wu <wujianguo@huawei.com>
Cc: Kamezawa Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Lai Jiangshan <laijs@cn.fujitsu.com>
Cc: Wu Jianguo <wujianguo@huawei.com>
Cc: Yasuaki Ishimatsu <isimatu.yasuaki@jp.fujitsu.com>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: Len Brown <lenb@kernel.org>
Cc: "Brown, Len" <len.brown@intel.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 42f47e27 22-Feb-2013 Tang Chen <tangchen@cn.fujitsu.com>

page_alloc: make movablemem_map have higher priority

If kernelcore or movablecore is specified at the same time with
movablemem_map, movablemem_map will have higher priority to be
satisfied. This patch will make find_zone_movable_pfns_for_nodes()
calculate zone_movable_pfn[] with the limit from zone_movable_limit[].

Signed-off-by: Tang Chen <tangchen@cn.fujitsu.com>
Reviewed-by: Wen Congyang <wency@cn.fujitsu.com>
Cc: Wu Jianguo <wujianguo@huawei.com>
Reviewed-by: Lai Jiangshan <laijs@cn.fujitsu.com>
Tested-by: Lin Feng <linfeng@cn.fujitsu.com>
Cc: Mel Gorman <mel@csn.ul.ie>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 6981ec31 22-Feb-2013 Tang Chen <tangchen@cn.fujitsu.com>

page_alloc: introduce zone_movable_limit[] to keep movable limit for nodes

Introduce a new array zone_movable_limit[] to store the ZONE_MOVABLE
limit from movablemem_map boot option for all nodes. The function
sanitize_zone_movable_limit() will find out to which node the ranges in
movable_map.map[] belongs, and calculates the low boundary of
ZONE_MOVABLE for each node.

Signed-off-by: Tang Chen <tangchen@cn.fujitsu.com>
Signed-off-by: Liu Jiang <jiang.liu@huawei.com>
Reviewed-by: Wen Congyang <wency@cn.fujitsu.com>
Cc: Wu Jianguo <wujianguo@huawei.com>
Reviewed-by: Lai Jiangshan <laijs@cn.fujitsu.com>
Tested-by: Lin Feng <linfeng@cn.fujitsu.com>
Cc: Mel Gorman <mel@csn.ul.ie>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 34b71f1e 22-Feb-2013 Tang Chen <tangchen@cn.fujitsu.com>

page_alloc: add movable_memmap kernel parameter

Add functions to parse movablemem_map boot option. Since the option
could be specified more then once, all the maps will be stored in the
global variable movablemem_map.map array.

And also, we keep the array in monotonic increasing order by start_pfn.
And merge all overlapped ranges.

[akpm@linux-foundation.org: improve comment]
[akpm@linux-foundation.org: checkpatch fixes]
[akpm@linux-foundation.org: remove unneeded parens]
Signed-off-by: Tang Chen <tangchen@cn.fujitsu.com>
Signed-off-by: Lai Jiangshan <laijs@cn.fujitsu.com>
Reviewed-by: Wen Congyang <wency@cn.fujitsu.com>
Tested-by: Lin Feng <linfeng@cn.fujitsu.com>
Cc: Wu Jianguo <wujianguo@huawei.com>
Cc: Mel Gorman <mel@csn.ul.ie>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 90ae8d67 22-Feb-2013 Andrew Morton <akpm@linux-foundation.org>

mm/page_alloc.c:__setup_per_zone_wmarks: make min_pages unsigned long

`int' is an inappropriate type for a number-of-pages counter.

While we're there, use the clamp() macro.

Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Cc: Rik van Riel <riel@redhat.com>
Cc: Mel Gorman <mgorman@suse.de>
Reviewed-by: Michal Hocko <mhocko@suse.cz>
Cc: Hugh Dickins <hughd@google.com>
Cc: Satoru Moriya <satoru.moriya@hds.com>
Cc: Simon Jeons <simon.jeons@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 2a6f5124 22-Feb-2013 Srinivas Pandruvada <srinivas.pandruvada@linux.intel.com>

CMA: make putback_lru_pages() call conditional

As per documentation and other places calling putback_lru_pages(),
putback_lru_pages() is called on error only. Make the CMA code behave
consistently.

[akpm@linux-foundation.org: remove a test-n-branch in the wrapup code]
Signed-off-by: Srinivas Pandruvada <srinivas.pandruvada@linux.intel.com>
Acked-by: Michal Nazarewicz <mina86@mina86.com>
Cc: Marek Szyprowski <m.szyprowski@samsung.com>
Cc: Bartlomiej Zolnierkiewicz <b.zolnierkie@samsung.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 7c45512d 18-Feb-2013 Linus Torvalds <torvalds@linux-foundation.org>

mm: fix pageblock bitmap allocation

Commit c060f943d092 ("mm: use aligned zone start for pfn_to_bitidx
calculation") fixed out calculation of the index into the pageblock
bitmap when a !SPARSEMEM zome was not aligned to pageblock_nr_pages.

However, the _allocation_ of that bitmap had never taken this alignment
requirement into accout, so depending on the exact size and alignment of
the zone, the use of that index could then access past the allocation,
resulting in some very subtle memory corruption.

This was reported (and bisected) by Ingo Molnar: one of his random
config builds would hang with certain very specific kernel command line
options.

In the meantime, commit c060f943d092 has been marked for stable, so this
fix needs to be back-ported to the stable kernels that backported the
commit to use the right alignment.

Bisected-and-tested-by: Ingo Molnar <mingo@kernel.org>
Acked-by: Mel Gorman <mgorman@suse.de>
Cc: stable@kernel.org
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 41a79734 12-Feb-2013 Marek Szyprowski <m.szyprowski@samsung.com>

mm: cma: fix accounting of CMA pages placed in high memory

The total number of low memory pages is determined as totalram_pages -
totalhigh_pages, so without this patch all CMA pageblocks placed in
highmem were accounted to low memory.

Signed-off-by: Marek Szyprowski <m.szyprowski@samsung.com>
Acked-by: Kyungmin Park <kyungmin.park@samsung.com>
Cc: Minchan Kim <minchan.kim@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 8bd75c77 07-Feb-2013 Clark Williams <williams@redhat.com>

sched/rt: Move rt specific bits into new header file

Move rt scheduler definitions out of include/linux/sched.h into
new file include/linux/sched/rt.h

Signed-off-by: Clark Williams <williams@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Steven Rostedt <rostedt@goodmis.org>
Link: http://lkml.kernel.org/r/20130207094707.7b9f825f@riff.lan
Signed-off-by: Ingo Molnar <mingo@kernel.org>


# 373d4d09 20-Jan-2013 Rusty Russell <rusty@rustcorp.com.au>

taint: add explicit flag to show whether lock dep is still OK.

Fix up all callers as they were before, with make one change: an
unsigned module taints the kernel, but doesn't turn off lockdep.

Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>


# 8fb74b9f 11-Jan-2013 Mel Gorman <mgorman@suse.de>

mm: compaction: partially revert capture of suitable high-order page

Eric Wong reported on 3.7 and 3.8-rc2 that ppoll() got stuck when
waiting for POLLIN on a local TCP socket. It was easier to trigger if
there was disk IO and dirty pages at the same time and he bisected it to
commit 1fb3f8ca0e92 ("mm: compaction: capture a suitable high-order page
immediately when it is made available").

The intention of that patch was to improve high-order allocations under
memory pressure after changes made to reclaim in 3.6 drastically hurt
THP allocations but the approach was flawed. For Eric, the problem was
that page->pfmemalloc was not being cleared for captured pages leading
to a poor interaction with swap-over-NFS support causing the packets to
be dropped. However, I identified a few more problems with the patch
including the fact that it can increase contention on zone->lock in some
cases which could result in async direct compaction being aborted early.

In retrospect the capture patch took the wrong approach. What it should
have done is mark the pageblock being migrated as MIGRATE_ISOLATE if it
was allocating for THP and avoided races that way. While the patch was
showing to improve allocation success rates at the time, the benefit is
marginal given the relative complexity and it should be revisited from
scratch in the context of the other reclaim-related changes that have
taken place since the patch was first written and tested. This patch
partially reverts commit 1fb3f8ca0e92 ("mm: compaction: capture a
suitable high-order page immediately when it is made available").

Reported-and-tested-by: Eric Wong <normalperson@yhbt.net>
Tested-by: Eric Dumazet <eric.dumazet@gmail.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Mel Gorman <mgorman@suse.de>
Cc: David Miller <davem@davemloft.net>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# c060f943 11-Jan-2013 Laura Abbott <lauraa@codeaurora.org>

mm: use aligned zone start for pfn_to_bitidx calculation

The current calculation in pfn_to_bitidx assumes that (pfn -
zone->zone_start_pfn) >> pageblock_order will return the same bit for
all pfn in a pageblock. If zone_start_pfn is not aligned to
pageblock_nr_pages, this may not always be correct.

Consider the following with pageblock order = 10, zone start 2MB:

pfn | pfn - zone start | (pfn - zone start) >> page block order
----------------------------------------------------------------
0x26000 | 0x25e00 | 0x97
0x26100 | 0x25f00 | 0x97
0x26200 | 0x26000 | 0x98
0x26300 | 0x26100 | 0x98

This means that calling {get,set}_pageblock_migratetype on a single page
will not set the migratetype for the full block. Fix this by rounding
down zone_start_pfn when doing the bitidx calculation.

For our use case, the effects of this bug were mostly tied to the fact
that CMA allocations would either take a long time or fail to happen.
Depending on the driver using CMA, this could result in anything from
visual glitches to application failures.

Signed-off-by: Laura Abbott <lauraa@codeaurora.org>
Acked-by: Mel Gorman <mgorman@suse.de>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 47ecfcb7 11-Jan-2013 Mel Gorman <mgorman@suse.de>

mm: compaction: Partially revert capture of suitable high-order page

Eric Wong reported on 3.7 and 3.8-rc2 that ppoll() got stuck when
waiting for POLLIN on a local TCP socket. It was easier to trigger if
there was disk IO and dirty pages at the same time and he bisected it to
commit 1fb3f8ca0e92 ("mm: compaction: capture a suitable high-order page
immediately when it is made available").

The intention of that patch was to improve high-order allocations under
memory pressure after changes made to reclaim in 3.6 drastically hurt
THP allocations but the approach was flawed. For Eric, the problem was
that page->pfmemalloc was not being cleared for captured pages leading
to a poor interaction with swap-over-NFS support causing the packets to
be dropped. However, I identified a few more problems with the patch
including the fact that it can increase contention on zone->lock in some
cases which could result in async direct compaction being aborted early.

In retrospect the capture patch took the wrong approach. What it should
have done is mark the pageblock being migrated as MIGRATE_ISOLATE if it
was allocating for THP and avoided races that way. While the patch was
showing to improve allocation success rates at the time, the benefit is
marginal given the relative complexity and it should be revisited from
scratch in the context of the other reclaim-related changes that have
taken place since the patch was first written and tested. This patch
partially reverts commit 1fb3f8ca "mm: compaction: capture a suitable
high-order page immediately when it is made available".

Reported-and-tested-by: Eric Wong <normalperson@yhbt.net>
Tested-by: Eric Dumazet <eric.dumazet@gmail.com>
Cc: stable@vger.kernel.org
Signed-off-by: Mel Gorman <mgorman@suse.de>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# a458431e 04-Jan-2013 Bartlomiej Zolnierkiewicz <b.zolnierkie@samsung.com>

mm: fix zone_watermark_ok_safe() accounting of isolated pages

Commit 702d1a6e0766 ("memory-hotplug: fix kswapd looping forever
problem") added an isolated pageblocks counter (nr_pageblock_isolate in
struct zone) and used it to adjust free pages counter in
zone_watermark_ok_safe() to prevent kswapd looping forever problem.

Then later, commit 2139cbe627b8 ("cma: fix counting of isolated pages")
fixed accounting of isolated pages in global free pages counter. It
made the previous zone_watermark_ok_safe() fix unnecessary and
potentially harmful (cause now isolated pages may be accounted twice
making free pages counter incorrect).

This patch removes the special isolated pageblocks counter altogether
which fixes zone_watermark_ok_safe() free pages check.

Reported-by: Tomasz Stanislawski <t.stanislaws@samsung.com>
Signed-off-by: Bartlomiej Zolnierkiewicz <b.zolnierkie@samsung.com>
Signed-off-by: Kyungmin Park <kyungmin.park@samsung.com>
Cc: Minchan Kim <minchan@kernel.org>
Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Cc: Aaditya Kumar <aaditya.kumar.30@gmail.com>
Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Michal Hocko <mhocko@suse.cz>
Cc: Marek Szyprowski <m.szyprowski@samsung.com>
Cc: Michal Nazarewicz <mina86@mina86.com>
Cc: Hugh Dickins <hughd@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# bcc2b02f 20-Dec-2012 Marek Szyprowski <m.szyprowski@samsung.com>

mm: cma: WARN if freed memory is still in use

Memory returned to free_contig_range() must have no other references.
Let kernel to complain loudly if page reference count is not equal to 1.

[rientjes@google.com: support sparsemem]
Signed-off-by: Marek Szyprowski <m.szyprowski@samsung.com>
Reviewed-by: Kyungmin Park <kyungmin.park@samsung.com>
Acked-by: Michal Nazarewicz <mina86@mina86.com>
Signed-off-by: David Rientjes <rientjes@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 6a1a0d3b 18-Dec-2012 Glauber Costa <glommer@parallels.com>

mm: allocate kernel pages to the right memcg

When a process tries to allocate a page with the __GFP_KMEMCG flag, the
page allocator will call the corresponding memcg functions to validate
the allocation. Tasks in the root memcg can always proceed.

To avoid adding markers to the page - and a kmem flag that would
necessarily follow, as much as doing page_cgroup lookups for no reason,
whoever is marking its allocations with __GFP_KMEMCG flag is responsible
for telling the page allocator that this is such an allocation at
free_pages() time. This is done by the invocation of
__free_accounted_pages() and free_accounted_pages().

Signed-off-by: Glauber Costa <glommer@parallels.com>
Acked-by: Michal Hocko <mhocko@suse.cz>
Acked-by: Mel Gorman <mgorman@suse.de>
Acked-by: Kamezawa Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Acked-by: David Rientjes <rientjes@google.com>
Cc: Christoph Lameter <cl@linux.com>
Cc: Pekka Enberg <penberg@cs.helsinki.fi>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Suleiman Souhlal <suleiman@google.com>
Cc: Tejun Heo <tj@kernel.org>
Cc: Frederic Weisbecker <fweisbec@redhat.com>
Cc: Greg Thelen <gthelen@google.com>
Cc: JoonSoo Kim <js1304@gmail.com>
Cc: Rik van Riel <riel@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 0bb2c763 18-Dec-2012 Gavin Shan <shangw@linux.vnet.ibm.com>

mm/page_alloc.c: remove duplicate check

While allocating pages using buddy allocator, the compound page is
probably split up to free pages. Under these circumstances, the compound
page should be destroyed by destroy_compound_page(). However, there is a
duplicate check to judge if the page is compound.

Remove the duplicate check since the compound_order() returns 0 when the
page doesn't have PG_head set in destroy_compound_page(). That is to say,
destroy_compound_page() needn't check PageHead().

Signed-off-by: Gavin Shan <shangw@linux.vnet.ibm.com>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Acked-by: Pekka Enberg <penberg@cs.helsinki.fi>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 01cefaef 12-Dec-2012 Jiang Liu <liuj97@gmail.com>

mm: provide more accurate estimation of pages occupied by memmap

If SPARSEMEM is enabled, it won't build page structures for non-existing
pages (holes) within a zone, so provide a more accurate estimation of
pages occupied by memmap if there are bigger holes within the zone.

And pages for highmem zones' memmap will be allocated from lowmem, so
charge nr_kernel_pages for that.

[akpm@linux-foundation.org: mark calc_memmap_size __paging_init]
Signed-off-by: Jiang Liu <jiang.liu@huawei.com>
Cc: Wen Congyang <wency@cn.fujitsu.com>
Cc: David Rientjes <rientjes@google.com>
Cc: Maciej Rutecki <maciej.rutecki@gmail.com>
Cc: Chris Clayton <chris2553@googlemail.com>
Cc: "Rafael J . Wysocki" <rjw@sisk.pl>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Minchan Kim <minchan@kernel.org>
Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Michal Hocko <mhocko@suse.cz>
Tested-by: Jianguo Wu <wujianguo@huawei.com>
Cc: Dave Hansen <dave@linux.vnet.ibm.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 9feedc9d 12-Dec-2012 Jiang Liu <liuj97@gmail.com>

mm: introduce new field "managed_pages" to struct zone

Currently a zone's present_pages is calcuated as below, which is
inaccurate and may cause trouble to memory hotplug.

spanned_pages - absent_pages - memmap_pages - dma_reserve.

During fixing bugs caused by inaccurate zone->present_pages, we found
zone->present_pages has been abused. The field zone->present_pages may
have different meanings in different contexts:

1) pages existing in a zone.
2) pages managed by the buddy system.

For more discussions about the issue, please refer to:
http://lkml.org/lkml/2012/11/5/866
https://patchwork.kernel.org/patch/1346751/

This patchset tries to introduce a new field named "managed_pages" to
struct zone, which counts "pages managed by the buddy system". And revert
zone->present_pages to count "physical pages existing in a zone", which
also keep in consistence with pgdat->node_present_pages.

We will set an initial value for zone->managed_pages in function
free_area_init_core() and will adjust it later if the initial value is
inaccurate.

For DMA/normal zones, the initial value is set to:

(spanned_pages - absent_pages - memmap_pages - dma_reserve)

Later zone->managed_pages will be adjusted to the accurate value when the
bootmem allocator frees all free pages to the buddy system in function
free_all_bootmem_node() and free_all_bootmem().

The bootmem allocator doesn't touch highmem pages, so highmem zones'
managed_pages is set to the accurate value "spanned_pages - absent_pages"
in function free_area_init_core() and won't be updated anymore.

This patch also adds a new field "managed_pages" to /proc/zoneinfo
and sysrq showmem.

[akpm@linux-foundation.org: small comment tweaks]
Signed-off-by: Jiang Liu <jiang.liu@huawei.com>
Cc: Wen Congyang <wency@cn.fujitsu.com>
Cc: David Rientjes <rientjes@google.com>
Cc: Maciej Rutecki <maciej.rutecki@gmail.com>
Tested-by: Chris Clayton <chris2553@googlemail.com>
Cc: "Rafael J . Wysocki" <rjw@sisk.pl>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Minchan Kim <minchan@kernel.org>
Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Michal Hocko <mhocko@suse.cz>
Cc: Jianguo Wu <wujianguo@huawei.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 20b2f52b 12-Dec-2012 Lai Jiangshan <laijs@cn.fujitsu.com>

numa: add CONFIG_MOVABLE_NODE for movable-dedicated node

We need a node which only contains movable memory. This feature is very
important for node hotplug. If a node has normal/highmem, the memory may
be used by the kernel and can't be offlined. If the node only contains
movable memory, we can offline the memory and the node.

All are prepared, we can actually introduce N_MEMORY.
add CONFIG_MOVABLE_NODE make we can use it for movable-dedicated node

[akpm@linux-foundation.org: fix Kconfig text]
Signed-off-by: Lai Jiangshan <laijs@cn.fujitsu.com>
Tested-by: Yasuaki Ishimatsu <isimatu.yasuaki@jp.fujitsu.com>
Signed-off-by: Wen Congyang <wency@cn.fujitsu.com>
Cc: Jiang Liu <jiang.liu@huawei.com>
Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Cc: Minchan Kim <minchan.kim@gmail.com>
Cc: Mel Gorman <mgorman@suse.de>
Cc: David Rientjes <rientjes@google.com>
Cc: Yinghai Lu <yinghai@kernel.org>
Cc: Rusty Russell <rusty@rustcorp.com.au>
Cc: Greg KH <greg@kroah.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 4b0ef1fe 12-Dec-2012 Lai Jiangshan <laijs@cn.fujitsu.com>

page_alloc: use N_MEMORY instead N_HIGH_MEMORY change the node_states initialization

N_HIGH_MEMORY stands for the nodes that has normal or high memory.
N_MEMORY stands for the nodes that has any memory.

The code here need to handle with the nodes which have memory, we should
use N_MEMORY instead.

Since we introduced N_MEMORY, we update the initialization of node_states.

Signed-off-by: Lai Jiangshan <laijs@cn.fujitsu.com>
Signed-off-by: Lin Feng <linfeng@cn.fujitsu.com>
Signed-off-by: Wen Congyang <wency@cn.fujitsu.com>
Cc: Christoph Lameter <cl@linux.com>
Cc: Hillf Danton <dhillf@gmail.com>
Cc: David Rientjes <rientjes@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# be49a6e1 12-Dec-2012 Marek Szyprowski <m.szyprowski@samsung.com>

mm: use migrate_prep() instead of migrate_prep_local()

__alloc_contig_migrate_range() should use all possible ways to get all the
pages migrated from the given memory range, so pruning per-cpu lru lists
for all CPUs is required, regadless the cost of such operation. Otherwise
some pages which got stuck at per-cpu lru list might get missed by
migration procedure causing the contiguous allocation to fail.

Reported-by: SeongHwan Yoon <sunghwan.yun@samsung.com>
Signed-off-by: Marek Szyprowski <m.szyprowski@samsung.com>
Signed-off-by: Kyungmin Park <kyungmin.park@samsung.com>
Acked-by: Michal Nazarewicz <mina86@mina86.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# bc357f43 11-Dec-2012 Marek Szyprowski <m.szyprowski@samsung.com>

mm: cma: remove watermark hacks

Commits 2139cbe627b8 ("cma: fix counting of isolated pages") and
d95ea5d18e69 ("cma: fix watermark checking") introduced a reliable
method of free page accounting when memory is being allocated from CMA
regions, so the workaround introduced earlier by commit 49f223a9cd96
("mm: trigger page reclaim in alloc_contig_range() to stabilise
watermarks") can be finally removed.

Signed-off-by: Marek Szyprowski <m.szyprowski@samsung.com>
Cc: Kyungmin Park <kyungmin.park@samsung.com>
Cc: Arnd Bergmann <arnd@arndb.de>
Cc: Mel Gorman <mel@csn.ul.ie>
Acked-by: Michal Nazarewicz <mina86@mina86.com>
Cc: Minchan Kim <minchan@kernel.org>
Cc: Bartlomiej Zolnierkiewicz <b.zolnierkie@samsung.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 2e30abd1 11-Dec-2012 Marek Szyprowski <m.szyprowski@samsung.com>

mm: cma: skip watermarks check for already isolated blocks in split_free_page()

Since commit 2139cbe627b8 ("cma: fix counting of isolated pages") free
pages in isolated pageblocks are not accounted to NR_FREE_PAGES counters,
so watermarks check is not required if one operates on a free page in
isolated pageblock.

Signed-off-by: Marek Szyprowski <m.szyprowski@samsung.com>
Cc: Kyungmin Park <kyungmin.park@samsung.com>
Cc: Arnd Bergmann <arnd@arndb.de>
Cc: Mel Gorman <mel@csn.ul.ie>
Acked-by: Michal Nazarewicz <mina86@mina86.com>
Cc: Minchan Kim <minchan@kernel.org>
Cc: Bartlomiej Zolnierkiewicz <b.zolnierkie@samsung.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 5733c7d1 11-Dec-2012 Rafael Aquini <aquini@redhat.com>

mm: introduce putback_movable_pages()

The PATCH "mm: introduce compaction and migration for virtio ballooned pages"
hacks around putback_lru_pages() in order to allow ballooned pages to be
re-inserted on balloon page list as if a ballooned page was like a LRU page.

As ballooned pages are not legitimate LRU pages, this patch introduces
putback_movable_pages() to properly cope with cases where the isolated
pageset contains ballooned pages and LRU pages, thus fixing the mentioned
inelegant hack around putback_lru_pages().

Signed-off-by: Rafael Aquini <aquini@redhat.com>
Cc: Rusty Russell <rusty@rustcorp.com.au>
Cc: "Michael S. Tsirkin" <mst@redhat.com>
Cc: Rik van Riel <riel@redhat.com>
Cc: Mel Gorman <mel@csn.ul.ie>
Cc: Andi Kleen <andi@firstfloor.org>
Cc: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Cc: Minchan Kim <minchan@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 6dcd73d7 11-Dec-2012 Wen Congyang <wency@cn.fujitsu.com>

memory-hotplug: allocate zone's pcp before onlining pages

We use __free_page() to put a page to buddy system when onlining pages.
__free_page() will store NR_FREE_PAGES in zone's pcp.vm_stat_diff, so we
should allocate zone's pcp before onlining pages, otherwise we will lose
some free pages.

[mhocko@suse.cz: make zone_pcp_reset independent of MEMORY_HOTREMOVE]
Signed-off-by: Wen Congyang <wency@cn.fujitsu.com>
Cc: David Rientjes <rientjes@google.com>
Cc: Jiang Liu <liuj97@gmail.com>
Cc: Len Brown <len.brown@intel.com>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Christoph Lameter <cl@linux.com>
Cc: Minchan Kim <minchan.kim@gmail.com>
Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Cc: Yasuaki Ishimatsu <isimatu.yasuaki@jp.fujitsu.com>
Cc: Dave Hansen <dave@linux.vnet.ibm.com>
Cc: Mel Gorman <mel@csn.ul.ie>
Signed-off-by: Michal Hocko <mhocko@suse.cz>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 97d0da22 11-Dec-2012 Wen Congyang <wency@cn.fujitsu.com>

memory-hotplug: fix NR_FREE_PAGES mismatch

NR_FREE_PAGES will be wrong after offlining pages. We add/dec
NR_FREE_PAGES like this now:

1. move all pages in buddy system to MIGRATE_ISOLATE, and dec NR_FREE_PAGES

2. don't add NR_FREE_PAGES when it is freed and the migratetype is
MIGRATE_ISOLATE

3. dec NR_FREE_PAGES when offlining isolated pages.

4. add NR_FREE_PAGES when undoing isolate pages.

When we come to step 3, all pages are in MIGRATE_ISOLATE list, and
NR_FREE_PAGES are right. When we come to step4, all pages are not in
buddy system, so we don't change NR_FREE_PAGES in this step, but we change
NR_FREE_PAGES in step3. So NR_FREE_PAGES is wrong after offlining pages.
So there is no need to change NR_FREE_PAGES in step3.

This patch also fixs a problem in step2: if the migratetype is
MIGRATE_ISOLATE, we should not add NR_FRR_PAGES when we remove pages from
pcppages.

Signed-off-by: Wen Congyang <wency@cn.fujitsu.com>
Cc: David Rientjes <rientjes@google.com>
Cc: Jiang Liu <liuj97@gmail.com>
Cc: Len Brown <len.brown@intel.com>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Christoph Lameter <cl@linux.com>
Cc: Minchan Kim <minchan.kim@gmail.com>
Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Cc: Yasuaki Ishimatsu <isimatu.yasuaki@jp.fujitsu.com>
Cc: Dave Hansen <dave@linux.vnet.ibm.com>
Cc: Mel Gorman <mel@csn.ul.ie>
Cc: Jianguo Wu <wujianguo106@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# b023f468 11-Dec-2012 Wen Congyang <wency@cn.fujitsu.com>

memory-hotplug: skip HWPoisoned page when offlining pages

hwpoisoned may be set when we offline a page by the sysfs interface
/sys/devices/system/memory/soft_offline_page or
/sys/devices/system/memory/hard_offline_page. If we don't clear
this flag when onlining pages, this page can't be freed, and will
not in free list. So we can't offline these pages again. So we
should skip such page when offlining pages.

Signed-off-by: Wen Congyang <wency@cn.fujitsu.com>
Cc: David Rientjes <rientjes@google.com>
Cc: Jiang Liu <liuj97@gmail.com>
Cc: Len Brown <len.brown@intel.com>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Christoph Lameter <cl@linux.com>
Cc: Minchan Kim <minchan.kim@gmail.com>
Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Cc: Yasuaki Ishimatsu <isimatu.yasuaki@jp.fujitsu.com>
Cc: Andi Kleen <andi@firstfloor.org>
Cc: Mel Gorman <mel@csn.ul.ie>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# e5adfffc 11-Dec-2012 Kirill A. Shutemov <kirill.shutemov@linux.intel.com>

mm: use IS_ENABLED(CONFIG_NUMA) instead of NUMA_BUILD

We don't need custom NUMA_BUILD anymore, since we have handy
IS_ENABLED().

Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Acked-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Acked-by: David Rientjes <rientjes@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 377e4f16 11-Dec-2012 Rabin Vincent <rabin.vincent@stericsson.com>

mm: show migration types in show_mem

This is useful to diagnose the reason for page allocation failure for
cases where there appear to be several free pages.

Example, with this alloc_pages(GFP_ATOMIC) failure:

swapper/0: page allocation failure: order:0, mode:0x0
...
Mem-info:
Normal per-cpu:
CPU 0: hi: 90, btch: 15 usd: 48
CPU 1: hi: 90, btch: 15 usd: 21
active_anon:0 inactive_anon:0 isolated_anon:0
active_file:0 inactive_file:84 isolated_file:0
unevictable:0 dirty:0 writeback:0 unstable:0
free:4026 slab_reclaimable:75 slab_unreclaimable:484
mapped:0 shmem:0 pagetables:0 bounce:0
Normal free:16104kB min:2296kB low:2868kB high:3444kB active_anon:0kB
inactive_anon:0kB active_file:0kB inactive_file:336kB unevictable:0kB
isolated(anon):0kB isolated(file):0kB present:331776kB mlocked:0kB
dirty:0kB writeback:0kB mapped:0kB shmem:0kB slab_reclaimable:300kB
slab_unreclaimable:1936kB kernel_stack:328kB pagetables:0kB unstable:0kB
bounce:0kB writeback_tmp:0kB pages_scanned:0 all_unreclaimable? no
lowmem_reserve[]: 0 0

Before the patch, it's hard (for me, at least) to say why all these free
chunks weren't considered for allocation:

Normal: 0*4kB 0*8kB 0*16kB 0*32kB 0*64kB 0*128kB 1*256kB 1*512kB
1*1024kB 1*2048kB 3*4096kB = 16128kB

After the patch, it's obvious that the reason is that all of these are
in the MIGRATE_CMA (C) freelist:

Normal: 0*4kB 0*8kB 0*16kB 0*32kB 0*64kB 0*128kB 1*256kB (C) 1*512kB
(C) 1*1024kB (C) 1*2048kB (C) 3*4096kB (C) = 16128kB

Signed-off-by: Rabin Vincent <rabin.vincent@stericsson.com>
Cc: Mel Gorman <mel@csn.ul.ie>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 57e0a030 12-Nov-2012 Mel Gorman <mgorman@suse.de>

mm: numa: Introduce last_nid to the page frame

This patch introduces a last_nid field to the page struct. This is used
to build a two-stage filter in the next patch that is aimed at
mitigating a problem whereby pages migrate to the wrong node when
referenced by a process that was running off its home node.

Signed-off-by: Mel Gorman <mgorman@suse.de>


# 8177a420 23-Mar-2012 Andrea Arcangeli <aarcange@redhat.com>

mm: numa: Structures for Migrate On Fault per NUMA migration rate limiting

This defines the per-node data used by Migrate On Fault in order to
rate limit the migration. The rate limiting is applied independently
to each destination node.

Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
Signed-off-by: Mel Gorman <mgorman@suse.de>


# 7b2a2d4a 19-Oct-2012 Mel Gorman <mgorman@suse.de>

mm: migrate: Add a tracepoint for migrate_pages

The pgmigrate_success and pgmigrate_fail vmstat counters tells the user
about migration activity but not the type or the reason. This patch adds
a tracepoint to identify the type of page migration and why the page is
being migrated.

Signed-off-by: Mel Gorman <mgorman@suse.de>
Reviewed-by: Rik van Riel <riel@redhat.com>


# caf49191 10-Dec-2012 Linus Torvalds <torvalds@linux-foundation.org>

Revert "revert "Revert "mm: remove __GFP_NO_KSWAPD""" and associated damage

This reverts commits a50915394f1fc02c2861d3b7ce7014788aa5066e and
d7c3b937bdf45f0b844400b7bf6fd3ed50bac604.

This is a revert of a revert of a revert. In addition, it reverts the
even older i915 change to stop using the __GFP_NO_KSWAPD flag due to the
original commits in linux-next.

It turns out that the original patch really was bogus, and that the
original revert was the correct thing to do after all. We thought we
had fixed the problem, and then reverted the revert, but the problem
really is fundamental: waking up kswapd simply isn't the right thing to
do, and direct reclaim sometimes simply _is_ the right thing to do.

When certain allocations fail, we simply should try some direct reclaim,
and if that fails, fail the allocation. That's the right thing to do
for THP allocations, which can easily fail, and the GPU allocations want
to do that too.

So starting kswapd is sometimes simply wrong, and removing the flag that
said "don't start kswapd" was a mistake. Let's hope we never revisit
this mistake again - and certainly not this many times ;)

Acked-by: Mel Gorman <mgorman@suse.de>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Cc: Rik van Riel <riel@redhat.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 31f8d42d 10-Dec-2012 Linus Torvalds <torvalds@linux-foundation.org>

Revert "mm: avoid waking kswapd for THP allocations when compaction is deferred or contended"

This reverts commit 782fd30406ecb9d9b082816abe0c6008fc72a7b0.

We are going to reinstate the __GFP_NO_KSWAPD flag that has been
removed, the removal reverted, and then removed again. Making this
commit a pointless fixup for a problem that was caused by the removal of
__GFP_NO_KSWAPD flag.

The thing is, we really don't want to wake up kswapd for THP allocations
(because they fail quite commonly under any kind of memory pressure,
including when there is tons of memory free), and these patches were
just trying to fix up the underlying bug: the original removal of
__GFP_NO_KSWAPD in commit c654345924f7 ("mm: remove __GFP_NO_KSWAPD")
was simply bogus.

Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 6d49e352 06-Dec-2012 Nadia Yvette Chambers <nyc@holomorphy.com>

propagate name change to comments in kernel source

I've legally changed my name with New York State, the US Social Security
Administration, et al. This patch propagates the name change and change
in initials and login to comments in the kernel source as well.

Signed-off-by: Nadia Yvette Chambers <nyc@holomorphy.com>
Signed-off-by: Jiri Kosina <jkosina@suse.cz>


# 782fd304 29-Nov-2012 Mel Gorman <mgorman@suse.de>

mm: avoid waking kswapd for THP allocations when compaction is deferred or contended

With "mm: vmscan: scale number of pages reclaimed by reclaim/compaction
based on failures" reverted, Zdenek Kabelac reported the following

Hmm, so it's just took longer to hit the problem and observe
kswapd0 spinning on my CPU again - it's not as endless like before -
but still it easily eats minutes - it helps to turn off Firefox
or TB (memory hungry apps) so kswapd0 stops soon - and restart
those apps again. (And I still have like >1GB of cached memory)

kswapd0 R running task 0 30 2 0x00000000
Call Trace:
preempt_schedule+0x42/0x60
_raw_spin_unlock+0x55/0x60
put_super+0x31/0x40
drop_super+0x22/0x30
prune_super+0x149/0x1b0
shrink_slab+0xba/0x510

The sysrq+m indicates the system has no swap so it'll never reclaim
anonymous pages as part of reclaim/compaction. That is one part of the
problem but not the root cause as file-backed pages could also be
reclaimed.

The likely underlying problem is that kswapd is woken up or kept awake
for each THP allocation request in the page allocator slow path.

If compaction fails for the requesting process then compaction will be
deferred for a time and direct reclaim is avoided. However, if there
are a storm of THP requests that are simply rejected, it will still be
the the case that kswapd is awake for a prolonged period of time as
pgdat->kswapd_max_order is updated each time. This is noticed by the
main kswapd() loop and it will not call kswapd_try_to_sleep(). Instead
it will loopp, shrinking a small number of pages and calling
shrink_slab() on each iteration.

This patch defers when kswapd gets woken up for THP allocations. For
!THP allocations, kswapd is always woken up. For THP allocations,
kswapd is woken up iff the process is willing to enter into direct
reclaim/compaction.

[akpm@linux-foundation.org: fix typo in comment]
Signed-off-by: Mel Gorman <mgorman@suse.de>
Cc: Zdenek Kabelac <zkabelac@redhat.com>
Cc: Seth Jennings <sjenning@linux.vnet.ibm.com>
Cc: Jiri Slaby <jirislaby@gmail.com>
Cc: Rik van Riel <riel@redhat.com>
Cc: Robert Jennings <rcj@linux.vnet.ibm.com>
Cc: Valdis Kletnieks <Valdis.Kletnieks@vt.edu>
Cc: Glauber Costa <glommer@gmail.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# a5091539 29-Nov-2012 Andrew Morton <akpm@linux-foundation.org>

revert "Revert "mm: remove __GFP_NO_KSWAPD""

It apepars that this patch was innocent, and we hope that "mm: avoid
waking kswapd for THP allocations when compaction is deferred or
contended" will fix the final kswapd-spinning cause.

Cc: Zdenek Kabelac <zkabelac@redhat.com>
Cc: Seth Jennings <sjenning@linux.vnet.ibm.com>
Cc: Valdis Kletnieks <Valdis.Kletnieks@vt.edu>
Cc: Jiri Slaby <jirislaby@gmail.com>
Cc: Rik van Riel <riel@redhat.com>
Cc: Robert Jennings <rcj@linux.vnet.ibm.com>
Cc: Mel Gorman <mgorman@suse.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 58d00209 29-Nov-2012 Mel Gorman <mgorman@suse.de>

mm: compaction: fix return value of capture_free_page()

Commit ef6c5be658f6 ("fix incorrect NR_FREE_PAGES accounting (appears
like memory leak)") fixes a NR_FREE_PAGE accounting leak but missed the
return value which was also missed by this reviewer until today.

That return value is used by compaction when adding pages to a list of
isolated free pages and without this follow-up fix, there is a risk of
free list corruption.

Signed-off-by: Mel Gorman <mgorman@suse.de>
Cc: Dave Hansen <dave@linux.vnet.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 82b212f4 26-Nov-2012 Mel Gorman <mgorman@suse.de>

Revert "mm: remove __GFP_NO_KSWAPD"

With "mm: vmscan: scale number of pages reclaimed by reclaim/compaction
based on failures" reverted, Zdenek Kabelac reported the following

Hmm, so it's just took longer to hit the problem and observe
kswapd0 spinning on my CPU again - it's not as endless like before -
but still it easily eats minutes - it helps to turn off Firefox
or TB (memory hungry apps) so kswapd0 stops soon - and restart
those apps again. (And I still have like >1GB of cached memory)

kswapd0 R running task 0 30 2 0x00000000
Call Trace:
preempt_schedule+0x42/0x60
_raw_spin_unlock+0x55/0x60
put_super+0x31/0x40
drop_super+0x22/0x30
prune_super+0x149/0x1b0
shrink_slab+0xba/0x510

The sysrq+m indicates the system has no swap so it'll never reclaim
anonymous pages as part of reclaim/compaction. That is one part of the
problem but not the root cause as file-backed pages could also be
reclaimed.

The likely underlying problem is that kswapd is woken up or kept awake
for each THP allocation request in the page allocator slow path.

If compaction fails for the requesting process then compaction will be
deferred for a time and direct reclaim is avoided. However, if there
are a storm of THP requests that are simply rejected, it will still be
the the case that kswapd is awake for a prolonged period of time as
pgdat->kswapd_max_order is updated each time. This is noticed by the
main kswapd() loop and it will not call kswapd_try_to_sleep(). Instead
it will loopp, shrinking a small number of pages and calling
shrink_slab() on each iteration.

The temptation is to supply a patch that checks if kswapd was woken for
THP and if so ignore pgdat->kswapd_max_order but it'll be a hack and not
backed up by proper testing. As 3.7 is very close to release and this
is not a bug we should release with, a safer path is to revert "mm:
remove __GFP_NO_KSWAPD" for now and revisit it with the view to ironing
out the balance_pgdat() logic in general.

Signed-off-by: Mel Gorman <mgorman@suse.de>
Cc: Zdenek Kabelac <zkabelac@redhat.com>
Cc: Seth Jennings <sjenning@linux.vnet.ibm.com>
Cc: Valdis Kletnieks <Valdis.Kletnieks@vt.edu>
Cc: Jiri Slaby <jirislaby@gmail.com>
Cc: Rik van Riel <riel@redhat.com>
Cc: Robert Jennings <rcj@linux.vnet.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# ef6c5be6 21-Nov-2012 Dave Hansen <dave@linux.vnet.ibm.com>

fix incorrect NR_FREE_PAGES accounting (appears like memory leak)

There have been some 3.7-rc reports of vm issues, including some kswapd
bugs and, more importantly, some memory "leaks":

http://www.spinics.net/lists/linux-mm/msg46187.html
https://bugzilla.kernel.org/show_bug.cgi?id=50181

Commit 1fb3f8ca0e92 ("mm: compaction: capture a suitable high-order page
immediately when it is made available") took split_free_page() and
reused it for the compaction code. It does something curious with
capture_free_page() (previously known as split_free_page()):

int capture_free_page(struct page *page, int alloc_order,
...
__mod_zone_page_state(zone, NR_FREE_PAGES, -(1UL << order));

- /* Split into individual pages */
- set_page_refcounted(page);
- split_page(page, order);
+ if (alloc_order != order)
+ expand(zone, page, alloc_order, order,
+ &zone->free_area[order], migratetype);

Note that expand() puts the pages _back_ in the allocator, but it does
not bump NR_FREE_PAGES. We "return" 'alloc_order' worth of pages, but
we accounted for removing 'order' in the __mod_zone_page_state() call.

For the old split_page()-style use (order==alloc_order) the bug will not
trigger. But, when called from the compaction code where we
occasionally get a larger page out of the buddy allocator than we need,
we will run in to this.

This patch simply changes the NR_FREE_PAGES manipulation to the correct
'alloc_order' instead of 'order'.

I've been able to repeatedly trigger this in my testing environment.
The amount "leaked" very closely tracks the imbalance I see in buddy
pages vs. NR_FREE_PAGES. I have confirmed that this patch fixes the
imbalance

Signed-off-by: Dave Hansen <dave@linux.vnet.ibm.com>
Acked-by: Mel Gorman <mgorman@suse.de>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 5576646f 16-Nov-2012 Andrew Morton <akpm@linux-foundation.org>

revert "mm: fix-up zone present pages"

Revert commit 7f1290f2f2a4 ("mm: fix-up zone present pages")

That patch tried to fix a issue when calculating zone->present_pages,
but it caused a regression on 32bit systems with HIGHMEM. With that
change, reset_zone_present_pages() resets all zone->present_pages to
zero, and fixup_zone_present_pages() is called to recalculate
zone->present_pages when the boot allocator frees core memory pages into
buddy allocator. Because highmem pages are not freed by bootmem
allocator, all highmem zones' present_pages becomes zero.

Various options for improving the situation are being discussed but for
now, let's return to the 3.6 code.

Cc: Jianguo Wu <wujianguo@huawei.com>
Cc: Jiang Liu <jiang.liu@huawei.com>
Cc: Petr Tesarik <ptesarik@suse.cz>
Cc: "Luck, Tony" <tony.luck@intel.com>
Cc: Mel Gorman <mel@csn.ul.ie>
Cc: Yinghai Lu <yinghai@kernel.org>
Cc: Minchan Kim <minchan.kim@gmail.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Acked-by: David Rientjes <rientjes@google.com>
Tested-by: Chris Clayton <chris2553@googlemail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# bea8c150 16-Nov-2012 Hugh Dickins <hughd@google.com>

memcg: fix hotplugged memory zone oops

When MEMCG is configured on (even when it's disabled by boot option),
when adding or removing a page to/from its lru list, the zone pointer
used for stats updates is nowadays taken from the struct lruvec. (On
many configurations, calculating zone from page is slower.)

But we have no code to update all the lruvecs (per zone, per memcg) when
a memory node is hotadded. Here's an extract from the oops which
results when running numactl to bind a program to a newly onlined node:

BUG: unable to handle kernel NULL pointer dereference at 0000000000000f60
IP: __mod_zone_page_state+0x9/0x60
Pid: 1219, comm: numactl Not tainted 3.6.0-rc5+ #180 Bochs Bochs
Process numactl (pid: 1219, threadinfo ffff880039abc000, task ffff8800383c4ce0)
Call Trace:
__pagevec_lru_add_fn+0xdf/0x140
pagevec_lru_move_fn+0xb1/0x100
__pagevec_lru_add+0x1c/0x30
lru_add_drain_cpu+0xa3/0x130
lru_add_drain+0x2f/0x40
...

The natural solution might be to use a memcg callback whenever memory is
hotadded; but that solution has not been scoped out, and it happens that
we do have an easy location at which to update lruvec->zone. The lruvec
pointer is discovered either by mem_cgroup_zone_lruvec() or by
mem_cgroup_page_lruvec(), and both of those do know the right zone.

So check and set lruvec->zone in those; and remove the inadequate
attempt to set lruvec->zone from lruvec_init(), which is called before
NODE_DATA(node) has been allocated in such cases.

Ah, there was one exceptionr. For no particularly good reason,
mem_cgroup_force_empty_list() has its own code for deciding lruvec.
Change it to use the standard mem_cgroup_zone_lruvec() and
mem_cgroup_get_lru_size() too. In fact it was already safe against such
an oops (the lru lists in danger could only be empty), but we're better
proofed against future changes this way.

I've marked this for stable (3.6) since we introduced the problem in 3.5
(now closed to stable); but I have no idea if this is the only fix
needed to get memory hotadd working with memcg in 3.6, and received no
answer when I enquired twice before.

Reported-by: Tang Chen <tangchen@cn.fujitsu.com>
Signed-off-by: Hugh Dickins <hughd@google.com>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Acked-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Konstantin Khlebnikov <khlebnikov@openvz.org>
Cc: Wen Congyang <wency@cn.fujitsu.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 6b187d02 25-Oct-2012 David Rientjes <rientjes@google.com>

mm, numa: avoid setting zone_reclaim_mode unless a node is sufficiently distant

Commit 957f822a0ab9 ("mm, numa: reclaim from all nodes within reclaim
distance") caused zone_reclaim_mode to be set for all systems where two
nodes are within RECLAIM_DISTANCE of each other. This is the opposite
of what we actually want: zone_reclaim_mode should be set if two nodes
are sufficiently distant.

Signed-off-by: David Rientjes <rientjes@google.com>
Reported-by: Julian Wollrath <jwollrath@web.de>
Tested-by: Julian Wollrath <jwollrath@web.de>
Cc: Hugh Dickins <hughd@google.com>
Cc: Patrik Kullman <patrik.kullman@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 86a595f9 25-Oct-2012 Bob Liu <lliubbo@gmail.com>

mm/page_alloc.c:alloc_contig_range(): return early for err path

If start_isolate_page_range() failed, unset_migratetype_isolate() has been
done inside it.

Signed-off-by: Bob Liu <lliubbo@gmail.com>
Cc: Ni zhan Chen <nizhan.chen@gmail.com>
Cc: Marek Szyprowski <m.szyprowski@samsung.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# a1f10213 23-Oct-2012 Bob Liu <lliubbo@gmail.com>

mm: cma: alloc_contig_range: return early for err path

If start_isolate_page_range() failed, unset_migratetype_isolate() has been
done inside it.

Signed-off-by: Bob Liu <lliubbo@gmail.com>
Signed-off-by: Marek Szyprowski <m.szyprowski@samsung.com>


# beb51eaa 08-Oct-2012 Minchan Kim <minchan@kernel.org>

cma: decrease cc.nr_migratepages after reclaiming pagelist

reclaim_clean_pages_from_list() reclaims clean pages before migration so
cc.nr_migratepages should be updated. Currently, there is no problem but
it can be wrong if we try to use the value in future.

Signed-off-by: Minchan Kim <minchan@kernel.org>
Acked-by: Mel Gorman <mgorman@suse.de>
Cc: Michal Nazarewicz <mina86@mina86.com>
Cc: Bartlomiej Zolnierkiewicz <b.zolnierkie@samsung.com>
Cc: Marek Szyprowski <m.szyprowski@samsung.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# e46a2879 08-Oct-2012 Minchan Kim <minchan@kernel.org>

CMA: migrate mlocked pages

Presently CMA cannot migrate mlocked pages so it ends up failing to allocate
contiguous memory space.

This patch makes mlocked pages be migrated out. Of course, it can affect
realtime processes but in CMA usecase, contiguous memory allocation failing
is far worse than access latency to an mlocked page being variable while
CMA is running. If someone wants to make the system realtime, he shouldn't
enable CMA because stalls can still happen at random times.

[akpm@linux-foundation.org: tweak comment text, per Mel]
Signed-off-by: Minchan Kim <minchan@kernel.org>
Acked-by: Mel Gorman <mgorman@suse.de>
Cc: Michal Nazarewicz <mina86@mina86.com>
Cc: Bartlomiej Zolnierkiewicz <b.zolnierkie@samsung.com>
Cc: Marek Szyprowski <m.szyprowski@samsung.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 5a883813 08-Oct-2012 Minchan Kim <minchan@kernel.org>

memory-hotplug: fix zone stat mismatch

During memory-hotplug, I found NR_ISOLATED_[ANON|FILE] are increasing,
causing the kernel to hang. When the system doesn't have enough free
pages, it enters reclaim but never reclaim any pages due to
too_many_isolated()==true and loops forever.

The cause is that when we do memory-hotadd after memory-remove,
__zone_pcp_update() clears a zone's ZONE_STAT_ITEMS in setup_pageset()
although the vm_stat_diff of all CPUs still have values.

In addtion, when we offline all pages of the zone, we reset them in
zone_pcp_reset without draining so we loss some zone stat item.

Reviewed-by: Wen Congyang <wency@cn.fujitsu.com>
Signed-off-by: Minchan Kim <minchan@kernel.org>
Cc: Kamezawa Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Yasuaki Ishimatsu <isimatu.yasuaki@jp.fujitsu.com>
Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 957f822a 08-Oct-2012 David Rientjes <rientjes@google.com>

mm, numa: reclaim from all nodes within reclaim distance

RECLAIM_DISTANCE represents the distance between nodes at which it is
deemed too costly to allocate from; it's preferred to try to reclaim from
a local zone before falling back to allocating on a remote node with such
a distance.

To do this, zone_reclaim_mode is set if the distance between any two
nodes on the system is greather than this distance. This, however, ends
up causing the page allocator to reclaim from every zone regardless of
its affinity.

What we really want is to reclaim only from zones that are closer than
RECLAIM_DISTANCE. This patch adds a nodemask to each node that
represents the set of nodes that are within this distance. During the
zone iteration, if the bit for a zone's node is set for the local node,
then reclaim is attempted; otherwise, the zone is skipped.

[akpm@linux-foundation.org: fix CONFIG_NUMA=n build]
Signed-off-by: David Rientjes <rientjes@google.com>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Minchan Kim <minchan@kernel.org>
Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# a0c5e813 08-Oct-2012 Hugh Dickins <hughd@google.com>

mm: remove free_page_mlock

We should not be seeing non-0 unevictable_pgs_mlockfreed any longer. So
remove free_page_mlock() from the page freeing paths: __PG_MLOCKED is
already in PAGE_FLAGS_CHECK_AT_FREE, so free_pages_check() will now be
checking it, reporting "BUG: Bad page state" if it's ever found set.
Comment UNEVICTABLE_MLOCKFREED and unevictable_pgs_mlockfreed always 0.

Signed-off-by: Hugh Dickins <hughd@google.com>
Acked-by: Mel Gorman <mel@csn.ul.ie>
Cc: Rik van Riel <riel@redhat.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Michel Lespinasse <walken@google.com>
Cc: Ying Han <yinghan@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 7f1290f2 08-Oct-2012 Jianguo Wu <wujianguo@huawei.com>

mm: fix-up zone present pages

I think zone->present_pages indicates pages that buddy system can management,
it should be:

zone->present_pages = spanned pages - absent pages - bootmem pages,

but is now:
zone->present_pages = spanned pages - absent pages - memmap pages.

spanned pages: total size, including holes.
absent pages: holes.
bootmem pages: pages used in system boot, managed by bootmem allocator.
memmap pages: pages used by page structs.

This may cause zone->present_pages less than it should be. For example,
numa node 1 has ZONE_NORMAL and ZONE_MOVABLE, it's memmap and other
bootmem will be allocated from ZONE_MOVABLE, so ZONE_NORMAL's
present_pages should be spanned pages - absent pages, but now it also
minus memmap pages(free_area_init_core), which are actually allocated from
ZONE_MOVABLE. When offlining all memory of a zone, this will cause
zone->present_pages less than 0, because present_pages is unsigned long
type, it is actually a very large integer, it indirectly caused
zone->watermark[WMARK_MIN] becomes a large
integer(setup_per_zone_wmarks()), than cause totalreserve_pages become a
large integer(calculate_totalreserve_pages()), and finally cause memory
allocating failure when fork process(__vm_enough_memory()).

[root@localhost ~]# dmesg
-bash: fork: Cannot allocate memory

I think the bug described in

http://marc.info/?l=linux-mm&m=134502182714186&w=2

is also caused by wrong zone present pages.

This patch intends to fix-up zone->present_pages when memory are freed to
buddy system on x86_64 and IA64 platforms.

Signed-off-by: Jianguo Wu <wujianguo@huawei.com>
Signed-off-by: Jiang Liu <jiang.liu@huawei.com>
Reported-by: Petr Tesarik <ptesarik@suse.cz>
Tested-by: Petr Tesarik <ptesarik@suse.cz>
Cc: "Luck, Tony" <tony.luck@intel.com>
Cc: Mel Gorman <mel@csn.ul.ie>
Cc: Yinghai Lu <yinghai@kernel.org>
Cc: Minchan Kim <minchan.kim@gmail.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: David Rientjes <rientjes@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 723a0644 08-Oct-2012 Minchan Kim <minchan@kernel.org>

mm/page_alloc: refactor out __alloc_contig_migrate_alloc()

__alloc_contig_migrate_alloc() can be used by memory-hotplug so refactor
it out (move + rename as a common name) into page_isolation.c.

[akpm@linux-foundation.org: checkpatch fixes]
Signed-off-by: Minchan Kim <minchan@kernel.org>
Cc: Kamezawa Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Reviewed-by: Yasuaki Ishimatsu <isimatu.yasuaki@jp.fujitsu.com>
Acked-by: Michal Nazarewicz <mina86@mina86.com>
Cc: Marek Szyprowski <m.szyprowski@samsung.com>
Cc: Wen Congyang <wency@cn.fujitsu.com>
Acked-by: David Rientjes <rientjes@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 62997027 08-Oct-2012 Mel Gorman <mgorman@suse.de>

mm: compaction: clear PG_migrate_skip based on compaction and reclaim activity

Compaction caches if a pageblock was scanned and no pages were isolated so
that the pageblocks can be skipped in the future to reduce scanning. This
information is not cleared by the page allocator based on activity due to
the impact it would have to the page allocator fast paths. Hence there is
a requirement that something clear the cache or pageblocks will be skipped
forever. Currently the cache is cleared if there were a number of recent
allocation failures and it has not been cleared within the last 5 seconds.
Time-based decisions like this are terrible as they have no relationship
to VM activity and is basically a big hammer.

Unfortunately, accurate heuristics would add cost to some hot paths so
this patch implements a rough heuristic. There are two cases where the
cache is cleared.

1. If a !kswapd process completes a compaction cycle (migrate and free
scanner meet), the zone is marked compact_blockskip_flush. When kswapd
goes to sleep, it will clear the cache. This is expected to be the
common case where the cache is cleared. It does not really matter if
kswapd happens to be asleep or going to sleep when the flag is set as
it will be woken on the next allocation request.

2. If there have been multiple failures recently and compaction just
finished being deferred then a process will clear the cache and start a
full scan. This situation happens if there are multiple high-order
allocation requests under heavy memory pressure.

The clearing of the PG_migrate_skip bits and other scans is inherently
racy but the race is harmless. For allocations that can fail such as THP,
they will simply fail. For requests that cannot fail, they will retry the
allocation. Tests indicated that scanning rates were roughly similar to
when the time-based heuristic was used and the allocation success rates
were similar.

Signed-off-by: Mel Gorman <mgorman@suse.de>
Cc: Rik van Riel <riel@redhat.com>
Cc: Richard Davies <richard@arachsys.com>
Cc: Shaohua Li <shli@kernel.org>
Cc: Avi Kivity <avi@redhat.com>
Cc: Rafael Aquini <aquini@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# bb13ffeb 08-Oct-2012 Mel Gorman <mgorman@suse.de>

mm: compaction: cache if a pageblock was scanned and no pages were isolated

When compaction was implemented it was known that scanning could
potentially be excessive. The ideal was that a counter be maintained for
each pageblock but maintaining this information would incur a severe
penalty due to a shared writable cache line. It has reached the point
where the scanning costs are a serious problem, particularly on
long-lived systems where a large process starts and allocates a large
number of THPs at the same time.

Instead of using a shared counter, this patch adds another bit to the
pageblock flags called PG_migrate_skip. If a pageblock is scanned by
either migrate or free scanner and 0 pages were isolated, the pageblock is
marked to be skipped in the future. When scanning, this bit is checked
before any scanning takes place and the block skipped if set.

The main difficulty with a patch like this is "when to ignore the cached
information?" If it's ignored too often, the scanning rates will still be
excessive. If the information is too stale then allocations will fail
that might have otherwise succeeded. In this patch

o CMA always ignores the information
o If the migrate and free scanner meet then the cached information will
be discarded if it's at least 5 seconds since the last time the cache
was discarded
o If there are a large number of allocation failures, discard the cache.

The time-based heuristic is very clumsy but there are few choices for a
better event. Depending solely on multiple allocation failures still
allows excessive scanning when THP allocations are failing in quick
succession due to memory pressure. Waiting until memory pressure is
relieved would cause compaction to continually fail instead of using
reclaim/compaction to try allocate the page. The time-based mechanism is
clumsy but a better option is not obvious.

Signed-off-by: Mel Gorman <mgorman@suse.de>
Acked-by: Rik van Riel <riel@redhat.com>
Cc: Richard Davies <richard@arachsys.com>
Cc: Shaohua Li <shli@kernel.org>
Cc: Avi Kivity <avi@redhat.com>
Acked-by: Rafael Aquini <aquini@redhat.com>
Cc: Fengguang Wu <fengguang.wu@intel.com>
Cc: Michal Nazarewicz <mina86@mina86.com>
Cc: Bartlomiej Zolnierkiewicz <b.zolnierkie@samsung.com>
Cc: Kyungmin Park <kyungmin.park@samsung.com>
Cc: Mark Brown <broonie@opensource.wolfsonmicro.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 753341a4 08-Oct-2012 Mel Gorman <mgorman@suse.de>

revert "mm: have order > 0 compaction start off where it left"

This reverts commit 7db8889ab05b ("mm: have order > 0 compaction start
off where it left") and commit de74f1cc ("mm: have order > 0 compaction
start near a pageblock with free pages"). These patches were a good
idea and tests confirmed that they massively reduced the amount of
scanning but the implementation is complex and tricky to understand. A
later patch will cache what pageblocks should be skipped and
reimplements the concept of compact_cached_free_pfn on top for both
migration and free scanners.

Signed-off-by: Mel Gorman <mgorman@suse.de>
Acked-by: Rik van Riel <riel@redhat.com>
Cc: Richard Davies <richard@arachsys.com>
Cc: Shaohua Li <shli@kernel.org>
Cc: Avi Kivity <avi@redhat.com>
Acked-by: Rafael Aquini <aquini@redhat.com>
Acked-by: Minchan Kim <minchan@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# f2d52fe5 08-Oct-2012 Wanpeng Li <liwanp@linux.vnet.ibm.com>

mm/memblock: cleanup early_node_map[] related comments

Commit 0ee332c14518 ("memblock: Kill early_node_map[]") removed
early_node_map[]. Clean up the comments to comply with that change.

Signed-off-by: Wanpeng Li <liwanp@linux.vnet.ibm.com>
Cc: Michal Hocko <mhocko@suse.cz>
Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Minchan Kim <minchan@kernel.org>
Cc: Gavin Shan <shangw@linux.vnet.ibm.com>
Cc: Yinghai Lu <yinghai@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 435b405c 08-Oct-2012 Minchan Kim <minchan@kernel.org>

memory-hotplug: fix pages missed by race rather than failing

If race between allocation and isolation in memory-hotplug offline
happens, some pages could be in MIGRATE_MOVABLE of free_list although the
pageblock's migratetype of the page is MIGRATE_ISOLATE.

The race could be detected by get_freepage_migratetype in
__test_page_isolated_in_pageblock. If it is detected, now EBUSY gets
bubbled all the way up and the hotplug operations fails.

But better idea is instead of returning and failing memory-hotremove, move
the free page to the correct list at the time it is detected. It could
enhance memory-hotremove operation success ratio although the race is
really rare.

Suggested by Mel Gorman.

[akpm@linux-foundation.org: small cleanup]
Signed-off-by: Minchan Kim <minchan@kernel.org>
Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Reviewed-by: Yasuaki Ishimatsu <isimatu.yasuaki@jp.fujitsu.com>
Acked-by: Mel Gorman <mgorman@suse.de>
Cc: Xishi Qiu <qiuxishi@huawei.com>
Cc: Wen Congyang <wency@cn.fujitsu.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 95e34412 08-Oct-2012 Minchan Kim <minchan@kernel.org>

mm: remain migratetype in freed page

The page allocator caches the pageblock information in page->private while
it is in the PCP freelists but this is overwritten with the order of the
page when freed to the buddy allocator. This patch stores the migratetype
of the page in the page->index field so that it is available at all times
when the page remain in free_list.

This patch adds a new call site in __free_pages_ok so it might be overhead
a bit but it's for high order allocation. So I believe damage isn't hurt.

Signed-off-by: Minchan Kim <minchan@kernel.org>
Acked-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Reviewed-by: Yasuaki Ishimatsu <isimatu.yasuaki@jp.fujitsu.com>
Acked-by: Mel Gorman <mgorman@suse.de>
Cc: Xishi Qiu <qiuxishi@huawei.com>
Cc: Wen Congyang <wency@cn.fujitsu.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# b12c4ad1 08-Oct-2012 Minchan Kim <minchan@kernel.org>

mm: page_alloc: use get_freepage_migratetype() instead of page_private()

The page allocator uses set_page_private and page_private for handling
migratetype when it frees page. Let's replace them with [set|get]
_freepage_migratetype to make it more clear.

Signed-off-by: Minchan Kim <minchan@kernel.org>
Acked-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Reviewed-by: Yasuaki Ishimatsu <isimatu.yasuaki@jp.fujitsu.com>
Acked-by: Mel Gorman <mgorman@suse.de>
Cc: Xishi Qiu <qiuxishi@huawei.com>
Cc: Wen Congyang <wency@cn.fujitsu.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# d95ea5d1 08-Oct-2012 Bartlomiej Zolnierkiewicz <b.zolnierkie@samsung.com>

cma: fix watermark checking

* Add ALLOC_CMA alloc flag and pass it to [__]zone_watermark_ok()
(from Minchan Kim).

* During watermark check decrease available free pages number by
free CMA pages number if necessary (unmovable allocations cannot
use pages from CMA areas).

Signed-off-by: Bartlomiej Zolnierkiewicz <b.zolnierkie@samsung.com>
Signed-off-by: Kyungmin Park <kyungmin.park@samsung.com>
Cc: Marek Szyprowski <m.szyprowski@samsung.com>
Cc: Michal Nazarewicz <mina86@mina86.com>
Cc: Minchan Kim <minchan@kernel.org>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Hugh Dickins <hughd@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# d1ce749a 08-Oct-2012 Bartlomiej Zolnierkiewicz <b.zolnierkie@samsung.com>

cma: count free CMA pages

Add NR_FREE_CMA_PAGES counter to be later used for checking watermark in
__zone_watermark_ok(). For simplicity and to avoid #ifdef hell make this
counter always available (not only when CONFIG_CMA=y).

[akpm@linux-foundation.org: use conventional migratetype naming]
Signed-off-by: Bartlomiej Zolnierkiewicz <b.zolnierkie@samsung.com>
Signed-off-by: Kyungmin Park <kyungmin.park@samsung.com>
Cc: Marek Szyprowski <m.szyprowski@samsung.com>
Cc: Michal Nazarewicz <mina86@mina86.com>
Cc: Minchan Kim <minchan@kernel.org>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Hugh Dickins <hughd@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 2139cbe6 08-Oct-2012 Bartlomiej Zolnierkiewicz <b.zolnierkie@samsung.com>

cma: fix counting of isolated pages

Isolated free pages shouldn't be accounted to NR_FREE_PAGES counter. Fix
it by properly decreasing/increasing NR_FREE_PAGES counter in
set_migratetype_isolate()/unset_migratetype_isolate() and removing counter
adjustment for isolated pages from free_one_page() and split_free_page().

Signed-off-by: Bartlomiej Zolnierkiewicz <b.zolnierkie@samsung.com>
Signed-off-by: Kyungmin Park <kyungmin.park@samsung.com>
Cc: Marek Szyprowski <m.szyprowski@samsung.com>
Cc: Michal Nazarewicz <mina86@mina86.com>
Cc: Minchan Kim <minchan@kernel.org>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Hugh Dickins <hughd@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 770c8aaa 08-Oct-2012 Bartlomiej Zolnierkiewicz <b.zolnierkie@samsung.com>

mm: fix tracing in free_pcppages_bulk()

page->private gets re-used in __free_one_page() to store page order
(so trace_mm_page_pcpu_drain() may print order instead of migratetype)
thus migratetype value must be cached locally.

Fixes regression introduced in commit a7016235a61d ("mm: fix migratetype
bug which slowed swapping"). This caused incorrect data to be attached
to the mm_page_pcpu_drain trace event.

[akpm@linux-foundation.org: add comment]
Cc: Marek Szyprowski <m.szyprowski@samsung.com>
Cc: Michal Nazarewicz <mina86@mina86.com>
Acked-by: Minchan Kim <minchan@kernel.org>
Acked-by: Mel Gorman <mgorman@suse.de>
Cc: Hugh Dickins <hughd@google.com>
Signed-off-by: Bartlomiej Zolnierkiewicz <b.zolnierkie@samsung.com>
Signed-off-by: Kyungmin Park <kyungmin.park@samsung.com>
Acked-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 02c6de8d 08-Oct-2012 Minchan Kim <minchan@kernel.org>

mm: cma: discard clean pages during contiguous allocation instead of migration

Drop clean cache pages instead of migration during alloc_contig_range() to
minimise allocation latency by reducing the amount of migration that is
necessary. It's useful for CMA because latency of migration is more
important than evicting the background process's working set. In
addition, as pages are reclaimed then fewer free pages for migration
targets are required so it avoids memory reclaiming to get free pages,
which is a contributory factor to increased latency.

I measured elapsed time of __alloc_contig_migrate_range() which migrates
10M in 40M movable zone in QEMU machine.

Before - 146ms, After - 7ms

[akpm@linux-foundation.org: fix nommu build]
Signed-off-by: Mel Gorman <mgorman@suse.de>
Signed-off-by: Minchan Kim <minchan@kernel.org>
Reviewed-by: Mel Gorman <mgorman@suse.de>
Cc: Marek Szyprowski <m.szyprowski@samsung.com>
Acked-by: Michal Nazarewicz <mina86@mina86.com>
Cc: Rik van Riel <riel@redhat.com>
Tested-by: Kyungmin Park <kyungmin.park@samsung.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 1fb3f8ca 08-Oct-2012 Mel Gorman <mgorman@suse.de>

mm: compaction: capture a suitable high-order page immediately when it is made available

While compaction is migrating pages to free up large contiguous blocks
for allocation it races with other allocation requests that may steal
these blocks or break them up. This patch alters direct compaction to
capture a suitable free page as soon as it becomes available to reduce
this race. It uses similar logic to split_free_page() to ensure that
watermarks are still obeyed.

Signed-off-by: Mel Gorman <mgorman@suse.de>
Reviewed-by: Rik van Riel <riel@redhat.com>
Reviewed-by: Minchan Kim <minchan@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# c6543459 08-Oct-2012 Rik van Riel <riel@redhat.com>

mm: remove __GFP_NO_KSWAPD

When transparent huge pages were introduced, memory compaction and swap
storms were an issue, and the kernel had to be careful to not make THP
allocations cause pageout or compaction.

Now that we have working compaction deferral, kswapd is smart enough to
invoke compaction and the quadratic behaviour around isolate_free_pages
has been fixed, it should be safe to remove __GFP_NO_KSWAPD.

[minchan@kernel.org: Comment fix]
[mgorman@suse.de: Avoid direct reclaim for deferred compaction]
Cc: Andrea Arcangeli <aarcange@redhat.com>
Signed-off-by: Rik van Riel <riel@redhat.com>
Signed-off-by: Mel Gorman <mgorman@suse.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 0ba8f2d5 17-Sep-2012 Li Haifeng <omycle@gmail.com>

mm/page_alloc: fix the page address of higher page's buddy calculation

The heuristic method for buddy has been introduced since commit
43506fad21ca ("mm/page_alloc.c: simplify calculation of combined index
of adjacent buddy lists"). But the page address of higher page's buddy
was wrongly calculated, which will lead page_is_buddy to fail for ever.
IOW, the heuristic method would be disabled with the wrong page address
of higher page's buddy.

Calculating the page address of higher page's buddy should be based
higher_page with the offset between index of higher page and index of
higher page's buddy.

Signed-off-by: Haifeng Li <omycle@gmail.com>
Signed-off-by: Gavin Shan <shangw@linux.vnet.ibm.com>
Reviewed-by: Michal Hocko <mhocko@suse.cz>
Cc: KyongHo Cho <pullip.cho@samsung.com>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Minchan Kim <minchan.kim@gmail.com>
Cc: Johannes Weiner <jweiner@redhat.com>
Cc: <stable@vger.kernel.org> [2.6.38+]
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# c67fe375 21-Aug-2012 Mel Gorman <mgorman@suse.de>

mm: compaction: Abort async compaction if locks are contended or taking too long

Jim Schutt reported a problem that pointed at compaction contending
heavily on locks. The workload is straight-forward and in his own words;

The systems in question have 24 SAS drives spread across 3 HBAs,
running 24 Ceph OSD instances, one per drive. FWIW these servers
are dual-socket Intel 5675 Xeons w/48 GB memory. I've got ~160
Ceph Linux clients doing dd simultaneously to a Ceph file system
backed by 12 of these servers.

Early in the test everything looks fine

procs -------------------memory------------------ ---swap-- -----io---- --system-- -----cpu-------
r b swpd free buff cache si so bi bo in cs us sy id wa st
31 15 0 287216 576 38606628 0 0 2 1158 2 14 1 3 95 0 0
27 15 0 225288 576 38583384 0 0 18 2222016 203357 134876 11 56 17 15 0
28 17 0 219256 576 38544736 0 0 11 2305932 203141 146296 11 49 23 17 0
6 18 0 215596 576 38552872 0 0 7 2363207 215264 166502 12 45 22 20 0
22 18 0 226984 576 38596404 0 0 3 2445741 223114 179527 12 43 23 22 0

and then it goes to pot

procs -------------------memory------------------ ---swap-- -----io---- --system-- -----cpu-------
r b swpd free buff cache si so bi bo in cs us sy id wa st
163 8 0 464308 576 36791368 0 0 11 22210 866 536 3 13 79 4 0
207 14 0 917752 576 36181928 0 0 712 1345376 134598 47367 7 90 1 2 0
123 12 0 685516 576 36296148 0 0 429 1386615 158494 60077 8 84 5 3 0
123 12 0 598572 576 36333728 0 0 1107 1233281 147542 62351 7 84 5 4 0
622 7 0 660768 576 36118264 0 0 557 1345548 151394 59353 7 85 4 3 0
223 11 0 283960 576 36463868 0 0 46 1107160 121846 33006 6 93 1 1 0

Note that system CPU usage is very high blocks being written out has
dropped by 42%. He analysed this with perf and found

perf record -g -a sleep 10
perf report --sort symbol --call-graph fractal,5
34.63% [k] _raw_spin_lock_irqsave
|
|--97.30%-- isolate_freepages
| compaction_alloc
| unmap_and_move
| migrate_pages
| compact_zone
| compact_zone_order
| try_to_compact_pages
| __alloc_pages_direct_compact
| __alloc_pages_slowpath
| __alloc_pages_nodemask
| alloc_pages_vma
| do_huge_pmd_anonymous_page
| handle_mm_fault
| do_page_fault
| page_fault
| |
| |--87.39%-- skb_copy_datagram_iovec
| | tcp_recvmsg
| | inet_recvmsg
| | sock_recvmsg
| | sys_recvfrom
| | system_call
| | __recv
| | |
| | --100.00%-- (nil)
| |
| --12.61%-- memcpy
--2.70%-- [...]

There was other data but primarily it is all showing that compaction is
contended heavily on the zone->lock and zone->lru_lock.

commit [b2eef8c0: mm: compaction: minimise the time IRQs are disabled
while isolating pages for migration] noted that it was possible for
migration to hold the lru_lock for an excessive amount of time. Very
broadly speaking this patch expands the concept.

This patch introduces compact_checklock_irqsave() to check if a lock
is contended or the process needs to be scheduled. If either condition
is true then async compaction is aborted and the caller is informed.
The page allocator will fail a THP allocation if compaction failed due
to contention. This patch also introduces compact_trylock_irqsave()
which will acquire the lock only if it is not contended and the process
does not need to schedule.

Reported-by: Jim Schutt <jaschut@sandia.gov>
Tested-by: Jim Schutt <jaschut@sandia.gov>
Signed-off-by: Mel Gorman <mgorman@suse.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# b121186a 21-Aug-2012 Alex Shi <alex.shi@linux.alibaba.com>

mm: correct page->pfmemalloc to fix deactivate_slab regression

Commit cfd19c5a9ecf ("mm: only set page->pfmemalloc when
ALLOC_NO_WATERMARKS was used") tried to narrow down page->pfmemalloc
setting, but it missed some places the pfmemalloc should be set.

So, in __slab_alloc, the unalignment pfmemalloc and ALLOC_NO_WATERMARKS
cause incorrect deactivate_slab() on our core2 server:

64.73% fio [kernel.kallsyms] [k] _raw_spin_lock
|
--- _raw_spin_lock
|
|---0.34%-- deactivate_slab
| __slab_alloc
| kmem_cache_alloc
| |

That causes our fio sync write performance to have a 40% regression.

Move the checking in get_page_from_freelist() which resolves this issue.

Signed-off-by: Alex Shi <alex.shi@intel.com>
Acked-by: Mel Gorman <mgorman@suse.de>
Cc: David Miller <davem@davemloft.net
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Tested-by: Eric Dumazet <eric.dumazet@gmail.com>
Tested-by: Sage Weil <sage@inktank.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 8783b6e2 02-Aug-2012 Linus Torvalds <torvalds@linux-foundation.org>

mm: remove node_start_pfn checking in new WARN_ON for now

Borislav Petkov reports that the new warning added in commit
88fdf75d1bb5 ("mm: warn if pg_data_t isn't initialized with zero")
triggers for him, and it is the node_start_pfn field that has already
been initialized once.

The call trace looks like this:

x86_64_start_kernel ->
x86_64_start_reservations ->
start_kernel ->
setup_arch ->
paging_init ->
zone_sizes_init ->
free_area_init_nodes ->
free_area_init_node

and (with the warning replaced by debug output), Borislav sees

On node 0 totalpages: 4193848
DMA zone: 64 pages used for memmap
DMA zone: 6 pages reserved
DMA zone: 3890 pages, LIFO batch:0
DMA32 zone: 16320 pages used for memmap
DMA32 zone: 798464 pages, LIFO batch:31
Normal zone: 52736 pages used for memmap
Normal zone: 3322368 pages, LIFO batch:31
free_area_init_node: pgdat->node_start_pfn: 4423680 <----
On node 1 totalpages: 4194304
Normal zone: 65536 pages used for memmap
Normal zone: 4128768 pages, LIFO batch:31
free_area_init_node: pgdat->node_start_pfn: 8617984 <----
On node 2 totalpages: 4194304
Normal zone: 65536 pages used for memmap
Normal zone: 4128768 pages, LIFO batch:31
free_area_init_node: pgdat->node_start_pfn: 12812288 <----
On node 3 totalpages: 4194304
Normal zone: 65536 pages used for memmap
Normal zone: 4128768 pages, LIFO batch:31

so remove the bogus warning for now to avoid annoying people. Minchan
Kim is looking at it.

Reported-by: Borislav Petkov <bp@amd64.org>
Cc: Minchan Kim <minchan@kernel.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 6527af5d 31-Jul-2012 Minchan Kim <minchan@kernel.org>

mm: remove redundant initialization

pg_data_t is zeroed before reaching free_area_init_core(), so remove the
now unnecessary initializations.

Signed-off-by: Minchan Kim <minchan@kernel.org>
Cc: Tejun Heo <tj@kernel.org>
Cc: Ralf Baechle <ralf@linux-mips.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 88fdf75d 31-Jul-2012 Minchan Kim <minchan@kernel.org>

mm: warn if pg_data_t isn't initialized with zero

Warn if memory-hotplug/boot code doesn't initialize pg_data_t with zero
when it is allocated. Arch code and memory hotplug already initiailize
pg_data_t. So this warning should never happen. I select fields randomly
near the beginning, middle and end of pg_data_t for checking.

This patch isn't for performance but for removing initialization code
which is necessary to add whenever we adds new field to pg_data_t or zone.

Firstly, Andrew suggested clearing out of pg_data_t in MM core part but
Tejun doesn't like it because in the future, some archs can initialize
some fields in arch code and pass them into general MM part so blindly
clearing it out in mm core part would be very annoying.

Signed-off-by: Minchan Kim <minchan@kernel.org>
Cc: Tejun Heo <tj@kernel.org>
Cc: Ralf Baechle <ralf@linux-mips.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 5515061d 31-Jul-2012 Mel Gorman <mgorman@suse.de>

mm: throttle direct reclaimers if PF_MEMALLOC reserves are low and swap is backed by network storage

If swap is backed by network storage such as NBD, there is a risk that a
large number of reclaimers can hang the system by consuming all
PF_MEMALLOC reserves. To avoid these hangs, the administrator must tune
min_free_kbytes in advance which is a bit fragile.

This patch throttles direct reclaimers if half the PF_MEMALLOC reserves
are in use. If the system is routinely getting throttled the system
administrator can increase min_free_kbytes so degradation is smoother but
the system will keep running.

Signed-off-by: Mel Gorman <mgorman@suse.de>
Cc: David Miller <davem@davemloft.net>
Cc: Neil Brown <neilb@suse.de>
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Mike Christie <michaelc@cs.wisc.edu>
Cc: Eric B Munson <emunson@mgebm.net>
Cc: Eric Dumazet <eric.dumazet@gmail.com>
Cc: Sebastian Andrzej Siewior <sebastian@breakpoint.cc>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Christoph Lameter <cl@linux.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 183f6371 31-Jul-2012 Mel Gorman <mgorman@suse.de>

mm: ignore mempolicies when using ALLOC_NO_WATERMARK

The reserve is proportionally distributed over all !highmem zones in the
system. So we need to allow an emergency allocation access to all zones.
In order to do that we need to break out of any mempolicy boundaries we
might have.

In my opinion that does not break mempolicies as those are user oriented
and not system oriented. That is, system allocations are not guaranteed
to be within mempolicy boundaries. For instance IRQs do not even have a
mempolicy.

So breaking out of mempolicy boundaries for 'rare' emergency allocations,
which are always system allocations (as opposed to user) is ok.

Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Signed-off-by: Mel Gorman <mgorman@suse.de>
Cc: David Miller <davem@davemloft.net>
Cc: Neil Brown <neilb@suse.de>
Cc: Mike Christie <michaelc@cs.wisc.edu>
Cc: Eric B Munson <emunson@mgebm.net>
Cc: Eric Dumazet <eric.dumazet@gmail.com>
Cc: Sebastian Andrzej Siewior <sebastian@breakpoint.cc>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Christoph Lameter <cl@linux.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# cfd19c5a 31-Jul-2012 Mel Gorman <mgorman@suse.de>

mm: only set page->pfmemalloc when ALLOC_NO_WATERMARKS was used

__alloc_pages_slowpath() is called when the number of free pages is below
the low watermark. If the caller is entitled to use ALLOC_NO_WATERMARKS
then the page will be marked page->pfmemalloc. This protects more pages
than are strictly necessary as we only need to protect pages allocated
below the min watermark (the pfmemalloc reserves).

This patch only sets page->pfmemalloc when ALLOC_NO_WATERMARKS was
required to allocate the page.

[rientjes@google.com: David noticed the problem during review]
Signed-off-by: Mel Gorman <mgorman@suse.de>
Cc: David Miller <davem@davemloft.net>
Cc: Neil Brown <neilb@suse.de>
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Mike Christie <michaelc@cs.wisc.edu>
Cc: Eric B Munson <emunson@mgebm.net>
Cc: Eric Dumazet <eric.dumazet@gmail.com>
Cc: Sebastian Andrzej Siewior <sebastian@breakpoint.cc>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Christoph Lameter <cl@linux.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 907aed48 31-Jul-2012 Mel Gorman <mgorman@suse.de>

mm: allow PF_MEMALLOC from softirq context

This is needed to allow network softirq packet processing to make use of
PF_MEMALLOC.

Currently softirq context cannot use PF_MEMALLOC due to it not being
associated with a task, and therefore not having task flags to fiddle with
- thus the gfp to alloc flag mapping ignores the task flags when in
interrupts (hard or soft) context.

Allowing softirqs to make use of PF_MEMALLOC therefore requires some
trickery. This patch borrows the task flags from whatever process happens
to be preempted by the softirq. It then modifies the gfp to alloc flags
mapping to not exclude task flags in softirq context, and modify the
softirq code to save, clear and restore the PF_MEMALLOC flag.

The save and clear, ensures the preempted task's PF_MEMALLOC flag doesn't
leak into the softirq. The restore ensures a softirq's PF_MEMALLOC flag
cannot leak back into the preempted process. This should be safe due to
the following reasons

Softirqs can run on multiple CPUs sure but the same task should not be
executing the same softirq code. Neither should the softirq
handler be preempted by any other softirq handler so the flags
should not leak to an unrelated softirq.

Softirqs re-enable hardware interrupts in __do_softirq() so can be
preempted by hardware interrupts so PF_MEMALLOC is inherited
by the hard IRQ. However, this is similar to a process in
reclaim being preempted by a hardirq. While PF_MEMALLOC is
set, gfp_to_alloc_flags() distinguishes between hard and
soft irqs and avoids giving a hardirq the ALLOC_NO_WATERMARKS
flag.

If the softirq is deferred to ksoftirq then its flags may be used
instead of a normal tasks but as the softirq cannot be preempted,
the PF_MEMALLOC flag does not leak to other code by accident.

[davem@davemloft.net: Document why PF_MEMALLOC is safe]
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Signed-off-by: Mel Gorman <mgorman@suse.de>
Cc: David Miller <davem@davemloft.net>
Cc: Neil Brown <neilb@suse.de>
Cc: Mike Christie <michaelc@cs.wisc.edu>
Cc: Eric B Munson <emunson@mgebm.net>
Cc: Eric Dumazet <eric.dumazet@gmail.com>
Cc: Sebastian Andrzej Siewior <sebastian@breakpoint.cc>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Christoph Lameter <cl@linux.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# b37f1dd0 31-Jul-2012 Mel Gorman <mgorman@suse.de>

mm: introduce __GFP_MEMALLOC to allow access to emergency reserves

__GFP_MEMALLOC will allow the allocation to disregard the watermarks, much
like PF_MEMALLOC. It allows one to pass along the memalloc state in
object related allocation flags as opposed to task related flags, such as
sk->sk_allocation. This removes the need for ALLOC_PFMEMALLOC as callers
using __GFP_MEMALLOC can get the ALLOC_NO_WATERMARK flag which is now
enough to identify allocations related to page reclaim.

Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Signed-off-by: Mel Gorman <mgorman@suse.de>
Cc: David Miller <davem@davemloft.net>
Cc: Neil Brown <neilb@suse.de>
Cc: Mike Christie <michaelc@cs.wisc.edu>
Cc: Eric B Munson <emunson@mgebm.net>
Cc: Eric Dumazet <eric.dumazet@gmail.com>
Cc: Sebastian Andrzej Siewior <sebastian@breakpoint.cc>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Christoph Lameter <cl@linux.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 072bb0aa 31-Jul-2012 Mel Gorman <mgorman@suse.de>

mm: sl[au]b: add knowledge of PFMEMALLOC reserve pages

When a user or administrator requires swap for their application, they
create a swap partition and file, format it with mkswap and activate it
with swapon. Swap over the network is considered as an option in diskless
systems. The two likely scenarios are when blade servers are used as part
of a cluster where the form factor or maintenance costs do not allow the
use of disks and thin clients.

The Linux Terminal Server Project recommends the use of the Network Block
Device (NBD) for swap according to the manual at
https://sourceforge.net/projects/ltsp/files/Docs-Admin-Guide/LTSPManual.pdf/download
There is also documentation and tutorials on how to setup swap over NBD at
places like https://help.ubuntu.com/community/UbuntuLTSP/EnableNBDSWAP The
nbd-client also documents the use of NBD as swap. Despite this, the fact
is that a machine using NBD for swap can deadlock within minutes if swap
is used intensively. This patch series addresses the problem.

The core issue is that network block devices do not use mempools like
normal block devices do. As the host cannot control where they receive
packets from, they cannot reliably work out in advance how much memory
they might need. Some years ago, Peter Zijlstra developed a series of
patches that supported swap over an NFS that at least one distribution is
carrying within their kernels. This patch series borrows very heavily
from Peter's work to support swapping over NBD as a pre-requisite to
supporting swap-over-NFS. The bulk of the complexity is concerned with
preserving memory that is allocated from the PFMEMALLOC reserves for use
by the network layer which is needed for both NBD and NFS.

Patch 1 adds knowledge of the PFMEMALLOC reserves to SLAB and SLUB to
preserve access to pages allocated under low memory situations
to callers that are freeing memory.

Patch 2 optimises the SLUB fast path to avoid pfmemalloc checks

Patch 3 introduces __GFP_MEMALLOC to allow access to the PFMEMALLOC
reserves without setting PFMEMALLOC.

Patch 4 opens the possibility for softirqs to use PFMEMALLOC reserves
for later use by network packet processing.

Patch 5 only sets page->pfmemalloc when ALLOC_NO_WATERMARKS was required

Patch 6 ignores memory policies when ALLOC_NO_WATERMARKS is set.

Patches 7-12 allows network processing to use PFMEMALLOC reserves when
the socket has been marked as being used by the VM to clean pages. If
packets are received and stored in pages that were allocated under
low-memory situations and are unrelated to the VM, the packets
are dropped.

Patch 11 reintroduces __skb_alloc_page which the networking
folk may object to but is needed in some cases to propogate
pfmemalloc from a newly allocated page to an skb. If there is a
strong objection, this patch can be dropped with the impact being
that swap-over-network will be slower in some cases but it should
not fail.

Patch 13 is a micro-optimisation to avoid a function call in the
common case.

Patch 14 tags NBD sockets as being SOCK_MEMALLOC so they can use
PFMEMALLOC if necessary.

Patch 15 notes that it is still possible for the PFMEMALLOC reserve
to be depleted. To prevent this, direct reclaimers get throttled on
a waitqueue if 50% of the PFMEMALLOC reserves are depleted. It is
expected that kswapd and the direct reclaimers already running
will clean enough pages for the low watermark to be reached and
the throttled processes are woken up.

Patch 16 adds a statistic to track how often processes get throttled

Some basic performance testing was run using kernel builds, netperf on
loopback for UDP and TCP, hackbench (pipes and sockets), iozone and
sysbench. Each of them were expected to use the sl*b allocators
reasonably heavily but there did not appear to be significant performance
variances.

For testing swap-over-NBD, a machine was booted with 2G of RAM with a
swapfile backed by NBD. 8*NUM_CPU processes were started that create
anonymous memory mappings and read them linearly in a loop. The total
size of the mappings were 4*PHYSICAL_MEMORY to use swap heavily under
memory pressure.

Without the patches and using SLUB, the machine locks up within minutes
and runs to completion with them applied. With SLAB, the story is
different as an unpatched kernel run to completion. However, the patched
kernel completed the test 45% faster.

MICRO
3.5.0-rc2 3.5.0-rc2
vanilla swapnbd
Unrecognised test vmscan-anon-mmap-write
MMTests Statistics: duration
Sys Time Running Test (seconds) 197.80 173.07
User+Sys Time Running Test (seconds) 206.96 182.03
Total Elapsed Time (seconds) 3240.70 1762.09

This patch: mm: sl[au]b: add knowledge of PFMEMALLOC reserve pages

Allocations of pages below the min watermark run a risk of the machine
hanging due to a lack of memory. To prevent this, only callers who have
PF_MEMALLOC or TIF_MEMDIE set and are not processing an interrupt are
allowed to allocate with ALLOC_NO_WATERMARKS. Once they are allocated to
a slab though, nothing prevents other callers consuming free objects
within those slabs. This patch limits access to slab pages that were
alloced from the PFMEMALLOC reserves.

When this patch is applied, pages allocated from below the low watermark
are returned with page->pfmemalloc set and it is up to the caller to
determine how the page should be protected. SLAB restricts access to any
page with page->pfmemalloc set to callers which are known to able to
access the PFMEMALLOC reserve. If one is not available, an attempt is
made to allocate a new page rather than use a reserve. SLUB is a bit more
relaxed in that it only records if the current per-CPU page was allocated
from PFMEMALLOC reserve and uses another partial slab if the caller does
not have the necessary GFP or process flags. This was found to be
sufficient in tests to avoid hangs due to SLUB generally maintaining
smaller lists than SLAB.

In low-memory conditions it does mean that !PFMEMALLOC allocators can fail
a slab allocation even though free objects are available because they are
being preserved for callers that are freeing pages.

[a.p.zijlstra@chello.nl: Original implementation]
[sebastian@breakpoint.cc: Correct order of page flag clearing]
Signed-off-by: Mel Gorman <mgorman@suse.de>
Cc: David Miller <davem@davemloft.net>
Cc: Neil Brown <neilb@suse.de>
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Mike Christie <michaelc@cs.wisc.edu>
Cc: Eric B Munson <emunson@mgebm.net>
Cc: Eric Dumazet <eric.dumazet@gmail.com>
Cc: Sebastian Andrzej Siewior <sebastian@breakpoint.cc>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Christoph Lameter <cl@linux.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 702d1a6e 31-Jul-2012 Minchan Kim <minchan@kernel.org>

memory-hotplug: fix kswapd looping forever problem

When hotplug offlining happens on zone A, it starts to mark freed page as
MIGRATE_ISOLATE type in buddy for preventing further allocation.
(MIGRATE_ISOLATE is very irony type because it's apparently on buddy but
we can't allocate them).

When the memory shortage happens during hotplug offlining, current task
starts to reclaim, then wake up kswapd. Kswapd checks watermark, then go
sleep because current zone_watermark_ok_safe doesn't consider
MIGRATE_ISOLATE freed page count. Current task continue to reclaim in
direct reclaim path without kswapd's helping. The problem is that
zone->all_unreclaimable is set by only kswapd so that current task would
be looping forever like below.

__alloc_pages_slowpath
restart:
wake_all_kswapd
rebalance:
__alloc_pages_direct_reclaim
do_try_to_free_pages
if global_reclaim && !all_unreclaimable
return 1; /* It means we did did_some_progress */
skip __alloc_pages_may_oom
should_alloc_retry
goto rebalance;

If we apply KOSAKI's patch[1] which doesn't depends on kswapd about
setting zone->all_unreclaimable, we can solve this problem by killing some
task in direct reclaim path. But it doesn't wake up kswapd, still. It
could be a problem still if other subsystem needs GFP_ATOMIC request. So
kswapd should consider MIGRATE_ISOLATE when it calculate free pages BEFORE
going sleep.

This patch counts the number of MIGRATE_ISOLATE page block and
zone_watermark_ok_safe will consider it if the system has such blocks
(fortunately, it's very rare so no problem in POV overhead and kswapd is
never hotpath).

Copy/modify from Mel's quote
"
Ideal solution would be "allocating" the pageblock.
It would keep the free space accounting as it is but historically,
memory hotplug didn't allocate pages because it would be difficult to
detect if a pageblock was isolated or if part of some balloon.
Allocating just full pageblocks would work around this, However,
it would play very badly with CMA.
"

[1] http://lkml.org/lkml/2012/6/14/74

[akpm@linux-foundation.org: simplify nr_zone_isolate_freepages(), rework zone_watermark_ok_safe() comment, simplify set_pageblock_isolate() and restore_pageblock_isolate()]
[akpm@linux-foundation.org: fix CONFIG_MEMORY_ISOLATION=n build]
Signed-off-by: Minchan Kim <minchan@kernel.org>
Suggested-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Tested-by: Aaditya Kumar <aaditya.kumar.30@gmail.com>
Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Michal Hocko <mhocko@suse.cz>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 2cfed075 31-Jul-2012 Minchan Kim <minchan@kernel.org>

mm: fix free page check in zone_watermark_ok()

__zone_watermark_ok currently compares free_pages which is a signed type
with z->lowmem_reserve[classzone_idx] which is unsigned which might lead
to sign overflow if free_pages doesn't satisfy the given order (or it came
as negative already) and then we rely on the following order loop to fix
it (which doesn't work for order-0). Let's fix the type conversion and do
not rely on the given value of free_pages or follow up fixups.

This patch fixes it because "memory-hotplug: fix kswapd looping forever
problem" depends on this.

As benefit of this patch, it doesn't rely on the loop to exit
__zone_watermark_ok in case of high order check and make the first test
effective.(ie, if (free_pages <= min + lowmem_reserve))

Aaditya reported this problem when he test my hotplug patch.

Reported-off-by: Aaditya Kumar <aaditya.kumar@ap.sony.com>
Tested-by: Aaditya Kumar <aaditya.kumar@ap.sony.com>
Signed-off-by: Aaditya Kumar <aaditya.kumar@ap.sony.com>
Signed-off-by: Minchan Kim <minchan@kernel.org>
Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Mel Gorman <mgorman@suse.de>
Reviewed-by: Michal Hocko <mhocko@suse.cz>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# ee6f509c 31-Jul-2012 Minchan Kim <minchan@kernel.org>

mm: factor out memory isolate functions

mm/page_alloc.c has some memory isolation functions but they are used only
when we enable CONFIG_{CMA|MEMORY_HOTPLUG|MEMORY_FAILURE}. So let's make
it configurable by new CONFIG_MEMORY_ISOLATION so that it can reduce
binary size and we can check it simple by CONFIG_MEMORY_ISOLATION, not if
defined CONFIG_{CMA|MEMORY_HOTPLUG|MEMORY_FAILURE}.

Signed-off-by: Minchan Kim <minchan@kernel.org>
Cc: Andi Kleen <andi@firstfloor.org>
Cc: Marek Szyprowski <m.szyprowski@samsung.com>
Acked-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Michal Hocko <mhocko@suse.cz>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 4ed7e022 31-Jul-2012 Jiang Liu <jiang.liu@huawei.com>

mm/hotplug: mark memory hotplug code in page_alloc.c as __meminit

Mark functions used by both boot and memory hotplug as __meminit to reduce
memory footprint when memory hotplug is disabled.

Alos guard zone_pcp_update() with CONFIG_MEMORY_HOTPLUG because it's only
used by memory hotplug code.

Signed-off-by: Jiang Liu <liuj97@gmail.com>
Cc: Wei Wang <Bessel.Wang@huawei.com>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Michal Hocko <mhocko@suse.cz>
Cc: Minchan Kim <minchan@kernel.org>
Cc: Rusty Russell <rusty@rustcorp.com.au>
Cc: Yinghai Lu <yinghai@kernel.org>
Cc: Tony Luck <tony.luck@intel.com>
Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Cc: David Rientjes <rientjes@google.com>
Cc: Keping Chen <chenkeping@huawei.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 340175b7 31-Jul-2012 Jiang Liu <jiang.liu@huawei.com>

mm/hotplug: free zone->pageset when a zone becomes empty

When a zone becomes empty after memory offlining, free zone->pageset.
Otherwise it will cause memory leak when adding memory to the empty zone
again because build_all_zonelists() will allocate zone->pageset for an
empty zone.

Signed-off-by: Jiang Liu <liuj97@gmail.com>
Signed-off-by: Wei Wang <Bessel.Wang@huawei.com>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Michal Hocko <mhocko@suse.cz>
Cc: Minchan Kim <minchan@kernel.org>
Cc: Rusty Russell <rusty@rustcorp.com.au>
Cc: Yinghai Lu <yinghai@kernel.org>
Cc: Tony Luck <tony.luck@intel.com>
Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Cc: David Rientjes <rientjes@google.com>
Cc: Keping Chen <chenkeping@huawei.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 9adb62a5 31-Jul-2012 Jiang Liu <jiang.liu@huawei.com>

mm/hotplug: correctly setup fallback zonelists when creating new pgdat

When hotadd_new_pgdat() is called to create new pgdat for a new node, a
fallback zonelist should be created for the new node. There's code to try
to achieve that in hotadd_new_pgdat() as below:

/*
* The node we allocated has no zone fallback lists. For avoiding
* to access not-initialized zonelist, build here.
*/
mutex_lock(&zonelists_mutex);
build_all_zonelists(pgdat, NULL);
mutex_unlock(&zonelists_mutex);

But it doesn't work as expected. When hotadd_new_pgdat() is called, the
new node is still in offline state because node_set_online(nid) hasn't
been called yet. And build_all_zonelists() only builds zonelists for
online nodes as:

for_each_online_node(nid) {
pg_data_t *pgdat = NODE_DATA(nid);

build_zonelists(pgdat);
build_zonelist_cache(pgdat);
}

Though we hope to create zonelist for the new pgdat, but it doesn't. So
add a new parameter "pgdat" the build_all_zonelists() to build pgdat for
the new pgdat too.

Signed-off-by: Jiang Liu <liuj97@gmail.com>
Signed-off-by: Xishi Qiu <qiuxishi@huawei.com>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Michal Hocko <mhocko@suse.cz>
Cc: Minchan Kim <minchan@kernel.org>
Cc: Rusty Russell <rusty@rustcorp.com.au>
Cc: Yinghai Lu <yinghai@kernel.org>
Cc: Tony Luck <tony.luck@intel.com>
Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Cc: David Rientjes <rientjes@google.com>
Cc: Keping Chen <chenkeping@huawei.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# ca57df79 31-Jul-2012 Xishi Qiu <qiuxishi@huawei.com>

mm: setup pageblock_order before it's used by sparsemem

On architectures with CONFIG_HUGETLB_PAGE_SIZE_VARIABLE set, such as
Itanium, pageblock_order is a variable with default value of 0. It's set
to the right value by set_pageblock_order() in function
free_area_init_core().

But pageblock_order may be used by sparse_init() before free_area_init_core()
is called along path:
sparse_init()
->sparse_early_usemaps_alloc_node()
->usemap_size()
->SECTION_BLOCKFLAGS_BITS
->((1UL << (PFN_SECTION_SHIFT - pageblock_order)) *
NR_PAGEBLOCK_BITS)

The uninitialized pageblock_size will cause memory wasting because
usemap_size() returns a much bigger value then it's really needed.

For example, on an Itanium platform,
sparse_init() pageblock_order=0 usemap_size=24576
free_area_init_core() before pageblock_order=0, usemap_size=24576
free_area_init_core() after pageblock_order=12, usemap_size=8

That means 24K memory has been wasted for each section, so fix it by calling
set_pageblock_order() from sparse_init().

Signed-off-by: Xishi Qiu <qiuxishi@huawei.com>
Signed-off-by: Jiang Liu <liuj97@gmail.com>
Cc: Tony Luck <tony.luck@intel.com>
Cc: Yinghai Lu <yinghai@kernel.org>
Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Cc: David Rientjes <rientjes@google.com>
Cc: Keping Chen <chenkeping@huawei.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 7db8889a 31-Jul-2012 Rik van Riel <riel@redhat.com>

mm: have order > 0 compaction start off where it left

Order > 0 compaction stops when enough free pages of the correct page
order have been coalesced. When doing subsequent higher order
allocations, it is possible for compaction to be invoked many times.

However, the compaction code always starts out looking for things to
compact at the start of the zone, and for free pages to compact things to
at the end of the zone.

This can cause quadratic behaviour, with isolate_freepages starting at the
end of the zone each time, even though previous invocations of the
compaction code already filled up all free memory on that end of the zone.

This can cause isolate_freepages to take enormous amounts of CPU with
certain workloads on larger memory systems.

The obvious solution is to have isolate_freepages remember where it left
off last time, and continue at that point the next time it gets invoked
for an order > 0 compaction. This could cause compaction to fail if
cc->free_pfn and cc->migrate_pfn are close together initially, in that
case we restart from the end of the zone and try once more.

Forced full (order == -1) compactions are left alone.

[akpm@linux-foundation.org: checkpatch fixes]
[akpm@linux-foundation.org: s/laste/last/, use 80 cols]
Signed-off-by: Rik van Riel <riel@redhat.com>
Reported-by: Jim Schutt <jaschut@sandia.gov>
Tested-by: Jim Schutt <jaschut@sandia.gov>
Cc: Minchan Kim <minchan.kim@gmail.com>
Reviewed-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Acked-by: Mel Gorman <mel@csn.ul.ie>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 80934513 31-Jul-2012 Minchan Kim <minchan@kernel.org>

mm: clean up __count_immobile_pages()

The __count_immobile_pages() naming is rather awkward. Choose a more
clear name and add a comment.

Signed-off-by: Minchan Kim <minchan@kernel.org>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Michal Hocko <mhocko@suse.cz>
Acked-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Bartlomiej Zolnierkiewicz <b.zolnierkie@samsung.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 97d255c8 31-Jul-2012 Minchan Kim <minchan@kernel.org>

mm: do not use page_count() without a page pin

d179e84ba ("mm: vmscan: do not use page_count without a page pin") fixed
this problem in vmscan.c but same problem is in __count_immobile_pages().

I copy and paste d179e84ba's contents for description.

"It is unsafe to run page_count during the physical pfn scan because
compound_head could trip on a dangling pointer when reading
page->first_page if the compound page is being freed by another CPU."

Signed-off-by: Minchan Kim <minchan@kernel.org>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Michal Hocko <mhocko@suse.cz>
Reviewed-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Wanpeng Li <liwp.linux@gmail.com>
Cc: Bartlomiej Zolnierkiewicz <b.zolnierkie@samsung.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 2a13515c 31-Jul-2012 KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>

mm: clear pages_scanned only if draining a pcp adds pages to the buddy allocator again

commit 2ff754fa8f ("mm: clear pages_scanned only if draining a pcp adds
pages to the buddy allocator again") fixed one free_pcppages_bulk()
misuse. But two another miuse still exist.

This patch fixes it.

Signed-off-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Acked-by: David Rientjes <rientjes@google.com>
Acked-by: Mel Gorman <mel@csn.ul.ie>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Reviewed-by: Minchan Kim <minchan@kernel.org>
Cc: Wu Fengguang <fengguang.wu@intel.com>
Reviewed-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Rik van Riel <riel@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# deaf386e 31-Jul-2012 Gavin Shan <shangw@linux.vnet.ibm.com>

mm/buddy: cleanup on should_fail_alloc_page

Currently, function should_fail() has "bool" for its return value, so it's
reasonable to change the return value of function should_fail_alloc_page()
into "bool" as well.

The patch does cleanup on function should_fail_alloc_page() to have "bool"
for its return value.

Signed-off-by: Gavin Shan <shangw@linux.vnet.ibm.com>
Acked-by: David Rientjes <rientjes@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 6a6dccba 05-Jul-2012 Rabin Vincent <rabin@rab.in>

mm: cma: don't replace lowmem pages with highmem

The filesystem layer expects pages in the block device's mapping to not
be in highmem (the mapping's gfp mask is set in bdget()), but CMA can
currently replace lowmem pages with highmem pages, leading to crashes in
filesystem code such as the one below:

Unable to handle kernel NULL pointer dereference at virtual address 00000400
pgd = c0c98000
[00000400] *pgd=00c91831, *pte=00000000, *ppte=00000000
Internal error: Oops: 817 [#1] PREEMPT SMP ARM
CPU: 0 Not tainted (3.5.0-rc5+ #80)
PC is at __memzero+0x24/0x80
...
Process fsstress (pid: 323, stack limit = 0xc0cbc2f0)
Backtrace:
[<c010e3f0>] (ext4_getblk+0x0/0x180) from [<c010e58c>] (ext4_bread+0x1c/0x98)
[<c010e570>] (ext4_bread+0x0/0x98) from [<c0117944>] (ext4_mkdir+0x160/0x3bc)
r4:c15337f0
[<c01177e4>] (ext4_mkdir+0x0/0x3bc) from [<c00c29e0>] (vfs_mkdir+0x8c/0x98)
[<c00c2954>] (vfs_mkdir+0x0/0x98) from [<c00c2a60>] (sys_mkdirat+0x74/0xac)
r6:00000000 r5:c152eb40 r4:000001ff r3:c14b43f0
[<c00c29ec>] (sys_mkdirat+0x0/0xac) from [<c00c2ab8>] (sys_mkdir+0x20/0x24)
r6:beccdcf0 r5:00074000 r4:beccdbbc
[<c00c2a98>] (sys_mkdir+0x0/0x24) from [<c000e3c0>] (ret_fast_syscall+0x0/0x30)

Fix this by replacing only highmem pages with highmem.

Reported-by: Laura Abbott <lauraa@codeaurora.org>
Signed-off-by: Rabin Vincent <rabin@rab.in>
Acked-by: Michal Nazarewicz <mina86@mina86.com>
Signed-off-by: Marek Szyprowski <m.szyprowski@samsung.com>


# 68e3e926 03-Jun-2012 Linus Torvalds <torvalds@linux-foundation.org>

Revert "mm: compaction: handle incorrect MIGRATE_UNMOVABLE type pageblocks"

This reverts commit 5ceb9ce6fe9462a298bb2cd5c9f1ca6cb80a0199.

That commit seems to be the cause of the mm compation list corruption
issues that Dave Jones reported. The locking (or rather, absense
there-of) is dubious, as is the use of the 'page' variable once it has
been found to be outside the pageblock range.

So revert it for now, we can re-visit this for 3.6. If we even need to:
as Minchan Kim says, "The patch wasn't a bug fix and even test workload
was very theoretical".

Reported-and-tested-by: Dave Jones <davej@redhat.com>
Acked-by: Hugh Dickins <hughd@google.com>
Acked-by: KOSAKI Motohiro <kosaki.motohiro@gmail.com>
Acked-by: Minchan Kim <minchan@kernel.org>
Cc: Bartlomiej Zolnierkiewicz <b.zolnierkie@samsung.com>
Cc: Kyungmin Park <kyungmin.park@samsung.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 7f5e86c2 29-May-2012 Konstantin Khlebnikov <khlebnikov@openvz.org>

mm: add link from struct lruvec to struct zone

This is the first stage of struct mem_cgroup_zone removal. Further
patches replace struct mem_cgroup_zone with a pointer to struct lruvec.

If CONFIG_CGROUP_MEM_RES_CTLR=n lruvec_zone() is just container_of().

Signed-off-by: Konstantin Khlebnikov <khlebnikov@openvz.org>
Cc: Mel Gorman <mel@csn.ul.ie>
Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Acked-by: Hugh Dickins <hughd@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 89abfab1 29-May-2012 Hugh Dickins <hughd@google.com>

mm/memcg: move reclaim_stat into lruvec

With mem_cgroup_disabled() now explicit, it becomes clear that the
zone_reclaim_stat structure actually belongs in lruvec, per-zone when
memcg is disabled but per-memcg per-zone when it's enabled.

We can delete mem_cgroup_get_reclaim_stat(), and change
update_page_reclaim_stat() to update just the one set of stats, the one
which get_scan_count() will actually use.

Signed-off-by: Hugh Dickins <hughd@google.com>
Signed-off-by: Konstantin Khlebnikov <khlebnikov@openvz.org>
Acked-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Acked-by: Michal Hocko <mhocko@suse.cz>
Reviewed-by: Minchan Kim <minchan@kernel.org>
Reviewed-by: Michal Hocko <mhocko@suse.cz>
Cc: Glauber Costa <glommer@parallels.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 51300cef 29-May-2012 Andrew Morton <akpm@linux-foundation.org>

mm/page_alloc.c: cleanups

- make pageflag_names[] const

- remove null termination of pageflag_names[]

Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Gavin Shan <shangw@linux.vnet.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# acc50c11 29-May-2012 Johannes Weiner <hannes@cmpxchg.org>

mm: page_alloc: catch out-of-date list of page flag names

String tables with names of enum items are always prone to go out of
sync with the enums themselves. Ensure during compile time that the
name table of page flags has the same size as the page flags enum.

Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
Cc: Gavin Shan <shangw@linux.vnet.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# be9cd873 29-May-2012 Gavin Shan <shangw@linux.vnet.ibm.com>

mm/buddy: dump PG_compound_lock page flag

The array pageflag_names[] does conversion from page flags into their
corresponding names so that a meaningful representation of the
corresponding page flag can be printed. This mechanism is used while
dumping page frames. However, the array missed PG_compound_lock. So
the PG_compound_lock page flag would be printed as a digital number
instead of a meaningful string.

The patch fixes that and prints "compound_lock" for the PG_compound_lock
page flag.

Signed-off-by: Gavin Shan <shangw@linux.vnet.ibm.com>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 5ceb9ce6 29-May-2012 Bartlomiej Zolnierkiewicz <b.zolnierkie@samsung.com>

mm: compaction: handle incorrect MIGRATE_UNMOVABLE type pageblocks

When MIGRATE_UNMOVABLE pages are freed from MIGRATE_UNMOVABLE type
pageblock (and some MIGRATE_MOVABLE pages are left in it) waiting until an
allocation takes ownership of the block may take too long. The type of
the pageblock remains unchanged so the pageblock cannot be used as a
migration target during compaction.

Fix it by:

* Adding enum compact_mode (COMPACT_ASYNC_[MOVABLE,UNMOVABLE], and
COMPACT_SYNC) and then converting sync field in struct compact_control
to use it.

* Adding nr_pageblocks_skipped field to struct compact_control and
tracking how many destination pageblocks were of MIGRATE_UNMOVABLE type.
If COMPACT_ASYNC_MOVABLE mode compaction ran fully in
try_to_compact_pages() (COMPACT_COMPLETE) it implies that there is not a
suitable page for allocation. In this case then check how if there were
enough MIGRATE_UNMOVABLE pageblocks to try a second pass in
COMPACT_ASYNC_UNMOVABLE mode.

* Scanning the MIGRATE_UNMOVABLE pageblocks (during COMPACT_SYNC and
COMPACT_ASYNC_UNMOVABLE compaction modes) and building a count based on
finding PageBuddy pages, page_count(page) == 0 or PageLRU pages. If all
pages within the MIGRATE_UNMOVABLE pageblock are in one of those three
sets change the whole pageblock type to MIGRATE_MOVABLE.

My particular test case (on a ARM EXYNOS4 device with 512 MiB, which means
131072 standard 4KiB pages in 'Normal' zone) is to:

- allocate 120000 pages for kernel's usage
- free every second page (60000 pages) of memory just allocated
- allocate and use 60000 pages from user space
- free remaining 60000 pages of kernel memory
(now we have fragmented memory occupied mostly by user space pages)
- try to allocate 100 order-9 (2048 KiB) pages for kernel's usage

The results:
- with compaction disabled I get 11 successful allocations
- with compaction enabled - 14 successful allocations
- with this patch I'm able to get all 100 successful allocations

NOTE: If we can make kswapd aware of order-0 request during compaction, we
can enhance kswapd with changing mode to COMPACT_ASYNC_FULL
(COMPACT_ASYNC_MOVABLE + COMPACT_ASYNC_UNMOVABLE). Please see the
following thread:

http://marc.info/?l=linux-mm&m=133552069417068&w=2

[minchan@kernel.org: minor cleanups]
Cc: Mel Gorman <mgorman@suse.de>
Cc: Minchan Kim <minchan@kernel.org>
Cc: Rik van Riel <riel@redhat.com>
Cc: Marek Szyprowski <m.szyprowski@samsung.com>
Signed-off-by: Bartlomiej Zolnierkiewicz <b.zolnierkie@samsung.com>
Signed-off-by: Kyungmin Park <kyungmin.park@samsung.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 955c1cd7 29-May-2012 Andrew Morton <akpm@linux-foundation.org>

mm/page_alloc.c: remove pageblock_default_order()

This has always been broken: one version takes an unsigned int and the
other version takes no arguments. This bug was hidden because one
version of set_pageblock_order() was a macro which doesn't evaluate its
argument.

Simplify it all and remove pageblock_default_order() altogether.

Reported-by: rajman mekaco <rajman.mekaco@gmail.com>
Cc: Mel Gorman <mel@csn.ul.ie>
Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Tejun Heo <tj@kernel.org>
Cc: Minchan Kim <minchan.kim@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# a62e2f4f 29-May-2012 Bjorn Helgaas <bhelgaas@google.com>

mm: print physical addresses consistently with other parts of kernel

Print physical address info in a style consistent with the %pR style used
elsewhere in the kernel. For example:

-Zone PFN ranges:
+Zone ranges:
- DMA32 0x00000010 -> 0x00100000
+ DMA32 [mem 0x00010000-0xffffffff]
- Normal 0x00100000 -> 0x01080000
+ Normal [mem 0x100000000-0x107fffffff]

Signed-off-by: Bjorn Helgaas <bhelgaas@google.com>
Cc: Yinghai Lu <yinghai@kernel.org>
Cc: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 31fe62b9 23-May-2012 Tim Bird <tim.bird@am.sony.com>

mm: add a low limit to alloc_large_system_hash

UDP stack needs a minimum hash size value for proper operation and also
uses alloc_large_system_hash() for proper NUMA distribution of its hash
tables and automatic sizing depending on available system memory.

On some low memory situations, udp_table_init() must ignore the
alloc_large_system_hash() result and reallocs a bigger memory area.

As we cannot easily free old hash table, we leak it and kmemleak can
issue a warning.

This patch adds a low limit parameter to alloc_large_system_hash() to
solve this problem.

We then specify UDP_HTABLE_SIZE_MIN for UDP/UDPLite hash table
allocation.

Reported-by: Mark Asselstine <mark.asselstine@windriver.com>
Reported-by: Tim Bird <tim.bird@am.sony.com>
Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Cc: Paul Gortmaker <paul.gortmaker@windriver.com>
Signed-off-by: David S. Miller <davem@davemloft.net>


# 58f42fd5 11-May-2012 Minchan Kim <minchan.kim@gmail.com>

cma: fix migration mode

__alloc_contig_migrate_range calls migrate_pages with wrong argument
for migrate_mode. Fix it.

Cc: Marek Szyprowski <m.szyprowski@samsung.com>
Signed-off-by: Minchan Kim <minchan@kernel.org>
Acked-by: Michal Nazarewicz <mina86@mina86.com>
Signed-off-by: Marek Szyprowski <m.szyprowski@samsung.com>


# 49f223a9 24-Jan-2012 Marek Szyprowski <m.szyprowski@samsung.com>

mm: trigger page reclaim in alloc_contig_range() to stabilise watermarks

alloc_contig_range() performs memory allocation so it also should keep
track on keeping the correct level of memory watermarks. This commit adds
a call to *_slowpath style reclaim to grab enough pages to make sure that
the final collection of contiguous pages from freelists will not starve
the system.

Signed-off-by: Marek Szyprowski <m.szyprowski@samsung.com>
Signed-off-by: Kyungmin Park <kyungmin.park@samsung.com>
CC: Michal Nazarewicz <mina86@mina86.com>
Tested-by: Rob Clark <rob.clark@linaro.org>
Tested-by: Ohad Ben-Cohen <ohad@wizery.com>
Tested-by: Benjamin Gaignard <benjamin.gaignard@linaro.org>
Tested-by: Robert Nelson <robertcnelson@gmail.com>
Tested-by: Barry Song <Baohua.Song@csr.com>


# bba90710 24-Jan-2012 Marek Szyprowski <m.szyprowski@samsung.com>

mm: extract reclaim code from __alloc_pages_direct_reclaim()

This patch extracts common reclaim code from __alloc_pages_direct_reclaim()
function to separate function: __perform_reclaim() which can be later used
by alloc_contig_range().

Signed-off-by: Marek Szyprowski <m.szyprowski@samsung.com>
Signed-off-by: Kyungmin Park <kyungmin.park@samsung.com>
Cc: Michal Nazarewicz <mina86@mina86.com>
Acked-by: Mel Gorman <mel@csn.ul.ie>
Tested-by: Rob Clark <rob.clark@linaro.org>
Tested-by: Ohad Ben-Cohen <ohad@wizery.com>
Tested-by: Benjamin Gaignard <benjamin.gaignard@linaro.org>
Tested-by: Robert Nelson <robertcnelson@gmail.com>
Tested-by: Barry Song <Baohua.Song@csr.com>


# cfd3da1e 25-Apr-2011 Mel Gorman <mgorman@suse.de>

mm: Serialize access to min_free_kbytes

There is a race between the min_free_kbytes sysctl, memory hotplug
and transparent hugepage support enablement. Memory hotplug uses a
zonelists_mutex to avoid a race when building zonelists. Reuse it to
serialise watermark updates.

[a.p.zijlstra@chello.nl: Older patch fixed the race with spinlock]
Signed-off-by: Mel Gorman <mgorman@suse.de>
Signed-off-by: Marek Szyprowski <m.szyprowski@samsung.com>
Reviewed-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Tested-by: Barry Song <Baohua.Song@csr.com>


# 0815f3d8 03-Apr-2012 Michal Nazarewicz <mina86@mina86.com>

mm: page_isolation: MIGRATE_CMA isolation functions added

This commit changes various functions that change pages and
pageblocks migrate type between MIGRATE_ISOLATE and
MIGRATE_MOVABLE in such a way as to allow to work with
MIGRATE_CMA migrate type.

Signed-off-by: Michal Nazarewicz <mina86@mina86.com>
Signed-off-by: Marek Szyprowski <m.szyprowski@samsung.com>
Reviewed-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Tested-by: Rob Clark <rob.clark@linaro.org>
Tested-by: Ohad Ben-Cohen <ohad@wizery.com>
Tested-by: Benjamin Gaignard <benjamin.gaignard@linaro.org>
Tested-by: Robert Nelson <robertcnelson@gmail.com>
Tested-by: Barry Song <Baohua.Song@csr.com>


# 47118af0 29-Dec-2011 Michal Nazarewicz <mina86@mina86.com>

mm: mmzone: MIGRATE_CMA migration type added

The MIGRATE_CMA migration type has two main characteristics:
(i) only movable pages can be allocated from MIGRATE_CMA
pageblocks and (ii) page allocator will never change migration
type of MIGRATE_CMA pageblocks.

This guarantees (to some degree) that page in a MIGRATE_CMA page
block can always be migrated somewhere else (unless there's no
memory left in the system).

It is designed to be used for allocating big chunks (eg. 10MiB)
of physically contiguous memory. Once driver requests
contiguous memory, pages from MIGRATE_CMA pageblocks may be
migrated away to create a contiguous block.

To minimise number of migrations, MIGRATE_CMA migration type
is the last type tried when page allocator falls back to other
migration types when requested.

Signed-off-by: Michal Nazarewicz <mina86@mina86.com>
Signed-off-by: Marek Szyprowski <m.szyprowski@samsung.com>
Signed-off-by: Kyungmin Park <kyungmin.park@samsung.com>
Acked-by: Mel Gorman <mel@csn.ul.ie>
Reviewed-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Tested-by: Rob Clark <rob.clark@linaro.org>
Tested-by: Ohad Ben-Cohen <ohad@wizery.com>
Tested-by: Benjamin Gaignard <benjamin.gaignard@linaro.org>
Tested-by: Robert Nelson <robertcnelson@gmail.com>
Tested-by: Barry Song <Baohua.Song@csr.com>


# 6d4a4916 11-Jan-2012 Michal Nazarewicz <mina86@mina86.com>

mm: page_alloc: change fallbacks array handling

This commit adds a row for MIGRATE_ISOLATE type to the fallbacks array
which was missing from it. It also, changes the array traversal logic
a little making MIGRATE_RESERVE an end marker. The letter change,
removes the implicit MIGRATE_UNMOVABLE from the end of each row which
was read by __rmqueue_fallback() function.

Signed-off-by: Michal Nazarewicz <mina86@mina86.com>
Signed-off-by: Marek Szyprowski <m.szyprowski@samsung.com>
Acked-by: Mel Gorman <mel@csn.ul.ie>
Reviewed-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Tested-by: Rob Clark <rob.clark@linaro.org>
Tested-by: Ohad Ben-Cohen <ohad@wizery.com>
Tested-by: Benjamin Gaignard <benjamin.gaignard@linaro.org>
Tested-by: Robert Nelson <robertcnelson@gmail.com>
Tested-by: Barry Song <Baohua.Song@csr.com>


# 041d3a8c 29-Dec-2011 Michal Nazarewicz <mina86@mina86.com>

mm: page_alloc: introduce alloc_contig_range()

This commit adds the alloc_contig_range() function which tries
to allocate given range of pages. It tries to migrate all
already allocated pages that fall in the range thus freeing them.
Once all pages in the range are freed they are removed from the
buddy system thus allocated for the caller to use.

Signed-off-by: Michal Nazarewicz <mina86@mina86.com>
Signed-off-by: Marek Szyprowski <m.szyprowski@samsung.com>
Acked-by: Mel Gorman <mel@csn.ul.ie>
Reviewed-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Tested-by: Rob Clark <rob.clark@linaro.org>
Tested-by: Ohad Ben-Cohen <ohad@wizery.com>
Tested-by: Benjamin Gaignard <benjamin.gaignard@linaro.org>
Tested-by: Robert Nelson <robertcnelson@gmail.com>
Tested-by: Barry Song <Baohua.Song@csr.com>


# 5f63b720 11-Jan-2012 Michal Nazarewicz <mina86@mina86.com>

mm: page_alloc: remove trailing whitespace

Signed-off-by: Michal Nazarewicz <mina86@mina86.com>
Signed-off-by: Marek Szyprowski <m.szyprowski@samsung.com>
Acked-by: Mel Gorman <mel@csn.ul.ie>


# 1b76b02f 11-May-2012 Hugh Dickins <hughd@google.com>

mm: raise MemFree by reverting percpu_pagelist_fraction to 0

Why is there less MemFree than there used to be? It perturbed a test,
so I've just been bisecting linux-next, and now find the offender went
upstream yesterday.

Commit 93278814d359 "mm: fix division by 0 in percpu_pagelist_fraction()"
mistakenly initialized percpu_pagelist_fraction to the sysctl's minimum 8,
which leaves 1/8th of memory on percpu lists (on each cpu??); but most of
us expect it to be left unset at 0 (and it's not then used as a divisor).

MemTotal: 8061476kB 8061476kB 8061476kB 8061476kB 8061476kB 8061476kB
Repetitive test with percpu_pagelist_fraction 8:
MemFree: 6948420kB 6237172kB 6949696kB 6840692kB 6949048kB 6862984kB
Same test with percpu_pagelist_fraction back to 0:
MemFree: 7945000kB 7944908kB 7948568kB 7949060kB 7948796kB 7948812kB

Signed-off-by: Hugh Dickins <hughd@google.com>
[ We really should fix the crazy sysctl interface too, but that's a
separate thing - Linus ]
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 93278814 10-May-2012 Sasha Levin <levinsasha928@gmail.com>

mm: fix division by 0 in percpu_pagelist_fraction()

percpu_pagelist_fraction_sysctl_handler() has only considered -EINVAL as
a possible error from proc_dointvec_minmax().

If any other error is returned, it would proceed to divide by zero since
percpu_pagelist_fraction wasn't getting initialized at any point. For
example, writing 0 bytes into the proc file would trigger the issue.

Signed-off-by: Sasha Levin <levinsasha928@gmail.com>
Reviewed-by: Minchan Kim <minchan@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 155cbfc8 08-May-2012 Kay Sievers <kay@vrfy.org>

mm: use KERN_CONT in printk() continuation lines

Cc: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Kay Sievers <kay@vrfy.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>


# 74046494 28-Mar-2012 Gilad Ben-Yossef <gilad@benyossef.com>

mm: only IPI CPUs to drain local pages if they exist

Calculate a cpumask of CPUs with per-cpu pages in any zone and only send
an IPI requesting CPUs to drain these pages to the buddy allocator if they
actually have pages when asked to flush.

This patch saves 85%+ of IPIs asking to drain per-cpu pages in case of
severe memory pressure that leads to OOM since in these cases multiple,
possibly concurrent, allocation requests end up in the direct reclaim code
path so when the per-cpu pages end up reclaimed on first allocation
failure for most of the proceeding allocation attempts until the memory
pressure is off (possibly via the OOM killer) there are no per-cpu pages
on most CPUs (and there can easily be hundreds of them).

This also has the side effect of shortening the average latency of direct
reclaim by 1 or more order of magnitude since waiting for all the CPUs to
ACK the IPI takes a long time.

Tested by running "hackbench 400" on a 8 CPU x86 VM and observing the
difference between the number of direct reclaim attempts that end up in
drain_all_pages() and those were more then 1/2 of the online CPU had any
per-cpu page in them, using the vmstat counters introduced in the next
patch in the series and using proc/interrupts.

In the test sceanrio, this was seen to save around 3600 global
IPIs after trigerring an OOM on a concurrent workload:

$ cat /proc/vmstat | tail -n 2
pcp_global_drain 0
pcp_global_ipi_saved 0

$ cat /proc/interrupts | grep CAL
CAL: 1 2 1 2
2 2 2 2 Function call interrupts

$ hackbench 400
[OOM messages snipped]

$ cat /proc/vmstat | tail -n 2
pcp_global_drain 3647
pcp_global_ipi_saved 3642

$ cat /proc/interrupts | grep CAL
CAL: 6 13 6 3
3 3 1 2 7 Function call interrupts

Please note that if the global drain is removed from the direct reclaim
path as a patch from Mel Gorman currently suggests this should be replaced
with an on_each_cpu_cond invocation.

Signed-off-by: Gilad Ben-Yossef <gilad@benyossef.com>
Acked-by: Mel Gorman <mel@csn.ul.ie>
Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Acked-by: Christoph Lameter <cl@linux.com>
Acked-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Pekka Enberg <penberg@kernel.org>
Cc: Rik van Riel <riel@redhat.com>
Cc: Andi Kleen <andi@firstfloor.org>
Acked-by: Michal Nazarewicz <mina86@mina86.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 29fd66d2 28-Mar-2012 David Rientjes <rientjes@google.com>

mm, coredump: fail allocations when coredumping instead of oom killing

The size of coredump files is limited by RLIMIT_CORE, however, allocating
large amounts of memory results in three negative consequences:

- the coredumping process may be chosen for oom kill and quickly deplete
all memory reserves in oom conditions preventing further progress from
being made or tasks from exiting,

- the coredumping process may cause other processes to be oom killed
without fault of their own as the result of a SIGSEGV, for example, in
the coredumping process, or

- the coredumping process may result in a livelock while writing to the
dump file if it needs memory to allocate while other threads are in
the exit path waiting on the coredumper to complete.

This is fixed by implying __GFP_NORETRY in the page allocator for
coredumping processes when reclaim has failed so the allocations fail and
the process continues to exit.

Signed-off-by: David Rientjes <rientjes@google.com>
Cc: Mel Gorman <mgorman@suse.de>
Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Minchan Kim <minchan.kim@gmail.com>
Cc: Oleg Nesterov <oleg@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# b224ef85 21-Mar-2012 Kautuk Consul <consul.kautuk@gmail.com>

page_alloc: remove unused find_zone_movable_pfns_for_nodes() argument

find_zone_movable_pfns_for_nodes() does not use its argument.

Signed-off-by: Kautuk Consul <consul.kautuk@gmail.com>
Cc: David Rientjes <rientjes@google.com>
Cc: Mel Gorman <mel@csn.ul.ie>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 8d13bddd 21-Mar-2012 Kautuk Consul <consul.kautuk@gmail.com>

page_alloc.c: remove add_from_early_node_map()

add_from_early_node_map() is unused.

Signed-off-by: Kautuk Consul <consul.kautuk@gmail.com>
Acked-by: David Rientjes <rientjes@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# cc9a6c87 21-Mar-2012 Mel Gorman <mgorman@suse.de>

cpuset: mm: reduce large amounts of memory barrier related damage v3

Commit c0ff7453bb5c ("cpuset,mm: fix no node to alloc memory when
changing cpuset's mems") wins a super prize for the largest number of
memory barriers entered into fast paths for one commit.

[get|put]_mems_allowed is incredibly heavy with pairs of full memory
barriers inserted into a number of hot paths. This was detected while
investigating at large page allocator slowdown introduced some time
after 2.6.32. The largest portion of this overhead was shown by
oprofile to be at an mfence introduced by this commit into the page
allocator hot path.

For extra style points, the commit introduced the use of yield() in an
implementation of what looks like a spinning mutex.

This patch replaces the full memory barriers on both read and write
sides with a sequence counter with just read barriers on the fast path
side. This is much cheaper on some architectures, including x86. The
main bulk of the patch is the retry logic if the nodemask changes in a
manner that can cause a false failure.

While updating the nodemask, a check is made to see if a false failure
is a risk. If it is, the sequence number gets bumped and parallel
allocators will briefly stall while the nodemask update takes place.

In a page fault test microbenchmark, oprofile samples from
__alloc_pages_nodemask went from 4.53% of all samples to 1.15%. The
actual results were

3.3.0-rc3 3.3.0-rc3
rc3-vanilla nobarrier-v2r1
Clients 1 UserTime 0.07 ( 0.00%) 0.08 (-14.19%)
Clients 2 UserTime 0.07 ( 0.00%) 0.07 ( 2.72%)
Clients 4 UserTime 0.08 ( 0.00%) 0.07 ( 3.29%)
Clients 1 SysTime 0.70 ( 0.00%) 0.65 ( 6.65%)
Clients 2 SysTime 0.85 ( 0.00%) 0.82 ( 3.65%)
Clients 4 SysTime 1.41 ( 0.00%) 1.41 ( 0.32%)
Clients 1 WallTime 0.77 ( 0.00%) 0.74 ( 4.19%)
Clients 2 WallTime 0.47 ( 0.00%) 0.45 ( 3.73%)
Clients 4 WallTime 0.38 ( 0.00%) 0.37 ( 1.58%)
Clients 1 Flt/sec/cpu 497620.28 ( 0.00%) 520294.53 ( 4.56%)
Clients 2 Flt/sec/cpu 414639.05 ( 0.00%) 429882.01 ( 3.68%)
Clients 4 Flt/sec/cpu 257959.16 ( 0.00%) 258761.48 ( 0.31%)
Clients 1 Flt/sec 495161.39 ( 0.00%) 517292.87 ( 4.47%)
Clients 2 Flt/sec 820325.95 ( 0.00%) 850289.77 ( 3.65%)
Clients 4 Flt/sec 1020068.93 ( 0.00%) 1022674.06 ( 0.26%)
MMTests Statistics: duration
Sys Time Running Test (seconds) 135.68 132.17
User+Sys Time Running Test (seconds) 164.2 160.13
Total Elapsed Time (seconds) 123.46 120.87

The overall improvement is small but the System CPU time is much
improved and roughly in correlation to what oprofile reported (these
performance figures are without profiling so skew is expected). The
actual number of page faults is noticeably improved.

For benchmarks like kernel builds, the overall benefit is marginal but
the system CPU time is slightly reduced.

To test the actual bug the commit fixed I opened two terminals. The
first ran within a cpuset and continually ran a small program that
faulted 100M of anonymous data. In a second window, the nodemask of the
cpuset was continually randomised in a loop.

Without the commit, the program would fail every so often (usually
within 10 seconds) and obviously with the commit everything worked fine.
With this patch applied, it also worked fine so the fix should be
functionally equivalent.

Signed-off-by: Mel Gorman <mgorman@suse.de>
Cc: Miao Xie <miaox@cn.fujitsu.com>
Cc: David Rientjes <rientjes@google.com>
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Christoph Lameter <cl@linux.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# f0cb3c76 21-Mar-2012 Konstantin Khlebnikov <khlebnikov@openvz.org>

mm: drain percpu lru add/rotate page-vectors on cpu hot-unplug

This cpu hotplug hook was accidentally removed in commit 00a62ce91e55
("mm: fix Committed_AS underflow on large NR_CPUS environment")

The visible effect of this accident: some pages are borrowed in per-cpu
page-vectors. Truncate can deal with it, but these pages cannot be
reused while this cpu is offline. So this is like a temporary memory
leak.

Signed-off-by: Konstantin Khlebnikov <khlebnikov@openvz.org>
Cc: Dave Hansen <dave@linux.vnet.ibm.com>
Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Cc: Eric B Munson <ebmunson@us.ibm.com>
Cc: Mel Gorman <mel@csn.ul.ie>
Cc: Christoph Lameter <cl@linux-foundation.org>
Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 08ab9b10 21-Mar-2012 David Rientjes <rientjes@google.com>

mm, oom: force oom kill on sysrq+f

The oom killer chooses not to kill a thread if:

- an eligible thread has already been oom killed and has yet to exit,
and

- an eligible thread is exiting but has yet to free all its memory and
is not the thread attempting to currently allocate memory.

SysRq+F manually invokes the global oom killer to kill a memory-hogging
task. This is normally done as a last resort to free memory when no
progress is being made or to test the oom killer itself.

For both uses, we always want to kill a thread and never defer. This
patch causes SysRq+F to always kill an eligible thread and can be used to
force a kill even if another oom killed thread has failed to exit.

Signed-off-by: David Rientjes <rientjes@google.com>
Acked-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Acked-by: Pekka Enberg <penberg@kernel.org>
Acked-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# aff62249 21-Mar-2012 Rik van Riel <riel@redhat.com>

vmscan: only defer compaction for failed order and higher

Currently a failed order-9 (transparent hugepage) compaction can lead to
memory compaction being temporarily disabled for a memory zone. Even if
we only need compaction for an order 2 allocation, eg. for jumbo frames
networking.

The fix is relatively straightforward: keep track of the highest order at
which compaction is succeeding, and only defer compaction for orders at
which compaction is failing.

Signed-off-by: Rik van Riel <riel@redhat.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Acked-by: Mel Gorman <mel@csn.ul.ie>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Minchan Kim <minchan.kim@gmail.com>
Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Cc: Hillf Danton <dhillf@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 074b8517 08-Feb-2012 Dimitri Sivanich <sivanich@sgi.com>

vfs: fix panic in __d_lookup() with high dentry hashtable counts

When the number of dentry cache hash table entries gets too high
(2147483648 entries), as happens by default on a 16TB system, use of a
signed integer in the dcache_init() initialization loop prevents the
dentry_hashtable from getting initialized, causing a panic in
__d_lookup(). Fix this in dcache_init() and similar areas.

Signed-off-by: Dimitri Sivanich <sivanich@sgi.com>
Acked-by: David S. Miller <davem@davemloft.net>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>


# 656a0706 20-Jan-2012 Michal Hocko <mhocko@suse.cz>

mm: __count_immobile_pages(): make sure the node is online

page_zone() requires an online node otherwise we are accessing NULL
NODE_DATA. This is not an issue at the moment because node_zones are
located at the structure beginning but this might change in the future
so better be careful about that.

Signed-off-by: Michal Hocko <mhocko@suse.cz>
Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Acked-by: Mel Gorman <mgorman@suse.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 687875fb 20-Jan-2012 Michal Hocko <mhocko@suse.cz>

mm: fix NULL ptr dereference in __count_immobile_pages

Fix the following NULL ptr dereference caused by

cat /sys/devices/system/memory/memory0/removable

Pid: 13979, comm: sed Not tainted 3.0.13-0.5-default #1 IBM BladeCenter LS21 -[7971PAM]-/Server Blade
RIP: __count_immobile_pages+0x4/0x100
Process sed (pid: 13979, threadinfo ffff880221c36000, task ffff88022e788480)
Call Trace:
is_pageblock_removable_nolock+0x34/0x40
is_mem_section_removable+0x74/0xf0
show_mem_removable+0x41/0x70
sysfs_read_file+0xfe/0x1c0
vfs_read+0xc7/0x130
sys_read+0x53/0xa0
system_call_fastpath+0x16/0x1b

We are crashing because we are trying to dereference NULL zone which
came from pfn=0 (struct page ffffea0000000000). According to the boot
log this page is marked reserved:
e820 update range: 0000000000000000 - 0000000000010000 (usable) ==> (reserved)

and early_node_map confirms that:
early_node_map[3] active PFN ranges
1: 0x00000010 -> 0x0000009c
1: 0x00000100 -> 0x000bffa3
1: 0x00100000 -> 0x00240000

The problem is that memory_present works in PAGE_SECTION_MASK aligned
blocks so the reserved range sneaks into the the section as well. This
also means that free_area_init_node will not take care of those reserved
pages and they stay uninitialized.

When we try to read the removable status we walk through all available
sections and hope that the zone is valid for all pages in the section.
But this is not true in this case as the zone and nid are not initialized.

We have only one node in this particular case and it is marked as node=1
(rather than 0) and that made the problem visible because page_to_nid will
return 0 and there are no zones on the node.

Let's check that the zone is valid and that the given pfn falls into its
boundaries and mark the section not removable. This might cause some
false positives, probably, but we do not have any sane way to find out
whether the page is reserved by the platform or it is just not used for
whatever other reasons.

Signed-off-by: Michal Hocko <mhocko@suse.cz>
Acked-by: Mel Gorman <mgorman@suse.de>
Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: David Rientjes <rientjes@google.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 4111304d 12-Jan-2012 Hugh Dickins <hughd@google.com>

mm: enum lru_list lru

Mostly we use "enum lru_list lru": change those few "l"s to "lru"s.

Signed-off-by: Hugh Dickins <hughd@google.com>
Reviewed-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 66199712 12-Jan-2012 Mel Gorman <mgorman@suse.de>

mm: page allocator: do not call direct reclaim for THP allocations while compaction is deferred

If compaction is deferred, direct reclaim is used to try to free enough
pages for the allocation to succeed. For small high-orders, this has a
reasonable chance of success. However, if the caller has specified
__GFP_NO_KSWAPD to limit the disruption to the system, it makes more sense
to fail the allocation rather than stall the caller in direct reclaim.
This patch skips direct reclaim if compaction is deferred and the caller
specifies __GFP_NO_KSWAPD.

Async compaction only considers a subset of pages so it is possible for
compaction to be deferred prematurely and not enter direct reclaim even in
cases where it should. To compensate for this, this patch also defers
compaction only if sync compaction failed.

Signed-off-by: Mel Gorman <mgorman@suse.de>
Acked-by: Minchan Kim <minchan.kim@gmail.com>
Reviewed-by: Rik van Riel<riel@redhat.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Dave Jones <davej@redhat.com>
Cc: Jan Kara <jack@suse.cz>
Cc: Andy Isaacson <adi@hexapodia.org>
Cc: Nai Xia <nai.xia@gmail.com>
Cc: Johannes Weiner <jweiner@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# d0048b0e 12-Jan-2012 Bob Liu <lliubbo@gmail.com>

page_alloc: break early in check_for_regular_memory()

If there is a zone below ZONE_NORMAL has present_pages, we can set node
state to N_NORMAL_MEMORY, no need to loop to end.

Signed-off-by: Bob Liu <lliubbo@gmail.com>
Acked-by: Michal Hocko <mhocko@suse.cz>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Acked-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 6290df54 12-Jan-2012 Johannes Weiner <jweiner@redhat.com>

mm: collect LRU list heads into struct lruvec

Having a unified structure with a LRU list set for both global zones and
per-memcg zones allows to keep that code simple which deals with LRU
lists and does not care about the container itself.

Once the per-memcg LRU lists directly link struct pages, the isolation
function and all other list manipulations are shared between the memcg
case and the global LRU case.

Signed-off-by: Johannes Weiner <jweiner@redhat.com>
Reviewed-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Reviewed-by: Michal Hocko <mhocko@suse.cz>
Reviewed-by: Kirill A. Shutemov <kirill@shutemov.name>
Cc: Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp>
Cc: Balbir Singh <bsingharora@gmail.com>
Cc: Ying Han <yinghan@google.com>
Cc: Greg Thelen <gthelen@google.com>
Cc: Michel Lespinasse <walken@google.com>
Cc: Rik van Riel <riel@redhat.com>
Cc: Minchan Kim <minchan.kim@gmail.com>
Cc: Christoph Hellwig <hch@infradead.org>
Cc: Hugh Dickins <hughd@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# c3993076 10-Jan-2012 Johannes Weiner <hannes@cmpxchg.org>

mm: page_alloc: generalize order handling in __free_pages_bootmem()

__free_pages_bootmem() used to special-case higher-order frees to save
individual page checking with free_pages_bulk().

Nowadays, both zero order and non-zero order frees use free_pages(), which
checks each individual page anyway, and so there is little point in making
the distinction anymore. The higher-order loop will work just fine for
zero order pages.

Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
Cc: Uwe Kleine-König <u.kleine-koenig@pengutronix.de>
Cc: Tejun Heo <tj@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# df0a6daa 10-Jan-2012 Michal Hocko <mhocko@suse.cz>

mm: fix off-by-two in __zone_watermark_ok()

Commit 88f5acf88ae6 ("mm: page allocator: adjust the per-cpu counter
threshold when memory is low") changed the form how free_pages is
calculated but it forgot that we used to do free_pages - ((1 << order) -
1) so we ended up with off-by-two when calculating free_pages.

Reported-by: Wang Sheng-Hui <shhuiw@gmail.com>
Signed-off-by: Michal Hocko <mhocko@suse.cz>
Acked-by: Mel Gorman <mgorman@suse.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# a756cf59 10-Jan-2012 Johannes Weiner <jweiner@redhat.com>

mm: try to distribute dirty pages fairly across zones

The maximum number of dirty pages that exist in the system at any time is
determined by a number of pages considered dirtyable and a user-configured
percentage of those, or an absolute number in bytes.

This number of dirtyable pages is the sum of memory provided by all the
zones in the system minus their lowmem reserves and high watermarks, so
that the system can retain a healthy number of free pages without having
to reclaim dirty pages.

But there is a flaw in that we have a zoned page allocator which does not
care about the global state but rather the state of individual memory
zones. And right now there is nothing that prevents one zone from filling
up with dirty pages while other zones are spared, which frequently leads
to situations where kswapd, in order to restore the watermark of free
pages, does indeed have to write pages from that zone's LRU list. This
can interfere so badly with IO from the flusher threads that major
filesystems (btrfs, xfs, ext4) mostly ignore write requests from reclaim
already, taking away the VM's only possibility to keep such a zone
balanced, aside from hoping the flushers will soon clean pages from that
zone.

Enter per-zone dirty limits. They are to a zone's dirtyable memory what
the global limit is to the global amount of dirtyable memory, and try to
make sure that no single zone receives more than its fair share of the
globally allowed dirty pages in the first place. As the number of pages
considered dirtyable excludes the zones' lowmem reserves and high
watermarks, the maximum number of dirty pages in a zone is such that the
zone can always be balanced without requiring page cleaning.

As this is a placement decision in the page allocator and pages are
dirtied only after the allocation, this patch allows allocators to pass
__GFP_WRITE when they know in advance that the page will be written to and
become dirty soon. The page allocator will then attempt to allocate from
the first zone of the zonelist - which on NUMA is determined by the task's
NUMA memory policy - that has not exceeded its dirty limit.

At first glance, it would appear that the diversion to lower zones can
increase pressure on them, but this is not the case. With a full high
zone, allocations will be diverted to lower zones eventually, so it is
more of a shift in timing of the lower zone allocations. Workloads that
previously could fit their dirty pages completely in the higher zone may
be forced to allocate from lower zones, but the amount of pages that
"spill over" are limited themselves by the lower zones' dirty constraints,
and thus unlikely to become a problem.

For now, the problem of unfair dirty page distribution remains for NUMA
configurations where the zones allowed for allocation are in sum not big
enough to trigger the global dirty limits, wake up the flusher threads and
remedy the situation. Because of this, an allocation that could not
succeed on any of the considered zones is allowed to ignore the dirty
limits before going into direct reclaim or even failing the allocation,
until a future patch changes the global dirty throttling and flusher
thread activation so that they take individual zone states into account.

Test results

15M DMA + 3246M DMA32 + 504 Normal = 3765M memory
40% dirty ratio
16G USB thumb drive
10 runs of dd if=/dev/zero of=disk/zeroes bs=32k count=$((10 << 15))

seconds nr_vmscan_write
(stddev) min| median| max
xfs
vanilla: 549.747( 3.492) 0.000| 0.000| 0.000
patched: 550.996( 3.802) 0.000| 0.000| 0.000

fuse-ntfs
vanilla: 1183.094(53.178) 54349.000| 59341.000| 65163.000
patched: 558.049(17.914) 0.000| 0.000| 43.000

btrfs
vanilla: 573.679(14.015) 156657.000| 460178.000| 606926.000
patched: 563.365(11.368) 0.000| 0.000| 1362.000

ext4
vanilla: 561.197(15.782) 0.000|2725438.000|4143837.000
patched: 568.806(17.496) 0.000| 0.000| 0.000

Signed-off-by: Johannes Weiner <jweiner@redhat.com>
Reviewed-by: Minchan Kim <minchan.kim@gmail.com>
Acked-by: Mel Gorman <mgorman@suse.de>
Reviewed-by: Michal Hocko <mhocko@suse.cz>
Tested-by: Wu Fengguang <fengguang.wu@intel.com>
Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Christoph Hellwig <hch@infradead.org>
Cc: Dave Chinner <david@fromorbit.com>
Cc: Jan Kara <jack@suse.cz>
Cc: Shaohua Li <shaohua.li@intel.com>
Cc: Rik van Riel <riel@redhat.com>
Cc: Chris Mason <chris.mason@oracle.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# ab8fabd4 10-Jan-2012 Johannes Weiner <jweiner@redhat.com>

mm: exclude reserved pages from dirtyable memory

Per-zone dirty limits try to distribute page cache pages allocated for
writing across zones in proportion to the individual zone sizes, to reduce
the likelihood of reclaim having to write back individual pages from the
LRU lists in order to make progress.

This patch:

The amount of dirtyable pages should not include the full number of free
pages: there is a number of reserved pages that the page allocator and
kswapd always try to keep free.

The closer (reclaimable pages - dirty pages) is to the number of reserved
pages, the more likely it becomes for reclaim to run into dirty pages:

+----------+ ---
| anon | |
+----------+ |
| | |
| | -- dirty limit new -- flusher new
| file | | |
| | | |
| | -- dirty limit old -- flusher old
| | |
+----------+ --- reclaim
| reserved |
+----------+
| kernel |
+----------+

This patch introduces a per-zone dirty reserve that takes both the lowmem
reserve as well as the high watermark of the zone into account, and a
global sum of those per-zone values that is subtracted from the global
amount of dirtyable pages. The lowmem reserve is unavailable to page
cache allocations and kswapd tries to keep the high watermark free. We
don't want to end up in a situation where reclaim has to clean pages in
order to balance zones.

Not treating reserved pages as dirtyable on a global level is only a
conceptual fix. In reality, dirty pages are not distributed equally
across zones and reclaim runs into dirty pages on a regular basis.

But it is important to get this right before tackling the problem on a
per-zone level, where the distance between reclaim and the dirty pages is
mostly much smaller in absolute numbers.

[akpm@linux-foundation.org: fix highmem build]
Signed-off-by: Johannes Weiner <jweiner@redhat.com>
Reviewed-by: Rik van Riel <riel@redhat.com>
Reviewed-by: Michal Hocko <mhocko@suse.cz>
Reviewed-by: Minchan Kim <minchan.kim@gmail.com>
Acked-by: Mel Gorman <mgorman@suse.de>
Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Christoph Hellwig <hch@infradead.org>
Cc: Wu Fengguang <fengguang.wu@intel.com>
Cc: Dave Chinner <david@fromorbit.com>
Cc: Jan Kara <jack@suse.cz>
Cc: Shaohua Li <shaohua.li@intel.com>
Cc: Chris Mason <chris.mason@oracle.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# c0a32fc5 10-Jan-2012 Stanislaw Gruszka <sgruszka@redhat.com>

mm: more intensive memory corruption debugging

With CONFIG_DEBUG_PAGEALLOC configured, the CPU will generate an exception
on access (read,write) to an unallocated page, which permits us to catch
code which corrupts memory. However the kernel is trying to maximise
memory usage, hence there are usually few free pages in the system and
buggy code usually corrupts some crucial data.

This patch changes the buddy allocator to keep more free/protected pages
and to interlace free/protected and allocated pages to increase the
probability of catching corruption.

When the kernel is compiled with CONFIG_DEBUG_PAGEALLOC,
debug_guardpage_minorder defines the minimum order used by the page
allocator to grant a request. The requested size will be returned with
the remaining pages used as guard pages.

The default value of debug_guardpage_minorder is zero: no change from
current behaviour.

[akpm@linux-foundation.org: tweak documentation, s/flg/flag/]
Signed-off-by: Stanislaw Gruszka <sgruszka@redhat.com>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: "Rafael J. Wysocki" <rjw@sisk.pl>
Cc: Christoph Lameter <cl@linux-foundation.org>
Cc: Pekka Enberg <penberg@cs.helsinki.fi>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# f90ac398 10-Jan-2012 Mel Gorman <mgorman@suse.de>

mm: avoid livelock on !__GFP_FS allocations

Colin Cross reported;

Under the following conditions, __alloc_pages_slowpath can loop forever:
gfp_mask & __GFP_WAIT is true
gfp_mask & __GFP_FS is false
reclaim and compaction make no progress
order <= PAGE_ALLOC_COSTLY_ORDER

These conditions happen very often during suspend and resume,
when pm_restrict_gfp_mask() effectively converts all GFP_KERNEL
allocations into __GFP_WAIT.

The oom killer is not run because gfp_mask & __GFP_FS is false,
but should_alloc_retry will always return true when order is less
than PAGE_ALLOC_COSTLY_ORDER.

In his fix, he avoided retrying the allocation if reclaim made no progress
and __GFP_FS was not set. The problem is that this would result in
GFP_NOIO allocations failing that previously succeeded which would be very
unfortunate.

The big difference between GFP_NOIO and suspend converting GFP_KERNEL to
behave like GFP_NOIO is that normally flushers will be cleaning pages and
kswapd reclaims pages allowing GFP_NOIO to succeed after a short delay.
The same does not necessarily apply during suspend as the storage device
may be suspended.

This patch special cases the suspend case to fail the page allocation if
reclaim cannot make progress and adds some documentation on how
gfp_allowed_mask is currently used. Failing allocations like this may
cause suspend to abort but that is better than a livelock.

[mgorman@suse.de: Rework fix to be suspend specific]
[rientjes@google.com: Move suspended device check to should_alloc_retry]
Reported-by: Colin Cross <ccross@android.com>
Signed-off-by: Mel Gorman <mgorman@suse.de>
Acked-by: David Rientjes <rientjes@google.com>
Cc: Minchan Kim <minchan.kim@gmail.com>
Cc: Pekka Enberg <penberg@cs.helsinki.fi>
Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 938929f1 10-Jan-2012 Mel Gorman <mgorman@suse.de>

mm: reduce the amount of work done when updating min_free_kbytes

When min_free_kbytes is updated, some pageblocks are marked
MIGRATE_RESERVE. Ordinarily, this work is unnoticable as it happens early
in boot but on large machines with 1TB of memory, this has been reported
to delay boot times, probably due to the NUMA distances involved.

The bulk of the work is due to calling calling pageblock_is_reserved() an
unnecessary amount of times and accessing far more struct page metadata
than is necessary. This patch significantly reduces the amount of work
done by setup_zone_migrate_reserve() improving boot times on 1TB machines.

[akpm@linux-foundation.org: coding-style fixes]
Signed-off-by: Mel Gorman <mgorman@suse.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# b413d48a 10-Jan-2012 Konstantin Khlebnikov <khlebnikov@openvz.org>

mm-tracepoint: rename page-free events

Rename mm_page_free_direct into mm_page_free and mm_pagevec_free into
mm_page_free_batched

Since v2.6.33-5426-gc475dab the kernel triggers mm_page_free_direct for
all freed pages, not only for directly freed. So, let's name it properly.
For pages freed via page-list we also trigger mm_page_free_batched event.

Signed-off-by: Konstantin Khlebnikov <khlebnikov@openvz.org>
Cc: Mel Gorman <mel@csn.ul.ie>
Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Reviewed-by: Minchan Kim <minchan.kim@gmail.com>
Cc: Hugh Dickins <hughd@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# da066ad3 10-Jan-2012 Konstantin Khlebnikov <khlebnikov@openvz.org>

mm: remove unused pagevec_free

It not exported and now nobody uses it.

Signed-off-by: Konstantin Khlebnikov <khlebnikov@openvz.org>
Cc: Mel Gorman <mel@csn.ul.ie>
Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Reviewed-by: Minchan Kim <minchan.kim@gmail.com>
Acked-by: Hugh Dickins <hughd@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# cc59850e 10-Jan-2012 Konstantin Khlebnikov <khlebnikov@openvz.org>

mm: add free_hot_cold_page_list() helper

This patch adds helper free_hot_cold_page_list() to free list of 0-order
pages. It frees pages directly from list without temporary page-vector.
It also calls trace_mm_pagevec_free() to simulate pagevec_free()
behaviour.

bloat-o-meter:

add/remove: 1/1 grow/shrink: 1/3 up/down: 267/-295 (-28)
function old new delta
free_hot_cold_page_list - 264 +264
get_page_from_freelist 2129 2132 +3
__pagevec_free 243 239 -4
split_free_page 380 373 -7
release_pages 606 510 -96
free_page_list 188 - -188

Signed-off-by: Konstantin Khlebnikov <khlebnikov@openvz.org>
Cc: Mel Gorman <mel@csn.ul.ie>
Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Acked-by: Minchan Kim <minchan.kim@gmail.com>
Acked-by: Hugh Dickins <hughd@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# f4ae40a6 24-Jul-2011 Al Viro <viro@zeniv.linux.org.uk>

switch debugfs to umode_t

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>


# d0215638 08-Dec-2011 Michal Hocko <mhocko@suse.cz>

mm: Ensure that pfn_valid() is called once per pageblock when reserving pageblocks

setup_zone_migrate_reserve() expects that zone->start_pfn starts at
pageblock_nr_pages aligned pfn otherwise we could access beyond an
existing memblock resulting in the following panic if
CONFIG_HOLES_IN_ZONE is not configured and we do not check pfn_valid:

IP: [<c02d331d>] setup_zone_migrate_reserve+0xcd/0x180
*pdpt = 0000000000000000 *pde = f000ff53f000ff53
Oops: 0000 [#1] SMP
Pid: 1, comm: swapper Not tainted 3.0.7-0.7-pae #1 VMware, Inc. VMware Virtual Platform/440BX Desktop Reference Platform
EIP: 0060:[<c02d331d>] EFLAGS: 00010006 CPU: 0
EIP is at setup_zone_migrate_reserve+0xcd/0x180
EAX: 000c0000 EBX: f5801fc0 ECX: 000c0000 EDX: 00000000
ESI: 000c01fe EDI: 000c01fe EBP: 00140000 ESP: f2475f58
DS: 007b ES: 007b FS: 00d8 GS: 0000 SS: 0068
Process swapper (pid: 1, ti=f2474000 task=f2472cd0 task.ti=f2474000)
Call Trace:
[<c02d389c>] __setup_per_zone_wmarks+0xec/0x160
[<c02d3a1f>] setup_per_zone_wmarks+0xf/0x20
[<c08a771c>] init_per_zone_wmark_min+0x27/0x86
[<c020111b>] do_one_initcall+0x2b/0x160
[<c086639d>] kernel_init+0xbe/0x157
[<c05cae26>] kernel_thread_helper+0x6/0xd
Code: a5 39 f5 89 f7 0f 46 fd 39 cf 76 40 8b 03 f6 c4 08 74 32 eb 91 90 89 c8 c1 e8 0e 0f be 80 80 2f 86 c0 8b 14 85 60 2f 86 c0 89 c8 <2b> 82 b4 12 00 00 c1 e0 05 03 82 ac 12 00 00 8b 00 f6 c4 08 0f
EIP: [<c02d331d>] setup_zone_migrate_reserve+0xcd/0x180 SS:ESP 0068:f2475f58
CR2: 00000000000012b4

We crashed in pageblock_is_reserved() when accessing pfn 0xc0000 because
highstart_pfn = 0x36ffe.

The issue was introduced in 3.0-rc1 by 6d3163ce ("mm: check if any page
in a pageblock is reserved before marking it MIGRATE_RESERVE").

Make sure that start_pfn is always aligned to pageblock_nr_pages to
ensure that pfn_valid s always called at the start of each pageblock.
Architectures with holes in pageblocks will be correctly handled by
pfn_valid_within in pageblock_is_reserved.

Signed-off-by: Michal Hocko <mhocko@suse.cz>
Signed-off-by: Mel Gorman <mgorman@suse.de>
Tested-by: Dang Bo <bdang@vmware.com>
Reviewed-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: David Rientjes <rientjes@google.com>
Cc: Arve Hjnnevg <arve@android.com>
Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Cc: John Stultz <john.stultz@linaro.org>
Cc: Dave Hansen <dave@linux.vnet.ibm.com>
Cc: <stable@vger.kernel.org> [3.0+]
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 58a84aa9 08-Dec-2011 Youquan Song <youquan.song@intel.com>

thp: set compound tail page _count to zero

Commit 70b50f94f1644 ("mm: thp: tail page refcounting fix") keeps all
page_tail->_count zero at all times. But the current kernel does not
set page_tail->_count to zero if a 1GB page is utilized. So when an
IOMMU 1GB page is used by KVM, it wil result in a kernel oops because a
tail page's _count does not equal zero.

kernel BUG at include/linux/mm.h:386!
invalid opcode: 0000 [#1] SMP
Call Trace:
gup_pud_range+0xb8/0x19d
get_user_pages_fast+0xcb/0x192
? trace_hardirqs_off+0xd/0xf
hva_to_pfn+0x119/0x2f2
gfn_to_pfn_memslot+0x2c/0x2e
kvm_iommu_map_pages+0xfd/0x1c1
kvm_iommu_map_memslots+0x7c/0xbd
kvm_iommu_map_guest+0xaa/0xbf
kvm_vm_ioctl_assigned_device+0x2ef/0xa47
kvm_vm_ioctl+0x36c/0x3a2
do_vfs_ioctl+0x49e/0x4e4
sys_ioctl+0x5a/0x7c
system_call_fastpath+0x16/0x1b
RIP gup_huge_pud+0xf2/0x159

Signed-off-by: Youquan Song <youquan.song@intel.com>
Reviewed-by: Andrea Arcangeli <aarcange@redhat.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 0ee332c1 08-Dec-2011 Tejun Heo <tj@kernel.org>

memblock: Kill early_node_map[]

Now all ARCH_POPULATES_NODE_MAP archs select HAVE_MEBLOCK_NODE_MAP -
there's no user of early_node_map[] left. Kill early_node_map[] and
replace ARCH_POPULATES_NODE_MAP with HAVE_MEMBLOCK_NODE_MAP. Also,
relocate for_each_mem_pfn_range() and helper from mm.h to memblock.h
as page_alloc.c would no longer host an alternative implementation.

This change is ultimately one to one mapping and shouldn't cause any
observable difference; however, after the recent changes, there are
some functions which now would fit memblock.c better than page_alloc.c
and dependency on HAVE_MEMBLOCK_NODE_MAP instead of HAVE_MEMBLOCK
doesn't make much sense on some of them. Further cleanups for
functions inside HAVE_MEMBLOCK_NODE_MAP in mm.h would be nice.

-v2: Fix compile bug introduced by mis-spelling
CONFIG_HAVE_MEMBLOCK_NODE_MAP to CONFIG_MEMBLOCK_HAVE_NODE_MAP in
mmzone.h. Reported by Stephen Rothwell.

Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Stephen Rothwell <sfr@canb.auug.org.au>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Yinghai Lu <yinghai@kernel.org>
Cc: Tony Luck <tony.luck@intel.com>
Cc: Ralf Baechle <ralf@linux-mips.org>
Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
Cc: Chen Liqin <liqin.chen@sunplusct.com>
Cc: Paul Mundt <lethal@linux-sh.org>
Cc: "David S. Miller" <davem@davemloft.net>
Cc: "H. Peter Anvin" <hpa@zytor.com>


# 6416b9fa 17-Nov-2011 Wang Sheng-Hui <shhuiw@gmail.com>

mm: cleanup the comment for head/tail pages of compound pages in mm/page_alloc.c

Only tail pages point at the head page using their ->first_page fields.

Signed-off-by: Wang Sheng-Hui <shhuiw@gmail.com>
Reviewed-by: Michal Hocko <mhocko@suse.cz>
Signed-off-by: Jiri Kosina <jkosina@suse.cz>


# 3ee9a4f0 31-Oct-2011 Joe Perches <joe@perches.com>

mm: neaten warn_alloc_failed

Add __attribute__((format (printf...) to the function to validate format
and arguments. Use vsprintf extension %pV to avoid any possible message
interleaving. Coalesce format string. Convert printks/pr_warning to
pr_warn.

[akpm@linux-foundation.org: use the __printf() macro]
Signed-off-by: Joe Perches <joe@perches.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 4f31888c 31-Oct-2011 Dave Jones <davej@redhat.com>

mm: output a list of loaded modules when we hit bad_page()

When we get a bad_page bug report, it's useful to see what modules the
user had loaded.

Signed-off-by: Dave Jones <davej@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# dd48c085 03-Aug-2011 Akinobu Mita <akinobu.mita@gmail.com>

fault-injection: add ability to export fault_attr in arbitrary directory

init_fault_attr_dentries() is used to export fault_attr via debugfs.
But it can only export it in debugfs root directory.

Per Forlin is working on mmc_fail_request which adds support to inject
data errors after a completed host transfer in MMC subsystem.

The fault_attr for mmc_fail_request should be defined per mmc host and
export it in debugfs directory per mmc host like
/sys/kernel/debug/mmc0/mmc_fail_request.

init_fault_attr_dentries() doesn't help for mmc_fail_request. So this
introduces fault_create_debugfs_attr() which is able to create a
directory in the arbitrary directory and replace
init_fault_attr_dentries().

[akpm@linux-foundation.org: extraneous semicolon, per Randy]
Signed-off-by: Akinobu Mita <akinobu.mita@gmail.com>
Tested-by: Per Forlin <per.forlin@linaro.org>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: Christoph Lameter <cl@linux-foundation.org>
Cc: Pekka Enberg <penberg@kernel.org>
Cc: Matt Mackall <mpm@selenic.com>
Cc: Randy Dunlap <rdunlap@xenotime.net>
Cc: Stephen Rothwell <sfr@canb.auug.org.au>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# b2588c4b 26-Jul-2011 Akinobu Mita <akinobu.mita@gmail.com>

fail_page_alloc: simplify debugfs initialization

Now cleanup_fault_attr_dentries() recursively removes a directory, So we
can simplify the error handling in the initialization code and no need
to hold dentry structs for each debugfs file.

Signed-off-by: Akinobu Mita <akinobu.mita@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 7f5ddcc8 26-Jul-2011 Akinobu Mita <akinobu.mita@gmail.com>

fault-injection: use debugfs_remove_recursive

Use debugfs_remove_recursive() to simplify initialization and
deinitialization of fault injection debugfs files.

Signed-off-by: Akinobu Mita <akinobu.mita@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 76d3fbf8 25-Jul-2011 Mel Gorman <mgorman@suse.de>

mm: page allocator: reconsider zones for allocation after direct reclaim

With zone_reclaim_mode enabled, it's possible for zones to be considered
full in the zonelist_cache so they are skipped in the future. If the
process enters direct reclaim, the ZLC may still consider zones to be full
even after reclaiming pages. Reconsider all zones for allocation if
direct reclaim returns successfully.

Signed-off-by: Mel Gorman <mgorman@suse.de>
Cc: Minchan Kim <minchan.kim@gmail.com>
Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Cc: Christoph Lameter <cl@linux.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# cd38b115 25-Jul-2011 Mel Gorman <mgorman@suse.de>

mm: page allocator: initialise ZLC for first zone eligible for zone_reclaim

There have been a small number of complaints about significant stalls
while copying large amounts of data on NUMA machines reported on a
distribution bugzilla. In these cases, zone_reclaim was enabled by
default due to large NUMA distances. In general, the complaints have not
been about the workload itself unless it was a file server (in which case
the recommendation was disable zone_reclaim).

The stalls are mostly due to significant amounts of time spent scanning
the preferred zone for pages to free. After a failure, it might fallback
to another node (as zonelists are often node-ordered rather than
zone-ordered) but stall quickly again when the next allocation attempt
occurs. In bad cases, each page allocated results in a full scan of the
preferred zone.

Patch 1 checks the preferred zone for recent allocation failure
which is particularly important if zone_reclaim has failed
recently. This avoids rescanning the zone in the near future and
instead falling back to another node. This may hurt node locality
in some cases but a failure to zone_reclaim is more expensive than
a remote access.

Patch 2 clears the zlc information after direct reclaim.
Otherwise, zone_reclaim can mark zones full, direct reclaim can
reclaim enough pages but the zone is still not considered for
allocation.

This was tested on a 24-thread 2-node x86_64 machine. The tests were
focused on large amounts of IO. All tests were bound to the CPUs on
node-0 to avoid disturbances due to processes being scheduled on different
nodes. The kernels tested are

3.0-rc6-vanilla Vanilla 3.0-rc6
zlcfirst Patch 1 applied
zlcreconsider Patches 1+2 applied

FS-Mark
./fs_mark -d /tmp/fsmark-10813 -D 100 -N 5000 -n 208 -L 35 -t 24 -S0 -s 524288
fsmark-3.0-rc6 3.0-rc6 3.0-rc6
vanilla zlcfirs zlcreconsider
Files/s min 54.90 ( 0.00%) 49.80 (-10.24%) 49.10 (-11.81%)
Files/s mean 100.11 ( 0.00%) 135.17 (25.94%) 146.93 (31.87%)
Files/s stddev 57.51 ( 0.00%) 138.97 (58.62%) 158.69 (63.76%)
Files/s max 361.10 ( 0.00%) 834.40 (56.72%) 802.40 (55.00%)
Overhead min 76704.00 ( 0.00%) 76501.00 ( 0.27%) 77784.00 (-1.39%)
Overhead mean 1485356.51 ( 0.00%) 1035797.83 (43.40%) 1594680.26 (-6.86%)
Overhead stddev 1848122.53 ( 0.00%) 881489.88 (109.66%) 1772354.90 ( 4.27%)
Overhead max 7989060.00 ( 0.00%) 3369118.00 (137.13%) 10135324.00 (-21.18%)
MMTests Statistics: duration
User/Sys Time Running Test (seconds) 501.49 493.91 499.93
Total Elapsed Time (seconds) 2451.57 2257.48 2215.92

MMTests Statistics: vmstat
Page Ins 46268 63840 66008
Page Outs 90821596 90671128 88043732
Swap Ins 0 0 0
Swap Outs 0 0 0
Direct pages scanned 13091697 8966863 8971790
Kswapd pages scanned 0 1830011 1831116
Kswapd pages reclaimed 0 1829068 1829930
Direct pages reclaimed 13037777 8956828 8648314
Kswapd efficiency 100% 99% 99%
Kswapd velocity 0.000 810.643 826.346
Direct efficiency 99% 99% 96%
Direct velocity 5340.128 3972.068 4048.788
Percentage direct scans 100% 83% 83%
Page writes by reclaim 0 3 0
Slabs scanned 796672 720640 720256
Direct inode steals 7422667 7160012 7088638
Kswapd inode steals 0 1736840 2021238

Test completes far faster with a large increase in the number of files
created per second. Standard deviation is high as a small number of
iterations were much higher than the mean. The number of pages scanned by
zone_reclaim is reduced and kswapd is used for more work.

LARGE DD
3.0-rc6 3.0-rc6 3.0-rc6
vanilla zlcfirst zlcreconsider
download tar 59 ( 0.00%) 59 ( 0.00%) 55 ( 7.27%)
dd source files 527 ( 0.00%) 296 (78.04%) 320 (64.69%)
delete source 36 ( 0.00%) 19 (89.47%) 20 (80.00%)
MMTests Statistics: duration
User/Sys Time Running Test (seconds) 125.03 118.98 122.01
Total Elapsed Time (seconds) 624.56 375.02 398.06

MMTests Statistics: vmstat
Page Ins 3594216 439368 407032
Page Outs 23380832 23380488 23377444
Swap Ins 0 0 0
Swap Outs 0 436 287
Direct pages scanned 17482342 69315973 82864918
Kswapd pages scanned 0 519123 575425
Kswapd pages reclaimed 0 466501 522487
Direct pages reclaimed 5858054 2732949 2712547
Kswapd efficiency 100% 89% 90%
Kswapd velocity 0.000 1384.254 1445.574
Direct efficiency 33% 3% 3%
Direct velocity 27991.453 184832.737 208171.929
Percentage direct scans 100% 99% 99%
Page writes by reclaim 0 5082 13917
Slabs scanned 17280 29952 35328
Direct inode steals 115257 1431122 332201
Kswapd inode steals 0 0 979532

This test downloads a large tarfile and copies it with dd a number of
times - similar to the most recent bug report I've dealt with. Time to
completion is reduced. The number of pages scanned directly is still
disturbingly high with a low efficiency but this is likely due to the
number of dirty pages encountered. The figures could probably be improved
with more work around how kswapd is used and how dirty pages are handled
but that is separate work and this result is significant on its own.

Streaming Mapped Writer
MMTests Statistics: duration
User/Sys Time Running Test (seconds) 124.47 111.67 112.64
Total Elapsed Time (seconds) 2138.14 1816.30 1867.56

MMTests Statistics: vmstat
Page Ins 90760 89124 89516
Page Outs 121028340 120199524 120736696
Swap Ins 0 86 55
Swap Outs 0 0 0
Direct pages scanned 114989363 96461439 96330619
Kswapd pages scanned 56430948 56965763 57075875
Kswapd pages reclaimed 27743219 27752044 27766606
Direct pages reclaimed 49777 46884 36655
Kswapd efficiency 49% 48% 48%
Kswapd velocity 26392.541 31363.631 30561.736
Direct efficiency 0% 0% 0%
Direct velocity 53780.091 53108.759 51581.004
Percentage direct scans 67% 62% 62%
Page writes by reclaim 385 122 1513
Slabs scanned 43008 39040 42112
Direct inode steals 0 10 8
Kswapd inode steals 733 534 477

This test just creates a large file mapping and writes to it linearly.
Time to completion is again reduced.

The gains are mostly down to two things. In many cases, there is less
scanning as zone_reclaim simply gives up faster due to recent failures.
The second reason is that memory is used more efficiently. Instead of
scanning the preferred zone every time, the allocator falls back to
another zone and uses it instead improving overall memory utilisation.

This patch: initialise ZLC for first zone eligible for zone_reclaim.

The zonelist cache (ZLC) is used among other things to record if
zone_reclaim() failed for a particular zone recently. The intention is to
avoid a high cost scanning extremely long zonelists or scanning within the
zone uselessly.

Currently the zonelist cache is setup only after the first zone has been
considered and zone_reclaim() has been called. The objective was to avoid
a costly setup but zone_reclaim is itself quite expensive. If it is
failing regularly such as the first eligible zone having mostly mapped
pages, the cost in scanning and allocation stalls is far higher than the
ZLC initialisation step.

This patch initialises ZLC before the first eligible zone calls
zone_reclaim(). Once initialised, it is checked whether the zone failed
zone_reclaim recently. If it has, the zone is skipped. As the first zone
is now being checked, additional care has to be taken about zones marked
full. A zone can be marked "full" because it should not have enough
unmapped pages for zone_reclaim but this is excessive as direct reclaim or
kswapd may succeed where zone_reclaim fails. Only mark zones "full" after
zone_reclaim fails if it failed to reclaim enough pages after scanning.

Signed-off-by: Mel Gorman <mgorman@suse.de>
Cc: Minchan Kim <minchan.kim@gmail.com>
Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Cc: Christoph Lameter <cl@linux.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 7c0caeb8 14-Jul-2011 Tejun Heo <tj@kernel.org>

memblock: Add optional region->nid

From 83103b92f3234ec830852bbc5c45911bd6cbdb20 Mon Sep 17 00:00:00 2001
From: Tejun Heo <tj@kernel.org>
Date: Thu, 14 Jul 2011 11:22:16 +0200

Add optional region->nid which can be enabled by arch using
CONFIG_HAVE_MEMBLOCK_NODE_MAP. When enabled, memblock also carries
NUMA node information and replaces early_node_map[].

Newly added memblocks have MAX_NUMNODES as nid. Arch can then call
memblock_set_node() to set node information. memblock takes care of
merging and node affine allocations w.r.t. node information.

When MEMBLOCK_NODE_MAP is enabled, early_node_map[], related data
structures and functions to manipulate and iterate it are disabled.
memblock version of __next_mem_pfn_range() is provided such that
for_each_mem_pfn_range() behaves the same and its users don't have to
be updated.

-v2: Yinghai spotted section mismatch caused by missing
__init_memblock in memblock_set_node(). Fixed.

Signed-off-by: Tejun Heo <tj@kernel.org>
Link: http://lkml.kernel.org/r/20110714094342.GF3455@htj.dyndns.org
Cc: Yinghai Lu <yinghai@kernel.org>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Signed-off-by: H. Peter Anvin <hpa@linux.intel.com>


# eb40c4c2 12-Jul-2011 Tejun Heo <tj@kernel.org>

memblock, x86: Replace memblock_x86_find_in_range_node() with generic memblock calls

With the previous changes, generic NUMA aware memblock API has feature
parity with memblock_x86_find_in_range_node(). There currently are
two users - x86 setup_node_data() and __alloc_memory_core_early() in
nobootmem.c.

This patch converts the former to use memblock_alloc_nid() and the
latter memblock_find_range_in_node(), and kills
memblock_x86_find_in_range_node() and related functions including
find_memory_early_core_early() in page_alloc.c.

Signed-off-by: Tejun Heo <tj@kernel.org>
Link: http://lkml.kernel.org/r/1310460395-30913-9-git-send-email-tj@kernel.org
Cc: Yinghai Lu <yinghai@kernel.org>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Signed-off-by: H. Peter Anvin <hpa@linux.intel.com>


# c13291a5 12-Jul-2011 Tejun Heo <tj@kernel.org>

bootmem: Use for_each_mem_pfn_range() in page_alloc.c

The previous patch added for_each_mem_pfn_range() which is more
versatile than for_each_active_range_index_in_nid(). This patch
replaces for_each_active_range_index_in_nid() and open coded
early_node_map[] walks with for_each_mem_pfn_range().

All conversions in this patch are straight-forward and shouldn't cause
any functional difference. After the conversions,
for_each_active_range_index_in_nid() doesn't have any user left and is
removed.

Signed-off-by: Tejun Heo <tj@kernel.org>
Link: http://lkml.kernel.org/r/1310460395-30913-4-git-send-email-tj@kernel.org
Cc: Yinghai Lu <yinghai@kernel.org>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Signed-off-by: H. Peter Anvin <hpa@linux.intel.com>


# 96e907d1 12-Jul-2011 Tejun Heo <tj@kernel.org>

bootmem: Reimplement __absent_pages_in_range() using for_each_mem_pfn_range()

__absent_pages_in_range() was needlessly complex. Reimplement it
using for_each_mem_pfn_range().

Also, update zone_absent_pages_in_node() such that it doesn't call
__absent_pages_in_range() with @zone_start_pfn which is larger than
@zone_end_pfn.

Signed-off-by: Tejun Heo <tj@kernel.org>
Link: http://lkml.kernel.org/r/1310460395-30913-3-git-send-email-tj@kernel.org
Cc: Yinghai Lu <yinghai@kernel.org>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Signed-off-by: H. Peter Anvin <hpa@linux.intel.com>


# 5dfe8660 14-Jul-2011 Tejun Heo <tj@kernel.org>

bootmem: Replace work_with_active_regions() with for_each_mem_pfn_range()

Callback based iteration is cumbersome and much less useful than
for_each_*() iterator. This patch implements for_each_mem_pfn_range()
which replaces work_with_active_regions(). All the current users of
work_with_active_regions() are converted.

This simplifies walking over early_node_map and will allow converting
internal logics in page_alloc to use iterator instead of walking
early_node_map directly, which in turn will enable moving node
information to memblock.

powerpc change is only compile tested.

Signed-off-by: Tejun Heo <tj@kernel.org>
Link: http://lkml.kernel.org/r/20110714074610.GD3455@htj.dyndns.org
Cc: Yinghai Lu <yinghai@kernel.org>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Signed-off-by: H. Peter Anvin <hpa@linux.intel.com>


# 1f5026a7 12-Jul-2011 Tejun Heo <tj@kernel.org>

memblock: Kill MEMBLOCK_ERROR

25818f0f28 (memblock: Make MEMBLOCK_ERROR be 0) thankfully made
MEMBLOCK_ERROR 0 and there already are codes which expect error return
to be 0. There's no point in keeping MEMBLOCK_ERROR around. End its
misery.

Signed-off-by: Tejun Heo <tj@kernel.org>
Link: http://lkml.kernel.org/r/1310457490-3356-6-git-send-email-tj@kernel.org
Cc: Yinghai Lu <yinghai@kernel.org>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Signed-off-by: H. Peter Anvin <hpa@linux.intel.com>


# 53348f27 12-Jul-2011 Tejun Heo <tj@kernel.org>

bootmem: Fix __free_pages_bootmem() to use @order properly

a226f6c899 (FRV: Clean up bootmem allocator's page freeing algorithm)
separated out __free_pages_bootmem() from free_all_bootmem_core().
__free_pages_bootmem() takes @order argument but it assumes @order is
either 0 or ilog2(BITS_PER_LONG). Note that all the current users
match that assumption and this doesn't cause actual problems.

Fix it by using 1 << order instead of BITS_PER_LONG.

Signed-off-by: Tejun Heo <tj@kernel.org>
Link: http://lkml.kernel.org/r/1310457490-3356-3-git-send-email-tj@kernel.org
Cc: David Howells <dhowells@redhat.com>
Signed-off-by: H. Peter Anvin <hpa@linux.intel.com>


# 1e01979c 12-Jul-2011 Tejun Heo <tj@kernel.org>

x86, numa: Implement pfn -> nid mapping granularity check

SPARSEMEM w/o VMEMMAP and DISCONTIGMEM, both used only on 32bit, use
sections array to map pfn to nid which is limited in granularity. If
NUMA nodes are laid out such that the mapping cannot be accurate, boot
will fail triggering BUG_ON() in mminit_verify_page_links().

On 32bit, it's 512MiB w/ PAE and SPARSEMEM. This seems to have been
granular enough until commit 2706a0bf7b (x86, NUMA: Enable
CONFIG_AMD_NUMA on 32bit too). Apparently, there is a machine which
aligns NUMA nodes to 128MiB and has only AMD NUMA but not SRAT. This
led to the following BUG_ON().

On node 0 totalpages: 2096615
DMA zone: 32 pages used for memmap
DMA zone: 0 pages reserved
DMA zone: 3927 pages, LIFO batch:0
Normal zone: 1740 pages used for memmap
Normal zone: 220978 pages, LIFO batch:31
HighMem zone: 16405 pages used for memmap
HighMem zone: 1853533 pages, LIFO batch:31
BUG: Int 6: CR2 (null)
EDI (null) ESI 00000002 EBP 00000002 ESP c1543ecc
EBX f2400000 EDX 00000006 ECX (null) EAX 00000001
err (null) EIP c16209aa CS 00000060 flg 00010002
Stack: f2400000 00220000 f7200800 c1620613 00220000 01000000 04400000 00238000
(null) f7200000 00000002 f7200b58 f7200800 c1620929 000375fe (null)
f7200b80 c16395f0 00200a02 f7200a80 (null) 000375fe 00000002 (null)
Pid: 0, comm: swapper Not tainted 2.6.39-rc5-00181-g2706a0b #17
Call Trace:
[<c136b1e5>] ? early_fault+0x2e/0x2e
[<c16209aa>] ? mminit_verify_page_links+0x12/0x42
[<c1620613>] ? memmap_init_zone+0xaf/0x10c
[<c1620929>] ? free_area_init_node+0x2b9/0x2e3
[<c1607e99>] ? free_area_init_nodes+0x3f2/0x451
[<c1601d80>] ? paging_init+0x112/0x118
[<c15f578d>] ? setup_arch+0x791/0x82f
[<c15f43d9>] ? start_kernel+0x6a/0x257

This patch implements node_map_pfn_alignment() which determines
maximum internode alignment and update numa_register_memblks() to
reject NUMA configuration if alignment exceeds the pfn -> nid mapping
granularity of the memory model as determined by PAGES_PER_SECTION.

This makes the problematic machine boot w/ flatmem by rejecting the
NUMA config and provides protection against crazy NUMA configurations.

Signed-off-by: Tejun Heo <tj@kernel.org>
Link: http://lkml.kernel.org/r/20110712074534.GB2872@htj.dyndns.org
LKML-Reference: <20110628174613.GP478@escobedo.osrc.amd.com>
Reported-and-Tested-by: Hans Rosenfeld <hans.rosenfeld@amd.com>
Cc: Conny Seidel <conny.seidel@amd.com>
Signed-off-by: H. Peter Anvin <hpa@linux.intel.com>


# 1fa7b6a2 01-Jun-2011 Linus Torvalds <torvalds@linux-foundation.org>

Revert "mm: fail GFP_DMA allocations when ZONE_DMA is not configured"

This reverts commit a197b59ae6e8bee56fcef37ea2482dc08414e2ac.

As rmk says:
"Commit a197b59ae6e8 (mm: fail GFP_DMA allocations when ZONE_DMA is not
configured) is causing regressions on ARM with various drivers which
use GFP_DMA.

The behaviour up until now has been to silently ignore that flag when
CONFIG_ZONE_DMA is not enabled, and to allocate from the normal zone.
However, as a result of the above commit, such allocations now fail
which causes drivers to fail. These are regressions compared to the
previous kernel version."

so just revert it.

Requested-by: Russell King <linux@arm.linux.org.uk>
Acked-by: Andrew Morton <akpm@linux-foundation.org>
Cc: David Rientjes <rientjes@google.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 246e87a9 26-May-2011 KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>

memcg: fix get_scan_count() for small targets

During memory reclaim we determine the number of pages to be scanned per
zone as

(anon + file) >> priority.
Assume
scan = (anon + file) >> priority.

If scan < SWAP_CLUSTER_MAX, the scan will be skipped for this time and
priority gets higher. This has some problems.

1. This increases priority as 1 without any scan.
To do scan in this priority, amount of pages should be larger than 512M.
If pages>>priority < SWAP_CLUSTER_MAX, it's recorded and scan will be
batched, later. (But we lose 1 priority.)
If memory size is below 16M, pages >> priority is 0 and no scan in
DEF_PRIORITY forever.

2. If zone->all_unreclaimabe==true, it's scanned only when priority==0.
So, x86's ZONE_DMA will never be recoverred until the user of pages
frees memory by itself.

3. With memcg, the limit of memory can be small. When using small memcg,
it gets priority < DEF_PRIORITY-2 very easily and need to call
wait_iff_congested().
For doing scan before priorty=9, 64MB of memory should be used.

Then, this patch tries to scan SWAP_CLUSTER_MAX of pages in force...when

1. the target is enough small.
2. it's kswapd or memcg reclaim.

Then we can avoid rapid priority drop and may be able to recover
all_unreclaimable in a small zones. And this patch removes nr_saved_scan.
This will allow scanning in this priority even when pages >> priority is
very small.

Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Acked-by: Ying Han <yinghan@google.com>
Cc: Balbir Singh <balbir@in.ibm.com>
Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Cc: Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp>
Cc: Mel Gorman <mel@csn.ul.ie>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# cfa54a0f 24-May-2011 Andrew Barry <abarry@cray.com>

mm/page_alloc.c: prevent unending loop in __alloc_pages_slowpath()

I believe I found a problem in __alloc_pages_slowpath, which allows a
process to get stuck endlessly looping, even when lots of memory is
available.

Running an I/O and memory intensive stress-test I see a 0-order page
allocation with __GFP_IO and __GFP_WAIT, running on a system with very
little free memory. Right about the same time that the stress-test gets
killed by the OOM-killer, the utility trying to allocate memory gets stuck
in __alloc_pages_slowpath even though most of the systems memory was freed
by the oom-kill of the stress-test.

The utility ends up looping from the rebalance label down through the
wait_iff_congested continiously. Because order=0,
__alloc_pages_direct_compact skips the call to get_page_from_freelist.
Because all of the reclaimable memory on the system has already been
reclaimed, __alloc_pages_direct_reclaim skips the call to
get_page_from_freelist. Since there is no __GFP_FS flag, the block with
__alloc_pages_may_oom is skipped. The loop hits the wait_iff_congested,
then jumps back to rebalance without ever trying to
get_page_from_freelist. This loop repeats infinitely.

The test case is pretty pathological. Running a mix of I/O stress-tests
that do a lot of fork() and consume all of the system memory, I can pretty
reliably hit this on 600 nodes, in about 12 hours. 32GB/node.

Signed-off-by: Andrew Barry <abarry@cray.com>
Signed-off-by: Minchan Kim <minchan.kim@gmail.com>
Reviewed-by: Rik van Riel<riel@redhat.com>
Acked-by: Mel Gorman <mgorman@suse.de>
Cc: <stable@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# a197b59a 24-May-2011 David Rientjes <rientjes@google.com>

mm: fail GFP_DMA allocations when ZONE_DMA is not configured

The page allocator will improperly return a page from ZONE_NORMAL even
when __GFP_DMA is passed if CONFIG_ZONE_DMA is disabled. The caller
expects DMA memory, perhaps for ISA devices with 16-bit address registers,
and may get higher memory resulting in undefined behavior.

This patch causes the page allocator to return NULL in such circumstances
with a warning emitted to the kernel log on the first occurrence.

Signed-off-by: David Rientjes <rientjes@google.com>
Cc: Mel Gorman <mel@csn.ul.ie>
Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Rik van Riel <riel@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 6d3163ce 24-May-2011 Arve Hjønnevåg <arve@android.com>

mm: check if any page in a pageblock is reserved before marking it MIGRATE_RESERVE

This fixes a problem where the first pageblock got marked MIGRATE_RESERVE
even though it only had a few free pages. eg, On current ARM port, The
kernel starts at offset 0x8000 to leave room for boot parameters, and the
memory is freed later.

This in turn caused no contiguous memory to be reserved and frequent
kswapd wakeups that emptied the caches to get more contiguous memory.

Unfortunatelly, ARM needs order-2 allocation for pgd (see
arm/mm/pgd.c#pgd_alloc()). Therefore the issue is not minor nor easy
avoidable.

[kosaki.motohiro@jp.fujitsu.com: added some explanation]
[kosaki.motohiro@jp.fujitsu.com: add !pfn_valid_within() to check]
[minchan.kim@gmail.com: check end_pfn in pageblock_is_reserved]
Signed-off-by: John Stultz <john.stultz@linaro.org>
Signed-off-by: Arve Hjønnevåg <arve@android.com>
Signed-off-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Acked-by: Mel Gorman <mel@csn.ul.ie>
Acked-by: Dave Hansen <dave@linux.vnet.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# a238ab5b 24-May-2011 Dave Hansen <dave@linux.vnet.ibm.com>

mm: break out page allocation warning code

This originally started as a simple patch to give vmalloc() some more
verbose output on failure on top of the plain page allocator messages.
Johannes suggested that it might be nicer to lead with the vmalloc() info
_before_ the page allocator messages.

But, I do think there's a lot of value in what __alloc_pages_slowpath()
does with its filtering and so forth.

This patch creates a new function which other allocators can call instead
of relying on the internal page allocator warnings. It also gives this
function private rate-limiting which separates it from other
printk_ratelimit() users.

Signed-off-by: Dave Hansen <dave@linux.vnet.ibm.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: David Rientjes <rientjes@google.com>
Cc: Michal Nazarewicz <mina86@mina86.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# c6a140bf 24-May-2011 Andrea Arcangeli <aarcange@redhat.com>

mm/compaction: reverse the change that forbade sync migraton with __GFP_NO_KSWAPD

It's uncertain this has been beneficial, so it's safer to undo it. All
other compaction users would still go in synchronous mode if a first
attempt at async compaction failed. Hopefully we don't need to force
special behavior for THP (which is the only __GFP_NO_KSWAPD user so far
and it's the easier to exercise and to be noticeable). This also make
__GFP_NO_KSWAPD return to its original strict semantics specific to bypass
kswapd, as THP allocations have khugepaged for the async THP
allocations/compactions.

Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
Cc: Alex Villacis Lasso <avillaci@fiec.espol.edu.ec>
Cc: Mel Gorman <mel@csn.ul.ie>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# a6cccdc3 24-May-2011 KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>

mm, mem-hotplug: update pcp->stat_threshold when memory hotplug occur

Currently, cpu hotplug updates pcp->stat_threshold, but memory hotplug
doesn't. There is no reason for this.

[akpm@linux-foundation.org: fix CONFIG_SMP=n build]
Signed-off-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Reviewed-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Acked-by: Mel Gorman <mel@csn.ul.ie>
Acked-by: Christoph Lameter <cl@linux.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 1b79acc9 24-May-2011 KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>

mm, mem-hotplug: recalculate lowmem_reserve when memory hotplug occurs

Currently, memory hotplug calls setup_per_zone_wmarks() and
calculate_zone_inactive_ratio(), but doesn't call
setup_per_zone_lowmem_reserve().

It means the number of reserved pages aren't updated even if memory hot
plug occur. This patch fixes it.

Signed-off-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Reviewed-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Acked-by: Mel Gorman <mel@csn.ul.ie>
Reviewed-by: Minchan Kim <minchan.kim@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 839a4fcc 24-May-2011 KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>

mm, mem-hotplug: fix section mismatch. setup_per_zone_inactive_ratio() should be __meminit.

Commit bce7394a3e ("page-allocator: reset wmark_min and inactive ratio of
zone when hotplug happens") introduced invalid section references. Now,
setup_per_zone_inactive_ratio() is marked __init and then it can't be
referenced from memory hotplug code.

This patch marks it as __meminit and also marks caller as __ref.

Signed-off-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Reviewed-by: Minchan Kim <minchan.kim@gmail.com>
Cc: Yasunori Goto <y-goto@jp.fujitsu.com>
Cc: Rik van Riel <riel@redhat.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Reviewed-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# ac3bbec5 24-May-2011 Sergey Senozhatsky <sergey.senozhatsky@gmail.com>

mm: remove unused zone_idx variable from set_migratetype_isolate

Signed-off-by: Sergey Senozhatsky <sergey.senozhatsky@gmail.com>
Reviewed-by: Christoph Lameter <cl@linux.com>
Acked-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 7bf02ea2 24-May-2011 David Rientjes <rientjes@google.com>

arch, mm: filter disallowed nodes from arch specific show_mem functions

Architectures that implement their own show_mem() function did not pass
the filter argument to show_free_areas() to appropriately avoid emitting
the state of nodes that are disallowed in the current context. This patch
now passes the filter argument to show_free_areas() so those nodes are now
avoided.

This patch also removes the show_free_areas() wrapper around
__show_free_areas() and converts existing callers to pass an empty filter.

ia64 emits additional information for each node, so skip_free_areas_zone()
must be made global to filter disallowed nodes and it is converted to use
a nid argument rather than a zone for this use case.

Signed-off-by: David Rientjes <rientjes@google.com>
Cc: Russell King <linux@arm.linux.org.uk>
Cc: Tony Luck <tony.luck@intel.com>
Cc: Fenghua Yu <fenghua.yu@intel.com>
Cc: Kyle McMartin <kyle@mcmartin.ca>
Cc: Helge Deller <deller@gmx.de>
Cc: James Bottomley <jejb@parisc-linux.org>
Cc: "David S. Miller" <davem@davemloft.net>
Cc: Guan Xuetao <gxt@mprc.pku.edu.cn>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 268bb0ce 20-May-2011 Linus Torvalds <torvalds@linux-foundation.org>

sanitize <linux/prefetch.h> usage

Commit e66eed651fd1 ("list: remove prefetching from regular list
iterators") removed the include of prefetch.h from list.h, which
uncovered several cases that had apparently relied on that rather
obscure header file dependency.

So this fixes things up a bit, using

grep -L linux/prefetch.h $(git grep -l '[^a-z_]prefetchw*(' -- '*.[ch]')
grep -L 'prefetchw*(' $(git grep -l 'linux/prefetch.h' -- '*.[ch]')

to guide us in finding files that either need <linux/prefetch.h>
inclusion, or have it despite not needing it.

There are more of them around (mostly network drivers), but this gets
many core ones.

Reported-by: Stephen Rothwell <sfr@canb.auug.org.au>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# b5e6ab58 16-May-2011 Randy Dunlap <randy.dunlap@oracle.com>

mm: fix kernel-doc warning in page_alloc.c

Fix new kernel-doc warning in mm/page_alloc.c:

Warning(mm/page_alloc.c:2370): No description found for parameter 'nid'

Signed-off-by: Randy Dunlap <randy.dunlap@oracle.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# ee85c2e1 11-May-2011 Andi Kleen <ak@linux.intel.com>

mm: add alloc_pages_exact_nid()

Add a alloc_pages_exact_nid() that allocates on a specific node.

The naming is quite broken, but fixing that would need a larger renaming
action.

[akpm@linux-foundation.org: coding-style fixes]
[akpm@linux-foundation.org: tweak comment]
Signed-off-by: Andi Kleen <ak@linux.intel.com>
Cc: Michal Hocko <mhocko@suse.cz>
Cc: Balbir Singh <balbir@linux.vnet.ibm.com>
Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Cc: Dave Hansen <dave@linux.vnet.ibm.com>
Cc: David Rientjes <rientjes@google.com>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 8f389a99 11-May-2011 Yinghai Lu <yinghai@kernel.org>

mm: use alloc_bootmem_node_nopanic() on really needed path

Stefan found nobootmem does not work on his system that has only 8M of
RAM. This causes an early panic:

BIOS-provided physical RAM map:
BIOS-88: 0000000000000000 - 000000000009f000 (usable)
BIOS-88: 0000000000100000 - 0000000000840000 (usable)
bootconsole [earlyser0] enabled
Notice: NX (Execute Disable) protection missing in CPU or disabled in BIOS!
DMI not present or invalid.
last_pfn = 0x840 max_arch_pfn = 0x100000
init_memory_mapping: 0000000000000000-0000000000840000
8MB LOWMEM available.
mapped low ram: 0 - 00840000
low ram: 0 - 00840000
Zone PFN ranges:
DMA 0x00000001 -> 0x00001000
Normal empty
Movable zone start PFN for each node
early_node_map[2] active PFN ranges
0: 0x00000001 -> 0x0000009f
0: 0x00000100 -> 0x00000840
BUG: Int 6: CR2 (null)
EDI c034663c ESI (null) EBP c0329f38 ESP c0329ef4
EBX c0346380 EDX 00000006 ECX ffffffff EAX fffffff4
err (null) EIP c0353191 CS c0320060 flg 00010082
Stack: (null) c030c533 000007cd (null) c030c533 00000001 (null) (null)
00000003 0000083f 00000018 00000002 00000002 c0329f6c c03534d6 (null)
(null) 00000100 00000840 (null) c0329f64 00000001 00001000 (null)
Pid: 0, comm: swapper Not tainted 2.6.36 #5
Call Trace:
[<c02e3707>] ? 0xc02e3707
[<c035e6e5>] 0xc035e6e5
[<c0353191>] ? 0xc0353191
[<c03534d6>] 0xc03534d6
[<c034f1cd>] 0xc034f1cd
[<c034a824>] 0xc034a824
[<c03513cb>] ? 0xc03513cb
[<c0349432>] 0xc0349432
[<c0349066>] 0xc0349066

It turns out that we should ignore the low limit of 16M.

Use alloc_bootmem_node_nopanic() in this case.

[akpm@linux-foundation.org: less mess]
Signed-off-by: Yinghai LU <yinghai@kernel.org>
Reported-by: Stefan Hellermann <stefan@the2masters.de>
Tested-by: Stefan Hellermann <stefan@the2masters.de>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: "H. Peter Anvin" <hpa@linux.intel.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: <stable@kernel.org> [2.6.34+]
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 9f6ae448 14-Apr-2011 Paul Mundt <lethal@linux-sh.org>

mm/page_alloc.c: silence build_all_zonelists() section mismatch

The memory hotplug case involves calling to build_all_zonelists() which
in turns calls in to setup_zone_pageset(). The latter is marked
__meminit while build_all_zonelists() itself has no particular
annotation. build_all_zonelists() is only handed a non-NULL pointer in
the case of memory hotplug through an existing __meminit path, so the
setup_zone_pageset() reference is always safe.

The options as such are either to flag build_all_zonelists() as __ref (as
per __build_all_zonelists()), or to simply discard the __meminit
annotation from setup_zone_pageset().

Signed-off-by: Paul Mundt <lethal@linux-sh.org>
Acked-by: Mel Gorman <mel@csn.ul.ie>
Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Acked-by: Randy Dunlap <randy.dunlap@oracle.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 58c2ee40 14-Mar-2011 Nikanth Karthikesan <knikanth@suse.de>

mm: Fix section mismatch for setup_zone_pageset()

build_all_zonelists() which is not __meminit, calls setup_zone_pageset().

Signed-off-by: Nikanth Karthikesan <knikanth@suse.de>
Signed-off-by: Jiri Kosina <jkosina@suse.cz>


# 25985edc 30-Mar-2011 Lucas De Marchi <lucas.demarchi@profusion.mobi>

Fix common misspellings

Fixes generated by 'codespell' and manually reviewed.

Signed-off-by: Lucas De Marchi <lucas.demarchi@profusion.mobi>


# b2b755b5 24-Mar-2011 David Rientjes <rientjes@google.com>

lib, arch: add filter argument to show_mem and fix private implementations

Commit ddd588b5dd55 ("oom: suppress nodes that are not allowed from
meminfo on oom kill") moved lib/show_mem.o out of lib/lib.a, which
resulted in build warnings on all architectures that implement their own
versions of show_mem():

lib/lib.a(show_mem.o): In function `show_mem':
show_mem.c:(.text+0x1f4): multiple definition of `show_mem'
arch/sparc/mm/built-in.o:(.text+0xd70): first defined here

The fix is to remove __show_mem() and add its argument to show_mem() in
all implementations to prevent this breakage.

Architectures that implement their own show_mem() actually don't do
anything with the argument yet, but they could be made to filter nodes
that aren't allowed in the current context in the future just like the
generic implementation.

Reported-by: Stephen Rothwell <sfr@canb.auug.org.au>
Reported-by: James Bottomley <James.Bottomley@hansenpartnership.com>
Suggested-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: David Rientjes <rientjes@google.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# f212ad7c 23-Mar-2011 Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp>

memcg: add memcg sanity checks at allocating and freeing pages

Add checks at allocating or freeing a page whether the page is used (iow,
charged) from the view point of memcg.

This check may be useful in debugging a problem and we did similar checks
before the commit 52d4b9ac(memcg: allocate all page_cgroup at boot).

This patch adds some overheads at allocating or freeing memory, so it's
enabled only when CONFIG_DEBUG_VM is enabled.

Signed-off-by: Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp>
Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
Acked-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Balbir Singh <balbir@linux.vnet.ibm.com>
Cc: Minchan Kim <minchan.kim@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 84be48d8 22-Mar-2011 Kirill A. Shutemov <kirill@shutemov.name>

mm/page_alloc.c: use list_move() instead of list_del()/list_add() combination

Signed-off-by: Kirill A. Shutemov <kirill@shutemov.name>
Cc: Mel Gorman <mel@csn.ul.ie>
Cc: Rik van Riel <riel@redhat.com>
Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Reviewed-by: Christoph Lameter <cl@linux.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 78afd561 22-Mar-2011 Andi Kleen <ak@linux.intel.com>

mm: add __GFP_OTHER_NODE flag

Add a new __GFP_OTHER_NODE flag to tell the low level numa statistics in
zone_statistics() that an allocation is on behalf of another thread. This
way the local and remote counters can be still correct, even when
background daemons like khugepaged are changing memory mappings.

This only affects the accounting, but I think it's worth doing that right
to avoid confusing users.

I first tried to just pass down the right node, but this required a lot of
changes to pass down this parameter and at least one addition of a 10th
argument to a 9 argument function. Using the flag is a lot less
intrusive.

Open: should be also used for migration?

[akpm@linux-foundation.org: coding-style fixes]
Signed-off-by: Andi Kleen <ak@linux.intel.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Reviewed-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 11bc82d6 22-Mar-2011 Andrea Arcangeli <aarcange@redhat.com>

mm: compaction: Use async migration for __GFP_NO_KSWAPD and enforce no writeback

__GFP_NO_KSWAPD allocations are usually very expensive and not mandatory
to succeed as they have graceful fallback. Waiting for I/O in those,
tends to be overkill in terms of latencies, so we can reduce their latency
by disabling sync migrate.

Unfortunately, even with async migration it's still possible for the
process to be blocked waiting for a request slot (e.g. get_request_wait
in the block layer) when ->writepage is called. To prevent
__GFP_NO_KSWAPD blocking, this patch prevents ->writepage being called on
dirty page cache for asynchronous migration.

Addresses https://bugzilla.kernel.org/show_bug.cgi?id=31142

[mel@csn.ul.ie: Avoid writebacks for NFS, retry locked pages, use bool]
Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
Signed-off-by: Mel Gorman <mel@csn.ul.ie>
Cc: Arthur Marsh <arthur.marsh@internode.on.net>
Cc: Clemens Ladisch <cladisch@googlemail.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Minchan Kim <minchan.kim@gmail.com>
Reported-by: Alex Villacis Lasso <avillaci@ceibo.fiec.espol.edu.ec>
Tested-by: Alex Villacis Lasso <avillaci@ceibo.fiec.espol.edu.ec>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 1d16871d 22-Mar-2011 Namhyung Kim <namhyung@gmail.com>

mm: batch-free pcp list if possible

free_pcppages_bulk() frees pages from pcp lists in a round-robin fashion
by keeping batch_free counter. But it doesn't need to spin if there is
only one non-empty list. This can be checked by batch_free ==
MIGRATE_PCPTYPES.

[akpm@linux-foundation.org: fix comment]
Signed-off-by: Namhyung Kim <namhyung@gmail.com>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Cc: Mel Gorman <mel@csn.ul.ie>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# cbf978bf 22-Mar-2011 David Rientjes <rientjes@google.com>

oom: suppress nodes that are not allowed from meminfo on page alloc failure

Displaying extremely verbose meminfo for all nodes on the system is
overkill for page allocation failures when the context restricts that
allocation to only a subset of nodes. We don't particularly care about
the state of all nodes when some are not allowed in the current context,
they can have an abundance of memory but we can't allocate from that part
of memory.

This patch suppresses disallowed nodes from the meminfo dump on a page
allocation failure if the context requires it.

Signed-off-by: David Rientjes <rientjes@google.com>
Cc: Mel Gorman <mel@csn.ul.ie>
Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 29423e77 22-Mar-2011 David Rientjes <rientjes@google.com>

oom: suppress show_mem() for many nodes in irq context on page alloc failure

When a page allocation failure occurs, show_mem() is called to dump the
state of the VM so users may understand what happened to get into that
condition.

This output, however, can be extremely verbose. In irq context, it may
result in significant delays that incur NMI watchdog timeouts when the
machine is large (we use CONFIG_NODES_SHIFT > 8 here to define a "large"
machine since the length of the show_mem() output is proportional to the
number of possible nodes).

This patch suppresses the show_mem() call in irq context when the kernel
has CONFIG_NODES_SHIFT > 8.

Signed-off-by: David Rientjes <rientjes@google.com>
Cc: Mel Gorman <mel@csn.ul.ie>
Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# ddd588b5 22-Mar-2011 David Rientjes <rientjes@google.com>

oom: suppress nodes that are not allowed from meminfo on oom kill

The oom killer is extremely verbose for machines with a large number of
cpus and/or nodes. This verbosity can often be harmful if it causes other
important messages to be scrolled from the kernel log and incurs a
signicant time delay, specifically for kernels with CONFIG_NODES_SHIFT >
8.

This patch causes only memory information to be displayed for nodes that
are allowed by current's cpuset when dumping the VM state. Information
for all other nodes is irrelevant to the oom condition; we don't care if
there's an abundance of memory elsewhere if we can't access it.

This only affects the behavior of dumping memory information when an oom
is triggered. Other dumps, such as for sysrq+m, still display the
unfiltered form when using the existing show_mem() interface.

Additionally, the per-cpu pageset statistics are extremely verbose in oom
killer output, so it is now suppressed. This removes

nodes_weight(current->mems_allowed) * (1 + nr_cpus)

lines from the oom killer output.

Callers may use __show_mem(SHOW_MEM_FILTER_NODES) to filter disallowed
nodes.

Signed-off-by: David Rientjes <rientjes@google.com>
Cc: Mel Gorman <mel@csn.ul.ie>
Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# ef2b4b95 17-Mar-2011 Andrea Arcangeli <aarcange@redhat.com>

mm: PageBuddy and mapcount robustness

Change the _mapcount value indicating PageBuddy from -2 to -128 for
more robusteness against page_mapcount() undeflows.

Use reset_page_mapcount instead of __ClearPageBuddy in bad_page to
ignore the previous retval of PageBuddy().

Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
Reported-by: Hugh Dickins <hughd@google.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# cc289894 26-Feb-2011 Yinghai Lu <yinghai@kernel.org>

mm: Move early_node_map[] reverse scan helpers under HAVE_MEMBLOCK

Heiko found recent memblock change triggers these warnings on s390:

mm/page_alloc.c:3623:22: warning: 'last_active_region_index_in_nid' defined but not used
mm/page_alloc.c:3638:22: warning: 'previous_active_region_index_in_nid' defined but not used

Need to move those two function under HAVE_MEMBLOCK with its only
user, find_memory_core_early().

-tj: Minor updates to description.

Reported-by: Heiko Carstens <heiko.carstens@de.ibm.com>
Signed-off-by: Yinghai Lu <yinghai@kernel.org>
Signed-off-by: Tejun Heo <tj@kernel.org>


# 29723fcc 25-Feb-2011 Namhyung Kim <namhyung@gmail.com>

mm: fix dubious code in __count_immobile_pages()

When pfn_valid_within() failed 'iter' was incremented twice.

Signed-off-by: Namhyung Kim <namhyung@gmail.com>
Reviewed-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Reviewed-by: Minchan Kim <minchan.kim@gmail.com>
Cc: <stable@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 8bc1f91e 24-Feb-2011 Yinghai Lu <yinghai@kernel.org>

bootmem: Move __alloc_memory_core_early() to nobootmem.c

Now that bootmem.c and nobootmem.c are separate, there's no reason to
define __alloc_memory_core_early(), which is used only by nobootmem,
inside #ifdef in page_alloc.c. Move it to nobootmem.c and make it
static.

This patch doesn't introduce any behavior change.

-tj: Updated commit description.

Signed-off-by: Yinghai Lu <yinghai@kernel.org>
Acked-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Tejun Heo <tj@kernel.org>


# e782ab42 24-Feb-2011 Yinghai Lu <yinghai@kernel.org>

bootmem: Move contig_page_data definition to bootmem.c/nobootmem.c

Now that bootmem.c and nobootmem.c are separate, it's cleaner to
define contig_page_data in each file than in page_alloc.c with #ifdef.
Move it.

This patch doesn't introduce any behavior change.

-v2: According to Andrew, fixed the struct layout.
-tj: Updated commit description.

Signed-off-by: Yinghai Lu <yinghai@kernel.org>
Acked-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Tejun Heo <tj@kernel.org>


# 2ff754fa 25-Jan-2011 David Rientjes <rientjes@google.com>

mm: clear pages_scanned only if draining a pcp adds pages to the buddy allocator

Commit 0e093d99763e ("writeback: do not sleep on the congestion queue if
there are no congested BDIs or if significant congestion is not being
encountered in the current zone") uncovered a livelock in the page
allocator that resulted in tasks infinitely looping trying to find
memory and kswapd running at 100% cpu.

The issue occurs because drain_all_pages() is called immediately
following direct reclaim when no memory is freed and try_to_free_pages()
returns non-zero because all zones in the zonelist do not have their
all_unreclaimable flag set.

When draining the per-cpu pagesets back to the buddy allocator for each
zone, the zone->pages_scanned counter is cleared to avoid erroneously
setting zone->all_unreclaimable later. The problem is that no pages may
actually be drained and, thus, the unreclaimable logic never fails
direct reclaim so the oom killer may be invoked.

This apparently only manifested after wait_iff_congested() was
introduced and the zone was full of anonymous memory that would not
congest the backing store. The page allocator would infinitely loop if
there were no other tasks waiting to be scheduled and clear
zone->pages_scanned because of drain_all_pages() as the result of this
change before kswapd could scan enough pages to trigger the reclaim
logic. Additionally, with every loop of the page allocator and in the
reclaim path, kswapd would be kicked and would end up running at 100%
cpu. In this scenario, current and kswapd are all running continuously
with kswapd incrementing zone->pages_scanned and current clearing it.

The problem is even more pronounced when current swaps some of its
memory to swap cache and the reclaimable logic then considers all active
anonymous memory in the all_unreclaimable logic, which requires a much
higher zone->pages_scanned value for try_to_free_pages() to return zero
that is never attainable in this scenario.

Before wait_iff_congested(), the page allocator would incur an
unconditional timeout and allow kswapd to elevate zone->pages_scanned to
a level that the oom killer would be called the next time it loops.

The fix is to only attempt to drain pcp pages if there is actually a
quantity to be drained. The unconditional clearing of
zone->pages_scanned in free_pcppages_bulk() need not be changed since
other callers already ensure that draining will occur. This patch
ensures that free_pcppages_bulk() will actually free memory before
calling into it from drain_all_pages() so zone->pages_scanned is only
cleared if appropriate.

Signed-off-by: David Rientjes <rientjes@google.com>
Cc: Mel Gorman <mel@csn.ul.ie>
Reviewed-by: Johannes Weiner <hannes@cmpxchg.org>
Cc: Minchan Kim <minchan.kim@gmail.com>
Cc: Wu Fengguang <fengguang.wu@intel.com>
Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Reviewed-by: Rik van Riel <riel@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# f33261d7 25-Jan-2011 David Rientjes <rientjes@google.com>

mm: fix deferred congestion timeout if preferred zone is not allowed

Before 0e093d99763e ("writeback: do not sleep on the congestion queue if
there are no congested BDIs or if significant congestion is not being
encountered in the current zone"), preferred_zone was only used for NUMA
statistics, to determine the zoneidx from which to allocate from given
the type requested, and whether to utilize memory compaction.

wait_iff_congested(), though, uses preferred_zone to determine if the
congestion wait should be deferred because its dirty pages are backed by
a congested bdi. This incorrectly defers the timeout and busy loops in
the page allocator with various cond_resched() calls if preferred_zone
is not allowed in the current context, usually consuming 100% of a cpu.

This patch ensures preferred_zone is an allowed zone in the fastpath
depending on whether current is constrained by its cpuset or nodes in
its mempolicy (when the nodemask passed is non-NULL). This is correct
since the fastpath allocation always passes ALLOC_CPUSET when trying to
allocate memory. In the slowpath, this patch resets preferred_zone to
the first zone of the allowed type when the allocation is not
constrained by current's cpuset, i.e. it does not pass ALLOC_CPUSET.

This patch also ensures preferred_zone is from the set of allowed nodes
when called from within direct reclaim since allocations are always
constrained by cpusets in this context (it is blockable).

Both of these uses of cpuset_current_mems_allowed are protected by
get_mems_allowed().

Signed-off-by: David Rientjes <rientjes@google.com>
Cc: Mel Gorman <mel@csn.ul.ie>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Minchan Kim <minchan.kim@gmail.com>
Cc: Wu Fengguang <fengguang.wu@intel.com>
Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Acked-by: Rik van Riel <riel@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# c06b1fca 13-Jan-2011 Andrew Morton <akpm@linux-foundation.org>

mm/page_alloc.c: don't cache `current' in a local

It's old-fashioned and unneeded.

akpm:/usr/src/25> size mm/page_alloc.o
text data bss dec hex filename
39884 1241317 18808 1300009 13d629 mm/page_alloc.o (before)
39838 1241317 18808 1299963 13d5fb mm/page_alloc.o (after)

Acked-by: David Rientjes <rientjes@google.com>
Acked-by: Mel Gorman <mel@csn.ul.ie>
Cc: Hugh Dickins <hughd@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 43506fad 13-Jan-2011 KyongHo Cho <pullip.cho@samsung.com>

mm/page_alloc.c: simplify calculation of combined index of adjacent buddy lists

The previous approach of calucation of combined index was

page_idx & ~(1 << order))

but we have same result with

page_idx & buddy_idx

This reduces instructions slightly as well as enhances readability.

[akpm@linux-foundation.org: coding-style fixes]
[akpm@linux-foundation.org: fix used-unintialised warning]
Signed-off-by: KyongHo Cho <pullip.cho@samsung.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 5f24ce5f 13-Jan-2011 Andrea Arcangeli <aarcange@redhat.com>

thp: remove PG_buddy

PG_buddy can be converted to _mapcount == -2. So the PG_compound_lock can
be added to page->flags without overflowing (because of the sparse section
bits increasing) with CONFIG_X86_PAE=y and CONFIG_X86_PAT=y. This also
has to move the memory hotplug code from _mapcount to lru.next to avoid
any risk of clashes. We can't use lru.next for PG_buddy removal, but
memory hotplug can use lru.next even more easily than the mapcount
instead.

Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 5c3240d9 13-Jan-2011 Andrea Arcangeli <aarcange@redhat.com>

thp: don't alloc harder for gfp nomemalloc even if nowait

Not worth throwing away the precious reserved free memory pool for
allocations that can fail gracefully (either through mempool or because
they're transhuge allocations later falling back to 4k allocations).

Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
Acked-by: Rik van Riel <riel@redhat.com>
Reviewed-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Acked-by: Mel Gorman <mel@csn.ul.ie>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 32dba98e 13-Jan-2011 Andrea Arcangeli <aarcange@redhat.com>

thp: _GFP_NO_KSWAPD

Transparent hugepage allocations must be allowed not to invoke kswapd or
any other kind of indirect reclaim (especially when the defrag sysfs is
control disabled). It's unacceptable to swap out anonymous pages
(potentially anonymous transparent hugepages) in order to create new
transparent hugepages. This is true for the MADV_HUGEPAGE areas too
(swapping out a kvm virtual machine and so having it suffer an unbearable
slowdown, so another one with guest physical memory marked MADV_HUGEPAGE
can run 30% faster if it is running memory intensive workloads, makes no
sense). If a transparent hugepage allocation fails the slowdown is minor
and there is total fallback, so kswapd should never be asked to swapout
memory to allow the high order allocation to succeed.

Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
Acked-by: Rik van Riel <riel@redhat.com>
Acked-by: Mel Gorman <mel@csn.ul.ie>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 59ff4216 13-Jan-2011 Andrea Arcangeli <aarcange@redhat.com>

thp: comment reminder in destroy_compound_page

Warn destroy_compound_page that __split_huge_page_refcount is heavily
dependent on its internal behavior.

Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
Acked-by: Rik van Riel <riel@redhat.com>
Acked-by: Mel Gorman <mel@csn.ul.ie>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 8dd60a3a 13-Jan-2011 Andrea Arcangeli <aarcange@redhat.com>

thp: clear compound mapping

Clear compound mapping for anonymous compound pages like it already
happens for regular anonymous pages. But crash if mapping is set for any
tail page, also the PageAnon check is meaningless for tail pages. This
check only makes sense for the head page, for tail page it can only hide
bugs and we definitely don't want to hide bugs.

Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
Acked-by: Rik van Riel <riel@redhat.com>
Acked-by: Mel Gorman <mel@csn.ul.ie>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 4e9f64c4 13-Jan-2011 Andrea Arcangeli <aarcange@redhat.com>

thp: fix bad_page to show the real reason the page is bad

page_count shows the count of the head page, but the actual check is done
on the tail page, so show what is really being checked.

Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
Acked-by: Rik van Riel <riel@redhat.com>
Acked-by: Mel Gorman <mel@csn.ul.ie>
Reviewed-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# ecb256f8 13-Jan-2011 Volodymyr G. Lukiianyk <volodymyrgl@gmail.com>

mm: set correct numa_zonelist_order string when configured on the kernel command line

When numa_zonelist_order parameter is set to "node" or "zone" on the
command line it's still showing as "default" in sysctl. That's because
early_param parsing function changes only user_zonelist_order variable.
Fix this by copying user-provided string to numa_zonelist_order if it was
successfully parsed.

Signed-off-by: Volodymyr G Lukiianyk <volodymyrgl@gmail.com>
Acked-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 99504748 13-Jan-2011 Mel Gorman <mel@csn.ul.ie>

mm: kswapd: stop high-order balancing when any suitable zone is balanced

Simon Kirby reported the following problem

We're seeing cases on a number of servers where cache never fully
grows to use all available memory. Sometimes we see servers with 4 GB
of memory that never seem to have less than 1.5 GB free, even with a
constantly-active VM. In some cases, these servers also swap out while
this happens, even though they are constantly reading the working set
into memory. We have been seeing this happening for a long time; I
don't think it's anything recent, and it still happens on 2.6.36.

After some debugging work by Simon, Dave Hansen and others, the prevaling
theory became that kswapd is reclaiming order-3 pages requested by SLUB
too aggressive about it.

There are two apparent problems here. On the target machine, there is a
small Normal zone in comparison to DMA32. As kswapd tries to balance all
zones, it would continually try reclaiming for Normal even though DMA32
was balanced enough for callers. The second problem is that
sleeping_prematurely() does not use the same logic as balance_pgdat() when
deciding whether to sleep or not. This keeps kswapd artifically awake.

A number of tests were run and the figures from previous postings will
look very different for a few reasons. One, the old figures were forcing
my network card to use GFP_ATOMIC in attempt to replicate Simon's problem.
Second, I previous specified slub_min_order=3 again in an attempt to
reproduce Simon's problem. In this posting, I'm depending on Simon to say
whether his problem is fixed or not and these figures are to show the
impact to the ordinary cases. Finally, the "vmscan" figures are taken
from /proc/vmstat instead of the tracepoints. There is less information
but recording is less disruptive.

The first test of relevance was postmark with a process running in the
background reading a large amount of anonymous memory in blocks. The
objective was to vaguely simulate what was happening on Simon's machine
and it's memory intensive enough to have kswapd awake.

POSTMARK
traceonly kanyzone
Transactions per second: 156.00 ( 0.00%) 153.00 (-1.96%)
Data megabytes read per second: 21.51 ( 0.00%) 21.52 ( 0.05%)
Data megabytes written per second: 29.28 ( 0.00%) 29.11 (-0.58%)
Files created alone per second: 250.00 ( 0.00%) 416.00 (39.90%)
Files create/transact per second: 79.00 ( 0.00%) 76.00 (-3.95%)
Files deleted alone per second: 520.00 ( 0.00%) 420.00 (-23.81%)
Files delete/transact per second: 79.00 ( 0.00%) 76.00 (-3.95%)

MMTests Statistics: duration
User/Sys Time Running Test (seconds) 16.58 17.4
Total Elapsed Time (seconds) 218.48 222.47

VMstat Reclaim Statistics: vmscan
Direct reclaims 0 4
Direct reclaim pages scanned 0 203
Direct reclaim pages reclaimed 0 184
Kswapd pages scanned 326631 322018
Kswapd pages reclaimed 312632 309784
Kswapd low wmark quickly 1 4
Kswapd high wmark quickly 122 475
Kswapd skip congestion_wait 1 0
Pages activated 700040 705317
Pages deactivated 212113 203922
Pages written 9875 6363

Total pages scanned 326631 322221
Total pages reclaimed 312632 309968
%age total pages scanned/reclaimed 95.71% 96.20%
%age total pages scanned/written 3.02% 1.97%

proc vmstat: Faults
Major Faults 300 254
Minor Faults 645183 660284
Page ins 493588 486704
Page outs 4960088 4986704
Swap ins 1230 661
Swap outs 9869 6355

Performance is mildly affected because kswapd is no longer doing as much
work and the background memory consumer process is getting in the way.
Note that kswapd scanned and reclaimed fewer pages as it's less aggressive
and overall fewer pages were scanned and reclaimed. Swap in/out is
particularly reduced again reflecting kswapd throwing out fewer pages.

The slight performance impact is unfortunate here but it looks like a
direct result of kswapd being less aggressive. As the bug report is about
too many pages being freed by kswapd, it may have to be accepted for now.

The second test is a streaming IO benchmark that was previously used by
Johannes to show regressions in page reclaim.

MICRO
traceonly kanyzone
User/Sys Time Running Test (seconds) 29.29 28.87
Total Elapsed Time (seconds) 492.18 488.79

VMstat Reclaim Statistics: vmscan
Direct reclaims 2128 1460
Direct reclaim pages scanned 2284822 1496067
Direct reclaim pages reclaimed 148919 110937
Kswapd pages scanned 15450014 16202876
Kswapd pages reclaimed 8503697 8537897
Kswapd low wmark quickly 3100 3397
Kswapd high wmark quickly 1860 7243
Kswapd skip congestion_wait 708 801
Pages activated 9635 9573
Pages deactivated 1432 1271
Pages written 223 1130

Total pages scanned 17734836 17698943
Total pages reclaimed 8652616 8648834
%age total pages scanned/reclaimed 48.79% 48.87%
%age total pages scanned/written 0.00% 0.01%

proc vmstat: Faults
Major Faults 165 221
Minor Faults 9655785 9656506
Page ins 3880 7228
Page outs 37692940 37480076
Swap ins 0 69
Swap outs 19 15

Again fewer pages are scanned and reclaimed as expected and this time the
test completed faster. Note that kswapd is hitting its watermarks faster
(low and high wmark quickly) which I expect is due to kswapd reclaiming
fewer pages.

I also ran fs-mark, iozone and sysbench but there is nothing interesting
to report in the figures. Performance is not significantly changed and
the reclaim statistics look reasonable.

Tgis patch:

When the allocator enters its slow path, kswapd is woken up to balance the
node. It continues working until all zones within the node are balanced.
For order-0 allocations, this makes perfect sense but for higher orders it
can have unintended side-effects. If the zone sizes are imbalanced,
kswapd may reclaim heavily within a smaller zone discarding an excessive
number of pages. The user-visible behaviour is that kswapd is awake and
reclaiming even though plenty of pages are free from a suitable zone.

This patch alters the "balance" logic for high-order reclaim allowing
kswapd to stop if any suitable zone becomes balanced to reduce the number
of pages it reclaims from other zones. kswapd still tries to ensure that
order-0 watermarks for all zones are met before sleeping.

Signed-off-by: Mel Gorman <mel@csn.ul.ie>
Reviewed-by: Minchan Kim <minchan.kim@gmail.com>
Reviewed-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Reviewed-by: Eric B Munson <emunson@mgebm.net>
Cc: Simon Kirby <sim@hostway.ca>
Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Cc: Shaohua Li <shaohua.li@intel.com>
Cc: Dave Hansen <dave@linux.vnet.ibm.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Rik van Riel <riel@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 77f1fe6b 13-Jan-2011 Mel Gorman <mel@csn.ul.ie>

mm: migration: allow migration to operate asynchronously and avoid synchronous compaction in the faster path

Migration synchronously waits for writeback if the initial passes fails.
Callers of memory compaction do not necessarily want this behaviour if the
caller is latency sensitive or expects that synchronous migration is not
going to have a significantly better success rate.

This patch adds a sync parameter to migrate_pages() allowing the caller to
indicate if wait_on_page_writeback() is allowed within migration or not.
For reclaim/compaction, try_to_compact_pages() is first called
asynchronously, direct reclaim runs and then try_to_compact_pages() is
called synchronously as there is a greater expectation that it'll succeed.

[akpm@linux-foundation.org: build/merge fix]
Signed-off-by: Mel Gorman <mel@csn.ul.ie>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Cc: Rik van Riel <riel@redhat.com>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Cc: Andy Whitcroft <apw@shadowen.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 3e7d3449 13-Jan-2011 Mel Gorman <mel@csn.ul.ie>

mm: vmscan: reclaim order-0 and use compaction instead of lumpy reclaim

Lumpy reclaim is disruptive. It reclaims a large number of pages and
ignores the age of the pages it reclaims. This can incur significant
stalls and potentially increase the number of major faults.

Compaction has reached the point where it is considered reasonably stable
(meaning it has passed a lot of testing) and is a potential candidate for
displacing lumpy reclaim. This patch introduces an alternative to lumpy
reclaim whe compaction is available called reclaim/compaction. The basic
operation is very simple - instead of selecting a contiguous range of
pages to reclaim, a number of order-0 pages are reclaimed and then
compaction is later by either kswapd (compact_zone_order()) or direct
compaction (__alloc_pages_direct_compact()).

[akpm@linux-foundation.org: fix build]
[akpm@linux-foundation.org: use conventional task_struct naming]
Signed-off-by: Mel Gorman <mel@csn.ul.ie>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Cc: Rik van Riel <riel@redhat.com>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Cc: Andy Whitcroft <apw@shadowen.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 88f5acf8 13-Jan-2011 Mel Gorman <mel@csn.ul.ie>

mm: page allocator: adjust the per-cpu counter threshold when memory is low

Commit aa45484 ("calculate a better estimate of NR_FREE_PAGES when memory
is low") noted that watermarks were based on the vmstat NR_FREE_PAGES. To
avoid synchronization overhead, these counters are maintained on a per-cpu
basis and drained both periodically and when a threshold is above a
threshold. On large CPU systems, the difference between the estimate and
real value of NR_FREE_PAGES can be very high. The system can get into a
case where pages are allocated far below the min watermark potentially
causing livelock issues. The commit solved the problem by taking a better
reading of NR_FREE_PAGES when memory was low.

Unfortately, as reported by Shaohua Li this accurate reading can consume a
large amount of CPU time on systems with many sockets due to cache line
bouncing. This patch takes a different approach. For large machines
where counter drift might be unsafe and while kswapd is awake, the per-cpu
thresholds for the target pgdat are reduced to limit the level of drift to
what should be a safe level. This incurs a performance penalty in heavy
memory pressure by a factor that depends on the workload and the machine
but the machine should function correctly without accidentally exhausting
all memory on a node. There is an additional cost when kswapd wakes and
sleeps but the event is not expected to be frequent - in Shaohua's test
case, there was one recorded sleep and wake event at least.

To ensure that kswapd wakes up, a safe version of zone_watermark_ok() is
introduced that takes a more accurate reading of NR_FREE_PAGES when called
from wakeup_kswapd, when deciding whether it is really safe to go back to
sleep in sleeping_prematurely() and when deciding if a zone is really
balanced or not in balance_pgdat(). We are still using an expensive
function but limiting how often it is called.

When the test case is reproduced, the time spent in the watermark
functions is reduced. The following report is on the percentage of time
spent cumulatively spent in the functions zone_nr_free_pages(),
zone_watermark_ok(), __zone_watermark_ok(), zone_watermark_ok_safe(),
zone_page_state_snapshot(), zone_page_state().

vanilla 11.6615%
disable-threshold 0.2584%

David said:

: We had to pull aa454840 "mm: page allocator: calculate a better estimate
: of NR_FREE_PAGES when memory is low and kswapd is awake" from 2.6.36
: internally because tests showed that it would cause the machine to stall
: as the result of heavy kswapd activity. I merged it back with this fix as
: it is pending in the -mm tree and it solves the issue we were seeing, so I
: definitely think this should be pushed to -stable (and I would seriously
: consider it for 2.6.37 inclusion even at this late date).

Signed-off-by: Mel Gorman <mel@csn.ul.ie>
Reported-by: Shaohua Li <shaohua.li@intel.com>
Reviewed-by: Christoph Lameter <cl@linux.com>
Tested-by: Nicolas Bareil <nico@chdir.org>
Cc: David Rientjes <rientjes@google.com>
Cc: Kyle McMartin <kyle@mcmartin.ca>
Cc: <stable@kernel.org> [2.6.37.1, 2.6.36.x]
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 1a4a678b 17-Dec-2010 Yinghai Lu <yinghai@kernel.org>

memblock: Make find_memory_core_early() find from top-down

That is used for find ram in node or bootmem type.

We should make it top-down so it will be consistent to memblock_find,
and to avoid allocating potentially valuable low memory before we
actually need it.

Suggested-by: Jeremy Fitzhardinge <jeremy@goop.org>
Signed-off-by: Yinghai Lu <yinghai@kernel.org>
LKML-Reference: <4D0C075B.3040501@kernel.org>
Signed-off-by: H. Peter Anvin <hpa@linux.intel.com>


# c9e664f1 03-Dec-2010 Rafael J. Wysocki <rjw@rjwysocki.net>

PM / Hibernate: Fix memory corruption related to swap

There is a problem that swap pages allocated before the creation of
a hibernation image can be released and used for storing the contents
of different memory pages while the image is being saved. Since the
kernel stored in the image doesn't know of that, it causes memory
corruption to occur after resume from hibernation, especially on
systems with relatively small RAM that need to swap often.

This issue can be addressed by keeping the GFP_IOFS bits clear
in gfp_allowed_mask during the entire hibernation, including the
saving of the image, until the system is finally turned off or
the hibernation is aborted. Unfortunately, for this purpose
it's necessary to rework the way in which the hibernate and
suspend code manipulates gfp_allowed_mask.

This change is based on an earlier patch from Hugh Dickins.

Signed-off-by: Rafael J. Wysocki <rjw@sisk.pl>
Reported-by: Ondrej Zary <linux@rainbow-software.org>
Acked-by: Hugh Dickins <hughd@google.com>
Reviewed-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: stable@kernel.org


# fa9f90be 28-Nov-2010 Jesper Juhl <jj@chaosbits.net>

Kill off a bunch of warning: ‘inline’ is not at beginning of declaration

These warnings are spewed during a build of a 'allnoconfig' kernel
(especially the ones from u64_stats_sync.h show up a lot) when building
with -Wextra (which I often do)..
They are
a) annoying
b) easy to get rid of.
This patch kills them off.

include/linux/u64_stats_sync.h:70:1: warning: ‘inline’ is not at beginning of declaration
include/linux/u64_stats_sync.h:77:1: warning: ‘inline’ is not at beginning of declaration
include/linux/u64_stats_sync.h:84:1: warning: ‘inline’ is not at beginning of declaration
include/linux/u64_stats_sync.h:96:1: warning: ‘inline’ is not at beginning of declaration
include/linux/u64_stats_sync.h:115:1: warning: ‘inline’ is not at beginning of declaration
include/linux/u64_stats_sync.h:127:1: warning: ‘inline’ is not at beginning of declaration
kernel/time.c:241:1: warning: ‘inline’ is not at beginning of declaration
kernel/time.c:257:1: warning: ‘inline’ is not at beginning of declaration
kernel/perf_event.c:4513:1: warning: ‘inline’ is not at beginning of declaration
mm/page_alloc.c:4012:1: warning: ‘inline’ is not at beginning of declaration

Signed-off-by: Jesper Juhl <jj@chaosbits.net>
Signed-off-by: Jiri Kosina <jkosina@suse.cz>


# e9959f0f 24-Nov-2010 KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>

mm/page_alloc.c: fix build_all_zonelist() where percpu_alloc() is wrongly called under stop_machine_run()

During memory hotplug, build_allzonelists() may be called under
stop_machine_run(). In this function, setup_zone_pageset() is called.
But it's bug because it will do page allocation under stop_machine_run().

Here is a report from Alok Kataria.

BUG: sleeping function called from invalid context at kernel/mutex.c:94
in_atomic(): 0, irqs_disabled(): 1, pid: 4, name: migration/0
Pid: 4, comm: migration/0 Not tainted 2.6.35.6-45.fc14.x86_64 #1
Call Trace:
[<ffffffff8103d12b>] __might_sleep+0xeb/0xf0
[<ffffffff81468245>] mutex_lock+0x24/0x50
[<ffffffff8110eaa6>] pcpu_alloc+0x6d/0x7ee
[<ffffffff81048888>] ? load_balance+0xbe/0x60e
[<ffffffff8103a1b3>] ? rt_se_boosted+0x21/0x2f
[<ffffffff8103e1cf>] ? dequeue_rt_stack+0x18b/0x1ed
[<ffffffff8110f237>] __alloc_percpu+0x10/0x12
[<ffffffff81465e22>] setup_zone_pageset+0x38/0xbe
[<ffffffff810d6d81>] ? build_zonelists_node.clone.58+0x79/0x8c
[<ffffffff81452539>] __build_all_zonelists+0x419/0x46c
[<ffffffff8108ef01>] ? cpu_stopper_thread+0xb2/0x198
[<ffffffff8108f075>] stop_machine_cpu_stop+0x8e/0xc5
[<ffffffff8108efe7>] ? stop_machine_cpu_stop+0x0/0xc5
[<ffffffff8108ef57>] cpu_stopper_thread+0x108/0x198
[<ffffffff81467a37>] ? schedule+0x5b2/0x5cc
[<ffffffff8108ee4f>] ? cpu_stopper_thread+0x0/0x198
[<ffffffff81065f29>] kthread+0x7f/0x87
[<ffffffff8100aae4>] kernel_thread_helper+0x4/0x10
[<ffffffff81065eaa>] ? kthread+0x0/0x87
[<ffffffff8100aae0>] ? kernel_thread_helper+0x0/0x10
Built 5 zonelists in Node order, mobility grouping on. Total pages: 289456
Policy zone: Normal

This patch tries to fix the issue by moving setup_zone_pageset() out from
stop_machine_run(). It's obviously not necessary to be called under
stop_machine_run().

[akpm@linux-foundation.org: remove unneeded local]
Reported-by: Alok Kataria <akataria@vmware.com>
Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Tejun Heo <tj@kernel.org>
Cc: Petr Vandrovec <petr@vmware.com>
Cc: Pekka Enberg <penberg@cs.helsinki.fi>
Reviewed-by: Christoph Lameter <cl@linux-foundation.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# e6223a3b 26-Oct-2010 Namhyung Kim <namhyung@gmail.com>

mm: add casts to/from gfp_t in gfp_to_alloc_flags()

This removes following warning from sparse:

mm/page_alloc.c:1934:9: warning: restricted gfp_t degrades to integer

Signed-off-by: Namhyung Kim <namhyung@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 0e093d99 26-Oct-2010 Mel Gorman <mel@csn.ul.ie>

writeback: do not sleep on the congestion queue if there are no congested BDIs or if significant congestion is not being encountered in the current zone

If congestion_wait() is called with no BDI congested, the caller will
sleep for the full timeout and this may be an unnecessary sleep. This
patch adds a wait_iff_congested() that checks congestion and only sleeps
if a BDI is congested else, it calls cond_resched() to ensure the caller
is not hogging the CPU longer than its quota but otherwise will not sleep.

This is aimed at reducing some of the major desktop stalls reported during
IO. For example, while kswapd is operating, it calls congestion_wait()
but it could just have been reclaiming clean page cache pages with no
congestion. Without this patch, it would sleep for a full timeout but
after this patch, it'll just call schedule() if it has been on the CPU too
long. Similar logic applies to direct reclaimers that are not making
enough progress.

Signed-off-by: Mel Gorman <mel@csn.ul.ie>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Minchan Kim <minchan.kim@gmail.com>
Cc: Wu Fengguang <fengguang.wu@intel.com>
Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Cc: Rik van Riel <riel@redhat.com>
Cc: Jens Axboe <axboe@kernel.dk>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 49ac8255 26-Oct-2010 KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>

memory hotplug: unify is_removable and offline detection code

Now, sysfs interface of memory hotplug shows whether the section is
removable or not. But it checks only migrateype of pages and doesn't
check details of cluster of pages.

Next, memory hotplug's set_migratetype_isolate() has the same kind of
check, too.

This patch adds the function __count_unmovable_pages() and makes above 2
checks to use the same logic. Then, is_removable and hotremove code uses
the same logic. No changes in the hotremove logic itself.

TODO: need to find a way to check RECLAMABLE. But, considering bit,
calling shrink_slab() against a range before starting memory hotremove
sounds better. If so, this patch's logic doesn't need to be changed.

[akpm@linux-foundation.org: coding-style fixes]
Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Reported-by: Michal Hocko <mhocko@suse.cz>
Cc: Wu Fengguang <fengguang.wu@intel.com>
Cc: Mel Gorman <mel@csn.ul.ie>
Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 4b20477f 26-Oct-2010 KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>

memory hotplug: fix notifier's return value check

Even if notifier cannot find any pages, it doesn't mean no pages are
available...And, if there are no notifiers registered, this condition will
be always true and memory hotplug will show -EBUSY.

This is a bug but not critical.

In most case, a pageblock which will be offlined is MIGRATE_MOVABLE This
"notifier" is called only when the pageblock is _not_ MIGRATE_MOVABLE.
But if not MIGRATE_MOVABLE, it's common case that memory hotplug will
fail. So, no one notice this bug.

Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Michal Hocko <mhocko@suse.cz>
Cc: Wu Fengguang <fengguang.wu@intel.com>
Cc: Mel Gorman <mel@csn.ul.ie>
Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# b7f50cfa 26-Oct-2010 Mel Gorman <mel@csn.ul.ie>

mm, page-allocator: do not check the state of a non-existant buddy during free

There is a bug in commit 6dda9d55 ("page allocator: reduce fragmentation
in buddy allocator by adding buddies that are merging to the tail of the
free lists") that means a buddy at order MAX_ORDER is checked for merging.
A page of this order never exists so at times, an effectively random
piece of memory is being checked.

Alan Curry has reported that this is causing memory corruption in
userspace data on a PPC32 platform (http://lkml.org/lkml/2010/10/9/32).
It is not clear why this is happening. It could be a cache coherency
problem where pages mapped in both user and kernel space are getting
different cache lines due to the bad read from kernel space
(http://lkml.org/lkml/2010/10/13/179). It could also be that there are
some special registers being io-remapped at the end of the memmap array
and that a read has special meaning on them. Compiler bugs have been
ruled out because the assembly before and after the patch looks relatively
harmless.

This patch fixes the problem by ensuring we are not reading a possibly
invalid location of memory. It's not clear why the read causes corruption
but one way or the other it is a buggy read.

Signed-off-by: Mel Gorman <mel@csn.ul.ie>
Cc: Corrado Zoccolo <czoccolo@gmail.com>
Reported-by: Alan Curry <pacman@kosh.dhis.org>
Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Cc: Christoph Lameter <cl@linux-foundation.org>
Cc: Rik van Riel <riel@redhat.com>
Cc: <stable@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# f241e660 07-Oct-2010 Robin Holt <holt@sgi.com>

mm: alloc_large_system_hash() printk overflow on 16TB boot

During boot of a 16TB system, the following is printed:
Dentry cache hash table entries: -2147483648 (order: 22, 17179869184 bytes)

Signed-off-by: Robin Holt <holt@sgi.com>
Reviewed-by: WANG Cong <xiyou.wangcong@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 9ee493ce 09-Sep-2010 Mel Gorman <mel@csn.ul.ie>

mm: page allocator: drain per-cpu lists after direct reclaim allocation fails

When under significant memory pressure, a process enters direct reclaim
and immediately afterwards tries to allocate a page. If it fails and no
further progress is made, it's possible the system will go OOM. However,
on systems with large amounts of memory, it's possible that a significant
number of pages are on per-cpu lists and inaccessible to the calling
process. This leads to a process entering direct reclaim more often than
it should increasing the pressure on the system and compounding the
problem.

This patch notes that if direct reclaim is making progress but allocations
are still failing that the system is already under heavy pressure. In
this case, it drains the per-cpu lists and tries the allocation a second
time before continuing.

Signed-off-by: Mel Gorman <mel@csn.ul.ie>
Reviewed-by: Minchan Kim <minchan.kim@gmail.com>
Reviewed-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Reviewed-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Reviewed-by: Christoph Lameter <cl@linux.com>
Cc: Dave Chinner <david@fromorbit.com>
Cc: Wu Fengguang <fengguang.wu@intel.com>
Cc: David Rientjes <rientjes@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# aa454840 09-Sep-2010 Christoph Lameter <cl@linux.com>

mm: page allocator: calculate a better estimate of NR_FREE_PAGES when memory is low and kswapd is awake

Ordinarily watermark checks are based on the vmstat NR_FREE_PAGES as it is
cheaper than scanning a number of lists. To avoid synchronization
overhead, counter deltas are maintained on a per-cpu basis and drained
both periodically and when the delta is above a threshold. On large CPU
systems, the difference between the estimated and real value of
NR_FREE_PAGES can be very high. If NR_FREE_PAGES is much higher than
number of real free page in buddy, the VM can allocate pages below min
watermark, at worst reducing the real number of pages to zero. Even if
the OOM killer kills some victim for freeing memory, it may not free
memory if the exit path requires a new page resulting in livelock.

This patch introduces a zone_page_state_snapshot() function (courtesy of
Christoph) that takes a slightly more accurate view of an arbitrary vmstat
counter. It is used to read NR_FREE_PAGES while kswapd is awake to avoid
the watermark being accidentally broken. The estimate is not perfect and
may result in cache line bounces but is expected to be lighter than the
IPI calls necessary to continually drain the per-cpu counters while kswapd
is awake.

Signed-off-by: Christoph Lameter <cl@linux.com>
Signed-off-by: Mel Gorman <mel@csn.ul.ie>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 72853e29 09-Sep-2010 Mel Gorman <mel@csn.ul.ie>

mm: page allocator: update free page counters after pages are placed on the free list

When allocating a page, the system uses NR_FREE_PAGES counters to
determine if watermarks would remain intact after the allocation was made.
This check is made without interrupts disabled or the zone lock held and
so is race-prone by nature. Unfortunately, when pages are being freed in
batch, the counters are updated before the pages are added on the list.
During this window, the counters are misleading as the pages do not exist
yet. When under significant pressure on systems with large numbers of
CPUs, it's possible for processes to make progress even though they should
have been stalled. This is particularly problematic if a number of the
processes are using GFP_ATOMIC as the min watermark can be accidentally
breached and in extreme cases, the system can livelock.

This patch updates the counters after the pages have been added to the
list. This makes the allocator more cautious with respect to preserving
the watermarks and mitigates livelock possibilities.

[akpm@linux-foundation.org: avoid modifying incoming args]
Signed-off-by: Mel Gorman <mel@csn.ul.ie>
Reviewed-by: Rik van Riel <riel@redhat.com>
Reviewed-by: Minchan Kim <minchan.kim@gmail.com>
Reviewed-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Reviewed-by: Christoph Lameter <cl@linux.com>
Reviewed-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 72d7c3b3 25-Aug-2010 Yinghai Lu <yinghai@kernel.org>

x86: Use memblock to replace early_res

1. replace find_e820_area with memblock_find_in_range
2. replace reserve_early with memblock_x86_reserve_range
3. replace free_early with memblock_x86_free_range.
4. NO_BOOTMEM will switch to use memblock too.
5. use _e820, _early wrap in the patch, in following patch, will
replace them all
6. because memblock_x86_free_range support partial free, we can remove some special care
7. Need to make sure that memblock_find_in_range() is called after memblock_x86_fill()
so adjust some calling later in setup.c::setup_arch()
-- corruption_check and mptable_update

-v2: Move reserve_brk() early
Before fill_memblock_area, to avoid overlap between brk and memblock_find_in_range()
that could happen We have more then 128 RAM entry in E820 tables, and
memblock_x86_fill() could use memblock_find_in_range() to find a new place for
memblock.memory.region array.
and We don't need to use extend_brk() after fill_memblock_area()
So move reserve_brk() early before fill_memblock_area().
-v3: Move find_smp_config early
To make sure memblock_find_in_range not find wrong place, if BIOS doesn't put mptable
in right place.
-v4: Treat RESERVED_KERN as RAM in memblock.memory. and they are already in
memblock.reserved already..
use __NOT_KEEP_MEMBLOCK to make sure memblock related code could be freed later.
-v5: Generic version __memblock_find_in_range() is going from high to low, and for 32bit
active_region for 32bit does include high pages
need to replace the limit with memblock.default_alloc_limit, aka get_max_mapped()
-v6: Use current_limit instead
-v7: check with MEMBLOCK_ERROR instead of -1ULL or -1L
-v8: Set memblock_can_resize early to handle EFI with more RAM entries
-v9: update after kmemleak changes in mainline

Suggested-by: David S. Miller <davem@davemloft.net>
Suggested-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Suggested-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Yinghai Lu <yinghai@kernel.org>
Signed-off-by: H. Peter Anvin <hpa@zytor.com>


# edbe7d23 25-Aug-2010 Yinghai Lu <yinghai@kernel.org>

memblock: Add find_memory_core_early()

According to node range in early_node_map[] with __memblock_find_in_range
to find free range.

Will be used by memblock_x86_find_in_range_node()

memblock_x86_find_in_range_node will be used to find right buffer for NODE_DATA

Signed-off-by: Yinghai Lu <yinghai@kernel.org>
Signed-off-by: H. Peter Anvin <hpa@zytor.com>


# 25edde03 09-Aug-2010 KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>

vmscan: kill prev_priority completely

Since 2.6.28 zone->prev_priority is unused. Then it can be removed
safely. It reduce stack usage slightly.

Now I have to say that I'm sorry. 2 years ago, I thought prev_priority
can be integrate again, it's useful. but four (or more) times trying
haven't got good performance number. Thus I give up such approach.

The rest of this changelog is notes on prev_priority and why it existed in
the first place and why it might be not necessary any more. This information
is based heavily on discussions between Andrew Morton, Rik van Riel and
Kosaki Motohiro who is heavily quotes from.

Historically prev_priority was important because it determined when the VM
would start unmapping PTE pages. i.e. there are no balances of note within
the VM, Anon vs File and Mapped vs Unmapped. Without prev_priority, there
is a potential risk of unnecessarily increasing minor faults as a large
amount of read activity of use-once pages could push mapped pages to the
end of the LRU and get unmapped.

There is no proof this is still a problem but currently it is not considered
to be. Active files are not deactivated if the active file list is smaller
than the inactive list reducing the liklihood that file-mapped pages are
being pushed off the LRU and referenced executable pages are kept on the
active list to avoid them getting pushed out by read activity.

Even if it is a problem, prev_priority prev_priority wouldn't works
nowadays. First of all, current vmscan still a lot of UP centric code. it
expose some weakness on some dozens CPUs machine. I think we need more and
more improvement.

The problem is, current vmscan mix up per-system-pressure, per-zone-pressure
and per-task-pressure a bit. example, prev_priority try to boost priority to
other concurrent priority. but if the another task have mempolicy restriction,
it is unnecessary, but also makes wrong big latency and exceeding reclaim.
per-task based priority + prev_priority adjustment make the emulation of
per-system pressure. but it have two issue 1) too rough and brutal emulation
2) we need per-zone pressure, not per-system.

Another example, currently DEF_PRIORITY is 12. it mean the lru rotate about
2 cycle (1/4096 + 1/2048 + 1/1024 + .. + 1) before invoking OOM-Killer.
but if 10,0000 thrreads enter DEF_PRIORITY reclaim at the same time, the
system have higher memory pressure than priority==0 (1/4096*10,000 > 2).
prev_priority can't solve such multithreads workload issue. In other word,
prev_priority concept assume the sysmtem don't have lots threads."

Signed-off-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Signed-off-by: Mel Gorman <mel@csn.ul.ie>
Reviewed-by: Johannes Weiner <hannes@cmpxchg.org>
Reviewed-by: Rik van Riel <riel@redhat.com>
Cc: Dave Chinner <david@fromorbit.com>
Cc: Chris Mason <chris.mason@oracle.com>
Cc: Nick Piggin <npiggin@suse.de>
Cc: Rik van Riel <riel@redhat.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Christoph Hellwig <hch@infradead.org>
Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Michael Rubin <mrubin@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# ff321fea 09-Aug-2010 Minchan Kim <minchan.kim@gmail.com>

mm: rename try_set_zone_oom() to try_set_zonelist_oom()

We have been used naming try_set_zone_oom and clear_zonelist_oom.
The role of functions is to lock of zonelist for preventing parallel
OOM. So clear_zonelist_oom makes sense but try_set_zone_oome is rather
awkward and unmatched with clear_zonelist_oom.

Let's change it with try_set_zonelist_oom.

Signed-off-by: Minchan Kim <minchan.kim@gmail.com>
Acked-by: David Rientjes <rientjes@google.com>
Reviewed-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 03668b3c 09-Aug-2010 David Rientjes <rientjes@google.com>

oom: avoid oom killer for lowmem allocations

If memory has been depleted in lowmem zones even with the protection
afforded to it by /proc/sys/vm/lowmem_reserve_ratio, it is unlikely that
killing current users will help. The memory is either reclaimable (or
migratable) already, in which case we should not invoke the oom killer at
all, or it is pinned by an application for I/O. Killing such an
application may leave the hardware in an unspecified state and there is no
guarantee that it will be able to make a timely exit.

Lowmem allocations are now failed in oom conditions when __GFP_NOFAIL is
not used so that the task can perhaps recover or try again later.

Previously, the heuristic provided some protection for those tasks with
CAP_SYS_RAWIO, but this is no longer necessary since we will not be
killing tasks for the purposes of ISA allocations.

high_zoneidx is gfp_zone(gfp_flags), meaning that ZONE_NORMAL will be the
default for all allocations that are not __GFP_DMA, __GFP_DMA32,
__GFP_HIGHMEM, and __GFP_MOVABLE on kernels configured to support those
flags. Testing for high_zoneidx being less than ZONE_NORMAL will only
return true for allocations that have either __GFP_DMA or __GFP_DMA32.

Acked-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Signed-off-by: David Rientjes <rientjes@google.com>
Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# b8ab9f82 20-Jul-2010 Yinghai Lu <yinghai@kernel.org>

x86,nobootmem: make alloc_bootmem_node fall back to other node when 32bit numa is used

Borislav Petkov reported his 32bit numa system has problem:

[ 0.000000] Reserving total of 4c00 pages for numa KVA remap
[ 0.000000] kva_start_pfn ~ 32800 max_low_pfn ~ 375fe
[ 0.000000] max_pfn = 238000
[ 0.000000] 8202MB HIGHMEM available.
[ 0.000000] 885MB LOWMEM available.
[ 0.000000] mapped low ram: 0 - 375fe000
[ 0.000000] low ram: 0 - 375fe000
[ 0.000000] alloc (nid=8 100000 - 7ee00000) (1000000 - ffffffff) 1000 1000 => 34e7000
[ 0.000000] alloc (nid=8 100000 - 7ee00000) (1000000 - ffffffff) 200 40 => 34c9d80
[ 0.000000] alloc (nid=0 100000 - 7ee00000) (1000000 - ffffffffffffffff) 180 40 => 34e6140
[ 0.000000] alloc (nid=1 80000000 - c7e60000) (1000000 - ffffffffffffffff) 240 40 => 80000000
[ 0.000000] BUG: unable to handle kernel paging request at 40000000
[ 0.000000] IP: [<c2c8cff1>] __alloc_memory_core_early+0x147/0x1d6
[ 0.000000] *pdpt = 0000000000000000 *pde = f000ff53f000ff00
...
[ 0.000000] Call Trace:
[ 0.000000] [<c2c8b4f8>] ? __alloc_bootmem_node+0x216/0x22f
[ 0.000000] [<c2c90c9b>] ? sparse_early_usemaps_alloc_node+0x5a/0x10b
[ 0.000000] [<c2c9149e>] ? sparse_init+0x1dc/0x499
[ 0.000000] [<c2c79118>] ? paging_init+0x168/0x1df
[ 0.000000] [<c2c780ff>] ? native_pagetable_setup_start+0xef/0x1bb

looks like it allocates too much high address for bootmem.

Try to cut limit with get_max_mapped()

Reported-by: Borislav Petkov <borislav.petkov@amd.com>
Tested-by: Conny Seidel <conny.seidel@amd.com>
Signed-off-by: Yinghai Lu <yinghai@kernel.org>
Cc: <stable@kernel.org> [2.6.34.x]
Cc: Ingo Molnar <mingo@elte.hu>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Lee Schermerhorn <lee.schermerhorn@hp.com>
Cc: Mel Gorman <mel@csn.ul.ie>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 9078370c 19-Jul-2010 Catalin Marinas <catalin.marinas@arm.com>

kmemleak: Add support for NO_BOOTMEM configurations

With commits 08677214 and 59be5a8e, alloc_bootmem()/free_bootmem() and
friends use the early_res functions for memory management when
NO_BOOTMEM is enabled. This patch adds the kmemleak calls in the
corresponding code paths for bootmem allocations.

Signed-off-by: Catalin Marinas <catalin.marinas@arm.com>
Acked-by: Pekka Enberg <penberg@cs.helsinki.fi>
Acked-by: Yinghai Lu <yinghai@kernel.org>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: stable@kernel.org


# 7aac7898 26-May-2010 Lee Schermerhorn <lee.schermerhorn@hp.com>

numa: introduce numa_mem_id()- effective local memory node id

Introduce numa_mem_id(), based on generic percpu variable infrastructure
to track "nearest node with memory" for archs that support memoryless
nodes.

Define API in <linux/topology.h> when CONFIG_HAVE_MEMORYLESS_NODES
defined, else stubs. Architectures will define HAVE_MEMORYLESS_NODES
if/when they support them.

Archs can override definitions of:

numa_mem_id() - returns node number of "local memory" node
set_numa_mem() - initialize [this cpus'] per cpu variable 'numa_mem'
cpu_to_mem() - return numa_mem for specified cpu; may be used as lvalue

Generic initialization of 'numa_mem' occurs in __build_all_zonelists().
This will initialize the boot cpu at boot time, and all cpus on change of
numa_zonelist_order, or when node or memory hot-plug requires zonelist
rebuild. Archs that support memoryless nodes will need to initialize
'numa_mem' for secondary cpus as they're brought on-line.

[akpm@linux-foundation.org: fix build]
Signed-off-by: Lee Schermerhorn <lee.schermerhorn@hp.com>
Signed-off-by: Christoph Lameter <cl@linux-foundation.org>
Cc: Tejun Heo <tj@kernel.org>
Cc: Mel Gorman <mel@csn.ul.ie>
Cc: Christoph Lameter <cl@linux-foundation.org>
Cc: Nick Piggin <npiggin@suse.de>
Cc: David Rientjes <rientjes@google.com>
Cc: Eric Whitney <eric.whitney@hp.com>
Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: "Luck, Tony" <tony.luck@intel.com>
Cc: Pekka Enberg <penberg@cs.helsinki.fi>
Cc: <linux-arch@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 72812019 26-May-2010 Lee Schermerhorn <lee.schermerhorn@hp.com>

numa: add generic percpu var numa_node_id() implementation

Rework the generic version of the numa_node_id() function to use the new
generic percpu variable infrastructure.

Guard the new implementation with a new config option:

CONFIG_USE_PERCPU_NUMA_NODE_ID.

Archs which support this new implemention will default this option to 'y'
when NUMA is configured. This config option could be removed if/when all
archs switch over to the generic percpu implementation of numa_node_id().
Arch support involves:

1) converting any existing per cpu variable implementations to use
this implementation. x86_64 is an instance of such an arch.
2) archs that don't use a per cpu variable for numa_node_id() will
need to initialize the new per cpu variable "numa_node" as cpus
are brought on-line. ia64 is an example.
3) Defining USE_PERCPU_NUMA_NODE_ID in arch dependent Kconfig--e.g.,
when NUMA is configured. This is required because I have
retained the old implementation by default to allow archs to
be modified incrementally, as desired.

Subsequent patches will convert x86_64 and ia64 to use this implemenation.

Signed-off-by: Lee Schermerhorn <lee.schermerhorn@hp.com>
Cc: Tejun Heo <tj@kernel.org>
Cc: Mel Gorman <mel@csn.ul.ie>
Reviewed-by: Christoph Lameter <cl@linux-foundation.org>
Cc: Nick Piggin <npiggin@suse.de>
Cc: David Rientjes <rientjes@google.com>
Cc: Eric Whitney <eric.whitney@hp.com>
Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: "Luck, Tony" <tony.luck@intel.com>
Cc: Pekka Enberg <penberg@cs.helsinki.fi>
Cc: <linux-arch@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 4eaf3f64 24-May-2010 Haicheng Li <haicheng.li@linux.intel.com>

mem-hotplug: fix potential race while building zonelist for new populated zone

Add global mutex zonelists_mutex to fix the possible race:

CPU0 CPU1 CPU2
(1) zone->present_pages += online_pages;
(2) build_all_zonelists();
(3) alloc_page();
(4) free_page();
(5) build_all_zonelists();
(6) __build_all_zonelists();
(7) zone->pageset = alloc_percpu();

In step (3,4), zone->pageset still points to boot_pageset, so bad
things may happen if 2+ nodes are in this state. Even if only 1 node
is accessing the boot_pageset, (3) may still consume too much memory
to fail the memory allocations in step (7).

Besides, atomic operation ensures alloc_percpu() in step (7) will never fail
since there is a new fresh memory block added in step(6).

[haicheng.li@linux.intel.com: hold zonelists_mutex when build_all_zonelists]
Signed-off-by: Haicheng Li <haicheng.li@linux.intel.com>
Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
Reviewed-by: Andi Kleen <andi.kleen@intel.com>
Cc: Christoph Lameter <cl@linux-foundation.org>
Cc: Mel Gorman <mel@csn.ul.ie>
Cc: Tejun Heo <tj@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 1f522509 24-May-2010 Haicheng Li <haicheng.li@linux.intel.com>

mem-hotplug: avoid multiple zones sharing same boot strapping boot_pageset

For each new populated zone of hotadded node, need to update its pagesets
with dynamically allocated per_cpu_pageset struct for all possible CPUs:

1) Detach zone->pageset from the shared boot_pageset
at end of __build_all_zonelists().

2) Use mutex to protect zone->pageset when it's still
shared in onlined_pages()

Otherwises, multiple zones of different nodes would share same boot strapping
boot_pageset for same CPU, which will finally cause below kernel panic:

------------[ cut here ]------------
kernel BUG at mm/page_alloc.c:1239!
invalid opcode: 0000 [#1] SMP
...
Call Trace:
[<ffffffff811300c1>] __alloc_pages_nodemask+0x131/0x7b0
[<ffffffff81162e67>] alloc_pages_current+0x87/0xd0
[<ffffffff81128407>] __page_cache_alloc+0x67/0x70
[<ffffffff811325f0>] __do_page_cache_readahead+0x120/0x260
[<ffffffff81132751>] ra_submit+0x21/0x30
[<ffffffff811329c6>] ondemand_readahead+0x166/0x2c0
[<ffffffff81132ba0>] page_cache_async_readahead+0x80/0xa0
[<ffffffff8112a0e4>] generic_file_aio_read+0x364/0x670
[<ffffffff81266cfa>] nfs_file_read+0xca/0x130
[<ffffffff8117b20a>] do_sync_read+0xfa/0x140
[<ffffffff8117bf75>] vfs_read+0xb5/0x1a0
[<ffffffff8117c151>] sys_read+0x51/0x80
[<ffffffff8103c032>] system_call_fastpath+0x16/0x1b
RIP [<ffffffff8112ff13>] get_page_from_freelist+0x883/0x900
RSP <ffff88000d1e78a8>
---[ end trace 4bda28328b9990db ]

[akpm@linux-foundation.org: merge fix]
Signed-off-by: Haicheng Li <haicheng.li@linux.intel.com>
Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
Reviewed-by: Andi Kleen <andi.kleen@intel.com>
Reviewed-by: Christoph Lameter <cl@linux-foundation.org>
Cc: Mel Gorman <mel@csn.ul.ie>
Cc: Tejun Heo <tj@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 319774e2 24-May-2010 Wu Fengguang <fengguang.wu@intel.com>

mem-hotplug: separate setup_per_cpu_pageset() into separate functions

No behavior change here.

Move some of setup_per_cpu_pageset() code into a new function
setup_zone_pageset() that will be useful for memory hotplug.

Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
Signed-off-by: Haicheng Li <haicheng.li@linux.intel.com>
Reviewed-by: Andi Kleen <andi.kleen@intel.com>
Reviewed-by: Christoph Lameter <cl@linux-foundation.org>
Cc: Mel Gorman <mel@csn.ul.ie>
Cc: Tejun Heo <tj@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# ec95f53a 24-May-2010 KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>

mm: introduce free_pages_prepare()

free_hot_cold_page() and __free_pages_ok() have very similar freeing
preparation. Consolidate them.

[akpm@linux-foundation.org: fix busted coding style]
Signed-off-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Acked-by: Mel Gorman <mel@csn.ul.ie>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 4f92e258 24-May-2010 Mel Gorman <mel@csn.ul.ie>

mm: compaction: defer compaction using an exponential backoff when compaction fails

The fragmentation index may indicate that a failure is due to external
fragmentation but after a compaction run completes, it is still possible
for an allocation to fail. There are two obvious reasons as to why

o Page migration cannot move all pages so fragmentation remains
o A suitable page may exist but watermarks are not met

In the event of compaction followed by an allocation failure, this patch
defers further compaction in the zone (1 << compact_defer_shift) times.
If the next compaction attempt also fails, compact_defer_shift is
increased up to a maximum of 6. If compaction succeeds, the defer
counters are reset again.

The zone that is deferred is the first zone in the zonelist - i.e. the
preferred zone. To defer compaction in the other zones, the information
would need to be stored in the zonelist or implemented similar to the
zonelist_cache. This would impact the fast-paths and is not justified at
this time.

Signed-off-by: Mel Gorman <mel@csn.ul.ie>
Cc: Rik van Riel <riel@redhat.com>
Cc: Minchan Kim <minchan.kim@gmail.com>
Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Cc: Christoph Lameter <cl@linux-foundation.org>
Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 56de7263 24-May-2010 Mel Gorman <mel@csn.ul.ie>

mm: compaction: direct compact when a high-order allocation fails

Ordinarily when a high-order allocation fails, direct reclaim is entered
to free pages to satisfy the allocation. With this patch, it is
determined if an allocation failed due to external fragmentation instead
of low memory and if so, the calling process will compact until a suitable
page is freed. Compaction by moving pages in memory is considerably
cheaper than paging out to disk and works where there are locked pages or
no swap. If compaction fails to free a page of a suitable size, then
reclaim will still occur.

Direct compaction returns as soon as possible. As each block is
compacted, it is checked if a suitable page has been freed and if so, it
returns.

[akpm@linux-foundation.org: Fix build errors]
[aarcange@redhat.com: fix count_vm_event preempt in memory compaction direct reclaim]
Signed-off-by: Mel Gorman <mel@csn.ul.ie>
Acked-by: Rik van Riel <riel@redhat.com>
Reviewed-by: Minchan Kim <minchan.kim@gmail.com>
Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Cc: Christoph Lameter <cl@linux-foundation.org>
Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 748446bb 24-May-2010 Mel Gorman <mel@csn.ul.ie>

mm: compaction: memory compaction core

This patch is the core of a mechanism which compacts memory in a zone by
relocating movable pages towards the end of the zone.

A single compaction run involves a migration scanner and a free scanner.
Both scanners operate on pageblock-sized areas in the zone. The migration
scanner starts at the bottom of the zone and searches for all movable
pages within each area, isolating them onto a private list called
migratelist. The free scanner starts at the top of the zone and searches
for suitable areas and consumes the free pages within making them
available for the migration scanner. The pages isolated for migration are
then migrated to the newly isolated free pages.

[aarcange@redhat.com: Fix unsafe optimisation]
[mel@csn.ul.ie: do not schedule work on other CPUs for compaction]
Signed-off-by: Mel Gorman <mel@csn.ul.ie>
Acked-by: Rik van Riel <riel@redhat.com>
Reviewed-by: Minchan Kim <minchan.kim@gmail.com>
Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Cc: Christoph Lameter <cl@linux-foundation.org>
Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# e325c90f 24-May-2010 David Rientjes <rientjes@google.com>

mm: default to node zonelist ordering when nodes have only lowmem

There are two types of zonelist ordering methodologies:

- node order, preferring allocations on a node to stay local to and

- zone order, preferring allocations come from a higher zone to avoid
allocating in lowmem zones even though they may not be local.

The ordering technique used by the kernel is configurable on the command
line, but also has some logic to determine what the default should be.

This logic currently lacks knowledge of systems where a node may only have
lowmem. For such systems, it is necessary to use node order so that
GFP_KERNEL allocations may be satisfied by nodes consisting of only
lowmem.

If zone order is used, GFP_KERNEL allocations to such nodes are actually
allocated on a node with local affinity that includes ZONE_NORMAL.

This change defaults to node zonelist ordering if any node lacks
ZONE_NORMAL.

To force zone order, append 'numa_zonelist_order=zone' to the kernel
command line.

Signed-off-by: David Rientjes <rientjes@google.com>
Acked-by: Mel Gorman <mel@csn.ul.ie>
Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# c0ff7453 24-May-2010 Miao Xie <miaox@cn.fujitsu.com>

cpuset,mm: fix no node to alloc memory when changing cpuset's mems

Before applying this patch, cpuset updates task->mems_allowed and
mempolicy by setting all new bits in the nodemask first, and clearing all
old unallowed bits later. But in the way, the allocator may find that
there is no node to alloc memory.

The reason is that cpuset rebinds the task's mempolicy, it cleans the
nodes which the allocater can alloc pages on, for example:

(mpol: mempolicy)
task1 task1's mpol task2
alloc page 1
alloc on node0? NO 1
1 change mems from 1 to 0
1 rebind task1's mpol
0-1 set new bits
0 clear disallowed bits
alloc on node1? NO 0
...
can't alloc page
goto oom

This patch fixes this problem by expanding the nodes range first(set newly
allowed bits) and shrink it lazily(clear newly disallowed bits). So we
use a variable to tell the write-side task that read-side task is reading
nodemask, and the write-side task clears newly disallowed nodes after
read-side task ends the current memory allocation.

[akpm@linux-foundation.org: fix spello]
Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
Cc: David Rientjes <rientjes@google.com>
Cc: Nick Piggin <npiggin@suse.de>
Cc: Paul Menage <menage@google.com>
Cc: Lee Schermerhorn <lee.schermerhorn@hp.com>
Cc: Hugh Dickins <hugh.dickins@tiscali.co.uk>
Cc: Ravikiran Thirumalai <kiran@scalex86.org>
Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Cc: Christoph Lameter <cl@linux-foundation.org>
Cc: Andi Kleen <andi@firstfloor.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 6dda9d55 24-May-2010 Corrado Zoccolo <czoccolo@gmail.com>

page allocator: reduce fragmentation in buddy allocator by adding buddies that are merging to the tail of the free lists

In order to reduce fragmentation, this patch classifies freed pages in two
groups according to their probability of being part of a high order merge.
Pages belonging to a compound whose next-highest buddy is free are more
likely to be part of a high order merge in the near future, so they will
be added at the tail of the freelist. The remaining pages are put at the
front of the freelist.

In this way, the pages that are more likely to cause a big merge are kept
free longer. Consequently there is a tendency to aggregate the
long-living allocations on a subset of the compounds, reducing the
fragmentation.

This heuristic was tested on three machines, x86, x86-64 and ppc64 with
3GB of RAM in each machine. The tests were kernbench, netperf, sysbench
and STREAM for performance and a high-order stress test for huge page
allocations.

KernBench X86
Elapsed mean 374.77 ( 0.00%) 375.10 (-0.09%)
User mean 649.53 ( 0.00%) 650.44 (-0.14%)
System mean 54.75 ( 0.00%) 54.18 ( 1.05%)
CPU mean 187.75 ( 0.00%) 187.25 ( 0.27%)

KernBench X86-64
Elapsed mean 94.45 ( 0.00%) 94.01 ( 0.47%)
User mean 323.27 ( 0.00%) 322.66 ( 0.19%)
System mean 36.71 ( 0.00%) 36.50 ( 0.57%)
CPU mean 380.75 ( 0.00%) 381.75 (-0.26%)

KernBench PPC64
Elapsed mean 173.45 ( 0.00%) 173.74 (-0.17%)
User mean 587.99 ( 0.00%) 587.95 ( 0.01%)
System mean 60.60 ( 0.00%) 60.57 ( 0.05%)
CPU mean 373.50 ( 0.00%) 372.75 ( 0.20%)

Nothing notable for kernbench.

NetPerf UDP X86
64 42.68 ( 0.00%) 42.77 ( 0.21%)
128 85.62 ( 0.00%) 85.32 (-0.35%)
256 170.01 ( 0.00%) 168.76 (-0.74%)
1024 655.68 ( 0.00%) 652.33 (-0.51%)
2048 1262.39 ( 0.00%) 1248.61 (-1.10%)
3312 1958.41 ( 0.00%) 1944.61 (-0.71%)
4096 2345.63 ( 0.00%) 2318.83 (-1.16%)
8192 4132.90 ( 0.00%) 4089.50 (-1.06%)
16384 6770.88 ( 0.00%) 6642.05 (-1.94%)*

NetPerf UDP X86-64
64 148.82 ( 0.00%) 154.92 ( 3.94%)
128 298.96 ( 0.00%) 312.95 ( 4.47%)
256 583.67 ( 0.00%) 626.39 ( 6.82%)
1024 2293.18 ( 0.00%) 2371.10 ( 3.29%)
2048 4274.16 ( 0.00%) 4396.83 ( 2.79%)
3312 6356.94 ( 0.00%) 6571.35 ( 3.26%)
4096 7422.68 ( 0.00%) 7635.42 ( 2.79%)*
8192 12114.81 ( 0.00%)* 12346.88 ( 1.88%)
16384 17022.28 ( 0.00%)* 17033.19 ( 0.06%)*
1.64% 2.73%

NetPerf UDP PPC64
64 49.98 ( 0.00%) 50.25 ( 0.54%)
128 98.66 ( 0.00%) 100.95 ( 2.27%)
256 197.33 ( 0.00%) 191.03 (-3.30%)
1024 761.98 ( 0.00%) 785.07 ( 2.94%)
2048 1493.50 ( 0.00%) 1510.85 ( 1.15%)
3312 2303.95 ( 0.00%) 2271.72 (-1.42%)
4096 2774.56 ( 0.00%) 2773.06 (-0.05%)
8192 4918.31 ( 0.00%) 4793.59 (-2.60%)
16384 7497.98 ( 0.00%) 7749.52 ( 3.25%)

The tests are run to have confidence limits within 1%. Results marked
with a * were not confident although in this case, it's only outside by
small amounts. Even with some results that were not confident, the
netperf UDP results were generally positive.

NetPerf TCP X86
64 652.25 ( 0.00%)* 648.12 (-0.64%)*
23.80% 22.82%
128 1229.98 ( 0.00%)* 1220.56 (-0.77%)*
21.03% 18.90%
256 2105.88 ( 0.00%) 1872.03 (-12.49%)*
1.00% 16.46%
1024 3476.46 ( 0.00%)* 3548.28 ( 2.02%)*
13.37% 11.39%
2048 4023.44 ( 0.00%)* 4231.45 ( 4.92%)*
9.76% 12.48%
3312 4348.88 ( 0.00%)* 4396.96 ( 1.09%)*
6.49% 8.75%
4096 4726.56 ( 0.00%)* 4877.71 ( 3.10%)*
9.85% 8.50%
8192 4732.28 ( 0.00%)* 5777.77 (18.10%)*
9.13% 13.04%
16384 5543.05 ( 0.00%)* 5906.24 ( 6.15%)*
7.73% 8.68%

NETPERF TCP X86-64
netperf-tcp-vanilla-netperf netperf-tcp
tcp-vanilla pgalloc-delay
64 1895.87 ( 0.00%)* 1775.07 (-6.81%)*
5.79% 4.78%
128 3571.03 ( 0.00%)* 3342.20 (-6.85%)*
3.68% 6.06%
256 5097.21 ( 0.00%)* 4859.43 (-4.89%)*
3.02% 2.10%
1024 8919.10 ( 0.00%)* 8892.49 (-0.30%)*
5.89% 6.55%
2048 10255.46 ( 0.00%)* 10449.39 ( 1.86%)*
7.08% 7.44%
3312 10839.90 ( 0.00%)* 10740.15 (-0.93%)*
6.87% 7.33%
4096 10814.84 ( 0.00%)* 10766.97 (-0.44%)*
6.86% 8.18%
8192 11606.89 ( 0.00%)* 11189.28 (-3.73%)*
7.49% 5.55%
16384 12554.88 ( 0.00%)* 12361.22 (-1.57%)*
7.36% 6.49%

NETPERF TCP PPC64
netperf-tcp-vanilla-netperf netperf-tcp
tcp-vanilla pgalloc-delay
64 594.17 ( 0.00%) 596.04 ( 0.31%)*
1.00% 2.29%
128 1064.87 ( 0.00%)* 1074.77 ( 0.92%)*
1.30% 1.40%
256 1852.46 ( 0.00%)* 1856.95 ( 0.24%)
1.25% 1.00%
1024 3839.46 ( 0.00%)* 3813.05 (-0.69%)
1.02% 1.00%
2048 4885.04 ( 0.00%)* 4881.97 (-0.06%)*
1.15% 1.04%
3312 5506.90 ( 0.00%) 5459.72 (-0.86%)
4096 6449.19 ( 0.00%) 6345.46 (-1.63%)
8192 7501.17 ( 0.00%) 7508.79 ( 0.10%)
16384 9618.65 ( 0.00%) 9490.10 (-1.35%)

There was a distinct lack of confidence in the X86* figures so I included
what the devation was where the results were not confident. Many of the
results, whether gains or losses were within the standard deviation so no
solid conclusion can be reached on performance impact. Looking at the
figures, only the X86-64 ones look suspicious with a few losses that were
outside the noise. However, the results were so unstable that without
knowing why they vary so much, a solid conclusion cannot be reached.

SYSBENCH X86
sysbench-vanilla pgalloc-delay
1 7722.85 ( 0.00%) 7756.79 ( 0.44%)
2 14901.11 ( 0.00%) 13683.44 (-8.90%)
3 15171.71 ( 0.00%) 14888.25 (-1.90%)
4 14966.98 ( 0.00%) 15029.67 ( 0.42%)
5 14370.47 ( 0.00%) 14865.00 ( 3.33%)
6 14870.33 ( 0.00%) 14845.57 (-0.17%)
7 14429.45 ( 0.00%) 14520.85 ( 0.63%)
8 14354.35 ( 0.00%) 14362.31 ( 0.06%)

SYSBENCH X86-64
1 17448.70 ( 0.00%) 17484.41 ( 0.20%)
2 34276.39 ( 0.00%) 34251.00 (-0.07%)
3 50805.25 ( 0.00%) 50854.80 ( 0.10%)
4 66667.10 ( 0.00%) 66174.69 (-0.74%)
5 66003.91 ( 0.00%) 65685.25 (-0.49%)
6 64981.90 ( 0.00%) 65125.60 ( 0.22%)
7 64933.16 ( 0.00%) 64379.23 (-0.86%)
8 63353.30 ( 0.00%) 63281.22 (-0.11%)
9 63511.84 ( 0.00%) 63570.37 ( 0.09%)
10 62708.27 ( 0.00%) 63166.25 ( 0.73%)
11 62092.81 ( 0.00%) 61787.75 (-0.49%)
12 61330.11 ( 0.00%) 61036.34 (-0.48%)
13 61438.37 ( 0.00%) 61994.47 ( 0.90%)
14 62304.48 ( 0.00%) 62064.90 (-0.39%)
15 63296.48 ( 0.00%) 62875.16 (-0.67%)
16 63951.76 ( 0.00%) 63769.09 (-0.29%)

SYSBENCH PPC64
-sysbench-pgalloc-delay-sysbench
sysbench-vanilla pgalloc-delay
1 7645.08 ( 0.00%) 7467.43 (-2.38%)
2 14856.67 ( 0.00%) 14558.73 (-2.05%)
3 21952.31 ( 0.00%) 21683.64 (-1.24%)
4 27946.09 ( 0.00%) 28623.29 ( 2.37%)
5 28045.11 ( 0.00%) 28143.69 ( 0.35%)
6 27477.10 ( 0.00%) 27337.45 (-0.51%)
7 26489.17 ( 0.00%) 26590.06 ( 0.38%)
8 26642.91 ( 0.00%) 25274.33 (-5.41%)
9 25137.27 ( 0.00%) 24810.06 (-1.32%)
10 24451.99 ( 0.00%) 24275.85 (-0.73%)
11 23262.20 ( 0.00%) 23674.88 ( 1.74%)
12 24234.81 ( 0.00%) 23640.89 (-2.51%)
13 24577.75 ( 0.00%) 24433.50 (-0.59%)
14 25640.19 ( 0.00%) 25116.52 (-2.08%)
15 26188.84 ( 0.00%) 26181.36 (-0.03%)
16 26782.37 ( 0.00%) 26255.99 (-2.00%)

Again, there is little to conclude here. While there are a few losses,
the results vary by +/- 8% in some cases. They are the results of most
concern as there are some large losses but it's also within the variance
typically seen between kernel releases.

The STREAM results varied so little and are so verbose that I didn't
include them here.

The final test stressed how many huge pages can be allocated. The
absolute number of huge pages allocated are the same with or without the
page. However, the "unusability free space index" which is a measure of
external fragmentation was slightly lower (lower is better) throughout the
lifetime of the system. I also measured the latency of how long it took
to successfully allocate a huge page. The latency was slightly lower and
on X86 and PPC64, more huge pages were allocated almost immediately from
the free lists. The improvement is slight but there.

[mel@csn.ul.ie: Tested, reworked for less branches]
[czoccolo@gmail.com: fix oops by checking pfn_valid_within()]
Signed-off-by: Mel Gorman <mel@csn.ul.ie>
Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Cc: Christoph Lameter <cl@linux-foundation.org>
Acked-by: Rik van Riel <riel@redhat.com>
Reviewed-by: Pekka Enberg <penberg@cs.helsinki.fi>
Cc: Corrado Zoccolo <czoccolo@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 88393161 16-Mar-2010 Thomas Weber <swirl@gmx.li>

Fix typos in comments

[Ss]ytem => [Ss]ystem
udpate => update
paramters => parameters
orginal => original

Signed-off-by: Thomas Weber <swirl@gmx.li>
Acked-by: Randy Dunlap <rdunlap@xenotime.net>
Signed-off-by: Jiri Kosina <jkosina@suse.cz>


# 718a3821 10-Mar-2010 Wu Fengguang <fengguang.wu@intel.com>

mm: introduce dump_page() and print symbolic flag names

- introduce dump_page() to print the page info for debugging some error
condition.

- convert three mm users: bad_page(), print_bad_pte() and memory offline
failure.

- print an extra field: the symbolic names of page->flags

Example dump_page() output:

[ 157.521694] page:ffffea0000a7cba8 count:2 mapcount:1 mapping:ffff88001c901791 index:0x147
[ 157.525570] page flags: 0x100000000100068(uptodate|lru|active|swapbacked)

Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: Alex Chiang <achiang@hp.com>
Cc: Rik van Riel <riel@redhat.com>
Cc: Andi Kleen <andi@firstfloor.org>
Cc: Mel Gorman <mel@linux.vnet.ibm.com>
Cc: Christoph Lameter <cl@linux-foundation.org>
Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 2d30a1f6 10-Mar-2010 Thomas Gleixner <tglx@linutronix.de>

mm: do not iterate over NR_CPUS in __zone_pcp_update()

__zone_pcp_update() iterates over NR_CPUS instead of limiting the access
to the possible cpus. This might result in access to uninitialized areas
as the per cpu allocator only populates the per cpu memory for possible
cpus.

This problem was created as a result of the dynamic allocation of pagesets
from percpu memory that went in during the merge window - commit
99dcc3e5a94ed491fbef402831d8c0bbb267f995 ("this_cpu: Page allocator
conversion").

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Acked-by: Pekka Enberg <penberg@cs.helsinki.fi>
Acked-by: Tejun Heo <tj@kernel.org>
Acked-by: Christoph Lameter <cl@linux-foundation.org>
Acked-by: Mel Gorman <mel@csn.ul.ie>
Reviewed-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 72f0ba02 05-Mar-2010 David Rientjes <rientjes@google.com>

mm: suppress pfn range output for zones without pages

free_area_init_nodes() emits pfn ranges for all zones on the system.
There may be no pages on a higher zone, however, due to memory limitations
or the use of the mem= kernel parameter. For example:

Zone PFN ranges:
DMA 0x00000001 -> 0x00001000
DMA32 0x00001000 -> 0x00100000
Normal 0x00100000 -> 0x00100000

The implementation copies the previous zone's highest pfn, if any, as the
next zone's lowest pfn. If its highest pfn is then greater than the
amount of addressable memory, the upper memory limit is used instead.
Thus, both the lowest and highest possible pfn for higher zones without
memory may be the same.

The pfn range for zones without memory is now shown as "empty" instead.

Signed-off-by: David Rientjes <rientjes@google.com>
Cc: Mel Gorman <mel@csn.ul.ie>
Reviewed-by: Christoph Lameter <cl@linux-foundation.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 452aa699 05-Mar-2010 Rafael J. Wysocki <rjw@rjwysocki.net>

mm/pm: force GFP_NOIO during suspend/hibernation and resume

There are quite a few GFP_KERNEL memory allocations made during
suspend/hibernation and resume that may cause the system to hang, because
the I/O operations they depend on cannot be completed due to the
underlying devices being suspended.

Avoid this problem by clearing the __GFP_IO and __GFP_FS bits in
gfp_allowed_mask before suspend/hibernation and restoring the original
values of these bits in gfp_allowed_mask durig the subsequent resume.

[akpm@linux-foundation.org: fix CONFIG_PM=n linkage]
Signed-off-by: Rafael J. Wysocki <rjw@sisk.pl>
Reported-by: Maxim Levitsky <maximlevitsky@gmail.com>
Cc: Sebastian Ott <sebott@linux.vnet.ibm.com>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 93e4a89a 05-Mar-2010 KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>

mm: restore zone->all_unreclaimable to independence word

commit e815af95 ("change all_unreclaimable zone member to flags") changed
all_unreclaimable member to bit flag. But it had an undesireble side
effect. free_one_page() is one of most hot path in linux kernel and
increasing atomic ops in it can reduce kernel performance a bit.

Thus, this patch revert such commit partially. at least
all_unreclaimable shouldn't share memory word with other zone flags.

[akpm@linux-foundation.org: fix patch interaction]
Signed-off-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Cc: David Rientjes <rientjes@google.com>
Cc: Wu Fengguang <fengguang.wu@intel.com>
Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Minchan Kim <minchan.kim@gmail.com>
Cc: Huang Shijie <shijie8@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# fc91668e 05-Mar-2010 Li Hong <lihong.hi@gmail.com>

mm: remove free_hot_page()

free_hot_page() is just a wrapper around free_hot_cold_page() with
parameter 'cold = 0'. After adding a clear comment for
free_hot_cold_page(), it is reasonable to remove a level of call.

[akpm@linux-foundation.org: fix build]
Signed-off-by: Li Hong <lihong.hi@gmail.com>
Cc: Mel Gorman <mel@csn.ul.ie>
Cc: Rik van Riel <riel@redhat.com>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: Larry Woodman <lwoodman@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Li Ming Chun <macli@brc.ubc.ca>
Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Cc: Americo Wang <xiyou.wangcong@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# c475dab6 05-Mar-2010 Li Hong <lihong.hi@gmail.com>

mm/page_alloc.c: adjust a call site to trace_mm_page_free_direct

Move a call of trace_mm_page_free_direct() from free_hot_page() to
free_hot_cold_page(). It is clearer and close to kmemcheck_free_shadow(),
as it is done in function __free_pages_ok().

Signed-off-by: Li Hong <lihong.hi@gmail.com>
Cc: Mel Gorman <mel@csn.ul.ie>
Cc: Rik van Riel <riel@redhat.com>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: Larry Woodman <lwoodman@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Li Ming Chun <macli@brc.ubc.ca>
Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# f650316c 05-Mar-2010 Li Hong <lihong.hi@gmail.com>

mm/page_alloc.c: remove duplicate call to trace_mm_page_free_direct

trace_mm_page_free_direct() is called in function __free_pages(). But it
is called again in free_hot_page() if order == 0 and produce duplicate
records in trace file for mm_page_free_direct event. As below:

K-PID CPU# TIMESTAMP FUNCTION
gnome-terminal-1567 [000] 4415.246466: mm_page_free_direct: page=ffffea0003db9f40 pfn=1155800 order=0
gnome-terminal-1567 [000] 4415.246468: mm_page_free_direct: page=ffffea0003db9f40 pfn=1155800 order=0
gnome-terminal-1567 [000] 4415.246506: mm_page_alloc: page=ffffea0003db9f40 pfn=1155800 order=0 migratetype=0 gfp_flags=GFP_KERNEL
gnome-terminal-1567 [000] 4415.255557: mm_page_free_direct: page=ffffea0003db9f40 pfn=1155800 order=0
gnome-terminal-1567 [000] 4415.255557: mm_page_free_direct: page=ffffea0003db9f40 pfn=1155800 order=0

This patch removes the first call and adds a call to
trace_mm_page_free_direct() in __free_pages_ok().

Signed-off-by: Li Hong <lihong.hi@gmail.com>
Cc: Mel Gorman <mel@csn.ul.ie>
Cc: Rik van Riel <riel@redhat.com>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: Larry Woodman <lwoodman@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Li Ming Chun <macli@brc.ubc.ca>
Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 2ee78f7b 17-Feb-2010 Yinghai Lu <yinghai@kernel.org>

x86: Fix non-bootmem compilation on PowerPC

These build errors on some non-x86 platforms (PowerPC for example):

mm/page_alloc.c: In function '__alloc_memory_core_early':
mm/page_alloc.c:3468: error: implicit declaration of function 'find_early_area'
mm/page_alloc.c:3483: error: implicit declaration of function 'reserve_early_without_check'

The function is only needed on CONFIG_NO_BOOTMEM.

Signed-off-by: Yinghai Lu <yinghai@kernel.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Johannes Weiner <hannes@saeurebad.de>
Cc: Mel Gorman <mel@csn.ul.ie>
LKML-Reference: <4B747239.4070907@kernel.org>
Signed-off-by: Ingo Molnar <mingo@elte.hu>


# 08677214 10-Feb-2010 Yinghai Lu <yinghai@kernel.org>

x86: Make 64 bit use early_res instead of bootmem before slab

Finally we can use early_res to replace bootmem for x86_64 now.

Still can use CONFIG_NO_BOOTMEM to enable it or not.

-v2: fix 32bit compiling about MAX_DMA32_PFN
-v3: folded bug fix from LKML message below

Signed-off-by: Yinghai Lu <yinghai@kernel.org>
LKML-Reference: <4B747239.4070907@kernel.org>
Signed-off-by: H. Peter Anvin <hpa@zytor.com>


# a7016235 29-Jan-2010 Hugh Dickins <hugh.dickins@tiscali.co.uk>

mm: fix migratetype bug which slowed swapping

After memory pressure has forced it to dip into the reserves, 2.6.32's
5f8dcc21211a3d4e3a7a5ca366b469fb88117f61 "page-allocator: split per-cpu
list into one-list-per-migrate-type" has been returning MIGRATE_RESERVE
pages to the MIGRATE_MOVABLE free_list: in some sense depleting reserves.

Fix that in the most straightforward way (which, considering the overheads
of alternative approaches, is Mel's preference): the right migratetype is
already in page_private(page), but free_pcppages_bulk() wasn't using it.

How did this bug show up? As a 20% slowdown in my tmpfs loop kbuild
swapping tests, on PowerMac G5 with SLUB allocator. Bisecting to that
commit was easy, but explaining the magnitude of the slowdown not easy.

The same effect appears, but much less markedly, with SLAB, and even
less markedly on other machines (the PowerMac divides into fewer zones
than x86, I think that may be a factor). We guess that lumpy reclaim
of short-lived high-order pages is implicated in some way, and probably
this bug has been tickling a poor decision somewhere in page reclaim.

But instrumentation hasn't told me much, I've run out of time and
imagination to determine exactly what's going on, and shouldn't hold up
the fix any longer: it's valid, and might even fix other misbehaviours.

Signed-off-by: Hugh Dickins <hugh.dickins@tiscali.co.uk>
Acked-by: Mel Gorman <mel@csn.ul.ie>
Cc: stable@kernel.org
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 6ccf80eb 15-Jan-2010 KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>

page allocator: update NR_FREE_PAGES only when necessary

commit f2260e6b (page allocator: update NR_FREE_PAGES only as necessary)
made one minor regression. if __rmqueue() was failed, NR_FREE_PAGES stat
go wrong. this patch fixes it.

Signed-off-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Cc: Mel Gorman <mel@csn.ul.ie>
Reviewed-by: Minchan Kim <minchan.kim@gmail.com>
Reported-by: Huang Shijie <shijie8@gmail.com>
Reviewed-by: Christoph Lameter <cl@linux-foundation.org>
Cc: <stable@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# d2dbe08d 15-Jan-2010 Kazuhisa Ichikawa <ki@epsilou.com>

mm/page_alloc: fix the range check for backward merging

The current check for 'backward merging' within add_active_range() does
not seem correct. start_pfn must be compared against
early_node_map[i].start_pfn (and NOT against .end_pfn) to find out whether
the new region is backward-mergeable with the existing range.

Signed-off-by: Kazuhisa Ichikawa <ki@epsilou.com>
Acked-by: David Rientjes <rientjes@google.com>
Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Cc: Mel Gorman <mel@csn.ul.ie>
Cc: Christoph Lameter <cl@linux-foundation.org>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 99dcc3e5 04-Jan-2010 Christoph Lameter <cl@linux-foundation.org>

this_cpu: Page allocator conversion

Use the per cpu allocator functionality to avoid per cpu arrays in struct zone.

This drastically reduces the size of struct zone for systems with large
amounts of processors and allows placement of critical variables of struct
zone in one cacheline even on very large systems.

Another effect is that the pagesets of one processor are placed near one
another. If multiple pagesets from different zones fit into one cacheline
then additional cacheline fetches can be avoided on the hot paths when
allocating memory from multiple zones.

Bootstrap becomes simpler if we use the same scheme for UP, SMP, NUMA. #ifdefs
are reduced and we can drop the zone_pcp macro.

Hotplug handling is also simplified since cpu alloc can bring up and
shut down cpu areas for a specific cpu as a whole. So there is no need to
allocate or free individual pagesets.

V7-V8:
- Explain chicken egg dilemmna with percpu allocator.

V4-V5:
- Fix up cases where per_cpu_ptr is called before irq disable
- Integrate the bootstrap logic that was separate before.

tj: Build failure in pageset_cpuup_callback() due to missing ret
variable fixed.

Reviewed-by: Mel Gorman <mel@csn.ul.ie>
Signed-off-by: Christoph Lameter <cl@linux-foundation.org>
Signed-off-by: Tejun Heo <tj@kernel.org>


# 443c6f14 23-Dec-2009 Andi Kleen <ak@linux.intel.com>

SYSCTL: Add a mutex to the page_alloc zone order sysctl

The zone list code clearly cannot tolerate concurrent writers (I couldn't
find any locks for that), so simply add a global mutex. No need for RCU
in this case.

Signed-off-by: Andi Kleen <ak@linux.intel.com>


# 925cc71e 17-Dec-2009 Robert Jennings <rcj@linux.vnet.ibm.com>

mm: Add notifier in pageblock isolation for balloon drivers

Memory balloon drivers can allocate a large amount of memory which is not
movable but could be freed to accomodate memory hotplug remove.

Prior to calling the memory hotplug notifier chain the memory in the
pageblock is isolated. Currently, if the migrate type is not
MIGRATE_MOVABLE the isolation will not proceed, causing the memory removal
for that page range to fail.

Rather than failing pageblock isolation if the migrateteype is not
MIGRATE_MOVABLE, this patch checks if all of the pages in the pageblock,
and not on the LRU, are owned by a registered balloon driver (or other
entity) using a notifier chain. If all of the non-movable pages are owned
by a balloon, they can be freed later through the memory notifier chain
and the range can still be isolated in set_migratetype_isolate().

Signed-off-by: Robert Jennings <rcj@linux.vnet.ibm.com>
Cc: Mel Gorman <mel@csn.ul.ie>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: Brian King <brking@linux.vnet.ibm.com>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
Cc: Gerald Schaefer <geralds@linux.vnet.ibm.com>
Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>


# 32996250 15-Dec-2009 Yinghai Lu <yinghai@kernel.org>

x86: Fix checking of SRAT when node 0 ram is not from 0

Found one system that boot from socket1 instead of socket0, SRAT get rejected...

[ 0.000000] SRAT: Node 1 PXM 0 0-a0000
[ 0.000000] SRAT: Node 1 PXM 0 100000-80000000
[ 0.000000] SRAT: Node 1 PXM 0 100000000-2080000000
[ 0.000000] SRAT: Node 0 PXM 1 2080000000-4080000000
[ 0.000000] SRAT: Node 2 PXM 2 4080000000-6080000000
[ 0.000000] SRAT: Node 3 PXM 3 6080000000-8080000000
[ 0.000000] SRAT: Node 4 PXM 4 8080000000-a080000000
[ 0.000000] SRAT: Node 5 PXM 5 a080000000-c080000000
[ 0.000000] SRAT: Node 6 PXM 6 c080000000-e080000000
[ 0.000000] SRAT: Node 7 PXM 7 e080000000-10080000000
...
[ 0.000000] NUMA: Allocated memnodemap from 500000 - 701040
[ 0.000000] NUMA: Using 20 for the hash shift.
[ 0.000000] Adding active range (0, 0x2080000, 0x4080000) 0 entries of 3200 used
[ 0.000000] Adding active range (1, 0x0, 0x96) 1 entries of 3200 used
[ 0.000000] Adding active range (1, 0x100, 0x7f750) 2 entries of 3200 used
[ 0.000000] Adding active range (1, 0x100000, 0x2080000) 3 entries of 3200 used
[ 0.000000] Adding active range (2, 0x4080000, 0x6080000) 4 entries of 3200 used
[ 0.000000] Adding active range (3, 0x6080000, 0x8080000) 5 entries of 3200 used
[ 0.000000] Adding active range (4, 0x8080000, 0xa080000) 6 entries of 3200 used
[ 0.000000] Adding active range (5, 0xa080000, 0xc080000) 7 entries of 3200 used
[ 0.000000] Adding active range (6, 0xc080000, 0xe080000) 8 entries of 3200 used
[ 0.000000] Adding active range (7, 0xe080000, 0x10080000) 9 entries of 3200 used
[ 0.000000] SRAT: PXMs only cover 917504MB of your 1048566MB e820 RAM. Not used.
[ 0.000000] SRAT: SRAT not used.

the early_node_map is not sorted because node0 with non zero start come first.

so try to sort it right away after all regions are registered.

also fixs refression by 8716273c (x86: Export srat physical topology)

-v2: make it more solid to handle cross node case like node0 [0,4g), [8,12g) and node1 [4g, 8g), [12g, 16g)
-v3: update comments.

Reported-and-tested-by: Jens Axboe <jens.axboe@oracle.com>
Signed-off-by: Yinghai Lu <yinghai@kernel.org>
LKML-Reference: <4B2579D2.3010201@kernel.org>
Signed-off-by: H. Peter Anvin <hpa@zytor.com>


# 4365a567 15-Dec-2009 KAMEZAWA Hiroyuki <kamezawa.hioryu@jp.fujitsu.com>

oom-kill: fix NUMA constraint check with nodemask

Fix node-oriented allocation handling in oom-kill.c I myself think of this
as a bugfix not as an ehnancement.

In these days, things are changed as
- alloc_pages() eats nodemask as its arguments, __alloc_pages_nodemask().
- mempolicy don't maintain its own private zonelists.
(And cpuset doesn't use nodemask for __alloc_pages_nodemask())

So, current oom-killer's check function is wrong.

This patch does
- check nodemask, if nodemask && nodemask doesn't cover all
node_states[N_HIGH_MEMORY], this is CONSTRAINT_MEMORY_POLICY.
- Scan all zonelist under nodemask, if it hits cpuset's wall
this faiulre is from cpuset.
And
- modifies the caller of out_of_memory not to call oom if __GFP_THISNODE.
This doesn't change "current" behavior. If callers use __GFP_THISNODE
it should handle "page allocation failure" by itself.

- handle __GFP_NOFAIL+__GFP_THISNODE path.
This is something like a FIXME but this gfpmask is not used now.

[akpm@linux-foundation.org: coding-style fixes]
Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hioryu@jp.fujitsu.com>
Acked-by: David Rientjes <rientjes@google.com>
Cc: Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp>
Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Cc: Christoph Lameter <cl@linux-foundation.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 8d22ba1b 15-Dec-2009 Wu Fengguang <fengguang.wu@intel.com>

HWPOISON: detect free buddy pages explicitly

Most free pages in the buddy system have no PG_buddy set.
Introduce is_free_buddy_page() for detecting them reliably.

CC: Nick Piggin <npiggin@suse.de>
CC: Mel Gorman <mel@linux.vnet.ibm.com>
Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
Signed-off-by: Andi Kleen <ak@linux.intel.com>


# af8e3354 14-Dec-2009 Hugh Dickins <hugh.dickins@tiscali.co.uk>

mm: CONFIG_MMU for PG_mlocked

Remove three degrees of obfuscation, left over from when we had
CONFIG_UNEVICTABLE_LRU. MLOCK_PAGES is CONFIG_HAVE_MLOCKED_PAGE_BIT is
CONFIG_HAVE_MLOCK is CONFIG_MMU. rmap.o (and memory-failure.o) are only
built when CONFIG_MMU, so don't need such conditions at all.

Somehow, I feel no compulsion to remove the CONFIG_HAVE_MLOCK* lines from
169 defconfigs: leave those to evolve in due course.

Signed-off-by: Hugh Dickins <hugh.dickins@tiscali.co.uk>
Cc: Izik Eidus <ieidus@redhat.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Nick Piggin <npiggin@suse.de>
Reviewed-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Cc: Rik van Riel <riel@redhat.com>
Cc: Lee Schermerhorn <Lee.Schermerhorn@hp.com>
Cc: Andi Kleen <andi@firstfloor.org>
Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Wu Fengguang <fengguang.wu@intel.com>
Cc: Minchan Kim <minchan.kim@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 4d795fb1 26-Nov-2009 Ingo Molnar <mingo@elte.hu>

tracing: Fix kmem event exports

Commit 53d0422 ("tracing: Convert some kmem events to DEFINE_EVENT")
moved the kmem tracepoint creation from util.c to page_alloc.c,
but forgot to move the exports.

Move them back.

Cc: Li Zefan <lizf@cn.fujitsu.com>
Cc: Pekka Enberg <penberg@cs.helsinki.fi>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Mel Gorman <mel@csn.ul.ie>
LKML-Reference: <4B0E286A.2000405@cn.fujitsu.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>


# 53d0422c 26-Nov-2009 Li Zefan <lizf@cn.fujitsu.com>

tracing: Convert some kmem events to DEFINE_EVENT

Use DECLARE_EVENT_CLASS to remove duplicate code:

text data bss dec hex filename
333987 69800 27228 431015 693a7 mm/built-in.o.old
330030 69800 27228 427058 68432 mm/built-in.o

8 events are converted:

kmem_alloc: kmalloc, kmem_cache_alloc
kmem_alloc_node: kmalloc_node, kmem_cache_alloc_node
kmem_free: kfree, kmem_cache_free
mm_page: mm_page_alloc_zone_locked, mm_page_pcpu_drain

No change in functionality.

Signed-off-by: Li Zefan <lizf@cn.fujitsu.com>
Acked-by: Pekka Enberg <penberg@cs.helsinki.fi>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Mel Gorman <mel@csn.ul.ie>
LKML-Reference: <4B0E286A.2000405@cn.fujitsu.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>


# 9d0ed60f 11-Nov-2009 Mel Gorman <mel@csn.ul.ie>

page allocator: Do not allow interrupts to use ALLOC_HARDER

Commit 341ce06f69abfafa31b9468410a13dbd60e2b237 ("page allocator:
calculate the alloc_flags for allocation only once") altered watermark
logic slightly by allowing rt_tasks that are handling an interrupt to set
ALLOC_HARDER. This patch brings the watermark logic more in line with
2.6.30.

This change results in a reduction of the number high-order GFP_ATOMIC
allocation failures reported. See
http://www.gossamer-threads.com/lists/linux/kernel/1144153

[rientjes@google.com: Spotted the problem]
Signed-off-by: Mel Gorman <mel@csn.ul.ie>
Reviewed-by: Pekka Enberg <penberg@cs.helsinki.fi>
Reviewed-by: Rik van Riel <riel@redhat.com>
Reviewed-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Cc: David Rientjes <rientjes@google.com>
Cc: <stable@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# cc4a6851 11-Nov-2009 Mel Gorman <mel@csn.ul.ie>

page allocator: always wake kswapd when restarting an allocation attempt after direct reclaim failed

If a direct reclaim makes no forward progress, it considers whether it
should go OOM or not. Whether OOM is triggered or not, it may retry the
allocation afterwards. In times past, this would always wake kswapd as
well but currently, kswapd is not woken up after direct reclaim fails.
For order-0 allocations, this makes little difference but if there is a
heavy mix of higher-order allocations that direct reclaim is failing for,
it might mean that kswapd is not rewoken for higher orders as much as it
did previously.

This patch wakes up kswapd when an allocation is being retried after a
direct reclaim failure. It would be expected that kswapd is already
awake, but this has the effect of telling kswapd to reclaim at the higher
order as well.

Signed-off-by: Mel Gorman <mel@csn.ul.ie>
Reviewed-by: Christoph Lameter <cl@linux-foundation.org>
Reviewed-by: Pekka Enberg <penberg@cs.helsinki.fi>
Reviewed-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Cc: <stable@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# b76146ed 26-Oct-2009 Andrew Morton <akpm@linux-foundation.org>

revert "mm: oom analysis: add buffer cache information to show_free_areas()"

Revert

commit 71de1ccbe1fb40203edd3beb473f8580d917d2ca
Author: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
AuthorDate: Mon Sep 21 17:01:31 2009 -0700
Commit: Linus Torvalds <torvalds@linux-foundation.org>
CommitDate: Tue Sep 22 07:17:27 2009 -0700

mm: oom analysis: add buffer cache information to show_free_areas()

show_free_areas() is called during page allocation failures, and page
allocation failures can occur in any calling context.

But nr_blockdev_pages() takes VFS locks which should not be taken from
hard IRQ context (at least). The result is lockdep warnings (and
deadlockability) during page allocation failures.

Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Cc: Wu Fengguang <fengguang.wu@intel.com>
Cc: Rik van Riel <riel@redhat.com>
Cc: David Rientjes <rientjes@google.com>
Cc: Alexey Dobriyan <adobriyan@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 8d65af78 23-Sep-2009 Alexey Dobriyan <adobriyan@gmail.com>

sysctl: remove "struct file *" argument of ->proc_handler

It's unused.

It isn't needed -- read or write flag is already passed and sysctl
shouldn't care about the rest.

It _was_ used in two places at arch/frv for some reason.

Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com>
Cc: David Howells <dhowells@redhat.com>
Cc: "Eric W. Biederman" <ebiederm@xmission.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Ralf Baechle <ralf@linux-mips.org>
Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: "David S. Miller" <davem@davemloft.net>
Cc: James Morris <jmorris@namei.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 03f6462a 21-Sep-2009 Hugh Dickins <hugh.dickins@tiscali.co.uk>

mm: move highest_memmap_pfn

Move highest_memmap_pfn __read_mostly from page_alloc.c next to zero_pfn
__read_mostly in memory.c: to help them share a cacheline, since they're
very often tested together in vm_normal_page().

Signed-off-by: Hugh Dickins <hugh.dickins@tiscali.co.uk>
Cc: Rik van Riel <riel@redhat.com>
Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Cc: Nick Piggin <npiggin@suse.de>
Cc: Mel Gorman <mel@csn.ul.ie>
Cc: Minchan Kim <minchan.kim@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# a6f9edd6 21-Sep-2009 Mel Gorman <mel@csn.ul.ie>

page-allocator: maintain rolling count of pages to free from the PCP

When round-robin freeing pages from the PCP lists, empty lists may be
encountered. In the event one of the lists has more pages than another,
there may be numerous checks for list_empty() which is undesirable. This
patch maintains a count of pages to free which is incremented when empty
lists are encountered. The intention is that more pages will then be
freed from fuller lists than the empty ones reducing the number of empty
list checks in the free path.

[akpm@linux-foundation.org: coding-style fixes]
Signed-off-by: Mel Gorman <mel@csn.ul.ie>
Cc: Nick Piggin <npiggin@suse.de>
Cc: Christoph Lameter <cl@linux-foundation.org>
Reviewed-by: Minchan Kim <minchan.kim@gmail.com>
Cc: Pekka Enberg <penberg@cs.helsinki.fi>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 5f8dcc21 21-Sep-2009 Mel Gorman <mel@csn.ul.ie>

page-allocator: split per-cpu list into one-list-per-migrate-type

The following two patches remove searching in the page allocator fast-path
by maintaining multiple free-lists in the per-cpu structure. At the time
the search was introduced, increasing the per-cpu structures would waste a
lot of memory as per-cpu structures were statically allocated at
compile-time. This is no longer the case.

The patches are as follows. They are based on mmotm-2009-08-27.

Patch 1 adds multiple lists to struct per_cpu_pages, one per
migratetype that can be stored on the PCP lists.

Patch 2 notes that the pcpu drain path check empty lists multiple times. The
patch reduces the number of checks by maintaining a count of free
lists encountered. Lists containing pages will then free multiple
pages in batch

The patches were tested with kernbench, netperf udp/tcp, hackbench and
sysbench. The netperf tests were not bound to any CPU in particular and
were run such that the results should be 99% confidence that the reported
results are within 1% of the estimated mean. sysbench was run with a
postgres background and read-only tests. Similar to netperf, it was run
multiple times so that it's 99% confidence results are within 1%. The
patches were tested on x86, x86-64 and ppc64 as

x86: Intel Pentium D 3GHz with 8G RAM (no-brand machine)
kernbench - No significant difference, variance well within noise
netperf-udp - 1.34% to 2.28% gain
netperf-tcp - 0.45% to 1.22% gain
hackbench - Small variances, very close to noise
sysbench - Very small gains

x86-64: AMD Phenom 9950 1.3GHz with 8G RAM (no-brand machine)
kernbench - No significant difference, variance well within noise
netperf-udp - 1.83% to 10.42% gains
netperf-tcp - No conclusive until buffer >= PAGE_SIZE
4096 +15.83%
8192 + 0.34% (not significant)
16384 + 1%
hackbench - Small gains, very close to noise
sysbench - 0.79% to 1.6% gain

ppc64: PPC970MP 2.5GHz with 10GB RAM (it's a terrasoft powerstation)
kernbench - No significant difference, variance well within noise
netperf-udp - 2-3% gain for almost all buffer sizes tested
netperf-tcp - losses on small buffers, gains on larger buffers
possibly indicates some bad caching effect.
hackbench - No significant difference
sysbench - 2-4% gain

This patch:

Currently the per-cpu page allocator searches the PCP list for pages of
the correct migrate-type to reduce the possibility of pages being
inappropriate placed from a fragmentation perspective. This search is
potentially expensive in a fast-path and undesirable. Splitting the
per-cpu list into multiple lists increases the size of a per-cpu structure
and this was potentially a major problem at the time the search was
introduced. These problem has been mitigated as now only the necessary
number of structures is allocated for the running system.

This patch replaces a list search in the per-cpu allocator with one list
per migrate type. The potential snag with this approach is when bulk
freeing pages. We round-robin free pages based on migrate type which has
little bearing on the cache hotness of the page and potentially checks
empty lists repeatedly in the event the majority of PCP pages are of one
type.

Signed-off-by: Mel Gorman <mel@csn.ul.ie>
Acked-by: Nick Piggin <npiggin@suse.de>
Cc: Christoph Lameter <cl@linux-foundation.org>
Cc: Minchan Kim <minchan.kim@gmail.com>
Cc: Pekka Enberg <penberg@cs.helsinki.fi>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# f8629631 21-Sep-2009 Wu Fengguang <fengguang.wu@intel.com>

mm: do batched scans for mem_cgroup

For mem_cgroup, shrink_zone() may call shrink_list() with nr_to_scan=1, in
which case shrink_list() _still_ calls isolate_pages() with the much
larger SWAP_CLUSTER_MAX. It effectively scales up the inactive list scan
rate by up to 32 times.

For example, with 16k inactive pages and DEF_PRIORITY=12, (16k >> 12)=4.
So when shrink_zone() expects to scan 4 pages in the active/inactive list,
the active list will be scanned 4 pages, while the inactive list will be
(over) scanned SWAP_CLUSTER_MAX=32 pages in effect. And that could break
the balance between the two lists.

It can further impact the scan of anon active list, due to the anon
active/inactive ratio rebalance logic in balance_pgdat()/shrink_zone():

inactive anon list over scanned => inactive_anon_is_low() == TRUE
=> shrink_active_list()
=> active anon list over scanned

So the end result may be

- anon inactive => over scanned
- anon active => over scanned (maybe not as much)
- file inactive => over scanned
- file active => under scanned (relatively)

The accesses to nr_saved_scan are not lock protected and so not 100%
accurate, however we can tolerate small errors and the resulted small
imbalanced scan rates between zones.

Cc: Rik van Riel <riel@redhat.com>
Reviewed-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Acked-by: Balbir Singh <balbir@linux.vnet.ibm.com>
Reviewed-by: Minchan Kim <minchan.kim@gmail.com>
Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 2c85f51d 21-Sep-2009 Jan Beulich <JBeulich@novell.com>

mm: also use alloc_large_system_hash() for the PID hash table

This is being done by allowing boot time allocations to specify that they
may want a sub-page sized amount of memory.

Overall this seems more consistent with the other hash table allocations,
and allows making two supposedly mm-only variables really mm-only
(nr_{kernel,all}_pages).

Signed-off-by: Jan Beulich <jbeulich@novell.com>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: "Eric W. Biederman" <ebiederm@xmission.com>
Cc: Mel Gorman <mel@csn.ul.ie>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 78986a67 21-Sep-2009 Mel Gorman <mel@csn.ul.ie>

page-allocator: limit the number of MIGRATE_RESERVE pageblocks per zone

After anti-fragmentation was merged, a bug was reported whereby devices
that depended on high-order atomic allocations were failing. The solution
was to preserve a property in the buddy allocator which tended to keep the
minimum number of free pages in the zone at the lower physical addresses
and contiguous. To preserve this property, MIGRATE_RESERVE was introduced
and a number of pageblocks at the start of a zone would be marked
"reserve", the number of which depended on min_free_kbytes.

Anti-fragmentation works by avoiding the mixing of page migratetypes
within the same pageblock. One way of helping this is to increase
min_free_kbytes because it becomes less like that it will be necessary to
place pages of of MIGRATE_RESERVE is unbounded, the free memory is kept
there in large contiguous blocks instead of helping anti-fragmentation as
much as it should. With the page-allocator tracepoint patches applied, it
was found during anti-fragmentation tests that the number of
fragmentation-related events were far higher than expected even with
min_free_kbytes at higher values.

This patch limits the number of MIGRATE_RESERVE blocks that exist per zone
to two. For example, with a sufficient min_free_kbytes, 4MB of memory
will be kept aside on an x86-64 and remain more or less free and
contiguous for the systems uptime. This should be sufficient for devices
depending on high-order atomic allocations while helping fragmentation
control when min_free_kbytes is tuned appropriately. As side-effect of
this patch is that the reserve variable is converted to int as unsigned
long was the wrong type to use when ensuring that only the required number
of reserve blocks are created.

With the patches applied, fragmentation-related events as measured by the
page allocator tracepoints were significantly reduced when running some
fragmentation stress-tests on systems with min_free_kbytes tuned to a
value appropriate for hugepage allocations at runtime. On x86, the events
recorded were reduced by 99.8%, on x86-64 by 99.72% and on ppc64 by
99.83%.

Signed-off-by: Mel Gorman <mel@csn.ul.ie>
Cc: <stable@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 0d3d062a 21-Sep-2009 Mel Gorman <mel@csn.ul.ie>

tracing, page-allocator: add trace event for page traffic related to the buddy lists

The page allocation trace event reports that a page was successfully
allocated but it does not specify where it came from. When analysing
performance, it can be important to distinguish between pages coming from
the per-cpu allocator and pages coming from the buddy lists as the latter
requires the zone lock to the taken and more data structures to be
examined.

This patch adds a trace event for __rmqueue reporting when a page is being
allocated from the buddy lists. It distinguishes between being called to
refill the per-cpu lists or whether it is a high-order allocation.
Similarly, this patch adds an event to catch when the PCP lists are being
drained a little and pages are going back to the buddy lists.

This is trickier to draw conclusions from but high activity on those
events could explain why there were a large number of cache misses on a
page-allocator-intensive workload. The coalescing and splitting of
buddies involves a lot of writing of page metadata and cache line bounces
not to mention the acquisition of an interrupt-safe lock necessary to
enter this path.

[akpm@linux-foundation.org: fix build]
Signed-off-by: Mel Gorman <mel@csn.ul.ie>
Acked-by: Rik van Riel <riel@redhat.com>
Reviewed-by: Ingo Molnar <mingo@elte.hu>
Cc: Larry Woodman <lwoodman@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Li Ming Chun <macli@brc.ubc.ca>
Reviewed-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# e0fff1bd 21-Sep-2009 Mel Gorman <mel@csn.ul.ie>

tracing, page-allocator: add trace events for anti-fragmentation falling back to other migratetypes

Fragmentation avoidance depends on being able to use free pages from lists
of the appropriate migrate type. In the event this is not possible,
__rmqueue_fallback() selects a different list and in some circumstances
change the migratetype of the pageblock. Simplistically, the more times
this event occurs, the more likely that fragmentation will be a problem
later for hugepage allocation at least but there are other considerations
such as the order of page being split to satisfy the allocation.

This patch adds a trace event for __rmqueue_fallback() that reports what
page is being used for the fallback, the orders of relevant pages, the
desired migratetype and the migratetype of the lists being used, whether
the pageblock changed type and whether this event is important with
respect to fragmentation avoidance or not. This information can be used
to help analyse fragmentation avoidance and help decide whether
min_free_kbytes should be increased or not.

Signed-off-by: Mel Gorman <mel@csn.ul.ie>
Acked-by: Rik van Riel <riel@redhat.com>
Reviewed-by: Ingo Molnar <mingo@elte.hu>
Cc: Larry Woodman <lwoodman@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Li Ming Chun <macli@brc.ubc.ca>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 4b4f278c 21-Sep-2009 Mel Gorman <mel@csn.ul.ie>

tracing, page-allocator: add trace events for page allocation and page freeing

This patch adds trace events for the allocation and freeing of pages,
including the freeing of pagevecs. Using the events, it will be known
what struct page and pfns are being allocated and freed and what the call
site was in many cases.

The page alloc tracepoints be used as an indicator as to whether the
workload was heavily dependant on the page allocator or not. You can make
a guess based on vmstat but you can't get a per-process breakdown.
Depending on the call path, the call_site for page allocation may be
__get_free_pages() instead of a useful callsite. Instead of passing down
a return address similar to slab debugging, the user should enable the
stacktrace and seg-addr options to get a proper stack trace.

The pagevec free tracepoint has a different usecase. It can be used to
get a idea of how many pages are being dumped off the LRU and whether it
is kswapd doing the work or a process doing direct reclaim.

Signed-off-by: Mel Gorman <mel@csn.ul.ie>
Acked-by: Rik van Riel <riel@redhat.com>
Reviewed-by: Ingo Molnar <mingo@elte.hu>
Cc: Larry Woodman <lwoodman@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Li Ming Chun <macli@brc.ubc.ca>
Reviewed-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 38a39857 21-Sep-2009 Mel Gorman <mel@csn.ul.ie>

page-allocator: remove dead function free_cold_page()

The function free_cold_page() has no callers so delete it.

Signed-off-by: Mel Gorman <mel@csn.ul.ie>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 2f66a68f 21-Sep-2009 Mel Gorman <mel@csn.ul.ie>

page-allocator: change migratetype for all pageblocks within a high-order page during __rmqueue_fallback

When there are no pages of a target migratetype free, the page allocator
selects a high-order block of another migratetype to allocate from. When
the order of the page taken is greater than pageblock_order, all
pageblocks within that high-order page should change migratetype so that
pages are later freed to the correct free-lists.

The current behaviour is that pageblocks change migratetype if the order
being split matches the pageblock_order. When pageblock_order <
MAX_ORDER-1, ownership is not changing correct and pages are being later
freed to the incorrect list and this impacts fragmentation avoidance.

This patch changes all pageblocks within the high-order page being split
to the correct migratetype. Without the patch, allocation success rates
for hugepages under stress were about 59% of physical memory on x86-64.
With the patch applied, this goes up to 65%.

Signed-off-by: Mel Gorman <mel@csn.ul.ie>
Cc: Andy Whitcroft <apw@shadowen.org>
Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 451ea25d 21-Sep-2009 Johannes Weiner <hannes@cmpxchg.org>

mm: perform non-atomic test-clear of PG_mlocked on free

By the time PG_mlocked is cleared in the page freeing path, nobody else is
looking at our page->flags anymore.

It is thus safe to make the test-and-clear non-atomic and thereby removing
an unnecessary and expensive operation from a hotpath.

Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
Reviewed-by: Christoph Lameter <cl@linux-foundation.org>
Reviewed-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Cc: Christoph Lameter <cl@linux-foundation.org>
Cc: Mel Gorman <mel@csn.ul.ie>
Cc: Nick Piggin <nickpiggin@yahoo.com.au>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 945a1113 21-Sep-2009 Akinobu Mita <akinobu.mita@gmail.com>

mm: add gfp mask checking for __get_free_pages()

__get_free_pages() with __GFP_HIGHMEM is not safe because the return
address cannot represent a highmem page. get_zeroed_page() already has
such a debug checking.

Signed-off-by: Akinobu Mita <akinobu.mita@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# a731286d 21-Sep-2009 KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>

mm: vmstat: add isolate pages

If the system is running a heavy load of processes then concurrent reclaim
can isolate a large number of pages from the LRU. /proc/vmstat and the
output generated for an OOM do not show how many pages were isolated.

This has been observed during process fork bomb testing (mstctl11 in LTP).

This patch shows the information about isolated pages.

Reproduced via:

-----------------------
% ./hackbench 140 process 1000
=> OOM occur

active_anon:146 inactive_anon:0 isolated_anon:49245
active_file:79 inactive_file:18 isolated_file:113
unevictable:0 dirty:0 writeback:0 unstable:0 buffer:39
free:370 slab_reclaimable:309 slab_unreclaimable:5492
mapped:53 shmem:15 pagetables:28140 bounce:0

Signed-off-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Acked-by: Rik van Riel <riel@redhat.com>
Acked-by: Wu Fengguang <fengguang.wu@intel.com>
Reviewed-by: Minchan Kim <minchan.kim@gmail.com>
Cc: Hugh Dickins <hugh.dickins@tiscali.co.uk>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# b259fbde 21-Sep-2009 David Rientjes <rientjes@google.com>

mm: update alloc_flags after oom killer has been called

It is possible for the oom killer to select current as the task to kill.
When this happens, alloc_flags needs to be updated accordingly to set
ALLOC_NO_WATERMARKS so the subsequent allocation attempt may use memory
reserves as the result of its thread having TIF_MEMDIE set if the
allocation is not __GFP_NOMEMALLOC.

Acked-by: Mel Gorman <mel@csn.ul.ie>
Signed-off-by: David Rientjes <rientjes@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 4b02108a 21-Sep-2009 KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>

mm: oom analysis: add shmem vmstat

Recently we encountered OOM problems due to memory use of the GEM cache.
Generally a large amuont of Shmem/Tmpfs pages tend to create a memory
shortage problem.

We often use the following calculation to determine the amount of shmem
pages:

shmem = NR_ACTIVE_ANON + NR_INACTIVE_ANON - NR_ANON_PAGES

however the expression does not consider isolated and mlocked pages.

This patch adds explicit accounting for pages used by shmem and tmpfs.

Signed-off-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Acked-by: Rik van Riel <riel@redhat.com>
Reviewed-by: Christoph Lameter <cl@linux-foundation.org>
Acked-by: Wu Fengguang <fengguang.wu@intel.com>
Cc: David Rientjes <rientjes@google.com>
Cc: Hugh Dickins <hugh.dickins@tiscali.co.uk>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# c6a7f572 21-Sep-2009 KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>

mm: oom analysis: Show kernel stack usage in /proc/meminfo and OOM log output

The amount of memory allocated to kernel stacks can become significant and
cause OOM conditions. However, we do not display the amount of memory
consumed by stacks.

Add code to display the amount of memory used for stacks in /proc/meminfo.

Signed-off-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Reviewed-by: Christoph Lameter <cl@linux-foundation.org>
Reviewed-by: Minchan Kim <minchan.kim@gmail.com>
Reviewed-by: Rik van Riel <riel@redhat.com>
Cc: David Rientjes <rientjes@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 71de1ccb 21-Sep-2009 KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>

mm: oom analysis: add buffer cache information to show_free_areas()

It is often useful to know the statistics for all pages that are handled
like page cache pages when looking at OOM log output.

Therefore show_free_areas() should also display buffer cache statistics.

Signed-off-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Acked-by: Wu Fengguang <fengguang.wu@intel.com>
Reviewed-by: Rik van Riel <riel@redhat.com>
Cc: David Rientjes <rientjes@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 4a0aa73f 21-Sep-2009 KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>

mm: oom analysis: add per-zone statistics to show_free_areas()

show_free_areas() displays only a limited amount of zone counters. This
patch includes additional counters in the display to allow easier
debugging. This may be especially useful if an OOM is due to running out
of DMA memory.

Signed-off-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Reviewed-by: Christoph Lameter <cl@linux-foundation.org>
Acked-by: Wu Fengguang <fengguang.wu@intel.com>
Reviewed-by: Minchan Kim <minchan.kim@gmail.com>
Reviewed-by: Rik van Riel <riel@redhat.com>
Cc: David Rientjes <rientjes@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 3701b033 21-Sep-2009 KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>

mm: show_free_areas(): display slab pages in two separate fields

If an OOM happens, we really want to know the number of remaining
reclaimable pages. So the reclaimable slab and unreclaimable slab fields
should not be combined for display.

Signed-off-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Reviewed-by: Minchan Kim <minchan.kim@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 55a4462a 21-Sep-2009 Randy Dunlap <randy.dunlap@oracle.com>

page_alloc: fix kernel-doc warning

Ummark function as having kernel-doc notation, fixing the kernel-doc
warning.

Warning(mm/page_alloc.c:4519): No description found for parameter 'zone'

Signed-off-by: Randy Dunlap <randy.dunlap@oracle.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 8e7e40d9 21-Sep-2009 Shaohua Li <shaohua.li@intel.com>

memory hotplug: make pages from movable zone always isolatable

Pages on movable zone have two types, MIGRATE_MOVABLE and MIGRATE_RESERVE,
both them can be movable, because only movable memory allocation can get
pages from movable zone. This makes pages in movable zone always be able
to migrate.

Signed-off-by: Shaohua Li <shaohua.li@intel.com>
Cc: Mel Gorman <mel@csn.ul.ie>
Cc: Christoph Lameter <cl@linux-foundation.org>
Cc: Yakui Zhao <yakui.zhao@intel.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 6fb332fa 21-Sep-2009 Shaohua Li <shaohua.li@intel.com>

memory hotplug: exclude isolated page from pco page alloc

Pages marked as isolated should not be allocated again. If such pages
reside in pcp list, they can be allocated too, so there is a ping-pong
memory offline frees some pages to pcp list and the pages get allocated
and then memory offline frees them again, this loop will happen again and
again.

This should have no impact in normal code path, because in normal code
path, pages in pcp list aren't isolated, and below loop will break in the
first entry.

Signed-off-by: Shaohua Li <shaohua.li@intel.com>
Cc: Mel Gorman <mel@csn.ul.ie>
Cc: Christoph Lameter <cl@linux-foundation.org>
Cc: Yakui Zhao <yakui.zhao@intel.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 112067f0 21-Sep-2009 Shaohua Li <shaohua.li@intel.com>

memory hotplug: update zone pcp at memory online

In my test, 128M memory is hot added, but zone's pcp batch is 0, which is
an obvious error. When pages are onlined, zone pcp should be updated
accordingly.

[akpm@linux-foundation.org: fix warnings]
Signed-off-by: Shaohua Li <shaohua.li@intel.com>
Cc: Mel Gorman <mel@csn.ul.ie>
Cc: Christoph Lameter <cl@linux-foundation.org>
Cc: Yakui Zhao <yakui.zhao@intel.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 478b81fd 21-Sep-2009 David Rientjes <rientjes@google.com>

mm: remove obsoleted alloc_pages cpuset comment

When a cpuset's nodemask is updated, all attached tasks have their cached
task->mems_allowed updated by a heap instead of requiring an explicit call
to cpuset_update_task_memory_state(), which has since been removed in
58568d2a8215cb6f55caf2332017d7bdff954e1c ("cpuset,mm: update tasks'
mems_allowed in time").

Remove the obsoleted comment from the page allocator.

Cc: Paul Menage <menage@google.com>
Acked-by: Miao Xie <miaox@cn.fujitsu.com>
Signed-off-by: David Rientjes <rientjes@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 2a7684a2 16-Sep-2009 Wu Fengguang <fengguang.wu@intel.com>

HWPOISON: check and isolate corrupted free pages v2

If memory corruption hits the free buddy pages, we can safely ignore them.
No one will access them until page allocation time, then prep_new_page()
will automatically check and isolate PG_hwpoison page for us (for 0-order
allocation).

This patch expands prep_new_page() to check every component page in a high
order page allocation, in order to completely stop PG_hwpoison pages from
being recirculated.

Note that the common case -- only allocating a single page, doesn't
do any more work than before. Allocating > order 0 does a bit more work,
but that's relatively uncommon.

This simple implementation may drop some innocent neighbor pages, hopefully
it is not a big problem because the event should be rare enough.

This patch adds some runtime costs to high order page users.

[AK: Improved description]

v2: Andi Kleen:
Port to -mm code
Move check into separate function.
Don't dump stack in bad_pages for hwpoisoned pages.
Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
Signed-off-by: Andi Kleen <ak@linux.intel.com>


# dd5d241e 05-Sep-2009 Mel Gorman <mel@csn.ul.ie>

page-allocator: always change pageblock ownership when anti-fragmentation is disabled

On low-memory systems, anti-fragmentation gets disabled as fragmentation
cannot be avoided on a sufficiently large boundary to be worthwhile. Once
disabled, there is a period of time when all the pageblocks are marked
MOVABLE and the expectation is that they get marked UNMOVABLE at each call
to __rmqueue_fallback().

However, when MAX_ORDER is large the pageblocks do not change ownership
because the normal criteria are not met. This has the effect of
prematurely breaking up too many large contiguous blocks. This is most
serious on NOMMU systems which depend on high-order allocations to boot.
This patch causes pageblocks to change ownership on every fallback when
anti-fragmentation is disabled. This prevents the large blocks being
prematurely broken up.

This is a fix to commit 49255c619fbd482d704289b5eb2795f8e3b7ff2e [page
allocator: move check for disabled anti-fragmentation out of fastpath] and
the problem affects 2.6.31-rc8.

Signed-off-by: Mel Gorman <mel@csn.ul.ie>
Tested-by: Paul Mundt <lethal@linux-sh.org>
Cc: David Howells <dhowells@redhat.com>
Cc: Pekka Enberg <penberg@cs.helsinki.fi>
Acked-by: Greg Ungerer <gerg@snapgear.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 7f9cfb31 18-Aug-2009 Bo Liu <bo-liu@hotmail.com>

mm: build_zonelists(): move clear node_load[] to __build_all_zonelists()

If node_load[] is cleared everytime build_zonelists() is
called,node_load[] will have no help to find the next node that should
appear in the given node's fallback list.

Because of the bug, zonelist's node_order is not calculated as expected.
This bug affects on big machine, which has asynmetric node distance.

[synmetric NUMA's node distance]
0 1 2
0 10 12 12
1 12 10 12
2 12 12 10

[asynmetric NUMA's node distance]
0 1 2
0 10 12 20
1 12 10 14
2 20 14 10

This (my bug) is very old but no one has reported this for a long time.
Maybe because the number of asynmetric NUMA is very small and they use
cpuset for customizing node memory allocation fallback.

[akpm@linux-foundation.org: fix CONFIG_NUMA=n build]
Signed-off-by: Bo Liu <bo-liu@hotmail.com>
Reviewed-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Mel Gorman <mel@csn.ul.ie>
Cc: Christoph Lameter <cl@linux-foundation.org>
Cc: <stable@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 1fc28b70 29-Jul-2009 Mel Gorman <mel@csn.ul.ie>

page-allocator: allow too high-order warning messages to be suppressed with __GFP_NOWARN

The page allocator warns once when an order >= MAX_ORDER is specified.
This is to catch callers of the allocator that are always falling back to
their worst-case when it was not expected. However, there are cases where
the caller is behaving correctly but cannot suppress the warning. This
patch allows the warning to be suppressed by the callers by specifying
__GFP_NOWARN.

Signed-off-by: Mel Gorman <mel@csn.ul.ie>
Acked-by: David Rientjes <rientjes@google.com>
Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
Cc: "David S. Miller" <davem@davemloft.net>
Cc: "Rafael J. Wysocki" <rjw@sisk.pl>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 6583bb64 29-Jul-2009 David Rientjes <rientjes@google.com>

mm: avoid endless looping for oom killed tasks

If a task is oom killed and still cannot find memory when trying with
no watermarks, it's better to fail the allocation attempt than to loop
endlessly. Direct reclaim has already failed and the oom killer will
be a no-op since current has yet to die, so there is no other
alternative for allocations that are not __GFP_NOFAIL.

Acked-by: Mel Gorman <mel@csn.ul.ie>
Signed-off-by: David Rientjes <rientjes@google.com>
Acked-by: Hugh Dickins <hugh.dickins@tiscali.co.uk>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# e084b2d9 29-Jul-2009 Mel Gorman <mel@csn.ul.ie>

page-allocator: preserve PFN ordering when __GFP_COLD is set

Fix a post-2.6.24 performace regression caused by
3dfa5721f12c3d5a441448086bee156887daa961 ("page-allocator: preserve PFN
ordering when __GFP_COLD is set").

Narayanan reports "The regression is around 15%. There is no disk controller
as our setup is based on Samsung OneNAND used as a memory mapped device on a
OMAP2430 based board."

The page allocator tries to preserve contiguous PFN ordering when returning
pages such that repeated callers to the allocator have a strong chance of
getting physically contiguous pages, particularly when external fragmentation
is low. However, of the bulk of the allocations have __GFP_COLD set as they
are due to aio_read() for example, then the PFNs are in reverse PFN order.
This can cause performance degration when used with IO controllers that could
have merged the requests.

This patch attempts to preserve the contiguous ordering of PFNs for users of
__GFP_COLD.

Signed-off-by: Mel Gorman <mel@csn.ul.ie>
Reported-by: Narayananu Gopalakrishnan <narayanan.g@samsung.com>
Tested-by: Narayanan Gopalakrishnan <narayanan.g@samsung.com>
Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: <stable@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 8aa7e847 09-Jul-2009 Jens Axboe <jens.axboe@oracle.com>

Fix congestion_wait() sync/async vs read/write confusion

Commit 1faa16d22877f4839bd433547d770c676d1d964c accidentally broke
the bdi congestion wait queue logic, causing us to wait on congestion
for WRITE (== 1) when we really wanted BLK_RW_ASYNC (== 0) instead.

Signed-off-by: Jens Axboe <jens.axboe@oracle.com>


# 264ef8a9 07-Jul-2009 Catalin Marinas <catalin.marinas@arm.com>

kmemleak: Remove alloc_bootmem annotations introduced in the past

kmemleak_alloc() calls were added in some places where alloc_bootmem was
called. Since now kmemleak tracks bootmem allocations, these explicit
calls should be run.

Signed-off-by: Catalin Marinas <catalin.marinas@arm.com>
Cc: Ingo Molnar <mingo@elte.hu>
Acked-by: Pekka Enberg <penberg@cs.helsinki.fi>


# 5bfd7560 05-Jul-2009 Kevin Cernekee <cernekee@gmail.com>

Fix virt_to_phys() warnings

These warnings were observed on MIPS32 using 2.6.31-rc1 and gcc-4.2.0:

mm/page_alloc.c: In function 'alloc_pages_exact':
mm/page_alloc.c:1986: warning: passing argument 1 of 'virt_to_phys' makes pointer from integer without a cast

drivers/usb/mon/mon_bin.c: In function 'mon_alloc_buff':
drivers/usb/mon/mon_bin.c:1264: warning: passing argument 1 of 'virt_to_phys' makes pointer from integer without a cast

[akpm@linux-foundation.org: fix kernel/perf_counter.c too]
Signed-off-by: Kevin Cernekee <cernekee@gmail.com>
Cc: Andi Kleen <andi@firstfloor.org>
Cc: Ralf Baechle <ralf@linux-mips.org>
Cc: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 66918dcd 30-Jun-2009 Yinghai Lu <yinghai@kernel.org>

x86: only clear node_states for 64bit

Nathan reported that

| commit 73d60b7f747176dbdff826c4127d22e1fd3f9f74
| Author: Yinghai Lu <yinghai@kernel.org>
| Date: Tue Jun 16 15:33:00 2009 -0700
|
| page-allocator: clear N_HIGH_MEMORY map before we set it again
|
| SRAT tables may contains nodes of very small size. The arch code may
| decide to not activate such a node. However, currently the early boot
| code sets N_HIGH_MEMORY for such nodes. These nodes therefore seem to be
| active although these nodes have no present pages.
|
| For 64bit N_HIGH_MEMORY == N_NORMAL_MEMORY, so that works for 64 bit too

unintentionally and incorrectly clears the cpuset.mems cgroup attribute on
an i386 kvm guest, meaning that cpuset.mems can not be used.

Fix this by only clearing node_states[N_NORMAL_MEMORY] for 64bit only.
and need to do save/restore for that in find_zone_movable_pfn

Reported-by: Nathan Lynch <ntl@pobox.com>
Tested-by: Nathan Lynch <ntl@pobox.com>
Signed-off-by: Yinghai Lu <yinghai@kernel.org>
Cc: Christoph Lameter <cl@linux-foundation.org>
Cc: Ingo Molnar <mingo@elte.hu>,
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 4923abf9 24-Jun-2009 Linus Torvalds <torvalds@linux-foundation.org>

Don't warn about order-1 allocations with __GFP_NOFAIL

Traditionally, we never failed small orders (even regardless of any
__GFP_NOFAIL flags), and slab will allocate order-1 allocations even for
small allocations that could fit in a single page (in order to avoid
excessive fragmentation).

Maybe we should remove this warning entirely, but before making that
judgement, at least limit it to bigger allocations.

Acked-by: Pekka Enberg <penberg@cs.helsinki.fi>
Cc: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 364df0eb 23-Jun-2009 Dimitri Sivanich <sivanich@sgi.com>

mm: fix handling of pagesets for downed cpus

After downing/upping a cpu, an attempt to set
/proc/sys/vm/percpu_pagelist_fraction results in an oops in
percpu_pagelist_fraction_sysctl_handler().

If a processor is downed then we need to set the pageset pointer back to
the boot pageset.

Updates of the high water marks should not access pagesets of unpopulated
zones (those pointer go to the boot pagesets which would be no longer
functional if their size would be increased beyond zero).

Signed-off-by: Dimitri Sivanich <sivanich@sgi.com>
Signed-off-by: Christoph Lameter <cl@linux-foundation.org>
Reviewed-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Cc: Nick Piggin <nickpiggin@yahoo.com.au>
Cc: Mel Gorman <mel@csn.ul.ie>
Cc: <stable@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# c277331d 19-Jun-2009 Johannes Weiner <hannes@cmpxchg.org>

mm: page_alloc: clear PG_locked before checking flags on free

da456f1 "page allocator: do not disable interrupts in free_page_mlock()" moved
the PG_mlocked clearing after the flag sanity checking which makes mlocked
pages always trigger 'bad page'. Fix this by clearing the bit up front.

Reported--and-debugged-by: Peter Chubb <peter.chubb@nicta.com.au>
Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
Acked-by: Mel Gorman <mel@csn.ul.ie>
Tested-by: Maxim Levitsky <maximlevitsky@gmail.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# dcce284a 17-Jun-2009 Benjamin Herrenschmidt <benh@kernel.crashing.org>

mm: Extend gfp masking to the page allocator

The page allocator also needs the masking of gfp flags during boot,
so this moves it out of slab/slub and uses it with the page allocator
as well.

Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Acked-by: Pekka Enberg <penberg@cs.helsinki.fi>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# fa5e084e 16-Jun-2009 Mel Gorman <mel@csn.ul.ie>

vmscan: do not unconditionally treat zones that fail zone_reclaim() as full

On NUMA machines, the administrator can configure zone_reclaim_mode that
is a more targetted form of direct reclaim. On machines with large NUMA
distances for example, a zone_reclaim_mode defaults to 1 meaning that
clean unmapped pages will be reclaimed if the zone watermarks are not
being met. The problem is that zone_reclaim() failing at all means the
zone gets marked full.

This can cause situations where a zone is usable, but is being skipped
because it has been considered full. Take a situation where a large tmpfs
mount is occuping a large percentage of memory overall. The pages do not
get cleaned or reclaimed by zone_reclaim(), but the zone gets marked full
and the zonelist cache considers them not worth trying in the future.

This patch makes zone_reclaim() return more fine-grained information about
what occured when zone_reclaim() failued. The zone only gets marked full
if it really is unreclaimable. If it's a case that the scan did not occur
or if enough pages were not reclaimed with the limited reclaim_mode, then
the zone is simply skipped.

There is a side-effect to this patch. Currently, if zone_reclaim()
successfully reclaimed SWAP_CLUSTER_MAX, an allocation attempt would go
ahead. With this patch applied, zone watermarks are rechecked after
zone_reclaim() does some work.

This bug was introduced by commit 9276b1bc96a132f4068fdee00983c532f43d3a26
("memory page_alloc zonelist caching speedup") way back in 2.6.19 when the
zonelist_cache was introduced. It was not intended that zone_reclaim()
aggressively consider the zone to be full when it failed as full direct
reclaim can still be an option. Due to the age of the bug, it should be
considered a -stable candidate.

Signed-off-by: Mel Gorman <mel@csn.ul.ie>
Reviewed-by: Wu Fengguang <fengguang.wu@intel.com>
Reviewed-by: Rik van Riel <riel@redhat.com>
Reviewed-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Cc: Christoph Lameter <cl@linux-foundation.org>
Cc: <stable@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 73d60b7f 16-Jun-2009 Yinghai Lu <yinghai@kernel.org>

page-allocator: clear N_HIGH_MEMORY map before we set it again

SRAT tables may contains nodes of very small size. The arch code may
decide to not activate such a node. However, currently the early boot
code sets N_HIGH_MEMORY for such nodes. These nodes therefore seem to be
active although these nodes have no present pages.

For 64bit N_HIGH_MEMORY == N_NORMAL_MEMORY, so that works for 64 bit too

Signed-off-by: Yinghai Lu <Yinghai@kernel.org>
Tested-by: Jack Steiner <steiner@sgi.com>
Acked-by: Christoph Lameter <cl@linux-foundation.org>
Cc: Mel Gorman <mel@csn.ul.ie>
Cc: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 82553a93 16-Jun-2009 David Rientjes <rientjes@google.com>

oom: invoke oom killer for __GFP_NOFAIL

The oom killer must be invoked regardless of the order if the allocation
is __GFP_NOFAIL, otherwise it will loop forever when reclaim fails to free
some memory.

Cc: Nick Piggin <npiggin@suse.de>
Acked-by: Rik van Riel <riel@redhat.com>
Acked-by: Mel Gorman <mel@csn.ul.ie>
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Christoph Lameter <cl@linux-foundation.org>
Cc: Dave Hansen <dave@linux.vnet.ibm.com>
Signed-off-by: David Rientjes <rientjes@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 68377659 16-Jun-2009 KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>

mm: remove CONFIG_UNEVICTABLE_LRU config option

Currently, nobody wants to turn UNEVICTABLE_LRU off. Thus this
configurability is unnecessary.

Signed-off-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Andi Kleen <andi@firstfloor.org>
Acked-by: Minchan Kim <minchan.kim@gmail.com>
Cc: David Woodhouse <dwmw2@infradead.org>
Cc: Matt Mackall <mpm@selenic.com>
Cc: Rik van Riel <riel@redhat.com>
Cc: Lee Schermerhorn <lee.schermerhorn@hp.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# bce7394a 16-Jun-2009 Minchan Kim <minchan.kim@gmail.com>

page-allocator: reset wmark_min and inactive ratio of zone when hotplug happens

Solve two problems.

Whenever memory hotplug sucessfully happens, zone->present_pages
have to be changed.

1) Now memory hotplug calls setup_per_zone_wmark_min only when
online_pages called, not offline_pages.

It breaks balance.

2) If zone->present_pages is changed, we also have to change
zone->inactive_ratio. That's because inactive_ratio depends on
zone->present_pages.

Signed-off-by: Minchan Kim <minchan.kim@gmail.com>
Acked-by: Yasunori Goto <y-goto@jp.fujitsu.com>
Cc: Rik van Riel <riel@redhat.com>
Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 96cb4df5 16-Jun-2009 Minchan Kim <minchan.kim@gmail.com>

page-allocator: add inactive ratio calculation function of each zone

Factor the per-zone arithemetic inside setup_per_zone_inactive_ratio()'s
loop into a a separate function, calculate_zone_inactive_ratio(). This
function will be used in a later patch

[akpm@linux-foundation.org: coding-style fixes]
Signed-off-by: Minchan Kim <minchan.kim@gmail.com>
Reviewed-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Cc: Rik van Riel <riel@redhat.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# bc75d33f 16-Jun-2009 Minchan Kim <minchan.kim@gmail.com>

page-allocator: clean up functions related to pages_min

Change the names of two functions. It doesn't affect behavior.

Presently, setup_per_zone_pages_min() changes low, high of zone as well as
min. So a better name is setup_per_zone_wmarks(). That's because Mel
changed zone->pages_[hig/low/min] to zone->watermark array in "page
allocator: replace the watermark-related union in struct zone with a
watermark[] array".

* setup_per_zone_pages_min => setup_per_zone_wmarks

Of course, we have to change init_per_zone_pages_min, too. There are not
pages_min any more.

* init_per_zone_pages_min => init_per_zone_wmark_min

[akpm@linux-foundation.org: coding-style fixes]
Signed-off-by: Minchan Kim <minchan.kim@gmail.com>
Acked-by: Mel Gorman <mel@csn.ul.ie>
Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Cc: Rik van Riel <riel@redhat.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 7f33d49a 16-Jun-2009 Rafael J. Wysocki <rjw@rjwysocki.net>

mm, PM/Freezer: Disable OOM killer when tasks are frozen

Currently, the following scenario appears to be possible in theory:

* Tasks are frozen for hibernation or suspend.
* Free pages are almost exhausted.
* Certain piece of code in the suspend code path attempts to allocate
some memory using GFP_KERNEL and allocation order less than or
equal to PAGE_ALLOC_COSTLY_ORDER.
* __alloc_pages_internal() cannot find a free page so it invokes the
OOM killer.
* The OOM killer attempts to kill a task, but the task is frozen, so
it doesn't die immediately.
* __alloc_pages_internal() jumps to 'restart', unsuccessfully tries
to find a free page and invokes the OOM killer.
* No progress can be made.

Although it is now hard to trigger during hibernation due to the memory
shrinking carried out by the hibernation code, it is theoretically
possible to trigger during suspend after the memory shrinking has been
removed from that code path. Moreover, since memory allocations are
going to be used for the hibernation memory shrinking, it will be even
more likely to happen during hibernation.

To prevent it from happening, introduce the oom_killer_disabled switch
that will cause __alloc_pages_internal() to fail in the situations in
which the OOM killer would have been called and make the freezer set
this switch after tasks have been successfully frozen.

[akpm@linux-foundation.org: be nicer to the namespace]
Signed-off-by: Rafael J. Wysocki <rjw@sisk.pl>
Cc: Fengguang Wu <fengguang.wu@gmail.com>
Cc: David Rientjes <rientjes@google.com>
Acked-by: Pavel Machek <pavel@ucw.cz>
Cc: Mel Gorman <mel@csn.ul.ie>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# dab48dab 16-Jun-2009 Andrew Morton <akpm@linux-foundation.org>

page-allocator: warn if __GFP_NOFAIL is used for a large allocation

__GFP_NOFAIL is a bad fiction. Allocations _can_ fail, and callers should
detect and suitably handle this (and not by lamely moving the infinite
loop up to the caller level either).

Attempting to use __GFP_NOFAIL for a higher-order allocation is even
worse, so add a once-off runtime check for this to slap people around for
even thinking about trying it.

Cc: David Rientjes <rientjes@google.com>
Acked-by: Mel Gorman <mel@csn.ul.ie>
Acked-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# e9bb35df 16-Jun-2009 Cyrill Gorcunov <gorcunov@openvz.org>

mm: setup_per_zone_inactive_ratio - fix comment and make it __init

The caller of setup_per_zone_inactive_ratio is an __init function. There
is no need to keep the callee after it completed as well. Also fix a
comment.

Acked-by: David Rientjes <rientjes@google.com>
Signed-off-by: Cyrill Gorcunov <gorcunov@openvz.org>
Reviewed-by: Minchan Kim <minchan.kim@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 5c87eada 16-Jun-2009 Cyrill Gorcunov <gorcunov@openvz.org>

mm: setup_per_zone_inactive_ratio - do not call for int_sqrt if not needed

int_sqrt() returns 0 if its argument is zero so call it if only needed.

Signed-off-by: Cyrill Gorcunov <gorcunov@openvz.org>
Cc: David Rientjes <rientjes@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 6e08a369 16-Jun-2009 Wu Fengguang <fengguang.wu@intel.com>

vmscan: cleanup the scan batching code

The vmscan batching logic is twisting. Move it into a standalone function
nr_scan_try_batch() and document it. No behavior change.

Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
Acked-by: Rik van Riel <riel@redhat.com>
Cc: Nick Piggin <npiggin@suse.de>
Cc: Christoph Lameter <cl@linux-foundation.org>
Acked-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Acked-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 20a0307c 16-Jun-2009 Wu Fengguang <fengguang.wu@intel.com>

mm: introduce PageHuge() for testing huge/gigantic pages

A series of patches to enhance the /proc/pagemap interface and to add a
userspace executable which can be used to present the pagemap data.

Export 10 more flags to end users (and more for kernel developers):

11. KPF_MMAP (pseudo flag) memory mapped page
12. KPF_ANON (pseudo flag) memory mapped page (anonymous)
13. KPF_SWAPCACHE page is in swap cache
14. KPF_SWAPBACKED page is swap/RAM backed
15. KPF_COMPOUND_HEAD (*)
16. KPF_COMPOUND_TAIL (*)
17. KPF_HUGE hugeTLB pages
18. KPF_UNEVICTABLE page is in the unevictable LRU list
19. KPF_HWPOISON hardware detected corruption
20. KPF_NOPAGE (pseudo flag) no page frame at the address

(*) For compound pages, exporting _both_ head/tail info enables
users to tell where a compound page starts/ends, and its order.

a simple demo of the page-types tool

# ./page-types -h
page-types [options]
-r|--raw Raw mode, for kernel developers
-a|--addr addr-spec Walk a range of pages
-b|--bits bits-spec Walk pages with specified bits
-l|--list Show page details in ranges
-L|--list-each Show page details one by one
-N|--no-summary Don't show summay info
-h|--help Show this usage message
addr-spec:
N one page at offset N (unit: pages)
N+M pages range from N to N+M-1
N,M pages range from N to M-1
N, pages range from N to end
,M pages range from 0 to M
bits-spec:
bit1,bit2 (flags & (bit1|bit2)) != 0
bit1,bit2=bit1 (flags & (bit1|bit2)) == bit1
bit1,~bit2 (flags & (bit1|bit2)) == bit1
=bit1,bit2 flags == (bit1|bit2)
bit-names:
locked error referenced uptodate
dirty lru active slab
writeback reclaim buddy mmap
anonymous swapcache swapbacked compound_head
compound_tail huge unevictable hwpoison
nopage reserved(r) mlocked(r) mappedtodisk(r)
private(r) private_2(r) owner_private(r) arch(r)
uncached(r) readahead(o) slob_free(o) slub_frozen(o)
slub_debug(o)
(r) raw mode bits (o) overloaded bits

# ./page-types
flags page-count MB symbolic-flags long-symbolic-flags
0x0000000000000000 487369 1903 _________________________________
0x0000000000000014 5 0 __R_D____________________________ referenced,dirty
0x0000000000000020 1 0 _____l___________________________ lru
0x0000000000000024 34 0 __R__l___________________________ referenced,lru
0x0000000000000028 3838 14 ___U_l___________________________ uptodate,lru
0x0001000000000028 48 0 ___U_l_______________________I___ uptodate,lru,readahead
0x000000000000002c 6478 25 __RU_l___________________________ referenced,uptodate,lru
0x000100000000002c 47 0 __RU_l_______________________I___ referenced,uptodate,lru,readahead
0x0000000000000040 8344 32 ______A__________________________ active
0x0000000000000060 1 0 _____lA__________________________ lru,active
0x0000000000000068 348 1 ___U_lA__________________________ uptodate,lru,active
0x0001000000000068 12 0 ___U_lA______________________I___ uptodate,lru,active,readahead
0x000000000000006c 988 3 __RU_lA__________________________ referenced,uptodate,lru,active
0x000100000000006c 48 0 __RU_lA______________________I___ referenced,uptodate,lru,active,readahead
0x0000000000004078 1 0 ___UDlA_______b__________________ uptodate,dirty,lru,active,swapbacked
0x000000000000407c 34 0 __RUDlA_______b__________________ referenced,uptodate,dirty,lru,active,swapbacked
0x0000000000000400 503 1 __________B______________________ buddy
0x0000000000000804 1 0 __R________M_____________________ referenced,mmap
0x0000000000000828 1029 4 ___U_l_____M_____________________ uptodate,lru,mmap
0x0001000000000828 43 0 ___U_l_____M_________________I___ uptodate,lru,mmap,readahead
0x000000000000082c 382 1 __RU_l_____M_____________________ referenced,uptodate,lru,mmap
0x000100000000082c 12 0 __RU_l_____M_________________I___ referenced,uptodate,lru,mmap,readahead
0x0000000000000868 192 0 ___U_lA____M_____________________ uptodate,lru,active,mmap
0x0001000000000868 12 0 ___U_lA____M_________________I___ uptodate,lru,active,mmap,readahead
0x000000000000086c 800 3 __RU_lA____M_____________________ referenced,uptodate,lru,active,mmap
0x000100000000086c 31 0 __RU_lA____M_________________I___ referenced,uptodate,lru,active,mmap,readahead
0x0000000000004878 2 0 ___UDlA____M__b__________________ uptodate,dirty,lru,active,mmap,swapbacked
0x0000000000001000 492 1 ____________a____________________ anonymous
0x0000000000005808 4 0 ___U_______Ma_b__________________ uptodate,mmap,anonymous,swapbacked
0x0000000000005868 2839 11 ___U_lA____Ma_b__________________ uptodate,lru,active,mmap,anonymous,swapbacked
0x000000000000586c 30 0 __RU_lA____Ma_b__________________ referenced,uptodate,lru,active,mmap,anonymous,swapbacked
total 513968 2007

# ./page-types -r
flags page-count MB symbolic-flags long-symbolic-flags
0x0000000000000000 468002 1828 _________________________________
0x0000000100000000 19102 74 _____________________r___________ reserved
0x0000000000008000 41 0 _______________H_________________ compound_head
0x0000000000010000 188 0 ________________T________________ compound_tail
0x0000000000008014 1 0 __R_D__________H_________________ referenced,dirty,compound_head
0x0000000000010014 4 0 __R_D___________T________________ referenced,dirty,compound_tail
0x0000000000000020 1 0 _____l___________________________ lru
0x0000000800000024 34 0 __R__l__________________P________ referenced,lru,private
0x0000000000000028 3794 14 ___U_l___________________________ uptodate,lru
0x0001000000000028 46 0 ___U_l_______________________I___ uptodate,lru,readahead
0x0000000400000028 44 0 ___U_l_________________d_________ uptodate,lru,mappedtodisk
0x0001000400000028 2 0 ___U_l_________________d_____I___ uptodate,lru,mappedtodisk,readahead
0x000000000000002c 6434 25 __RU_l___________________________ referenced,uptodate,lru
0x000100000000002c 47 0 __RU_l_______________________I___ referenced,uptodate,lru,readahead
0x000000040000002c 14 0 __RU_l_________________d_________ referenced,uptodate,lru,mappedtodisk
0x000000080000002c 30 0 __RU_l__________________P________ referenced,uptodate,lru,private
0x0000000800000040 8124 31 ______A_________________P________ active,private
0x0000000000000040 219 0 ______A__________________________ active
0x0000000800000060 1 0 _____lA_________________P________ lru,active,private
0x0000000000000068 322 1 ___U_lA__________________________ uptodate,lru,active
0x0001000000000068 12 0 ___U_lA______________________I___ uptodate,lru,active,readahead
0x0000000400000068 13 0 ___U_lA________________d_________ uptodate,lru,active,mappedtodisk
0x0000000800000068 12 0 ___U_lA_________________P________ uptodate,lru,active,private
0x000000000000006c 977 3 __RU_lA__________________________ referenced,uptodate,lru,active
0x000100000000006c 48 0 __RU_lA______________________I___ referenced,uptodate,lru,active,readahead
0x000000040000006c 5 0 __RU_lA________________d_________ referenced,uptodate,lru,active,mappedtodisk
0x000000080000006c 3 0 __RU_lA_________________P________ referenced,uptodate,lru,active,private
0x0000000c0000006c 3 0 __RU_lA________________dP________ referenced,uptodate,lru,active,mappedtodisk,private
0x0000000c00000068 1 0 ___U_lA________________dP________ uptodate,lru,active,mappedtodisk,private
0x0000000000004078 1 0 ___UDlA_______b__________________ uptodate,dirty,lru,active,swapbacked
0x000000000000407c 34 0 __RUDlA_______b__________________ referenced,uptodate,dirty,lru,active,swapbacked
0x0000000000000400 538 2 __________B______________________ buddy
0x0000000000000804 1 0 __R________M_____________________ referenced,mmap
0x0000000000000828 1029 4 ___U_l_____M_____________________ uptodate,lru,mmap
0x0001000000000828 43 0 ___U_l_____M_________________I___ uptodate,lru,mmap,readahead
0x000000000000082c 382 1 __RU_l_____M_____________________ referenced,uptodate,lru,mmap
0x000100000000082c 12 0 __RU_l_____M_________________I___ referenced,uptodate,lru,mmap,readahead
0x0000000000000868 192 0 ___U_lA____M_____________________ uptodate,lru,active,mmap
0x0001000000000868 12 0 ___U_lA____M_________________I___ uptodate,lru,active,mmap,readahead
0x000000000000086c 800 3 __RU_lA____M_____________________ referenced,uptodate,lru,active,mmap
0x000100000000086c 31 0 __RU_lA____M_________________I___ referenced,uptodate,lru,active,mmap,readahead
0x0000000000004878 2 0 ___UDlA____M__b__________________ uptodate,dirty,lru,active,mmap,swapbacked
0x0000000000001000 492 1 ____________a____________________ anonymous
0x0000000000005008 2 0 ___U________a_b__________________ uptodate,anonymous,swapbacked
0x0000000000005808 4 0 ___U_______Ma_b__________________ uptodate,mmap,anonymous,swapbacked
0x000000000000580c 1 0 __RU_______Ma_b__________________ referenced,uptodate,mmap,anonymous,swapbacked
0x0000000000005868 2839 11 ___U_lA____Ma_b__________________ uptodate,lru,active,mmap,anonymous,swapbacked
0x000000000000586c 29 0 __RU_lA____Ma_b__________________ referenced,uptodate,lru,active,mmap,anonymous,swapbacked
total 513968 2007

# ./page-types --raw --list --no-summary --bits reserved
offset count flags
0 15 _____________________r___________
31 4 _____________________r___________
159 97 _____________________r___________
4096 2067 _____________________r___________
6752 2390 _____________________r___________
9355 3 _____________________r___________
9728 14526 _____________________r___________

This patch:

Introduce PageHuge(), which identifies huge/gigantic pages by their
dedicated compound destructor functions.

Also move prep_compound_gigantic_page() to hugetlb.c and make
__free_pages_ok() non-static.

Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Cc: Andi Kleen <andi@firstfloor.org>
Cc: Matt Mackall <mpm@selenic.com>
Cc: Alexey Dobriyan <adobriyan@gmail.com>
Cc: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# a1dd268c 16-Jun-2009 Mel Gorman <mel@csn.ul.ie>

mm: use alloc_pages_exact() in alloc_large_system_hash() to avoid duplicated logic

alloc_large_system_hash() has logic for freeing pages at the end of an
excessively large power-of-two buffer that is a duplicate of what is in
alloc_pages_exact(). This patch converts alloc_large_system_hash() to use
alloc_pages_exact().

Signed-off-by: Mel Gorman <mel@csn.ul.ie>
Acked-by: Hugh Dickins <hugh.dickins@tiscali.co.uk>
Cc: Christoph Lameter <cl@linux.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 72807a74 16-Jun-2009 Mel Gorman <mel@csn.ul.ie>

page allocator: sanity check order in the page allocator slow path

Callers may speculatively call different allocators in order of preference
trying to allocate a buffer of a given size. The order needed to allocate
this may be larger than what the page allocator can normally handle.
While the allocator mostly does the right thing, it should not direct
reclaim or wakeup kswapd with a bogus order. This patch sanity checks the
order in the slow path and returns NULL if it is too large.

Signed-off-by: Mel Gorman <mel@csn.ul.ie>
Signed-off-by: Dave Hansen <dave@linux.vnet.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 092cead6 16-Jun-2009 KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>

page allocator: move free_page_mlock() to page_alloc.c

Currently, free_page_mlock() is only called from page_alloc.c. Thus, we
can move it to page_alloc.c.

Cc: Lee Schermerhorn <Lee.Schermerhorn@hp.com>
Cc: Mel Gorman <mel@csn.ul.ie>
Cc: Christoph Lameter <cl@linux-foundation.org>
Cc: Pekka Enberg <penberg@cs.helsinki.fi>
Cc: Dave Hansen <dave@linux.vnet.ibm.com>
Signed-off-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 62bc62a8 16-Jun-2009 Christoph Lameter <cl@linux-foundation.org>

page allocator: use a pre-calculated value instead of num_online_nodes() in fast paths

num_online_nodes() is called in a number of places but most often by the
page allocator when deciding whether the zonelist needs to be filtered
based on cpusets or the zonelist cache. This is actually a heavy function
and touches a number of cache lines.

This patch stores the number of online nodes at boot time and updates the
value when nodes get onlined and offlined. The value is then used in a
number of important paths in place of num_online_nodes().

[rientjes@google.com: do not override definition of node_set_online() with macro]
Signed-off-by: Christoph Lameter <cl@linux-foundation.org>
Signed-off-by: Mel Gorman <mel@csn.ul.ie>
Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Cc: Pekka Enberg <penberg@cs.helsinki.fi>
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Nick Piggin <nickpiggin@yahoo.com.au>
Cc: Dave Hansen <dave@linux.vnet.ibm.com>
Cc: Lee Schermerhorn <Lee.Schermerhorn@hp.com>
Signed-off-by: David Rientjes <rientjes@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 974709bd 16-Jun-2009 Mel Gorman <mel@csn.ul.ie>

page allocator: get the pageblock migratetype without disabling interrupts

Local interrupts are disabled when freeing pages to the PCP list. Part of
that free checks what the migratetype of the pageblock the page is in but
it checks this with interrupts disabled and interupts should never be
disabled longer than necessary. This patch checks the pagetype with
interrupts enabled with the impact that it is possible a page is freed to
the wrong list when a pageblock changes type. As that block is now
already considered mixed from an anti-fragmentation perspective, it's not
of vital importance.

Signed-off-by: Mel Gorman <mel@csn.ul.ie>
Cc: Christoph Lameter <cl@linux-foundation.org>
Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Cc: Pekka Enberg <penberg@cs.helsinki.fi>
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Nick Piggin <nickpiggin@yahoo.com.au>
Cc: Dave Hansen <dave@linux.vnet.ibm.com>
Cc: Lee Schermerhorn <Lee.Schermerhorn@hp.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# f2260e6b 16-Jun-2009 Mel Gorman <mel@csn.ul.ie>

page allocator: update NR_FREE_PAGES only as necessary

When pages are being freed to the buddy allocator, the zone NR_FREE_PAGES
counter must be updated. In the case of bulk per-cpu page freeing, it's
updated once per page. This retouches cache lines more than necessary.
Update the counters one per per-cpu bulk free.

Signed-off-by: Mel Gorman <mel@csn.ul.ie>
Reviewed-by: Christoph Lameter <cl@linux-foundation.org>
Reviewed-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Cc: Pekka Enberg <penberg@cs.helsinki.fi>
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Nick Piggin <nickpiggin@yahoo.com.au>
Cc: Dave Hansen <dave@linux.vnet.ibm.com>
Cc: Lee Schermerhorn <Lee.Schermerhorn@hp.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 41858966 16-Jun-2009 Mel Gorman <mel@csn.ul.ie>

page allocator: use allocation flags as an index to the zone watermark

ALLOC_WMARK_MIN, ALLOC_WMARK_LOW and ALLOC_WMARK_HIGH determin whether
pages_min, pages_low or pages_high is used as the zone watermark when
allocating the pages. Two branches in the allocator hotpath determine
which watermark to use.

This patch uses the flags as an array index into a watermark array that is
indexed with WMARK_* defines accessed via helpers. All call sites that
use zone->pages_* are updated to use the helpers for accessing the values
and the array offsets for setting.

Signed-off-by: Mel Gorman <mel@csn.ul.ie>
Reviewed-by: Christoph Lameter <cl@linux-foundation.org>
Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Cc: Pekka Enberg <penberg@cs.helsinki.fi>
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Nick Piggin <nickpiggin@yahoo.com.au>
Cc: Dave Hansen <dave@linux.vnet.ibm.com>
Cc: Lee Schermerhorn <Lee.Schermerhorn@hp.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# a3af9c38 16-Jun-2009 Nick Piggin <nickpiggin@yahoo.com.au>

page allocator: do not check for compound pages during the page allocator sanity checks

A number of sanity checks are made on each page allocation and free
including that the page count is zero. page_count() checks for compound
pages and checks the count of the head page if true. However, in these
paths, we do not care if the page is compound or not as the count of each
tail page should also be zero.

This patch makes two changes to the use of page_count() in the free path.
It converts one check of page_count() to a VM_BUG_ON() as the count should
have been unconditionally checked earlier in the free path. It also
avoids checking for compound pages.

[mel@csn.ul.ie: Wrote changelog]
Signed-off-by: Mel Gorman <mel@csn.ul.ie>
Signed-off-by: Nick Piggin <nickpiggin@yahoo.com.au>
Reviewed-by: Christoph Lameter <cl@linux-foundation.org>
Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Cc: Pekka Enberg <penberg@cs.helsinki.fi>
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Dave Hansen <dave@linux.vnet.ibm.com>
Cc: Lee Schermerhorn <Lee.Schermerhorn@hp.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# d395b734 16-Jun-2009 Mel Gorman <mel@csn.ul.ie>

page allocator: do not setup zonelist cache when there is only one node

There is a zonelist cache which is used to track zones that are not in the
allowed cpuset or found to be recently full. This is to reduce cache
footprint on large machines. On smaller machines, it just incurs cost for
no gain. This patch only uses the zonelist cache when there are NUMA
nodes.

Signed-off-by: Mel Gorman <mel@csn.ul.ie>
Reviewed-by: Christoph Lameter <cl@linux-foundation.org>
Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Cc: Pekka Enberg <penberg@cs.helsinki.fi>
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Nick Piggin <nickpiggin@yahoo.com.au>
Cc: Dave Hansen <dave@linux.vnet.ibm.com>
Cc: Lee Schermerhorn <Lee.Schermerhorn@hp.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# da456f14 16-Jun-2009 Mel Gorman <mel@csn.ul.ie>

page allocator: do not disable interrupts in free_page_mlock()

free_page_mlock() tests and clears PG_mlocked using locked versions of the
bit operations. If set, it disables interrupts to update counters and
this happens on every page free even though interrupts are disabled very
shortly afterwards a second time. This is wasteful.

This patch splits what free_page_mlock() does. The bit check is still
made. However, the update of counters is delayed until the interrupts are
disabled and the non-lock version for clearing the bit is used. One
potential weirdness with this split is that the counters do not get
updated if the bad_page() check is triggered but a system showing bad
pages is getting screwed already.

Signed-off-by: Mel Gorman <mel@csn.ul.ie>
Reviewed-by: Christoph Lameter <cl@linux-foundation.org>
Reviewed-by: Pekka Enberg <penberg@cs.helsinki.fi>
Reviewed-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Nick Piggin <nickpiggin@yahoo.com.au>
Cc: Dave Hansen <dave@linux.vnet.ibm.com>
Acked-by: Lee Schermerhorn <Lee.Schermerhorn@hp.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# ed0ae21d 16-Jun-2009 Mel Gorman <mel@csn.ul.ie>

page allocator: do not call get_pageblock_migratetype() more than necessary

get_pageblock_migratetype() is potentially called twice for every page
free. Once, when being freed to the pcp lists and once when being freed
back to buddy. When freeing from the pcp lists, it is known what the
pageblock type was at the time of free so use it rather than rechecking.
In low memory situations under memory pressure, this might skew
anti-fragmentation slightly but the interference is minimal and decisions
that are fragmenting memory are being made anyway.

Signed-off-by: Mel Gorman <mel@csn.ul.ie>
Reviewed-by: Christoph Lameter <cl@linux-foundation.org>
Reviewed-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Cc: Pekka Enberg <penberg@cs.helsinki.fi>
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Nick Piggin <nickpiggin@yahoo.com.au>
Cc: Dave Hansen <dave@linux.vnet.ibm.com>
Cc: Lee Schermerhorn <Lee.Schermerhorn@hp.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 0ac3a409 16-Jun-2009 Mel Gorman <mel@csn.ul.ie>

page allocator: inline __rmqueue_fallback()

__rmqueue_fallback() is in the slow path but has only one call site.
Because there is only one call-site, this function can then be inlined
without causing text bloat. On an x86-based config, it made no difference
as the savings were padded out by NOP instructions. Milage varies but
text will either decrease in size or remain static.

Signed-off-by: Mel Gorman <mel@csn.ul.ie>
Cc: Christoph Lameter <cl@linux-foundation.org>
Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Cc: Pekka Enberg <penberg@cs.helsinki.fi>
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Nick Piggin <nickpiggin@yahoo.com.au>
Cc: Dave Hansen <dave@linux.vnet.ibm.com>
Cc: Lee Schermerhorn <Lee.Schermerhorn@hp.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 0a15c3e9 16-Jun-2009 Mel Gorman <mel@csn.ul.ie>

page allocator: inline buffered_rmqueue()

buffered_rmqueue() is in the fast path so inline it. Because it only has
one call site, this function can then be inlined without causing text
bloat. On an x86-based config, it made no difference as the savings were
padded out by NOP instructions. Milage varies but text will either
decrease in size or remain static.

Signed-off-by: Mel Gorman <mel@csn.ul.ie>
Reviewed-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Cc: Christoph Lameter <cl@linux-foundation.org>
Cc: Pekka Enberg <penberg@cs.helsinki.fi>
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Nick Piggin <nickpiggin@yahoo.com.au>
Cc: Dave Hansen <dave@linux.vnet.ibm.com>
Cc: Lee Schermerhorn <Lee.Schermerhorn@hp.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 728ec980 16-Jun-2009 Mel Gorman <mel@csn.ul.ie>

page allocator: inline __rmqueue_smallest()

Inline __rmqueue_smallest by altering flow very slightly so that there is
only one call site. Because there is only one call-site, this function
can then be inlined without causing text bloat. On an x86-based config,
this patch reduces text by 16 bytes.

Signed-off-by: Mel Gorman <mel@csn.ul.ie>
Reviewed-by: Christoph Lameter <cl@linux-foundation.org>
Reviewed-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Cc: Pekka Enberg <penberg@cs.helsinki.fi>
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Nick Piggin <nickpiggin@yahoo.com.au>
Cc: Dave Hansen <dave@linux.vnet.ibm.com>
Cc: Lee Schermerhorn <Lee.Schermerhorn@hp.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# a56f57ff 16-Jun-2009 Mel Gorman <mel@csn.ul.ie>

page allocator: remove a branch by assuming __GFP_HIGH == ALLOC_HIGH

Allocations that specify __GFP_HIGH get the ALLOC_HIGH flag. If these
flags are equal to each other, we can eliminate a branch.

[akpm@linux-foundation.org: Suggested the hack]
Signed-off-by: Mel Gorman <mel@csn.ul.ie>
Reviewed-by: Pekka Enberg <penberg@cs.helsinki.fi>
Reviewed-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Cc: Christoph Lameter <cl@linux-foundation.org>
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Nick Piggin <nickpiggin@yahoo.com.au>
Cc: Dave Hansen <dave@linux.vnet.ibm.com>
Cc: Lee Schermerhorn <Lee.Schermerhorn@hp.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 341ce06f 16-Jun-2009 Peter Zijlstra <a.p.zijlstra@chello.nl>

page allocator: calculate the alloc_flags for allocation only once

Factor out the mapping between GFP and alloc_flags only once. Once
factored out, it only needs to be calculated once but some care must be
taken.

[neilb@suse.de says]
As the test:

- if (((p->flags & PF_MEMALLOC) || unlikely(test_thread_flag(TIF_MEMDIE)))
- && !in_interrupt()) {
- if (!(gfp_mask & __GFP_NOMEMALLOC)) {

has been replaced with a slightly weaker one:

+ if (alloc_flags & ALLOC_NO_WATERMARKS) {

Without care, this would allow recursion into the allocator via direct
reclaim. This patch ensures we do not recurse when PF_MEMALLOC is set but
TF_MEMDIE callers are now allowed to directly reclaim where they would
have been prevented in the past.

Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Acked-by: Pekka Enberg <penberg@cs.helsinki.fi>
Signed-off-by: Mel Gorman <mel@csn.ul.ie>
Cc: Neil Brown <neilb@suse.de>
Cc: Christoph Lameter <cl@linux-foundation.org>
Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Cc: Nick Piggin <nickpiggin@yahoo.com.au>
Cc: Dave Hansen <dave@linux.vnet.ibm.com>
Cc: Lee Schermerhorn <Lee.Schermerhorn@hp.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 3dd28266 16-Jun-2009 Mel Gorman <mel@csn.ul.ie>

page allocator: calculate the migratetype for allocation only once

GFP mask is converted into a migratetype when deciding which pagelist to
take a page from. However, it is happening multiple times per allocation,
at least once per zone traversed. Calculate it once.

Signed-off-by: Mel Gorman <mel@csn.ul.ie>
Cc: Christoph Lameter <cl@linux-foundation.org>
Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Cc: Pekka Enberg <penberg@cs.helsinki.fi>
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Nick Piggin <nickpiggin@yahoo.com.au>
Cc: Dave Hansen <dave@linux.vnet.ibm.com>
Cc: Lee Schermerhorn <Lee.Schermerhorn@hp.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 5117f45d 16-Jun-2009 Mel Gorman <mel@csn.ul.ie>

page allocator: calculate the preferred zone for allocation only once

get_page_from_freelist() can be called multiple times for an allocation.
Part of this calculates the preferred_zone which is the first usable zone
in the zonelist but the zone depends on the GFP flags specified at the
beginning of the allocation call. This patch calculates preferred_zone
once. It's safe to do this because if preferred_zone is NULL at the start
of the call, no amount of direct reclaim or other actions will change the
fact the allocation will fail.

[akpm@linux-foundation.org: remove (void) casts]
Signed-off-by: Mel Gorman <mel@csn.ul.ie>
Reviewed-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Reviewed-by: Pekka Enberg <penberg@cs.helsinki.fi>
Cc: Christoph Lameter <cl@linux-foundation.org>
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Nick Piggin <nickpiggin@yahoo.com.au>
Cc: Dave Hansen <dave@linux.vnet.ibm.com>
Cc: Lee Schermerhorn <Lee.Schermerhorn@hp.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 49255c61 16-Jun-2009 Mel Gorman <mel@csn.ul.ie>

page allocator: move check for disabled anti-fragmentation out of fastpath

On low-memory systems, anti-fragmentation gets disabled as there is
nothing it can do and it would just incur overhead shuffling pages between
lists constantly. Currently the check is made in the free page fast path
for every page. This patch moves it to a slow path. On machines with low
memory, there will be small amount of additional overhead as pages get
shuffled between lists but it should quickly settle.

Signed-off-by: Mel Gorman <mel@csn.ul.ie>
Reviewed-by: Christoph Lameter <cl@linux-foundation.org>
Reviewed-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Cc: Pekka Enberg <penberg@cs.helsinki.fi>
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Nick Piggin <nickpiggin@yahoo.com.au>
Cc: Dave Hansen <dave@linux.vnet.ibm.com>
Cc: Lee Schermerhorn <Lee.Schermerhorn@hp.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 11e33f6a 16-Jun-2009 Mel Gorman <mel@csn.ul.ie>

page allocator: break up the allocator entry point into fast and slow paths

The core of the page allocator is one giant function which allocates
memory on the stack and makes calculations that may not be needed for
every allocation. This patch breaks up the allocator path into fast and
slow paths for clarity. Note the slow paths are still inlined but the
entry is marked unlikely. If they were not inlined, it actally increases
text size to generate the as there is only one call site.

Signed-off-by: Mel Gorman <mel@csn.ul.ie>
Reviewed-by: Christoph Lameter <cl@linux-foundation.org>
Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Cc: Pekka Enberg <penberg@cs.helsinki.fi>
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Nick Piggin <nickpiggin@yahoo.com.au>
Cc: Dave Hansen <dave@linux.vnet.ibm.com>
Cc: Lee Schermerhorn <Lee.Schermerhorn@hp.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 7f82af97 16-Jun-2009 Mel Gorman <mel@csn.ul.ie>

page allocator: check only once if the zonelist is suitable for the allocation

It is possible with __GFP_THISNODE that no zones are suitable. This patch
makes sure the check is only made once.

Signed-off-by: Mel Gorman <mel@csn.ul.ie>
Reviewed-by: Christoph Lameter <cl@linux-foundation.org>
Reviewed-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Reviewed-by: Pekka Enberg <penberg@cs.helsinki.fi>
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Nick Piggin <nickpiggin@yahoo.com.au>
Cc: Dave Hansen <dave@linux.vnet.ibm.com>
Cc: Lee Schermerhorn <Lee.Schermerhorn@hp.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# b3c466ce 16-Jun-2009 Mel Gorman <mel@csn.ul.ie>

page allocator: do not sanity check order in the fast path

No user of the allocator API should be passing in an order >= MAX_ORDER
but we check for it on each and every allocation. Delete this check and
make it a VM_BUG_ON check further down the call path.

[akpm@linux-foundation.org: s/VM_BUG_ON/WARN_ON_ONCE/]
Signed-off-by: Mel Gorman <mel@csn.ul.ie>
Reviewed-by: Christoph Lameter <cl@linux-foundation.org>
Reviewed-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Reviewed-by: Pekka Enberg <penberg@cs.helsinki.fi>
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Nick Piggin <nickpiggin@yahoo.com.au>
Cc: Dave Hansen <dave@linux.vnet.ibm.com>
Cc: Lee Schermerhorn <Lee.Schermerhorn@hp.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# d239171e 16-Jun-2009 Mel Gorman <mel@csn.ul.ie>

page allocator: replace __alloc_pages_internal() with __alloc_pages_nodemask()

The start of a large patch series to clean up and optimise the page
allocator.

The performance improvements are in a wide range depending on the exact
machine but the results I've seen so fair are approximately;

kernbench: 0 to 0.12% (elapsed time)
0.49% to 3.20% (sys time)
aim9: -4% to 30% (for page_test and brk_test)
tbench: -1% to 4%
hackbench: -2.5% to 3.45% (mostly within the noise though)
netperf-udp -1.34% to 4.06% (varies between machines a bit)
netperf-tcp -0.44% to 5.22% (varies between machines a bit)

I haven't sysbench figures at hand, but previously they were within the
-0.5% to 2% range.

On netperf, the client and server were bound to opposite number CPUs to
maximise the problems with cache line bouncing of the struct pages so I
expect different people to report different results for netperf depending
on their exact machine and how they ran the test (different machines, same
cpus client/server, shared cache but two threads client/server, different
socket client/server etc).

I also measured the vmlinux sizes for a single x86-based config with
CONFIG_DEBUG_INFO enabled but not CONFIG_DEBUG_VM. The core of the
.config is based on the Debian Lenny kernel config so I expect it to be
reasonably typical.

This patch:

__alloc_pages_internal is the core page allocator function but essentially
it is an alias of __alloc_pages_nodemask. Naming a publicly available and
exported function "internal" is also a big ugly. This patch renames
__alloc_pages_internal() to __alloc_pages_nodemask() and deletes the old
nodemask function.

Warning - This patch renames an exported symbol. No kernel driver is
affected by external drivers calling __alloc_pages_internal() should
change the call to __alloc_pages_nodemask() without any alteration of
parameters.

Signed-off-by: Mel Gorman <mel@csn.ul.ie>
Reviewed-by: Christoph Lameter <cl@linux-foundation.org>
Reviewed-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Reviewed-by: Pekka Enberg <penberg@cs.helsinki.fi>
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Nick Piggin <nickpiggin@yahoo.com.au>
Cc: Dave Hansen <dave@linux.vnet.ibm.com>
Cc: Lee Schermerhorn <Lee.Schermerhorn@hp.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 6c0db466 16-Jun-2009 Hugh Dickins <hugh.dickins@tiscali.co.uk>

mm: alloc_large_system_hash check order

On an x86_64 with 4GB ram, tcp_init()'s call to alloc_large_system_hash(),
to allocate tcp_hashinfo.ehash, is now triggering an mmotm WARN_ON_ONCE on
order >= MAX_ORDER - it's hoping for order 11. alloc_large_system_hash()
had better make its own check on the order.

Signed-off-by: Hugh Dickins <hugh.dickins@tiscali.co.uk>
Cc: David Miller <davem@davemloft.net>
Cc: Mel Gorman <mel@csn.ul.ie>
Cc: Eric Dumazet <dada1@cosmosbay.com>
Cc: Christoph Lameter <cl@linux.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 58568d2a 16-Jun-2009 Miao Xie <miaox@cn.fujitsu.com>

cpuset,mm: update tasks' mems_allowed in time

Fix allocating page cache/slab object on the unallowed node when memory
spread is set by updating tasks' mems_allowed after its cpuset's mems is
changed.

In order to update tasks' mems_allowed in time, we must modify the code of
memory policy. Because the memory policy is applied in the process's
context originally. After applying this patch, one task directly
manipulates anothers mems_allowed, and we use alloc_lock in the
task_struct to protect mems_allowed and memory policy of the task.

But in the fast path, we didn't use lock to protect them, because adding a
lock may lead to performance regression. But if we don't add a lock,the
task might see no nodes when changing cpuset's mems_allowed to some
non-overlapping set. In order to avoid it, we set all new allowed nodes,
then clear newly disallowed ones.

[lee.schermerhorn@hp.com:
The rework of mpol_new() to extract the adjusting of the node mask to
apply cpuset and mpol flags "context" breaks set_mempolicy() and mbind()
with MPOL_PREFERRED and a NULL nodemask--i.e., explicit local
allocation. Fix this by adding the check for MPOL_PREFERRED and empty
node mask to mpol_new_mpolicy().

Remove the now unneeded 'nodes = NULL' from mpol_new().

Note that mpol_new_mempolicy() is always called with a non-NULL
'nodes' parameter now that it has been removed from mpol_new().
Therefore, we don't need to test nodes for NULL before testing it for
'empty'. However, just to be extra paranoid, add a VM_BUG_ON() to
verify this assumption.]
[lee.schermerhorn@hp.com:

I don't think the function name 'mpol_new_mempolicy' is descriptive
enough to differentiate it from mpol_new().

This function applies cpuset set context, usually constraining nodes
to those allowed by the cpuset. However, when the 'RELATIVE_NODES flag
is set, it also translates the nodes. So I settled on
'mpol_set_nodemask()', because the comment block for mpol_new() mentions
that we need to call this function to "set nodes".

Some additional minor line length, whitespace and typo cleanup.]
Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Christoph Lameter <cl@linux-foundation.org>
Cc: Paul Menage <menage@google.com>
Cc: Nick Piggin <nickpiggin@yahoo.com.au>
Cc: Yasunori Goto <y-goto@jp.fujitsu.com>
Cc: Pekka Enberg <penberg@cs.helsinki.fi>
Cc: David Rientjes <rientjes@google.com>
Signed-off-by: Lee Schermerhorn <lee.schermerhorn@hp.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# b1eeab67 25-Nov-2008 Vegard Nossum <vegard.nossum@gmail.com>

kmemcheck: add hooks for the page allocator

This adds support for tracking the initializedness of memory that
was allocated with the page allocator. Highmem requests are not
tracked.

Cc: Dave Hansen <dave@linux.vnet.ibm.com>
Acked-by: Pekka Enberg <penberg@cs.helsinki.fi>

[build fix for !CONFIG_KMEMCHECK]
Signed-off-by: Ingo Molnar <mingo@elte.hu>

[rebased for mainline inclusion]
Signed-off-by: Vegard Nossum <vegard.nossum@gmail.com>


# dbb1f81c 11-Jun-2009 Catalin Marinas <catalin.marinas@arm.com>

kmemleak: Add kmemleak_alloc callback from alloc_large_system_hash

The alloc_large_system_hash function is called from various places in
the kernel and it contains pointers to other allocated structures. It
therefore needs to be traced by kmemleak.

Signed-off-by: Catalin Marinas <catalin.marinas@arm.com>


# 888a589f 15-May-2009 Yinghai Lu <yinghai@kernel.org>

mm, x86: remove MEMORY_HOTPLUG_RESERVE related code

after:

| commit b263295dbffd33b0fbff670720fa178c30e3392a
| Author: Christoph Lameter <clameter@sgi.com>
| Date: Wed Jan 30 13:30:47 2008 +0100
|
| x86: 64-bit, make sparsemem vmemmap the only memory model

we don't have MEMORY_HOTPLUG_RESERVE anymore.

Historically, x86-64 had an architecture-specific method for memory hotplug
whereby it scanned the SRAT for physical memory ranges that could be
potentially used for memory hot-add later. By reserving those ranges
without physical memory, the memmap would be allocated and left dormant
until needed. This depended on the DISCONTIG memory model which has been
removed so the code implementing HOTPLUG_RESERVE is now dead.

This patch removes the dead code used by MEMORY_HOTPLUG_RESERVE.

(Changelog authored by Mel.)

v2: updated changelog, and remove hotadd= in doc

[ Impact: remove dead code ]

Signed-off-by: Yinghai Lu <yinghai@kernel.org>
Reviewed-by: Christoph Lameter <cl@linux-foundation.org>
Reviewed-by: Mel Gorman <mel@csn.ul.ie>
Workflow-found-OK-by: Andrew Morton <akpm@linux-foundation.org>
LKML-Reference: <4A0C4910.7090508@kernel.org>
Signed-off-by: Ingo Molnar <mingo@elte.hu>


# 3a6be87f 06-May-2009 David Howells <dhowells@redhat.com>

nommu: clamp zone_batchsize() to 0 under NOMMU conditions

Clamp zone_batchsize() to 0 under NOMMU conditions to stop
free_hot_cold_page() from queueing and batching frees.

The problem is that under NOMMU conditions it is really important to be
able to allocate large contiguous chunks of memory, but when munmap() or
exit_mmap() releases big stretches of memory, return of these to the buddy
allocator can be deferred, and when it does finally happen, it can be in
small chunks.

Whilst the fragmentation this incurs isn't so much of a problem under MMU
conditions as userspace VM is glued together from individual pages with
the aid of the MMU, it is a real problem if there isn't an MMU.

By clamping the page freeing queue size to 0, pages are returned to the
allocator immediately, and the buddy detector is more likely to be able to
glue them together into large chunks immediately, and fragmentation is
less likely to occur.

By disabling batching of frees, and by turning off the trimming of excess
space during boot, Coldfire can manage to boot.

Reported-by: Lanttor Guo <lanttor.guo@freescale.com>
Signed-off-by: David Howells <dhowells@redhat.com>
Tested-by: Lanttor Guo <lanttor.guo@freescale.com>
Cc: Greg Ungerer <gerg@snapgear.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 9155203a 06-May-2009 David Howells <dhowells@redhat.com>

mm: use roundown_pow_of_two() in zone_batchsize()

Use roundown_pow_of_two(N) in zone_batchsize() rather than (1 <<
(fls(N)-1)) as they are equivalent, and with the former it is easier to
see what is going on.

Signed-off-by: David Howells <dhowells@redhat.com>
Tested-by: Lanttor Guo <lanttor.guo@freescale.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 327c0e96 31-Mar-2009 KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>

vmscan: fix it to take care of nodemask

try_to_free_pages() is used for the direct reclaim of up to
SWAP_CLUSTER_MAX pages when watermarks are low. The caller to
alloc_pages_nodemask() can specify a nodemask of nodes that are allowed to
be used but this is not passed to try_to_free_pages(). This can lead to
unnecessary reclaim of pages that are unusable by the caller and int the
worst case lead to allocation failure as progress was not been make where
it is needed.

This patch passes the nodemask used for alloc_pages_nodemask() to
try_to_free_pages().

Reviewed-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Acked-by: Mel Gorman <mel@csn.ul.ie>
Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Rik van Riel <riel@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# ee99c71c 31-Mar-2009 KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>

mm: introduce for_each_populated_zone() macro

Impact: cleanup

In almost cases, for_each_zone() is used with populated_zone(). It's
because almost function doesn't need memoryless node information.
Therefore, for_each_populated_zone() can help to make code simplify.

This patch has no functional change.

[akpm@linux-foundation.org: small cleanup]
Signed-off-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Mel Gorman <mel@csn.ul.ie>
Reviewed-by: Johannes Weiner <hannes@cmpxchg.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# e713a21d 09-Jan-2009 Alexey Zaytsev <alexey.zaytsev@gmail.com>

trivial: Fix dubious bitwise 'or' usage spotted by sparse.

It doesn't change the semantics, but it looks like
the logical 'or' was meant to be used here.

Signed-off-by: Alexey Zaytsev <alexey.zaytsev@gmail.com>
Signed-off-by: Jiri Kosina <jkosina@suse.cz>


# a70f7302 12-Mar-2009 Rusty Russell <rusty@rustcorp.com.au>

cpumask: replace node_to_cpumask with cpumask_of_node.

Impact: cleanup

node_to_cpumask (and the blecherous node_to_cpumask_ptr which
contained a declaration) are replaced now everyone implements
cpumask_of_node.

Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>


# cc2559bc 18-Feb-2009 KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>

mm: fix memmap init for handling memory hole

Now, early_pfn_in_nid(PFN, NID) may returns false if PFN is a hole.
and memmap initialization was not done. This was a trouble for
sparc boot.

To fix this, the PFN should be initialized and marked as PG_reserved.
This patch changes early_pfn_in_nid() return true if PFN is a hole.

Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Reported-by: David Miller <davem@davemlloft.net>
Tested-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Cc: Mel Gorman <mel@csn.ul.ie>
Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
Cc: <stable@kernel.org> [2.6.25.x, 2.6.26.x, 2.6.27.x, 2.6.28.x]
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# f2dbcfa7 18-Feb-2009 KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>

mm: clean up for early_pfn_to_nid()

What's happening is that the assertion in mm/page_alloc.c:move_freepages()
is triggering:

BUG_ON(page_zone(start_page) != page_zone(end_page));

Once I knew this is what was happening, I added some annotations:

if (unlikely(page_zone(start_page) != page_zone(end_page))) {
printk(KERN_ERR "move_freepages: Bogus zones: "
"start_page[%p] end_page[%p] zone[%p]\n",
start_page, end_page, zone);
printk(KERN_ERR "move_freepages: "
"start_zone[%p] end_zone[%p]\n",
page_zone(start_page), page_zone(end_page));
printk(KERN_ERR "move_freepages: "
"start_pfn[0x%lx] end_pfn[0x%lx]\n",
page_to_pfn(start_page), page_to_pfn(end_page));
printk(KERN_ERR "move_freepages: "
"start_nid[%d] end_nid[%d]\n",
page_to_nid(start_page), page_to_nid(end_page));
...

And here's what I got:

move_freepages: Bogus zones: start_page[2207d0000] end_page[2207dffc0] zone[fffff8103effcb00]
move_freepages: start_zone[fffff8103effcb00] end_zone[fffff8003fffeb00]
move_freepages: start_pfn[0x81f600] end_pfn[0x81f7ff]
move_freepages: start_nid[1] end_nid[0]

My memory layout on this box is:

[ 0.000000] Zone PFN ranges:
[ 0.000000] Normal 0x00000000 -> 0x0081ff5d
[ 0.000000] Movable zone start PFN for each node
[ 0.000000] early_node_map[8] active PFN ranges
[ 0.000000] 0: 0x00000000 -> 0x00020000
[ 0.000000] 1: 0x00800000 -> 0x0081f7ff
[ 0.000000] 1: 0x0081f800 -> 0x0081fe50
[ 0.000000] 1: 0x0081fed1 -> 0x0081fed8
[ 0.000000] 1: 0x0081feda -> 0x0081fedb
[ 0.000000] 1: 0x0081fedd -> 0x0081fee5
[ 0.000000] 1: 0x0081fee7 -> 0x0081ff51
[ 0.000000] 1: 0x0081ff59 -> 0x0081ff5d

So it's a block move in that 0x81f600-->0x81f7ff region which triggers
the problem.

This patch:

Declaration of early_pfn_to_nid() is scattered over per-arch include
files, and it seems it's complicated to know when the declaration is used.
I think it makes fix-for-memmap-init not easy.

This patch moves all declaration to include/linux/mm.h

After this,
if !CONFIG_NODES_POPULATES_NODE_MAP && !CONFIG_HAVE_ARCH_EARLY_PFN_TO_NID
-> Use static definition in include/linux/mm.h
else if !CONFIG_HAVE_ARCH_EARLY_PFN_TO_NID
-> Use generic definition in mm/page_alloc.c
else
-> per-arch back end function will be called.

Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Tested-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Reported-by: David Miller <davem@davemlloft.net>
Cc: Mel Gorman <mel@csn.ul.ie>
Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
Cc: <stable@kernel.org> [2.6.25.x, 2.6.26.x, 2.6.27.x, 2.6.28.x]
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# cf40bd16 21-Jan-2009 Nick Piggin <npiggin@suse.de>

lockdep: annotate reclaim context (__GFP_NOFS)

Here is another version, with the incremental patch rolled up, and
added reclaim context annotation to kswapd, and allocation tracing
to slab allocators (which may only ever reach the page allocator
in rare cases, so it is good to put annotations here too).

Haven't tested this version as such, but it should be getting closer
to merge worthy ;)

--
After noticing some code in mm/filemap.c accidentally perform a __GFP_FS
allocation when it should not have been, I thought it might be a good idea to
try to catch this kind of thing with lockdep.

I coded up a little idea that seems to work. Unfortunately the system has to
actually be in __GFP_FS page reclaim, then take the lock, before it will mark
it. But at least that might still be some orders of magnitude more common
(and more debuggable) than an actual deadlock condition, so we have some
improvement I hope (the concept is no less complete than discovery of a lock's
interrupt contexts).

I guess we could even do the same thing with __GFP_IO (normal reclaim), and
even GFP_NOIO locks too... but filesystems will have the most locks and fiddly
code paths, so let's start there and see how it goes.

It *seems* to work. I did a quick test.

=================================
[ INFO: inconsistent lock state ]
2.6.28-rc6-00007-ged31348-dirty #26
---------------------------------
inconsistent {in-reclaim-W} -> {ov-reclaim-W} usage.
modprobe/8526 [HC0[0]:SC0[0]:HE1:SE1] takes:
(testlock){--..}, at: [<ffffffffa0020055>] brd_init+0x55/0x216 [brd]
{in-reclaim-W} state was registered at:
[<ffffffff80267bdb>] __lock_acquire+0x75b/0x1a60
[<ffffffff80268f71>] lock_acquire+0x91/0xc0
[<ffffffff8070f0e1>] mutex_lock_nested+0xb1/0x310
[<ffffffffa002002b>] brd_init+0x2b/0x216 [brd]
[<ffffffff8020903b>] _stext+0x3b/0x170
[<ffffffff80272ebf>] sys_init_module+0xaf/0x1e0
[<ffffffff8020c3fb>] system_call_fastpath+0x16/0x1b
[<ffffffffffffffff>] 0xffffffffffffffff
irq event stamp: 3929
hardirqs last enabled at (3929): [<ffffffff8070f2b5>] mutex_lock_nested+0x285/0x310
hardirqs last disabled at (3928): [<ffffffff8070f089>] mutex_lock_nested+0x59/0x310
softirqs last enabled at (3732): [<ffffffff8061f623>] sk_filter+0x83/0xe0
softirqs last disabled at (3730): [<ffffffff8061f5b6>] sk_filter+0x16/0xe0

other info that might help us debug this:
1 lock held by modprobe/8526:
#0: (testlock){--..}, at: [<ffffffffa0020055>] brd_init+0x55/0x216 [brd]

stack backtrace:
Pid: 8526, comm: modprobe Not tainted 2.6.28-rc6-00007-ged31348-dirty #26
Call Trace:
[<ffffffff80265483>] print_usage_bug+0x193/0x1d0
[<ffffffff80266530>] mark_lock+0xaf0/0xca0
[<ffffffff80266735>] mark_held_locks+0x55/0xc0
[<ffffffffa0020000>] ? brd_init+0x0/0x216 [brd]
[<ffffffff802667ca>] trace_reclaim_fs+0x2a/0x60
[<ffffffff80285005>] __alloc_pages_internal+0x475/0x580
[<ffffffff8070f29e>] ? mutex_lock_nested+0x26e/0x310
[<ffffffffa0020000>] ? brd_init+0x0/0x216 [brd]
[<ffffffffa002006a>] brd_init+0x6a/0x216 [brd]
[<ffffffffa0020000>] ? brd_init+0x0/0x216 [brd]
[<ffffffff8020903b>] _stext+0x3b/0x170
[<ffffffff8070f8b9>] ? mutex_unlock+0x9/0x10
[<ffffffff8070f83d>] ? __mutex_unlock_slowpath+0x10d/0x180
[<ffffffff802669ec>] ? trace_hardirqs_on_caller+0x12c/0x190
[<ffffffff80272ebf>] sys_init_module+0xaf/0x1e0
[<ffffffff8020c3fb>] system_call_fastpath+0x16/0x1b

Signed-off-by: Nick Piggin <npiggin@suse.de>
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Signed-off-by: Ingo Molnar <mingo@elte.hu>


# 6e901571 07-Jan-2009 KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>

mm: introduce zone_reclaim struct

Add zone_reclam_stat struct for later enhancement.

A later patch uses this. This patch doesn't any behavior change (yet).

Reviewed-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Signed-off-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Acked-by: Rik van Riel <riel@redhat.com>
Cc: Balbir Singh <balbir@in.ibm.com>
Cc: Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp>
Cc: Hugh Dickins <hugh@veritas.com>
Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 9f572e3f 06-Jan-2009 KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>

mm: remove CONFIG_OUT_OF_LINE_PFN_TO_PAGE

No architectures use CONFIG_OUT_OF_LINE_PFN_TO_PAGE - it can be removed.

Signed-off-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 1e9e6365 06-Jan-2009 Hugh Dickins <hugh@veritas.com>

badpage: KERN_ALERT BUG instead of KERN_EMERG

bad_page() and rmap Eeek messages have said KERN_EMERG for a few years,
which I've followed in print_bad_pte(). These are serious system errors,
on a par with BUGs, but they're not quite emergencies, and we do our best
to carry on: say KERN_ALERT "BUG: " like the x86 oops does.

And remove the "Trying to fix it up, but a reboot is needed" line: it's
not untrue, but I hope the KERN_ALERT "BUG: " conveys as much.

Signed-off-by: Hugh Dickins <hugh@veritas.com>
Cc: Nick Piggin <nickpiggin@yahoo.com.au>
Cc: Christoph Lameter <cl@linux-foundation.org>
Cc: Mel Gorman <mel@csn.ul.ie>
Cc: Rik van Riel <riel@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# d936cf9b 06-Jan-2009 Hugh Dickins <hugh@veritas.com>

badpage: ratelimit print_bad_pte and bad_page

print_bad_pte() and bad_page() might each need ratelimiting - especially
for their dump_stacks, almost never of interest, yet not quite
dispensible. Correlating corruption across neighbouring entries can be
very helpful, so allow a burst of 60 reports before keeping quiet for the
remainder of that minute (or allow a steady drip of one report per
second).

Signed-off-by: Hugh Dickins <hugh@veritas.com>
Cc: Nick Piggin <nickpiggin@yahoo.com.au>
Cc: Christoph Lameter <cl@linux-foundation.org>
Cc: Mel Gorman <mel@csn.ul.ie>
Cc: Rik van Riel <riel@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 22b31eec 06-Jan-2009 Hugh Dickins <hugh@veritas.com>

badpage: vm_normal_page use print_bad_pte

print_bad_pte() is so far being called only when zap_pte_range() finds
negative page_mapcount, or there's a fault on a pte_file where it does not
belong. That's weak coverage when we suspect pagetable corruption.

Originally, it was called when vm_normal_page() found an invalid pfn: but
pfn_valid is expensive on some architectures and configurations, so 2.6.24
put that under CONFIG_DEBUG_VM (which doesn't help in the field), then
2.6.26 replaced it by a VM_BUG_ON (likewise).

Reinstate the print_bad_pte() in vm_normal_page(), but use a cheaper test
than pfn_valid(): memmap_init_zone() (used in bootup and hotplug) keep a
__read_mostly note of the highest_memmap_pfn, vm_normal_page() then check
pfn against that. We could call this pfn_plausible() or pfn_sane(), but I
doubt we'll need it elsewhere: of course it's not reliable, but gives much
stronger pagetable validation on many boxes.

Also use print_bad_pte() when the pte_special bit is found outside a
VM_PFNMAP or VM_MIXEDMAP area, instead of VM_BUG_ON.

Signed-off-by: Hugh Dickins <hugh@veritas.com>
Cc: Nick Piggin <nickpiggin@yahoo.com.au>
Cc: Christoph Lameter <cl@linux-foundation.org>
Cc: Mel Gorman <mel@csn.ul.ie>
Cc: Rik van Riel <riel@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 3dc14741 06-Jan-2009 Hugh Dickins <hugh@veritas.com>

badpage: replace page_remove_rmap Eeek and BUG

Now that bad pages are kept out of circulation, there is no need for the
infamous page_remove_rmap() BUG() - once that page is freed, its negative
mapcount will issue a "Bad page state" message and the page won't be
freed. Removing the BUG() allows more info, on subsequent pages, to be
gathered.

We do have more info about the page at this point than bad_page() can know
- notably, what the pmd is, which might pinpoint something like low 64kB
corruption - but page_remove_rmap() isn't given the address to find that.

In practice, there is only one call to page_remove_rmap() which has ever
reported anything, that from zap_pte_range() (usually on exit, sometimes
on munmap). It has all the info, so remove page_remove_rmap()'s "Eeek"
message and leave it all to zap_pte_range().

mm/memory.c already has a hardly used print_bad_pte() function, showing
some of the appropriate info: extend it to show what we want for the rmap
case: pte info, page info (when there is a page) and vma info to compare.
zap_pte_range() already knows the pmd, but print_bad_pte() is easier to
use if it works that out for itself.

Some of this info is also shown in bad_page()'s "Bad page state" message.
Keep them separate, but adjust them to match each other as far as
possible. Say "Bad page map" in print_bad_pte(), and add a TAINT_BAD_PAGE
there too.

print_bad_pte() show current->comm unconditionally (though it should get
repeated in the usually irrelevant stack trace): sorry, I misled Nick
Piggin to make it conditional on vm_mm == current->mm, but current->mm is
already NULL in the exit case. Usually current->comm is good, though
exceptionally it may not be that of the mm (when "swapoff" for example).

Signed-off-by: Hugh Dickins <hugh@veritas.com>
Cc: Nick Piggin <nickpiggin@yahoo.com.au>
Cc: Christoph Lameter <cl@linux-foundation.org>
Cc: Mel Gorman <mel@csn.ul.ie>
Cc: Rik van Riel <riel@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 8cc3b392 06-Jan-2009 Hugh Dickins <hugh@veritas.com>

badpage: keep any bad page out of circulation

Until now the bad_page() checkers have special-cased PageReserved, keeping
those pages out of circulation thereafter. Now extend the special case to
all: we want to keep ANY page with bad state out of circulation - the
"free" page may well be in use by something.

Leave the bad state of those pages untouched, for examination by
debuggers; except for PageBuddy - leaving that set would risk bringing the
page back.

Signed-off-by: Hugh Dickins <hugh@veritas.com>
Cc: Nick Piggin <nickpiggin@yahoo.com.au>
Cc: Christoph Lameter <cl@linux-foundation.org>
Cc: Mel Gorman <mel@csn.ul.ie>
Cc: Rik van Riel <riel@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 79f4b7bf 06-Jan-2009 Hugh Dickins <hugh@veritas.com>

badpage: simplify page_alloc flag check+clear

Simplify the PAGE_FLAGS checking and clearing when freeing and allocating
a page: check the same flags as before when freeing, clear ALL the flags
(unless PageReserved) when freeing, check ALL flags off when allocating.

Signed-off-by: Hugh Dickins <hugh@veritas.com>
Cc: Nick Piggin <nickpiggin@yahoo.com.au>
Cc: Christoph Lameter <cl@linux-foundation.org>
Cc: Mel Gorman <mel@csn.ul.ie>
Cc: Rik van Riel <riel@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# efab8186 06-Jan-2009 KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>

mm: make setup_per_zone_inactive_ratio() static

Sparse output following warning.

mm/page_alloc.c:4301:6: warning: symbol 'setup_per_zone_inactive_ratio' was not declared. Should it be static?

cleanup here.

Signed-off-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>

Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# b962716b 06-Jan-2009 Hugh Dickins <hugh@veritas.com>

mm: optimize get_scan_ratio for no swap

Rik suggests a simplified get_scan_ratio() for !CONFIG_SWAP. Yes, the gcc
optimizer gives us that, when nr_swap_pages is #defined as 0L. Move usual
declaration to swapfile.c: it never belonged in page_alloc.c.

Signed-off-by: Hugh Dickins <hugh@veritas.com>
Cc: Lee Schermerhorn <lee.schermerhorn@hp.com>
Acked-by: Rik van Riel <riel@redhat.com>
Cc: Nick Piggin <nickpiggin@yahoo.com.au>
Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Robin Holt <holt@sgi.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 58a01a45 06-Jan-2009 Julia Lawall <julia@diku.dk>

mm/page_alloc.c: eliminate NULL test and memset after alloc_bootmem

As noted by Akinobu Mita in patch b1fceac2b9e04d278316b2faddf276015fc06e3b,
alloc_bootmem and related functions never return NULL and always return a
zeroed region of memory. Thus a NULL test or memset after calls to these
functions is unnecessary.

This was fixed using the following semantic patch.
(http://www.emn.fr/x-info/coccinelle/)

// <smpl>
@@
expression E;
statement S;
@@

E = \(alloc_bootmem\|alloc_bootmem_low\|alloc_bootmem_pages\|alloc_bootmem_low_pages\|alloc_bootmem_node\|alloc_bootmem_low_pages_node\|alloc_bootmem_pages_node\)(...)
... when != E
(
- BUG_ON (E == NULL);
|
- if (E == NULL) S
)

@@
expression E,E1;
@@

E = \(alloc_bootmem\|alloc_bootmem_low\|alloc_bootmem_pages\|alloc_bootmem_low_pages\|alloc_bootmem_node\|alloc_bootmem_low_pages_node\|alloc_bootmem_pages_node\)(...)
... when != E
- memset(E,0,E1);
// </smpl>

Signed-off-by: Julia Lawall <julia@diku.dk>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 5594c8c8 06-Jan-2009 Yinghai Lu <yinghai@kernel.org>

mm: print out memmap number only if it is not zero

Don't print the size of the zone's memmap array if it does not have one.

Impact: cleanup

Signed-off-by: Yinghai Lu <yinghai@kernel.org>
Cc: Mel Gorman <mel@csn.ul.ie>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# e33c3b5e 12-Nov-2008 David Rientjes <rientjes@google.com>

cpusets: update mems allowed in page allocator

If all allowable memory is unreclaimable, it is possible to loop forever
in the page allocator for ~__GFP_NORETRY allocations.

During this time, it is also possible for a task's cpuset to expand its
set of allowable nodes so that it now includes free memory. The cached
copy of this set, current->mems_allowed, is stale, however, since there
has not been a subsequent call to cpuset_update_task_memory_state().

The cached copy of the set of allowable nodes is now updated in the page
allocator's slow path so the additional memory is available to
get_page_from_freelist().

[akpm@linux-foundation.org: add comment]
Signed-off-by: David Rientjes <rientjes@google.com>
Cc: Paul Menage <menage@google.com>
Cc: Christoph Lameter <cl@linux-foundation.org>
Cc: Mel Gorman <mel@csn.ul.ie>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 18229df5 06-Nov-2008 Andy Whitcroft <apw@shadowen.org>

hugetlb: pull gigantic page initialisation out of the default path

As we can determine exactly when a gigantic page is in use we can optimise
the common regular page cases by pulling out gigantic page initialisation
into its own function. As gigantic pages are never released to buddy we
do not need a destructor. This effectivly reverts the previous change to
the main buddy allocator. It also adds a paranoid check to ensure we
never release gigantic pages from hugetlbfs to the main buddy.

Signed-off-by: Andy Whitcroft <apw@shadowen.org>
Cc: Jon Tollefson <kniht@linux.vnet.ibm.com>
Cc: Mel Gorman <mel@csn.ul.ie>
Cc: Nick Piggin <nickpiggin@yahoo.com.au>
Cc: Christoph Lameter <cl@linux-foundation.org>
Cc: <stable@kernel.org> [2.6.27.x]
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 52d4b9ac 18-Oct-2008 KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>

memcg: allocate all page_cgroup at boot

Allocate all page_cgroup at boot and remove page_cgroup poitner from
struct page. This patch adds an interface as

struct page_cgroup *lookup_page_cgroup(struct page*)

All FLATMEM/DISCONTIGMEM/SPARSEMEM and MEMORY_HOTPLUG is supported.

Remove page_cgroup pointer reduces the amount of memory by
- 4 bytes per PAGE_SIZE.
- 8 bytes per PAGE_SIZE
if memory controller is disabled. (even if configured.)

On usual 8GB x86-32 server, this saves 8MB of NORMAL_ZONE memory.
On my x86-64 server with 48GB of memory, this saves 96MB of memory.
I think this reduction makes sense.

By pre-allocation, kmalloc/kfree in charge/uncharge are removed.
This means
- we're not necessary to be afraid of kmalloc faiulre.
(this can happen because of gfp_mask type.)
- we can avoid calling kmalloc/kfree.
- we can avoid allocating tons of small objects which can be fragmented.
- we can know what amount of memory will be used for this extra-lru handling.

I added printk message as

"allocated %ld bytes of page_cgroup"
"please try cgroup_disable=memory option if you don't want"

maybe enough informative for users.

Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Reviewed-by: Balbir Singh <balbir@linux.vnet.ibm.com>
Cc: Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 1125b4e3 18-Oct-2008 Gerald Schaefer <gerald.schaefer@linux.ibm.com>

setup_per_zone_pages_min(): take zone->lock instead of zone->lru_lock

This replaces zone->lru_lock in setup_per_zone_pages_min() with zone->lock.
There seems to be no need for the lru_lock anymore, but there is a need for
zone->lock instead, because that function may call move_freepages() via
setup_zone_migrate_reserve().

Signed-off-by: Gerald Schaefer <gerald.schaefer@de.ibm.com>
Acked-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Tested-by: Yasunori Goto <y-goto@jp.fujitsu.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# d903ef9f 18-Oct-2008 Yinghai Lu <yhlu.kernel@gmail.com>

mm: print out meminit for memmap

Improve debuggability of memory setup problems.

Signed-off-by: Yinghai Lu <yhlu.kernel@gmail.com>
Cc: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 985737cf 18-Oct-2008 Lee Schermerhorn <lee.schermerhorn@hp.com>

mlock: count attempts to free mlocked page

Allow free of mlock()ed pages. This shouldn't happen, but during
developement, it occasionally did.

This patch allows us to survive that condition, while keeping the
statistics and events correct for debug.

Signed-off-by: Lee Schermerhorn <lee.schermerhorn@hp.com>
Signed-off-by: Rik van Riel <riel@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# b291f000 18-Oct-2008 Nick Piggin <npiggin@suse.de>

mlock: mlocked pages are unevictable

Make sure that mlocked pages also live on the unevictable LRU, so kswapd
will not scan them over and over again.

This is achieved through various strategies:

1) add yet another page flag--PG_mlocked--to indicate that
the page is locked for efficient testing in vmscan and,
optionally, fault path. This allows early culling of
unevictable pages, preventing them from getting to
page_referenced()/try_to_unmap(). Also allows separate
accounting of mlock'd pages, as Nick's original patch
did.

Note: Nick's original mlock patch used a PG_mlocked
flag. I had removed this in favor of the PG_unevictable
flag + an mlock_count [new page struct member]. I
restored the PG_mlocked flag to eliminate the new
count field.

2) add the mlock/unevictable infrastructure to mm/mlock.c,
with internal APIs in mm/internal.h. This is a rework
of Nick's original patch to these files, taking into
account that mlocked pages are now kept on unevictable
LRU list.

3) update vmscan.c:page_evictable() to check PageMlocked()
and, if vma passed in, the vm_flags. Note that the vma
will only be passed in for new pages in the fault path;
and then only if the "cull unevictable pages in fault
path" patch is included.

4) add try_to_unlock() to rmap.c to walk a page's rmap and
ClearPageMlocked() if no other vmas have it mlocked.
Reuses as much of try_to_unmap() as possible. This
effectively replaces the use of one of the lru list links
as an mlock count. If this mechanism let's pages in mlocked
vmas leak through w/o PG_mlocked set [I don't know that it
does], we should catch them later in try_to_unmap(). One
hopes this will be rare, as it will be relatively expensive.

Original mm/internal.h, mm/rmap.c and mm/mlock.c changes:
Signed-off-by: Nick Piggin <npiggin@suse.de>

splitlru: introduce __get_user_pages():

New munlock processing need to GUP_FLAGS_IGNORE_VMA_PERMISSIONS.
because current get_user_pages() can't grab PROT_NONE pages theresore it
cause PROT_NONE pages can't munlock.

[akpm@linux-foundation.org: fix this for pagemap-pass-mm-into-pagewalkers.patch]
[akpm@linux-foundation.org: untangle patch interdependencies]
[akpm@linux-foundation.org: fix things after out-of-order merging]
[hugh@veritas.com: fix page-flags mess]
[lee.schermerhorn@hp.com: fix munlock page table walk - now requires 'mm']
[kosaki.motohiro@jp.fujitsu.com: build fix]
[kosaki.motohiro@jp.fujitsu.com: fix truncate race and sevaral comments]
[kosaki.motohiro@jp.fujitsu.com: splitlru: introduce __get_user_pages()]
Signed-off-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Signed-off-by: Rik van Riel <riel@redhat.com>
Signed-off-by: Lee Schermerhorn <lee.schermerhorn@hp.com>
Cc: Nick Piggin <npiggin@suse.de>
Cc: Dave Hansen <dave@linux.vnet.ibm.com>
Cc: Matt Mackall <mpm@selenic.com>
Signed-off-by: Hugh Dickins <hugh@veritas.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 7b854121 18-Oct-2008 Lee Schermerhorn <Lee.Schermerhorn@hp.com>

Unevictable LRU Page Statistics

Report unevictable pages per zone and system wide.

Kosaki Motohiro added support for memory controller unevictable
statistics.

[riel@redhat.com: fix printk in show_free_areas()]
[akpm@linux-foundation.org: fix units in /proc/vmstats]
Signed-off-by: Lee Schermerhorn <lee.schermerhorn@hp.com>
Signed-off-by: Rik van Riel <riel@redhat.com>
Signed-off-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Debugged-by: Hiroshi Shimamoto <h-shimamoto@ct.jp.nec.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 556adecb 18-Oct-2008 Rik van Riel <riel@redhat.com>

vmscan: second chance replacement for anonymous pages

We avoid evicting and scanning anonymous pages for the most part, but
under some workloads we can end up with most of memory filled with
anonymous pages. At that point, we suddenly need to clear the referenced
bits on all of memory, which can take ages on very large memory systems.

We can reduce the maximum number of pages that need to be scanned by not
taking the referenced state into account when deactivating an anonymous
page. After all, every anonymous page starts out referenced, so why
check?

If an anonymous page gets referenced again before it reaches the end of
the inactive list, we move it back to the active list.

To keep the maximum amount of necessary work reasonable, we scale the
active to inactive ratio with the size of memory, using the formula
active:inactive ratio = sqrt(memory in GB * 10).

Kswapd CPU use now seems to scale by the amount of pageout bandwidth,
instead of by the amount of memory present in the system.

[kamezawa.hiroyu@jp.fujitsu.com: fix OOM with memcg]
[kamezawa.hiroyu@jp.fujitsu.com: memcg: lru scan fix]
Signed-off-by: Rik van Riel <riel@redhat.com>
Signed-off-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 4f98a2fe 18-Oct-2008 Rik van Riel <riel@redhat.com>

vmscan: split LRU lists into anon & file sets

Split the LRU lists in two, one set for pages that are backed by real file
systems ("file") and one for pages that are backed by memory and swap
("anon"). The latter includes tmpfs.

The advantage of doing this is that the VM will not have to scan over lots
of anonymous pages (which we generally do not want to swap out), just to
find the page cache pages that it should evict.

This patch has the infrastructure and a basic policy to balance how much
we scan the anon lists and how much we scan the file lists. The big
policy changes are in separate patches.

[lee.schermerhorn@hp.com: collect lru meminfo statistics from correct offset]
[kosaki.motohiro@jp.fujitsu.com: prevent incorrect oom under split_lru]
[kosaki.motohiro@jp.fujitsu.com: fix pagevec_move_tail() doesn't treat unevictable page]
[hugh@veritas.com: memcg swapbacked pages active]
[hugh@veritas.com: splitlru: BDI_CAP_SWAP_BACKED]
[akpm@linux-foundation.org: fix /proc/vmstat units]
[nishimura@mxp.nes.nec.co.jp: memcg: fix handling of shmem migration]
[kosaki.motohiro@jp.fujitsu.com: adjust Quicklists field of /proc/meminfo]
[kosaki.motohiro@jp.fujitsu.com: fix style issue of get_scan_ratio()]
Signed-off-by: Rik van Riel <riel@redhat.com>
Signed-off-by: Lee Schermerhorn <Lee.Schermerhorn@hp.com>
Signed-off-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Signed-off-by: Hugh Dickins <hugh@veritas.com>
Signed-off-by: Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# b2e18538 18-Oct-2008 Rik van Riel <riel@redhat.com>

define page_file_cache() function

Define page_file_cache() function to answer the question:
is page backed by a file?

Originally part of Rik van Riel's split-lru patch. Extracted to make
available for other, independent reclaim patches.

Moved inline function to linux/mm_inline.h where it will be needed by
subsequent "split LRU" and "noreclaim" patches.

Unfortunately this needs to use a page flag, since the PG_swapbacked state
needs to be preserved all the way to the point where the page is last
removed from the LRU. Trying to derive the status from other info in the
page resulted in wrong VM statistics in earlier split VM patchsets.

The total number of page flags in use on a 32 bit machine after this patch
is 19.

[akpm@linux-foundation.org: fix up out-of-order merge fallout]
[hugh@veritas.com: splitlru: shmem_getpage SetPageSwapBacked sooner[
Signed-off-by: Rik van Riel <riel@redhat.com>
Signed-off-by: Lee Schermerhorn <lee.schermerhorn@hp.com>
Signed-off-by: MinChan Kim <minchan.kim@gmail.com>
Signed-off-by: Hugh Dickins <hugh@veritas.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# b69408e8 18-Oct-2008 Christoph Lameter <cl@linux-foundation.org>

vmscan: Use an indexed array for LRU variables

Currently we are defining explicit variables for the inactive and active
list. An indexed array can be more generic and avoid repeating similar
code in several places in the reclaim code.

We are saving a few bytes in terms of code size:

Before:

text data bss dec hex filename
4097753 573120 4092484 8763357 85b7dd vmlinux

After:

text data bss dec hex filename
4097729 573120 4092484 8763333 85b7c5 vmlinux

Having an easy way to add new lru lists may ease future work on the
reclaim code.

Signed-off-by: Rik van Riel <riel@redhat.com>
Signed-off-by: Lee Schermerhorn <lee.schermerhorn@hp.com>
Signed-off-by: Christoph Lameter <cl@linux-foundation.org>
Signed-off-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# db99100d 15-Oct-2008 Andrew Morton <akpm@linux-foundation.org>

mm/page_alloc.c:free_area_init_nodes() fix inappropriate use of enum

Local variable `i' is a) misleadingly-named for an `enum zone_type' and b)
used for indexing zones as well as nodes as well as node_maps.

Make it an `int'.

Reported-by: Frans Pop <elendil@planet.nl>
Cc: Mel Gorman <mel@csn.ul.ie>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 6babc32c 02-Oct-2008 Andy Whitcroft <apw@shadowen.org>

mm: handle initialising compound pages at orders greater than MAX_ORDER

When we initialise a compound page we initialise the page flags and head
page pointer for all base pages spanned by that page. When we initialise
a gigantic page (a page of order greater than or equal to MAX_ORDER) we
have to initialise more than MAX_ORDER_NR_PAGES pages. Currently we
assume that all elements of the mem_map in this page are contigious in
memory. However this is only guarenteed out to MAX_ORDER_NR_PAGES pages,
and with SPARSEMEM enabled they will not be contigious. This leads us to
walk off the end of the first section and scribble on everything which
follows, BAD.

When we reach a MAX_ORDER_NR_PAGES boundary we much locate the next
section of the mem_map. As gigantic pages can only be maximally aligned
we know this will occur at exact multiple of MAX_ORDER_NR_PAGES pages from
the start of the page.

This is a bug fix for the gigantic page support in hugetlbfs.

Credit to Mel Gorman for spotting the issue.

Signed-off-by: Andy Whitcroft <apw@shadowen.org>
Cc: Mel Gorman <mel@csn.ul.ie>
Cc: Jon Tollefson <kniht@linux.vnet.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 52765583 02-Sep-2008 Marcin Slusarz <marcin.slusarz@gmail.com>

mm/bootmem: silence section mismatch warning - contig_page_data/bootmem_node_data

WARNING: vmlinux.o(.data+0x1f5c0): Section mismatch in reference from the variable contig_page_data to the variable .init.data:bootmem_node_data
The variable contig_page_data references
the variable __initdata bootmem_node_data
If the reference is valid then annotate the
variable with __init* (see linux/init.h) or name the variable:
*driver, *_template, *_timer, *_sht, *_ops, *_probe, *_probe_one, *_console,

Signed-off-by: Marcin Slusarz <marcin.slusarz@gmail.com>
Cc: Johannes Weiner <hannes@saeurebad.de>
Cc: Sean MacLennan <smaclennan@pikatech.com>
Cc: Sam Ravnborg <sam@ravnborg.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 344c790e 02-Sep-2008 Adam Litke <agl@us.ibm.com>

mm: make setup_zone_migrate_reserve() aware of overlapping nodes

I have gotten to the root cause of the hugetlb badness I reported back on
August 15th. My system has the following memory topology (note the
overlapping node):

Node 0 Memory: 0x8000000-0x44000000
Node 1 Memory: 0x0-0x8000000 0x44000000-0x80000000

setup_zone_migrate_reserve() scans the address range 0x0-0x8000000 looking
for a pageblock to move onto the MIGRATE_RESERVE list. Finding no
candidates, it happily continues the scan into 0x8000000-0x44000000. When
a pageblock is found, the pages are moved to the MIGRATE_RESERVE list on
the wrong zone. Oops.

setup_zone_migrate_reserve() should skip pageblocks in overlapping nodes.

Signed-off-by: Adam Litke <agl@us.ibm.com>
Acked-by: Mel Gorman <mel@csn.ul.ie>
Cc: Dave Hansen <dave@linux.vnet.ibm.com>
Cc: Nishanth Aravamudan <nacc@us.ibm.com>
Cc: Andy Whitcroft <apw@shadowen.org>
Cc: <stable@kernel.org> [2.6.25.x, 2.6.26.x]
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 74768ed8 12-Aug-2008 Jan Beulich <jbeulich@novell.com>

page allocator: use no-panic variant of alloc_bootmem() in alloc_large_system_hash()

.. since a failed allocation is being (initially) handled gracefully, and
panic()-ed upon failure explicitly in the function if retries with smaller
sizes failed.

Signed-off-by: Jan Beulich <jbeulich@novell.com>
Signed-off-by: David Howells <dhowells@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 1d1958f0 29-Jul-2008 Yinghai Lu <yhlu.kernel@gmail.com>

mm: remove find_max_pfn_with_active_regions

It has no user now

Also print out info about adding/removing active regions.

Signed-off-by: Yinghai Lu <yhlu.kernel@gmail.com>
Acked-by: Mel Gorman <mel@csn.ul.ie>
Acked-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 9b1a4d38 27-Jul-2008 Rusty Russell <rusty@rustcorp.com.au>

stop_machine: Wean existing callers off stop_machine_run()

Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>


# af370fb8 23-Jul-2008 Yasunori Goto <y-goto@jp.fujitsu.com>

memory hotplug: small fixes to bootmem freeing for memory hotremove

- Change some naming
* Magic -> types
* MIX_INFO -> MIX_SECTION_INFO
* Change definition of bootmem type from direct hex value

- __free_pages_bootmem() becomes __meminit.

Signed-off-by: Yasunori Goto <y-goto@jp.fujitsu.com>
Cc: Andy Whitcroft <apw@shadowen.org>
Cc: Badari Pulavarty <pbadari@us.ibm.com>
Cc: Yinghai Lu <yhlu.kernel@gmail.com>
Cc: Johannes Weiner <hannes@saeurebad.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# b69a7288 23-Jul-2008 Adrian Bunk <bunk@kernel.org>

mm/page_alloc.c: cleanups

This patch contains the following cleanups:
- make the following needlessly global variables static:
- required_kernelcore
- zone_movable_pfn[]
- make the following needlessly global functions static:
- move_freepages()
- move_freepages_block()
- setup_pageset()
- find_usable_zone_for_movable()
- adjust_zone_range_for_zone_movable()
- __absent_pages_in_range()
- find_min_pfn_for_node()
- find_zone_movable_pfns_for_nodes()

Signed-off-by: Adrian Bunk <bunk@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 2be0ffe2 23-Jul-2008 Timur Tabi <timur@freescale.com>

mm: add alloc_pages_exact() and free_pages_exact()

alloc_pages_exact() is similar to alloc_pages(), except that it allocates
the minimum number of pages to fulfill the request. This is useful if you
want to allocate a very large buffer that is slightly larger than an even
power-of-two number of pages. In that case, alloc_pages() will waste a
lot of memory.

I have a video driver that wants to allocate a 5MB buffer. alloc_pages()
wiill waste 3MB of physically-contiguous memory.

Signed-off-by: Timur Tabi <timur@freescale.com>
Cc: Andi Kleen <andi@firstfloor.org>
Acked-by: Mel Gorman <mel@csn.ul.ie>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 01ad1c08 23-Jul-2008 Andi Kleen <ak@linux.intel.com>

mm: export prep_compound_page to mm

hugetlb will need to get compound pages from bootmem to handle the case of
them being greater than or equal to MAX_ORDER. Export the constructor
function needed for this.

Acked-by: Adam Litke <agl@us.ibm.com>
Signed-off-by: Andi Kleen <ak@suse.de>
Signed-off-by: Nick Piggin <npiggin@suse.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 9109fb7b 23-Jul-2008 Johannes Weiner <hannes@saeurebad.de>

mm: drop unneeded pgdat argument from free_area_init_node()

free_area_init_node() gets passed in the node id as well as the node
descriptor. This is redundant as the function can trivially get the node
descriptor itself by means of NODE_DATA() and the node's id.

I checked all the users and NODE_DATA() seems to be usable everywhere
from where this function is called.

Signed-off-by: Johannes Weiner <hannes@saeurebad.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 3c82d0ce 23-Jul-2008 Andy Whitcroft <apw@shadowen.org>

buddy: clarify comments describing buddy merge

In __free_one_page(), the comment "Move the buddy up one level" appears
attached to the break and by implication when the break is taken we are
moving it up one level:

if (!page_is_buddy(page, buddy, order))
break; /* Move the buddy up one level. */

In reality the inverse is true, we break out when we can no longer merge
this page with its buddy. Looking back into pre-history (into the full
git history) it appears that these two lines accidentally got joined as
part of another change.

Move the comment down where it belongs below the if and clarify its
language.

Signed-off-by: Andy Whitcroft <apw@shadowen.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# e4048e5d 23-Jul-2008 KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>

page allocator: inline some __alloc_pages() wrappers

Two zonelist patch series rewrote __page_alloc() largely. Now, it is just
a wrapper function. Inlining them will save a function call.

[akpm@linux-foundation.org: export __alloc_pages_internal]
Cc: Lee Schermerhorn <lee.schermerhorn@hp.com>
Cc: Mel Gorman <mel@csn.ul.ie>
Signed-off-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# b61bfa3c 23-Jul-2008 Johannes Weiner <hannes@saeurebad.de>

mm: move bootmem descriptors definition to a single place

There are a lot of places that define either a single bootmem descriptor or an
array of them. Use only one central array with MAX_NUMNODES items instead.

Signed-off-by: Johannes Weiner <hannes@saeurebad.de>
Acked-by: Ralf Baechle <ralf@linux-mips.org>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: Richard Henderson <rth@twiddle.net>
Cc: Russell King <rmk@arm.linux.org.uk>
Cc: Tony Luck <tony.luck@intel.com>
Cc: Hirokazu Takata <takata@linux-m32r.org>
Cc: Geert Uytterhoeven <geert@linux-m68k.org>
Cc: Kyle McMartin <kyle@parisc-linux.org>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Paul Mundt <lethal@linux-sh.org>
Cc: David S. Miller <davem@davemloft.net>
Cc: Yinghai Lu <yhlu.kernel@gmail.com>
Cc: Christoph Lameter <cl@linux-foundation.org>
Cc: Mel Gorman <mel@csn.ul.ie>
Cc: Andy Whitcroft <apw@shadowen.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 68ad8df4 23-Jul-2008 Mel Gorman <mel@csn.ul.ie>

mm: print out the zonelists on request for manual verification

This patch prints out the zonelists during boot for manual verification by the
user if the mminit_loglevel is MMINIT_VERIFY or higher.

Signed-off-by: Mel Gorman <mel@csn.ul.ie>
Cc: Christoph Lameter <cl@linux-foundation.org>
Cc: Andy Whitcroft <apw@shadowen.org>
Cc: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 2dbb51c4 23-Jul-2008 Mel Gorman <mel@csn.ul.ie>

mm: make defensive checks around PFN values registered for memory usage

There are a number of different views to how much memory is currently active.
There is the arch-independent zone-sizing view, the bootmem allocator and
memory models view.

Architectures register this information at different times and is not
necessarily in sync particularly with respect to some SPARSEMEM limitations.

This patch introduces mminit_validate_memmodel_limits() which is able to
validate and correct PFN ranges with respect to the memory model. It is only
SPARSEMEM that currently validates itself.

Signed-off-by: Mel Gorman <mel@csn.ul.ie>
Cc: Christoph Lameter <cl@linux-foundation.org>
Cc: Andy Whitcroft <apw@shadowen.org>
Cc: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 708614e6 23-Jul-2008 Mel Gorman <mel@csn.ul.ie>

mm: verify the page links and memory model

Print out information on how the page flags are being used if mminit_loglevel
is MMINIT_VERIFY or higher and unconditionally performs sanity checks on the
flags regardless of loglevel.

When the page flags are updated with section, node and zone information, a
check are made to ensure the values can be retrieved correctly. Finally we
confirm that pfn_to_page and page_to_pfn are the correct inverse functions.

[akpm@linux-foundation.org: fix printk warnings]
Signed-off-by: Mel Gorman <mel@csn.ul.ie>
Cc: Christoph Lameter <cl@linux-foundation.org>
Cc: Andy Whitcroft <apw@shadowen.org>
Cc: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 6b74ab97 23-Jul-2008 Mel Gorman <mel@csn.ul.ie>

mm: add a basic debugging framework for memory initialisation

Boot initialisation is very complex, with significant numbers of
architecture-specific routines, hooks and code ordering. While significant
amounts of the initialisation is architecture-independent, it trusts the data
received from the architecture layer. This is a mistake, and has resulted in
a number of difficult-to-diagnose bugs.

This patchset adds some validation and tracing to memory initialisation. It
also introduces a few basic defensive measures. The validation code can be
explicitly disabled for embedded systems.

This patch:

Add additional debugging and verification code for memory initialisation.

Once enabled, the verification checks are always run and when required
additional debugging information may be outputted via a mminit_loglevel=
command-line parameter.

The verification code is placed in a new file mm/mm_init.c. Ideally other mm
initialisation code will be moved here over time.

Signed-off-by: Mel Gorman <mel@csn.ul.ie>
Cc: Christoph Lameter <cl@linux-foundation.org>
Cc: Andy Whitcroft <apw@shadowen.org>
Cc: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 5dab8ec1 25-Jun-2008 Paul Jackson <pj@sgi.com>

mm, generic, x86 boot: more tweaks to hex prints of some pfn addresses

Fix some problems with (and applies on top of) a previous patch:
x86 boot: show pfn addresses in hex not decimal in some kernel info printks

Primarily change "0x%8lx" format, which displays with a right aligned
space filled hex number (spaces between the "0x" prefix and the number),
into "%0#10lx" format, which zero fills instead of space fills, and
which uses the printf flag '#' to request the "0x" prefix instead of
hard coding it.

Also replace some other "0x%lx" formats with "%#lx", making use of the
'#' printf flag again.

Signed-off-by: Paul Jackson <pj@sgi.com>
Cc: "Yinghai Lu" <yhlu.kernel@gmail.com>
Cc: "Jack Steiner" <steiner@sgi.com>
Cc: "Mike Travis" <travis@sgi.com>
Cc: "Huang
Cc: Ying" <ying.huang@intel.com>
Cc: "Andi Kleen" <andi@firstfloor.org>
Cc: "Andrew Morton" <akpm@linux-foundation.org>
Cc: Paul Jackson <pj@sgi.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>


# 2bc0d261 22-Jun-2008 Paul Jackson <pj@sgi.com>

x86 boot: more consistently use type int for node ids

Everywhere I look, node id's are of type 'int', except in this one
case, which has 'unsigned long'. Change this one to 'int' as well.
There is nothing special about the way this variable 'nid' is used in
this routine to justify using an unusual type here.

Signed-off-by: Paul Jackson <pj@sgi.com>
Cc: "Yinghai Lu" <yhlu.kernel@gmail.com>
Cc: "Jack Steiner" <steiner@sgi.com>
Cc: "Mike Travis" <travis@sgi.com>
Cc: "Huang
Cc: Ying" <ying.huang@intel.com>
Cc: "Andi Kleen" <andi@firstfloor.org>
Cc: "Andrew Morton" <akpm@linux-foundation.org>
Cc: Paul Jackson <pj@sgi.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>


# e2fc252e 22-Jun-2008 Paul Jackson <pj@sgi.com>

x86 boot: show pfn addresses in hex not decimal in some kernel info printks

Page frame numbers (the portion of physical addresses above the low
order page offsets) are displayed in several kernel debug and info
prints in decimal, not hex. Decimal addresse are unreadable. Use hex.

Signed-off-by: Paul Jackson <pj@sgi.com>
Cc: "Yinghai Lu" <yhlu.kernel@gmail.com>
Cc: "Jack Steiner" <steiner@sgi.com>
Cc: "Mike Travis" <travis@sgi.com>
Cc: "Huang
Cc: Ying" <ying.huang@intel.com>
Cc: "Andi Kleen" <andi@firstfloor.org>
Cc: "Andrew Morton" <akpm@linux-foundation.org>
Cc: Paul Jackson <pj@sgi.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>


# d52d53b8 16-Jun-2008 Yinghai Lu <yhlu.kernel@gmail.com>

RFC x86: try to remove arch_get_ram_range

want to remove arch_get_ram_range, and use early_node_map instead.

Signed-off-by: Yinghai Lu <yhlu.kernel@gmail.com>

Signed-off-by: Ingo Molnar <mingo@elte.hu>


# b5bc6c0e 14-Jun-2008 Yinghai Lu <yhlu.kernel@gmail.com>

x86, mm: use add_highpages_with_active_regions() for high pages init v2

use early_node_map to init high pages, so we can remove page_is_ram() and
page_is_reserved_early() in the big loop with add_one_highpage

also remove page_is_reserved_early(), it is not needed anymore.

v2: fix the build of other platforms

Signed-off-by: Yinghai Lu <yhlu.kernel@gmail.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>


# cc1050ba 13-Jun-2008 Yinghai Lu <yhlu.kernel@gmail.com>

x86: replace shrink_active_range() with remove_active_range()

in case we have kva before ramdisk on a node, we still need to use
those ranges.

v2: reserve_early kva ram area, in case there are holes in highmem, to avoid
those area could be treat as free high pages.

Signed-off-by: Yinghai Lu <yhlu.kernel@gmail.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>


# 494de900 02-Jul-2008 Mel Gorman <mel@csn.ul.ie>

Do not overwrite nr_zones on !NUMA when initialising zlcache_ptr

The non-NUMA case of build_zonelist_cache() would initialize the
zlcache_ptr for both node_zonelists[] to NULL.

Which is problematic, since non-NUMA only has a single node_zonelists[]
entry, and trying to zero the non-existent second one just overwrote the
nr_zones field instead.

As kswapd uses this value to determine what reclaim work is necessary,
the result is that kswapd never reclaims. This causes processes to
stall frequently in low-memory situations as they always direct reclaim.
This patch initialises zlcache_ptr correctly.

Signed-off-by: Mel Gorman <mel@csn.ul.ie>
Tested-by: Dan Williams <dan.j.williams@intel.com>
[ Simplified patch a bit ]
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 15c8b6c1 09-May-2008 Jens Axboe <jens.axboe@oracle.com>

on_each_cpu(): kill unused 'retry' parameter

It's not even passed on to smp_call_function() anymore, since that
was removed. So kill it.

Acked-by: Jeremy Fitzhardinge <jeremy.fitzhardinge@citrix.com>
Reviewed-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Signed-off-by: Jens Axboe <jens.axboe@oracle.com>


# cc1a9d86 08-Jun-2008 Yinghai Lu <yhlu.kernel@gmail.com>

mm, x86: shrink_active_range() should check all

Now we are using register_e820_active_regions() instead of
add_active_range() directly. So end_pfn could be different between the
value in early_node_map to node_end_pfn.

So we need to make shrink_active_range() smarter.

shrink_active_range() is a generic MM function in mm/page_alloc.c but
it is only used on 32-bit x86. Should we move it back to some file in
arch/x86?

Signed-off-by: Yinghai Lu <yhlu.kernel@gmail.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>


# dfa7e20c 09-Jun-2008 Russ Anderson <rja@sgi.com>

mm: Minor clean-up of page flags in mm/page_alloc.c

Minor source code cleanup of page flags in mm/page_alloc.c.
Move the definition of the groups of bits to page-flags.h.

The purpose of this clean up is that the next patch will
conditionally add a page flag to the groups. Doing that
in a header file is cleaner than adding #ifdefs to the
C code.

Signed-off-by: Russ Anderson <rja@sgi.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# e8c27ac9 01-Jun-2008 Yinghai Lu <yhlu.kernel@gmail.com>

x86, numa, 32-bit: print out debug info on all kvas

also fix the print out of node_remap_end_vaddr

Signed-off-by: Yinghai Lu <yhlu.kernel@gmail.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>


# cd94b9db 23-May-2008 Heiko Carstens <hca@linux.ibm.com>

memory hotplug: fix early allocation handling

Trying to add memory via add_memory() from within an initcall function
results in

bootmem alloc of 163840 bytes failed!
Kernel panic - not syncing: Out of memory

This is caused by zone_wait_table_init() which uses system_state to decide
if it should use the bootmem allocator or not.

When initcalls are handled the system_state is still SYSTEM_BOOTING but
the bootmem allocator doesn't work anymore. So the allocation will fail.

To fix this use slab_is_available() instead as indicator like we do it
everywhere else.

[akpm@linux-foundation.org: coding-style fix]
Reviewed-by: Andy Whitcroft <apw@shadowen.org>
Cc: Dave Hansen <haveblue@us.ibm.com>
Cc: Gerald Schaefer <gerald.schaefer@de.ibm.com>
Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Acked-by: Yasunori Goto <y-goto@jp.fujitsu.com>
Signed-off-by: Heiko Carstens <heiko.carstens@de.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 7eb54824 23-May-2008 Andy Whitcroft <apw@shadowen.org>

zonelists: handle a node zonelist with no applicable entries

When booting 2.6.26-rc3 on a multi-node x86_32 numa system we are seeing
panics when trying node local allocations:

BUG: unable to handle kernel NULL pointer dereference at 0000034c
IP: [<c1042507>] get_page_from_freelist+0x4a/0x18e
*pdpt = 00000000013a7001 *pde = 0000000000000000
Oops: 0000 [#1] SMP
Modules linked in:

Pid: 0, comm: swapper Not tainted (2.6.26-rc3-00003-g5abc28d #82)
EIP: 0060:[<c1042507>] EFLAGS: 00010282 CPU: 0
EIP is at get_page_from_freelist+0x4a/0x18e
EAX: c1371ed8 EBX: 00000000 ECX: 00000000 EDX: 00000000
ESI: f7801180 EDI: 00000000 EBP: 00000000 ESP: c1371ec0
DS: 007b ES: 007b FS: 00d8 GS: 0000 SS: 0068
Process swapper (pid: 0, ti=c1370000 task=c12f5b40 task.ti=c1370000)
Stack: 00000000 00000000 00000000 00000000 000612d0 000412d0 00000000 000412d0
f7801180 f7c0101c f7c01018 c10426e4 f7c01018 00000001 00000044 00000000
00000001 c12f5b40 00000001 00000010 00000000 000412d0 00000286 000412d0
Call Trace:
[<c10426e4>] __alloc_pages_internal+0x99/0x378
[<c10429ca>] __alloc_pages+0x7/0x9
[<c105e0e8>] kmem_getpages+0x66/0xef
[<c105ec55>] cache_grow+0x8f/0x123
[<c105f117>] ____cache_alloc_node+0xb9/0xe4
[<c105f427>] kmem_cache_alloc_node+0x92/0xd2
[<c122118c>] setup_cpu_cache+0xaf/0x177
[<c105e6ca>] kmem_cache_create+0x2c8/0x353
[<c13853af>] kmem_cache_init+0x1ce/0x3ad
[<c13755c5>] start_kernel+0x178/0x1ee

This occurs when we are scanning the zonelists looking for a ZONE_NORMAL
page. In this system there is only ZONE_DMA and ZONE_NORMAL memory on
node 0, all other nodes are mapped above 4GB physical. Here is a dump
of the zonelists from this system:

zonelists pgdat=c1400000
0: c14006c0:2 f7c006c0:2 f7e006c0:2 c1400360:1 c1400000:0
1: c14006c0:2 c1400360:1 c1400000:0
zonelists pgdat=f7c00000
0: f7c006c0:2 f7e006c0:2 c14006c0:2 c1400360:1 c1400000:0
1: f7c006c0:2
zonelists pgdat=f7e00000
0: f7e006c0:2 c14006c0:2 f7c006c0:2 c1400360:1 c1400000:0
1: f7e006c0:2

When performing a node local allocation we call get_page_from_freelist()
looking for a page. It in turn calls first_zones_zonelist() which returns
a preferred_zone. Where there are no applicable zones this will be NULL.
However we use this unconditionally, leading to this panic.

Where there are no applicable zones there is no possibility of a successful
allocation, so simply fail the allocation.

Signed-off-by: Andy Whitcroft <apw@shadowen.org>
Acked-by: Mel Gorman <mel@csn.ul.ie>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# f7232154 23-May-2008 Johannes Weiner <hannes@saeurebad.de>

mm: don't drop a partial page in a zone's memory map size

In a zone's present pages number, account for all pages occupied by the
memory map, including a partial.

Signed-off-by: Johannes Weiner <hannes@saeurebad.de>
Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 76cdd58e 14-May-2008 Heiko Carstens <hca@linux.ibm.com>

memory_hotplug: always initialize pageblock bitmap

Trying to online a new memory section that was added via memory hotplug
sometimes results in crashes when the new pages are added via __free_page.
Reason for that is that the pageblock bitmap isn't initialized and hence
contains random stuff. That means that get_pageblock_migratetype()
returns also random stuff and therefore

list_add(&page->lru,
&zone->free_area[order].free_list[migratetype]);

in __free_one_page() tries to do a list_add to something that isn't even
necessarily a list.

This happens since 86051ca5eaf5e560113ec7673462804c54284456 ("mm: fix
usemap initialization") which makes sure that the pageblock bitmap gets
only initialized for pages present in a zone. Unfortunately for hot-added
memory the zones "grow" after the memmap and the pageblock memmap have
been initialized. Which means that the new pages have an unitialized
bitmap. To solve this the calls to grow_zone_span() and grow_pgdat_span()
are moved to __add_zone() just before the initialization happens.

The patch also moves the two functions since __add_zone() is the only
caller and I didn't want to add a forward declaration.

Signed-off-by: Heiko Carstens <heiko.carstens@de.ibm.com>
Cc: Andy Whitcroft <apw@shadowen.org>
Cc: Dave Hansen <haveblue@us.ibm.com>
Cc: Gerald Schaefer <gerald.schaefer@de.ibm.com>
Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Yasunori Goto <y-goto@jp.fujitsu.com>
Cc: <stable@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 3ac7fe5a 30-Apr-2008 Thomas Gleixner <tglx@linutronix.de>

infrastructure to debug (dynamic) objects

We can see an ever repeating problem pattern with objects of any kind in the
kernel:

1) freeing of active objects
2) reinitialization of active objects

Both problems can be hard to debug because the crash happens at a point where
we have no chance to decode the root cause anymore. One problem spot are
kernel timers, where the detection of the problem often happens in interrupt
context and usually causes the machine to panic.

While working on a timer related bug report I had to hack specialized code
into the timer subsystem to get a reasonable hint for the root cause. This
debug hack was fine for temporary use, but far from a mergeable solution due
to the intrusiveness into the timer code.

The code further lacked the ability to detect and report the root cause
instantly and keep the system operational.

Keeping the system operational is important to get hold of the debug
information without special debugging aids like serial consoles and special
knowledge of the bug reporter.

The problems described above are not restricted to timers, but timers tend to
expose it usually in a full system crash. Other objects are less explosive,
but the symptoms caused by such mistakes can be even harder to debug.

Instead of creating specialized debugging code for the timer subsystem a
generic infrastructure is created which allows developers to verify their code
and provides an easy to enable debug facility for users in case of trouble.

The debugobjects core code keeps track of operations on static and dynamic
objects by inserting them into a hashed list and sanity checking them on
object operations and provides additional checks whenever kernel memory is
freed.

The tracked object operations are:
- initializing an object
- adding an object to a subsystem list
- deleting an object from a subsystem list

Each operation is sanity checked before the operation is executed and the
subsystem specific code can provide a fixup function which allows to prevent
the damage of the operation. When the sanity check triggers a warning message
and a stack trace is printed.

The list of operations can be extended if the need arises. For now it's
limited to the requirements of the first user (timers).

The core code enqueues the objects into hash buckets. The hash index is
generated from the address of the object to simplify the lookup for the check
on kfree/vfree. Each bucket has it's own spinlock to avoid contention on a
global lock.

The debug code can be compiled in without being active. The runtime overhead
is minimal and could be optimized by asm alternatives. A kernel command line
option enables the debugging code.

Thanks to Ingo Molnar for review, suggestions and cleanup patches.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Cc: Greg KH <greg@kroah.com>
Cc: Randy Dunlap <randy.dunlap@oracle.com>
Cc: Kay Sievers <kay.sievers@vrfy.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# a41f24ea 29-Apr-2008 Nishanth Aravamudan <nacc@us.ibm.com>

page allocator: smarter retry of costly-order allocations

Because of page order checks in __alloc_pages(), hugepage (and similarly
large order) allocations will not retry unless explicitly marked
__GFP_REPEAT. However, the current retry logic is nearly an infinite
loop (or until reclaim does no progress whatsoever). For these costly
allocations, that seems like overkill and could potentially never
terminate. Mel observed that allowing current __GFP_REPEAT semantics for
hugepage allocations essentially killed the system. I believe this is
because we may continue to reclaim small orders of pages all over, but
never have enough to satisfy the hugepage allocation request. This is
clearly only a problem for large order allocations, of which hugepages
are the most obvious (to me).

Modify try_to_free_pages() to indicate how many pages were reclaimed.
Use that information in __alloc_pages() to eventually fail a large
__GFP_REPEAT allocation when we've reclaimed an order of pages equal to
or greater than the allocation's order. This relies on lumpy reclaim
functioning as advertised. Due to fragmentation, lumpy reclaim may not
be able to free up the order needed in one invocation, so multiple
iterations may be requred. In other words, the more fragmented memory
is, the more retry attempts __GFP_REPEAT will make (particularly for
higher order allocations).

This changes the semantics of __GFP_REPEAT subtly, but *only* for
allocations > PAGE_ALLOC_COSTLY_ORDER. With this patch, for those size
allocations, we will try up to some point (at least 1<<order reclaimed
pages), rather than forever (which is the case for allocations <=
PAGE_ALLOC_COSTLY_ORDER).

This change improves the /proc/sys/vm/nr_hugepages interface with a
follow-on patch that makes pool allocations use __GFP_REPEAT. Rather
than administrators repeatedly echo'ing a particular value into the
sysctl, and forcing reclaim into action manually, this change allows for
the sysctl to attempt a reasonable effort itself. Similarly, dynamic
pool growth should be more successful under load, as lumpy reclaim can
try to free up pages, rather than failing right away.

Choosing to reclaim only up to the order of the requested allocation
strikes a balance between not failing hugepage allocations and returning
to the caller when it's unlikely to every succeed. Because of lumpy
reclaim, if we have freed the order requested, hopefully it has been in
big chunks and those chunks will allow our allocation to succeed. If
that isn't the case after freeing up the current order, I don't think it
is likely to succeed in the future, although it is possible given a
particular fragmentation pattern.

Signed-off-by: Nishanth Aravamudan <nacc@us.ibm.com>
Cc: Andy Whitcroft <apw@shadowen.org>
Tested-by: Mel Gorman <mel@csn.ul.ie>
Cc: Dave Hansen <haveblue@us.ibm.com>
Cc: Christoph Lameter <clameter@sgi.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# ab857d093 29-Apr-2008 Nishanth Aravamudan <nacc@us.ibm.com>

mm: fix misleading __GFP_REPEAT related comments

The definition and use of __GFP_REPEAT, __GFP_NOFAIL and __GFP_NORETRY in the
core VM have somewhat differing comments as to their actual semantics.
Annoyingly, the flags definition has inline and header comments, which might
be interpreted as not being equivalent. Just add references to the header
comments in the inline ones so they don't go out of sync in the future. In
their use in __alloc_pages() clarify that the current implementation treats
low-order allocations and __GFP_REPEAT allocations as distinct cases.

To clarify, the flags' semantics are:

__GFP_NORETRY means try no harder than one run through __alloc_pages

__GFP_REPEAT means __GFP_NOFAIL

__GFP_NOFAIL means repeat forever

order <= PAGE_ALLOC_COSTLY_ORDER means __GFP_NOFAIL

Signed-off-by: Nishanth Aravamudan <nacc@us.ibm.com>
Acked-by: Mel Gorman <mel@csn.ul.ie>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 86051ca5 29-Apr-2008 KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>

mm: fix usemap initialization

usemap must be initialized only when pfn is within zone. If not, it corrupts
memory.

And this patch also reduces the number of calls to set_pageblock_migratetype()
from
(pfn & (pageblock_nr_pages -1)
to
!(pfn & (pageblock_nr_pages-1)
it should be called once per pageblock.

Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Acked-by: Mel Gorman <mel@csn.ul.ie>
Cc: Hugh Dickins <hugh@veritas.com>
Cc: Shi Weihua <shiwh@cn.fujitsu.com>
Cc: Balbir Singh <balbir@in.ibm.com>
Cc: Pavel Emelyanov <xemul@openvz.org>
Cc: <stable@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 2309f9e6 28-Apr-2008 Pavel Machek <pavel@ucw.cz>

mm/page_alloc.c: remove hand-coded get_order()

Remove hand-coded get_order() from page_alloc.c.

Signed-off-by: Pavel Machek <pavel@suse.cz>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 0c0a4a51 28-Apr-2008 Yasunori Goto <y-goto@jp.fujitsu.com>

memory hotplug: free memmaps allocated by bootmem

This patch is to free memmaps which is allocated by bootmem.

Freeing usemap is not necessary. The pages of usemap may be necessary for
other sections.

If removing section is last section on the node, its section is the final user
of usemap page. (usemaps are allocated on its section by previous patch.) But
it shouldn't be freed too, because the section must be logical offline state
which all pages are isolated against page allocater. If it is freed, page
alloctor may use it which will be removed physically soon. It will be
disaster. So, this patch keeps it as it is.

Signed-off-by: Yasunori Goto <y-goto@jp.fujitsu.com>
Cc: Badari Pulavarty <pbadari@us.ibm.com>
Cc: Yinghai Lu <yhlu.kernel@gmail.com>
Cc: Yasunori Goto <y-goto@jp.fujitsu.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 0a128b2b 28-Apr-2008 Christoph Lameter <clameter@sgi.com>

pageflags: eliminate PG_xxx aliases

Remove aliases of PG_xxx. We can easily drop those now and alias by
specifying the PG_xxx flag in the macro that generates the functions.

Signed-off-by: Christoph Lameter <clameter@sgi.com>
Cc: Andy Whitcroft <apw@shadowen.org>
Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Cc: Rik van Riel <riel@redhat.com>
Cc: Mel Gorman <mel@csn.ul.ie>
Cc: Jeremy Fitzhardinge <jeremy@goop.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# f05111f5 28-Apr-2008 S.Çağlar Onur <caglar@pardus.org.tr>

mm/page_alloc.c: fix indentation

zlc_setup(): handle jiffies wraparound
(10ed273f5016c582413dfbc468dd084957d847e1) changes tab with spaces

Signed-off-by: S.Caglar Onur <caglar@pardus.org.tr>
Cc: Lee Schermerhorn <Lee.Schermerhorn@hp.com>
Cc: Paul Jackson <pj@sgi.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 19770b32 28-Apr-2008 Mel Gorman <mel@csn.ul.ie>

mm: filter based on a nodemask as well as a gfp_mask

The MPOL_BIND policy creates a zonelist that is used for allocations
controlled by that mempolicy. As the per-node zonelist is already being
filtered based on a zone id, this patch adds a version of __alloc_pages() that
takes a nodemask for further filtering. This eliminates the need for
MPOL_BIND to create a custom zonelist.

A positive benefit of this is that allocations using MPOL_BIND now use the
local node's distance-ordered zonelist instead of a custom node-id-ordered
zonelist. I.e., pages will be allocated from the closest allowed node with
available memory.

[Lee.Schermerhorn@hp.com: Mempolicy: update stale documentation and comments]
[Lee.Schermerhorn@hp.com: Mempolicy: make dequeue_huge_page_vma() obey MPOL_BIND nodemask]
[Lee.Schermerhorn@hp.com: Mempolicy: make dequeue_huge_page_vma() obey MPOL_BIND nodemask rework]
Signed-off-by: Mel Gorman <mel@csn.ul.ie>
Acked-by: Christoph Lameter <clameter@sgi.com>
Signed-off-by: Lee Schermerhorn <lee.schermerhorn@hp.com>
Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Mel Gorman <mel@csn.ul.ie>
Cc: Hugh Dickins <hugh@veritas.com>
Cc: Nick Piggin <nickpiggin@yahoo.com.au>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# dd1a239f 28-Apr-2008 Mel Gorman <mel@csn.ul.ie>

mm: have zonelist contains structs with both a zone pointer and zone_idx

Filtering zonelists requires very frequent use of zone_idx(). This is costly
as it involves a lookup of another structure and a substraction operation. As
the zone_idx is often required, it should be quickly accessible. The node idx
could also be stored here if it was found that accessing zone->node is
significant which may be the case on workloads where nodemasks are heavily
used.

This patch introduces a struct zoneref to store a zone pointer and a zone
index. The zonelist then consists of an array of these struct zonerefs which
are looked up as necessary. Helpers are given for accessing the zone index as
well as the node index.

[kamezawa.hiroyu@jp.fujitsu.com: Suggested struct zoneref instead of embedding information in pointers]
[hugh@veritas.com: mm-have-zonelist: fix memcg ooms]
[hugh@veritas.com: just return do_try_to_free_pages]
[hugh@veritas.com: do_try_to_free_pages gfp_mask redundant]
Signed-off-by: Mel Gorman <mel@csn.ul.ie>
Acked-by: Christoph Lameter <clameter@sgi.com>
Acked-by: David Rientjes <rientjes@google.com>
Signed-off-by: Lee Schermerhorn <lee.schermerhorn@hp.com>
Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Mel Gorman <mel@csn.ul.ie>
Cc: Christoph Lameter <clameter@sgi.com>
Cc: Nick Piggin <nickpiggin@yahoo.com.au>
Signed-off-by: Hugh Dickins <hugh@veritas.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 54a6eb5c 28-Apr-2008 Mel Gorman <mel@csn.ul.ie>

mm: use two zonelist that are filtered by GFP mask

Currently a node has two sets of zonelists, one for each zone type in the
system and a second set for GFP_THISNODE allocations. Based on the zones
allowed by a gfp mask, one of these zonelists is selected. All of these
zonelists consume memory and occupy cache lines.

This patch replaces the multiple zonelists per-node with two zonelists. The
first contains all populated zones in the system, ordered by distance, for
fallback allocations when the target/preferred node has no free pages. The
second contains all populated zones in the node suitable for GFP_THISNODE
allocations.

An iterator macro is introduced called for_each_zone_zonelist() that interates
through each zone allowed by the GFP flags in the selected zonelist.

Signed-off-by: Mel Gorman <mel@csn.ul.ie>
Acked-by: Christoph Lameter <clameter@sgi.com>
Signed-off-by: Lee Schermerhorn <lee.schermerhorn@hp.com>
Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Mel Gorman <mel@csn.ul.ie>
Cc: Christoph Lameter <clameter@sgi.com>
Cc: Hugh Dickins <hugh@veritas.com>
Cc: Nick Piggin <nickpiggin@yahoo.com.au>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 18ea7e71 28-Apr-2008 Mel Gorman <mel@csn.ul.ie>

mm: remember what the preferred zone is for zone_statistics

On NUMA, zone_statistics() is used to record events like numa hit, miss and
foreign. It assumes that the first zone in a zonelist is the preferred zone.
When multiple zonelists are replaced by one that is filtered, this is no
longer the case.

This patch records what the preferred zone is rather than assuming the first
zone in the zonelist is it. This simplifies the reading of later patches in
this set.

Signed-off-by: Mel Gorman <mel@csn.ul.ie>
Signed-off-by: Lee Schermerhorn <lee.schermerhorn@hp.com>
Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Mel Gorman <mel@csn.ul.ie>
Reviewed-by: Christoph Lameter <clameter@sgi.com>
Cc: Hugh Dickins <hugh@veritas.com>
Cc: Nick Piggin <nickpiggin@yahoo.com.au>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 0e88460d 28-Apr-2008 Mel Gorman <mel@csn.ul.ie>

mm: introduce node_zonelist() for accessing the zonelist for a GFP mask

Introduce a node_zonelist() helper function. It is used to lookup the
appropriate zonelist given a node and a GFP mask. The patch on its own is a
cleanup but it helps clarify parts of the two-zonelist-per-node patchset. If
necessary, it can be merged with the next patch in this set without problems.

Reviewed-by: Christoph Lameter <clameter@sgi.com>
Signed-off-by: Mel Gorman <mel@csn.ul.ie>
Signed-off-by: Lee Schermerhorn <lee.schermerhorn@hp.com>
Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Mel Gorman <mel@csn.ul.ie>
Cc: Christoph Lameter <clameter@sgi.com>
Cc: Hugh Dickins <hugh@veritas.com>
Cc: Nick Piggin <nickpiggin@yahoo.com.au>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# dac1d27b 28-Apr-2008 Mel Gorman <mel@csn.ul.ie>

mm: use zonelists instead of zones when direct reclaiming pages

The following patches replace multiple zonelists per node with two zonelists
that are filtered based on the GFP flags. The patches as a set fix a bug with
regard to the use of MPOL_BIND and ZONE_MOVABLE. With this patchset, the
MPOL_BIND will apply to the two highest zones when the highest zone is
ZONE_MOVABLE. This should be considered as an alternative fix for the
MPOL_BIND+ZONE_MOVABLE in 2.6.23 to the previously discussed hack that filters
only custom zonelists.

The first patch cleans up an inconsistency where direct reclaim uses
zonelist->zones where other places use zonelist.

The second patch introduces a helper function node_zonelist() for looking up
the appropriate zonelist for a GFP mask which simplifies patches later in the
set.

The third patch defines/remembers the "preferred zone" for numa statistics, as
it is no longer always the first zone in a zonelist.

The forth patch replaces multiple zonelists with two zonelists that are
filtered. The two zonelists are due to the fact that the memoryless patchset
introduces a second set of zonelists for __GFP_THISNODE.

The fifth patch introduces helper macros for retrieving the zone and node
indices of entries in a zonelist.

The final patch introduces filtering of the zonelists based on a nodemask.
Two zonelists exist per node, one for normal allocations and one for
__GFP_THISNODE.

Performance results varied depending on the machine configuration. In real
workloads the gain/loss will depend on how much the userspace portion of the
benchmark benefits from having more cache available due to reduced referencing
of zonelists.

These are the range of performance losses/gains when running against
2.6.24-rc4-mm1. The set and these machines are a mix of i386, x86_64 and
ppc64 both NUMA and non-NUMA.
loss to gain
Total CPU time on Kernbench: -0.86% to 1.13%
Elapsed time on Kernbench: -0.79% to 0.76%
page_test from aim9: -4.37% to 0.79%
brk_test from aim9: -0.71% to 4.07%
fork_test from aim9: -1.84% to 4.60%
exec_test from aim9: -0.71% to 1.08%

This patch:

The allocator deals with zonelists which indicate the order in which zones
should be targeted for an allocation. Similarly, direct reclaim of pages
iterates over an array of zones. For consistency, this patch converts direct
reclaim to use a zonelist. No functionality is changed by this patch. This
simplifies zonelist iterators in the next patch.

Signed-off-by: Mel Gorman <mel@csn.ul.ie>
Acked-by: Christoph Lameter <clameter@sgi.com>
Signed-off-by: Lee Schermerhorn <lee.schermerhorn@hp.com>
Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Mel Gorman <mel@csn.ul.ie>
Cc: Christoph Lameter <clameter@sgi.com>
Cc: Hugh Dickins <hugh@veritas.com>
Cc: Nick Piggin <nickpiggin@yahoo.com.au>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# c5f59f08 04-Apr-2008 Mike Travis <travis@sgi.com>

nodemask: use new node_to_cpumask_ptr function

* Use new node_to_cpumask_ptr. This creates a pointer to the
cpumask for a given node. This definition is in mm patch:

asm-generic-add-node_to_cpumask_ptr-macro.patch

* Use new set_cpus_allowed_ptr function.

Depends on:
[mm-patch]: asm-generic-add-node_to_cpumask_ptr-macro.patch
[sched-devel]: sched: add new set_cpus_allowed_ptr function
[x86/latest]: x86: add cpus_scnprintf function

Cc: Greg Kroah-Hartman <gregkh@suse.de>
Cc: Greg Banks <gnb@melbourne.sgi.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Signed-off-by: Mike Travis <travis@sgi.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>


# 9442ec9d 04-Mar-2008 Hugh Dickins <hugh@veritas.com>

memcg: bad page if page_cgroup when free

Replace free_hot_cold_page's VM_BUG_ON(page_get_page_cgroup(page)) by a "Bad
page state" and clear: most users don't have CONFIG_DEBUG_VM on, and if it
were set here, it'd likely cause corruption when the page is reused.

Don't use page_assign_page_cgroup to clear it: that should be private to
memcontrol.c, and always called with the lock taken; and memmap_init_zone
doesn't need it either - like page->mapping and other pointers throughout the
kernel, Linux assumes pointers in zeroed structures are NULL pointers.

Instead use page_reset_bad_cgroup, added to memcontrol.h for this only.

Signed-off-by: Hugh Dickins <hugh@veritas.com>
Cc: David Rientjes <rientjes@google.com>
Acked-by: Balbir Singh <balbir@linux.vnet.ibm.com>
Acked-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Hirokazu Takahashi <taka@valinux.co.jp>
Cc: YAMAMOTO Takashi <yamamoto@valinux.co.jp>
Cc: Paul Menage <menage@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 10ed273f 04-Mar-2008 KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>

zlc_setup(): handle jiffies wraparound

jiffies subtraction may cause an overflow problem. It should be using
time_after().

[akpm@linux-foundation.org: include jiffies.h]
Signed-off-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Cc: Lee Schermerhorn <Lee.Schermerhorn@hp.com>
Cc: Paul Jackson <pj@sgi.com>
Cc: Mel Gorman <mel@csn.ul.ie>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# b5a0e011 23-Feb-2008 Alexander van Heukelum <heukelum@mailshack.com>

Solve section mismatch for free_area_init_core.

WARNING: vmlinux.o(.meminit.text+0x649):
Section mismatch in reference from the
function free_area_init_core() to the function .init.text:setup_usemap()
The function __meminit free_area_init_core() references
a function __init setup_usemap().
If free_area_init_core is only used by setup_usemap then
annotate free_area_init_core with a matching annotation.

The warning is covers this stack of functions in mm/page_alloc.c:

alloc_bootmem_node must be marked __init.
alloc_bootmem_node is used by setup_usemap, if !SPARSEMEM.
(usemap_size is only used by setup_usemap, if !SPARSEMEM.)
setup_usemap is only used by free_area_init_core.
free_area_init_core is only used by free_area_init_node.

free_area_init_node is used by:
arch/alpha/mm/numa.c: __init paging_init()
arch/arm/mm/init.c: __init bootmem_init_node()
arch/avr32/mm/init.c: __init paging_init()
arch/cris/arch-v10/mm/init.c: __init paging_init()
arch/cris/arch-v32/mm/init.c: __init paging_init()
arch/m32r/mm/discontig.c: __init zone_sizes_init()
arch/m32r/mm/init.c: __init zone_sizes_init()
arch/m68k/mm/motorola.c: __init paging_init()
arch/m68k/mm/sun3mmu.c: __init paging_init()
arch/mips/sgi-ip27/ip27-memory.c: __init paging_init()
arch/parisc/mm/init.c: __init paging_init()
arch/sparc/mm/srmmu.c: __init srmmu_paging_init()
arch/sparc/mm/sun4c.c: __init sun4c_paging_init()
arch/sparc64/mm/init.c: __init paging_init()
mm/page_alloc.c: __init free_area_init_nodes()
mm/page_alloc.c: __init free_area_init()
and
mm/memory_hotplug.c: hotadd_new_pgdat()

hotadd_new_pgdat can not be an __init function, but:

It is compiled for MEMORY_HOTPLUG configurations only
MEMORY_HOTPLUG depends on SPARSEMEM || X86_64_ACPI_NUMA
X86_64_ACPI_NUMA depends on X86_64
ARCH_FLATMEM_ENABLE depends on X86_32
ARCH_DISCONTIGMEM_ENABLE depends on X86_32
So X86_64_ACPI_NUMA implies SPARSEMEM, right?

So we can mark the stack of functions __init for !SPARSEMEM, but we must mark
them __meminit for SPARSEMEM configurations. This is ok, because then the
calls to alloc_bootmem_node are also avoided.

Compile-tested on:
silly minimal config
defconfig x86_32
defconfig x86_64
defconfig x86_64 -HIBERNATION +MEMORY_HOTPLUG

Signed-off-by: Alexander van Heukelum <heukelum@fastmail.fm>
Reviewed-by: Sam Ravnborg <sam@ravnborg.org>
Acked-by: Geert Uytterhoeven <geert@linux-m68k.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# edde08f2 08-Feb-2008 Harvey Harrison <harvey.harrison@gmail.com>

misc: removal of final callers using fastcall

Signed-off-by: Harvey Harrison <harvey.harrison@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 8a9f3ccd 07-Feb-2008 Balbir Singh <balbir@linux.vnet.ibm.com>

Memory controller: memory accounting

Add the accounting hooks. The accounting is carried out for RSS and Page
Cache (unmapped) pages. There is now a common limit and accounting for both.
The RSS accounting is accounted at page_add_*_rmap() and page_remove_rmap()
time. Page cache is accounted at add_to_page_cache(),
__delete_from_page_cache(). Swap cache is also accounted for.

Each page's page_cgroup is protected with the last bit of the
page_cgroup pointer, this makes handling of race conditions involving
simultaneous mappings of a page easier. A reference count is kept in the
page_cgroup to deal with cases where a page might be unmapped from the RSS
of all tasks, but still lives in the page cache.

Credits go to Vaidyanathan Srinivasan for helping with reference counting work
of the page cgroup. Almost all of the page cache accounting code has help
from Vaidyanathan Srinivasan.

[hugh@veritas.com: fix swapoff breakage]
[akpm@linux-foundation.org: fix locking]
Signed-off-by: Vaidyanathan Srinivasan <svaidy@linux.vnet.ibm.com>
Signed-off-by: Balbir Singh <balbir@linux.vnet.ibm.com>
Cc: Pavel Emelianov <xemul@openvz.org>
Cc: Paul Menage <menage@google.com>
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: "Eric W. Biederman" <ebiederm@xmission.com>
Cc: Nick Piggin <nickpiggin@yahoo.com.au>
Cc: Kirill Korotaev <dev@sw.ru>
Cc: Herbert Poetzl <herbert@13thfloor.at>
Cc: David Rientjes <rientjes@google.com>
Cc: <Valdis.Kletnieks@vt.edu>
Signed-off-by: Hugh Dickins <hugh@veritas.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# e6f3602d 04-Feb-2008 Larry Woodman <lwoodman@redhat.com>

Include count of pagecache pages in show_mem() output

The show_mem() output does not include the total number of pagecache
pages. This would be helpful when analyzing the debug information in
the /var/log/messages file after OOM kills occur.

This patch includes the total pagecache pages in that output.

Signed-off-by: Larry Woodman <lwoodman@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 920c7a5d 04-Feb-2008 Harvey Harrison <harvey.harrison@gmail.com>

mm: remove fastcall from mm/

fastcall is always defined to be empty, remove it

[akpm@linux-foundation.org: coding-style fixes]
Signed-off-by: Harvey Harrison <harvey.harrison@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 1e548deb 04-Feb-2008 Andi Kleen <ak@linux.intel.com>

page allocator: remove unused arguments in zone_init_free_lists()

Signed-off-by: Andi Kleen <ak@suse.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 3dfa5721 04-Feb-2008 Christoph Lameter <clameter@sgi.com>

Page allocator: get rid of the list of cold pages

We have repeatedly discussed if the cold pages still have a point. There is
one way to join the two lists: Use a single list and put the cold pages at the
end and the hot pages at the beginning. That way a single list can serve for
both types of allocations.

The discussion of the RFC for this and Mel's measurements indicate that
there may not be too much of a point left to having separate lists for
hot and cold pages (see http://marc.info/?t=119492914200001&r=1&w=2).

Signed-off-by: Christoph Lameter <clameter@sgi.com>
Cc: Mel Gorman <mel@csn.ul.ie>
Cc: Martin Bligh <mbligh@mbligh.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 9f8f2172 04-Feb-2008 Christoph Lameter <clameter@sgi.com>

Page allocator: clean up pcp draining functions

- Add comments explaing how drain_pages() works.

- Eliminate useless functions

- Rename drain_all_local_pages to drain_all_pages(). It does drain
all pages not only those of the local processor.

- Eliminate useless interrupt off / on sequences. drain_pages()
disables interrupts on its own. The execution thread is
pinned to processor by the caller. So there is no need to
disable interrupts.

- Put drain_all_pages() declaration in gfp.h and remove the
declarations from suspend.h and from mm/memory_hotplug.c

- Make software suspend call drain_all_pages(). The draining
of processor local pages is may not the right approach if
software suspend wants to support SMP. If they call drain_all_pages
then we can make drain_pages() static.

[akpm@linux-foundation.org: fix build]
Signed-off-by: Christoph Lameter <clameter@sgi.com>
Acked-by: Mel Gorman <mel@csn.ul.ie>
Cc: "Rafael J. Wysocki" <rjw@sisk.pl>
Cc: Daniel Walker <dwalker@mvista.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 1d6f4e60 17-Jan-2008 Sam Ravnborg <sam@ravnborg.org>

mm: fix section mismatch warning in page_alloc.c

With CONFIG_HOTPLUG=n and CONFIG_HOTPLUG_CPU=y we saw
following warning:
WARNING: mm/built-in.o(.text+0x6864): Section mismatch: reference to .init.text: (between 'process_zones' and 'pageset_cpuup_callback')

The culprit was zone_batchsize() which were annotated __devinit but used
from process_zones() which is annotated __cpuinit. zone_batchsize() are
used from another function annotated __meminit so the only valid option is
to drop the annotation of zone_batchsize() so we know it is always valid to
use it.

Signed-off-by: Sam Ravnborg <sam@ravnborg.org>
Acked-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 467bc461 08-Jan-2008 Thomas Bogendoerfer <tsbogend@alpha.franken.de>

Fix crash with FLAT_MEMORY and ARCH_PFN_OFFSET != 0

When using FLAT_MEMORY and ARCH_PFN_OFFSET is not 0, the kernel crashes in
memmap_init_zone(). This bug got introduced by commit
c713216deebd95d2b0ab38fef8bb2361c0180c2d

Signed-off-by: Thomas Bogendoerfer <tsbogend@alpha.franken.de>
Acked-by: Mel Gorman <mel@csn.ul.ie>
Cc: Bob Picco <bob.picco@hp.com>
Cc: Dave Hansen <haveblue@us.ibm.com>
Cc: Andy Whitcroft <apw@shadowen.org>
Cc: Andi Kleen <ak@muc.de>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Paul Mackerras <paulus@samba.org>
Cc: "Keith Mannthey" <kmannth@gmail.com>
Cc: "Luck, Tony" <tony.luck@intel.com>
Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Yasunori Goto <y-goto@jp.fujitsu.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 81eabcbe 17-Dec-2007 Mel Gorman <mel@csn.ul.ie>

mm: fix page allocation for larger I/O segments

In some cases the IO subsystem is able to merge requests if the pages are
adjacent in physical memory. This was achieved in the allocator by having
expand() return pages in physically contiguous order in situations were a
large buddy was split. However, list-based anti-fragmentation changed the
order pages were returned in to avoid searching in buffered_rmqueue() for a
page of the appropriate migrate type.

This patch restores behaviour of rmqueue_bulk() preserving the physical
order of pages returned by the allocator without incurring increased search
costs for anti-fragmentation.

Signed-off-by: Mel Gorman <mel@csn.ul.ie>
Cc: James Bottomley <James.Bottomley@steeleye.com>
Cc: Jens Axboe <jens.axboe@oracle.com>
Cc: Mark Lord <mlord@pobox.com
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# ba72cb8c 28-Nov-2007 Mel Gorman <mel@csn.ul.ie>

Fix boot problem with iSeries lacking hugepage support

Ordinarily the size of a pageblock is determined at compile-time based on the
hugepage size. On PPC64, the hugepage size is determined at runtime based on
what is supported by the machine. With legacy machines such as iSeries that
do not support hugepages, HPAGE_SHIFT is 0. This results in pageblock_order
being set to -PAGE_SHIFT and a crash results shortly afterwards.

This patch adds a function to select a sensible value for pageblock order by
default when HUGETLB_PAGE_SIZE_VARIABLE is set. It checks that HPAGE_SHIFT
is a sensible value before using the hugepage size; if it is not MAX_ORDER-1
is used.

This is a fix for 2.6.24.

Credit goes to Stephen Rothwell for identifying the bug and testing candidate
patches. Additional credit goes to Andy Whitcroft for spotting a problem
with respects to IA-64 before releasing. Additional credit to David Gibson
for testing with the libhugetlbfs test suite.

Signed-off-by: Mel Gorman <mel@csn.ul.ie>
Tested-by: Stephen Rothwell <sfr@canb.auug.org.au>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Acked-by: Paul Mackerras <paulus@samba.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 09f345da 28-Nov-2007 Hugh Dickins <hugh@veritas.com>

prep_zero_page: remove bogus BUG_ON

2.6.11 gave __GFP_ZERO's prep_zero_page a bogus "highmem may have to wait"
assertion. Presumably added under the misconception that clear_highpage
uses nonatomic kmap; but then and now it uses kmap_atomic, so no problem.

Signed-off-by: Hugh Dickins <hugh@veritas.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 44048d70 12-Nov-2007 Linus Torvalds <torvalds@woody.linux-foundation.org>

Revert "Bias the placement of kernel pages at lower PFNs"

This reverts commit 5adc5be7cd1bcef6bb64f5255d2a33f20a3cf5be.

Alexey Dobriyan reports that it causes huge slowdowns under some loads,
in his case a "mkfs.ext2" on a 30G partition. With the placement bias,
the mkfs took over four minutes, with it reverted it's back to about ten
seconds for Alexey.

Reported-and-tested-by: Alexey Dobriyan <adobriyan@gmail.com>
Cc: Mel Gorman <mel@csn.ul.ie>
Cc: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 183ff22b 19-Oct-2007 Simon Arlott <simon@fire.lp0.eux>

spelling fixes: mm/

Spelling fixes in mm/.

Signed-off-by: Simon Arlott <simon@fire.lp0.eu>
Signed-off-by: Adrian Bunk <bunk@kernel.org>


# ff0ceb9d 17-Oct-2007 David Rientjes <rientjes@google.com>

oom: serialize out of memory calls

A final allocation attempt with a very high watermark needs to be attempted
before invoking out_of_memory(). OOM killer serialization needs to occur
before this final attempt, otherwise tasks attempting to OOM-lock all zones in
its zonelist may spin and acquire the lock unnecessarily after the OOM
condition has already been alleviated.

If the final allocation does succeed, the zonelist is simply OOM-unlocked and
__alloc_pages() returns the page. Otherwise, the OOM killer is invoked.

If the task cannot acquire OOM-locks on all zones in its zonelist, it is put
to sleep and the allocation is retried when it gets rescheduled. One of its
zones is already marked as being in the OOM killer so it'll hopefully be
getting some free memory soon, at least enough to satisfy a high watermark
allocation attempt. This prevents needlessly killing a task when the OOM
condition would have already been alleviated if it had simply been given
enough time.

Cc: Andrea Arcangeli <andrea@suse.de>
Acked-by: Christoph Lameter <clameter@sgi.com>
Signed-off-by: David Rientjes <rientjes@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# e815af95 17-Oct-2007 David Rientjes <rientjes@google.com>

oom: change all_unreclaimable zone member to flags

Convert the int all_unreclaimable member of struct zone to unsigned long
flags. This can now be used to specify several different zone flags such as
all_unreclaimable and reclaim_in_progress, which can now be removed and
converted to a per-zone flag.

Flags are set and cleared as follows:

zone_set_flag(struct zone *zone, zone_flags_t flag)
zone_clear_flag(struct zone *zone, zone_flags_t flag)

Defines the first zone flags, ZONE_ALL_UNRECLAIMABLE and ZONE_RECLAIM_LOCKED,
which have the same semantics as the old zone->all_unreclaimable and
zone->reclaim_in_progress, respectively. Also converts all current users that
set or clear either flag to use the new interface.

Helper functions are defined to test the flags:

int zone_is_all_unreclaimable(const struct zone *zone)
int zone_is_reclaim_locked(const struct zone *zone)

All flag operators are of the atomic variety because there are currently
readers that are implemented that do not take zone->lock.

[akpm@linux-foundation.org: add needed include]
Cc: Andrea Arcangeli <andrea@suse.de>
Acked-by: Christoph Lameter <clameter@sgi.com>
Signed-off-by: David Rientjes <rientjes@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 5a3135c2 17-Oct-2007 David Rientjes <rientjes@google.com>

oom: move prototypes to appropriate header file

Move the OOM killer's extern function prototypes to include/linux/oom.h and
include it where necessary.

[clg@fr.ibm.com: build fix]
Cc: Andrea Arcangeli <andrea@suse.de>
Acked-by: Christoph Lameter <clameter@sgi.com>
Signed-off-by: David Rientjes <rientjes@google.com>
Signed-off-by: Cedric Le Goater <clg@fr.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 0c0e6195 16-Oct-2007 KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>

memory unplug: page offline

Logic.
- set all pages in [start,end) as isolated migration-type.
by this, all free pages in the range will be not-for-use.
- Migrate all LRU pages in the range.
- Test all pages in the range's refcnt is zero or not.

Todo:
- allocate migration destination page from better area.
- confirm page_count(page)== 0 && PageReserved(page) page is safe to be freed..
(I don't like this kind of page but..
- Find out pages which cannot be migrated.
- more running tests.
- Use reclaim for unplugging other memory type area.

Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Signed-off-by: Yasunori Goto <y-goto@jp.fujitsu.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# a5d76b54 16-Oct-2007 KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>

memory unplug: page isolation

Implement generic chunk-of-pages isolation method by using page grouping ops.

This patch add MIGRATE_ISOLATE to MIGRATE_TYPES. By this
- MIGRATE_TYPES increases.
- bitmap for migratetype is enlarged.

pages of MIGRATE_ISOLATE migratetype will not be allocated even if it is free.
By this, you can isolated *freed* pages from users. How-to-free pages is not
a purpose of this patch. You may use reclaim and migrate codes to free pages.

If start_isolate_page_range(start,end) is called,
- migratetype of the range turns to be MIGRATE_ISOLATE if
its type is MIGRATE_MOVABLE. (*) this check can be updated if other
memory reclaiming works make progress.
- MIGRATE_ISOLATE is not on migratetype fallback list.
- All free pages and will-be-freed pages are isolated.
To check all pages in the range are isolated or not, use test_pages_isolated(),
To cancel isolation, use undo_isolate_page_range().

Changes V6 -> V7
- removed unnecessary #ifdef

There are HOLES_IN_ZONE handling codes...I'm glad if we can remove them..

Signed-off-by: Yasunori Goto <y-goto@jp.fujitsu.com>
Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 48f13bf3 16-Oct-2007 Mel Gorman <mel@csn.ul.ie>

Breakout page_order() to internal.h to avoid special knowledge of the buddy allocator

The statistics patch later needs to know what order a free page is on the free
lists. Rather than having special knowledge of page_private() when
PageBuddy() is set, this patch places out page_order() in internal.h and adds
a VM_BUG_ON to catch using it on non-PageBuddy pages.

Signed-off-by: Mel Gorman <mel@csn.ul.ie>
Signed-off-by: Christoph Lameter <clameter@sgi.com>
Acked-by: Andy Whitcroft <apw@shadowen.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 484f51f8 16-Oct-2007 Adrian Bunk <bunk@stusta.de>

mm/page_alloc.c: make code static

This patch makes needlessly global code static.

Signed-off-by: Adrian Bunk <bunk@stusta.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 467c996c 16-Oct-2007 Mel Gorman <mel@csn.ul.ie>

Print out statistics in relation to fragmentation avoidance to /proc/pagetypeinfo

This patch provides fragmentation avoidance statistics via /proc/pagetypeinfo.
The information is collected only on request so there is no runtime overhead.
The statistics are in three parts:

The first part prints information on the size of blocks that pages are
being grouped on and looks like

Page block order: 10
Pages per block: 1024

The second part is a more detailed version of /proc/buddyinfo and looks like

Free pages count per migrate type at order 0 1 2 3 4 5 6 7 8 9 10
Node 0, zone DMA, type Unmovable 0 0 0 0 0 0 0 0 0 0 0
Node 0, zone DMA, type Reclaimable 1 0 0 0 0 0 0 0 0 0 0
Node 0, zone DMA, type Movable 0 0 0 0 0 0 0 0 0 0 0
Node 0, zone DMA, type Reserve 0 4 4 0 0 0 0 1 0 1 0
Node 0, zone Normal, type Unmovable 111 8 4 4 2 3 1 0 0 0 0
Node 0, zone Normal, type Reclaimable 293 89 8 0 0 0 0 0 0 0 0
Node 0, zone Normal, type Movable 1 6 13 9 7 6 3 0 0 0 0
Node 0, zone Normal, type Reserve 0 0 0 0 0 0 0 0 0 0 4

The third part looks like

Number of blocks type Unmovable Reclaimable Movable Reserve
Node 0, zone DMA 0 1 2 1
Node 0, zone Normal 3 17 94 4

To walk the zones within a node with interrupts disabled, walk_zones_in_node()
is introduced and shared between /proc/buddyinfo, /proc/zoneinfo and
/proc/pagetypeinfo to reduce code duplication. It seems specific to what
vmstat.c requires but could be broken out as a general utility function in
mmzone.c if there were other other potential users.

Signed-off-by: Mel Gorman <mel@csn.ul.ie>
Acked-by: Andy Whitcroft <apw@shadowen.org>
Acked-by: Christoph Lameter <clameter@sgi.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# d9c23400 16-Oct-2007 Mel Gorman <mel@csn.ul.ie>

Do not depend on MAX_ORDER when grouping pages by mobility

Currently mobility grouping works at the MAX_ORDER_NR_PAGES level. This makes
sense for the majority of users where this is also the huge page size.
However, on platforms like ia64 where the huge page size is runtime
configurable it is desirable to group at a lower order. On x86_64 and
occasionally on x86, the hugepage size may not always be MAX_ORDER_NR_PAGES.

This patch groups pages together based on the value of HUGETLB_PAGE_ORDER. It
uses a compile-time constant if possible and a variable where the huge page
size is runtime configurable.

It is assumed that grouping should be done at the lowest sensible order and
that the user would not want to override this. If this is not true,
page_block order could be forced to a variable initialised via a boot-time
kernel parameter.

One potential issue with this patch is that IA64 now parses hugepagesz with
early_param() instead of __setup(). __setup() is called after the memory
allocator has been initialised and the pageblock bitmaps already setup. In
tests on one IA64 there did not seem to be any problem with using
early_param() and in fact may be more correct as it guarantees the parameter
is handled before the parsing of hugepages=.

Signed-off-by: Mel Gorman <mel@csn.ul.ie>
Acked-by: Andy Whitcroft <apw@shadowen.org>
Acked-by: Christoph Lameter <clameter@sgi.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# d100313f 16-Oct-2007 Mel Gorman <mel@csn.ul.ie>

Fix calculation in move_freepages_block for counting pages

move_freepages_block() returns the number of blocks moved. This value is used
to determine if a block of pages should be stolen for the exclusive use of a
migrate type or not. However, the value returned is being used correctly.
This patch fixes the calculation to return the number of base pages that have
been moved.

This should be considered a fix to the patch
move-free-pages-between-lists-on-steal.patch

Credit to Andy Whitcroft for spotting the problem.

Signed-off-by: Mel Gorman <mel@csn.ul.ie>
Acked-by: Andy Whitcroft <apw@shadowen.org>
Acked-by: Christoph Lameter <clameter@sgi.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 64c5e135 16-Oct-2007 Mel Gorman <mel@csn.ul.ie>

don't group high order atomic allocations

Grouping high-order atomic allocations together was intended to allow
bursty users of atomic allocations to work such as e1000 in situations
where their preallocated buffers were depleted. This did not work in at
least one case with a wireless network adapter needing order-1 allocations
frequently. To resolve that, the free pages used for min_free_kbytes were
moved to separate contiguous blocks with the patch
bias-the-location-of-pages-freed-for-min_free_kbytes-in-the-same-max_order_nr_pages-blocks.

It is felt that keeping the free pages in the same contiguous blocks should
be sufficient for bursty short-lived high-order atomic allocations to
succeed, maybe even with the e1000. Even if there is a failure, increasing
the value of min_free_kbytes will free pages as contiguous bloks in
contrast to the standard buddy allocator which makes no attempt to keep the
minimum number of free pages contiguous.

This patch backs out grouping high order atomic allocations together to
determine if it is really needed or not. If a new report comes in about
high-order atomic allocations failing, the feature can be reintroduced to
determine if it fixes the problem or not. As a side-effect, this patch
reduces by 1 the number of bits required to track the mobility type of
pages within a MAX_ORDER_NR_PAGES block.

Signed-off-by: Mel Gorman <mel@csn.ul.ie>
Acked-by: Andy Whitcroft <apw@shadowen.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# ac0e5b7a 16-Oct-2007 Mel Gorman <mel@csn.ul.ie>

remove PAGE_GROUP_BY_MOBILITY

Grouping pages by mobility can be disabled at compile-time. This was
considered undesirable by a number of people. However, in the current stack of
patches, it is not a simple case of just dropping the configurable patch as it
would cause merge conflicts. This patch backs out the configuration option.

Signed-off-by: Mel Gorman <mel@csn.ul.ie>
Acked-by: Andy Whitcroft <apw@shadowen.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 56fd56b8 16-Oct-2007 Mel Gorman <mel@csn.ul.ie>

Bias the location of pages freed for min_free_kbytes in the same MAX_ORDER_NR_PAGES blocks

The standard buddy allocator always favours the smallest block of pages.
The effect of this is that the pages free to satisfy min_free_kbytes tends
to be preserved since boot time at the same location of memory ffor a very
long time and as a contiguous block. When an administrator sets the
reserve at 16384 at boot time, it tends to be the same MAX_ORDER blocks
that remain free. This allows the occasional high atomic allocation to
succeed up until the point the blocks are split. In practice, it is
difficult to split these blocks but when they do split, the benefit of
having min_free_kbytes for contiguous blocks disappears. Additionally,
increasing min_free_kbytes once the system has been running for some time
has no guarantee of creating contiguous blocks.

On the other hand, CONFIG_PAGE_GROUP_BY_MOBILITY favours splitting large
blocks when there are no free pages of the appropriate type available. A
side-effect of this is that all blocks in memory tends to be used up and
the contiguous free blocks from boot time are not preserved like in the
vanilla allocator. This can cause a problem if a new caller is unwilling
to reclaim or does not reclaim for long enough.

A failure scenario was found for a wireless network device allocating
order-1 atomic allocations but the allocations were not intense or frequent
enough for a whole block of pages to be preserved for MIGRATE_HIGHALLOC.
This was reproduced on a desktop by booting with mem=256mb, forcing the
driver to allocate at order-1, running a bittorrent client (downloading a
debian ISO) and building a kernel with -j2.

This patch addresses the problem on the desktop machine booted with
mem=256mb. It works by setting aside a reserve of MAX_ORDER_NR_PAGES
blocks, the number of which depends on the value of min_free_kbytes. These
blocks are only fallen back to when there is no other free pages. Then the
smallest possible page is used just like the normal buddy allocator instead
of the largest possible page to preserve contiguous pages The pages in free
lists in the reserve blocks are never taken for another migrate type. The
results is that even if min_free_kbytes is set to a low value, contiguous
blocks will be preserved in the MIGRATE_RESERVE blocks.

This works better than the vanilla allocator because if min_free_kbytes is
increased, a new reserve block will be chosen based on the location of
reclaimable pages and the block will free up as contiguous pages. In the
vanilla allocator, no effort is made to target a block of pages to free as
contiguous pages and min_free_kbytes pages are scattered randomly.

This effect has been observed on the test machine. min_free_kbytes was set
initially low but it was kept as a contiguous free block within
MIGRATE_RESERVE. min_free_kbytes was then set to a higher value and over a
period of time, the free blocks were within the reserve and coalescing.
How long it takes to free up depends on how quickly LRU is rotating.
Amusingly, this means that more activity will free the blocks faster.

This mechanism potentially replaces MIGRATE_HIGHALLOC as it may be more
effective than grouping contiguous free pages together. It all depends on
whether the number of active atomic high allocations exceeds
min_free_kbytes or not. If the number of active allocations exceeds
min_free_kbytes, it's worth it but maybe in that situation, min_free_kbytes
should be set higher. Once there are no more reports of allocation
failures, a patch will be submitted that backs out MIGRATE_HIGHALLOC and
see if the reports stay missing.

Credit to Mariusz Kozlowski for discovering the problem, describing the
failure scenario and testing patches and scenarios.

[akpm@linux-foundation.org: cleanups]
Signed-off-by: Mel Gorman <mel@csn.ul.ie>
Acked-by: Andy Whitcroft <apw@shadowen.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 46dafbca 16-Oct-2007 Mel Gorman <mel@csn.ul.ie>

Be more agressive about stealing when MIGRATE_RECLAIMABLE allocations fallback

MIGRATE_RECLAIMABLE allocations tend to be very bursty in nature like when
updatedb starts. It is likely this will occur in situations where MAX_ORDER
blocks of pages are not free. This means that updatedb can scatter
MIGRATE_RECLAIMABLE pages throughout the address space. This patch is more
agressive about stealing blocks of pages for MIGRATE_RECLAIMABLE.

Signed-off-by: Mel Gorman <mel@csn.ul.ie>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 5adc5be7 16-Oct-2007 Mel Gorman <mel@csn.ul.ie>

Bias the placement of kernel pages at lower PFNs

This patch chooses blocks with lower PFNs when placing kernel allocations.
This is particularly important during fallback in low memory situations to
stop unmovable pages being placed throughout the entire address space.

Signed-off-by: Mel Gorman <mel@csn.ul.ie>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 9ef9acb0 16-Oct-2007 Mel Gorman <mel@csn.ul.ie>

Do not group pages by mobility type on low memory systems

Grouping pages by mobility can only successfully operate when there are more
MAX_ORDER_NR_PAGES areas than mobility types. When there are insufficient
areas, fallbacks cannot be avoided. This has noticeable performance impacts
on machines with small amounts of memory in comparison to MAX_ORDER_NR_PAGES.
For example, on IA64 with a configuration including huge pages spans 1GiB with
MAX_ORDER_NR_PAGES so would need at least 4GiB of RAM before grouping pages by
mobility would be useful. In comparison, an x86 would need 16MB.

This patch checks the size of vm_total_pages in build_all_zonelists(). If
there are not enough areas, mobility is effectivly disabled by considering
all allocations as the same type (UNMOVABLE). This is achived via a
__read_mostly flag.

With this patch, performance is comparable to disabling grouping pages
by mobility at compile-time on a test machine with insufficient memory.
With this patch, it is reasonable to get rid of grouping pages by mobility
a compile-time option.

Signed-off-by: Mel Gorman <mel@csn.ul.ie>
Acked-by: Andy Whitcroft <apw@shadowen.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# e010487d 16-Oct-2007 Mel Gorman <mel@csn.ul.ie>

Group high-order atomic allocations

In rare cases, the kernel needs to allocate a high-order block of pages
without sleeping. For example, this is the case with e1000 cards configured
to use jumbo frames. Migrating or reclaiming pages in this situation is not
an option.

This patch groups these allocations together as much as possible by adding a
new MIGRATE_TYPE. The MIGRATE_HIGHATOMIC type are exactly what they sound
like. Care is taken that pages of other migrate types do not use the same
blocks as high-order atomic allocations.

Signed-off-by: Mel Gorman <mel@csn.ul.ie>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# e12ba74d 16-Oct-2007 Mel Gorman <mel@csn.ul.ie>

Group short-lived and reclaimable kernel allocations

This patch marks a number of allocations that are either short-lived such as
network buffers or are reclaimable such as inode allocations. When something
like updatedb is called, long-lived and unmovable kernel allocations tend to
be spread throughout the address space which increases fragmentation.

This patch groups these allocations together as much as possible by adding a
new MIGRATE_TYPE. The MIGRATE_RECLAIMABLE type is for allocations that can be
reclaimed on demand, but not moved. i.e. they can be migrated by deleting
them and re-reading the information from elsewhere.

Signed-off-by: Mel Gorman <mel@csn.ul.ie>
Cc: Andy Whitcroft <apw@shadowen.org>
Cc: Christoph Lameter <clameter@sgi.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# c361be55 16-Oct-2007 Mel Gorman <mel@csn.ul.ie>

Move free pages between lists on steal

When a fallback occurs, there will be free pages for one allocation type
stored on the list for another. When a large steal occurs, this patch will
move all the free pages within one list to the other.

[y-goto@jp.fujitsu.com: fix BUG_ON check at move_freepages()]
[apw@shadowen.org: Move to using pfn_valid_within()]
Signed-off-by: Mel Gorman <mel@csn.ul.ie>
Cc: Christoph Lameter <clameter@engr.sgi.com>
Signed-off-by: Yasunori Goto <y-goto@jp.fujitsu.com>
Cc: Bjorn Helgaas <bjorn.helgaas@hp.com>
Signed-off-by: Andy Whitcroft <andyw@uk.ibm.com>
Cc: Bob Picco <bob.picco@hp.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# e2c55dc8 16-Oct-2007 Mel Gorman <mel@csn.ul.ie>

Drain per-cpu lists when high-order allocations fail

Per-cpu pages can accidentally cause fragmentation because they are free, but
pinned pages in an otherwise contiguous block. When this patch is applied,
the per-cpu caches are drained after the direct-reclaim is entered if the
requested order is greater than 0. It simply reuses the code used by suspend
and hotplug.

Signed-off-by: Mel Gorman <mel@csn.ul.ie>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# b92a6edd 16-Oct-2007 Mel Gorman <mel@csn.ul.ie>

Add a configure option to group pages by mobility

The grouping mechanism has some memory overhead and a more complex allocation
path. This patch allows the strategy to be disabled for small memory systems
or if it is known the workload is suffering because of the strategy. It also
acts to show where the page groupings strategy interacts with the standard
buddy allocator.

Signed-off-by: Mel Gorman <mel@csn.ul.ie>
Signed-off-by: Joel Schopp <jschopp@austin.ibm.com>
Cc: Andy Whitcroft <apw@shadowen.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 535131e6 16-Oct-2007 Mel Gorman <mel@csn.ul.ie>

Choose pages from the per-cpu list based on migration type

The freelists for each migrate type can slowly become polluted due to the
per-cpu list. Consider what happens when the following happens

1. A 2^(MAX_ORDER-1) list is reserved for __GFP_MOVABLE pages
2. An order-0 page is allocated from the newly reserved block
3. The page is freed and placed on the per-cpu list
4. alloc_page() is called with GFP_KERNEL as the gfp_mask
5. The per-cpu list is used to satisfy the allocation

This results in a kernel page is in the middle of a migratable region. This
patch prevents this leak occuring by storing the MIGRATE_ type of the page in
page->private. On allocate, a page will only be returned of the desired type,
else more pages will be allocated. This may temporarily allow a per-cpu list
to go over the pcp->high limit but it'll be corrected on the next free. Care
is taken to preserve the hotness of pages recently freed.

The additional code is not measurably slower for the workloads we've tested.

Signed-off-by: Mel Gorman <mel@csn.ul.ie>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# b2a0ac88 16-Oct-2007 Mel Gorman <mel@csn.ul.ie>

Split the free lists for movable and unmovable allocations

This patch adds the core of the fragmentation reduction strategy. It works by
grouping pages together based on their ability to migrate or be reclaimed.
Basically, it works by breaking the list in zone->free_area list into
MIGRATE_TYPES number of lists.

Signed-off-by: Mel Gorman <mel@csn.ul.ie>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 835c134e 16-Oct-2007 Mel Gorman <mel@csn.ul.ie>

Add a bitmap that is used to track flags affecting a block of pages

Here is the latest revision of the anti-fragmentation patches. Of particular
note in this version is special treatment of high-order atomic allocations.
Care is taken to group them together and avoid grouping pages of other types
near them. Artifical tests imply that it works. I'm trying to get the
hardware together that would allow setting up of a "real" test. If anyone
already has a setup and test that can trigger the atomic-allocation problem,
I'd appreciate a test of these patches and a report. The second major change
is that these patches will apply cleanly with patches that implement
anti-fragmentation through zones.

kernbench shows effectively no performance difference varying between -0.2%
and +2% on a variety of test machines. Success rates for huge page allocation
are dramatically increased. For example, on a ppc64 machine, the vanilla
kernel was only able to allocate 1% of memory as a hugepage and this was due
to a single hugepage reserved as min_free_kbytes. With these patches applied,
17% was allocatable as superpages. With reclaim-related fixes from Andy
Whitcroft, it was 40% and further reclaim-related improvements should increase
this further.

Changelog Since V28
o Group high-order atomic allocations together
o It is no longer required to set min_free_kbytes to 10% of memory. A value
of 16384 in most cases will be sufficient
o Now applied with zone-based anti-fragmentation
o Fix incorrect VM_BUG_ON within buffered_rmqueue()
o Reorder the stack so later patches do not back out work from earlier patches
o Fix bug were journal pages were being treated as movable
o Bias placement of non-movable pages to lower PFNs
o More agressive clustering of reclaimable pages in reactions to workloads
like updatedb that flood the size of inode caches

Changelog Since V27

o Renamed anti-fragmentation to Page Clustering. Anti-fragmentation was giving
the mistaken impression that it was the 100% solution for high order
allocations. Instead, it greatly increases the chances high-order
allocations will succeed and lays the foundation for defragmentation and
memory hot-remove to work properly
o Redefine page groupings based on ability to migrate or reclaim instead of
basing on reclaimability alone
o Get rid of spurious inits
o Per-cpu lists are no longer split up per-type. Instead the per-cpu list is
searched for a page of the appropriate type
o Added more explanation commentary
o Fix up bug in pageblock code where bitmap was used before being initalised

Changelog Since V26
o Fix double init of lists in setup_pageset

Changelog Since V25
o Fix loop order of for_each_rclmtype_order so that order of loop matches args
o gfpflags_to_rclmtype uses gfp_t instead of unsigned long
o Rename get_pageblock_type() to get_page_rclmtype()
o Fix alignment problem in move_freepages()
o Add mechanism for assigning flags to blocks of pages instead of page->flags
o On fallback, do not examine the preferred list of free pages a second time

The purpose of these patches is to reduce external fragmentation by grouping
pages of related types together. When pages are migrated (or reclaimed under
memory pressure), large contiguous pages will be freed.

This patch works by categorising allocations by their ability to migrate;

Movable - The pages may be moved with the page migration mechanism. These are
generally userspace pages.

Reclaimable - These are allocations for some kernel caches that are
reclaimable or allocations that are known to be very short-lived.

Unmovable - These are pages that are allocated by the kernel that
are not trivially reclaimed. For example, the memory allocated for a
loaded module would be in this category. By default, allocations are
considered to be of this type

HighAtomic - These are high-order allocations belonging to callers that
cannot sleep or perform any IO. In practice, this is restricted to
jumbo frame allocation for network receive. It is assumed that the
allocations are short-lived

Instead of having one MAX_ORDER-sized array of free lists in struct free_area,
there is one for each type of reclaimability. Once a 2^MAX_ORDER block of
pages is split for a type of allocation, it is added to the free-lists for
that type, in effect reserving it. Hence, over time, pages of the different
types can be clustered together.

When the preferred freelists are expired, the largest possible block is taken
from an alternative list. Buddies that are split from that large block are
placed on the preferred allocation-type freelists to mitigate fragmentation.

This implementation gives best-effort for low fragmentation in all zones.
Ideally, min_free_kbytes needs to be set to a value equal to 4 * (1 <<
(MAX_ORDER-1)) pages in most cases. This would be 16384 on x86 and x86_64 for
example.

Our tests show that about 60-70% of physical memory can be allocated on a
desktop after a few days uptime. In benchmarks and stress tests, we are
finding that 80% of memory is available as contiguous blocks at the end of the
test. To compare, a standard kernel was getting < 1% of memory as large pages
on a desktop and about 8-12% of memory as large pages at the end of stress
tests.

Following this email are 12 patches that implement thie page grouping feature.
The first patch introduces a mechanism for storing flags related to a whole
block of pages. Then allocations are split between movable and all other
allocations. Following that are patches to deal with per-cpu pages and make
the mechanism configurable. The next patch moves free pages between lists
when partially allocated blocks are used for pages of another migrate type.
The second last patch groups reclaimable kernel allocations such as inode
caches together. The final patch related to groupings keeps high-order atomic
allocations.

The last two patches are more concerned with control of fragmentation. The
second last patch biases placement of non-movable allocations towards the
start of memory. This is with a view of supporting memory hot-remove of DIMMs
with higher PFNs in the future. The biasing could be enforced a lot heavier
but it would cost. The last patch agressively clusters reclaimable pages like
inode caches together.

The fragmentation reduction strategy needs to track if pages within a block
can be moved or reclaimed so that pages are freed to the appropriate list.
This patch adds a bitmap for flags affecting a whole a MAX_ORDER block of
pages.

In non-SPARSEMEM configurations, the bitmap is stored in the struct zone and
allocated during initialisation. SPARSEMEM statically allocates the bitmap in
a struct mem_section so that bitmaps do not have to be resized during memory
hotadd. This wastes a small amount of memory per unused section (usually
sizeof(unsigned long)) but the complexity of dynamically allocating the memory
is quite high.

Additional credit to Andy Whitcroft who reviewed up an earlier implementation
of the mechanism an suggested how to make it a *lot* cleaner.

Signed-off-by: Mel Gorman <mel@csn.ul.ie>
Cc: Andy Whitcroft <apw@shadowen.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 37b07e41 16-Oct-2007 Lee Schermerhorn <Lee.Schermerhorn@hp.com>

memoryless nodes: fixup uses of node_online_map in generic code

Here's a cut at fixing up uses of the online node map in generic code.

mm/shmem.c:shmem_parse_mpol()

Ensure nodelist is subset of nodes with memory.
Use node_states[N_HIGH_MEMORY] as default for missing
nodelist for interleave policy.

mm/shmem.c:shmem_fill_super()

initialize policy_nodes to node_states[N_HIGH_MEMORY]

mm/page-writeback.c:highmem_dirtyable_memory()

sum over nodes with memory

mm/page_alloc.c:zlc_setup()

allowednodes - use nodes with memory.

mm/page_alloc.c:default_zonelist_order()

average over nodes with memory.

mm/page_alloc.c:find_next_best_node()

skip nodes w/o memory.
N_HIGH_MEMORY state mask may not be initialized at this time,
unless we want to depend on early_calculate_totalpages() [see
below]. Will ZONE_MOVABLE ever be configurable?

mm/page_alloc.c:find_zone_movable_pfns_for_nodes()

spread kernelcore over nodes with memory.

This required calling early_calculate_totalpages()
unconditionally, and populating N_HIGH_MEMORY node
state therein from nodes in the early_node_map[].
If we can depend on this, we can eliminate the
population of N_HIGH_MEMORY mask from __build_all_zonelists()
and use the N_HIGH_MEMORY mask in find_next_best_node().

mm/mempolicy.c:mpol_check_policy()

Ensure nodes specified for policy are subset of
nodes with memory.

[akpm@linux-foundation.org: fix warnings]
Signed-off-by: Lee Schermerhorn <lee.schermerhorn@hp.com>
Acked-by: Christoph Lameter <clameter@sgi.com>
Cc: Shaohua Li <shaohua.li@intel.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 523b9458 16-Oct-2007 Christoph Lameter <clameter@sgi.com>

Memoryless nodes: Fix GFP_THISNODE behavior

GFP_THISNODE checks that the zone selected is within the pgdat (node) of the
first zone of a nodelist. That only works if the node has memory. A
memoryless node will have its first node on another pgdat (node).

GFP_THISNODE currently will return simply memory on the first pgdat. Thus it
is returning memory on other nodes. GFP_THISNODE should fail if there is no
local memory on a node.

Add a new set of zonelists for each node that only contain the nodes that
belong to the zones itself so that no fallback is possible.

Then modify gfp_type to pickup the right zone based on the presence of
__GFP_THISNODE.

Drop the existing GFP_THISNODE checks from the page_allocators hot path.

Signed-off-by: Christoph Lameter <clameter@sgi.com>
Acked-by: Nishanth Aravamudan <nacc@us.ibm.com>
Tested-by: Lee Schermerhorn <lee.schermerhorn@hp.com>
Acked-by: Bob Picco <bob.picco@hp.com>
Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Mel Gorman <mel@skynet.ie>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 633c0666 16-Oct-2007 Christoph Lameter <clameter@sgi.com>

Memoryless nodes: drop one memoryless node boot warning

get_pfn_range_for_nid() is called multiple times for each node at boot time.
Each time, it will warn about nodes with no memory, resulting in boot messages
like:

Node 0 active with no memory
Node 0 active with no memory
Node 0 active with no memory
Node 0 active with no memory
Node 0 active with no memory
Node 0 active with no memory
On node 0 totalpages: 0
Node 0 active with no memory
Node 0 active with no memory
DMA zone: 0 pages used for memmap
Node 0 active with no memory
Node 0 active with no memory
Normal zone: 0 pages used for memmap
Node 0 active with no memory
Node 0 active with no memory
Movable zone: 0 pages used for memmap

and so on for each memoryless node.

We already have the "On node N totalpages: ..." and other related messages, so
drop the "Node N active with no memory" warnings.

Signed-off-by: Lee Schermerhorn <lee.schermerhorn@hp.com>
Cc: Bob Picco <bob.picco@hp.com>
Cc: Nishanth Aravamudan <nacc@us.ibm.com>
Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Mel Gorman <mel@skynet.ie>
Signed-off-by: Christoph Lameter <clameter@sgi.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 37c0708d 16-Oct-2007 Christoph Lameter <clameter@sgi.com>

Memoryless nodes: Add N_CPU node state

We need the check for a node with cpu in zone reclaim. Zone reclaim will not
allow remote zone reclaim if a node has a cpu.

[Lee.Schermerhorn@hp.com: Move setup of N_CPU node state mask]
Signed-off-by: Christoph Lameter <clameter@sgi.com>
Tested-by: Lee Schermerhorn <lee.schermerhorn@hp.com>
Acked-by: Bob Picco <bob.picco@hp.com>
Cc: Nishanth Aravamudan <nacc@us.ibm.com>
Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Mel Gorman <mel@skynet.ie>
Signed-off-by: Lee Schermerhorn <lee.schermerhorn@hp.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 7ea1530a 16-Oct-2007 Christoph Lameter <clameter@sgi.com>

Memoryless nodes: introduce mask of nodes with memory

It is necessary to know if nodes have memory since we have recently begun to
add support for memoryless nodes. For that purpose we introduce a two new
node states: N_HIGH_MEMORY and N_NORMAL_MEMORY.

A node has its bit in N_HIGH_MEMORY set if it has any memory regardless of the
type of mmemory. If a node has memory then it has at least one zone defined
in its pgdat structure that is located in the pgdat itself.

A node has its bit in N_NORMAL_MEMORY set if it has a lower zone than
ZONE_HIGHMEM. This means it is possible to allocate memory that is not
subject to kmap.

N_HIGH_MEMORY and N_NORMAL_MEMORY can then be used in various places to insure
that we do the right thing when we encounter a memoryless node.

[akpm@linux-foundation.org: build fix]
[Lee.Schermerhorn@hp.com: update N_HIGH_MEMORY node state for memory hotadd]
[y-goto@jp.fujitsu.com: Fix memory hotplug + sparsemem build]
Signed-off-by: Lee Schermerhorn <Lee.Schermerhorn@hp.com>
Signed-off-by: Nishanth Aravamudan <nacc@us.ibm.com>
Signed-off-by: Christoph Lameter <clameter@sgi.com>
Acked-by: Bob Picco <bob.picco@hp.com>
Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Mel Gorman <mel@skynet.ie>
Signed-off-by: Yasunori Goto <y-goto@jp.fujitsu.com>
Signed-off-by: Paul Mundt <lethal@linux-sh.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 13808910 16-Oct-2007 Christoph Lameter <clameter@sgi.com>

Memoryless nodes: Generic management of nodemasks for various purposes

Why do we need to support memoryless nodes?

KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> wrote:

> For fujitsu, problem is called "empty" node.
>
> When ACPI's SRAT table includes "possible nodes", ia64 bootstrap(acpi_numa_init)
> creates nodes, which includes no memory, no cpu.
>
> I tried to remove empty-node in past, but that was denied.
> It was because we can hot-add cpu to the empty node.
> (node-hotplug triggered by cpu is not implemented now. and it will be ugly.)
>
>
> For HP, (Lee can comment on this later), they have memory-less-node.
> As far as I hear, HP's machine can have following configration.
>
> (example)
> Node0: CPU0 memory AAA MB
> Node1: CPU1 memory AAA MB
> Node2: CPU2 memory AAA MB
> Node3: CPU3 memory AAA MB
> Node4: Memory XXX GB
>
> AAA is very small value (below 16MB) and will be omitted by ia64 bootstrap.
> After boot, only Node 4 has valid memory (but have no cpu.)
>
> Maybe this is memory-interleave by firmware config.

Christoph Lameter <clameter@sgi.com> wrote:

> Future SGI platforms (actually also current one can have but nothing like
> that is deployed to my knowledge) have nodes with only cpus. Current SGI
> platforms have nodes with just I/O that we so far cannot manage in the
> core. So the arch code maps them to the nearest memory node.

Lee Schermerhorn <Lee.Schermerhorn@hp.com> wrote:

> For the HP platforms, we can configure each cell with from 0% to 100%
> "cell local memory". When we configure with <100% CLM, the "missing
> percentages" are interleaved by hardware on a cache-line granularity to
> improve bandwidth at the expense of latency for numa-challenged
> applications [and OSes, but not our problem ;-)]. When we boot Linux on
> such a config, all of the real nodes have no memory--it all resides in a
> single interleaved pseudo-node.
>
> When we boot Linux on a 100% CLM configuration [== NUMA], we still have
> the interleaved pseudo-node. It contains a few hundred MB stolen from
> the real nodes to contain the DMA zone. [Interleaved memory resides at
> phys addr 0]. The memoryless-nodes patches, along with the zoneorder
> patches, support this config as well.
>
> Also, when we boot a NUMA config with the "mem=" command line,
> specifying less memory than actually exists, Linux takes the excluded
> memory "off the top" rather than distributing it across the nodes. This
> can result in memoryless nodes, as well.
>

This patch:

Preparation for memoryless node patches.

Provide a generic way to keep nodemasks describing various characteristics of
NUMA nodes.

Remove the node_online_map and the node_possible map and realize the same
functionality using two nodes stats: N_POSSIBLE and N_ONLINE.

[Lee.Schermerhorn@hp.com: Initialize N_*_MEMORY and N_CPU masks for non-NUMA config]
Signed-off-by: Christoph Lameter <clameter@sgi.com>
Tested-by: Lee Schermerhorn <lee.schermerhorn@hp.com>
Acked-by: Lee Schermerhorn <lee.schermerhorn@hp.com>
Acked-by: Bob Picco <bob.picco@hp.com>
Cc: Nishanth Aravamudan <nacc@us.ibm.com>
Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Mel Gorman <mel@skynet.ie>
Signed-off-by: Lee Schermerhorn <lee.schermerhorn@hp.com>
Cc: "Serge E. Hallyn" <serge@hallyn.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 8691f3a7 16-Oct-2007 Jesper Juhl <jesper.juhl@gmail.com>

mm: no need to cast vmalloc() return value in zone_wait_table_init()

vmalloc() returns a void pointer, so there's no need to cast its
return value in mm/page_alloc.c::zone_wait_table_init().

Signed-off-by: Jesper Juhl <jesper.juhl@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 64191688 31-Aug-2007 Andrew Morton <akpm@linux-foundation.org>

process_zones(): fix recovery code

Don't try to free memory which we didn't allocate.

Acked-by: Christoph Lameter <clameter@sgi.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# b377fd39 22-Aug-2007 Mel Gorman <mel@csn.ul.ie>

Apply memory policies to top two highest zones when highest zone is ZONE_MOVABLE

The NUMA layer only supports NUMA policies for the highest zone. When
ZONE_MOVABLE is configured with kernelcore=, the the highest zone becomes
ZONE_MOVABLE. The result is that policies are only applied to allocations
like anonymous pages and page cache allocated from ZONE_MOVABLE when the
zone is used.

This patch applies policies to the two highest zones when the highest zone
is ZONE_MOVABLE. As ZONE_MOVABLE consists of pages from the highest "real"
zone, it's always functionally equivalent.

The patch has been tested on a variety of machines both NUMA and non-NUMA
covering x86, x86_64 and ppc64. No abnormal results were seen in
kernbench, tbench, dbench or hackbench. It passes regression tests from
the numactl package with and without kernelcore= once numactl tests are
patched to wait for vmstat counters to update.

akpm: this is the nasty hack to fix NUMA mempolicies in the presence of
ZONE_MOVABLE and kernelcore= in 2.6.23. Christoph says "For .24 either merge
the mobility or get the other solution that Mel is working on. That solution
would only use a single zonelist per node and filter on the fly. That may
help performance and also help to make memory policies work better."

Signed-off-by: Mel Gorman <mel@csn.ul.ie>
Acked-by: Lee Schermerhorn <lee.schermerhorn@hp.com>
Tested-by: Lee Schermerhorn <lee.schermerhorn@hp.com>
Acked-by: Christoph Lameter <clameter@sgi.com>
Cc: Andi Kleen <ak@suse.de>
Cc: Paul Mundt <lethal@linux-sh.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# a8bbf72a 31-Jul-2007 Mel Gorman <mel@csn.ul.ie>

Do not trigger OOM-killer for high-order allocation failures

out_of_memory() may be called when an allocation is failing and the direct
reclaim is not making any progress. This does not take into account the
requested order of the allocation. If the request if for an order larger
than PAGE_ALLOC_COSTLY_ORDER, it is reasonable to fail the allocation
because the kernel makes no guarantees about those allocations succeeding.

This false OOM situation can occur if a user is trying to grow the hugepage
pool in a script like;

#!/bin/bash
REQUIRED=$1
echo 1 > /proc/sys/vm/hugepages_treat_as_movable
echo $REQUIRED > /proc/sys/vm/nr_hugepages
ACTUAL=`cat /proc/sys/vm/nr_hugepages`
while [ $REQUIRED -ne $ACTUAL ]; do
echo Huge page pool at $ACTUAL growing to $REQUIRED
echo $REQUIRED > /proc/sys/vm/nr_hugepages
ACTUAL=`cat /proc/sys/vm/nr_hugepages`
sleep 1
done

This is a reasonable scenario when ZONE_MOVABLE is in use but triggers OOM
easily on 2.6.23-rc1. This patch will fail an allocation for an order above
PAGE_ALLOC_COSTLY_ORDER instead of killing processes and retrying.

Signed-off-by: Mel Gorman <mel@csn.ul.ie>
Acked-by: Andy Whitcroft <apw@shadowen.org>
Cc: David Rientjes <rientjes@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 296699de 29-Jul-2007 Rafael J. Wysocki <rjw@rjwysocki.net>

Introduce CONFIG_SUSPEND for suspend-to-Ram and standby

Introduce CONFIG_SUSPEND representing the ability to enter system sleep
states, such as the ACPI S3 state, and allow the user to choose SUSPEND
and HIBERNATION independently of each other.

Make HOTPLUG_CPU be selected automatically if SUSPEND or HIBERNATION has
been chosen and the kernel is intended for SMP systems.

Also, introduce CONFIG_PM_SLEEP which is automatically selected if
CONFIG_SUSPEND or CONFIG_HIBERNATION is set and use it to select the
code needed for both suspend and hibernation.

The top-level power management headers and the ACPI code related to
suspend and hibernation are modified to use the new definitions (the
changes in drivers/acpi/sleep/main.c are, mostly, moving code to reduce
the number of ifdefs).

There are many other files in which CONFIG_PM can be replaced with
CONFIG_PM_SLEEP or even with CONFIG_SUSPEND, but they can be updated in
the future.

Signed-off-by: Rafael J. Wysocki <rjw@sisk.pl>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# b5445f95 26-Jul-2007 Mel Gorman <mel@csn.ul.ie>

Allow nodes to exist that only contain ZONE_MOVABLE

With the introduction of kernelcore=, a configurable zone is created on
request. In some cases, this value will be small enough that some nodes
contain only ZONE_MOVABLE. On some NUMA configurations when this occurs,
arch-independent zone-sizing will get the size of the memory holes within
the node incorrect. The value of present_pages goes negative and the boot
fails.

This patch fixes the bug in the calculation of the size of the hole. The
test case is to boot test a NUMA machine with a low value of kernelcore=
before and after the patch is applied. While this bug exists in early
kernel it cannot be triggered in practice.

This patch has been boot-tested on a variety machines with and without
kernelcore= set.

Signed-off-by: Mel Gorman <mel@csn.ul.ie>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# e228929b 20-Jul-2007 Paul Mundt <lethal@linux-sh.org>

mm: fix memory hotplug oops from ZONE_MOVABLE changes.

zone_movable_pfn is presently marked as __initdata and referenced from
adjust_zone_range_for_zone_movable(), which in turn is referenced by
zone_spanned_pages_in_node(). Both of these are __meminit annotated. When
memory hotplug is enabled, this will oops on a hot-add, due to
zone_movable_pfn having been freed.

__meminitdata annotation gives the desired behaviour.

This will only impact platforms that enable both memory hotplug
and ARCH_POPULATES_NODE_MAP.

Signed-off-by: Paul Mundt <lethal@linux-sh.org>
Acked-by: Mel Gorman <mel@csn.ul.ie>
Acked-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# fe3cba17 19-Jul-2007 Fengguang Wu <wfg@mail.ustc.edu.cn>

mm: share PG_readahead and PG_reclaim

Share the same page flag bit for PG_readahead and PG_reclaim.

One is used only on file reads, another is only for emergency writes. One
is used mostly for fresh/young pages, another is for old pages.

Combinations of possible interactions are:

a) clear PG_reclaim => implicit clear of PG_readahead
it will delay an asynchronous readahead into a synchronous one
it actually does _good_ for readahead:
the pages will be reclaimed soon, it's readahead thrashing!
in this case, synchronous readahead makes more sense.

b) clear PG_readahead => implicit clear of PG_reclaim
one(and only one) page will not be reclaimed in time
it can be avoided by checking PageWriteback(page) in readahead first

c) set PG_reclaim => implicit set of PG_readahead
will confuse readahead and make it restart the size rampup process
it's a trivial problem, and can mostly be avoided by checking
PageWriteback(page) first in readahead

d) set PG_readahead => implicit set of PG_reclaim
PG_readahead will never be set on already cached pages.
PG_reclaim will always be cleared on dirtying a page.
so not a problem.

In summary,
a) we get better behavior
b,d) possible interactions can be avoided
c) racy condition exists that might affect readahead, but the chance
is _really_ low, and the hurt on readahead is trivial.

Compound pages also use PG_reclaim, but for now they do not interact with
reclaim/readahead code.

Signed-off-by: Fengguang Wu <wfg@mail.ustc.edu.cn>
Cc: Rusty Russell <rusty@rustcorp.com.au>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# d77c2d7c 19-Jul-2007 Fengguang Wu <wfg@mail.ustc.edu.cn>

readahead: introduce PG_readahead

Introduce a new page flag: PG_readahead.

It acts as a look-ahead mark, which tells the page reader: Hey, it's time to
invoke the read-ahead logic. For the sake of I/O pipelining, don't wait until
it runs out of cached pages!

Signed-off-by: Fengguang Wu <wfg@mail.ustc.edu.cn>
Cc: Steven Pratt <slpratt@austin.ibm.com>
Cc: Ram Pai <linuxram@us.ibm.com>
Cc: Rusty Russell <rusty@rustcorp.com.au>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# c2f1a551 17-Jul-2007 Meelap Shah <meelap@umich.edu>

knfsd: nfsd4: vary maximum delegation limit based on RAM size

Our original NFSv4 delegation policy was to give out a read delegation on any
open when it was possible to.

Since the lifetime of a delegation isn't limited to that of an open, a client
may quite reasonably hang on to a delegation as long as it has the inode
cached. This becomes an obvious problem the first time a client's inode cache
approaches the size of the server's total memory.

Our first quick solution was to add a hard-coded limit. This patch makes a
mild incremental improvement by varying that limit according to the server's
total memory size, allowing at most 4 delegations per megabyte of RAM.

My quick back-of-the-envelope calculation finds that in the worst case (where
every delegation is for a different inode), a delegation could take about
1.5K, which would make the worst case usage about 6% of memory. The new limit
works out to be about the same as the old on a 1-gig server.

[akpm@linux-foundation.org: Don't needlessly bloat vmlinux]
[akpm@linux-foundation.org: Make it right for highmem machines]
Signed-off-by: "J. Bruce Fields" <bfields@citi.umich.edu>
Signed-off-by: Neil Brown <neilb@suse.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 5ad333eb 17-Jul-2007 Andy Whitcroft <apw@shadowen.org>

Lumpy Reclaim V4

When we are out of memory of a suitable size we enter reclaim. The current
reclaim algorithm targets pages in LRU order, which is great for fairness at
order-0 but highly unsuitable if you desire pages at higher orders. To get
pages of higher order we must shoot down a very high proportion of memory;
>95% in a lot of cases.

This patch set adds a lumpy reclaim algorithm to the allocator. It targets
groups of pages at the specified order anchored at the end of the active and
inactive lists. This encourages groups of pages at the requested orders to
move from active to inactive, and active to free lists. This behaviour is
only triggered out of direct reclaim when higher order pages have been
requested.

This patch set is particularly effective when utilised with an
anti-fragmentation scheme which groups pages of similar reclaimability
together.

This patch set is based on Peter Zijlstra's lumpy reclaim V2 patch which forms
the foundation. Credit to Mel Gorman for sanitity checking.

Mel said:

The patches have an application with hugepage pool resizing.

When lumpy-reclaim is used used with ZONE_MOVABLE, the hugepages pool can
be resized with greater reliability. Testing on a desktop machine with 2GB
of RAM showed that growing the hugepage pool with ZONE_MOVABLE on it's own
was very slow as the success rate was quite low. Without lumpy-reclaim,
each attempt to grow the pool by 100 pages would yield 1 or 2 hugepages.
With lumpy-reclaim, getting 40 to 70 hugepages on each attempt was typical.

[akpm@osdl.org: ia64 pfn_to_nid fixes and loop cleanup]
[bunk@stusta.de: static declarations for internal functions]
[a.p.zijlstra@chello.nl: initial lumpy V2 implementation]
Signed-off-by: Andy Whitcroft <apw@shadowen.org>
Acked-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Acked-by: Mel Gorman <mel@csn.ul.ie>
Acked-by: Mel Gorman <mel@csn.ul.ie>
Cc: Bob Picco <bob.picco@hp.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 7e63efef 17-Jul-2007 Mel Gorman <mel@csn.ul.ie>

Add a movablecore= parameter for sizing ZONE_MOVABLE

This patch adds a new parameter for sizing ZONE_MOVABLE called
movablecore=. While kernelcore= is used to specify the minimum amount of
memory that must be available for all allocation types, movablecore= is
used to specify the minimum amount of memory that is used for migratable
allocations. The amount of memory used for migratable allocations
determines how large the huge page pool could be dynamically resized to at
runtime for example.

How movablecore is actually handled is that the total number of pages in
the system is calculated and a value is set for kernelcore that is

kernelcore == totalpages - movablecore

Both kernelcore= and movablecore= can be safely specified at the same time.

Signed-off-by: Mel Gorman <mel@csn.ul.ie>
Acked-by: Andy Whitcroft <apw@shadowen.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# ed7ed365 17-Jul-2007 Mel Gorman <mel@csn.ul.ie>

handle kernelcore=: generic

This patch adds the kernelcore= parameter for x86.

Once all patches are applied, a new command-line parameter exist and a new
sysctl. This patch adds the necessary documentation.

From: Yasunori Goto <y-goto@jp.fujitsu.com>

When "kernelcore" boot option is specified, kernel can't boot up on ia64
because of an infinite loop. In addition, the parsing code can be handled
in an architecture-independent manner.

This patch uses common code to handle the kernelcore= parameter. It is
only available to architectures that support arch-independent zone-sizing
(i.e. define CONFIG_ARCH_POPULATES_NODE_MAP). Other architectures will
ignore the boot parameter.

[bunk@stusta.de: make cmdline_parse_kernelcore() static]
Signed-off-by: Mel Gorman <mel@csn.ul.ie>
Signed-off-by: Yasunori Goto <y-goto@jp.fujitsu.com>
Acked-by: Andy Whitcroft <apw@shadowen.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 2a1e274a 17-Jul-2007 Mel Gorman <mel@csn.ul.ie>

Create the ZONE_MOVABLE zone

The following 8 patches against 2.6.20-mm2 create a zone called ZONE_MOVABLE
that is only usable by allocations that specify both __GFP_HIGHMEM and
__GFP_MOVABLE. This has the effect of keeping all non-movable pages within a
single memory partition while allowing movable allocations to be satisfied
from either partition. The patches may be applied with the list-based
anti-fragmentation patches that groups pages together based on mobility.

The size of the zone is determined by a kernelcore= parameter specified at
boot-time. This specifies how much memory is usable by non-movable
allocations and the remainder is used for ZONE_MOVABLE. Any range of pages
within ZONE_MOVABLE can be released by migrating the pages or by reclaiming.

When selecting a zone to take pages from for ZONE_MOVABLE, there are two
things to consider. First, only memory from the highest populated zone is
used for ZONE_MOVABLE. On the x86, this is probably going to be ZONE_HIGHMEM
but it would be ZONE_DMA on ppc64 or possibly ZONE_DMA32 on x86_64. Second,
the amount of memory usable by the kernel will be spread evenly throughout
NUMA nodes where possible. If the nodes are not of equal size, the amount of
memory usable by the kernel on some nodes may be greater than others.

By default, the zone is not as useful for hugetlb allocations because they are
pinned and non-migratable (currently at least). A sysctl is provided that
allows huge pages to be allocated from that zone. This means that the huge
page pool can be resized to the size of ZONE_MOVABLE during the lifetime of
the system assuming that pages are not mlocked. Despite huge pages being
non-movable, we do not introduce additional external fragmentation of note as
huge pages are always the largest contiguous block we care about.

Credit goes to Andy Whitcroft for catching a large variety of problems during
review of the patches.

This patch creates an additional zone, ZONE_MOVABLE. This zone is only usable
by allocations which specify both __GFP_HIGHMEM and __GFP_MOVABLE. Hot-added
memory continues to be placed in their existing destination as there is no
mechanism to redirect them to a specific zone.

[y-goto@jp.fujitsu.com: Fix section mismatch of memory hotplug related code]
[akpm@linux-foundation.org: various fixes]
Signed-off-by: Mel Gorman <mel@csn.ul.ie>
Cc: Andy Whitcroft <apw@shadowen.org>
Signed-off-by: Yasunori Goto <y-goto@jp.fujitsu.com>
Cc: William Lee Irwin III <wli@holomorphy.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 54114994 16-Jul-2007 Akinobu Mita <akinobu.mita@gmail.com>

fault-injection: add min-order parameter to fail_page_alloc

Limiting smaller allocation failures by fault injection helps to find real
possible bugs. Because higher order allocations are likely to fail and
zero-order allocations are not likely to fail.

This patch adds min-order parameter to fail_page_alloc. It specifies the
minimum page allocation order to be injected failures.

Signed-off-by: Akinobu Mita <akinobu.mita@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# b49ad484 16-Jul-2007 Dan Aloni <da-x@monatomic.org>

mm/page_alloc.c: lower printk severity

Signed-off-by: Dan Aloni <da-x@monatomic.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 6ea6e688 16-Jul-2007 Paul Mundt <lethal@linux-sh.org>

mm: more __meminit annotations

Currently zone_spanned_pages_in_node() and zone_absent_pages_in_node() are
non-static for ARCH_POPULATES_NODE_MAP and static otherwise. However, only
the non-static versions are __meminit annotated, despite only being called
from __meminit functions in either case.

zone_init_free_lists() is currently non-static and not __meminit annotated
either, despite only being called once in the entire tree by
init_currently_empty_zone(), which too is __meminit. So make it static and
properly annotated.

Signed-off-by: Paul Mundt <lethal@linux-sh.org>
Cc: Yasunori Goto <y-goto@jp.fujitsu.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 98011f56 16-Jul-2007 Jan Beulich <jbeulich@novell.com>

mm: fix improper .init-type section references

.. which modpost started warning about.

Signed-off-by: Jan Beulich <jbeulich@novell.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 1037b83b 16-Jul-2007 Eric Dumazet <dada1@cosmosbay.com>

MM: alloc_large_system_hash() can free some memory for non power-of-two bucketsize

alloc_large_system_hash() is called at boot time to allocate space for
several large hash tables.

Lately, TCP hash table was changed and its bucketsize is not a power-of-two
anymore.

On most setups, alloc_large_system_hash() allocates one big page (order >
0) with __get_free_pages(GFP_ATOMIC, order). This single high_order page
has a power-of-two size, bigger than the needed size.

We can free all pages that wont be used by the hash table.

On a 1GB i386 machine, this patch saves 128 KB of LOWMEM memory.

TCP established hash table entries: 32768 (order: 6, 393216 bytes)

Signed-off-by: Eric Dumazet <dada1@cosmosbay.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# f0c0b2b8 16-Jul-2007 KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>

change zonelist order: zonelist order selection logic

Make zonelist creation policy selectable from sysctl/boot option v6.

This patch makes NUMA's zonelist (of pgdat) order selectable.
Available order are Default(automatic)/ Node-based / Zone-based.

[Default Order]
The kernel selects Node-based or Zone-based order automatically.

[Node-based Order]
This policy treats the locality of memory as the most important parameter.
Zonelist order is created by each zone's locality. This means lower zones
(ex. ZONE_DMA) can be used before higher zone (ex. ZONE_NORMAL) exhausion.
IOW. ZONE_DMA will be in the middle of zonelist.
current 2.6.21 kernel uses this.

Pros.
* A user can expect local memory as much as possible.
Cons.
* lower zone will be exhansted before higher zone. This may cause OOM_KILL.

Maybe suitable if ZONE_DMA is relatively big and you never see OOM_KILL
because of ZONE_DMA exhaution and you need the best locality.

(example)
assume 2 node NUMA. node(0) has ZONE_DMA/ZONE_NORMAL, node(1) has ZONE_NORMAL.

*node(0)'s memory allocation order:

node(0)'s NORMAL -> node(0)'s DMA -> node(1)'s NORMAL.

*node(1)'s memory allocation order:

node(1)'s NORMAL -> node(0)'s NORMAL -> node(0)'s DMA.

[Zone-based order]
This policy treats the zone type as the most important parameter.
Zonelist order is created by zone-type order. This means lower zone
never be used bofere higher zone exhaustion.
IOW. ZONE_DMA will be always at the tail of zonelist.

Pros.
* OOM_KILL(bacause of lower zone) occurs only if the whole zones are exhausted.
Cons.
* memory locality may not be best.

(example)
assume 2 node NUMA. node(0) has ZONE_DMA/ZONE_NORMAL, node(1) has ZONE_NORMAL.

*node(0)'s memory allocation order:

node(0)'s NORMAL -> node(1)'s NORMAL -> node(0)'s DMA.

*node(1)'s memory allocation order:

node(1)'s NORMAL -> node(0)'s NORMAL -> node(0)'s DMA.

bootoption "numa_zonelist_order=" and proc/sysctl is supporetd.

command:
%echo N > /proc/sys/vm/numa_zonelist_order

Will rebuild zonelist in Node-based order.

command:
%echo Z > /proc/sys/vm/numa_zonelist_order

Will rebuild zonelist in Zone-based order.

Thanks to Lee Schermerhorn, he gives me much help and codes.

[Lee.Schermerhorn@hp.com: add check_highest_zone to build_zonelists_in_zone_order]
[akpm@linux-foundation.org: build fix]
Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Lee Schermerhorn <lee.schermerhorn@hp.com>
Cc: Christoph Lameter <clameter@sgi.com>
Cc: Andi Kleen <ak@suse.de>
Cc: "jesse.barnes@intel.com" <jesse.barnes@intel.com>
Signed-off-by: Lee Schermerhorn <lee.schermerhorn@hp.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# d09c6b80 14-Jun-2007 Paul Mundt <lethal@linux-sh.org>

mm: Fix memory/cpu hotplug section mismatch and oops.

When building with memory hotplug enabled and cpu hotplug disabled, we
end up with the following section mismatch:

WARNING: mm/built-in.o(.text+0x4e58): Section mismatch: reference to
.init.text: (between 'free_area_init_node' and '__build_all_zonelists')

This happens as a result of:

-> free_area_init_node()
-> free_area_init_core()
-> zone_pcp_init() <-- all __meminit up to this point
-> zone_batchsize() <-- marked as __cpuinit fo

This happens because CONFIG_HOTPLUG_CPU=n sets __cpuinit to __init, but
CONFIG_MEMORY_HOTPLUG=y unsets __meminit.

Changing zone_batchsize() to __devinit fixes this.

__devinit is the only thing that is common between CONFIG_HOTPLUG_CPU=y and
CONFIG_MEMORY_HOTPLUG=y. In the long run, perhaps this should be moved to
another section identifier completely. Without this, memory hot-add
of offline nodes (via hotadd_new_pgdat()) will oops if CPU hotplug is
not also enabled.

Signed-off-by: Paul Mundt <lethal@linux-sh.org>
Acked-by: Yasunori Goto <y-goto@jp.fujitsu.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

--

mm/page_alloc.c | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)


# 12d810c1 31-May-2007 Roman Zippel <zippel@linux-m68k.org>

m68k: discontinuous memory support

Fix support for discontinuous memory

Signed-off-by: Roman Zippel <zippel@linux-m68k.org>
Signed-off-by: Geert Uytterhoeven <geert@linux-m68k.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 418508c1 23-May-2007 Miklos Szeredi <mszeredi@suse.cz>

fix unused setup_nr_node_ids

mm/page_alloc.c:931: warning: 'setup_nr_node_ids' defined but not used

This is now the only (!) compiler warning I get in my UML build :)

Signed-off-by: Miklos Szeredi <mszeredi@suse.cz>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 577a32f6 17-May-2007 Sam Ravnborg <sam@ravnborg.org>

mm: fix section mismatch warnings

modpost had two cases hardcoded for mm/
Shift over to __init_refok and kill the
hardcoded function names in modpost.

This has the drawback that the functions
will always be kept no matter configuration.
With previous code the function were placed in
init section if configuration allowed it.

Signed-off-by: Sam Ravnborg <sam@ravnborg.org>


# 6f076f5d 10-May-2007 Stephen Rothwell <sfr@canb.auug.org.au>

early_pfn_to_nid needs to be __meminit

Since it is referenced by memmap_init_zone (which is __meminit) via the
early_pfn_in_nid macro when CONFIG_NODES_SPAN_OTHER_NODES is set (which
basically means PowerPC 64).

This removes a section mismatch warning in those circumstances.

Signed-off-by: Stephen Rothwell <sfr@canb.auug.org.au>
Cc: Yasunori Goto <y-goto@jp.fujitsu.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 4037d452 09-May-2007 Christoph Lameter <clameter@sgi.com>

Move remote node draining out of slab allocators

Currently the slab allocators contain callbacks into the page allocator to
perform the draining of pagesets on remote nodes. This requires SLUB to have
a whole subsystem in order to be compatible with SLAB. Moving node draining
out of the slab allocators avoids a section of code in SLUB.

Move the node draining so that is is done when the vm statistics are updated.
At that point we are already touching all the cachelines with the pagesets of
a processor.

Add a expire counter there. If we have to update per zone or global vm
statistics then assume that the pageset will require subsequent draining.

The expire counter will be decremented on each vm stats update pass until it
reaches zero. Then we will drain one batch from the pageset. The draining
will cause vm counter updates which will then cause another expiration until
the pcp is empty. So we will drain a batch every 3 seconds.

Note that remote node draining is a somewhat esoteric feature that is required
on large NUMA systems because otherwise significant portions of system memory
can become trapped in pcp queues. The number of pcp is determined by the
number of processors and nodes in a system. A system with 4 processors and 2
nodes has 8 pcps which is okay. But a system with 1024 processors and 512
nodes has 512k pcps with a high potential for large amount of memory being
caught in them.

Signed-off-by: Christoph Lameter <clameter@sgi.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 8bb78442 09-May-2007 Rafael J. Wysocki <rjw@rjwysocki.net>

Add suspend-related notifications for CPU hotplug

Since nonboot CPUs are now disabled after tasks and devices have been
frozen and the CPU hotplug infrastructure is used for this purpose, we need
special CPU hotplug notifications that will help the CPU-hotplug-aware
subsystems distinguish normal CPU hotplug events from CPU hotplug events
related to a system-wide suspend or resume operation in progress. This
patch introduces such notifications and causes them to be used during
suspend and resume transitions. It also changes all of the
CPU-hotplug-aware subsystems to take these notifications into consideration
(for now they are handled in the same way as the corresponding "normal"
ones).

[oleg@tv-sign.ru: cleanups]
Signed-off-by: Rafael J. Wysocki <rjw@sisk.pl>
Cc: Gautham R Shenoy <ego@in.ibm.com>
Cc: Pavel Machek <pavel@ucw.cz>
Signed-off-by: Oleg Nesterov <oleg@tv-sign.ru>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 72280ede 08-May-2007 Yasunori Goto <y-goto@jp.fujitsu.com>

Add white list into modpost.c for memory hotplug code and ia64's machvec section

This patch is add white list into modpost.c for some functions and
ia64's section to fix section mismatchs.

sparse_index_alloc() and zone_wait_table_init() calls bootmem allocator
at boot time, and kmalloc/vmalloc at hotplug time. If config
memory hotplug is on, there are references of bootmem allocater(init text)
from them (normal text). This is cause of section mismatch.

Bootmem is called by many functions and it must be
used only at boot time. I think __init of them should keep for
section mismatch check. So, I would like to register sparse_index_alloc()
and zone_wait_table_init() into white list.

In addition, ia64's .machvec section is function table of some platform
dependent code. It is mixture of .init.text and normal text. These
reference of __init functions are valid too.

Signed-off-by: Yasunori Goto <y-goto@jp.fujitsu.com>
Cc: Sam Ravnborg <sam@ravnborg.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# a3142c8e 08-May-2007 Yasunori Goto <y-goto@jp.fujitsu.com>

Fix section mismatch of memory hotplug related code.

This is to fix many section mismatches of code related to memory hotplug.
I checked compile with memory hotplug on/off on ia64 and x86-64 box.

Signed-off-by: Yasunori Goto <y-goto@jp.fujitsu.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 7be98234 06-May-2007 Rafael J. Wysocki <rjw@rjwysocki.net>

swsusp: use inline functions for changing page flags

Replace direct invocations of SetPageNosave(), SetPageNosaveFree() etc. with
calls to inline functions that can be changed in subsequent patches without
modifying the code calling them.

Signed-off-by: Rafael J. Wysocki <rjw@sisk.pl>
Acked-by: Pavel Machek <pavel@ucw.cz>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 6d777953 06-May-2007 Christoph Lameter <clameter@sgi.com>

mm: optimize compound_head() by avoiding a shared page flag

The patch adds PageTail(page) and PageHead(page) to check if a page is the
head or the tail of a compound page. This is done by masking the two bits
describing the state of a compound page and then comparing them. So one
comparision and a branch instead of two bit checks and two branches.

Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# d85f3385 06-May-2007 Christoph Lameter <clameter@sgi.com>

Make page->private usable in compound pages

If we add a new flag so that we can distinguish between the first page and the
tail pages then we can avoid to use page->private in the first page.
page->private == page for the first page, so there is no real information in
there.

Freeing up page->private makes the use of compound pages more transparent.
They become more usable like real pages. Right now we have to be careful f.e.
if we are going beyond PAGE_SIZE allocations in the slab on i386 because we
can then no longer use the private field. This is one of the issues that
cause us not to support debugging for page size slabs in SLAB.

Having page->private available for SLUB would allow more meta information in
the page struct. I can probably avoid the 16 bit ints that I have in there
right now.

Also if page->private is available then a compound page may be equipped with
buffer heads. This may free up the way for filesystems to support larger
blocks than page size.

We add PageTail as an alias of PageReclaim. Compound pages cannot currently
be reclaimed. Because of the alias one needs to check PageCompound first.

The RFC for the this approach was discussed at
http://marc.info/?t=117574302800001&r=1&w=2

[nacc@us.ibm.com: fix hugetlbfs]
Signed-off-by: Christoph Lameter <clameter@sgi.com>
Signed-off-by: Nishanth Aravamudan <nacc@us.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 3b1d92c5 06-May-2007 Mel Gorman <mel@csn.ul.ie>

Do not disable interrupts when reading min_free_kbytes

The sysctl handler for min_free_kbytes calls setup_per_zone_pages_min() on
read or write. This function iterates through every zone and calls
spin_lock_irqsave() on the zone LRU lock. When reading min_free_kbytes,
this is a total waste of time that disables interrupts on the local
processor. It might even be noticable machines with large numbers of zones
if a process started constantly reading min_free_kbytes.

This patch only calls setup_per_zone_pages_min() only on write. Tested on
an x86 laptop and it did the right thing.

Signed-off-by: Mel Gorman <mel@csn.ul.ie>
Acked-by: Christoph Lameter <clameter@engr.sgi.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 14e07298 06-May-2007 Andy Whitcroft <apw@shadowen.org>

add pfn_valid_within helper for sub-MAX_ORDER hole detection

Generally we work under the assumption that memory the mem_map array is
contigious and valid out to MAX_ORDER_NR_PAGES block of pages, ie. that if we
have validated any page within this MAX_ORDER_NR_PAGES block we need not check
any other. This is not true when CONFIG_HOLES_IN_ZONE is set and we must
check each and every reference we make from a pfn.

Add a pfn_valid_within() helper which should be used when scanning pages
within a MAX_ORDER_NR_PAGES block when we have already checked the validility
of the block normally with pfn_valid(). This can then be optimised away when
we do not have holes within a MAX_ORDER_NR_PAGES block of pages.

Signed-off-by: Andy Whitcroft <apw@shadowen.org>
Acked-by: Mel Gorman <mel@csn.ul.ie>
Acked-by: Bob Picco <bob.picco@hp.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 476f3534 06-May-2007 Christoph Lameter <clameter@engr.sgi.com>

Safer nr_node_ids and nr_node_ids determination and initial values

The nr_cpu_ids value is currently only calculated in smp_init. However, it
may be needed before (SLUB needs it on kmem_cache_init!) and other kernel
components may also want to allocate dynamically sized per cpu array before
smp_init. So move the determination of possible cpus into sched_init()
where we already loop over all possible cpus early in boot.

Also initialize both nr_node_ids and nr_cpu_ids with the highest value they
could take. If we have accidental users before these values are determined
then the current valud of 0 may cause too small per cpu and per node arrays
to be allocated. If it is set to the maximum possible then we only waste
some memory for early boot users.

Signed-off-by: Christoph Lameter <clameter@sgi.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 5409bae0 28-Feb-2007 Nick Piggin <nickpiggin@yahoo.com.au>

[PATCH] Rename PG_checked to PG_owner_priv_1

Rename PG_checked to PG_owner_priv_1 to reflect its availablilty as a
private flag for use by the owner/allocator of the page. In the case of
pagecache pages (which might be considered to be owned by the mm),
filesystems may use the flag.

Signed-off-by: Jeremy Fitzhardinge <jeremy@xensource.com>
Signed-off-by: Nick Piggin <nickpiggin@yahoo.com.au>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 8ef82866 20-Feb-2007 Christoph Lameter <clameter@sgi.com>

[PATCH] slab: reduce size of alien cache to cover only possible nodes

The alien cache is a per cpu per node array allocated for every slab on the
system. Currently we size this array for all nodes that the kernel does
support. For IA64 this is 1024 nodes. So we allocate an array with 1024
objects even if we only boot a system with 4 nodes.

This patch uses "nr_node_ids" to determine the number of possible nodes
supported by a hardware configuration and only allocates an alien cache
sized for possible nodes.

The initialization of nr_node_ids occurred too late relative to the bootstrap
of the slab allocator and so I moved the setup_nr_node_ids() into
free_area_init_nodes().

Signed-off-by: Christoph Lameter <clameter@sgi.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 74c7aa8b 20-Feb-2007 Christoph Lameter <clameter@sgi.com>

[PATCH] Replace highest_possible_node_id() with nr_node_ids

highest_possible_node_id() is currently used to calculate the last possible
node idso that the network subsystem can figure out how to size per node
arrays.

I think having the ability to determine the maximum amount of nodes in a
system at runtime is useful but then we should name this entry
correspondingly, it should return the number of node_ids, and the the value
needs to be setup only once on bootup. The node_possible_map does not
change after bootup.

This patch introduces nr_node_ids and replaces the use of
highest_possible_node_id(). nr_node_ids is calculated on bootup when the
page allocators pagesets are initialized.

[deweerdt@free.fr: fix oops]
Signed-off-by: Christoph Lameter <clameter@sgi.com>
Cc: Neil Brown <neilb@suse.de>
Cc: Trond Myklebust <trond.myklebust@fys.uio.no>
Signed-off-by: Frederik Deweerdt <frederik.deweerdt@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 4b51d669 10-Feb-2007 Christoph Lameter <clameter@sgi.com>

[PATCH] optional ZONE_DMA: optional ZONE_DMA in the VM

Make ZONE_DMA optional in core code.

- ifdef all code for ZONE_DMA and related definitions following the example
for ZONE_DMA32 and ZONE_HIGHMEM.

- Without ZONE_DMA, ZONE_HIGHMEM and ZONE_DMA32 we get to a ZONES_SHIFT of
0.

- Modify the VM statistics to work correctly without a DMA zone.

- Modify slab to not create DMA slabs if there is no ZONE_DMA.

[akpm@osdl.org: cleanup]
[jdike@addtoit.com: build fix]
[apw@shadowen.org: Simplify calculation of the number of bits we need for ZONES_SHIFT]
Signed-off-by: Christoph Lameter <clameter@sgi.com>
Cc: Andi Kleen <ak@suse.de>
Cc: "Luck, Tony" <tony.luck@intel.com>
Cc: Kyle McMartin <kyle@mcmartin.ca>
Cc: Matthew Wilcox <willy@debian.org>
Cc: James Bottomley <James.Bottomley@steeleye.com>
Cc: Paul Mundt <lethal@linux-sh.org>
Signed-off-by: Andy Whitcroft <apw@shadowen.org>
Signed-off-by: Jeff Dike <jdike@addtoit.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 6267276f 10-Feb-2007 Christoph Lameter <clameter@sgi.com>

[PATCH] optional ZONE_DMA: deal with cases of ZONE_DMA meaning the first zone

This patchset follows up on the earlier work in Andrew's tree to reduce the
number of zones. The patches allow to go to a minimum of 2 zones. This one
allows also to make ZONE_DMA optional and therefore the number of zones can be
reduced to one.

ZONE_DMA is usually used for ISA DMA devices. There are a number of reasons
why we would not want to have ZONE_DMA

1. Some arches do not need ZONE_DMA at all.

2. With the advent of IOMMUs DMA zones are no longer needed.
The necessity of DMA zones may drastically be reduced
in the future. This patchset allows a compilation of
a kernel without that overhead.

3. Devices that require ISA DMA get rare these days. All
my systems do not have any need for ISA DMA.

4. The presence of an additional zone unecessarily complicates
VM operations because it must be scanned and balancing
logic must operate on its.

5. With only ZONE_NORMAL one can reach the situation where
we have only one zone. This will allow the unrolling of many
loops in the VM and allows the optimization of varous
code paths in the VM.

6. Having only a single zone in a NUMA system results in a
1-1 correspondence between nodes and zones. Various additional
optimizations to critical VM paths become possible.

Many systems today can operate just fine with a single zone. If you look at
what is in ZONE_DMA then one usually sees that nothing uses it. The DMA slabs
are empty (Some arches use ZONE_DMA instead of ZONE_NORMAL, then ZONE_NORMAL
will be empty instead).

On all of my systems (i386, x86_64, ia64) ZONE_DMA is completely empty. Why
constantly look at an empty zone in /proc/zoneinfo and empty slab in
/proc/slabinfo? Non i386 also frequently have no need for ZONE_DMA and zones
stay empty.

The patchset was tested on i386 (UP / SMP), x86_64 (UP, NUMA) and ia64 (NUMA).

The RFC posted earlier (see
http://marc.theaimsgroup.com/?l=linux-kernel&m=115231723513008&w=2) had lots
of #ifdefs in them. An effort has been made to minize the number of #ifdefs
and make this as compact as possible. The job was made much easier by the
ongoing efforts of others to extract common arch specific functionality.

I have been running this for awhile now on my desktop and finally Linux is
using all my available RAM instead of leaving the 16MB in ZONE_DMA untouched:

christoph@pentium940:~$ cat /proc/zoneinfo
Node 0, zone Normal
pages free 4435
min 1448
low 1810
high 2172
active 241786
inactive 210170
scanned 0 (a: 0 i: 0)
spanned 524224
present 524224
nr_anon_pages 61680
nr_mapped 14271
nr_file_pages 390264
nr_slab_reclaimable 27564
nr_slab_unreclaimable 1793
nr_page_table_pages 449
nr_dirty 39
nr_writeback 0
nr_unstable 0
nr_bounce 0
cpu: 0 pcp: 0
count: 156
high: 186
batch: 31
cpu: 0 pcp: 1
count: 9
high: 62
batch: 15
vm stats threshold: 20
cpu: 1 pcp: 0
count: 177
high: 186
batch: 31
cpu: 1 pcp: 1
count: 12
high: 62
batch: 15
vm stats threshold: 20
all_unreclaimable: 0
prev_priority: 12
temp_priority: 12
start_pfn: 0

This patch:

In two places in the VM we use ZONE_DMA to refer to the first zone. If
ZONE_DMA is optional then other zones may be first. So simply replace
ZONE_DMA with zone 0.

This also fixes ZONETABLE_PGSHIFT. If we have only a single zone then
ZONES_PGSHIFT may become 0 because there is no need anymore to encode the zone
number related to a pgdat. However, we still need a zonetable to index all
the zones for each node if this is a NUMA system. Therefore define
ZONETABLE_SHIFT unconditionally as the offset of the ZONE field in page flags.

[apw@shadowen.org: fix mismerge]
Acked-by: Christoph Hellwig <hch@infradead.org>
Signed-off-by: Christoph Lameter <clameter@sgi.com>
Cc: Andi Kleen <ak@suse.de>
Cc: "Luck, Tony" <tony.luck@intel.com>
Cc: Kyle McMartin <kyle@mcmartin.ca>
Cc: Matthew Wilcox <willy@debian.org>
Cc: James Bottomley <James.Bottomley@steeleye.com>
Cc: Paul Mundt <lethal@linux-sh.org>
Signed-off-by: Andy Whitcroft <apw@shadowen.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 65e458d4 10-Feb-2007 Christoph Lameter <clameter@sgi.com>

[PATCH] Drop get_zone_counts()

Values are available via ZVC sums.

Signed-off-by: Christoph Lameter <clameter@sgi.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 9195481d 10-Feb-2007 Christoph Lameter <clameter@sgi.com>

[PATCH] Drop nr_free_pages_pgdat()

Function is unnecessary now. We can use the summing features of the ZVCs to
get the values we need.

Signed-off-by: Christoph Lameter <clameter@sgi.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 96177299 10-Feb-2007 Christoph Lameter <clameter@sgi.com>

[PATCH] Drop free_pages()

nr_free_pages is now a simple access to a global variable. Make it a macro
instead of a function.

The nr_free_pages now requires vmstat.h to be included. There is one
occurrence in power management where we need to add the include. Directly
refrer to global_page_state() there to clarify why the #include was added.

[akpm@osdl.org: arm build fix]
[akpm@osdl.org: sparc64 build fix]
Signed-off-by: Christoph Lameter <clameter@sgi.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# d23ad423 10-Feb-2007 Christoph Lameter <clameter@sgi.com>

[PATCH] Use ZVC for free_pages

This is again simplifies some of the VM counter calculations through the use
of the ZVC consolidated counters.

[michal.k.k.piotrowski@gmail.com: build fix]
Signed-off-by: Christoph Lameter <clameter@sgi.com>
Signed-off-by: Michal Piotrowski <michal.k.k.piotrowski@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# c8785385 10-Feb-2007 Christoph Lameter <clameter@sgi.com>

[PATCH] Use ZVC for inactive and active counts

The determination of the dirty ratio to determine writeback behavior is
currently based on the number of total pages on the system.

However, not all pages in the system may be dirtied. Thus the ratio is always
too low and can never reach 100%. The ratio may be particularly skewed if
large hugepage allocations, slab allocations or device driver buffers make
large sections of memory not available anymore. In that case we may get into
a situation in which f.e. the background writeback ratio of 40% cannot be
reached anymore which leads to undesired writeback behavior.

This patchset fixes that issue by determining the ratio based on the actual
pages that may potentially be dirty. These are the pages on the active and
the inactive list plus free pages.

The problem with those counts has so far been that it is expensive to
calculate these because counts from multiple nodes and multiple zones will
have to be summed up. This patchset makes these counters ZVC counters. This
means that a current sum per zone, per node and for the whole system is always
available via global variables and not expensive anymore to calculate.

The patchset results in some other good side effects:

- Removal of the various functions that sum up free, active and inactive
page counts

- Cleanup of the functions that display information via the proc filesystem.

This patch:

The use of a ZVC for nr_inactive and nr_active allows a simplification of some
counter operations. More ZVC functionality is used for sums etc in the
following patches.

[akpm@osdl.org: UP build fix]
Signed-off-by: Christoph Lameter <clameter@sgi.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# a6af2bc3 10-Feb-2007 Mel Gorman <mel@csn.ul.ie>

[PATCH] Avoid excessive sorting of early_node_map[]

find_min_pfn_for_node() and find_min_pfn_with_active_regions() sort
early_node_map[] on every call. This is an excessive amount of sorting and
that can be avoided. This patch always searches the whole early_node_map[]
in find_min_pfn_for_node() instead of returning the first value found. The
map is then only sorted once when required. Successfully boot tested on a
number of machines.

[akpm@osdl.org: cleanup]
Signed-off-by: Mel Gorman <mel@csn.ul.ie>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# a25700a5 08-Feb-2007 Andrew Morton <akpm@osdl.org>

[PATCH] mm: show bounce pages in oom killer output

Also split that long line up - people like to send us wordwrapped oom-kill
traces.

Cc: Nick Piggin <nickpiggin@yahoo.com.au>
Cc: Christoph Lameter <clameter@engr.sgi.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 6fd6b17c 31-Jan-2007 Linus Torvalds <torvalds@woody.linux-foundation.org>

Revert "[PATCH] mm: micro optimise zone_watermark_ok"

This reverts commit e80ee884ae0e3794ef2b65a18a767d502ad712ee.

Pawel Sikora had a boot-time oops due to it - because the sign change
invalidates the following comparisons, since 'free_pages' can be
negative.

The micro-optimization just isn't worth it.

Bisected-by: Pawel Sikora <pluto@agmk.net>
Acked-by: Andrew Morton <akpm@osdl.org>
Cc: Nick Piggin <nickpiggin@yahoo.com.au>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# a2f3aa02 11-Jan-2007 Dave Hansen <haveblue@us.ibm.com>

[PATCH] Fix sparsemem on Cell

Fix an oops experienced on the Cell architecture when init-time functions,
early_*(), are called at runtime. It alters the call paths to make sure
that the callers explicitly say whether the call is being made on behalf of
a hotplug even, or happening at boot-time.

It has been compile tested on ppc64, ia64, s390, i386 and x86_64.

Acked-by: Arnd Bergmann <arndb@de.ibm.com>
Signed-off-by: Dave Hansen <haveblue@us.ibm.com>
Cc: Yasunori Goto <y-goto@jp.fujitsu.com>
Acked-by: Andy Whitcroft <apw@shadowen.org>
Cc: Christoph Lameter <clameter@engr.sgi.com>
Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
Acked-by: Heiko Carstens <heiko.carstens@de.ibm.com>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Paul Mackerras <paulus@samba.org>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>


# f2e12bb2 05-Jan-2007 Christoph Lameter <clameter@sgi.com>

[PATCH] Check for populated zone in __drain_pages

Both process_zones() and drain_node_pages() check for populated zones
before touching pagesets. However, __drain_pages does not do so,

This may result in a NULL pointer dereference for pagesets in unpopulated
zones if a NUMA setup is combined with cpu hotplug.

Initially the unpopulated zone has the pcp pointers pointing to the boot
pagesets. Since the zone is not populated the boot pageset pointers will
not be changed during page allocator and slab bootstrap.

If a cpu is later brought down (first call to __drain_pages()) then the pcp
pointers for cpus in unpopulated zones are set to NULL since __drain_pages
does not first check for an unpopulated zone.

If the cpu is then brought up again then we call process_zones() which will
ignore the unpopulated zone. So the pageset pointers will still be NULL.

If the cpu is then again brought down then __drain_pages will attempt to
drain pages by following the NULL pageset pointer for unpopulated zones.

Signed-off-by: Christoph Lameter <clameter@sgi.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>


# 9ab37b8f 05-Jan-2007 Paul Mundt <lethal@linux-sh.org>

[PATCH] Sanely size hash tables when using large base pages

At the moment the inode/dentry cache hash tables (common by way of
alloc_large_system_hash()) are incorrectly sized by their respective
detection logic when we attempt to use large base pages on systems with
little memory.

This results in odd behaviour when using a 64kB PAGE_SIZE, such as:

Dentry cache hash table entries: 8192 (order: -1, 32768 bytes)
Inode-cache hash table entries: 4096 (order: -2, 16384 bytes)

The mount cache hash table is seemingly the only one that gets this right
by directly taking PAGE_SIZE in to account.

The following patch attempts to catch the bogus values and round it up to
at least 0-order.

Signed-off-by: Paul Mundt <lethal@linux-sh.org>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>


# 02a0e53d 13-Dec-2006 Paul Jackson <pj@sgi.com>

[PATCH] cpuset: rework cpuset_zone_allowed api

Elaborate the API for calling cpuset_zone_allowed(), so that users have to
explicitly choose between the two variants:

cpuset_zone_allowed_hardwall()
cpuset_zone_allowed_softwall()

Until now, whether or not you got the hardwall flavor depended solely on
whether or not you or'd in the __GFP_HARDWALL gfp flag to the gfp_mask
argument.

If you didn't specify __GFP_HARDWALL, you implicitly got the softwall
version.

Unfortunately, this meant that users would end up with the softwall version
without thinking about it. Since only the softwall version might sleep,
this led to bugs with possible sleeping in interrupt context on more than
one occassion.

The hardwall version requires that the current tasks mems_allowed allows
the node of the specified zone (or that you're in interrupt or that
__GFP_THISNODE is set or that you're on a one cpuset system.)

The softwall version, depending on the gfp_mask, might allow a node if it
was allowed in the nearest enclusing cpuset marked mem_exclusive (which
requires taking the cpuset lock 'callback_mutex' to evaluate.)

This patch removes the cpuset_zone_allowed() call, and forces the caller to
explicitly choose between the hardwall and the softwall case.

If the caller wants the gfp_mask to determine this choice, they should (1)
be sure they can sleep or that __GFP_HARDWALL is set, and (2) invoke the
cpuset_zone_allowed_softwall() routine.

This adds another 100 or 200 bytes to the kernel text space, due to the few
lines of nearly duplicate code at the top of both cpuset_zone_allowed_*
routines. It should save a few instructions executed for the calls that
turned into calls of cpuset_zone_allowed_hardwall, thanks to not having to
set (before the call) then check (within the call) the __GFP_HARDWALL flag.

For the most critical call, from get_page_from_freelist(), the same
instructions are executed as before -- the old cpuset_zone_allowed()
routine it used to call is the same code as the
cpuset_zone_allowed_softwall() routine that it calls now.

Not a perfect win, but seems worth it, to reduce this chance of hitting a
sleeping with irq off complaint again.

Signed-off-by: Paul Jackson <pj@sgi.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>


# 6b1b60f4 08-Dec-2006 Don Mullis <dwm@meer.net>

[PATCH] fault-injection: defaults likely to please a new user

Assign defaults most likely to please a new user:
1) generate some logging output
(verbose=2)
2) avoid injecting failures likely to lock up UI
(ignore_gfp_wait=1, ignore_gfp_highmem=1)

Signed-off-by: Don Mullis <dwm@meer.net>
Cc: Akinobu Mita <akinobu.mita@gmail.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>


# 933e312e 08-Dec-2006 Akinobu Mita <akinobu.mita@gmail.com>

[PATCH] fault-injection capability for alloc_pages()

This patch provides fault-injection capability for alloc_pages()

Boot option:

fail_page_alloc=<interval>,<probability>,<space>,<times>

<interval> -- specifies the interval of failures.

<probability> -- specifies how often it should fail in percent.

<space> -- specifies the size of free space where memory can be
allocated safely in pages.

<times> -- specifies how many times failures may happen at most.

Debugfs:

/debug/fail_page_alloc/interval
/debug/fail_page_alloc/probability
/debug/fail_page_alloc/specifies
/debug/fail_page_alloc/times
/debug/fail_page_alloc/ignore-gfp-highmem
/debug/fail_page_alloc/ignore-gfp-wait

Example:

fail_page_alloc=10,100,0,-1

The page allocation (alloc_pages(), ...) fails once per 10 times.

Signed-off-by: Akinobu Mita <akinobu.mita@gmail.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>


# f0d1b0b3 08-Dec-2006 David Howells <dhowells@redhat.com>

[PATCH] LOG2: Implement a general integer log2 facility in the kernel

This facility provides three entry points:

ilog2() Log base 2 of unsigned long
ilog2_u32() Log base 2 of u32
ilog2_u64() Log base 2 of u64

These facilities can either be used inside functions on dynamic data:

int do_something(long q)
{
...;
y = ilog2(x)
...;
}

Or can be used to statically initialise global variables with constant values:

unsigned n = ilog2(27);

When performing static initialisation, the compiler will report "error:
initializer element is not constant" if asked to take a log of zero or of
something not reducible to a constant. They treat negative numbers as
unsigned.

When not dealing with a constant, they fall back to using fls() which permits
them to use arch-specific log calculation instructions - such as BSR on
x86/x86_64 or SCAN on FRV - if available.

[akpm@osdl.org: MMC fix]
Signed-off-by: David Howells <dhowells@redhat.com>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Herbert Xu <herbert@gondor.apana.org.au>
Cc: David Howells <dhowells@redhat.com>
Cc: Wojtek Kaniewski <wojtekka@toxygen.net>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>


# 15ad7cdc 06-Dec-2006 Helge Deller <deller@gmx.de>

[PATCH] struct seq_operations and struct file_operations constification

- move some file_operations structs into the .rodata section

- move static strings from policy_types[] array into the .rodata section

- fix generic seq_operations usages, so that those structs may be defined
as "const" as well

[akpm@osdl.org: couple of fixes]
Signed-off-by: Helge Deller <deller@gmx.de>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>


# 02316067 06-Dec-2006 Ingo Molnar <mingo@elte.hu>

[PATCH] hotplug CPU: clean up hotcpu_notifier() use

There was lots of #ifdef noise in the kernel due to hotcpu_notifier(fn,
prio) not correctly marking 'fn' as used in the !HOTPLUG_CPU case, and thus
generating compiler warnings of unused symbols, hence forcing people to add
#ifdefs.

the compiler can skip truly unused functions just fine:

text data bss dec hex filename
1624412 728710 3674856 6027978 5bfaca vmlinux.before
1624412 728710 3674856 6027978 5bfaca vmlinux.after

[akpm@osdl.org: topology.c fix]
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>


# 04903664 06-Dec-2006 Andrew Morton <akpm@osdl.org>

[PATCH] remove HASH_HIGHMEM

It has no users and it's doubtful that we'll need it again.

Cc: "David S. Miller" <davem@davemloft.net>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>


# 33f2ef89 06-Dec-2006 Andy Whitcroft <apw@shadowen.org>

[PATCH] mm: make compound page destructor handling explicit

Currently we we use the lru head link of the second page of a compound page
to hold its destructor. This was ok when it was purely an internal
implmentation detail. However, hugetlbfs overrides this destructor
violating the layering. Abstract this out as explicit calls, also
introduce a type for the callback function allowing them to be type
checked. For each callback we pre-declare the function, causing a type
error on definition rather than on use elsewhere.

[akpm@osdl.org: cleanups]
Signed-off-by: Andy Whitcroft <apw@shadowen.org>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>


# 952f3b51 06-Dec-2006 Christoph Lameter <clameter@sgi.com>

[PATCH] GFP_THISNODE must not trigger global reclaim

The intent of GFP_THISNODE is to make sure that an allocation occurs on a
particular node. If this is not possible then NULL needs to be returned so
that the caller can choose what to do next on its own (the slab allocator
depends on that).

However, GFP_THISNODE currently triggers reclaim before returning a failure
(GFP_THISNODE means GFP_NORETRY is set). If we have over allocated a node
then we will currently do some reclaim before returning NULL. The caller
may want memory from other nodes before reclaim should be triggered. (If
the caller wants reclaim then he can directly use __GFP_THISNODE instead).

There is no flag to avoid reclaim in the page allocator and adding yet
another GFP_xx flag would be difficult given that we are out of available
flags.

So just compare and see if all bits for GFP_THISNODE (__GFP_THISNODE,
__GFP_NORETRY and __GFP_NOWARN) are set. If so then we return NULL before
waking up kswapd.

Signed-off-by: Christoph Lameter <clameter@sgi.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>


# ce421c79 06-Dec-2006 Andy Whitcroft <apw@shadowen.org>

[PATCH] mm: cleanup indentation on switch for CPU operations

These patches introduced new switch statements which are indented contrary
to the concensus in mm/*.c. Fix them up to match that concensus.

[PATCH] node local per-cpu-pages
[PATCH] ZVC: Scale thresholds depending on the size of the system
commit e7c8d5c9955a4d2e88e36b640563f5d6d5aba48a
commit df9ecaba3f152d1ea79f2a5e0b87505e03f47590

Signed-off-by: Andy Whitcroft <apw@shadowen.org>
Cc: Christoph Lameter <clameter@engr.sgi.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>


# 25ba77c1 06-Dec-2006 Andy Whitcroft <apw@shadowen.org>

[PATCH] numa node ids are int, page_to_nid and zone_to_nid should return int

NUMA node ids are passed as either int or unsigned int almost exclusivly
page_to_nid and zone_to_nid both return unsigned long. This is a throw
back to when page_to_nid was a #define and was thus exposing the real type
of the page flags field.

In addition to fixing up the definitions of page_to_nid and zone_to_nid I
audited the users of these functions identifying the following incorrect
uses:

1) mm/page_alloc.c show_node() -- printk dumping the node id,
2) include/asm-ia64/pgalloc.h pgtable_quicklist_free() -- comparison
against numa_node_id() which returns an int from cpu_to_node(), and
3) mm/mpolicy.c check_pte_range -- used as an index in node_isset which
uses bit_set which in generic code takes an int.

Signed-off-by: Andy Whitcroft <apw@shadowen.org>
Cc: Christoph Lameter <clameter@engr.sgi.com>
Cc: "Luck, Tony" <tony.luck@intel.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>


# bc4ba393 06-Dec-2006 Christoph Lameter <clameter@sgi.com>

[PATCH] drain_node_page(): Drain pages in batch units

drain_node_pages() currently drains the complete pageset of all pages. If
there are a large number of pages in the queues then we may hold off
interrupts for too long.

Duplicate the method used in free_hot_cold_page. Only drain pcp->batch
pages at one time.

Signed-off-by: Christoph Lameter <clameter@sgi.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>


# b43a57bb 06-Dec-2006 Kirill Korotaev <dev@openvz.org>

[PATCH] OOM can panic due to processes stuck in __alloc_pages()

OOM can panic due to the processes stuck in __alloc_pages() doing infinite
rebalance loop while no memory can be reclaimed. OOM killer tries to kill
some processes, but unfortunetaly, rebalance label was moved by someone
below the TIF_MEMDIE check, so buddy allocator doesn't see that process is
OOM-killed and it can simply fail the allocation :/

Observed in reality on RHEL4(2.6.9)+OpenVZ kernel when a user doing some
memory allocation tricks triggered OOM panic.

Signed-off-by: Denis Lunev <den@sw.ru>
Signed-off-by: Kirill Korotaev <dev@openvz.org>
Cc: Nick Piggin <nickpiggin@yahoo.com.au>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>


# cc102509 06-Dec-2006 Nick Piggin <npiggin@suse.de>

[PATCH] mm: add arch_alloc_page

Add an arch_alloc_page to match arch_free_page.

Signed-off-by: Nick Piggin <npiggin@suse.de>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>


# 9276b1bc 06-Dec-2006 Paul Jackson <pj@sgi.com>

[PATCH] memory page_alloc zonelist caching speedup

Optimize the critical zonelist scanning for free pages in the kernel memory
allocator by caching the zones that were found to be full recently, and
skipping them.

Remembers the zones in a zonelist that were short of free memory in the
last second. And it stashes a zone-to-node table in the zonelist struct,
to optimize that conversion (minimize its cache footprint.)

Recent changes:

This differs in a significant way from a similar patch that I
posted a week ago. Now, instead of having a nodemask_t of
recently full nodes, I have a bitmask of recently full zones.
This solves a problem that last weeks patch had, which on
systems with multiple zones per node (such as DMA zone) would
take seeing any of these zones full as meaning that all zones
on that node were full.

Also I changed names - from "zonelist faster" to "zonelist cache",
as that seemed to better convey what we're doing here - caching
some of the key zonelist state (for faster access.)

See below for some performance benchmark results. After all that
discussion with David on why I didn't need them, I went and got
some ;). I wanted to verify that I had not hurt the normal case
of memory allocation noticeably. At least for my one little
microbenchmark, I found (1) the normal case wasn't affected, and
(2) workloads that forced scanning across multiple nodes for
memory improved up to 10% fewer System CPU cycles and lower
elapsed clock time ('sys' and 'real'). Good. See details, below.

I didn't have the logic in get_page_from_freelist() for various
full nodes and zone reclaim failures correct. That should be
fixed up now - notice the new goto labels zonelist_scan,
this_zone_full, and try_next_zone, in get_page_from_freelist().

There are two reasons I persued this alternative, over some earlier
proposals that would have focused on optimizing the fake numa
emulation case by caching the last useful zone:

1) Contrary to what I said before, we (SGI, on large ia64 sn2 systems)
have seen real customer loads where the cost to scan the zonelist
was a problem, due to many nodes being full of memory before
we got to a node we could use. Or at least, I think we have.
This was related to me by another engineer, based on experiences
from some time past. So this is not guaranteed. Most likely, though.

The following approach should help such real numa systems just as
much as it helps fake numa systems, or any combination thereof.

2) The effort to distinguish fake from real numa, using node_distance,
so that we could cache a fake numa node and optimize choosing
it over equivalent distance fake nodes, while continuing to
properly scan all real nodes in distance order, was going to
require a nasty blob of zonelist and node distance munging.

The following approach has no new dependency on node distances or
zone sorting.

See comment in the patch below for a description of what it actually does.

Technical details of note (or controversy):

- See the use of "zlc_active" and "did_zlc_setup" below, to delay
adding any work for this new mechanism until we've looked at the
first zone in zonelist. I figured the odds of the first zone
having the memory we needed were high enough that we should just
look there, first, then get fancy only if we need to keep looking.

- Some odd hackery was needed to add items to struct zonelist, while
not tripping up the custom zonelists built by the mm/mempolicy.c
code for MPOL_BIND. My usual wordy comments below explain this.
Search for "MPOL_BIND".

- Some per-node data in the struct zonelist is now modified frequently,
with no locking. Multiple CPU cores on a node could hit and mangle
this data. The theory is that this is just performance hint data,
and the memory allocator will work just fine despite any such mangling.
The fields at risk are the struct 'zonelist_cache' fields 'fullzones'
(a bitmask) and 'last_full_zap' (unsigned long jiffies). It should
all be self correcting after at most a one second delay.

- This still does a linear scan of the same lengths as before. All
I've optimized is making the scan faster, not algorithmically
shorter. It is now able to scan a compact array of 'unsigned
short' in the case of many full nodes, so one cache line should
cover quite a few nodes, rather than each node hitting another
one or two new and distinct cache lines.

- If both Andi and Nick don't find this too complicated, I will be
(pleasantly) flabbergasted.

- I removed the comment claiming we only use one cachline's worth of
zonelist. We seem, at least in the fake numa case, to have put the
lie to that claim.

- I pay no attention to the various watermarks and such in this performance
hint. A node could be marked full for one watermark, and then skipped
over when searching for a page using a different watermark. I think
that's actually quite ok, as it will tend to slightly increase the
spreading of memory over other nodes, away from a memory stressed node.

===============

Performance - some benchmark results and analysis:

This benchmark runs a memory hog program that uses multiple
threads to touch alot of memory as quickly as it can.

Multiple runs were made, touching 12, 38, 64 or 90 GBytes out of
the total 96 GBytes on the system, and using 1, 19, 37, or 55
threads (on a 56 CPU system.) System, user and real (elapsed)
timings were recorded for each run, shown in units of seconds,
in the table below.

Two kernels were tested - 2.6.18-mm3 and the same kernel with
this zonelist caching patch added. The table also shows the
percentage improvement the zonelist caching sys time is over
(lower than) the stock *-mm kernel.

number 2.6.18-mm3 zonelist-cache delta (< 0 good) percent
GBs N ------------ -------------- ---------------- systime
mem threads sys user real sys user real sys user real better
12 1 153 24 177 151 24 176 -2 0 -1 1%
12 19 99 22 8 99 22 8 0 0 0 0%
12 37 111 25 6 112 25 6 1 0 0 -0%
12 55 115 25 5 110 23 5 -5 -2 0 4%
38 1 502 74 576 497 73 570 -5 -1 -6 0%
38 19 426 78 48 373 76 39 -53 -2 -9 12%
38 37 544 83 36 547 82 36 3 -1 0 -0%
38 55 501 77 23 511 80 24 10 3 1 -1%
64 1 917 125 1042 890 124 1014 -27 -1 -28 2%
64 19 1118 138 119 965 141 103 -153 3 -16 13%
64 37 1202 151 94 1136 150 81 -66 -1 -13 5%
64 55 1118 141 61 1072 140 58 -46 -1 -3 4%
90 1 1342 177 1519 1275 174 1450 -67 -3 -69 4%
90 19 2392 199 192 2116 189 176 -276 -10 -16 11%
90 37 3313 238 175 2972 225 145 -341 -13 -30 10%
90 55 1948 210 104 1843 213 100 -105 3 -4 5%

Notes:
1) This test ran a memory hog program that started a specified number N of
threads, and had each thread allocate and touch 1/N'th of
the total memory to be used in the test run in a single loop,
writing a constant word to memory, one store every 4096 bytes.
Watching this test during some earlier trial runs, I would see
each of these threads sit down on one CPU and stay there, for
the remainder of the pass, a different CPU for each thread.

2) The 'real' column is not comparable to the 'sys' or 'user' columns.
The 'real' column is seconds wall clock time elapsed, from beginning
to end of that test pass. The 'sys' and 'user' columns are total
CPU seconds spent on that test pass. For a 19 thread test run,
for example, the sum of 'sys' and 'user' could be up to 19 times the
number of 'real' elapsed wall clock seconds.

3) Tests were run on a fresh, single-user boot, to minimize the amount
of memory already in use at the start of the test, and to minimize
the amount of background activity that might interfere.

4) Tests were done on a 56 CPU, 28 Node system with 96 GBytes of RAM.

5) Notice that the 'real' time gets large for the single thread runs, even
though the measured 'sys' and 'user' times are modest. I'm not sure what
that means - probably something to do with it being slow for one thread to
be accessing memory along ways away. Perhaps the fake numa system, running
ostensibly the same workload, would not show this substantial degradation
of 'real' time for one thread on many nodes -- lets hope not.

6) The high thread count passes (one thread per CPU - on 55 of 56 CPUs)
ran quite efficiently, as one might expect. Each pair of threads needed
to allocate and touch the memory on the node the two threads shared, a
pleasantly parallizable workload.

7) The intermediate thread count passes, when asking for alot of memory forcing
them to go to a few neighboring nodes, improved the most with this zonelist
caching patch.

Conclusions:
* This zonelist cache patch probably makes little difference one way or the
other for most workloads on real numa hardware, if those workloads avoid
heavy off node allocations.
* For memory intensive workloads requiring substantial off-node allocations
on real numa hardware, this patch improves both kernel and elapsed timings
up to ten per-cent.
* For fake numa systems, I'm optimistic, but will have to leave that up to
Rohit Seth to actually test (once I get him a 2.6.18 backport.)

Signed-off-by: Paul Jackson <pj@sgi.com>
Cc: Rohit Seth <rohitseth@google.com>
Cc: Christoph Lameter <clameter@engr.sgi.com>
Cc: David Rientjes <rientjes@cs.washington.edu>
Cc: Paul Menage <menage@google.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>


# 89689ae7 06-Dec-2006 Christoph Lameter <clameter@sgi.com>

[PATCH] Get rid of zone_table[]

The zone table is mostly not needed. If we have a node in the page flags
then we can get to the zone via NODE_DATA() which is much more likely to be
already in the cpu cache.

In case of SMP and UP NODE_DATA() is a constant pointer which allows us to
access an exact replica of zonetable in the node_zones field. In all of
the above cases there will be no need at all for the zone table.

The only remaining case is if in a NUMA system the node numbers do not fit
into the page flags. In that case we make sparse generate a table that
maps sections to nodes and use that table to to figure out the node number.
This table is sized to fit in a single cache line for the known 32 bit
NUMA platform which makes it very likely that the information can be
obtained without a cache miss.

For sparsemem the zone table seems to be have been fairly large based on
the maximum possible number of sections and the number of zones per node.
There is some memory saving by removing zone_table. The main benefit is to
reduce the cache foootprint of the VM from the frequent lookups of zones.
Plus it simplifies the page allocator.

[akpm@osdl.org: build fix]
Signed-off-by: Christoph Lameter <clameter@sgi.com>
Cc: Dave Hansen <haveblue@us.ibm.com>
Cc: Andy Whitcroft <apw@shadowen.org>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>


# 0798e519 06-Dec-2006 Paul Jackson <pj@sgi.com>

[PATCH] memory page alloc minor cleanups

- s/freeliest/freelist/ spelling fix

- Check for NULL *z zone seems useless - even if it could happen, so
what? Perhaps we should have a check later on if we are faced with an
allocation request that is not allowed to fail - shouldn't that be a
serious kernel error, passing an empty zonelist with a mandate to not
fail?

- Initializing 'z' to zonelist->zones can wait until after the first
get_page_from_freelist() fails; we only use 'z' in the wakeup_kswapd()
loop, so let's initialize 'z' there, in a 'for' loop. Seems clearer.

- Remove superfluous braces around a break

- Fix a couple errant spaces

- Adjust indentation on the cpuset_zone_allowed() check, to match the
lines just before it -- seems easier to read in this case.

- Add another set of braces to the zone_watermark_ok logic

From: Paul Jackson <pj@sgi.com>

Backout one item from a previous "memory page_alloc minor cleanups" patch.
Until and unless we are certain that no one can ever pass an empty zonelist
to __alloc_pages(), this check for an empty zonelist (or some BUG
equivalent) is essential. The code in get_page_from_freelist() blow ups if
passed an empty zonelist.

Signed-off-by: Paul Jackson <pj@sgi.com>
Acked-by: Christoph Lameter <clameter@sgi.com>
Cc: Nick Piggin <nickpiggin@yahoo.com.au>
Signed-off-by: Paul Jackson <pj@sgi.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>


# 1abbfb41 22-Nov-2006 Mel Gorman <mel@skynet.ie>

[PATCH] x86_64: fix bad page state in process 'swapper'

find_min_pfn_for_node() and find_min_pfn_with_active_regions() both
depend on a sorted early_node_map[]. However, sort_node_map() is being
called after fin_min_pfn_with_active_regions() in
free_area_init_nodes().

In most cases, this is ok, but on at least one x86_64, the SRAT table
caused the E820 ranges to be registered out of order. This gave the
wrong values for the min PFN range resulting in some pages not being
initialised.

This patch sorts the early_node_map in find_min_pfn_for_node(). It has
been boot tested on x86, x86_64, ppc64 and ia64.

Signed-off-by: Mel Gorman <mel@csn.ul.ie>
Acked-by: Andre Noll <maan@systemlinux.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>


# 941c7105 02-Nov-2006 nkalmala <nkalmala@gmail.com>

[PATCH] mm: un-needed add-store operation wastes a few bytes

Un-needed add-store operation wastes a few bytes.
8 bytes wasted with -O2, on a ppc.

Signed-off-by: nkalmala <nkalmala@gmail.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>


# 0c6cb974 28-Oct-2006 Mel Gorman <mel@skynet.ie>

[PATCH] Calculation fix for memory holes beyong the end of physical memory

absent_pages_in_range() made the assumption that users of the
arch-independent zone-sizing API would not care about holes beyound the end
of physical memory. This was not the case and was "fixed" in a patch
called "Account for holes that are outside the range of physical memory".
However, when given a range that started before a hole in "real" memory and
ended beyond the end of memory, it would get the result wrong. The bug is
in mainline but a patch is below.

It has been tested successfully on a number of machines and architectures.
Additional credit to Keith Mannthey for discovering the problem, helping
identify the correct fix and confirming it Worked For Him.

Signed-off-by: Mel Gorman <mel@csn.ul.ie>
Cc: keith mannthey <kmannth@us.ibm.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>


# 3bb1a852 28-Oct-2006 Martin Bligh <mbligh@mbligh.org>

[PATCH] vmscan: Fix temp_priority race

The temp_priority field in zone is racy, as we can walk through a reclaim
path, and just before we copy it into prev_priority, it can be overwritten
(say with DEF_PRIORITY) by another reclaimer.

The same bug is contained in both try_to_free_pages and balance_pgdat, but
it is fixed slightly differently. In balance_pgdat, we keep a separate
priority record per zone in a local array. In try_to_free_pages there is
no need to do this, as the priority level is the same for all zones that we
reclaim from.

Impact of this bug is that temp_priority is copied into prev_priority, and
setting this artificially high causes reclaimers to set distress
artificially low. They then fail to reclaim mapped pages, when they are,
in fact, under severe memory pressure (their priority may be as low as 0).
This causes the OOM killer to fire incorrectly.

From: Andrew Morton <akpm@osdl.org>

__zone_reclaim() isn't modifying zone->prev_priority. But zone->prev_priority
is used in the decision whether or not to bring mapped pages onto the inactive
list. Hence there's a risk here that __zone_reclaim() will fail because
zone->prev_priority ir large (ie: low urgency) and lots of mapped pages end up
stuck on the active list.

Fix that up by decreasing (ie making more urgent) zone->prev_priority as
__zone_reclaim() scans the zone's pages.

This bug perhaps explains why ZONE_RECLAIM_PRIORITY was created. It should be
possible to remove that now, and to just start out at DEF_PRIORITY?

Cc: Nick Piggin <nickpiggin@yahoo.com.au>
Cc: Christoph Lameter <clameter@engr.sgi.com>
Cc: <stable@kernel.org>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>


# 75167957 21-Oct-2006 Andy Whitcroft <apw@shadowen.org>

[PATCH] Reintroduce NODES_SPAN_OTHER_NODES for powerpc

Reintroduce NODES_SPAN_OTHER_NODES for powerpc

Revert "[PATCH] Remove SPAN_OTHER_NODES config definition"
This reverts commit f62859bb6871c5e4a8e591c60befc8caaf54db8c.
Revert "[PATCH] mm: remove arch independent NODES_SPAN_OTHER_NODES"
This reverts commit a94b3ab7eab4edcc9b2cb474b188f774c331adf7.

Also update the comments to indicate that this is still required
and where its used.

Signed-off-by: Andy Whitcroft <apw@shadowen.org>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Mike Kravetz <kravetz@us.ibm.com>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Acked-by: Mel Gorman <mel@csn.ul.ie>
Acked-by: Will Schmidt <will_schmidt@vnet.ibm.com>
Cc: Christoph Lameter <clameter@sgi.com>
Cc: <stable@kernel.org>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>


# 6220ec78 20-Oct-2006 Andrew Morton <akpm@osdl.org>

[PATCH] highest_possible_node_id() linkage fix

Qooting Adrian:

- net/sunrpc/svc.c uses highest_possible_node_id()

- include/linux/nodemask.h says highest_possible_node_id() is
out-of-line #if MAX_NUMNODES > 1

- the out-of-line highest_possible_node_id() is in lib/cpumask.c

- lib/Makefile: lib-$(CONFIG_SMP) += cpumask.o
CONFIG_ARCH_DISCONTIGMEM_ENABLE=y, CONFIG_SMP=n, CONFIG_SUNRPC=y

-> highest_possible_node_id() is used in net/sunrpc/svc.c
CONFIG_NODES_SHIFT defined and > 0

-> include/linux/numa.h: MAX_NUMNODES > 1

-> compile error

The bug is not present on architectures where ARCH_DISCONTIGMEM_ENABLE
depends on NUMA (but m32r isn't the only affected architecture).

So move the function into page_alloc.c

Cc: Adrian Bunk <bunk@stusta.de>
Cc: Paul Jackson <pj@sgi.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>


# 3fcfab16 20-Oct-2006 Andrew Morton <akpm@osdl.org>

[PATCH] separate bdi congestion functions from queue congestion functions

Separate out the concept of "queue congestion" from "backing-dev congestion".
Congestion is a backing-dev concept, not a queue concept.

The blk_* congestion functions are retained, as wrappers around the core
backing-dev congestion functions.

This proper layering is needed so that NFS can cleanly use the congestion
functions, and so that CONFIG_BLOCK=n actually links.

Cc: "Thomas Maier" <balagi@justmail.de>
Cc: "Jens Axboe" <jens.axboe@oracle.com>
Cc: Trond Myklebust <trond.myklebust@fys.uio.no>
Cc: David Howells <dhowells@redhat.com>
Cc: Peter Osterlund <petero2@telia.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>


# 9858db50 11-Oct-2006 Nick Piggin <npiggin@suse.de>

[PATCH] mm: locks_freed fix

Move the lock debug checks below the page reserved checks. Also, having
debug_check_no_locks_freed in kernel_map_pages is wrong.

Signed-off-by: Nick Piggin <npiggin@suse.de>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>


# dafb1367 11-Oct-2006 Nick Piggin <npiggin@suse.de>

[PATCH] mm: arch_free_page fix

After the PG_reserved check was added, arch_free_page was being called in the
wrong place (it could be called for a page we don't actually want to free).
Fix that.

Signed-off-by: Nick Piggin <npiggin@suse.de>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>


# b888132b 11-Oct-2006 Mel Gorman <mel@skynet.ie>

[PATCH] mm: remove memmap_zone_idx()

memmap_zone_idx() is not used anymore. It was required by an earlier
version of
account-for-memmap-and-optionally-the-kernel-image-as-holes.patch but not
any more.

Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>


# 88ca3b94 04-Oct-2006 Randy Dunlap <rdunlap@infradead.org>

[PATCH] page_alloc: fix kernel-doc and func. declaration

Fix kernel-doc and function declaration (missing "void") in
mm/page_alloc.c.

Add mm/page_alloc.c to kernel-api.tmpl in DocBook.

mm/page_alloc.c:2589:38: warning: non-ANSI function declaration of function 'remove_all_active_ranges'

Signed-off-by: Randy Dunlap <rdunlap@xenotime.net>
Acked-by: Christoph Lameter <clameter@sgi.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>


# e80ee884 04-Oct-2006 Nick Piggin <nickpiggin@yahoo.com.au>

[PATCH] mm: micro optimise zone_watermark_ok

Having min be a signed quantity means gcc can't turn high latency divides
into shifts. There happen to be two such divides for GFP_ATOMIC (ie.
networking, ie. important) allocations, one of which depends on the other.
Fixing this makes code smaller as a bonus.

Shame on somebody (probably me).

Signed-off-by: Nick Piggin <npiggin@suse.de>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>


# 423b41d7 27-Sep-2006 Randy Dunlap <rdunlap@infradead.org>

[PATCH] mm/page_alloc: use NULL instead of 0 for ptr

Use NULL instead of 0 for pointer value, eliminate sparse warnings.

Signed-off-by: Randy Dunlap <rdunlap@xenotime.net>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>


# 66a55030 27-Sep-2006 Christoph Lameter <clameter@sgi.com>

[PATCH] Do not allocate pagesets for unpopulated zones.

We do not need to allocate pagesets for unpopulated zones.

Signed-off-by: Christoph Lameter <clameter@sgi.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>


# d5f541ed 27-Sep-2006 Christoph Lameter <clameter@sgi.com>

[PATCH] Add node to zone for the NUMA case

Add the node in order to optimize zone_to_nid.

Signed-off-by: Christoph Lameter <clameter@sgi.com>
Acked-by: Paul Jackson <pj@sgi.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>


# 08e0f6a9 27-Sep-2006 Christoph Lameter <clameter@sgi.com>

[PATCH] Add NUMA_BUILD definition in kernel.h to avoid #ifdef CONFIG_NUMA

The NUMA_BUILD constant is always available and will be set to 1 on
NUMA_BUILDs. That way checks valid only under CONFIG_NUMA can easily be done
without #ifdef CONFIG_NUMA

F.e.

if (NUMA_BUILD && <numa_condition>) {
...
}

[akpm: not a thing we'd normally do, but CONFIG_NUMA is special: it is
causing ifdef explosion in core kernel, so let's see if this is a comfortable
way in whcih to control that]

Signed-off-by: Christoph Lameter <clameter@sgi.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>


# c7241913 27-Sep-2006 Jes Sorensen <jes@sgi.com>

[PATCH] Condense output of show_free_areas()

On larger systems, the amount of output dumped on the console when you do
SysRq-M is beyond insane. This patch is trying to reduce it somewhat as
even with the smaller NUMA systems that have hit the desktop this seems to
be a fair thing to do.

The philosophy I have taken is as follows:
1) If a zone is empty, don't tell, we don't need yet another line
telling us so. The information is available since one can look up
the fact how many zones were initialized in the first place.
2) Put as much information on a line is possible, if it can be done
in one line, rahter than two, then do it in one. I tried to format
the temperature stuff for easy reading.

Change show_free_areas() to not print lines for empty zones. If no zone
output is printed, the zone is empty. This reduces the number of lines
dumped to the console in sysrq on a large system by several thousand lines.

Change the zone temperature printouts to use one line per CPU instead of
two lines (one hot, one cold). On a 1024 CPU, 1024 node system, this
reduces the console output by over a million lines of output.

While this is a bigger problem on large NUMA systems, it is also applicable
to smaller desktop sized and mid range NUMA systems.

Old format:

Mem-info:
Node 0 DMA per-cpu:
cpu 0 hot: high 42, batch 7 used:24
cpu 0 cold: high 14, batch 3 used:1
cpu 1 hot: high 42, batch 7 used:34
cpu 1 cold: high 14, batch 3 used:0
cpu 2 hot: high 42, batch 7 used:0
cpu 2 cold: high 14, batch 3 used:0
cpu 3 hot: high 42, batch 7 used:0
cpu 3 cold: high 14, batch 3 used:0
cpu 4 hot: high 42, batch 7 used:0
cpu 4 cold: high 14, batch 3 used:0
cpu 5 hot: high 42, batch 7 used:0
cpu 5 cold: high 14, batch 3 used:0
cpu 6 hot: high 42, batch 7 used:0
cpu 6 cold: high 14, batch 3 used:0
cpu 7 hot: high 42, batch 7 used:0
cpu 7 cold: high 14, batch 3 used:0
Node 0 DMA32 per-cpu: empty
Node 0 Normal per-cpu: empty
Node 0 HighMem per-cpu: empty
Node 1 DMA per-cpu:
[snip]
Free pages: 5410688kB (0kB HighMem)
Active:9536 inactive:4261 dirty:6 writeback:0 unstable:0 free:338168 slab:1931 mapped:1900 pagetables:208
Node 0 DMA free:1676304kB min:3264kB low:4080kB high:4896kB active:128048kB inactive:61568kB present:1970880kB pages_scanned:0 all_unreclaimable? no
lowmem_reserve[]: 0 0 0 0
Node 0 DMA32 free:0kB min:0kB low:0kB high:0kB active:0kB inactive:0kB present:0kB pages_scanned:0 all_unreclaimable? no
lowmem_reserve[]: 0 0 0 0
Node 0 Normal free:0kB min:0kB low:0kB high:0kB active:0kB inactive:0kB present:0kB pages_scanned:0 all_unreclaimable? no
lowmem_reserve[]: 0 0 0 0
Node 0 HighMem free:0kB min:512kB low:512kB high:512kB active:0kB inactive:0kB present:0kB pages_scanned:0 all_unreclaimable? no
lowmem_reserve[]: 0 0 0 0
Node 1 DMA free:1951728kB min:3280kB low:4096kB high:4912kB active:5632kB inactive:1504kB present:1982464kB pages_scanned:0 all_unreclaimable? no
lowmem_reserve[]: 0 0 0 0
....

New format:

Mem-info:
Node 0 DMA per-cpu:
CPU 0: Hot: hi: 42, btch: 7 usd: 41 Cold: hi: 14, btch: 3 usd: 2
CPU 1: Hot: hi: 42, btch: 7 usd: 40 Cold: hi: 14, btch: 3 usd: 1
CPU 2: Hot: hi: 42, btch: 7 usd: 0 Cold: hi: 14, btch: 3 usd: 0
CPU 3: Hot: hi: 42, btch: 7 usd: 0 Cold: hi: 14, btch: 3 usd: 0
CPU 4: Hot: hi: 42, btch: 7 usd: 0 Cold: hi: 14, btch: 3 usd: 0
CPU 5: Hot: hi: 42, btch: 7 usd: 0 Cold: hi: 14, btch: 3 usd: 0
CPU 6: Hot: hi: 42, btch: 7 usd: 0 Cold: hi: 14, btch: 3 usd: 0
CPU 7: Hot: hi: 42, btch: 7 usd: 0 Cold: hi: 14, btch: 3 usd: 0
Node 1 DMA per-cpu:
[snip]
Free pages: 5411088kB (0kB HighMem)
Active:9558 inactive:4233 dirty:6 writeback:0 unstable:0 free:338193 slab:1942 mapped:1918 pagetables:208
Node 0 DMA free:1677648kB min:3264kB low:4080kB high:4896kB active:129296kB inactive:58864kB present:1970880kB pages_scanned:0 all_unreclaimable? no
lowmem_reserve[]: 0 0 0 0
Node 1 DMA free:1948448kB min:3280kB low:4096kB high:4912kB active:6864kB inactive:3536kB present:1982464kB pages_scanned:0 all_unreclaimable? no
lowmem_reserve[]: 0 0 0 0

Signed-off-by: Jes Sorensen <jes@sgi.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>


# fb01439c 27-Sep-2006 Mel Gorman <mel@skynet.ie>

[PATCH] Allow an arch to expand node boundaries

Arch-independent zone-sizing determines the size of a node
(pgdat->node_spanned_pages) based on the physical memory that was
registered by the architecture. However, when
CONFIG_MEMORY_HOTPLUG_RESERVE is set, the architecture expects that the
spanned_pages will be much larger and that mem_map will be allocated that
is used lated on memory hot-add.

This patch allows an architecture that sets CONFIG_MEMORY_HOTPLUG_RESERVE
to call push_node_boundaries() which will set the node beginning and end to
at *least* the requested boundary.

Cc: Dave Hansen <haveblue@us.ibm.com>
Cc: Andy Whitcroft <apw@shadowen.org>
Cc: Andi Kleen <ak@muc.de>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Paul Mackerras <paulus@samba.org>
Cc: "Keith Mannthey" <kmannth@gmail.com>
Cc: "Luck, Tony" <tony.luck@intel.com>
Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Yasunori Goto <y-goto@jp.fujitsu.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>


# 9c7cd687 27-Sep-2006 Mel Gorman <mel@skynet.ie>

[PATCH] Account for holes that are outside the range of physical memory

absent_pages_in_range() made the assumption that users of the API would not
care about holes beyound the end of physical memory. This was not the
case. This patch will account for ranges outside of physical memory as
holes correctly.

Cc: Dave Hansen <haveblue@us.ibm.com>
Cc: Andy Whitcroft <apw@shadowen.org>
Cc: Andi Kleen <ak@muc.de>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Paul Mackerras <paulus@samba.org>
Cc: "Keith Mannthey" <kmannth@gmail.com>
Cc: "Luck, Tony" <tony.luck@intel.com>
Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Yasunori Goto <y-goto@jp.fujitsu.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>


# 0e0b864e 27-Sep-2006 Mel Gorman <mel@csn.ul.ie>

[PATCH] Account for memmap and optionally the kernel image as holes

The x86_64 code accounted for memmap and some portions of the the DMA zone as
holes. This was because those areas would never be reclaimed and accounting
for them as memory affects min watermarks. This patch will account for the
memmap as a memory hole. Architectures may optionally use set_dma_reserve()
if they wish to account for a portion of memory in ZONE_DMA as a hole.

Signed-off-by: Mel Gorman <mel@csn.ul.ie>
Cc: Dave Hansen <haveblue@us.ibm.com>
Cc: Andy Whitcroft <apw@shadowen.org>
Cc: Andi Kleen <ak@muc.de>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Paul Mackerras <paulus@samba.org>
Cc: "Keith Mannthey" <kmannth@gmail.com>
Cc: "Luck, Tony" <tony.luck@intel.com>
Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Yasunori Goto <y-goto@jp.fujitsu.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>


# c713216d 27-Sep-2006 Mel Gorman <mel@csn.ul.ie>

[PATCH] Introduce mechanism for registering active regions of memory

At a basic level, architectures define structures to record where active
ranges of page frames are located. Once located, the code to calculate zone
sizes and holes in each architecture is very similar. Some of this zone and
hole sizing code is difficult to read for no good reason. This set of patches
eliminates the similar-looking architecture-specific code.

The patches introduce a mechanism where architectures register where the
active ranges of page frames are with add_active_range(). When all areas have
been discovered, free_area_init_nodes() is called to initialise the pgdat and
zones. The zone sizes and holes are then calculated in an architecture
independent manner.

Patch 1 introduces the mechanism for registering and initialising PFN ranges
Patch 2 changes ppc to use the mechanism - 139 arch-specific LOC removed
Patch 3 changes x86 to use the mechanism - 136 arch-specific LOC removed
Patch 4 changes x86_64 to use the mechanism - 74 arch-specific LOC removed
Patch 5 changes ia64 to use the mechanism - 52 arch-specific LOC removed
Patch 6 accounts for mem_map as a memory hole as the pages are not reclaimable.
It adjusts the watermarks slightly

Tony Luck has successfully tested for ia64 on Itanium with tiger_defconfig,
gensparse_defconfig and defconfig. Bob Picco has also tested and debugged on
IA64. Jack Steiner successfully boot tested on a mammoth SGI IA64-based
machine. These were on patches against 2.6.17-rc1 and release 3 of these
patches but there have been no ia64-changes since release 3.

There are differences in the zone sizes for x86_64 as the arch-specific code
for x86_64 accounts the kernel image and the starting mem_maps as memory holes
but the architecture-independent code accounts the memory as present.

The big benefit of this set of patches is a sizable reduction of
architecture-specific code, some of which is very hairy. There should be a
greater reduction when other architectures use the same mechanisms for zone
and hole sizing but I lack the hardware to test on.

Additional credit;
Dave Hansen for the initial suggestion and comments on early patches
Andy Whitcroft for reviewing early versions and catching numerous
errors
Tony Luck for testing and debugging on IA64
Bob Picco for fixing bugs related to pfn registration, reviewing a
number of patch revisions, providing a number of suggestions
on future direction and testing heavily
Jack Steiner and Robin Holt for testing on IA64 and clarifying
issues related to memory holes
Yasunori for testing on IA64
Andi Kleen for reviewing and feeding back about x86_64
Christian Kujau for providing valuable information related to ACPI
problems on x86_64 and testing potential fixes

This patch:

Define the structure to represent an active range of page frames within a node
in an architecture independent manner. Architectures are expected to register
active ranges of PFNs using add_active_range(nid, start_pfn, end_pfn) and call
free_area_init_nodes() passing the PFNs of the end of each zone.

Signed-off-by: Mel Gorman <mel@csn.ul.ie>
Signed-off-by: Bob Picco <bob.picco@hp.com>
Cc: Dave Hansen <haveblue@us.ibm.com>
Cc: Andy Whitcroft <apw@shadowen.org>
Cc: Andi Kleen <ak@muc.de>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Paul Mackerras <paulus@samba.org>
Cc: "Keith Mannthey" <kmannth@gmail.com>
Cc: "Luck, Tony" <tony.luck@intel.com>
Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Yasunori Goto <y-goto@jp.fujitsu.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>


# f623f0db 26-Sep-2006 Rafael J. Wysocki <rjw@rjwysocki.net>

[PATCH] swsusp: Fix mark_free_pages

Clean up mm/page_alloc.c#mark_free_pages() and make it avoid clearing
PageNosaveFree for PageNosave pages. This allows us to get rid of an ugly
hack in kernel/power/snapshot.c#copy_data_pages().

Additionally, the page-copying loop in copy_data_pages() is moved to an
inline function.

Signed-off-by: Rafael J. Wysocki <rjw@sisk.pl>
Cc: Pavel Machek <pavel@ucw.cz>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>


# 89fa3024 26-Sep-2006 Christoph Lameter <clameter@sgi.com>

[PATCH] NUMA: Add zone_to_nid function

There are many places where we need to determine the node of a zone.
Currently we use a difficult to read sequence of pointer dereferencing.
Put that into an inline function and use throughout VM. Maybe we can find
a way to optimize the lookup in the future.

Signed-off-by: Christoph Lameter <clameter@sgi.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>


# 0ff38490 26-Sep-2006 Christoph Lameter <clameter@sgi.com>

[PATCH] zone_reclaim: dynamic slab reclaim

Currently one can enable slab reclaim by setting an explicit option in
/proc/sys/vm/zone_reclaim_mode. Slab reclaim is then used as a final
option if the freeing of unmapped file backed pages is not enough to free
enough pages to allow a local allocation.

However, that means that the slab can grow excessively and that most memory
of a node may be used by slabs. We have had a case where a machine with
46GB of memory was using 40-42GB for slab. Zone reclaim was effective in
dealing with pagecache pages. However, slab reclaim was only done during
global reclaim (which is a bit rare on NUMA systems).

This patch implements slab reclaim during zone reclaim. Zone reclaim
occurs if there is a danger of an off node allocation. At that point we

1. Shrink the per node page cache if the number of pagecache
pages is more than min_unmapped_ratio percent of pages in a zone.

2. Shrink the slab cache if the number of the nodes reclaimable slab pages
(patch depends on earlier one that implements that counter)
are more than min_slab_ratio (a new /proc/sys/vm tunable).

The shrinking of the slab cache is a bit problematic since it is not node
specific. So we simply calculate what point in the slab we want to reach
(current per node slab use minus the number of pages that neeed to be
allocated) and then repeately run the global reclaim until that is
unsuccessful or we have reached the limit. I hope we will have zone based
slab reclaim at some point which will make that easier.

The default for the min_slab_ratio is 5%

Also remove the slab option from /proc/sys/vm/zone_reclaim_mode.

[akpm@osdl.org: cleanups]
Signed-off-by: Christoph Lameter <clameter@sgi.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>


# 972d1a7b 26-Sep-2006 Christoph Lameter <clameter@sgi.com>

[PATCH] ZVC: Support NR_SLAB_RECLAIMABLE / NR_SLAB_UNRECLAIMABLE

Remove the atomic counter for slab_reclaim_pages and replace the counter
and NR_SLAB with two ZVC counter that account for unreclaimable and
reclaimable slab pages: NR_SLAB_RECLAIMABLE and NR_SLAB_UNRECLAIMABLE.

Change the check in vmscan.c to refer to to NR_SLAB_RECLAIMABLE. The
intend seems to be to check for slab pages that could be freed.

Signed-off-by: Christoph Lameter <clameter@sgi.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>


# 8417bba4 26-Sep-2006 Christoph Lameter <clameter@sgi.com>

[PATCH] Replace min_unmapped_ratio by min_unmapped_pages in struct zone

*_pages is a better description of the role of the variable.

Signed-off-by: Christoph Lameter <clameter@sgi.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>


# 39bbcb8f 26-Sep-2006 Christoph Lameter <clameter@sgi.com>

[PATCH] mm: do not check unpopulated zones for draining and counter updates

If a zone is unpopulated then we do not need to check for pages that are to
be drained and also not for vm counters that may need to be updated.

Signed-off-by: Christoph Lameter <clameter@sgi.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>


# 006d22d9 26-Sep-2006 Christoph Lameter <clameter@sgi.com>

[PATCH] Optimize free_one_page

Free one_page currently adds the page to a fake list and calls
free_page_bulk. Fee_page_bulk takes it off again and then calles
__free_one_page.

Make free_one_page go directly to __free_one_page. Saves list on / off and
a temporary list in free_one_page for higher ordered pages.

Signed-off-by: Christoph Lameter <clameter@sgi.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>


# 1192d526 26-Sep-2006 Christoph Lameter <clameter@sgi.com>

[PATCH] Cleanup: Add zone pointer to get_page_from_freelist

There are frequent references to *z in get_page_from_freelist.

Add an explicit zone variable that can be used in all these places.

Signed-off-by: Christoph Lameter <clameter@sgi.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>


# 9b819d20 26-Sep-2006 Christoph Lameter <clameter@sgi.com>

[PATCH] Add __GFP_THISNODE to avoid fallback to other nodes and ignore cpuset/memory policy restrictions

Add a new gfp flag __GFP_THISNODE to avoid fallback to other nodes. This
flag is essential if a kernel component requires memory to be located on a
certain node. It will be needed for alloc_pages_node() to force allocation
on the indicated node and for alloc_pages() to force allocation on the
current node.

Signed-off-by: Christoph Lameter <clameter@sgi.com>
Cc: Andy Whitcroft <apw@shadowen.org>
Cc: Mel Gorman <mel@csn.ul.ie>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>


# 19655d34 26-Sep-2006 Christoph Lameter <clameter@sgi.com>

[PATCH] linearly index zone->node_zonelists[]

I wonder why we need this bitmask indexing into zone->node_zonelists[]?

We always start with the highest zone and then include all lower zones
if we build zonelists.

Are there really cases where we need allocation from ZONE_DMA or
ZONE_HIGHMEM but not ZONE_NORMAL? It seems that the current implementation
of highest_zone() makes that already impossible.

If we go linear on the index then gfp_zone() == highest_zone() and a lot
of definitions fall by the wayside.

We can now revert back to the use of gfp_zone() in mempolicy.c ;-)

Signed-off-by: Christoph Lameter <clameter@sgi.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>


# 2f6726e5 26-Sep-2006 Christoph Lameter <clameter@sgi.com>

[PATCH] Apply type enum zone_type

After we have done this we can now do some typing cleanup.

The memory policy layer keeps a policy_zone that specifies
the zone that gets memory policies applied. This variable
can now be of type enum zone_type.

The check_highest_zone function and the build_zonelists funnctionm must
then also take a enum zone_type parameter.

Plus there are a number of loops over zones that also should use
zone_type.

We run into some troubles at some points with functions that need a
zone_type variable to become -1. Fix that up.

[pj@sgi.com: fix set_mempolicy() crash]
Signed-off-by: Christoph Lameter <clameter@sgi.com>
Signed-off-by: Paul Jackson <pj@sgi.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>


# 4e4785bc 26-Sep-2006 Christoph Lameter <clameter@sgi.com>

[PATCH] mempolicies: fix policy_zone check

There is a check in zonelist_policy that compares pieces of the bitmap
obtained from a gfp mask via GFP_ZONETYPES with a zone number in function
zonelist_policy().

The bitmap is an ORed mask of __GFP_DMA, __GFP_DMA32 and __GFP_HIGHMEM.
The policy_zone is a zone number with the possible values of ZONE_DMA,
ZONE_DMA32, ZONE_HIGHMEM and ZONE_NORMAL. These are two different domains
of values.

For some reason seemed to work before the zone reduction patchset (It
definitely works on SGI boxes since we just have one zone and the check
cannot fail).

With the zone reduction patchset this check definitely fails on systems
with two zones if the system actually has memory in both zones.

This is because ZONE_NORMAL is selected using no __GFP flag at
all and thus gfp_zone(gfpmask) == 0. ZONE_DMA is selected when __GFP_DMA
is set. __GFP_DMA is 0x01. So gfp_zone(gfpmask) == 1.

policy_zone is set to ZONE_NORMAL (==1) if ZONE_NORMAL and ZONE_DMA are
populated.

For ZONE_NORMAL gfp_zone(<no _GFP_DMA>) yields 0 which is <
policy_zone(ZONE_NORMAL) and so policy is not applied to regular memory
allocations!

Instead gfp_zone(__GFP_DMA) == 1 which results in policy being applied
to DMA allocations!

What we realy want in that place is to establish the highest allowable
zone for a given gfp_mask. If the highest zone is higher or equal to the
policy_zone then memory policies need to be applied. We have such
a highest_zone() function in page_alloc.c.

So move the highest_zone() function from mm/page_alloc.c into
include/linux/gfp.h. On the way we simplify the function and use the new
zone_type that was also introduced with the zone reduction patchset plus we
also specify the right type for the gfp flags parameter.

Signed-off-by: Christoph Lameter <clameter@sgi.com>
Signed-off-by: Lee Schermerhorn <Lee.Schermerhorn@hp.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>


# e53ef38d 26-Sep-2006 Christoph Lameter <clameter@sgi.com>

[PATCH] reduce MAX_NR_ZONES: make ZONE_HIGHMEM optional

Make ZONE_HIGHMEM optional

- ifdef out code and definitions related to CONFIG_HIGHMEM

- __GFP_HIGHMEM falls back to normal allocations if there is no
ZONE_HIGHMEM

- GFP_ZONEMASK becomes 0x01 if there is no DMA32 and no HIGHMEM
zone.

[jdike@addtoit.com: build fix]
Signed-off-by: Jeff Dike <jdike@addtoit.com>
Signed-off-by: Christoph Lameter <clameter@engr.sgi.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>


# fb0e7942 26-Sep-2006 Christoph Lameter <clameter@sgi.com>

[PATCH] reduce MAX_NR_ZONES: make ZONE_DMA32 optional

Make ZONE_DMA32 optional

- Add #ifdefs around ZONE_DMA32 specific code and definitions.

- Add CONFIG_ZONE_DMA32 config option and use that for x86_64
that alone needs this zone.

- Remove the use of CONFIG_DMA_IS_DMA32 and CONFIG_DMA_IS_NORMAL
for ia64 and fix up the way per node ZVCs are calculated.

- Fall back to prior GFP_ZONEMASK of 0x03 if there is no
DMA32 zone.

Signed-off-by: Christoph Lameter <clameter@sgi.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>


# 2f1b6248 26-Sep-2006 Christoph Lameter <clameter@sgi.com>

[PATCH] reduce MAX_NR_ZONES: use enum to define zones, reformat and comment

Use enum for zones and reformat zones dependent information

Add comments explaning the use of zones and add a zones_t type for zone
numbers.

Line up information that will be #ifdefd by the following patches.

[akpm@osdl.org: comment cleanups]
Signed-off-by: Christoph Lameter <clameter@sgi.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>


# 98d2b0eb 26-Sep-2006 Christoph Lameter <clameter@sgi.com>

[PATCH] reduce MAX_NR_ZONES: page allocator ZONE_HIGHMEM cleanup

page allocator ZONE_HIGHMEM fixups

1. We do not need to do an #ifdef in si_meminfo since both counters
in use are zero if !CONFIG_HIGHMEM.

2. Add #ifdef in si_meminfo_node instead to avoid referencing zone
information for ZONE_HIGHMEM if we do not have HIGHMEM
(may not be there after the following patches).

3. Replace the use of ZONE_HIGHMEM with MAX_NR_ZONES in build_zonelists_node

4. build_zonelists_node: Remove BUG_ON for ZONE_HIGHMEM. Zone will
be optional soon and thus BUG_ON cannot be triggered anymore.

5. init_free_area_core: Replace a use of ZONE_HIGHMEM with NR_MAX_ZONES.

[akpm@osdl.org: cleanups]
Signed-off-by: Christoph Lameter <clameter@sgi.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>


# c1f60a5a 26-Sep-2006 Christoph Lameter <clameter@sgi.com>

[PATCH] reduce MAX_NR_ZONES: move HIGHMEM counters into highmem.c/.h

Move totalhigh_pages and nr_free_highpages() into highmem.c/.h

Move the totalhigh_pages definition into highmem.c/.h. Move the
nr_free_highpages function into highmem.c

[yoichi_yuasa@tripeaks.co.jp: build fix]
Signed-off-by: Christoph Lameter <clameter@sgi.com>
Signed-off-by: Yoichi Yuasa <yoichi_yuasa@tripeaks.co.jp>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>


# 182e8e23 26-Sep-2006 Christoph Lameter <clameter@sgi.com>

[PATCH] reduce MAX_NR_ZONES: make display of highmem counters conditional on CONFIG_HIGHMEM

Do not display HIGHMEM memory sizes if CONFIG_HIGHMEM is not set.

Make HIGHMEM dependent texts and make display of highmem counters optional

Some texts are depending on CONFIG_HIGHMEM.

Remove those strings and remove the display of highmem counter values if
CONFIG_HIGHMEM is not set.

[akpm@osdl.org: remove some ifdefs]
Signed-off-by: Christoph Lameter <clameter@sgi.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>


# 725d704e 26-Sep-2006 Nick Piggin <npiggin@suse.de>

[PATCH] mm: VM_BUG_ON

Introduce a VM_BUG_ON, which is turned on with CONFIG_DEBUG_VM. Use this
in the lightweight, inline refcounting functions; PageLRU and PageActive
checks in vmscan, because they're pretty well confined to vmscan. And in
page allocate/free fastpaths which can be the hottest parts of the kernel
for kbuilds.

Unlike BUG_ON, VM_BUG_ON must not be used to execute statements with
side-effects, and should not be used outside core mm code.

Signed-off-by: Nick Piggin <npiggin@suse.de>
Cc: Hugh Dickins <hugh@veritas.com>
Cc: Christoph Lameter <clameter@engr.sgi.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>


# f3ef9ead 25-Sep-2006 David Rientjes <rientjes@cs.washington.edu>

[PATCH] do not free non slab allocated per_cpu_pageset

Stops panic associated with attempting to free a non slab-allocated
per_cpu_pageset.

Signed-off-by: David Rientjes <rientjes@cs.washington.edu>
Acked-by: Christoph Lameter <clameter@sgi.com>
Cc: <stable@kernel.org>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>


# f034b5d4 24-Aug-2006 David S. Miller <davem@sunset.davemloft.net>

[XFRM]: Dynamic xfrm_state hash table sizing.

The grow algorithm is simple, we grow if:

1) we see a hash chain collision at insert, and
2) we haven't hit the hash size limit (currently 1*1024*1024 slots), and
3) the number of xfrm_state objects is > the current hash mask

All of this needs some tweaking.

Remove __initdata from "hashdist" so we can use it safely at run time.

Signed-off-by: David S. Miller <davem@davemloft.net>


# 9614634f 03-Jul-2006 Christoph Lameter <clameter@sgi.com>

[PATCH] ZVC/zone_reclaim: Leave 1% of unmapped pagecache pages for file I/O

It turns out that it is advantageous to leave a small portion of unmapped file
backed pages if all of a zone's pages (or almost all pages) are allocated and
so the page allocator has to go off-node.

This allows recently used file I/O buffers to stay on the node and
reduces the times that zone reclaim is invoked if file I/O occurs
when we run out of memory in a zone.

The problem is that zone reclaim runs too frequently when the page cache is
used for file I/O (read write and therefore unmapped pages!) alone and we have
almost all pages of the zone allocated. Zone reclaim may remove 32 unmapped
pages. File I/O will use these pages for the next read/write requests and the
unmapped pages increase. After the zone has filled up again zone reclaim will
remove it again after only 32 pages. This cycle is too inefficient and there
are potentially too many zone reclaim cycles.

With the 1% boundary we may still remove all unmapped pages for file I/O in
zone reclaim pass. However. it will take a large number of read and writes
to get back to 1% again where we trigger zone reclaim again.

The zone reclaim 2.6.16/17 does not show this behavior because we have a 30
second timeout.

[akpm@osdl.org: rename the /proc file and the variable]
Signed-off-by: Christoph Lameter <clameter@sgi.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>


# f8891e5e 30-Jun-2006 Christoph Lameter <clameter@sgi.com>

[PATCH] Light weight event counters

The remaining counters in page_state after the zoned VM counter patches
have been applied are all just for show in /proc/vmstat. They have no
essential function for the VM.

We use a simple increment of per cpu variables. In order to avoid the most
severe races we disable preempt. Preempt does not prevent the race between
an increment and an interrupt handler incrementing the same statistics
counter. However, that race is exceedingly rare, we may only loose one
increment or so and there is no requirement (at least not in kernel) that
the vm event counters have to be accurate.

In the non preempt case this results in a simple increment for each
counter. For many architectures this will be reduced by the compiler to a
single instruction. This single instruction is atomic for i386 and x86_64.
And therefore even the rare race condition in an interrupt is avoided for
both architectures in most cases.

The patchset also adds an off switch for embedded systems that allows a
building of linux kernels without these counters.

The implementation of these counters is through inline code that hopefully
results in only a single instruction increment instruction being emitted
(i386, x86_64) or in the increment being hidden though instruction
concurrency (EPIC architectures such as ia64 can get that done).

Benefits:
- VM event counter operations usually reduce to a single inline instruction
on i386 and x86_64.
- No interrupt disable, only preempt disable for the preempt case.
Preempt disable can also be avoided by moving the counter into a spinlock.
- Handling is similar to zoned VM counters.
- Simple and easily extendable.
- Can be omitted to reduce memory use for embedded use.

References:

RFC http://marc.theaimsgroup.com/?l=linux-kernel&m=113512330605497&w=2
RFC http://marc.theaimsgroup.com/?l=linux-kernel&m=114988082814934&w=2
local_t http://marc.theaimsgroup.com/?l=linux-kernel&m=114991748606690&w=2
V2 http://marc.theaimsgroup.com/?t=115014808400007&r=1&w=2
V3 http://marc.theaimsgroup.com/?l=linux-kernel&m=115024767022346&w=2
V4 http://marc.theaimsgroup.com/?l=linux-kernel&m=115047968808926&w=2

Signed-off-by: Christoph Lameter <clameter@sgi.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>


# ca889e6c 30-Jun-2006 Christoph Lameter <clameter@sgi.com>

[PATCH] Use Zoned VM Counters for NUMA statistics

The numa statistics are really event counters. But they are per node and
so we have had special treatment for these counters through additional
fields on the pcp structure. We can now use the per zone nature of the
zoned VM counters to realize these.

This will shrink the size of the pcp structure on NUMA systems. We will
have some room to add additional per zone counters that will all still fit
in the same cacheline.

Bits Prior pcp size Size after patch We can add
------------------------------------------------------------------
64 128 bytes (16 words) 80 bytes (10 words) 48
32 76 bytes (19 words) 56 bytes (14 words) 8 (64 byte cacheline)
72 (128 byte)

Remove the special statistics for numa and replace them with zoned vm
counters. This has the side effect that global sums of these events now
show up in /proc/vmstat.

Also take the opportunity to move the zone_statistics() function from
page_alloc.c into vmstat.c.

Discussions:
V2 http://marc.theaimsgroup.com/?t=115048227000002&r=1&w=2

Signed-off-by: Christoph Lameter <clameter@sgi.com>
Acked-by: Andi Kleen <ak@suse.de>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>


# fd39fc85 30-Jun-2006 Christoph Lameter <clameter@sgi.com>

[PATCH] zoned vm counters: conversion of nr_unstable to per zone counter

Conversion of nr_unstable to a per zone counter

We need to do some special modifications to the nfs code since there are
multiple cases of disposition and we need to have a page ref for proper
accounting.

This converts the last critical page state of the VM and therefore we need to
remove several functions that were depending on GET_PAGE_STATE_LAST in order
to make the kernel compile again. We are only left with event type counters
in page state.

[akpm@osdl.org: bugfixes]
Signed-off-by: Christoph Lameter <clameter@sgi.com>
Cc: Trond Myklebust <trond.myklebust@fys.uio.no>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>


# ce866b34 30-Jun-2006 Christoph Lameter <clameter@sgi.com>

[PATCH] zoned vm counters: conversion of nr_writeback to per zone counter

Conversion of nr_writeback to per zone counter.

This removes the last page_state counter from arch/i386/mm/pgtable.c so we
drop the page_state from there.

[akpm@osdl.org: bugfix]
Signed-off-by: Christoph Lameter <clameter@sgi.com>
Cc: Trond Myklebust <trond.myklebust@fys.uio.no>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>


# b1e7a8fd 30-Jun-2006 Christoph Lameter <clameter@sgi.com>

[PATCH] zoned vm counters: conversion of nr_dirty to per zone counter

This makes nr_dirty a per zone counter. Looping over all processors is
avoided during writeback state determination.

The counter aggregation for nr_dirty had to be undone in the NFS layer since
we summed up the page counts from multiple zones. Someone more familiar with
NFS should probably review what I have done.

[akpm@osdl.org: bugfix]
Signed-off-by: Christoph Lameter <clameter@sgi.com>
Cc: Trond Myklebust <trond.myklebust@fys.uio.no>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>


# df849a15 30-Jun-2006 Christoph Lameter <clameter@sgi.com>

[PATCH] zoned vm counters: conversion of nr_pagetables to per zone counter

Conversion of nr_page_table_pages to a per zone counter

[akpm@osdl.org: bugfix]
Signed-off-by: Christoph Lameter <clameter@sgi.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>


# 9a865ffa 30-Jun-2006 Christoph Lameter <clameter@sgi.com>

[PATCH] zoned vm counters: conversion of nr_slab to per zone counter

- Allows reclaim to access counter without looping over processor counts.

- Allows accurate statistics on how many pages are used in a zone by
the slab. This may become useful to balance slab allocations over
various zones.

[akpm@osdl.org: bugfix]
Signed-off-by: Christoph Lameter <clameter@sgi.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>


# 347ce434 30-Jun-2006 Christoph Lameter <clameter@sgi.com>

[PATCH] zoned vm counters: conversion of nr_pagecache to per zone counter

Currently a single atomic variable is used to establish the size of the page
cache in the whole machine. The zoned VM counters have the same method of
implementation as the nr_pagecache code but also allow the determination of
the pagecache size per zone.

Remove the special implementation for nr_pagecache and make it a zoned counter
named NR_FILE_PAGES.

Updates of the page cache counters are always performed with interrupts off.
We can therefore use the __ variant here.

Signed-off-by: Christoph Lameter <clameter@sgi.com>
Cc: Trond Myklebust <trond.myklebust@fys.uio.no>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>


# 65ba55f5 30-Jun-2006 Christoph Lameter <clameter@sgi.com>

[PATCH] zoned vm counters: convert nr_mapped to per zone counter

nr_mapped is important because it allows a determination of how many pages of
a zone are not mapped, which would allow a more efficient means of determining
when we need to reclaim memory in a zone.

We take the nr_mapped field out of the page state structure and define a new
per zone counter named NR_FILE_MAPPED (the anonymous pages will be split off
from NR_MAPPED in the next patch).

We replace the use of nr_mapped in various kernel locations. This avoids the
looping over all processors in try_to_free_pages(), writeback, reclaim (swap +
zone reclaim).

[akpm@osdl.org: bugfix]
Signed-off-by: Christoph Lameter <clameter@sgi.com>
Cc: Trond Myklebust <trond.myklebust@fys.uio.no>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>


# 2244b95a 30-Jun-2006 Christoph Lameter <clameter@sgi.com>

[PATCH] zoned vm counters: basic ZVC (zoned vm counter) implementation

Per zone counter infrastructure

The counters that we currently have for the VM are split per processor. The
processor however has not much to do with the zone these pages belong to. We
cannot tell f.e. how many ZONE_DMA pages are dirty.

So we are blind to potentially inbalances in the usage of memory in various
zones. F.e. in a NUMA system we cannot tell how many pages are dirty on a
particular node. If we knew then we could put measures into the VM to balance
the use of memory between different zones and different nodes in a NUMA
system. For example it would be possible to limit the dirty pages per node so
that fast local memory is kept available even if a process is dirtying huge
amounts of pages.

Another example is zone reclaim. We do not know how many unmapped pages exist
per zone. So we just have to try to reclaim. If it is not working then we
pause and try again later. It would be better if we knew when it makes sense
to reclaim unmapped pages from a zone. This patchset allows the determination
of the number of unmapped pages per zone. We can remove the zone reclaim
interval with the counters introduced here.

Futhermore the ability to have various usage statistics available will allow
the development of new NUMA balancing algorithms that may be able to improve
the decision making in the scheduler of when to move a process to another node
and hopefully will also enable automatic page migration through a user space
program that can analyse the memory load distribution and then rebalance
memory use in order to increase performance.

The counter framework here implements differential counters for each processor
in struct zone. The differential counters are consolidated when a threshold
is exceeded (like done in the current implementation for nr_pageache), when
slab reaping occurs or when a consolidation function is called.

Consolidation uses atomic operations and accumulates counters per zone in the
zone structure and also globally in the vm_stat array. VM functions can
access the counts by simply indexing a global or zone specific array.

The arrangement of counters in an array also simplifies processing when output
has to be generated for /proc/*.

Counters can be updated by calling inc/dec_zone_page_state or
_inc/dec_zone_page_state analogous to *_page_state. The second group of
functions can be called if it is known that interrupts are disabled.

Special optimized increment and decrement functions are provided. These can
avoid certain checks and use increment or decrement instructions that an
architecture may provide.

We also add a new CONFIG_DMA_IS_NORMAL that signifies that an architecture can
do DMA to all memory and therefore ZONE_NORMAL will not be populated. This is
only currently set for IA64 SGI SN2 and currently only affects
node_page_state(). In the best case node_page_state can be reduced to
retrieving a single counter for the one zone on the node.

[akpm@osdl.org: cleanups]
[akpm@osdl.org: export vm_stat[] for filesystems]
Signed-off-by: Christoph Lameter <clameter@sgi.com>
Cc: Trond Myklebust <trond.myklebust@fys.uio.no>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>


# f6ac2354 30-Jun-2006 Christoph Lameter <clameter@sgi.com>

[PATCH] zoned vm counters: create vmstat.c/.h from page_alloc.c/.h

NOTE: ZVC are *not* the lightweight event counters. ZVCs are reliable whereas
event counters do not need to be.

Zone based VM statistics are necessary to be able to determine what the state
of memory in one zone is. In a NUMA system this can be helpful for local
reclaim and other memory optimizations that may be able to shift VM load in
order to get more balanced memory use.

It is also useful to know how the computing load affects the memory
allocations on various zones. This patchset allows the retrieval of that data
from userspace.

The patchset introduces a framework for counters that is a cross between the
existing page_stats --which are simply global counters split per cpu-- and the
approach of deferred incremental updates implemented for nr_pagecache.

Small per cpu 8 bit counters are added to struct zone. If the counter exceeds
certain thresholds then the counters are accumulated in an array of
atomic_long in the zone and in a global array that sums up all zone values.
The small 8 bit counters are next to the per cpu page pointers and so they
will be in high in the cpu cache when pages are allocated and freed.

Access to VM counter information for a zone and for the whole machine is then
possible by simply indexing an array (Thanks to Nick Piggin for pointing out
that approach). The access to the total number of pages of various types does
no longer require the summing up of all per cpu counters.

Benefits of this patchset right now:

- Ability for UP and SMP configuration to determine how memory
is balanced between the DMA, NORMAL and HIGHMEM zones.

- loops over all processors are avoided in writeback and
reclaim paths. We can avoid caching the writeback information
because the needed information is directly accessible.

- Special handling for nr_pagecache removed.

- zone_reclaim_interval vanishes since VM stats can now determine
when it is worth to do local reclaim.

- Fast inline per node page state determination.

- Accurate counters in /sys/devices/system/node/node*/meminfo. Current
counters are counting simply which processor allocated a page somewhere
and guestimate based on that. So the counters were not useful to show
the actual distribution of page use on a specific zone.

- The swap_prefetch patch requires per node statistics in order to
figure out when processors of a node can prefetch. This patch provides
some of the needed numbers.

- Detailed VM counters available in more /proc and /sys status files.

References to earlier discussions:
V1 http://marc.theaimsgroup.com/?l=linux-kernel&m=113511649910826&w=2
V2 http://marc.theaimsgroup.com/?l=linux-kernel&m=114980851924230&w=2
V3 http://marc.theaimsgroup.com/?l=linux-kernel&m=115014697910351&w=2
V4 http://marc.theaimsgroup.com/?l=linux-kernel&m=115024767318740&w=2

Performance tests with AIM7 did not show any regressions. Seems to be a tad
faster even. Tested on ia64/NUMA. Builds fine on i386, SMP / UP. Includes
fixes for s390/arm/uml arch code.

This patch:

Move counter code from page_alloc.c/page-flags.h to vmstat.c/h.

Create vmstat.c/vmstat.h by separating the counter code and the proc
functions.

Move the vm_stat_text array before zoneinfo_show.

[akpm@osdl.org: s390 build fix]
[akpm@osdl.org: HOTPLUG_CPU build fix]
Signed-off-by: Christoph Lameter <clameter@sgi.com>
Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
Cc: Trond Myklebust <trond.myklebust@fys.uio.no>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>


# 6ab3d562 30-Jun-2006 Jörn Engel <joern@wohnheim.fh-wedel.de>

Remove obsolete #include <linux/config.h>

Signed-off-by: Jörn Engel <joern@wohnheim.fh-wedel.de>
Signed-off-by: Adrian Bunk <bunk@stusta.de>


# f9b8404c 27-Jun-2006 Ingo Molnar <mingo@elte.hu>

[PATCH] pi-futex: introduce debug_check_no_locks_freed()

Add debug_check_no_locks_freed(), as a central inline to add
bad-lock-free-debugging functionality to.

Signed-off-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Arjan van de Ven <arjan@linux.intel.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>


# 74b85f37 27-Jun-2006 Chandra Seetharaman <sekharan@us.ibm.com>

[PATCH] cpu hotplug: make cpu_notifier related notifier blocks __cpuinit only

Make notifier_blocks associated with cpu_notifier as __cpuinitdata.

__cpuinitdata makes sure that the data is init time only unless
CONFIG_HOTPLUG_CPU is defined.

Signed-off-by: Chandra Seetharaman <sekharan@us.ibm.com>
Cc: Ashok Raj <ashok.raj@intel.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>


# 9c7b216d 27-Jun-2006 Chandra Seetharaman <sekharan@us.ibm.com>

[PATCH] cpu hotplug: revert init patch submitted for 2.6.17

In 2.6.17, there was a problem with cpu_notifiers and XFS. I provided a
band-aid solution to solve that problem. In the process, i undid all the
changes you both were making to ensure that these notifiers were available
only at init time (unless CONFIG_HOTPLUG_CPU is defined).

We deferred the real fix to 2.6.18. Here is a set of patches that fixes the
XFS problem cleanly and makes the cpu notifiers available only at init time
(unless CONFIG_HOTPLUG_CPU is defined).

If CONFIG_HOTPLUG_CPU is defined then cpu notifiers are available at run
time.

This patch reverts the notifier_call changes made in 2.6.17

Signed-off-by: Chandra Seetharaman <sekharan@us.ibm.com>
Cc: Ashok Raj <ashok.raj@intel.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>


# d6e05edc 26-Jun-2006 Andreas Mohr <andi@lisas.de>

spelling fixes

acquired (aquired)
contiguous (contigious)
successful (succesful, succesfull)
surprise (suprise)
whether (weather)
some other misspellings

Signed-off-by: Andreas Mohr <andi@lisas.de>
Signed-off-by: Adrian Bunk <bunk@stusta.de>


# 43b0bc00 25-Jun-2006 Chris Wright <chrisw@sous-sol.org>

[PATCH] cpuset: remove extra cpuset_zone_allowed check in __alloc_pages

This is redundant with check in wakeup_kswapd.

Signed-off-by: Chris Wright <chrisw@sous-sol.org>
Acked-by: Paul Jackson <pj@sgi.com>
Acked-by: Christoph Lameter <clameter@sgi.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>


# 8f9de51a 23-Jun-2006 Kirill Korotaev <dev@openvz.org>

[PATCH] printk() should not be called under zone->lock

This patch fixes printk() under zone->lock in show_free_areas(). It can be
unsafe to call printk() under this lock, since caller can try to
allocate/free some memory and selfdeadlock on this lock. I found
allocations/freeing mem both in netconsole and serial console.

This issue was faced in reallity when meminfo was periodically printed for
debug purposes and netconsole was used.

Signed-off-by: Kirill Korotaev <dev@openvz.org>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>


# bd1e22b8 23-Jun-2006 Andrew Morton <akpm@osdl.org>

[PATCH] initialise total_memory() earlier

Initialise total_memory earlier in boot. Because if for some reason we run
page reclaim early in boot, we don't want total_memory to be zero when we use
it as a divisor.

And rename total_memory to vm_total_pages to avoid naming clashes with
architectures.

Cc: Yasunori Goto <y-goto@jp.fujitsu.com>
Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Martin Bligh <mbligh@google.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>


# 67de6482 23-Jun-2006 Andy Whitcroft <apw@shadowen.org>

[PATCH] squash duplicate page_to_pfn and pfn_to_page

We have architectures where the size of page_to_pfn and pfn_to_page are
significant enough to overall image size that they wish to push them out of
line. However, in the process we have grown a second copy of the
implementation of each of these routines for each memory model. Share the
implmentation exposing it either inline or out-of-line as required.

Signed-off-by: Andy Whitcroft <apw@shadowen.org>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>


# 6811378e 23-Jun-2006 Yasunori Goto <y-goto@jp.fujitsu.com>

[PATCH] wait_table and zonelist initializing for memory hotadd: update zonelists

In current code, zonelist is considered to be build once, no modification.
But MemoryHotplug can add new zone/pgdat. It must be updated.

This patch modifies build_all_zonelists(). By this, build_all_zonelist() can
reconfig pgdat's zonelists.

To update them safety, this patch use stop_machine_run(). Other cpus don't
touch among updating them by using it.

In old version (V2 of node hotadd), kernel updated them after zone
initialization. But present_page of its new zone is still 0, because
online_page() is not called yet at this time. Build_zonelists() checks
present_pages to find present zone. It was too early. So, I changed it after
online_pages().

Signed-off-by: Yasunori Goto <y-goto@jp.fujitsu.com>
Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>


# cca448fe 23-Jun-2006 Yasunori Goto <y-goto@jp.fujitsu.com>

[PATCH] wait_table and zonelist initializing for memory hotadd: wait_table initialization

Wait_table is initialized according to zone size at boot time. But, we cannot
know the maixmum zone size when memory hotplug is enabled. It can be
changed.... And resizing of wait_table is hard.

So kernel allocate and initialzie wait_table as its maximum size.

Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Signed-off-by: Yasunori Goto <y-goto@jp.fujitsu.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>


# 718127cc 23-Jun-2006 Yasunori Goto <y-goto@jp.fujitsu.com>

[PATCH] wait_table and zonelist initializing for memory hotadd: add return code for init_current_empty_zone

When add_zone() is called against empty zone (not populated zone), we have to
initialize the zone which didn't initialize at boot time. But,
init_currently_empty_zone() may fail due to allocation of wait table. So,
this patch is to catch its error code.

Changes against wait_table is in the next patch.

Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Signed-off-by: Yasunori Goto <y-goto@jp.fujitsu.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>


# 86356ab1 23-Jun-2006 Yasunori Goto <y-goto@jp.fujitsu.com>

[PATCH] wait_table and zonelist initializing for memory hotadd: change to meminit for build_zonelist

Change definitions of some functions and data from __init to __meminit.

These functions and data can be used after bootup by this patch to be used for
hot-add codes.

Signed-off-by: Yasunori Goto <y-goto@jp.fujitsu.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>


# 02b694de 23-Jun-2006 Yasunori Goto <y-goto@jp.fujitsu.com>

[PATCH] wait_table and zonelist initializing for memory hotadd: change name of wait_table_size()

This is just to rename from wait_table_size() to wait_table_hash_nr_entries().

Signed-off-by: Yasunori Goto <y-goto@jp.fujitsu.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>


# cb2b95e1 23-Jun-2006 Andy Whitcroft <apw@shadowen.org>

[PATCH] zone handle unaligned zone boundaries

The buddy allocator has a requirement that boundaries between contigious
zones occur aligned with the the MAX_ORDER ranges. Where they do not we
will incorrectly merge pages cross zone boundaries. This can lead to pages
from the wrong zone being handed out.

Originally the buddy allocator would check that buddies were in the same
zone by referencing the zone start and end page frame numbers. This was
removed as it became very expensive and the buddy allocator already made
the assumption that zones boundaries were aligned.

It is clear that not all configurations and architectures are honouring
this alignment requirement. Therefore it seems safest to reintroduce
support for non-aligned zone boundaries. This patch introduces a new check
when considering a page a buddy it compares the zone_table index for the
two pages and refuses to merge the pages where they do not match. The
zone_table index is unique for each node/zone combination when
FLATMEM/DISCONTIGMEM is enabled and for each section/zone combination when
SPARSEMEM is enabled (a SPARSEMEM section is at least a MAX_ORDER size).

Signed-off-by: Andy Whitcroft <apw@shadowen.org>
Cc: Dave Hansen <haveblue@us.ibm.com>
Cc: Mel Gorman <mel@csn.ul.ie>
Cc: Yasunori Goto <y-goto@jp.fujitsu.com>
Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>


# e984bb43 20-May-2006 Bob Picco <bob.picco@hp.com>

[PATCH] Align the node_mem_map endpoints to a MAX_ORDER boundary

Andy added code to buddy allocator which does not require the zone's
endpoints to be aligned to MAX_ORDER. An issue is that the buddy allocator
requires the node_mem_map's endpoints to be MAX_ORDER aligned. Otherwise
__page_find_buddy could compute a buddy not in node_mem_map for partial
MAX_ORDER regions at zone's endpoints. page_is_buddy will detect that
these pages at endpoints are not PG_buddy (they were zeroed out by bootmem
allocator and not part of zone). Of course the negative here is we could
waste a little memory but the positive is eliminating all the old checks
for zone boundary conditions.

SPARSEMEM won't encounter this issue because of MAX_ORDER size constraint
when SPARSEMEM is configured. ia64 VIRTUAL_MEM_MAP doesn't need the logic
either because the holes and endpoints are handled differently. This
leaves checking alloc_remap and other arches which privately allocate for
node_mem_map.

Signed-off-by: Bob Picco <bob.picco@hp.com>
Acked-by: Mel Gorman <mel@csn.ul.ie>
Cc: Dave Hansen <haveblue@us.ibm.com>
Cc: Andy Whitcroft <apw@shadowen.org>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>


# bdd804f4 20-May-2006 Paul Jackson <pj@sgi.com>

[PATCH] Cpuset: might sleep checking zones allowed fix

Fix a couple of infrequently encountered 'sleeping function called from
invalid context' in the cpuset hooks in __alloc_pages. Could sleep while
interrupts disabled.

The routine cpuset_zone_allowed() is called by code in mm/page_alloc.c
__alloc_pages() to determine if a zone is allowed in the current tasks
cpuset. This routine can sleep, for certain GFP_KERNEL allocations, if the
zone is on a memory node not allowed in the current cpuset, but might be
allowed in a parent cpuset.

But we can't sleep in __alloc_pages() if in interrupt, nor if called for a
GFP_ATOMIC request (__GFP_WAIT not set in gfp_flags).

The rule was intended to be:
Don't call cpuset_zone_allowed() if you can't sleep, unless you
pass in the __GFP_HARDWALL flag set in gfp_flag, which disables
the code that might scan up ancestor cpusets and sleep.

This rule was being violated in a couple of places, due to a bogus change
made (by myself, pj) to __alloc_pages() as part of the November 2005 effort
to cleanup its logic, and also due to a later fix to constrain which swap
daemons were awoken.

The bogus change can be seen at:
http://linux.derkeiler.com/Mailing-Lists/Kernel/2005-11/4691.html
[PATCH 01/05] mm fix __alloc_pages cpuset ALLOC_* flags

This was first noticed on a tight memory system, in code that was disabling
interrupts and doing allocation requests with __GFP_WAIT not set, which
resulted in __might_sleep() writing complaints to the log "Debug: sleeping
function called ...", when the code in cpuset_zone_allowed() tried to take
the callback_sem cpuset semaphore.

We haven't seen a system hang on this 'might_sleep' yet, but we are at
decent risk of seeing it fairly soon, especially since the additional
cpuset_zone_allowed() check was added, conditioning wakeup_kswapd(), in
March 2006.

Special thanks to Dave Chinner, for figuring this out, and a tip of the hat
to Nick Piggin who warned me of this back in Nov 2005, before I was ready
to listen.

Signed-off-by: Paul Jackson <pj@sgi.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>


# ac924c60 15-May-2006 Andrew Morton <akpm@osdl.org>

[PATCH] setup_per_zone_pages_min() overflow fix

As pointed out in http://bugzilla.kernel.org/show_bug.cgi?id=6490, this
function can experience overflows on 32-bit machines, causing our response to
changed values of min_free_kbytes to go whacky.

Fixing it efficiently is all too hard, so fix it with 64-bit math instead.

Cc: Ake Sandgren <ake.sandgren@hpc2n.umu.se>
Cc: Martin Bligh <mbligh@google.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>


# 83d722f7 24-Apr-2006 Chandra Seetharaman <sekharan@us.ibm.com>

[PATCH] Remove __devinit and __cpuinit from notifier_call definitions

Few of the notifier_chain_register() callers use __init in the definition
of notifier_call. It is incorrect as the function definition should be
available after the initializations (they do not unregister them during
initializations).

This patch fixes all such usages to _not_ have the notifier_call __init
section.

Signed-off-by: Chandra Seetharaman <sekharan@us.ibm.com>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>


# 6aa3001b2 18-Apr-2006 Andrew Morton <akpm@osdl.org>

[PATCH] page_alloc.c: buddy handling cleanup

Fix up some whitespace damage.

Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>


# cb45b0e9 10-Apr-2006 Hideo AOKI <haoki@redhat.com>

[PATCH] overcommit: add calculate_totalreserve_pages()

These patches are an enhancement of OVERCOMMIT_GUESS algorithm in
__vm_enough_memory().

- why the kernel needed patching

When the kernel can't allocate anonymous pages in practice, currnet
OVERCOMMIT_GUESS could return success. This implementation might be
the cause of oom kill in memory pressure situation.

If the Linux runs with page reservation features like
/proc/sys/vm/lowmem_reserve_ratio and without swap region, I think
the oom kill occurs easily.

- the overall design approach in the patch

When the OVERCOMMET_GUESS algorithm calculates number of free pages,
the reserved free pages are regarded as non-free pages.

This change helps to avoid the pitfall that the number of free pages
become less than the number which the kernel tries to keep free.

- testing results

I tested the patches using my test kernel module.

If the patches aren't applied to the kernel, __vm_enough_memory()
returns success in the situation but autual page allocation is
failed.

On the other hand, if the patches are applied to the kernel, memory
allocation failure is avoided since __vm_enough_memory() returns
failure in the situation.

I checked that on i386 SMP 16GB memory machine. I haven't tested on
nommu environment currently.

This patch adds totalreserve_pages for __vm_enough_memory().

Calculate_totalreserve_pages() checks maximum lowmem_reserve pages and
pages_high in each zone. Finally, the function stores the sum of each
zone to totalreserve_pages.

The totalreserve_pages is calculated when the VM is initilized.
And the variable is updated when /proc/sys/vm/lowmem_reserve_raito
or /proc/sys/vm/min_free_kbytes are changed.

Signed-off-by: Hideo Aoki <haoki@redhat.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>


# 676165a8 09-Apr-2006 Nick Piggin <piggin@cyberone.com.au>

[PATCH] Fix buddy list race that could lead to page lru list corruptions

Rohit found an obscure bug causing buddy list corruption.

page_is_buddy is using a non-atomic test (PagePrivate && page_count == 0)
to determine whether or not a free page's buddy is itself free and in the
buddy lists.

Each of the conjuncts may be true at different times due to unrelated
conditions, so the non-atomic page_is_buddy test may find each conjunct to
be true even if they were not both true at the same time (ie. the page was
not on the buddy lists).

Signed-off-by: Martin Bligh <mbligh@google.com>
Signed-off-by: Rohit Seth <rohitseth@google.com>
Signed-off-by: Nick Piggin <npiggin@suse.de>
Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>


# ae0f15fb 27-Mar-2006 KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>

[PATCH] for_each_online_pgdat: remove pgdat_list

By using for_each_online_pgdat(), pgdat_list is not necessary now. This patch
removes it.

Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>


# ec936fc5 27-Mar-2006 KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>

[PATCH] for_each_online_pgdat: renaming for_each_pgdat

Replace for_each_pgdat() with for_each_online_pgdat().

Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>


# a0140c1d 27-Mar-2006 KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>

[PATCH] remove zone_mem_map

This patch removes zone_mem_map.

pfn_to_page uses pgdat, page_to_pfn uses zone. page_to_pfn can use pgdat
instead of zone, which is only one user of zone_mem_map. By modifing it,
we can remove zone_mem_map.

Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Dave Hansen <haveblue@us.ibm.com>
Cc: Christoph Lameter <christoph@lameter.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>


# a117e66e 27-Mar-2006 KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>

[PATCH] unify pfn_to_page: generic functions

There are 3 memory models, FLATMEM, DISCONTIGMEM, SPARSEMEM.
Each arch has its own page_to_pfn(), pfn_to_page() for each models.
But most of them can use the same arithmetic.

This patch adds asm-generic/memory_model.h, which includes generic
page_to_pfn(), pfn_to_page() definitions for each memory model.

When CONFIG_OUT_OF_LINE_PFN_TO_PAGE=y, out-of-line functions are
used instead of macro. This is enabled by some archs and reduces
text size.

Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Hugh Dickins <hugh@veritas.com>
Cc: Andi Kleen <ak@muc.de>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Richard Henderson <rth@twiddle.net>
Cc: Ivan Kokshaysky <ink@jurassic.park.msu.ru>
Cc: Russell King <rmk@arm.linux.org.uk>
Cc: Ian Molton <spyro@f2s.com>
Cc: Mikael Starvik <starvik@axis.com>
Cc: David Howells <dhowells@redhat.com>
Cc: Yoshinori Sato <ysato@users.sourceforge.jp>
Cc: Hirokazu Takata <takata.hirokazu@renesas.com>
Cc: Ralf Baechle <ralf@linux-mips.org>
Cc: Kyle McMartin <kyle@mcmartin.ca>
Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
Cc: Paul Mundt <lethal@linux-sh.org>
Cc: Kazumoto Kojima <kkojima@rr.iij4u.or.jp>
Cc: Richard Curnow <rc@rc0.org.uk>
Cc: William Lee Irwin III <wli@holomorphy.com>
Cc: "David S. Miller" <davem@davemloft.net>
Cc: Jeff Dike <jdike@addtoit.com>
Cc: Paolo 'Blaisorblade' Giarrusso <blaisorblade@yahoo.it>
Cc: Miles Bader <uclinux-v850@lsi.nec.co.jp>
Cc: Chris Zankel <chris@zankel.net>
Cc: "Luck, Tony" <tony.luck@intel.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>


# 6e692ed3 25-Mar-2006 John Hawkes <hawkes@sgi.com>

[PATCH] fix alloc_large_system_hash() roundup

The "rounded up to nearest power of 2 in size" algorithm in
alloc_large_system_hash is not correct. As coded, it takes an otherwise
acceptable power-of-2 value and doubles it. For example, we see the error
if we boot with thash_entries=2097152 which produces a hash table with
4194304 entries.

Signed-off-by: John Hawkes <hawkes@sgi.com>
Cc: Roland Dreier <rdreier@cisco.com>
Cc: "Chen, Kenneth W" <kenneth.w.chen@intel.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>


# f5335c0f 25-Mar-2006 Anton Blanchard <anton@samba.org>

[PATCH] quieten zone_pcp_init

In zone_pcp_init we print out all zones even if they are empty:

On node 0 totalpages: 245760
DMA zone: 245760 pages, LIFO batch:31
DMA32 zone: 0 pages, LIFO batch:0
Normal zone: 0 pages, LIFO batch:0
HighMem zone: 0 pages, LIFO batch:0

To conserve dmesg space why not print only the non zero zones.

Signed-off-by: Anton Blanchard <anton@samba.org>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>


# 0b1303fc 24-Mar-2006 Christoph Lameter <clameter@engr.sgi.com>

[PATCH] cpusets: only wakeup kswapd for zones in the current cpuset

If we get under some memory pressure in a cpuset (we only scan zones that
are in the cpuset for memory) then kswapd is woken up for all zones. This
patch only wakes up kswapd in zones that are part of the current cpuset.

Signed-off-by: Christoph Lameter <clameter@sgi.com>
Acked-by: Paul Jackson <pj@sgi.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>


# 879336c3 22-Mar-2006 Christoph Lameter <clameter@sgi.com>

[PATCH] drain_node_pages: interrupt latency reduction / optimization

1. Only disable interrupts if there is actually something to free

2. Only dirty the pcp cacheline if we actually freed something.

3. Disable interrupts for each single pcp and not for cleaning
all the pcps in all zones of a node.

drain_node_pages is called every 2 seconds from cache_reap. This
fix should avoid most disabling of interrupts.

Signed-off-by: Christoph Lameter <clameter@sgi.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>


# 6626c5d5 22-Mar-2006 Andrew Morton <akpm@osdl.org>

[PATCH] mm: prep_zero_page() in irq is a bug

prep_zero_page() uses KM_USER0 and hence may not be used from IRQ context, at
least for highmem pages.

Cc: Nick Piggin <nickpiggin@yahoo.com.au>
Cc: Christoph Lameter <christoph@lameter.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>


# 17cf4406 22-Mar-2006 Nick Piggin <npiggin@suse.de>

[PATCH] mm: cleanup prep_ stuff

Move the prep_ stuff into prep_new_page.

Signed-off-by: Nick Piggin <npiggin@suse.de>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>


# 7835e98b 22-Mar-2006 Nick Piggin <npiggin@suse.de>

[PATCH] remove set_page_count() outside mm/

set_page_count usage outside mm/ is limited to setting the refcount to 1.
Remove set_page_count from outside mm/, and replace those users with
init_page_count() and set_page_refcounted().

This allows more debug checking, and tighter control on how code is allowed
to play around with page->_count.

Signed-off-by: Nick Piggin <npiggin@suse.de>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>


# 84097518 22-Mar-2006 Nick Piggin <npiggin@suse.de>

[PATCH] mm: nommu use compound pages

Now that compound page handling is properly fixed in the VM, move nommu
over to using compound pages rather than rolling their own refcounting.

nommu vm page refcounting is broken anyway, but there is no need to have
divergent code in the core VM now, nor when it gets fixed.

Signed-off-by: Nick Piggin <npiggin@suse.de>
Cc: David Howells <dhowells@redhat.com>

(Needs testing, please).
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>


# 545b1ea9 22-Mar-2006 Nick Piggin <npiggin@suse.de>

[PATCH] mm: cleanup bootmem

The bootmem code added to page_alloc.c duplicated some page freeing code
that it really doesn't need to because it is not so performance critical.

While we're here, make prefetching work properly by actually prefetching
the page we're about to use before prefetching ahead to the next one (ie.
get the most important transaction started first). Also prefetch just a
single page ahead rather than leaving a gap of 16.

Jack Steiner reported no problems with SGI's ia64 simulator.

Signed-off-by: Nick Piggin <npiggin@suse.de>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>


# 8dfcc9ba 22-Mar-2006 Nick Piggin <npiggin@suse.de>

[PATCH] mm: split highorder pages

Have an explicit mm call to split higher order pages into individual pages.
Should help to avoid bugs and be more explicit about the code's intention.

Signed-off-by: Nick Piggin <npiggin@suse.de>
Cc: Russell King <rmk@arm.linux.org.uk>
Cc: David Howells <dhowells@redhat.com>
Cc: Ralf Baechle <ralf@linux-mips.org>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Paul Mundt <lethal@linux-sh.org>
Cc: "David S. Miller" <davem@davemloft.net>
Cc: Chris Zankel <chris@zankel.net>
Signed-off-by: Yoichi Yuasa <yoichi_yuasa@tripeaks.co.jp>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>


# 5e9dace8 22-Mar-2006 Nick Piggin <npiggin@suse.de>

[PATCH] mm: page_alloc less atomics

More atomic operation removal from page allocator

Signed-off-by: Nick Piggin <npiggin@suse.de>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>


# b40607fc 22-Mar-2006 Andrew Morton <akpm@osdl.org>

[PATCH] __get_page_state() cpumask cleanup and fix

__get_page_state() has an open-coded for_each_cpu_mask() loop in it.

Tidy that up, then notice that the code was buggy:

while (cpu < NR_CPUS) {
unsigned long *in, *out, off;

if (!cpu_isset(cpu, *cpumask))
continue;

an obvious infinite loop. I guess we just never call it with a holey cpu
mask.

Even after my cpumask size-reduction work, this patch increases code size :(

Cc: Paul Jackson <pj@sgi.com>
Cc: Christoph Lameter <clameter@engr.sgi.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>


# 8fce4d8e 09-Mar-2006 Christoph Lameter <clameter@engr.sgi.com>

[PATCH] slab: Node rotor for freeing alien caches and remote per cpu pages.

The cache reaper currently tries to free all alien caches and all remote
per cpu pages in each pass of cache_reap. For a machines with large number
of nodes (such as Altix) this may lead to sporadic delays of around ~10ms.
Interrupts are disabled while reclaiming creating unacceptable delays.

This patch changes that behavior by adding a per cpu reap_node variable.
Instead of attempting to free all caches, we free only one alien cache and
the per cpu pages from one remote node. That reduces the time spend in
cache_reap. However, doing so will lengthen the time it takes to
completely drain all remote per cpu pagesets and all alien caches. The
time needed will grow with the number of nodes in the system. All caches
are drained when they overflow their respective capacity. So the drawback
here is only that a bit of memory may be wasted for awhile longer.

Details:

1. Rename drain_remote_pages to drain_node_pages to allow the specification
of the node to drain of pcp pages.

2. Add additional functions init_reap_node, next_reap_node for NUMA
that manage a per cpu reap_node counter.

3. Add a reap_alien function that reaps only from the current reap_node.

For us this seems to be a critical issue. Holdoffs of an average of ~7ms
cause some HPC benchmarks to slow down significantly. F.e. NAS parallel
slows down dramatically. NAS parallel has a 12-16 seconds runtime w/o rotor
compared to 5.8 secs with the rotor patches. It gets down to 5.05 secs with
the additional interrupt holdoff reductions.

Signed-off-by: Christoph Lameter <clameter@sgi.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>


# 9b0f8b04 20-Feb-2006 Christoph Lameter <clameter@engr.sgi.com>

[PATCH] Terminate process that fails on a constrained allocation

Some allocations are restricted to a limited set of nodes (due to memory
policies or cpuset constraints). If the page allocator is not able to find
enough memory then that does not mean that overall system memory is low.

In particular going postal and more or less randomly shooting at processes
is not likely going to help the situation but may just lead to suicide (the
whole system coming down).

It is better to signal to the process that no memory exists given the
constraints that the process (or the configuration of the process) has
placed on the allocation behavior. The process may be killed but then the
sysadmin or developer can investigate the situation. The solution is
similar to what we do when running out of hugepages.

This patch adds a check before we kill processes. At that point
performance considerations do not matter much so we just scan the zonelist
and reconstruct a list of nodes. If the list of nodes does not contain all
online nodes then this is a constrained allocation and we should kill the
current process.

Signed-off-by: Christoph Lameter <clameter@sgi.com>
Cc: Nick Piggin <nickpiggin@yahoo.com.au>
Cc: Andi Kleen <ak@muc.de>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>


# 4cf808eb 17-Feb-2006 Linus Torvalds <torvalds@osdl.org>

[PATCH] Handle holes in node mask in node fallback list setup

Change the find_next_best_node algorithm to correctly skip
over holes in the node online mask. Previously it would not handle
missing nodes correctly and cause crashes at boot.

[Written by Linus, tested by AK]

Signed-off-by: Andi Kleen <ak@suse.de>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>


# d98c7a09 14-Feb-2006 Hugh Dickins <hugh@veritas.com>

[PATCH] compound page: default destructor

Somehow I imagined that calling a NULL destructor would free a compound page
rather than oopsing. No, we must supply a default destructor, __free_pages_ok
using the order noted by prep_compound_page. hugetlb can still replace this
as before with its own free_huge_page pointer.

The case that needs this is not common: rarely does put_compound_page's
put_page_testzero bring the count down to 0. But if get_user_pages is applied
to some part of a compound page, without immediate release (e.g. AIO or
Infiniband), then it's possible for its put_page to come after the containing
vma has been unmapped and the driver done its free_pages.

That's just the kind of case compound pages are supposed to be guarding
against (but Nick points out, nor did PageReserved handle this right).

Signed-off-by: Hugh Dickins <hugh@veritas.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>


# 41d78ba5 14-Feb-2006 Hugh Dickins <hugh@veritas.com>

[PATCH] compound page: use page[1].lru

If a compound page has its own put_page_testzero destructor (the only current
example is free_huge_page), that is noted in page[1].mapping of the compound
page. But that's rather a poor place to keep it: functions which call
set_page_dirty_lock after get_user_pages (e.g. Infiniband's
__ib_umem_release) ought to be checking first, otherwise set_page_dirty is
liable to crash on what's not the address of a struct address_space.

And now I'm about to make that worse: it turns out that every compound page
needs a destructor, so we can no longer rely on hugetlb pages going their own
special way, to avoid further problems of page->mapping reuse. For example,
not many people know that: on 50% of i386 -Os builds, the first tail page of a
compound page purports to be PageAnon (when its destructor has an odd
address), which surprises page_add_file_rmap.

Keep the compound page destructor in page[1].lru.next instead. And to free up
the common pairing of mapping and index, also move compound page order from
index to lru.prev. Slab reuses page->lru too: but if we ever need slab to use
compound pages, it can easily stack its use above this.

(akpm: decoded version of the above: the tail pages of a compound page now
have ->mapping==NULL, so there's no need for the set_page_dirty[_lock]()
caller to check that they're not compund pages before doing the dirty).

Signed-off-by: Hugh Dickins <hugh@veritas.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>


# 88a2a4ac 05-Feb-2006 Eric Dumazet <dada1@cosmosbay.com>

[PATCH] percpu data: only iterate over possible CPUs

percpu_data blindly allocates bootmem memory to store NR_CPUS instances of
cpudata, instead of allocating memory only for possible cpus.

As a preparation for changing that, we need to convert various 0 -> NR_CPUS
loops to use for_each_cpu().

(The above only applies to users of asm-generic/percpu.h. powerpc has gone it
alone and is presently only allocating memory for present CPUs, so it's
currently corrupting memory).

Signed-off-by: Eric Dumazet <dada1@cosmosbay.com>
Cc: "David S. Miller" <davem@davemloft.net>
Cc: James Bottomley <James.Bottomley@steeleye.com>
Acked-by: Ingo Molnar <mingo@elte.hu>
Cc: Jens Axboe <axboe@suse.de>
Cc: Anton Blanchard <anton@samba.org>
Acked-by: William Irwin <wli@holomorphy.com>
Cc: Andi Kleen <ak@muc.de>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>


# 6292d9aa 01-Feb-2006 Ashok Raj <ashok.raj@intel.com>

[PATCH] __cpuinit functions wrongly marked __meminit

__meminit has overzelously been modified and crept its way into marking
cpuup callbacks as __meminit.

Signed-off-by: Ashok Raj <ashok.raj@intel.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>


# 9eeff239 18-Jan-2006 Christoph Lameter <clameter@sgi.com>

[PATCH] Zone reclaim: Reclaim logic

Some bits for zone reclaim exists in 2.6.15 but they are not usable. This
patch fixes them up, removes unused code and makes zone reclaim usable.

Zone reclaim allows the reclaiming of pages from a zone if the number of
free pages falls below the watermarks even if other zones still have enough
pages available. Zone reclaim is of particular importance for NUMA
machines. It can be more beneficial to reclaim a page than taking the
performance penalties that come with allocating a page on a remote zone.

Zone reclaim is enabled if the maximum distance to another node is higher
than RECLAIM_DISTANCE, which may be defined by an arch. By default
RECLAIM_DISTANCE is 20. 20 is the distance to another node in the same
component (enclosure or motherboard) on IA64. The meaning of the NUMA
distance information seems to vary by arch.

If zone reclaim is not successful then no further reclaim attempts will
occur for a certain time period (ZONE_RECLAIM_INTERVAL).

This patch was discussed before. See

http://marc.theaimsgroup.com/?l=linux-kernel&m=113519961504207&w=2
http://marc.theaimsgroup.com/?l=linux-kernel&m=113408418232531&w=2
http://marc.theaimsgroup.com/?l=linux-kernel&m=113389027420032&w=2
http://marc.theaimsgroup.com/?l=linux-kernel&m=113380938612205&w=2

Signed-off-by: Christoph Lameter <clameter@sgi.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>


# c09b4240 16-Jan-2006 Matt Tolentino <metolent@cs.vt.edu>

[PATCH] x86_64: add __meminit for memory hotplug

Add __meminit to the __init lineup to ensure functions default
to __init when memory hotplug is not enabled. Replace __devinit
with __meminit on functions that were changed when the memory
hotplug code was introduced.

Signed-off-by: Matt Tolentino <matthew.e.tolentino@intel.com>
Signed-off-by: Andi Kleen <ak@suse.de>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>


# cbe8dd4a 12-Jan-2006 Greg Ungerer <gerg@snapgear.com>

[PATCH] memmap_init_zone(): remove uneccesary page++

Remove unecessary page++ from memmap_init_zone loop.

Signed-off-by: Greg Ungerer <gerg@uclinux.org>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>


# 4eac915d 11-Jan-2006 Paul Jackson <pj@sgi.com>

[PATCH] mm: gfp_atomic comments

Clarify in comments that GFP_ATOMIC means both "don't sleep" and "use
emergency pools", hence both ALLOC_HARDER and ALLOC_HIGH.

Signed-off-by: Paul Jackson <pj@sgi.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>


# 7365f3d1 11-Jan-2006 Hugh Dickins <hugh@veritas.com>

[PATCH] Restore KERN_EMERG to each line printed by bad_page

Signed-off-by: Hugh Dickins <hugh@veritas.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>


# a4fc7ab1 11-Jan-2006 David Woodhouse <dwmw2@infradead.org>

[PATCH] fix/simplify mutex debugging code

Let's switch mutex_debug_check_no_locks_freed() to take (addr, len) as
arguments instead, since all its callers were just calculating the 'to'
address for themselves anyway... (and sometimes doing so badly).

Signed-off-by: David Woodhouse <dwmw2@infradead.org>
Acked-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>


# de5097c2 09-Jan-2006 Ingo Molnar <mingo@elte.hu>

[PATCH] mutex subsystem, more debugging code

more mutex debugging: check for held locks during memory freeing,
task exit, enable sysrq printouts, etc.

Signed-off-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Arjan van de Ven <arjan@infradead.org>


# 3e0d98b9 08-Jan-2006 Paul Jackson <pj@sgi.com>

[PATCH] cpuset: memory pressure meter

Provide a simple per-cpuset metric of memory pressure, tracking the -rate-
that the tasks in a cpuset call try_to_free_pages(), the synchronous
(direct) memory reclaim code.

This enables batch managers monitoring jobs running in dedicated cpusets to
efficiently detect what level of memory pressure that job is causing.

This is useful both on tightly managed systems running a wide mix of
submitted jobs, which may choose to terminate or reprioritize jobs that are
trying to use more memory than allowed on the nodes assigned them, and with
tightly coupled, long running, massively parallel scientific computing jobs
that will dramatically fail to meet required performance goals if they
start to use more memory than allowed to them.

This patch just provides a very economical way for the batch manager to
monitor a cpuset for signs of memory pressure. It's up to the batch
manager or other user code to decide what to do about it and take action.

==> Unless this feature is enabled by writing "1" to the special file
/dev/cpuset/memory_pressure_enabled, the hook in the rebalance
code of __alloc_pages() for this metric reduces to simply noticing
that the cpuset_memory_pressure_enabled flag is zero. So only
systems that enable this feature will compute the metric.

Why a per-cpuset, running average:

Because this meter is per-cpuset, rather than per-task or mm, the
system load imposed by a batch scheduler monitoring this metric is
sharply reduced on large systems, because a scan of the tasklist can be
avoided on each set of queries.

Because this meter is a running average, instead of an accumulating
counter, a batch scheduler can detect memory pressure with a single
read, instead of having to read and accumulate results for a period of
time.

Because this meter is per-cpuset rather than per-task or mm, the
batch scheduler can obtain the key information, memory pressure in a
cpuset, with a single read, rather than having to query and accumulate
results over all the (dynamically changing) set of tasks in the cpuset.

A per-cpuset simple digital filter (requires a spinlock and 3 words of data
per-cpuset) is kept, and updated by any task attached to that cpuset, if it
enters the synchronous (direct) page reclaim code.

A per-cpuset file provides an integer number representing the recent
(half-life of 10 seconds) rate of direct page reclaims caused by the tasks
in the cpuset, in units of reclaims attempted per second, times 1000.

Signed-off-by: Paul Jackson <pj@sgi.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>


# 48db57f8 08-Jan-2006 Nick Piggin <nickpiggin@yahoo.com.au>

[PATCH] mm: free_pages opt

Try to streamline free_pages_bulk by ensuring callers don't pass in a
'count' that exceeds the list size.

Some cleanups:
Rename __free_pages_bulk to __free_one_page.
Put the page list manipulation from __free_pages_ok into free_one_page.
Make __free_pages_ok static.

Signed-off-by: Nick Piggin <npiggin@suse.de>
Cc: Hugh Dickins <hugh@veritas.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>


# 23316bc8 08-Jan-2006 Nick Piggin <nickpiggin@yahoo.com.au>

[PATCH] mm: cleanup zone_pcp

Use zone_pcp everywhere even though NUMA code "knows" the internal details
of the zone. Stop other people trying to copy, and it looks nicer.

Also, only print the pagesets of online cpus in zoneinfo.

Signed-off-by: Nick Piggin <npiggin@suse.de>
Cc: "Seth, Rohit" <rohit.seth@intel.com>
Cc: Christoph Lameter <christoph@lameter.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>


# 8ad4b1fb 08-Jan-2006 Rohit Seth <rohit.seth@intel.com>

[PATCH] Make high and batch sizes of per_cpu_pagelists configurable

As recently there has been lot of traffic on the right values for batch and
high water marks for per_cpu_pagelists. This patch makes these two
variables configurable through /proc interface.

A new tunable /proc/sys/vm/percpu_pagelist_fraction is added. This entry
controls the fraction of pages at most in each zone that are allocated for
each per cpu page list. The min value for this is 8. It means that we
don't allow more than 1/8th of pages in each zone to be allocated in any
single per_cpu_pagelist.

The batch value of each per cpu pagelist is also updated as a result. It
is set to pcp->high/4. The upper limit of batch is (PAGE_SHIFT * 8)

Signed-off-by: Rohit Seth <rohit.seth@intel.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>


# bec6b0c8 08-Jan-2006 Christoph Lameter <clameter@engr.sgi.com>

[PATCH] slab: remove nested #ifdef CONFIG_NUMA

For some reason there is an #ifdef CONFIG_NUMA within another #ifdef
CONFIG_NUMA in the page allocator. Remove innermost #ifdef CONFIG_NUMA

Signed-off-by: Christoph Lameter <clameter@sgi.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>


# 84c2008a 08-Jan-2006 Andrew Morton <akpm@osdl.org>

[PATCH] revert "mm: page_state fixes"

Hugh says:

page_alloc_cpu_notify() specifically contains code to

/* Add dead cpu's page_states to our own. */

which handles this more efficiently.

Cc: Hugh Dickins <hugh@veritas.com>
Cc: Nick Piggin <nickpiggin@yahoo.com.au>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>


# a74609fa 06-Jan-2006 Nick Piggin <nickpiggin@yahoo.com.au>

[PATCH] mm: page_state opt

Optimise page_state manipulations by introducing interrupt unsafe accessors
to page_state fields. Callers must provide their own locking (either
disable interrupts or not update from interrupt context).

Switch over the hot callsites that can easily be moved under interrupts off
sections.

Signed-off-by: Nick Piggin <npiggin@suse.de>
Cc: Hugh Dickins <hugh@veritas.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>


# 070f8032 06-Jan-2006 Christoph Lameter <clameter@engr.sgi.com>

[PATCH] build_zonelists_node(): rename args

Give j and r meaningful names.

Signed-off-by: Christoph Lameter <clameter@sgi.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>


# 02a68a5e 06-Jan-2006 Christoph Lameter <clameter@engr.sgi.com>

[PATCH] Fix zone policy determination

The use k in the inner loop means that the highest zone nr is always used
if any zone of a node is populated. This means that the policy zone is not
correctly determined on arches that do no use HIGHMEM like ia64.

Change the loop to decrement k which also simplifies the BUG_ON.

Signed-off-by: Christoph Lameter <clameter@sgi.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>


# 4be38e35 06-Jan-2006 Christoph Lameter <clameter@engr.sgi.com>

[PATCH] mm: move determination of policy_zone into page allocator

Currently the function to build a zonelist for a BIND policy has the side
effect to set the policy_zone. This seems to be a bit strange. policy
zone seems to not be initialized elsewhere and therefore 0. Do we police
ZONE_DMA if no bind policy has been used yet?

This patch moves the determination of the zone to apply policies to into
the page allocator. We determine the zone while building the zonelist for
nodes.

Signed-off-by: Christoph Lameter <clameter@sgi.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>


# 1a93205b 06-Jan-2006 Christoph Lameter <clameter@engr.sgi.com>

[PATCH] mm: simplify build_zonelists_node by removing the case statement.

Simplify build_zonelists_node by removing the case statement.

Signed-off-by: Christoph Lameter <clameter@sgi.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>


# f3fe6512 06-Jan-2006 Con Kolivas <kernel@kolivas.org>

[PATCH] mm: add populated_zone() helper

There are numerous places we check whether a zone is populated or not.

Provide a helper function to check for populated zones and convert all
checks for zone->present_pages.

Signed-off-by: Con Kolivas <kernel@kolivas.org>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>


# 224abf92 06-Jan-2006 Nick Piggin <nickpiggin@yahoo.com.au>

[PATCH] mm: bad_page optimisation

Cut down size slightly by not passing bad_page the function name (it should be
able to be determined by dump_stack()). And cut down the number of printks in
bad_page.

Also, cut down some branching in the destroy_compound_page path.

Signed-off-by: Nick Piggin <npiggin@suse.de>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>


# 9328b8fa 06-Jan-2006 Nick Piggin <nickpiggin@yahoo.com.au>

[PATCH] mm: dma32 zone statistics

Add dma32 to zone statistics. Also attempt to arrange struct page_state a
bit better (visually).

Signed-off-by: Nick Piggin <npiggin@suse.de>
Cc: Andi Kleen <ak@muc.de>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>


# a226f6c8 06-Jan-2006 David Howells <dhowells@redhat.com>

[PATCH] FRV: Clean up bootmem allocator's page freeing algorithm

The attached patch cleans up the way the bootmem allocator frees pages.

A new function, __free_pages_bootmem(), is provided in mm/page_alloc.c that is
called from mm/bootmem.c to turn pages over to the main allocator. All the
bits of code to initialise pages (clearing PG_reserved and setting the page
count) are moved to here. The checks on page validity are removed, on the
assumption that the struct page arrays will have been prepared correctly.

Signed-off-by: David Howells <dhowells@redhat.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>


# 085cc7d5 06-Jan-2006 Nick Piggin <nickpiggin@yahoo.com.au>

[PATCH] mm: page_alloc cleanups

Small cleanups that does not change generated code with the gcc's I've tested
with.

Signed-off-by: Nick Piggin <npiggin@suse.de>
Cc: Hugh Dickins <hugh@veritas.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>


# a86b1f53 06-Jan-2006 Nick Piggin <nickpiggin@yahoo.com.au>

[PATCH] mm: page_state fixes

read_page_state and __get_page_state only traverse online CPUs, which will
cause results to fluctuate when CPUs are plugged in or out.

Signed-off-by: Nick Piggin <npiggin@suse.de>
Cc: Hugh Dickins <hugh@veritas.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>


# 2d92c5c9 06-Jan-2006 Nick Piggin <nickpiggin@yahoo.com.au>

[PATCH] mm: remove pcp low

struct per_cpu_pages.low is useless. Remove it.

Signed-off-by: Nick Piggin <npiggin@suse.de>
Cc: Hugh Dickins <hugh@veritas.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>


# 13e7444b 06-Jan-2006 Nick Piggin <nickpiggin@yahoo.com.au>

[PATCH] mm: remove bad_range

bad_range is supposed to be a temporary check. It would be a pity to throw it
out. Make it depend on CONFIG_DEBUG_VM instead.

CONFIG_HOLES_IN_ZONE systems were relying on this to check pfn_valid in the
page allocator. Add that to page_is_buddy instead.

Signed-off-by: Nick Piggin <npiggin@suse.de>
Cc: Hugh Dickins <hugh@veritas.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>


# 92be2e33 06-Jan-2006 Nick Piggin <nickpiggin@yahoo.com.au>

[PATCH] mm: microopt conditions

Micro optimise some conditionals where we don't need lazy evaluation.

Signed-off-by: Nick Piggin <npiggin@suse.de>
Cc: Hugh Dickins <hugh@veritas.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>


# 77a8a788 06-Jan-2006 Nick Piggin <nickpiggin@yahoo.com.au>

[PATCH] mm: set_page_refs opt

Inline set_page_refs.

Signed-off-by: Nick Piggin <npiggin@suse.de>
Cc: Hugh Dickins <hugh@veritas.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>


# c54ad30c 06-Jan-2006 Nick Piggin <nickpiggin@yahoo.com.au>

[PATCH] mm: pagealloc opt

Slightly optimise some page allocation and freeing functions by taking
advantage of knowing whether or not interrupts are disabled.

Signed-off-by: Nick Piggin <npiggin@suse.de>
Cc: Hugh Dickins <hugh@veritas.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>


# a94b3ab7 06-Jan-2006 Mike Kravetz <kravetz@us.ibm.com>

[PATCH] mm: remove arch independent NODES_SPAN_OTHER_NODES

The NODES_SPAN_OTHER_NODES config option was created so that DISCONTIGMEM
could handle pSeries numa layouts. However, support for DISCONTIGMEM has
been replaced by SPARSEMEM on powerpc. As a result, this config option and
supporting code is no longer needed.

I have already sent a patch to Paul that removes the option from powerpc
specific code. This removes the arch independent piece. Doesn't really
matter which is applied first.

Signed-off-by: Mike Kravetz <kravetz@us.ibm.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>


# 47f3a867 06-Jan-2006 Paul Jackson <pj@sgi.com>

[PATCH] mm: fix __alloc_pages cpuset ALLOC_* flags

Two changes to the setting of the ALLOC_CPUSET flag in
mm/page_alloc.c:__alloc_pages()

- A bug fix - the "ignoring mins" case should not be honoring ALLOC_CPUSET.
This case of all cases, since it is handling a request that will free up
more memory than is asked for (exiting tasks, e.g.) should be allowed to
escape cpuset constraints when memory is tight.

- A logic change to make it simpler. Honor cpusets even on GFP_ATOMIC
(!wait) requests. With this, cpuset confinement applies to all requests
except ALLOC_NO_WATERMARKS, so that in a subsequent cleanup patch, I can
remove the ALLOC_CPUSET flag entirely. Since I don't know any real reason
this logic has to be either way, I am choosing the path of the simplest
code.

Signed-off-by: Paul Jackson <pj@sgi.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>


# 78d9955b 15-Dec-2005 Al Viro <viro@ftp.linux.org.uk>

[PATCH] missing prototype (mm/page_alloc.c)

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>


# 0ceaacc9 03-Dec-2005 Nick Piggin <nickpiggin@yahoo.com.au>

[PATCH] Fix up per-cpu page batch sizes

The code to clamp batch sizes to 2^n - 1 went missing and an extra
check got added, which must have been a hunk of the "higer order pcp
batch refills" work sneaking in.

Signed-off-by: Nick Piggin <npiggin@suse.de>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>


# 3148890b 28-Nov-2005 Nick Piggin <nickpiggin@yahoo.com.au>

[PATCH] mm: __alloc_pages cleanup fix

I believe this patch is required to fix breakage in the asynch reclaim
watermark logic introduced by this patch:

http://www.kernel.org/git/?p=linux/kernel/git/torvalds/linux-2.6.git;a=commitdiff;h=7fb1d9fca5c6e3b06773b69165a73f3fb786b8ee

Just some background of the watermark logic in case it isn't clear...
Basically what we have is this:

--- pages_high
|
| (a)
|
--- pages_low
|
| (b)
|
--- pages_min
|
| (c)
|
--- 0

Now when pages_low is reached, we want to kick asynch reclaim, which gives us
an interval of "b" before we must start synch reclaim, and gives kswapd an
interval of "a" before it need go back to sleep.

When pages_min is reached, normal allocators must enter synch reclaim, but
PF_MEMALLOC, ALLOC_HARDER, and ALLOC_HIGH (ie. atomic allocations, recursive
allocations, etc.) get access to varying amounts of the reserve "c".

Signed-off-by: Nick Piggin <npiggin@suse.de>
Cc: "Seth, Rohit" <rohit.seth@intel.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>


# 689bcebf 21-Nov-2005 Hugh Dickins <hugh@veritas.com>

[PATCH] unpaged: PG_reserved bad_page

It used to be the case that PG_reserved pages were silently never freed, but
in 2.6.15-rc1 they may be freed with a "Bad page state" message. We should
work through such cases as they appear, fixing the code; but for now it's
safer to issue the message without freeing the page, leaving PG_reserved set.

Signed-off-by: Hugh Dickins <hugh@veritas.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>


# 664beed0 21-Nov-2005 Hugh Dickins <hugh@veritas.com>

[PATCH] unpaged: unifdefed PageCompound

It looks like snd_xxx is not the only nopage to be using PageReserved as a way
of holding a high-order page together: which no longer works, but is masked by
our failure to free from VM_RESERVED areas. We cannot fix that bug without
first substituting another way to hold the high-order page together, while
farming out the 0-order pages from within it.

That's just what PageCompound is designed for, but it's been kept under
CONFIG_HUGETLB_PAGE. Remove the #ifdefs: which saves some space (out- of-line
put_page), doesn't slow down what most needs to be fast (already using
hugetlb), and unifies the way we handle high-order pages.

Signed-off-by: Hugh Dickins <hugh@veritas.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>


# 6b1de916 17-Nov-2005 Jens Axboe <axboe@suse.de>

[PATCH] VM: fix zone list restart in page allocatate

We must reassign z before looping through the zones kicking kswapd,
since it will be NULL if we hit an OOM condition and jump back to the
beginning again. 'z' is initially assigned before the restart: label. So
move the restart label up a little.

Signed-off-by: Jens Axboe <axboe@suse.de>


# 07808b74 05-Nov-2005 Andi Kleen <ak@linux.intel.com>

[PATCH] x86_64: Remove obsolete ARCH_HAS_ATOMIC_UNSIGNED and page_flags_t

Has been introduced for x86-64 at some point to save memory
in struct page, but has been obsolete for some time. Just
remove it.

Signed-off-by: Andi Kleen <ak@suse.de>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>


# b0d41693 05-Nov-2005 Andi Kleen <ak@linux.intel.com>

[PATCH] x86_64: When cpu_up fails clean up page allocator properly

Signed-off-by: Andi Kleen <ak@suse.de>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>


# a2f1b424 05-Nov-2005 Andi Kleen <ak@linux.intel.com>

[PATCH] x86_64: Add 4GB DMA32 zone

Add a new 4GB GFP_DMA32 zone between the GFP_DMA and GFP_NORMAL zones.

As a bit of historical background: when the x86-64 port
was originally designed we had some discussion if we should
use a 16MB DMA zone like i386 or a 4GB DMA zone like IA64 or
both. Both was ruled out at this point because it was in early
2.4 when VM is still quite shakey and had bad troubles even
dealing with one DMA zone. We settled on the 16MB DMA zone mainly
because we worried about older soundcards and the floppy.

But this has always caused problems since then because
device drivers had trouble getting enough DMA able memory. These days
the VM works much better and the wide use of NUMA has proven
it can deal with many zones successfully.

So this patch adds both zones.

This helps drivers who need a lot of memory below 4GB because
their hardware is not accessing more (graphic drivers - proprietary
and free ones, video frame buffer drivers, sound drivers etc.).
Previously they could only use IOMMU+16MB GFP_DMA, which
was not enough memory.

Another common problem is that hardware who has full memory
addressing for >4GB misses it for some control structures in memory
(like transmit rings or other metadata). They tended to allocate memory
in the 16MB GFP_DMA or the IOMMU/swiotlb then using pci_alloc_consistent,
but that can tie up a lot of precious 16MB GFPDMA/IOMMU/swiotlb memory
(even on AMD systems the IOMMU tends to be quite small) especially if you have
many devices. With the new zone pci_alloc_consistent can just put
this stuff into memory below 4GB which works better.

One argument was still if the zone should be 4GB or 2GB. The main
motivation for 2GB would be an unnamed not so unpopular hardware
raid controller (mostly found in older machines from a particular four letter
company) who has a strange 2GB restriction in firmware. But
that one works ok with swiotlb/IOMMU anyways, so it doesn't really
need GFP_DMA32. I chose 4GB to be compatible with IA64 and because
it seems to be the most common restriction.

The new zone is so far added only for x86-64.

For other architectures who don't set up this
new zone nothing changes. Architectures can set a compatibility
define in Kconfig CONFIG_DMA_IS_DMA32 that will define GFP_DMA32
as GFP_DMA. Otherwise it's a nop because on 32bit architectures
it's normally not needed because GFP_NORMAL (=0) is DMA able
enough.

One problem is still that GFP_DMA means different things on different
architectures. e.g. some drivers used to have #ifdef ia64 use GFP_DMA
(trusting it to be 4GB) #elif __x86_64__ (use other hacks like
the swiotlb because 16MB is not enough) ... . This was quite
ugly and is now obsolete.

These should be now converted to use GFP_DMA32 unconditionally. I haven't done
this yet. Or best only use pci_alloc_consistent/dma_alloc_coherent
which will use GFP_DMA32 transparently.

Signed-off-by: Andi Kleen <ak@suse.de>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>


# 669ed175 13-Nov-2005 Nick Piggin <nickpiggin@yahoo.com.au>

[PATCH] mm: highmem watermarks

The pages_high - pages_low and pages_low - pages_min deltas are the asynch
reclaim watermarks. As such, the should be in the same ratios as any other
zone for highmem zones. It is the pages_min - 0 delta which is the
PF_MEMALLOC reserve, and this is the region that isn't very useful for
highmem.

This patch ensures highmem systems have similar characteristics as non highmem
ones with the same amount of memory, and also that highmem zones get similar
reclaim pressures to other zones.

Signed-off-by: Nick Piggin <npiggin@suse.de>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>


# 7fb1d9fc 13-Nov-2005 Rohit Seth <rohit.seth@intel.com>

[PATCH] mm: __alloc_pages cleanup

Clean up of __alloc_pages.

Restoration of previous behaviour, plus further cleanups by introducing an
'alloc_flags', removing the last of should_reclaim_zone.

Signed-off-by: Rohit Seth <rohit.seth@intel.com>
Signed-off-by: Nick Piggin <npiggin@suse.de>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>


# 885036d3 13-Nov-2005 Kirill Korotaev <dev@sw.ru>

[PATCH] mm: __GFP_NOFAIL fix

In __alloc_pages():

if ((p->flags & (PF_MEMALLOC | PF_MEMDIE)) && !in_interrupt()) {
/* go through the zonelist yet again, ignoring mins */
for (i = 0; zones[i] != NULL; i++) {
struct zone *z = zones[i];

page = buffered_rmqueue(z, order, gfp_mask);
if (page) {
zone_statistics(zonelist, z);
goto got_pg;
}
}
goto nopage; <<<< HERE!!! FAIL...
}

kswapd (which has PF_MEMALLOC flag) can fail to allocate memory even when
it allocates it with __GFP_NOFAIL flag.

Signed-Off-By: Pavel Emelianov <xemul@sw.ru>
Signed-Off-By: Denis Lunev <den@sw.ru>
Signed-Off-By: Kirill Korotaev <dev@sw.ru>
Cc: Nick Piggin <nickpiggin@yahoo.com.au>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>


# 6b482c67 10-Nov-2005 Dave Jones <davej@redhat.com>

[PATCH] Don't print per-cpu vm stats for offline cpus.

I just hit a page allocation error on a kernel configured to support
64 CPUs. It spewed 60 completely useless unnecessary lines of info.

Signed-off-by: Dave Jones <davej@redhat.com>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>


# f8b8db77 07-Nov-2005 Adrian Bunk <bunk@stusta.de>

[PATCH] unexport nr_swap_pages

I didn't find any possible modular usage in the kernel.

Signed-off-by: Adrian Bunk <bunk@stusta.de>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>


# 2f96996d 29-Oct-2005 John Hawkes <hawkes@sgi.com>

[PATCH] mm: wider use of for_each_*cpu()

In 'mm' change the explicit use of a for-loop using NR_CPUS into the
general for_each_cpu() constructs. This widens the scope of potential
future optimizations of the general constructs, as well as takes advantage
of the existing optimizations of first_cpu() and next_cpu(), which is
advantageous when the true CPU count is much smaller than NR_CPUS.

Signed-off-by: John Hawkes <hawkes@sgi.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>


# 3947be19 29-Oct-2005 Dave Hansen <haveblue@us.ibm.com>

[PATCH] memory hotplug: sysfs and add/remove functions

This adds generic memory add/remove and supporting functions for memory
hotplug into a new file as well as a memory hotplug kernel config option.

Individual architecture patches will follow.

For now, disable memory hotplug when swsusp is enabled. There's a lot of
churn there right now. We'll fix it up properly once it calms down.

Signed-off-by: Matt Tolentino <matthew.e.tolentino@intel.com>
Signed-off-by: Dave Hansen <haveblue@us.ibm.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>


# bdc8cb98 29-Oct-2005 Dave Hansen <haveblue@us.ibm.com>

[PATCH] memory hotplug locking: zone span seqlock

See the "fixup bad_range()" patch for more information, but this actually
creates a the lock to protect things making assumptions about a zone's size
staying constant at runtime.

Signed-off-by: Dave Hansen <haveblue@us.ibm.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>


# 208d54e5 29-Oct-2005 Dave Hansen <haveblue@us.ibm.com>

[PATCH] memory hotplug locking: node_size_lock

pgdat->node_size_lock is basically only neeeded in one place in the normal
code: show_mem(), which is the arch-specific sysrq-m printing function.

Strictly speaking, the architectures not doing memory hotplug do no need this
locking in show_mem(). However, they are all included for completeness. This
should also make any future consolidation of all of the implementations a
little more straightforward.

This lock is also held in the sparsemem code during a memory removal, as
sections are invalidated. This is the place there pfn_valid() is made false
for a memory area that's being removed. The lock is only required when doing
pfn_valid() operations on memory which the user does not already have a
reference on the page, such as in show_mem().

Signed-off-by: Dave Hansen <haveblue@us.ibm.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>


# c6a57e19 29-Oct-2005 Dave Hansen <haveblue@us.ibm.com>

[PATCH] memory hotplug prep: fixup bad_range()

When doing memory hotplug operations, the size of existing zones can obviously
change. This means that zone->zone_{start_pfn,spanned_pages} can change.

There are currently no locks that protect these structure members. However,
they are rarely accessed at runtime. Outside of swsusp, the only place that I
can find is bad_range().

So, split bad_range() up into two pieces: one that needs to be locked and
anther that doesn't.

Signed-off-by: Dave Hansen <haveblue@us.ibm.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>


# ed8ece2e 29-Oct-2005 Dave Hansen <haveblue@us.ibm.com>

[PATCH] memory hotplug prep: break out zone initialization

If a zone is empty at boot-time and then hot-added to later, it needs to run
the same init code that would have been run on it at boot.

This patch breaks out zone table and per-cpu-pages functions for use by the
hotplug code. You can almost see all of the free_area_init_core() function on
one page now. :)

Signed-off-by: Dave Hansen <haveblue@us.ibm.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>


# 4c21e2f2 29-Oct-2005 Hugh Dickins <hugh@veritas.com>

[PATCH] mm: split page table lock

Christoph Lameter demonstrated very poor scalability on the SGI 512-way, with
a many-threaded application which concurrently initializes different parts of
a large anonymous area.

This patch corrects that, by using a separate spinlock per page table page, to
guard the page table entries in that page, instead of using the mm's single
page_table_lock. (But even then, page_table_lock is still used to guard page
table allocation, and anon_vma allocation.)

In this implementation, the spinlock is tucked inside the struct page of the
page table page: with a BUILD_BUG_ON in case it overflows - which it would in
the case of 32-bit PA-RISC with spinlock debugging enabled.

Splitting the lock is not quite for free: another cacheline access. Ideally,
I suppose we would use split ptlock only for multi-threaded processes on
multi-cpu machines; but deciding that dynamically would have its own costs.
So for now enable it by config, at some number of cpus - since the Kconfig
language doesn't support inequalities, let preprocessor compare that with
NR_CPUS. But I don't think it's worth being user-configurable: for good
testing of both split and unsplit configs, split now at 4 cpus, and perhaps
change that to 8 later.

There is a benefit even for singly threaded processes: kswapd can be attacking
one part of the mm while another part is busy faulting.

Signed-off-by: Hugh Dickins <hugh@veritas.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>


# b5810039 29-Oct-2005 Nick Piggin <nickpiggin@yahoo.com.au>

[PATCH] core remove PageReserved

Remove PageReserved() calls from core code by tightening VM_RESERVED
handling in mm/ to cover PageReserved functionality.

PageReserved special casing is removed from get_page and put_page.

All setting and clearing of PageReserved is retained, and it is now flagged
in the page_alloc checks to help ensure we don't introduce any refcount
based freeing of Reserved pages.

MAP_PRIVATE, PROT_WRITE of VM_RESERVED regions is tentatively being
deprecated. We never completely handled it correctly anyway, and is be
reintroduced in future if required (Hugh has a proof of concept).

Once PageReserved() calls are removed from kernel/power/swsusp.c, and all
arch/ and driver code, the Set and Clear calls, and the PG_reserved bit can
be trivially removed.

Last real user of PageReserved is swsusp, which uses PageReserved to
determine whether a struct page points to valid memory or not. This still
needs to be addressed (a generic page_is_ram() should work).

A last caveat: the ZERO_PAGE is now refcounted and managed with rmap (and
thus mapcounted and count towards shared rss). These writes to the struct
page could cause excessive cacheline bouncing on big systems. There are a
number of ways this could be addressed if it is an issue.

Signed-off-by: Nick Piggin <npiggin@suse.de>

Refcount bug fix for filemap_xip.c

Signed-off-by: Carsten Otte <cotte@de.ibm.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>


# e46a5e28 29-Oct-2005 Seth, Rohit <rohit.seth@intel.com>

[PATCH] mm: set per-cpu-pages lower threshold to zero

Set the low water mark for hot pages in pcp to zero.

(akpm: for the life of me I cannot remember why we created pcp->low. Neither
can Martin and the changelog is silent. Maybe it was just a brainfart, but I
have this feeling that there was a reason. If not, we should remove the
fields completely. We'll see.)

Signed-off-by: Rohit Seth <rohit.seth@intel.com>
Cc: <linux-mm@kvack.org>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>


# ba56e91c 29-Oct-2005 Seth, Rohit <rohit.seth@intel.com>

[PATCH] mm: page_alloc: increase size of per-cpu-pages

Increase the page allocator's per-cpu magazines from 1/4MB to 1/2MB.

Over 100+ runs for a workload, the difference in mean is about 2%. The best
results for both are almost same. Though the max variation in results with
1/2MB is only 2.2%, whereas with 1/4MB it is 12%.

Signed-off-by: Rohit Seth <rohit.seth@intel.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>


# 260b2367 21-Oct-2005 Al Viro <viro@zeniv.linux.org.uk>

[PATCH] gfp_t: the rest

zone handling, mapping->flags handling

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>


# af4ca457 21-Oct-2005 Al Viro <viro@zeniv.linux.org.uk>

[PATCH] gfp_t: infrastructure

Beginning of gfp_t annotations:

- -Wbitwise added to CHECKFLAGS
- old __bitwise renamed to __bitwise__
- __bitwise defined to either __bitwise__ or nothing, depending on
__CHECK_ENDIAN__ being defined
- gfp_t switched from __nocast to __bitwise__
- force cast to gfp_t added to __GFP_... constants
- new helper - gfp_zone(); extracts zone bits out of gfp_t value and casts
the result to int

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>


# 1c6fe946 26-Oct-2005 Magnus Damm <magnus@valinux.co.jp>

[PATCH] NUMA: broken per cpu pageset counters

The NUMA counters in struct per_cpu_pageset (linux/mmzone.h) are never
cleared today. This works ok for CPU 0 on NUMA machines because
boot_pageset[] is already zero, but for other CPU:s this results in
uninitialized counters.

Signed-off-by: Magnus Damm <magnus@valinux.co.jp>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>


# dd0fc66f 07-Oct-2005 Al Viro <viro@ftp.linux.org.uk>

[PATCH] gfp flags annotations - part 1

- added typedef unsigned int __nocast gfp_t;

- replaced __nocast uses for gfp flags with gfp_t - it gives exactly
the same warnings as far as sparse is concerned, doesn't change
generated code (from gcc point of view we replaced unsigned int with
typedef) and documents what's going on far better.

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>


# 9f158333 13-Sep-2005 Randy Dunlap <rdunlap@infradead.org>

[PATCH] use add_taint() for setting tainted bit flags

Use the add_taint() interface for setting tainted bit flags instead of
doing it manually.

Signed-off-by: Randy Dunlap <rdunlap@xenotime.net>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>


# 207f36ee 10-Sep-2005 Renaud Lienhart <renaud.lienhart@free.fr>

[PATCH] remove invalid comment in mm/page_alloc.c

free_pages_bulk() doesn't free the entire list if count == 0.

Signed-off-by: Renaud Lienhart <renaud.lienhart@free.fr>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>


# 9bf2229f 06-Sep-2005 Paul Jackson <pj@sgi.com>

[PATCH] cpusets: formalize intermediate GFP_KERNEL containment

This patch makes use of the previously underutilized cpuset flag
'mem_exclusive' to provide what amounts to another layer of memory placement
resolution. With this patch, there are now the following four layers of
memory placement available:

1) The whole system (interrupt and GFP_ATOMIC allocations can use this),
2) The nearest enclosing mem_exclusive cpuset (GFP_KERNEL allocations can use),
3) The current tasks cpuset (GFP_USER allocations constrained to here), and
4) Specific node placement, using mbind and set_mempolicy.

These nest - each layer is a subset (same or within) of the previous.

Layer (2) above is new, with this patch. The call used to check whether a
zone (its node, actually) is in a cpuset (in its mems_allowed, actually) is
extended to take a gfp_mask argument, and its logic is extended, in the case
that __GFP_HARDWALL is not set in the flag bits, to look up the cpuset
hierarchy for the nearest enclosing mem_exclusive cpuset, to determine if
placement is allowed. The definition of GFP_USER, which used to be identical
to GFP_KERNEL, is changed to also set the __GFP_HARDWALL bit, in the previous
cpuset_gfp_hardwall_flag patch.

GFP_ATOMIC and GFP_KERNEL allocations will stay within the current tasks
cpuset, so long as any node therein is not too tight on memory, but will
escape to the larger layer, if need be.

The intended use is to allow something like a batch manager to handle several
jobs, each job in its own cpuset, but using common kernel memory for caches
and such. Swapper and oom_kill activity is also constrained to Layer (2). A
task in or below one mem_exclusive cpuset should not cause swapping on nodes
in another non-overlapping mem_exclusive cpuset, nor provoke oom_killing of a
task in another such cpuset. Heavy use of kernel memory for i/o caching and
such by one job should not impact the memory available to jobs in other
non-overlapping mem_exclusive cpusets.

This patch enables providing hardwall, inescapable cpusets for memory
allocations of each job, while sharing kernel memory allocations between
several jobs, in an enclosing mem_exclusive cpuset.

Like Dinakar's patch earlier to enable administering sched domains using the
cpu_exclusive flag, this patch also provides a useful meaning to a cpuset flag
that had previously done nothing much useful other than restrict what cpuset
configurations were allowed.

Signed-off-by: Paul Jackson <pj@sgi.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>


# 6c231b7b 06-Sep-2005 Ravikiran G Thirumalai <kiran@scalex86.org>

[PATCH] Additions to .data.read_mostly section

Mark variables which are usually accessed for reads with __readmostly.

Signed-off-by: Alok N Kataria <alokk@calsoftinc.com>
Signed-off-by: Shai Fultheim <shai@scalex86.org>
Signed-off-by: Ravikiran Thirumalai <kiran@scalex86.org>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>


# c3d8c141 06-Sep-2005 Christoph Lameter <christoph@scalex86.org>

[PATCH] More __read_mostly variables

Move some more frequently read variables that showed up during some of our
performance tests as sometimes ending up in hot cachelines to the
read_mostly section.

Fix: Move the __read_mostly from before hpet_usec_quotient to follow the
variable like the other uses of __read_mostly.

Signed-off-by: Alok N Kataria <alokk@calsoftinc.com>
Signed-off-by: Christoph Lameter <christoph@scalex86.org>
Signed-off-by: Shai Fultheim <shai@scalex86.org>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>


# c07e02db 03-Sep-2005 Martin Hicks <mort@sgi.com>

[PATCH] VM: add page_state info to per-node meminfo

Add page_state info to the per-node meminfo file in sysfs. This is mostly
just for informational purposes.

The lack of this information was brought up recently during a discussion
regarding pagecache clearing, and I put this patch together to test out one
of the suggestions.

It seems like interesting info to have, so I'm submitting the patch.

Signed-off-by: Martin Hicks <mort@sgi.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>


# 53e9a615 03-Sep-2005 Martin Hicks <mort@bork.org>

[PATCH] VM: zone reclaim atomic ops cleanup

Christoph Lameter and Marcelo Tosatti asked to get rid of the
atomic_inc_and_test() to cleanup the atomic ops in the zone reclaim code.

Signed-off-by: Martin Hicks <mort@sgi.com>
Signed-off-by: Christoph Lameter <clameter@sgi.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>


# 242e5468 03-Sep-2005 Nick Piggin <nickpiggin@yahoo.com.au>

[PATCH] mm: remove atomic

This bitop does not need to be atomic because it is performed when there will
be no references to the page (ie. the page is being freed).

Signed-off-by: Nick Piggin <npiggin@suse.de>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>


# e310fd43 29-Jul-2005 Martin J. Bligh <mbligh@mbligh.org>

[PATCH] Fix NUMA node sizing in nr_free_zone_pages

We are iterating over all nodes in nr_free_zone_pages(). Because the
fallback zonelists contain all nodes in the system, and we walk all the
zonelists, we're counting memory multiple times (once for each node). This
caused us to make a size estimate of 32GB for an 8GB AMD64 box, which makes
all the dirty ratio calculations, etc incorrect.

There's still a further bug to fix from e820 holes causing overestimation
as well, but this fix is separate, and good as is, and fixes one class of
problems. Problem found by Badari, and tested by Ram Pai - thanks!

Signed-off-by: Martin J. Bligh <mbligh@mbligh.org>
Signed-off-by: Matt Dobson <colpatch@us.ibm.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>


# 12b1c5f3 27-Jul-2005 Andy Whitcroft <apw@shadowen.org>

[PATCH] Remove bogus warning in page_alloc.c

Originally __free_pages_bulk used the relative page number within a zone to
define its buddies. This meant that to maintain the "maximally aligned"
requirements (that an allocation of size N will be aligned at least to N
physically) zones had to also be aligned to 1<<MAX_ORDER pages. When
__free_pages_bulk was updated to use the relative page frame numbers of the
free'd pages to pair buddies this released the alignment constraint on the
'left' edge of the zone. This allows _either_ edge of the zone to contain
partial MAX_ORDER sized buddies. These simply never will have matching
buddies and thus will never make it to the 'top' of the pyramid.

The patch below removes a now redundant check ensuring that the mem_map was
aligned to MAX_ORDER.

Signed-off-by: Andy Whitcroft <apw@shadowen.org>
Cc: Christoph Lameter <christoph@lameter.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>


# 37b173a4 07-Jul-2005 Marcelo Tosatti <marcelo.tosatti@cyclades.com>

[PATCH] remove completly bogus comment inside __alloc_pages() try_to_free_pages handling

Remove completly bogus comment from did_some_progress != 0 handling (that
same comment is a few lines below on did_some_progress = 0 case, where it
belongs).

Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>


# 79b9ce31 07-Jul-2005 Marcelo Tosatti <marcelo.tosatti@cyclades.com>

[PATCH] print order information when OOM killing

Dump the current allocation order when OOM killing.

Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>


# 3212c6be 27-Jun-2005 Bob Picco <bob.picco@hp.com>

[PATCH] fix WANT_PAGE_VIRTUAL in memmap_init

I spotted this issue while in memmap_init last week. I can't say the
change has any test coverage by me. start_pfn was formerly used in main
"for" loop. The fix is replace start_pfn with pfn.

Signed-off-by: Bob Picco <bob.picco@hp.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>


# 29751f69 23-Jun-2005 Andy Whitcroft <apw@shadowen.org>

[PATCH] sparsemem hotplug base

Make sparse's initalization be accessible at runtime. This allows sparse
mappings to be created after boot in a hotplug situation.

This patch is separated from the previous one just to give an indication how
much of the sparse infrastructure is *just* for hotplug memory.

The section_mem_map doesn't really store a pointer. It stores something that
is convenient to do some math against to get a pointer. It isn't valid to
just do *section_mem_map, so I don't think it should be stored as a pointer.

There are a couple of things I'd like to store about a section. First of all,
the fact that it is !NULL does not mean that it is present. There could be
such a combination where section_mem_map *is* NULL, but the math gets you
properly to a real mem_map. So, I don't think that check is safe.

Since we're storing 32-bit-aligned structures, we have a few bits in the
bottom of the pointer to play with. Use one bit to encode whether there's
really a mem_map there, and the other one to tell whether there's a valid
section there. We need to distinguish between the two because sometimes
there's a gap between when a section is discovered to be present and when we
can get the mem_map for it.

Signed-off-by: Dave Hansen <haveblue@us.ibm.com>
Signed-off-by: Andy Whitcroft <apw@shadowen.org>
Signed-off-by: Jack Steiner <steiner@sgi.com>
Signed-off-by: Bob Picco <bob.picco@hp.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>


# 641c7673 23-Jun-2005 Andy Whitcroft <apw@shadowen.org>

[PATCH] sparsemem swiss cheese numa layouts

The part of the sparsemem patch which modifies memmap_init_zone() has recently
become a problem. It changes behavior so that there is a call to
pfn_to_page() for each individual page inside of a node's range:
node_start_pfn through node_end_pfn. It used to simply do this once, at the
beginning of the node, but having sparsemem's non-contiguous mem_map[]s inside
of a node made it necessary to change.

Mike Kravetz recently wrote a patch which made the NUMA code accept some new
kinds of layouts. The system's memory was laid out like this, with node 0's
memory in two pieces: one before and one after node 1's memory:

Node 0: +++++ +++++
Node 1: +++++

Previous behavior before Mike's patch was to assign nodes like this:

Node 0: 00000 XXXXX
Node 1: 11111

Where the 'X' areas were simply thrown away. The new behavior was to make the
pg_data_t span node 0 across all of its areas, including areas that are really
node 1's: Node 0: 000000000000000 Node 1: 11111

This wastes a little bit of mem_map space, but ends up being OK, and more
fully utilizes the system's memory. memmap_init_zone() initializes all of the
"struct page"s for node 0, even for the "hole", but those never get used,
because there is no pfn_to_page() that resolves to those pages. However, only
calling pfn_to_page() once, memmap_init_zone() always uses the pages that were
allocated for node0->node_mem_map because:

struct page *start = pfn_to_page(start_pfn);
// effectively start = &node->node_mem_map[0]
for (page = start; page < (start + size); page++) {
init_page_here();...
page++;
}

Slow, and wasteful, but generally harmless.

But, modify that to call pfn_to_page() for each loop iteration (like sparsemem
does):

for (pfn = start_pfn; pfn < < (start_pfn + size); pfn++++) {
page = pfn_to_page(pfn);
}

And you end up trying to initialize node 1's pages too early, along with bogus
data from node 0. This patch checks for those weird layouts and declines to
touch the pages, making the more frequent pfn_to_page() calls OK to do.

Signed-off-by: Dave Hansen <haveblue@us.ibm.com>
Signed-off-by: Andy Whitcroft <apw@shadowen.org>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>


# d41dee36 23-Jun-2005 Andy Whitcroft <apw@shadowen.org>

[PATCH] sparsemem memory model

Sparsemem abstracts the use of discontiguous mem_maps[]. This kind of
mem_map[] is needed by discontiguous memory machines (like in the old
CONFIG_DISCONTIGMEM case) as well as memory hotplug systems. Sparsemem
replaces DISCONTIGMEM when enabled, and it is hoped that it can eventually
become a complete replacement.

A significant advantage over DISCONTIGMEM is that it's completely separated
from CONFIG_NUMA. When producing this patch, it became apparent in that NUMA
and DISCONTIG are often confused.

Another advantage is that sparse doesn't require each NUMA node's ranges to be
contiguous. It can handle overlapping ranges between nodes with no problems,
where DISCONTIGMEM currently throws away that memory.

Sparsemem uses an array to provide different pfn_to_page() translations for
each SECTION_SIZE area of physical memory. This is what allows the mem_map[]
to be chopped up.

In order to do quick pfn_to_page() operations, the section number of the page
is encoded in page->flags. Part of the sparsemem infrastructure enables
sharing of these bits more dynamically (at compile-time) between the
page_zone() and sparsemem operations. However, on 32-bit architectures, the
number of bits is quite limited, and may require growing the size of the
page->flags type in certain conditions. Several things might force this to
occur: a decrease in the SECTION_SIZE (if you want to hotplug smaller areas of
memory), an increase in the physical address space, or an increase in the
number of used page->flags.

One thing to note is that, once sparsemem is present, the NUMA node
information no longer needs to be stored in the page->flags. It might provide
speed increases on certain platforms and will be stored there if there is
room. But, if out of room, an alternate (theoretically slower) mechanism is
used.

This patch introduces CONFIG_FLATMEM. It is used in almost all cases where
there used to be an #ifndef DISCONTIG, because SPARSEMEM and DISCONTIGMEM
often have to compile out the same areas of code.

Signed-off-by: Andy Whitcroft <apw@shadowen.org>
Signed-off-by: Dave Hansen <haveblue@us.ibm.com>
Signed-off-by: Martin Bligh <mbligh@aracnet.com>
Signed-off-by: Adrian Bunk <bunk@stusta.de>
Signed-off-by: Yasunori Goto <y-goto@jp.fujitsu.com>
Signed-off-by: Bob Picco <bob.picco@hp.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>


# 93b7504e 23-Jun-2005 Dave Hansen <haveblue@us.ibm.com>

[PATCH] Introduce new Kconfig option for NUMA or DISCONTIG

There is some confusion that arose when working on SPARSEMEM patch between
what is needed for DISCONTIG vs. NUMA.

Multiple pg_data_t's are needed for DISCONTIGMEM or NUMA, independently.
All of the current NUMA implementations require an implementation of
DISCONTIG. Because of this, quite a lot of code which is really needed for
NUMA is actually under DISCONTIG #ifdefs. For SPARSEMEM, we changed some
of these #ifdefs to CONFIG_NUMA, but that broke the DISCONTIG=y and NUMA=n
case.

Introducing this new NEED_MULTIPLE_NODES config option allows code that is
needed for both NUMA or DISCONTIG to be separated out from code that is
specific to DISCONTIG.

One great advantage of this approach is that it doesn't require every
architecture to be converted over. All of the current implementations
should "just work", only the ones implementing SPARSEMEM will have to be
fixed up.

The change to free_area_init() makes it work inside, or out of the new
config option.

Signed-off-by: Dave Hansen <haveblue@us.ibm.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>


# 348f8b6c 23-Jun-2005 Dave Hansen <haveblue@us.ibm.com>

[PATCH] sparsemem base: reorganize page->flags bit operations

Generify the value fields in the page_flags. The aim is to allow the location
and size of these fields to be varied. Additionally we want to move away from
fixed allocations per field whilst still enforcing the overall bit utilisation
limits. We rely on the compiler to spot and optimise the accessor functions.

Signed-off-by: Andy Whitcroft <apw@shadowen.org>
Signed-off-by: Dave Hansen <haveblue@us.ibm.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>


# 6f167ec7 23-Jun-2005 Dave Hansen <haveblue@us.ibm.com>

[PATCH] sparsemem base: simple NUMA remap space allocator

Introduce a simple allocator for the NUMA remap space. This space is very
scarce, used for structures which are best allocated node local.

This mechanism is also used on non-NUMA ia64 systems with a vmem_map to keep
the pgdat->node_mem_map initialized in a consistent place for all
architectures.

Issues:
o alloc_remap takes a node_id where we might expect a pgdat which was intended
to allow us to allocate the pgdat's using this mechanism; which we do not yet
do. Could have alloc_remap_node() and alloc_remap_nid() for this purpose.

Signed-off-by: Andy Whitcroft <apw@shadowen.org>
Signed-off-by: Dave Hansen <haveblue@us.ibm.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>


# b7c84c6a 22-Jun-2005 Christoph Lameter <clameter@engr.sgi.com>

[PATCH] boot_pageset must not be freed.

The boot_pageset needs to be preserved for hotplugging and for off line
processors and nodes. Otherwise pointers will point into memory that has
now a different use. /proc/zoneinfo is currently showing strange results
if processors / nodes are not present.

Signed-off-by: Christoph Lameter <clameter@sgi.com>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>


# c0d62219 21-Jun-2005 Denis Vlasenko <vda@ilport.com.ua>

[PATCH] Kill stray newline

OOM killer prints a stray newline.

Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>


# 334795ec 21-Jun-2005 Hugh Dickins <hugh@veritas.com>

[PATCH] bad_page: clear reclaim and slab

Since free_pages_check complains if PG_reclaim or PG_slab is set, bad_page
ought to clear them to avoid repetitive reports (Nikita noticed this too).
Let prep_new_page check page_count and PG_slab as free_pages_check does.

Signed-off-by: Hugh Dickins <hugh@veritas.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>


# 2caaad41 21-Jun-2005 Christoph Lameter <christoph@graphe.net>

[PATCH] Reduce size of huge boot per_cpu_pageset

Reduce size of the huge per_cpu_pageset structure in __initdata introduced
into mm1 with the pageset localization patchset. Use one specially
configured pageset per cpu for all zones and nodes during bootup.

- Avoid duplication of pageset initialization code.
- do the adding to the pageset list before potential free_pages_bulk
in free_hot_cold_page (otherwise we would have to hold a page
in a pageset during the period that the boot pagesets are in use).
- remove mistaken __cpuinitdata attribute and revert back to __initdata
for the boot pageset. A boot pageset is not necessary for cpu hotplug.

Tested for UP SMP NUMA on x86_64 (2.6.12-rc6-mm1): UP SMP NUMA Tested on
IA64 (2.6.12-rc5-mm2): NUMA (2.6.12-rc6-mm1 broken for IA64 because of
sparsemem patches)

Signed-off-by: Christoph Lameter <clameter@sgi.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>


# 4ae7c039 21-Jun-2005 Christoph Lameter <clameter@engr.sgi.com>

[PATCH] Periodically drain non local pagesets

The pageset array can potentially acquire a huge amount of memory on large
NUMA systems. F.e. on a system with 512 processors and 256 nodes there
will be 256*512 pagesets. If each pageset only holds 5 pages then we are
talking about 655360 pages.With a 16K page size on IA64 this results in
potentially 10 Gigabytes of memory being trapped in pagesets. The typical
cases are much less for smaller systems but there is still the potential of
memory being trapped in off node pagesets. Off node memory may be rarely
used if local memory is available and so we may potentially have memory in
seldom used pagesets without this patch.

The slab allocator flushes its per cpu caches every 2 seconds. The
following patch flushes the off node pageset caches in the same way by
tying into the slab flush.

The patch also changes /proc/zoneinfo to include the number of pages
currently in each pageset.

Signed-off-by: Christoph Lameter <clameter@sgi.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>


# 578c2fd6 21-Jun-2005 Janet Morgan <janetmor@us.ibm.com>

[PATCH] add OOM debug

This patch provides more debug info when the system is OOM. It displays
memory stats (basically sysrq-m info) from __alloc_pages() when page
allocation fails and during OOM kill.

Thanks to Dave Jones for coming up with the idea.

Signed-off-by: Janet Morgan <janetmor@us.ibm.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>


# c2f29ea1 21-Jun-2005 Benjamin LaHaise <bcrl@kvack.org>

[PATCH] __read_page_state(): pass unsigned long instead of unsigned

By making the offset argument of __read_page_state an unsigned long instead of
unsigned, we can avoid forcing the compiler to sign extend a usually constant
argument. This saves 1 instruction on x86-64.

Signed-off-by: Benjamin LaHaise <benjamin.c.lahaise@intel.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>


# 83e5d8f7 21-Jun-2005 Benjamin LaHaise <bcrl@kvack.org>

[PATCH] __mod_page_state(): pass unsigned long instead of unsigned

By making the offset argument of __mod_page_state an unsigned long instead
of unsigned, we can avoid forcing the compiler to sign extend a usually
constant argument. This saves 1 instruction on x86-64.

Signed-off-by: Benjamin LaHaise <benjamin.c.lahaise@intel.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>


# 1ad539b2 21-Jun-2005 Darren Hart <dvhltc@us.ibm.com>

[PATCH] vm: try_to_free_pages unused argument

try_to_free_pages accepts a third argument, order, but hasn't used it since
before 2.6.0. The following patch removes the argument and updates all the
calls to try_to_free_pages.

Signed-off-by: Darren Hart <dvhltc@us.ibm.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>


# e7c8d5c9 21-Jun-2005 Christoph Lameter <christoph@lameter.com>

[PATCH] node local per-cpu-pages

This patch modifies the way pagesets in struct zone are managed.

Each zone has a per-cpu array of pagesets. So any particular CPU has some
memory in each zone structure which belongs to itself. Even if that CPU is
not local to that zone.

So the patch relocates the pagesets for each cpu to the node that is nearest
to the cpu instead of allocating the pagesets in the (possibly remote) target
zone. This means that the operations to manage pages on remote zone can be
done with information available locally.

We play a macro trick so that non-NUMA pmachines avoid the additional
pointer chase on the page allocator fastpath.

AIM7 benchmark on a 32 CPU SGI Altix

w/o patches:
Tasks jobs/min jti jobs/min/task real cpu
1 484.68 100 484.6769 12.01 1.97 Fri Mar 25 11:01:42 2005
100 27140.46 89 271.4046 21.44 148.71 Fri Mar 25 11:02:04 2005
200 30792.02 82 153.9601 37.80 296.72 Fri Mar 25 11:02:42 2005
300 32209.27 81 107.3642 54.21 451.34 Fri Mar 25 11:03:37 2005
400 34962.83 78 87.4071 66.59 588.97 Fri Mar 25 11:04:44 2005
500 31676.92 75 63.3538 91.87 742.71 Fri Mar 25 11:06:16 2005
600 36032.69 73 60.0545 96.91 885.44 Fri Mar 25 11:07:54 2005
700 35540.43 77 50.7720 114.63 1024.28 Fri Mar 25 11:09:49 2005
800 33906.70 74 42.3834 137.32 1181.65 Fri Mar 25 11:12:06 2005
900 34120.67 73 37.9119 153.51 1325.26 Fri Mar 25 11:14:41 2005
1000 34802.37 74 34.8024 167.23 1465.26 Fri Mar 25 11:17:28 2005

with slab API changes and pageset patch:

Tasks jobs/min jti jobs/min/task real cpu
1 485.00 100 485.0000 12.00 1.96 Fri Mar 25 11:46:18 2005
100 28000.96 89 280.0096 20.79 150.45 Fri Mar 25 11:46:39 2005
200 32285.80 79 161.4290 36.05 293.37 Fri Mar 25 11:47:16 2005
300 40424.15 84 134.7472 43.19 438.42 Fri Mar 25 11:47:59 2005
400 39155.01 79 97.8875 59.46 590.05 Fri Mar 25 11:48:59 2005
500 37881.25 82 75.7625 76.82 730.19 Fri Mar 25 11:50:16 2005
600 39083.14 78 65.1386 89.35 872.79 Fri Mar 25 11:51:46 2005
700 38627.83 77 55.1826 105.47 1022.46 Fri Mar 25 11:53:32 2005
800 39631.94 78 49.5399 117.48 1169.94 Fri Mar 25 11:55:30 2005
900 36903.70 79 41.0041 141.94 1310.78 Fri Mar 25 11:57:53 2005
1000 36201.23 77 36.2012 160.77 1458.31 Fri Mar 25 12:00:34 2005

Signed-off-by: Christoph Lameter <clameter@sgi.com>
Signed-off-by: Shobhit Dayal <shobhit@calsoftinc.com>
Signed-off-by: Shai Fultheim <Shai@Scalex86.org>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>


# 1e7e5a90 21-Jun-2005 Martin Hicks <mort@sgi.com>

[PATCH] VM: rate limit early reclaim

When early zone reclaim is turned on the LRU is scanned more frequently when a
zone is low on memory. This limits when the zone reclaim can be called by
skipping the scan if another thread (either via kswapd or sync reclaim) is
already reclaiming from the zone.

Signed-off-by: Martin Hicks <mort@sgi.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>


# 0c35bbad 21-Jun-2005 Martin Hicks <mort@sgi.com>

[PATCH] VM: add __GFP_NORECLAIM

When using the early zone reclaim, it was noticed that allocating new pages
that should be spread across the whole system caused eviction of local pages.

This adds a new GFP flag to prevent early reclaim from happening during
certain allocation attempts. The example that is implemented here is for page
cache pages. We want page cache pages to be spread across the whole system,
and we don't want page cache pages to evict other pages to get local memory.

Signed-off-by: Martin Hicks <mort@sgi.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>


# 753ee728 21-Jun-2005 Martin Hicks <mort@sgi.com>

[PATCH] VM: early zone reclaim

This is the core of the (much simplified) early reclaim. The goal of this
patch is to reclaim some easily-freed pages from a zone before falling back
onto another zone.

One of the major uses of this is NUMA machines. With the default allocator
behavior the allocator would look for memory in another zone, which might be
off-node, before trying to reclaim from the current zone.

This adds a zone tuneable to enable early zone reclaim. It is selected on a
per-zone basis and is turned on/off via syscall.

Adding some extra throttling on the reclaim was also required (patch
4/4). Without the machine would grind to a crawl when doing a "make -j"
kernel build. Even with this patch the System Time is higher on
average, but it seems tolerable. Here are some numbers for kernbench
runs on a 2-node, 4cpu, 8Gig RAM Altix in the "make -j" run:

wall user sys %cpu ctx sw. sleeps
---- ---- --- ---- ------ ------
No patch 1009 1384 847 258 298170 504402
w/patch, no reclaim 880 1376 667 288 254064 396745
w/patch & reclaim 1079 1385 926 252 291625 548873

These numbers are the average of 2 runs of 3 "make -j" runs done right
after system boot. Run-to-run variability for "make -j" is huge, so
these numbers aren't terribly useful except to seee that with reclaim
the benchmark still finishes in a reasonable amount of time.

I also looked at the NUMA hit/miss stats for the "make -j" runs and the
reclaim doesn't make any difference when the machine is thrashing away.

Doing a "make -j8" on a single node that is filled with page cache pages
takes 700 seconds with reclaim turned on and 735 seconds without reclaim
(due to remote memory accesses).

The simple zone_reclaim syscall program is at
http://www.bork.org/~mort/sgi/zone_reclaim.c

Signed-off-by: Martin Hicks <mort@sgi.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>


# 295ab934 21-Jun-2005 Nikita Danilov <nikita@clusterfs.com>

[PATCH] mm: add /proc/zoneinfo

Add /proc/zoneinfo file to display information about memory zones. Useful
to analyze VM behaviour.

Signed-off-by: Nikita Danilov <nikita@clusterfs.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>


# 7223a93a 23-Mar-2005 Dean Nelson <dcn@sgi.com>

[IA64] Export node_online_map and node_possible_map

Export node_online_map and node_possible_map so that kernel modules can use
the nodemask macros, like, for_each_node() and for_each_online_node().

Signed-off-by: Dean Nelson <dcn@sgi.com>
Signed-off-by: Tony Luck <tony.luck@intel.com>


# 4dc3b16b 01-May-2005 Pavel Pisa <pisa@cmp.felk.cvut.cz>

[PATCH] DocBook: changes and extensions to the kernel documentation

I have recompiled Linux kernel 2.6.11.5 documentation for me and our
university students again. The documentation could be extended for more
sources which are equipped by structured comments for recent 2.6 kernels. I
have tried to proceed with that task. I have done that more times from 2.6.0
time and it gets boring to do same changes again and again. Linux kernel
compiles after changes for i386 and ARM targets. I have added references to
some more files into kernel-api book, I have added some section names as well.
So please, check that changes do not break something and that categories are
not too much skewed.

I have changed kernel-doc to accept "fastcall" and "asmlinkage" words reserved
by kernel convention. Most of the other changes are modifications in the
comments to make kernel-doc happy, accept some parameters description and do
not bail out on errors. Changed <pid> to @pid in the description, moved some
#ifdef before comments to correct function to comments bindings, etc.

You can see result of the modified documentation build at
http://cmp.felk.cvut.cz/~pisa/linux/lkdb-2.6.11.tar.gz

Some more sources are ready to be included into kernel-doc generated
documentation. Sources has been added into kernel-api for now. Some more
section names added and probably some more chaos introduced as result of quick
cleanup work.

Signed-off-by: Pavel Pisa <pisa@cmp.felk.cvut.cz>
Signed-off-by: Martin Waitz <tali@admingilde.org>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>


# edfbe2b0 01-May-2005 KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>

[PATCH] count bounce buffer pages in vmstat

This is a patch for counting the number of pages for bounce buffers. It's
shown in /proc/vmstat.

Currently, the number of bounce pages are not counted anywhere. So, if
there are many bounce pages, it seems that there are leaked pages. And
it's difficult for a user to imagine the usage of bounce pages. So, it's
meaningful to show # of bouce pages.

Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>


# b84a35be 01-May-2005 Nick Piggin <nickpiggin@yahoo.com.au>

[PATCH] mempool: NOMEMALLOC and NORETRY

Mempools have 2 problems.

The first is that mempool_alloc can possibly get stuck in __alloc_pages
when they should opt to fail, and take an element from their reserved pool.

The second is that it will happily eat emergency PF_MEMALLOC reserves
instead of going to their reserved pools.

Fix the first by passing __GFP_NORETRY in the allocation calls in
mempool_alloc. Fix the second by introducing a __GFP_MEMPOOL flag which
directs the page allocator not to allocate from the reserve pool.

Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>


# 8e30f272 01-May-2005 Nick Piggin <nickpiggin@yahoo.com.au>

[PATCH] mm: pcp use non powers of 2 for batch size

Jack Steiner reported this to have fixed his problem (bad colouring):
"The patches fix both problems that I found - bad
coloring & excessive pages in pagesets."

In most workloads this is not likely to be such a pronounced problem,
however it should help corner cases. And avoiding powers of 2 in these
types of memory operations is always a good idea.

Signed-off-by: Nick Piggin <nickpiggin@yahoo.com.au>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>


# 1da177e4 16-Apr-2005 Linus Torvalds <torvalds@ppc970.osdl.org>

Linux-2.6.12-rc2

Initial git repository build. I'm not bothering with the full history,
even though we have it. We can create a separate "historical" git
archive of that later if we want to, and in the meantime it's about
3.2GB when imported into git - space that would just make the early
git days unnecessarily complicated, when we don't have a lot of good
infrastructure for it.

Let it rip!