Searched +hist:83 +hist:aeeada (Results 1 - 4 of 4) sorted by relevance

/linux-master/include/linux/
H A Dshrinker.hdiff b7217a0b Mon Nov 14 16:59:49 MST 2022 T.J. Mercier <tjmercier@google.com> mm: shrinkers: add missing includes for undeclared types

The shrinker.h header depends on a user including other headers before it
for types used by shrinker.h. Fix this by including the appropriate
headers in shrinker.h.

./include/linux/shrinker.h:13:9: error: unknown type name `gfp_t'
13 | gfp_t gfp_mask;
| ^~~~~
./include/linux/shrinker.h:71:26: error: field `list' has incomplete type
71 | struct list_head list;
| ^~~~
./include/linux/shrinker.h:82:9: error: unknown type name `atomic_long_t'
82 | atomic_long_t *nr_deferred;
|

Link: https://lkml.kernel.org/r/20221114235949.201749-1-tjmercier@google.com
Fixes: 83aeeada7c69 ("vmscan: use atomic-long for shrinker batching")
Fixes: b0d40c92adaf ("superblock: introduce per-sb cache shrinker infrastructure")
Signed-off-by: T.J. Mercier <tjmercier@google.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Dave Chinner <dchinner@redhat.com>
Cc: Konstantin Khlebnikov <khlebnikov@openvz.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
diff 83aeeada Thu Dec 08 15:33:54 MST 2011 Konstantin Khlebnikov <khlebnikov@openvz.org> vmscan: use atomic-long for shrinker batching

Use atomic-long operations instead of looping around cmpxchg().

[akpm@linux-foundation.org: massage atomic.h inclusions]
Signed-off-by: Konstantin Khlebnikov <khlebnikov@openvz.org>
Cc: Dave Chinner <david@fromorbit.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
diff 83aeeada Thu Dec 08 15:33:54 MST 2011 Konstantin Khlebnikov <khlebnikov@openvz.org> vmscan: use atomic-long for shrinker batching

Use atomic-long operations instead of looping around cmpxchg().

[akpm@linux-foundation.org: massage atomic.h inclusions]
Signed-off-by: Konstantin Khlebnikov <khlebnikov@openvz.org>
Cc: Dave Chinner <david@fromorbit.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
H A Dfs.hdiff 83f936c7 Thu Mar 13 22:02:47 MDT 2014 Al Viro <viro@zeniv.linux.org.uk> mark struct file that had write access grabbed by open()

new flag in ->f_mode - FMODE_WRITER. Set by do_dentry_open() in case
when it has grabbed write access, checked by __fput() to decide whether
it wants to drop the sucker. Allows to stop bothering with mnt_clone_write()
in alloc_file(), along with fewer special_file() checks.

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
diff 83aeeada Thu Dec 08 15:33:54 MST 2011 Konstantin Khlebnikov <khlebnikov@openvz.org> vmscan: use atomic-long for shrinker batching

Use atomic-long operations instead of looping around cmpxchg().

[akpm@linux-foundation.org: massage atomic.h inclusions]
Signed-off-by: Konstantin Khlebnikov <khlebnikov@openvz.org>
Cc: Dave Chinner <david@fromorbit.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
diff 83aeeada Thu Dec 08 15:33:54 MST 2011 Konstantin Khlebnikov <khlebnikov@openvz.org> vmscan: use atomic-long for shrinker batching

Use atomic-long operations instead of looping around cmpxchg().

[akpm@linux-foundation.org: massage atomic.h inclusions]
Signed-off-by: Konstantin Khlebnikov <khlebnikov@openvz.org>
Cc: Dave Chinner <david@fromorbit.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
diff 76791ab2 Wed Mar 25 09:48:35 MDT 2009 Ingo Molnar <mingo@elte.hu> kmemtrace, fs: uninline simple_transaction_set()

Impact: cleanup

We want to remove percpu.h from rcupdate.h (for upcoming kmemtrace
changes), but this is not possible currently without breaking the
build because fs.h has an implicit include file depedency: it
uses PAGE_SIZE but does not include asm/page.h which defines it.

This problem gets masked in practice because most fs.h using sites
use rcupreempt.h (and other headers) which includes percpu.h which
brings in asm/page.h indirectly.

We cannot add asm/page.h to asm/fs.h because page.h is not an
exported header.

Move simple_transaction_set() to the other simple-transaction
file helpers in fs/libfs.c.

This removes the include file hell and also reduces
kernel size a bit.

Acked-by: Al Viro <viro@zeniv.linux.org.uk>
Cc: Alexey Dobriyan <adobriyan@gmail.com>
Cc: Pekka Enberg <penberg@cs.helsinki.fi>
Cc: Eduard - Gabriel Munteanu <eduard.munteanu@linux360.ro>
Cc: paulmck@linux.vnet.ibm.com
LKML-Reference: <1237898630.25315.83.camel@penberg-laptop>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
diff 83b7b44e Wed Dec 06 21:38:53 MST 2006 Eric Dumazet <dada1@cosmosbay.com> [PATCH] fs: reorder some 'struct inode' fields to speedup i_size manipulations

On 32bits SMP platforms, 64bits i_size is protected by a seqcount
(i_size_seqcount).

When i_size is read or written, i_size_seqcount is read/written as well, so
it make sense to group these two fields together in the same cache line.

This patch moves i_size_seqcount next to i_size, and also moves i_version
to let offsetof(struct inode, i_size) being 0x40 instead of 0x3c (for
32bits platforms).

For 64 bits platforms, i_size_seqcount doesnt exist, and the move of a
'long i_version' should not introduce a new hole because of padding.

Signed-off-by: Eric Dumazet <dada1@cosmosbay.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
H A Dmm.hdiff 2f52578f Thu Dec 10 08:55:05 MST 2020 Matthew Wilcox (Oracle) <willy@infradead.org> mm/util: Add folio_mapping() and folio_file_mapping()

These are the folio equivalent of page_mapping() and page_file_mapping().
Add an out-of-line page_mapping() wrapper around folio_mapping()
in order to prevent the page_folio() call from bloating every caller
of page_mapping(). Adjust page_file_mapping() and page_mapping_file()
to use folios internally. Rename __page_file_mapping() to
swapcache_mapping() and change it to take a folio.

This ends up saving 122 bytes of text overall. folio_mapping() is
45 bytes shorter than page_mapping() was, but the new page_mapping()
wrapper is 30 bytes. The major reduction is a few bytes less in dozens
of nfs functions (which call page_file_mapping()). Most of these appear
to be a slight change in gcc's register allocation decisions, which allow:

48 8b 56 08 mov 0x8(%rsi),%rdx
48 8d 42 ff lea -0x1(%rdx),%rax
83 e2 01 and $0x1,%edx
48 0f 44 c6 cmove %rsi,%rax

to become:

48 8b 46 08 mov 0x8(%rsi),%rax
48 8d 78 ff lea -0x1(%rax),%rdi
a8 01 test $0x1,%al
48 0f 44 fe cmove %rsi,%rdi

for a reduction of a single byte. Once the NFS client is converted to
use folios, this entire sequence will disappear.

Also add folio_mapping() documentation.

Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Acked-by: Jeff Layton <jlayton@kernel.org>
Acked-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Reviewed-by: William Kucharski <william.kucharski@oracle.com>
Reviewed-by: David Howells <dhowells@redhat.com>
diff 83b57531 Sun Jul 09 17:14:01 MDT 2017 Eric W. Biederman <ebiederm@xmission.com> mm/memory_failure: Remove unused trapno from memory_failure

Today 4 architectures set ARCH_SUPPORTS_MEMORY_FAILURE (arm64, parisc,
powerpc, and x86), while 4 other architectures set __ARCH_SI_TRAPNO
(alpha, metag, sparc, and tile). These two sets of architectures do
not interesect so remove the trapno paramater to remove confusion.

Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
diff 83aeeada Thu Dec 08 15:33:54 MST 2011 Konstantin Khlebnikov <khlebnikov@openvz.org> vmscan: use atomic-long for shrinker batching

Use atomic-long operations instead of looping around cmpxchg().

[akpm@linux-foundation.org: massage atomic.h inclusions]
Signed-off-by: Konstantin Khlebnikov <khlebnikov@openvz.org>
Cc: Dave Chinner <david@fromorbit.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
diff 83aeeada Thu Dec 08 15:33:54 MST 2011 Konstantin Khlebnikov <khlebnikov@openvz.org> vmscan: use atomic-long for shrinker batching

Use atomic-long operations instead of looping around cmpxchg().

[akpm@linux-foundation.org: massage atomic.h inclusions]
Signed-off-by: Konstantin Khlebnikov <khlebnikov@openvz.org>
Cc: Dave Chinner <david@fromorbit.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
diff 83b964bb Sun Mar 13 13:49:16 MDT 2011 Stephen Wilson <wilsons@start.ca> mm: arch: make in_gate_area take an mm_struct instead of a task_struct

Morally, the question of whether an address lies in a gate vma should be asked
with respect to an mm, not a particular task. Moreover, dropping the dependency
on task_struct will help make existing and future operations on mm's more
flexible and convenient.

Signed-off-by: Stephen Wilson <wilsons@start.ca>
Reviewed-by: Michel Lespinasse <walken@google.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
diff 83f78668 Wed Sep 16 03:50:13 MDT 2009 Wu Fengguang <fengguang.wu@intel.com> HWPOISON: Add invalidate_inode_page

Add a simple way to invalidate a single page
This is just a refactoring of the truncate.c code.
Originally from Fengguang, modified by Andi Kleen.

Signed-off-by: Andi Kleen <ak@linux.intel.com>
diff 83c54070 Thu Jul 19 02:47:05 MDT 2007 Nick Piggin <npiggin@suse.de> mm: fault feedback #2

This patch completes Linus's wish that the fault return codes be made into
bit flags, which I agree makes everything nicer. This requires requires
all handle_mm_fault callers to be modified (possibly the modifications
should go further and do things like fault accounting in handle_mm_fault --
however that would be for another patch).

[akpm@linux-foundation.org: fix alpha build]
[akpm@linux-foundation.org: fix s390 build]
[akpm@linux-foundation.org: fix sparc build]
[akpm@linux-foundation.org: fix sparc64 build]
[akpm@linux-foundation.org: fix ia64 build]
Signed-off-by: Nick Piggin <npiggin@suse.de>
Cc: Richard Henderson <rth@twiddle.net>
Cc: Ivan Kokshaysky <ink@jurassic.park.msu.ru>
Cc: Russell King <rmk@arm.linux.org.uk>
Cc: Ian Molton <spyro@f2s.com>
Cc: Bryan Wu <bryan.wu@analog.com>
Cc: Mikael Starvik <starvik@axis.com>
Cc: David Howells <dhowells@redhat.com>
Cc: Yoshinori Sato <ysato@users.sourceforge.jp>
Cc: "Luck, Tony" <tony.luck@intel.com>
Cc: Hirokazu Takata <takata@linux-m32r.org>
Cc: Geert Uytterhoeven <geert@linux-m68k.org>
Cc: Roman Zippel <zippel@linux-m68k.org>
Cc: Greg Ungerer <gerg@uclinux.org>
Cc: Matthew Wilcox <willy@debian.org>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
Cc: Paul Mundt <lethal@linux-sh.org>
Cc: Kazumoto Kojima <kkojima@rr.iij4u.or.jp>
Cc: Richard Curnow <rc@rc0.org.uk>
Cc: William Lee Irwin III <wli@holomorphy.com>
Cc: "David S. Miller" <davem@davemloft.net>
Cc: Jeff Dike <jdike@addtoit.com>
Cc: Paolo 'Blaisorblade' Giarrusso <blaisorblade@yahoo.it>
Cc: Miles Bader <uclinux-v850@lsi.nec.co.jp>
Cc: Chris Zankel <chris@zankel.net>
Acked-by: Kyle McMartin <kyle@mcmartin.ca>
Acked-by: Haavard Skinnemoen <hskinnemoen@atmel.com>
Acked-by: Ralf Baechle <ralf@linux-mips.org>
Acked-by: Andi Kleen <ak@muc.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
[ Still apparently needs some ARM and PPC loving - Linus ]
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
/linux-master/mm/
H A Dvmscan.cdiff 8cd7c588 Fri Nov 05 14:42:25 MDT 2021 Mel Gorman <mgorman@techsingularity.net> mm/vmscan: throttle reclaim until some writeback completes if congested

Patch series "Remove dependency on congestion_wait in mm/", v5.

This series that removes all calls to congestion_wait in mm/ and deletes
wait_iff_congested. It's not a clever implementation but
congestion_wait has been broken for a long time [1].

Even if congestion throttling worked, it was never a great idea. While
excessive dirty/writeback pages at the tail of the LRU is one
possibility that reclaim may be slow, there is also the problem of too
many pages being isolated and reclaim failing for other reasons
(elevated references, too many pages isolated, excessive LRU contention
etc).

This series replaces the "congestion" throttling with 3 different types.

- If there are too many dirty/writeback pages, sleep until a timeout or
enough pages get cleaned

- If too many pages are isolated, sleep until enough isolated pages are
either reclaimed or put back on the LRU

- If no progress is being made, direct reclaim tasks sleep until
another task makes progress with acceptable efficiency.

This was initially tested with a mix of workloads that used to trigger
corner cases that no longer work. A new test case was created called
"stutterp" (pagereclaim-stutterp-noreaders in mmtests) using a freshly
created XFS filesystem. Note that it may be necessary to increase the
timeout of ssh if executing remotely as ssh itself can get throttled and
the connection may timeout.

stutterp varies the number of "worker" processes from 4 up to NR_CPUS*4
to check the impact as the number of direct reclaimers increase. It has
four types of worker.

- One "anon latency" worker creates small mappings with mmap() and
times how long it takes to fault the mapping reading it 4K at a time

- X file writers which is fio randomly writing X files where the total
size of the files add up to the allowed dirty_ratio. fio is allowed
to run for a warmup period to allow some file-backed pages to
accumulate. The duration of the warmup is based on the best-case
linear write speed of the storage.

- Y file readers which is fio randomly reading small files

- Z anon memory hogs which continually map (100-dirty_ratio)% of memory

- Total estimated WSS = (100+dirty_ration) percentage of memory

X+Y+Z+1 == NR_WORKERS varying from 4 up to NR_CPUS*4

The intent is to maximise the total WSS with a mix of file and anon
memory where some anonymous memory must be swapped and there is a high
likelihood of dirty/writeback pages reaching the end of the LRU.

The test can be configured to have no background readers to stress
dirty/writeback pages. The results below are based on having zero
readers.

The short summary of the results is that the series works and stalls
until some event occurs but the timeouts may need adjustment.

The test results are not broken down by patch as the series should be
treated as one block that replaces a broken throttling mechanism with a
working one.

Finally, three machines were tested but I'm reporting the worst set of
results. The other two machines had much better latencies for example.

First the results of the "anon latency" latency

stutterp
5.15.0-rc1 5.15.0-rc1
vanilla mm-reclaimcongest-v5r4
Amean mmap-4 31.4003 ( 0.00%) 2661.0198 (-8374.52%)
Amean mmap-7 38.1641 ( 0.00%) 149.2891 (-291.18%)
Amean mmap-12 60.0981 ( 0.00%) 187.8105 (-212.51%)
Amean mmap-21 161.2699 ( 0.00%) 213.9107 ( -32.64%)
Amean mmap-30 174.5589 ( 0.00%) 377.7548 (-116.41%)
Amean mmap-48 8106.8160 ( 0.00%) 1070.5616 ( 86.79%)
Stddev mmap-4 41.3455 ( 0.00%) 27573.9676 (-66591.66%)
Stddev mmap-7 53.5556 ( 0.00%) 4608.5860 (-8505.23%)
Stddev mmap-12 171.3897 ( 0.00%) 5559.4542 (-3143.75%)
Stddev mmap-21 1506.6752 ( 0.00%) 5746.2507 (-281.39%)
Stddev mmap-30 557.5806 ( 0.00%) 7678.1624 (-1277.05%)
Stddev mmap-48 61681.5718 ( 0.00%) 14507.2830 ( 76.48%)
Max-90 mmap-4 31.4243 ( 0.00%) 83.1457 (-164.59%)
Max-90 mmap-7 41.0410 ( 0.00%) 41.0720 ( -0.08%)
Max-90 mmap-12 66.5255 ( 0.00%) 53.9073 ( 18.97%)
Max-90 mmap-21 146.7479 ( 0.00%) 105.9540 ( 27.80%)
Max-90 mmap-30 193.9513 ( 0.00%) 64.3067 ( 66.84%)
Max-90 mmap-48 277.9137 ( 0.00%) 591.0594 (-112.68%)
Max mmap-4 1913.8009 ( 0.00%) 299623.9695 (-15555.96%)
Max mmap-7 2423.9665 ( 0.00%) 204453.1708 (-8334.65%)
Max mmap-12 6845.6573 ( 0.00%) 221090.3366 (-3129.64%)
Max mmap-21 56278.6508 ( 0.00%) 213877.3496 (-280.03%)
Max mmap-30 19716.2990 ( 0.00%) 216287.6229 (-997.00%)
Max mmap-48 477923.9400 ( 0.00%) 245414.8238 ( 48.65%)

For most thread counts, the time to mmap() is unfortunately increased.
In earlier versions of the series, this was lower but a large number of
throttling events were reaching their timeout increasing the amount of
inefficient scanning of the LRU. There is no prioritisation of reclaim
tasks making progress based on each tasks rate of page allocation versus
progress of reclaim. The variance is also impacted for high worker
counts but in all cases, the differences in latency are not
statistically significant due to very large maximum outliers. Max-90
shows that 90% of the stalls are comparable but the Max results show the
massive outliers which are increased to to stalling.

It is expected that this will be very machine dependant. Due to the
test design, reclaim is difficult so allocations stall and there are
variances depending on whether THPs can be allocated or not. The amount
of memory will affect exactly how bad the corner cases are and how often
they trigger. The warmup period calculation is not ideal as it's based
on linear writes where as fio is randomly writing multiple files from
multiple tasks so the start state of the test is variable. For example,
these are the latencies on a single-socket machine that had more memory

Amean mmap-4 42.2287 ( 0.00%) 49.6838 * -17.65%*
Amean mmap-7 216.4326 ( 0.00%) 47.4451 * 78.08%*
Amean mmap-12 2412.0588 ( 0.00%) 51.7497 ( 97.85%)
Amean mmap-21 5546.2548 ( 0.00%) 51.8862 ( 99.06%)
Amean mmap-30 1085.3121 ( 0.00%) 72.1004 ( 93.36%)

The overall system CPU usage and elapsed time is as follows

5.15.0-rc3 5.15.0-rc3
vanilla mm-reclaimcongest-v5r4
Duration User 6989.03 983.42
Duration System 7308.12 799.68
Duration Elapsed 2277.67 2092.98

The patches reduce system CPU usage by 89% as the vanilla kernel is rarely
stalling.

The high-level /proc/vmstats show

5.15.0-rc1 5.15.0-rc1
vanilla mm-reclaimcongest-v5r2
Ops Direct pages scanned 1056608451.00 503594991.00
Ops Kswapd pages scanned 109795048.00 147289810.00
Ops Kswapd pages reclaimed 63269243.00 31036005.00
Ops Direct pages reclaimed 10803973.00 6328887.00
Ops Kswapd efficiency % 57.62 21.07
Ops Kswapd velocity 48204.98 57572.86
Ops Direct efficiency % 1.02 1.26
Ops Direct velocity 463898.83 196845.97

Kswapd scanned less pages but the detailed pattern is different. The
vanilla kernel scans slowly over time where as the patches exhibits
burst patterns of scan activity. Direct reclaim scanning is reduced by
52% due to stalling.

The pattern for stealing pages is also slightly different. Both kernels
exhibit spikes but the vanilla kernel when reclaiming shows pages being
reclaimed over a period of time where as the patches tend to reclaim in
spikes. The difference is that vanilla is not throttling and instead
scanning constantly finding some pages over time where as the patched
kernel throttles and reclaims in spikes.

Ops Percentage direct scans 90.59 77.37

For direct reclaim, vanilla scanned 90.59% of pages where as with the
patches, 77.37% were direct reclaim due to throttling

Ops Page writes by reclaim 2613590.00 1687131.00

Page writes from reclaim context are reduced.

Ops Page writes anon 2932752.00 1917048.00

And there is less swapping.

Ops Page reclaim immediate 996248528.00 107664764.00

The number of pages encountered at the tail of the LRU tagged for
immediate reclaim but still dirty/writeback is reduced by 89%.

Ops Slabs scanned 164284.00 153608.00

Slab scan activity is similar.

ftrace was used to gather stall activity

Vanilla
-------
1 writeback_wait_iff_congested: usec_timeout=100000 usec_delayed=16000
2 writeback_wait_iff_congested: usec_timeout=100000 usec_delayed=12000
8 writeback_wait_iff_congested: usec_timeout=100000 usec_delayed=8000
29 writeback_wait_iff_congested: usec_timeout=100000 usec_delayed=4000
82394 writeback_wait_iff_congested: usec_timeout=100000 usec_delayed=0

The fast majority of wait_iff_congested calls do not stall at all. What
is likely happening is that cond_resched() reschedules the task for a
short period when the BDI is not registering congestion (which it never
will in this test setup).

1 writeback_congestion_wait: usec_timeout=100000 usec_delayed=120000
2 writeback_congestion_wait: usec_timeout=100000 usec_delayed=132000
4 writeback_congestion_wait: usec_timeout=100000 usec_delayed=112000
380 writeback_congestion_wait: usec_timeout=100000 usec_delayed=108000
778 writeback_congestion_wait: usec_timeout=100000 usec_delayed=104000

congestion_wait if called always exceeds the timeout as there is no
trigger to wake it up.

Bottom line: Vanilla will throttle but it's not effective.

Patch series
------------

Kswapd throttle activity was always due to scanning pages tagged for
immediate reclaim at the tail of the LRU

1 usec_timeout=100000 usect_delayed=72000 reason=VMSCAN_THROTTLE_WRITEBACK
4 usec_timeout=100000 usect_delayed=20000 reason=VMSCAN_THROTTLE_WRITEBACK
5 usec_timeout=100000 usect_delayed=12000 reason=VMSCAN_THROTTLE_WRITEBACK
6 usec_timeout=100000 usect_delayed=16000 reason=VMSCAN_THROTTLE_WRITEBACK
11 usec_timeout=100000 usect_delayed=100000 reason=VMSCAN_THROTTLE_WRITEBACK
11 usec_timeout=100000 usect_delayed=8000 reason=VMSCAN_THROTTLE_WRITEBACK
94 usec_timeout=100000 usect_delayed=0 reason=VMSCAN_THROTTLE_WRITEBACK
112 usec_timeout=100000 usect_delayed=4000 reason=VMSCAN_THROTTLE_WRITEBACK

The majority of events did not stall or stalled for a short period.
Roughly 16% of stalls reached the timeout before expiry. For direct
reclaim, the number of times stalled for each reason were

6624 reason=VMSCAN_THROTTLE_ISOLATED
93246 reason=VMSCAN_THROTTLE_NOPROGRESS
96934 reason=VMSCAN_THROTTLE_WRITEBACK

The most common reason to stall was due to excessive pages tagged for
immediate reclaim at the tail of the LRU followed by a failure to make
forward. A relatively small number were due to too many pages isolated
from the LRU by parallel threads

For VMSCAN_THROTTLE_ISOLATED, the breakdown of delays was

9 usec_timeout=20000 usect_delayed=4000 reason=VMSCAN_THROTTLE_ISOLATED
12 usec_timeout=20000 usect_delayed=16000 reason=VMSCAN_THROTTLE_ISOLATED
83 usec_timeout=20000 usect_delayed=20000 reason=VMSCAN_THROTTLE_ISOLATED
6520 usec_timeout=20000 usect_delayed=0 reason=VMSCAN_THROTTLE_ISOLATED

Most did not stall at all. A small number reached the timeout.

For VMSCAN_THROTTLE_NOPROGRESS, the breakdown of stalls were all over
the map

1 usec_timeout=500000 usect_delayed=324000 reason=VMSCAN_THROTTLE_NOPROGRESS
1 usec_timeout=500000 usect_delayed=332000 reason=VMSCAN_THROTTLE_NOPROGRESS
1 usec_timeout=500000 usect_delayed=348000 reason=VMSCAN_THROTTLE_NOPROGRESS
1 usec_timeout=500000 usect_delayed=360000 reason=VMSCAN_THROTTLE_NOPROGRESS
2 usec_timeout=500000 usect_delayed=228000 reason=VMSCAN_THROTTLE_NOPROGRESS
2 usec_timeout=500000 usect_delayed=260000 reason=VMSCAN_THROTTLE_NOPROGRESS
2 usec_timeout=500000 usect_delayed=340000 reason=VMSCAN_THROTTLE_NOPROGRESS
2 usec_timeout=500000 usect_delayed=364000 reason=VMSCAN_THROTTLE_NOPROGRESS
2 usec_timeout=500000 usect_delayed=372000 reason=VMSCAN_THROTTLE_NOPROGRESS
2 usec_timeout=500000 usect_delayed=428000 reason=VMSCAN_THROTTLE_NOPROGRESS
2 usec_timeout=500000 usect_delayed=460000 reason=VMSCAN_THROTTLE_NOPROGRESS
2 usec_timeout=500000 usect_delayed=464000 reason=VMSCAN_THROTTLE_NOPROGRESS
3 usec_timeout=500000 usect_delayed=244000 reason=VMSCAN_THROTTLE_NOPROGRESS
3 usec_timeout=500000 usect_delayed=252000 reason=VMSCAN_THROTTLE_NOPROGRESS
3 usec_timeout=500000 usect_delayed=272000 reason=VMSCAN_THROTTLE_NOPROGRESS
4 usec_timeout=500000 usect_delayed=188000 reason=VMSCAN_THROTTLE_NOPROGRESS
4 usec_timeout=500000 usect_delayed=268000 reason=VMSCAN_THROTTLE_NOPROGRESS
4 usec_timeout=500000 usect_delayed=328000 reason=VMSCAN_THROTTLE_NOPROGRESS
4 usec_timeout=500000 usect_delayed=380000 reason=VMSCAN_THROTTLE_NOPROGRESS
4 usec_timeout=500000 usect_delayed=392000 reason=VMSCAN_THROTTLE_NOPROGRESS
4 usec_timeout=500000 usect_delayed=432000 reason=VMSCAN_THROTTLE_NOPROGRESS
5 usec_timeout=500000 usect_delayed=204000 reason=VMSCAN_THROTTLE_NOPROGRESS
5 usec_timeout=500000 usect_delayed=220000 reason=VMSCAN_THROTTLE_NOPROGRESS
5 usec_timeout=500000 usect_delayed=412000 reason=VMSCAN_THROTTLE_NOPROGRESS
5 usec_timeout=500000 usect_delayed=436000 reason=VMSCAN_THROTTLE_NOPROGRESS
6 usec_timeout=500000 usect_delayed=488000 reason=VMSCAN_THROTTLE_NOPROGRESS
7 usec_timeout=500000 usect_delayed=212000 reason=VMSCAN_THROTTLE_NOPROGRESS
7 usec_timeout=500000 usect_delayed=300000 reason=VMSCAN_THROTTLE_NOPROGRESS
7 usec_timeout=500000 usect_delayed=316000 reason=VMSCAN_THROTTLE_NOPROGRESS
7 usec_timeout=500000 usect_delayed=472000 reason=VMSCAN_THROTTLE_NOPROGRESS
8 usec_timeout=500000 usect_delayed=248000 reason=VMSCAN_THROTTLE_NOPROGRESS
8 usec_timeout=500000 usect_delayed=356000 reason=VMSCAN_THROTTLE_NOPROGRESS
8 usec_timeout=500000 usect_delayed=456000 reason=VMSCAN_THROTTLE_NOPROGRESS
9 usec_timeout=500000 usect_delayed=124000 reason=VMSCAN_THROTTLE_NOPROGRESS
9 usec_timeout=500000 usect_delayed=376000 reason=VMSCAN_THROTTLE_NOPROGRESS
9 usec_timeout=500000 usect_delayed=484000 reason=VMSCAN_THROTTLE_NOPROGRESS
10 usec_timeout=500000 usect_delayed=172000 reason=VMSCAN_THROTTLE_NOPROGRESS
10 usec_timeout=500000 usect_delayed=420000 reason=VMSCAN_THROTTLE_NOPROGRESS
10 usec_timeout=500000 usect_delayed=452000 reason=VMSCAN_THROTTLE_NOPROGRESS
11 usec_timeout=500000 usect_delayed=256000 reason=VMSCAN_THROTTLE_NOPROGRESS
12 usec_timeout=500000 usect_delayed=112000 reason=VMSCAN_THROTTLE_NOPROGRESS
12 usec_timeout=500000 usect_delayed=116000 reason=VMSCAN_THROTTLE_NOPROGRESS
12 usec_timeout=500000 usect_delayed=144000 reason=VMSCAN_THROTTLE_NOPROGRESS
12 usec_timeout=500000 usect_delayed=152000 reason=VMSCAN_THROTTLE_NOPROGRESS
12 usec_timeout=500000 usect_delayed=264000 reason=VMSCAN_THROTTLE_NOPROGRESS
12 usec_timeout=500000 usect_delayed=384000 reason=VMSCAN_THROTTLE_NOPROGRESS
12 usec_timeout=500000 usect_delayed=424000 reason=VMSCAN_THROTTLE_NOPROGRESS
12 usec_timeout=500000 usect_delayed=492000 reason=VMSCAN_THROTTLE_NOPROGRESS
13 usec_timeout=500000 usect_delayed=184000 reason=VMSCAN_THROTTLE_NOPROGRESS
13 usec_timeout=500000 usect_delayed=444000 reason=VMSCAN_THROTTLE_NOPROGRESS
14 usec_timeout=500000 usect_delayed=308000 reason=VMSCAN_THROTTLE_NOPROGRESS
14 usec_timeout=500000 usect_delayed=440000 reason=VMSCAN_THROTTLE_NOPROGRESS
14 usec_timeout=500000 usect_delayed=476000 reason=VMSCAN_THROTTLE_NOPROGRESS
16 usec_timeout=500000 usect_delayed=140000 reason=VMSCAN_THROTTLE_NOPROGRESS
17 usec_timeout=500000 usect_delayed=232000 reason=VMSCAN_THROTTLE_NOPROGRESS
17 usec_timeout=500000 usect_delayed=240000 reason=VMSCAN_THROTTLE_NOPROGRESS
17 usec_timeout=500000 usect_delayed=280000 reason=VMSCAN_THROTTLE_NOPROGRESS
18 usec_timeout=500000 usect_delayed=404000 reason=VMSCAN_THROTTLE_NOPROGRESS
20 usec_timeout=500000 usect_delayed=148000 reason=VMSCAN_THROTTLE_NOPROGRESS
20 usec_timeout=500000 usect_delayed=216000 reason=VMSCAN_THROTTLE_NOPROGRESS
20 usec_timeout=500000 usect_delayed=468000 reason=VMSCAN_THROTTLE_NOPROGRESS
21 usec_timeout=500000 usect_delayed=448000 reason=VMSCAN_THROTTLE_NOPROGRESS
23 usec_timeout=500000 usect_delayed=168000 reason=VMSCAN_THROTTLE_NOPROGRESS
23 usec_timeout=500000 usect_delayed=296000 reason=VMSCAN_THROTTLE_NOPROGRESS
25 usec_timeout=500000 usect_delayed=132000 reason=VMSCAN_THROTTLE_NOPROGRESS
25 usec_timeout=500000 usect_delayed=352000 reason=VMSCAN_THROTTLE_NOPROGRESS
26 usec_timeout=500000 usect_delayed=180000 reason=VMSCAN_THROTTLE_NOPROGRESS
27 usec_timeout=500000 usect_delayed=284000 reason=VMSCAN_THROTTLE_NOPROGRESS
28 usec_timeout=500000 usect_delayed=164000 reason=VMSCAN_THROTTLE_NOPROGRESS
29 usec_timeout=500000 usect_delayed=136000 reason=VMSCAN_THROTTLE_NOPROGRESS
30 usec_timeout=500000 usect_delayed=200000 reason=VMSCAN_THROTTLE_NOPROGRESS
30 usec_timeout=500000 usect_delayed=400000 reason=VMSCAN_THROTTLE_NOPROGRESS
31 usec_timeout=500000 usect_delayed=196000 reason=VMSCAN_THROTTLE_NOPROGRESS
32 usec_timeout=500000 usect_delayed=156000 reason=VMSCAN_THROTTLE_NOPROGRESS
33 usec_timeout=500000 usect_delayed=224000 reason=VMSCAN_THROTTLE_NOPROGRESS
35 usec_timeout=500000 usect_delayed=128000 reason=VMSCAN_THROTTLE_NOPROGRESS
35 usec_timeout=500000 usect_delayed=176000 reason=VMSCAN_THROTTLE_NOPROGRESS
36 usec_timeout=500000 usect_delayed=368000 reason=VMSCAN_THROTTLE_NOPROGRESS
36 usec_timeout=500000 usect_delayed=496000 reason=VMSCAN_THROTTLE_NOPROGRESS
37 usec_timeout=500000 usect_delayed=312000 reason=VMSCAN_THROTTLE_NOPROGRESS
38 usec_timeout=500000 usect_delayed=304000 reason=VMSCAN_THROTTLE_NOPROGRESS
40 usec_timeout=500000 usect_delayed=288000 reason=VMSCAN_THROTTLE_NOPROGRESS
43 usec_timeout=500000 usect_delayed=408000 reason=VMSCAN_THROTTLE_NOPROGRESS
55 usec_timeout=500000 usect_delayed=416000 reason=VMSCAN_THROTTLE_NOPROGRESS
56 usec_timeout=500000 usect_delayed=76000 reason=VMSCAN_THROTTLE_NOPROGRESS
58 usec_timeout=500000 usect_delayed=120000 reason=VMSCAN_THROTTLE_NOPROGRESS
59 usec_timeout=500000 usect_delayed=208000 reason=VMSCAN_THROTTLE_NOPROGRESS
61 usec_timeout=500000 usect_delayed=68000 reason=VMSCAN_THROTTLE_NOPROGRESS
71 usec_timeout=500000 usect_delayed=192000 reason=VMSCAN_THROTTLE_NOPROGRESS
71 usec_timeout=500000 usect_delayed=480000 reason=VMSCAN_THROTTLE_NOPROGRESS
79 usec_timeout=500000 usect_delayed=60000 reason=VMSCAN_THROTTLE_NOPROGRESS
82 usec_timeout=500000 usect_delayed=320000 reason=VMSCAN_THROTTLE_NOPROGRESS
82 usec_timeout=500000 usect_delayed=92000 reason=VMSCAN_THROTTLE_NOPROGRESS
85 usec_timeout=500000 usect_delayed=64000 reason=VMSCAN_THROTTLE_NOPROGRESS
85 usec_timeout=500000 usect_delayed=80000 reason=VMSCAN_THROTTLE_NOPROGRESS
88 usec_timeout=500000 usect_delayed=84000 reason=VMSCAN_THROTTLE_NOPROGRESS
90 usec_timeout=500000 usect_delayed=160000 reason=VMSCAN_THROTTLE_NOPROGRESS
90 usec_timeout=500000 usect_delayed=292000 reason=VMSCAN_THROTTLE_NOPROGRESS
94 usec_timeout=500000 usect_delayed=56000 reason=VMSCAN_THROTTLE_NOPROGRESS
118 usec_timeout=500000 usect_delayed=88000 reason=VMSCAN_THROTTLE_NOPROGRESS
119 usec_timeout=500000 usect_delayed=72000 reason=VMSCAN_THROTTLE_NOPROGRESS
126 usec_timeout=500000 usect_delayed=108000 reason=VMSCAN_THROTTLE_NOPROGRESS
146 usec_timeout=500000 usect_delayed=52000 reason=VMSCAN_THROTTLE_NOPROGRESS
148 usec_timeout=500000 usect_delayed=36000 reason=VMSCAN_THROTTLE_NOPROGRESS
148 usec_timeout=500000 usect_delayed=48000 reason=VMSCAN_THROTTLE_NOPROGRESS
159 usec_timeout=500000 usect_delayed=28000 reason=VMSCAN_THROTTLE_NOPROGRESS
178 usec_timeout=500000 usect_delayed=44000 reason=VMSCAN_THROTTLE_NOPROGRESS
183 usec_timeout=500000 usect_delayed=40000 reason=VMSCAN_THROTTLE_NOPROGRESS
237 usec_timeout=500000 usect_delayed=100000 reason=VMSCAN_THROTTLE_NOPROGRESS
266 usec_timeout=500000 usect_delayed=32000 reason=VMSCAN_THROTTLE_NOPROGRESS
313 usec_timeout=500000 usect_delayed=24000 reason=VMSCAN_THROTTLE_NOPROGRESS
347 usec_timeout=500000 usect_delayed=96000 reason=VMSCAN_THROTTLE_NOPROGRESS
470 usec_timeout=500000 usect_delayed=20000 reason=VMSCAN_THROTTLE_NOPROGRESS
559 usec_timeout=500000 usect_delayed=16000 reason=VMSCAN_THROTTLE_NOPROGRESS
964 usec_timeout=500000 usect_delayed=12000 reason=VMSCAN_THROTTLE_NOPROGRESS
2001 usec_timeout=500000 usect_delayed=104000 reason=VMSCAN_THROTTLE_NOPROGRESS
2447 usec_timeout=500000 usect_delayed=8000 reason=VMSCAN_THROTTLE_NOPROGRESS
7888 usec_timeout=500000 usect_delayed=4000 reason=VMSCAN_THROTTLE_NOPROGRESS
22727 usec_timeout=500000 usect_delayed=0 reason=VMSCAN_THROTTLE_NOPROGRESS
51305 usec_timeout=500000 usect_delayed=500000 reason=VMSCAN_THROTTLE_NOPROGRESS

The full timeout is often hit but a large number also do not stall at
all. The remainder slept a little allowing other reclaim tasks to make
progress.

While this timeout could be further increased, it could also negatively
impact worst-case behaviour when there is no prioritisation of what task
should make progress.

For VMSCAN_THROTTLE_WRITEBACK, the breakdown was

1 usec_timeout=100000 usect_delayed=44000 reason=VMSCAN_THROTTLE_WRITEBACK
2 usec_timeout=100000 usect_delayed=76000 reason=VMSCAN_THROTTLE_WRITEBACK
3 usec_timeout=100000 usect_delayed=80000 reason=VMSCAN_THROTTLE_WRITEBACK
5 usec_timeout=100000 usect_delayed=48000 reason=VMSCAN_THROTTLE_WRITEBACK
5 usec_timeout=100000 usect_delayed=84000 reason=VMSCAN_THROTTLE_WRITEBACK
6 usec_timeout=100000 usect_delayed=72000 reason=VMSCAN_THROTTLE_WRITEBACK
7 usec_timeout=100000 usect_delayed=88000 reason=VMSCAN_THROTTLE_WRITEBACK
11 usec_timeout=100000 usect_delayed=56000 reason=VMSCAN_THROTTLE_WRITEBACK
12 usec_timeout=100000 usect_delayed=64000 reason=VMSCAN_THROTTLE_WRITEBACK
16 usec_timeout=100000 usect_delayed=92000 reason=VMSCAN_THROTTLE_WRITEBACK
24 usec_timeout=100000 usect_delayed=68000 reason=VMSCAN_THROTTLE_WRITEBACK
28 usec_timeout=100000 usect_delayed=32000 reason=VMSCAN_THROTTLE_WRITEBACK
30 usec_timeout=100000 usect_delayed=60000 reason=VMSCAN_THROTTLE_WRITEBACK
30 usec_timeout=100000 usect_delayed=96000 reason=VMSCAN_THROTTLE_WRITEBACK
32 usec_timeout=100000 usect_delayed=52000 reason=VMSCAN_THROTTLE_WRITEBACK
42 usec_timeout=100000 usect_delayed=40000 reason=VMSCAN_THROTTLE_WRITEBACK
77 usec_timeout=100000 usect_delayed=28000 reason=VMSCAN_THROTTLE_WRITEBACK
99 usec_timeout=100000 usect_delayed=36000 reason=VMSCAN_THROTTLE_WRITEBACK
137 usec_timeout=100000 usect_delayed=24000 reason=VMSCAN_THROTTLE_WRITEBACK
190 usec_timeout=100000 usect_delayed=20000 reason=VMSCAN_THROTTLE_WRITEBACK
339 usec_timeout=100000 usect_delayed=16000 reason=VMSCAN_THROTTLE_WRITEBACK
518 usec_timeout=100000 usect_delayed=12000 reason=VMSCAN_THROTTLE_WRITEBACK
852 usec_timeout=100000 usect_delayed=8000 reason=VMSCAN_THROTTLE_WRITEBACK
3359 usec_timeout=100000 usect_delayed=4000 reason=VMSCAN_THROTTLE_WRITEBACK
7147 usec_timeout=100000 usect_delayed=0 reason=VMSCAN_THROTTLE_WRITEBACK
83962 usec_timeout=100000 usect_delayed=100000 reason=VMSCAN_THROTTLE_WRITEBACK

The majority hit the timeout in direct reclaim context although a
sizable number did not stall at all. This is very different to kswapd
where only a tiny percentage of stalls due to writeback reached the
timeout.

Bottom line, the throttling appears to work and the wakeup events may
limit worst case stalls. There might be some grounds for adjusting
timeouts but it's likely futile as the worst-case scenarios depend on
the workload, memory size and the speed of the storage. A better
approach to improve the series further would be to prioritise tasks
based on their rate of allocation with the caveat that it may be very
expensive to track.

This patch (of 5):

Page reclaim throttles on wait_iff_congested under the following
conditions:

- kswapd is encountering pages under writeback and marked for immediate
reclaim implying that pages are cycling through the LRU faster than
pages can be cleaned.

- Direct reclaim will stall if all dirty pages are backed by congested
inodes.

wait_iff_congested is almost completely broken with few exceptions.
This patch adds a new node-based workqueue and tracks the number of
throttled tasks and pages written back since throttling started. If
enough pages belonging to the node are written back then the throttled
tasks will wake early. If not, the throttled tasks sleeps until the
timeout expires.

[neilb@suse.de: Uninterruptible sleep and simpler wakeups]
[hdanton@sina.com: Avoid race when reclaim starts]
[vbabka@suse.cz: vmstat irq-safe api, clarifications]

Link: https://lore.kernel.org/linux-mm/45d8b7a6-8548-65f5-cccf-9f451d4ae3d4@kernel.dk/ [1]
Link: https://lkml.kernel.org/r/20211022144651.19914-1-mgorman@techsingularity.net
Link: https://lkml.kernel.org/r/20211022144651.19914-2-mgorman@techsingularity.net
Signed-off-by: Mel Gorman <mgorman@techsingularity.net>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Cc: NeilBrown <neilb@suse.de>
Cc: "Theodore Ts'o" <tytso@mit.edu>
Cc: Andreas Dilger <adilger.kernel@dilger.ca>
Cc: "Darrick J . Wong" <djwong@kernel.org>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Dave Chinner <david@fromorbit.com>
Cc: Rik van Riel <riel@surriel.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Jonathan Corbet <corbet@lwn.net>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
diff d51d1e64 Tue Apr 10 17:28:07 MDT 2018 Steven Rostedt <rostedt@goodmis.org> mm, vmscan, tracing: use pointer to reclaim_stat struct in trace event

The trace event trace_mm_vmscan_lru_shrink_inactive() currently has 12
parameters! Seven of them are from the reclaim_stat structure. This
structure is currently local to mm/vmscan.c. By moving it to the global
vmstat.h header, we can also reference it from the vmscan tracepoints.
In moving it, it brings down the overhead of passing so many arguments
to the trace event. In the future, we may limit the number of arguments
that a trace event may pass (ideally just 6, but more realistically it
may be 8).

Before this patch, the code to call the trace event is this:

0f 83 aa fe ff ff jae ffffffff811e6261 <shrink_inactive_list+0x1e1>
48 8b 45 a0 mov -0x60(%rbp),%rax
45 8b 64 24 20 mov 0x20(%r12),%r12d
44 8b 6d d4 mov -0x2c(%rbp),%r13d
8b 4d d0 mov -0x30(%rbp),%ecx
44 8b 75 cc mov -0x34(%rbp),%r14d
44 8b 7d c8 mov -0x38(%rbp),%r15d
48 89 45 90 mov %rax,-0x70(%rbp)
8b 83 b8 fe ff ff mov -0x148(%rbx),%eax
8b 55 c0 mov -0x40(%rbp),%edx
8b 7d c4 mov -0x3c(%rbp),%edi
8b 75 b8 mov -0x48(%rbp),%esi
89 45 80 mov %eax,-0x80(%rbp)
65 ff 05 e4 f7 e2 7e incl %gs:0x7ee2f7e4(%rip) # 15bd0 <__preempt_count>
48 8b 05 75 5b 13 01 mov 0x1135b75(%rip),%rax # ffffffff8231bf68 <__tracepoint_mm_vmscan_lru_shrink_inactive+0x28>
48 85 c0 test %rax,%rax
74 72 je ffffffff811e646a <shrink_inactive_list+0x3ea>
48 89 c3 mov %rax,%rbx
4c 8b 10 mov (%rax),%r10
89 f8 mov %edi,%eax
48 89 85 68 ff ff ff mov %rax,-0x98(%rbp)
89 f0 mov %esi,%eax
48 89 85 60 ff ff ff mov %rax,-0xa0(%rbp)
89 c8 mov %ecx,%eax
48 89 85 78 ff ff ff mov %rax,-0x88(%rbp)
89 d0 mov %edx,%eax
48 89 85 70 ff ff ff mov %rax,-0x90(%rbp)
8b 45 8c mov -0x74(%rbp),%eax
48 8b 7b 08 mov 0x8(%rbx),%rdi
48 83 c3 18 add $0x18,%rbx
50 push %rax
41 54 push %r12
41 55 push %r13
ff b5 78 ff ff ff pushq -0x88(%rbp)
41 56 push %r14
41 57 push %r15
ff b5 70 ff ff ff pushq -0x90(%rbp)
4c 8b 8d 68 ff ff ff mov -0x98(%rbp),%r9
4c 8b 85 60 ff ff ff mov -0xa0(%rbp),%r8
48 8b 4d 98 mov -0x68(%rbp),%rcx
48 8b 55 90 mov -0x70(%rbp),%rdx
8b 75 80 mov -0x80(%rbp),%esi
41 ff d2 callq *%r10

After the patch:

0f 83 a8 fe ff ff jae ffffffff811e626d <shrink_inactive_list+0x1cd>
8b 9b b8 fe ff ff mov -0x148(%rbx),%ebx
45 8b 64 24 20 mov 0x20(%r12),%r12d
4c 8b 6d a0 mov -0x60(%rbp),%r13
65 ff 05 f5 f7 e2 7e incl %gs:0x7ee2f7f5(%rip) # 15bd0 <__preempt_count>
4c 8b 35 86 5b 13 01 mov 0x1135b86(%rip),%r14 # ffffffff8231bf68 <__tracepoint_mm_vmscan_lru_shrink_inactive+0x28>
4d 85 f6 test %r14,%r14
74 2a je ffffffff811e6411 <shrink_inactive_list+0x371>
49 8b 06 mov (%r14),%rax
8b 4d 8c mov -0x74(%rbp),%ecx
49 8b 7e 08 mov 0x8(%r14),%rdi
49 83 c6 18 add $0x18,%r14
4c 89 ea mov %r13,%rdx
45 89 e1 mov %r12d,%r9d
4c 8d 45 b8 lea -0x48(%rbp),%r8
89 de mov %ebx,%esi
51 push %rcx
48 8b 4d 98 mov -0x68(%rbp),%rcx
ff d0 callq *%rax

Link: http://lkml.kernel.org/r/2559d7cb-ec60-1200-2362-04fa34fd02bb@fb.com
Link: http://lkml.kernel.org/r/20180322121003.4177af15@gandalf.local.home
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
Reported-by: Alexei Starovoitov <ast@fb.com>
Acked-by: David Rientjes <rientjes@google.com>
Acked-by: Michal Hocko <mhocko@suse.com>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Andrey Ryabinin <aryabinin@virtuozzo.com>
Cc: Alexei Starovoitov <ast@fb.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
diff d51d1e64 Tue Apr 10 17:28:07 MDT 2018 Steven Rostedt <rostedt@goodmis.org> mm, vmscan, tracing: use pointer to reclaim_stat struct in trace event

The trace event trace_mm_vmscan_lru_shrink_inactive() currently has 12
parameters! Seven of them are from the reclaim_stat structure. This
structure is currently local to mm/vmscan.c. By moving it to the global
vmstat.h header, we can also reference it from the vmscan tracepoints.
In moving it, it brings down the overhead of passing so many arguments
to the trace event. In the future, we may limit the number of arguments
that a trace event may pass (ideally just 6, but more realistically it
may be 8).

Before this patch, the code to call the trace event is this:

0f 83 aa fe ff ff jae ffffffff811e6261 <shrink_inactive_list+0x1e1>
48 8b 45 a0 mov -0x60(%rbp),%rax
45 8b 64 24 20 mov 0x20(%r12),%r12d
44 8b 6d d4 mov -0x2c(%rbp),%r13d
8b 4d d0 mov -0x30(%rbp),%ecx
44 8b 75 cc mov -0x34(%rbp),%r14d
44 8b 7d c8 mov -0x38(%rbp),%r15d
48 89 45 90 mov %rax,-0x70(%rbp)
8b 83 b8 fe ff ff mov -0x148(%rbx),%eax
8b 55 c0 mov -0x40(%rbp),%edx
8b 7d c4 mov -0x3c(%rbp),%edi
8b 75 b8 mov -0x48(%rbp),%esi
89 45 80 mov %eax,-0x80(%rbp)
65 ff 05 e4 f7 e2 7e incl %gs:0x7ee2f7e4(%rip) # 15bd0 <__preempt_count>
48 8b 05 75 5b 13 01 mov 0x1135b75(%rip),%rax # ffffffff8231bf68 <__tracepoint_mm_vmscan_lru_shrink_inactive+0x28>
48 85 c0 test %rax,%rax
74 72 je ffffffff811e646a <shrink_inactive_list+0x3ea>
48 89 c3 mov %rax,%rbx
4c 8b 10 mov (%rax),%r10
89 f8 mov %edi,%eax
48 89 85 68 ff ff ff mov %rax,-0x98(%rbp)
89 f0 mov %esi,%eax
48 89 85 60 ff ff ff mov %rax,-0xa0(%rbp)
89 c8 mov %ecx,%eax
48 89 85 78 ff ff ff mov %rax,-0x88(%rbp)
89 d0 mov %edx,%eax
48 89 85 70 ff ff ff mov %rax,-0x90(%rbp)
8b 45 8c mov -0x74(%rbp),%eax
48 8b 7b 08 mov 0x8(%rbx),%rdi
48 83 c3 18 add $0x18,%rbx
50 push %rax
41 54 push %r12
41 55 push %r13
ff b5 78 ff ff ff pushq -0x88(%rbp)
41 56 push %r14
41 57 push %r15
ff b5 70 ff ff ff pushq -0x90(%rbp)
4c 8b 8d 68 ff ff ff mov -0x98(%rbp),%r9
4c 8b 85 60 ff ff ff mov -0xa0(%rbp),%r8
48 8b 4d 98 mov -0x68(%rbp),%rcx
48 8b 55 90 mov -0x70(%rbp),%rdx
8b 75 80 mov -0x80(%rbp),%esi
41 ff d2 callq *%r10

After the patch:

0f 83 a8 fe ff ff jae ffffffff811e626d <shrink_inactive_list+0x1cd>
8b 9b b8 fe ff ff mov -0x148(%rbx),%ebx
45 8b 64 24 20 mov 0x20(%r12),%r12d
4c 8b 6d a0 mov -0x60(%rbp),%r13
65 ff 05 f5 f7 e2 7e incl %gs:0x7ee2f7f5(%rip) # 15bd0 <__preempt_count>
4c 8b 35 86 5b 13 01 mov 0x1135b86(%rip),%r14 # ffffffff8231bf68 <__tracepoint_mm_vmscan_lru_shrink_inactive+0x28>
4d 85 f6 test %r14,%r14
74 2a je ffffffff811e6411 <shrink_inactive_list+0x371>
49 8b 06 mov (%r14),%rax
8b 4d 8c mov -0x74(%rbp),%ecx
49 8b 7e 08 mov 0x8(%r14),%rdi
49 83 c6 18 add $0x18,%r14
4c 89 ea mov %r13,%rdx
45 89 e1 mov %r12d,%r9d
4c 8d 45 b8 lea -0x48(%rbp),%r8
89 de mov %ebx,%esi
51 push %rcx
48 8b 4d 98 mov -0x68(%rbp),%rcx
ff d0 callq *%rax

Link: http://lkml.kernel.org/r/2559d7cb-ec60-1200-2362-04fa34fd02bb@fb.com
Link: http://lkml.kernel.org/r/20180322121003.4177af15@gandalf.local.home
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
Reported-by: Alexei Starovoitov <ast@fb.com>
Acked-by: David Rientjes <rientjes@google.com>
Acked-by: Michal Hocko <mhocko@suse.com>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Andrey Ryabinin <aryabinin@virtuozzo.com>
Cc: Alexei Starovoitov <ast@fb.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
diff d51d1e64 Tue Apr 10 17:28:07 MDT 2018 Steven Rostedt <rostedt@goodmis.org> mm, vmscan, tracing: use pointer to reclaim_stat struct in trace event

The trace event trace_mm_vmscan_lru_shrink_inactive() currently has 12
parameters! Seven of them are from the reclaim_stat structure. This
structure is currently local to mm/vmscan.c. By moving it to the global
vmstat.h header, we can also reference it from the vmscan tracepoints.
In moving it, it brings down the overhead of passing so many arguments
to the trace event. In the future, we may limit the number of arguments
that a trace event may pass (ideally just 6, but more realistically it
may be 8).

Before this patch, the code to call the trace event is this:

0f 83 aa fe ff ff jae ffffffff811e6261 <shrink_inactive_list+0x1e1>
48 8b 45 a0 mov -0x60(%rbp),%rax
45 8b 64 24 20 mov 0x20(%r12),%r12d
44 8b 6d d4 mov -0x2c(%rbp),%r13d
8b 4d d0 mov -0x30(%rbp),%ecx
44 8b 75 cc mov -0x34(%rbp),%r14d
44 8b 7d c8 mov -0x38(%rbp),%r15d
48 89 45 90 mov %rax,-0x70(%rbp)
8b 83 b8 fe ff ff mov -0x148(%rbx),%eax
8b 55 c0 mov -0x40(%rbp),%edx
8b 7d c4 mov -0x3c(%rbp),%edi
8b 75 b8 mov -0x48(%rbp),%esi
89 45 80 mov %eax,-0x80(%rbp)
65 ff 05 e4 f7 e2 7e incl %gs:0x7ee2f7e4(%rip) # 15bd0 <__preempt_count>
48 8b 05 75 5b 13 01 mov 0x1135b75(%rip),%rax # ffffffff8231bf68 <__tracepoint_mm_vmscan_lru_shrink_inactive+0x28>
48 85 c0 test %rax,%rax
74 72 je ffffffff811e646a <shrink_inactive_list+0x3ea>
48 89 c3 mov %rax,%rbx
4c 8b 10 mov (%rax),%r10
89 f8 mov %edi,%eax
48 89 85 68 ff ff ff mov %rax,-0x98(%rbp)
89 f0 mov %esi,%eax
48 89 85 60 ff ff ff mov %rax,-0xa0(%rbp)
89 c8 mov %ecx,%eax
48 89 85 78 ff ff ff mov %rax,-0x88(%rbp)
89 d0 mov %edx,%eax
48 89 85 70 ff ff ff mov %rax,-0x90(%rbp)
8b 45 8c mov -0x74(%rbp),%eax
48 8b 7b 08 mov 0x8(%rbx),%rdi
48 83 c3 18 add $0x18,%rbx
50 push %rax
41 54 push %r12
41 55 push %r13
ff b5 78 ff ff ff pushq -0x88(%rbp)
41 56 push %r14
41 57 push %r15
ff b5 70 ff ff ff pushq -0x90(%rbp)
4c 8b 8d 68 ff ff ff mov -0x98(%rbp),%r9
4c 8b 85 60 ff ff ff mov -0xa0(%rbp),%r8
48 8b 4d 98 mov -0x68(%rbp),%rcx
48 8b 55 90 mov -0x70(%rbp),%rdx
8b 75 80 mov -0x80(%rbp),%esi
41 ff d2 callq *%r10

After the patch:

0f 83 a8 fe ff ff jae ffffffff811e626d <shrink_inactive_list+0x1cd>
8b 9b b8 fe ff ff mov -0x148(%rbx),%ebx
45 8b 64 24 20 mov 0x20(%r12),%r12d
4c 8b 6d a0 mov -0x60(%rbp),%r13
65 ff 05 f5 f7 e2 7e incl %gs:0x7ee2f7f5(%rip) # 15bd0 <__preempt_count>
4c 8b 35 86 5b 13 01 mov 0x1135b86(%rip),%r14 # ffffffff8231bf68 <__tracepoint_mm_vmscan_lru_shrink_inactive+0x28>
4d 85 f6 test %r14,%r14
74 2a je ffffffff811e6411 <shrink_inactive_list+0x371>
49 8b 06 mov (%r14),%rax
8b 4d 8c mov -0x74(%rbp),%ecx
49 8b 7e 08 mov 0x8(%r14),%rdi
49 83 c6 18 add $0x18,%r14
4c 89 ea mov %r13,%rdx
45 89 e1 mov %r12d,%r9d
4c 8d 45 b8 lea -0x48(%rbp),%r8
89 de mov %ebx,%esi
51 push %rcx
48 8b 4d 98 mov -0x68(%rbp),%rcx
ff d0 callq *%rax

Link: http://lkml.kernel.org/r/2559d7cb-ec60-1200-2362-04fa34fd02bb@fb.com
Link: http://lkml.kernel.org/r/20180322121003.4177af15@gandalf.local.home
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
Reported-by: Alexei Starovoitov <ast@fb.com>
Acked-by: David Rientjes <rientjes@google.com>
Acked-by: Michal Hocko <mhocko@suse.com>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Andrey Ryabinin <aryabinin@virtuozzo.com>
Cc: Alexei Starovoitov <ast@fb.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
diff d51d1e64 Tue Apr 10 17:28:07 MDT 2018 Steven Rostedt <rostedt@goodmis.org> mm, vmscan, tracing: use pointer to reclaim_stat struct in trace event

The trace event trace_mm_vmscan_lru_shrink_inactive() currently has 12
parameters! Seven of them are from the reclaim_stat structure. This
structure is currently local to mm/vmscan.c. By moving it to the global
vmstat.h header, we can also reference it from the vmscan tracepoints.
In moving it, it brings down the overhead of passing so many arguments
to the trace event. In the future, we may limit the number of arguments
that a trace event may pass (ideally just 6, but more realistically it
may be 8).

Before this patch, the code to call the trace event is this:

0f 83 aa fe ff ff jae ffffffff811e6261 <shrink_inactive_list+0x1e1>
48 8b 45 a0 mov -0x60(%rbp),%rax
45 8b 64 24 20 mov 0x20(%r12),%r12d
44 8b 6d d4 mov -0x2c(%rbp),%r13d
8b 4d d0 mov -0x30(%rbp),%ecx
44 8b 75 cc mov -0x34(%rbp),%r14d
44 8b 7d c8 mov -0x38(%rbp),%r15d
48 89 45 90 mov %rax,-0x70(%rbp)
8b 83 b8 fe ff ff mov -0x148(%rbx),%eax
8b 55 c0 mov -0x40(%rbp),%edx
8b 7d c4 mov -0x3c(%rbp),%edi
8b 75 b8 mov -0x48(%rbp),%esi
89 45 80 mov %eax,-0x80(%rbp)
65 ff 05 e4 f7 e2 7e incl %gs:0x7ee2f7e4(%rip) # 15bd0 <__preempt_count>
48 8b 05 75 5b 13 01 mov 0x1135b75(%rip),%rax # ffffffff8231bf68 <__tracepoint_mm_vmscan_lru_shrink_inactive+0x28>
48 85 c0 test %rax,%rax
74 72 je ffffffff811e646a <shrink_inactive_list+0x3ea>
48 89 c3 mov %rax,%rbx
4c 8b 10 mov (%rax),%r10
89 f8 mov %edi,%eax
48 89 85 68 ff ff ff mov %rax,-0x98(%rbp)
89 f0 mov %esi,%eax
48 89 85 60 ff ff ff mov %rax,-0xa0(%rbp)
89 c8 mov %ecx,%eax
48 89 85 78 ff ff ff mov %rax,-0x88(%rbp)
89 d0 mov %edx,%eax
48 89 85 70 ff ff ff mov %rax,-0x90(%rbp)
8b 45 8c mov -0x74(%rbp),%eax
48 8b 7b 08 mov 0x8(%rbx),%rdi
48 83 c3 18 add $0x18,%rbx
50 push %rax
41 54 push %r12
41 55 push %r13
ff b5 78 ff ff ff pushq -0x88(%rbp)
41 56 push %r14
41 57 push %r15
ff b5 70 ff ff ff pushq -0x90(%rbp)
4c 8b 8d 68 ff ff ff mov -0x98(%rbp),%r9
4c 8b 85 60 ff ff ff mov -0xa0(%rbp),%r8
48 8b 4d 98 mov -0x68(%rbp),%rcx
48 8b 55 90 mov -0x70(%rbp),%rdx
8b 75 80 mov -0x80(%rbp),%esi
41 ff d2 callq *%r10

After the patch:

0f 83 a8 fe ff ff jae ffffffff811e626d <shrink_inactive_list+0x1cd>
8b 9b b8 fe ff ff mov -0x148(%rbx),%ebx
45 8b 64 24 20 mov 0x20(%r12),%r12d
4c 8b 6d a0 mov -0x60(%rbp),%r13
65 ff 05 f5 f7 e2 7e incl %gs:0x7ee2f7f5(%rip) # 15bd0 <__preempt_count>
4c 8b 35 86 5b 13 01 mov 0x1135b86(%rip),%r14 # ffffffff8231bf68 <__tracepoint_mm_vmscan_lru_shrink_inactive+0x28>
4d 85 f6 test %r14,%r14
74 2a je ffffffff811e6411 <shrink_inactive_list+0x371>
49 8b 06 mov (%r14),%rax
8b 4d 8c mov -0x74(%rbp),%ecx
49 8b 7e 08 mov 0x8(%r14),%rdi
49 83 c6 18 add $0x18,%r14
4c 89 ea mov %r13,%rdx
45 89 e1 mov %r12d,%r9d
4c 8d 45 b8 lea -0x48(%rbp),%r8
89 de mov %ebx,%esi
51 push %rcx
48 8b 4d 98 mov -0x68(%rbp),%rcx
ff d0 callq *%rax

Link: http://lkml.kernel.org/r/2559d7cb-ec60-1200-2362-04fa34fd02bb@fb.com
Link: http://lkml.kernel.org/r/20180322121003.4177af15@gandalf.local.home
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
Reported-by: Alexei Starovoitov <ast@fb.com>
Acked-by: David Rientjes <rientjes@google.com>
Acked-by: Michal Hocko <mhocko@suse.com>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Andrey Ryabinin <aryabinin@virtuozzo.com>
Cc: Alexei Starovoitov <ast@fb.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
diff d51d1e64 Tue Apr 10 17:28:07 MDT 2018 Steven Rostedt <rostedt@goodmis.org> mm, vmscan, tracing: use pointer to reclaim_stat struct in trace event

The trace event trace_mm_vmscan_lru_shrink_inactive() currently has 12
parameters! Seven of them are from the reclaim_stat structure. This
structure is currently local to mm/vmscan.c. By moving it to the global
vmstat.h header, we can also reference it from the vmscan tracepoints.
In moving it, it brings down the overhead of passing so many arguments
to the trace event. In the future, we may limit the number of arguments
that a trace event may pass (ideally just 6, but more realistically it
may be 8).

Before this patch, the code to call the trace event is this:

0f 83 aa fe ff ff jae ffffffff811e6261 <shrink_inactive_list+0x1e1>
48 8b 45 a0 mov -0x60(%rbp),%rax
45 8b 64 24 20 mov 0x20(%r12),%r12d
44 8b 6d d4 mov -0x2c(%rbp),%r13d
8b 4d d0 mov -0x30(%rbp),%ecx
44 8b 75 cc mov -0x34(%rbp),%r14d
44 8b 7d c8 mov -0x38(%rbp),%r15d
48 89 45 90 mov %rax,-0x70(%rbp)
8b 83 b8 fe ff ff mov -0x148(%rbx),%eax
8b 55 c0 mov -0x40(%rbp),%edx
8b 7d c4 mov -0x3c(%rbp),%edi
8b 75 b8 mov -0x48(%rbp),%esi
89 45 80 mov %eax,-0x80(%rbp)
65 ff 05 e4 f7 e2 7e incl %gs:0x7ee2f7e4(%rip) # 15bd0 <__preempt_count>
48 8b 05 75 5b 13 01 mov 0x1135b75(%rip),%rax # ffffffff8231bf68 <__tracepoint_mm_vmscan_lru_shrink_inactive+0x28>
48 85 c0 test %rax,%rax
74 72 je ffffffff811e646a <shrink_inactive_list+0x3ea>
48 89 c3 mov %rax,%rbx
4c 8b 10 mov (%rax),%r10
89 f8 mov %edi,%eax
48 89 85 68 ff ff ff mov %rax,-0x98(%rbp)
89 f0 mov %esi,%eax
48 89 85 60 ff ff ff mov %rax,-0xa0(%rbp)
89 c8 mov %ecx,%eax
48 89 85 78 ff ff ff mov %rax,-0x88(%rbp)
89 d0 mov %edx,%eax
48 89 85 70 ff ff ff mov %rax,-0x90(%rbp)
8b 45 8c mov -0x74(%rbp),%eax
48 8b 7b 08 mov 0x8(%rbx),%rdi
48 83 c3 18 add $0x18,%rbx
50 push %rax
41 54 push %r12
41 55 push %r13
ff b5 78 ff ff ff pushq -0x88(%rbp)
41 56 push %r14
41 57 push %r15
ff b5 70 ff ff ff pushq -0x90(%rbp)
4c 8b 8d 68 ff ff ff mov -0x98(%rbp),%r9
4c 8b 85 60 ff ff ff mov -0xa0(%rbp),%r8
48 8b 4d 98 mov -0x68(%rbp),%rcx
48 8b 55 90 mov -0x70(%rbp),%rdx
8b 75 80 mov -0x80(%rbp),%esi
41 ff d2 callq *%r10

After the patch:

0f 83 a8 fe ff ff jae ffffffff811e626d <shrink_inactive_list+0x1cd>
8b 9b b8 fe ff ff mov -0x148(%rbx),%ebx
45 8b 64 24 20 mov 0x20(%r12),%r12d
4c 8b 6d a0 mov -0x60(%rbp),%r13
65 ff 05 f5 f7 e2 7e incl %gs:0x7ee2f7f5(%rip) # 15bd0 <__preempt_count>
4c 8b 35 86 5b 13 01 mov 0x1135b86(%rip),%r14 # ffffffff8231bf68 <__tracepoint_mm_vmscan_lru_shrink_inactive+0x28>
4d 85 f6 test %r14,%r14
74 2a je ffffffff811e6411 <shrink_inactive_list+0x371>
49 8b 06 mov (%r14),%rax
8b 4d 8c mov -0x74(%rbp),%ecx
49 8b 7e 08 mov 0x8(%r14),%rdi
49 83 c6 18 add $0x18,%r14
4c 89 ea mov %r13,%rdx
45 89 e1 mov %r12d,%r9d
4c 8d 45 b8 lea -0x48(%rbp),%r8
89 de mov %ebx,%esi
51 push %rcx
48 8b 4d 98 mov -0x68(%rbp),%rcx
ff d0 callq *%rax

Link: http://lkml.kernel.org/r/2559d7cb-ec60-1200-2362-04fa34fd02bb@fb.com
Link: http://lkml.kernel.org/r/20180322121003.4177af15@gandalf.local.home
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
Reported-by: Alexei Starovoitov <ast@fb.com>
Acked-by: David Rientjes <rientjes@google.com>
Acked-by: Michal Hocko <mhocko@suse.com>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Andrey Ryabinin <aryabinin@virtuozzo.com>
Cc: Alexei Starovoitov <ast@fb.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
diff 83da7510 Fri Apr 18 16:07:10 MDT 2014 Christoph Lameter <cl@linux.com> vmscan: reclaim_clean_pages_from_list() must use mod_zone_page_state()

Seems to be called with preemption enabled. Therefore it must use
mod_zone_page_state instead.

Signed-off-by: Christoph Lameter <cl@linux.com>
Reported-by: Grygorii Strashko <grygorii.strashko@ti.com>
Tested-by: Grygorii Strashko <grygorii.strashko@ti.com>
Cc: Tejun Heo <tj@kernel.org>
Cc: Santosh Shilimkar <santosh.shilimkar@ti.com>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: Mel Gorman <mel@csn.ul.ie>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
diff 117aad1e Mon Sep 30 14:45:16 MDT 2013 Rafael Aquini <aquini@redhat.com> mm: avoid reinserting isolated balloon pages into LRU lists

Isolated balloon pages can wrongly end up in LRU lists when
migrate_pages() finishes its round without draining all the isolated
page list.

The same issue can happen when reclaim_clean_pages_from_list() tries to
reclaim pages from an isolated page list, before migration, in the CMA
path. Such balloon page leak opens a race window against LRU lists
shrinkers that leads us to the following kernel panic:

BUG: unable to handle kernel NULL pointer dereference at 0000000000000028
IP: [<ffffffff810c2625>] shrink_page_list+0x24e/0x897
PGD 3cda2067 PUD 3d713067 PMD 0
Oops: 0000 [#1] SMP
CPU: 0 PID: 340 Comm: kswapd0 Not tainted 3.12.0-rc1-22626-g4367597 #87
Hardware name: Bochs Bochs, BIOS Bochs 01/01/2011
RIP: shrink_page_list+0x24e/0x897
RSP: 0000:ffff88003da499b8 EFLAGS: 00010286
RAX: 0000000000000000 RBX: ffff88003e82bd60 RCX: 00000000000657d5
RDX: 0000000000000000 RSI: 000000000000031f RDI: ffff88003e82bd40
RBP: ffff88003da49ab0 R08: 0000000000000001 R09: 0000000081121a45
R10: ffffffff81121a45 R11: ffff88003c4a9a28 R12: ffff88003e82bd40
R13: ffff88003da0e800 R14: 0000000000000001 R15: ffff88003da49d58
FS: 0000000000000000(0000) GS:ffff88003fc00000(0000) knlGS:0000000000000000
CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 00000000067d9000 CR3: 000000003ace5000 CR4: 00000000000407b0
Call Trace:
shrink_inactive_list+0x240/0x3de
shrink_lruvec+0x3e0/0x566
__shrink_zone+0x94/0x178
shrink_zone+0x3a/0x82
balance_pgdat+0x32a/0x4c2
kswapd+0x2f0/0x372
kthread+0xa2/0xaa
ret_from_fork+0x7c/0xb0
Code: 80 7d 8f 01 48 83 95 68 ff ff ff 00 4c 89 e7 e8 5a 7b 00 00 48 85 c0 49 89 c5 75 08 80 7d 8f 00 74 3e eb 31 48 8b 80 18 01 00 00 <48> 8b 74 0d 48 8b 78 30 be 02 00 00 00 ff d2 eb
RIP [<ffffffff810c2625>] shrink_page_list+0x24e/0x897
RSP <ffff88003da499b8>
CR2: 0000000000000028
---[ end trace 703d2451af6ffbfd ]---
Kernel panic - not syncing: Fatal exception

This patch fixes the issue, by assuring the proper tests are made at
putback_movable_pages() & reclaim_clean_pages_from_list() to avoid
isolated balloon pages being wrongly reinserted in LRU lists.

[akpm@linux-foundation.org: clarify awkward comment text]
Signed-off-by: Rafael Aquini <aquini@redhat.com>
Reported-by: Luiz Capitulino <lcapitulino@redhat.com>
Tested-by: Luiz Capitulino <lcapitulino@redhat.com>
Cc: Mel Gorman <mel@csn.ul.ie>
Cc: Rik van Riel <riel@redhat.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
diff e2be15f6 Wed Jul 03 16:01:57 MDT 2013 Mel Gorman <mgorman@suse.de> mm: vmscan: stall page reclaim and writeback pages based on dirty/writepage pages encountered

Further testing of the "Reduce system disruption due to kswapd"
discovered a few problems. First and foremost, it's possible for pages
under writeback to be freed which will lead to badness. Second, as
pages were not being swapped the file LRU was being scanned faster and
clean file pages were being reclaimed. In some cases this results in
increased read IO to re-read data from disk. Third, more pages were
being written from kswapd context which can adversly affect IO
performance. Lastly, it was observed that PageDirty pages are not
necessarily dirty on all filesystems (buffers can be clean while
PageDirty is set and ->writepage generates no IO) and not all
filesystems set PageWriteback when the page is being written (e.g.
ext3). This disconnect confuses the reclaim stalling logic. This
follow-up series is aimed at these problems.

The tests were based on three kernels

vanilla: kernel 3.9 as that is what the current mmotm uses as a baseline
mmotm-20130522 is mmotm as of 22nd May with "Reduce system disruption due to
kswapd" applied on top as per what should be in Andrew's tree
right now
lessdisrupt-v7r10 is this follow-up series on top of the mmotm kernel

The first test used memcached+memcachetest while some background IO was
in progress as implemented by the parallel IO tests implement in MM
Tests. memcachetest benchmarks how many operations/second memcached can
service. It starts with no background IO on a freshly created ext4
filesystem and then re-runs the test with larger amounts of IO in the
background to roughly simulate a large copy in progress. The
expectation is that the IO should have little or no impact on
memcachetest which is running entirely in memory.

parallelio
3.9.0 3.9.0 3.9.0
vanilla mm1-mmotm-20130522 mm1-lessdisrupt-v7r10
Ops memcachetest-0M 23117.00 ( 0.00%) 22780.00 ( -1.46%) 22763.00 ( -1.53%)
Ops memcachetest-715M 23774.00 ( 0.00%) 23299.00 ( -2.00%) 22934.00 ( -3.53%)
Ops memcachetest-2385M 4208.00 ( 0.00%) 24154.00 (474.00%) 23765.00 (464.76%)
Ops memcachetest-4055M 4104.00 ( 0.00%) 25130.00 (512.33%) 24614.00 (499.76%)
Ops io-duration-0M 0.00 ( 0.00%) 0.00 ( 0.00%) 0.00 ( 0.00%)
Ops io-duration-715M 12.00 ( 0.00%) 7.00 ( 41.67%) 6.00 ( 50.00%)
Ops io-duration-2385M 116.00 ( 0.00%) 21.00 ( 81.90%) 21.00 ( 81.90%)
Ops io-duration-4055M 160.00 ( 0.00%) 36.00 ( 77.50%) 35.00 ( 78.12%)
Ops swaptotal-0M 0.00 ( 0.00%) 0.00 ( 0.00%) 0.00 ( 0.00%)
Ops swaptotal-715M 140138.00 ( 0.00%) 18.00 ( 99.99%) 18.00 ( 99.99%)
Ops swaptotal-2385M 385682.00 ( 0.00%) 0.00 ( 0.00%) 0.00 ( 0.00%)
Ops swaptotal-4055M 418029.00 ( 0.00%) 0.00 ( 0.00%) 0.00 ( 0.00%)
Ops swapin-0M 0.00 ( 0.00%) 0.00 ( 0.00%) 0.00 ( 0.00%)
Ops swapin-715M 144.00 ( 0.00%) 0.00 ( 0.00%) 0.00 ( 0.00%)
Ops swapin-2385M 134227.00 ( 0.00%) 0.00 ( 0.00%) 0.00 ( 0.00%)
Ops swapin-4055M 125618.00 ( 0.00%) 0.00 ( 0.00%) 0.00 ( 0.00%)
Ops minorfaults-0M 1536429.00 ( 0.00%) 1531632.00 ( 0.31%) 1533541.00 ( 0.19%)
Ops minorfaults-715M 1786996.00 ( 0.00%) 1612148.00 ( 9.78%) 1608832.00 ( 9.97%)
Ops minorfaults-2385M 1757952.00 ( 0.00%) 1614874.00 ( 8.14%) 1613541.00 ( 8.21%)
Ops minorfaults-4055M 1774460.00 ( 0.00%) 1633400.00 ( 7.95%) 1630881.00 ( 8.09%)
Ops majorfaults-0M 1.00 ( 0.00%) 0.00 ( 0.00%) 0.00 ( 0.00%)
Ops majorfaults-715M 184.00 ( 0.00%) 167.00 ( 9.24%) 166.00 ( 9.78%)
Ops majorfaults-2385M 24444.00 ( 0.00%) 155.00 ( 99.37%) 93.00 ( 99.62%)
Ops majorfaults-4055M 21357.00 ( 0.00%) 147.00 ( 99.31%) 134.00 ( 99.37%)

memcachetest is the transactions/second reported by memcachetest. In
the vanilla kernel note that performance drops from around
23K/sec to just over 4K/second when there is 2385M of IO going
on in the background. With current mmotm, there is no collapse
in performance and with this follow-up series there is little
change.

swaptotal is the total amount of swap traffic. With mmotm and the follow-up
series, the total amount of swapping is much reduced.

3.9.0 3.9.0 3.9.0
vanillamm1-mmotm-20130522mm1-lessdisrupt-v7r10
Minor Faults 11160152 10706748 10622316
Major Faults 46305 755 678
Swap Ins 260249 0 0
Swap Outs 683860 18 18
Direct pages scanned 0 678 2520
Kswapd pages scanned 6046108 8814900 1639279
Kswapd pages reclaimed 1081954 1172267 1094635
Direct pages reclaimed 0 566 2304
Kswapd efficiency 17% 13% 66%
Kswapd velocity 5217.560 7618.953 1414.879
Direct efficiency 100% 83% 91%
Direct velocity 0.000 0.586 2.175
Percentage direct scans 0% 0% 0%
Zone normal velocity 5105.086 6824.681 671.158
Zone dma32 velocity 112.473 794.858 745.896
Zone dma velocity 0.000 0.000 0.000
Page writes by reclaim 1929612.000 6861768.000 32821.000
Page writes file 1245752 6861750 32803
Page writes anon 683860 18 18
Page reclaim immediate 7484 40 239
Sector Reads 1130320 93996 86900
Sector Writes 13508052 10823500 11804436
Page rescued immediate 0 0 0
Slabs scanned 33536 27136 18560
Direct inode steals 0 0 0
Kswapd inode steals 8641 1035 0
Kswapd skipped wait 0 0 0
THP fault alloc 8 37 33
THP collapse alloc 508 552 515
THP splits 24 1 1
THP fault fallback 0 0 0
THP collapse fail 0 0 0

There are a number of observations to make here

1. Swap outs are almost eliminated. Swap ins are 0 indicating that the
pages swapped were really unused anonymous pages. Related to that,
major faults are much reduced.

2. kswapd efficiency was impacted by the initial series but with these
follow-up patches, the efficiency is now at 66% indicating that far
fewer pages were skipped during scanning due to dirty or writeback
pages.

3. kswapd velocity is reduced indicating that fewer pages are being scanned
with the follow-up series as kswapd now stalls when the tail of the
LRU queue is full of unqueued dirty pages. The stall gives flushers a
chance to catch-up so kswapd can reclaim clean pages when it wakes

4. In light of Zlatko's recent reports about zone scanning imbalances,
mmtests now reports scanning velocity on a per-zone basis. With mainline,
you can see that the scanning activity is dominated by the Normal
zone with over 45 times more scanning in Normal than the DMA32 zone.
With the series currently in mmotm, the ratio is slightly better but it
is still the case that the bulk of scanning is in the highest zone. With
this follow-up series, the ratio of scanning between the Normal and
DMA32 zone is roughly equal.

5. As Dave Chinner observed, the current patches in mmotm increased the
number of pages written from kswapd context which is expected to adversly
impact IO performance. With the follow-up patches, far fewer pages are
written from kswapd context than the mainline kernel

6. With the series in mmotm, fewer inodes were reclaimed by kswapd. With
the follow-up series, there is less slab shrinking activity and no inodes
were reclaimed.

7. Note that "Sectors Read" is drastically reduced implying that the source
data being used for the IO is not being aggressively discarded due to
page reclaim skipping over dirty pages and reclaiming clean pages. Note
that the reducion in reads could also be due to inode data not being
re-read from disk after a slab shrink.

3.9.0 3.9.0 3.9.0
vanillamm1-mmotm-20130522mm1-lessdisrupt-v7r10
Mean sda-avgqz 166.99 32.09 33.44
Mean sda-await 853.64 192.76 185.43
Mean sda-r_await 6.31 9.24 5.97
Mean sda-w_await 2992.81 202.65 192.43
Max sda-avgqz 1409.91 718.75 698.98
Max sda-await 6665.74 3538.00 3124.23
Max sda-r_await 58.96 111.95 58.00
Max sda-w_await 28458.94 3977.29 3148.61

In light of the changes in writes from reclaim context, the number of
reads and Dave Chinner's concerns about IO performance I took a closer
look at the IO stats for the test disk. Few observations

1. The average queue size is reduced by the initial series and roughly
the same with this follow up.

2. Average wait times for writes are reduced and as the IO
is completing faster it at least implies that the gain is because
flushers are writing the files efficiently instead of page reclaim
getting in the way.

3. The reduction in maximum write latency is staggering. 28 seconds down
to 3 seconds.

Jan Kara asked how NFS is affected by all of this. Unstable pages can
be taken into account as one of the patches in the series shows but it
is still the case that filesystems with unusual handling of dirty or
writeback could still be treated better.

Tests like postmark, fsmark and largedd showed up nothing useful. On my test
setup, pages are simply not being written back from reclaim context with or
without the patches and there are no changes in performance. My test setup
probably is just not strong enough network-wise to be really interesting.

I ran a longer-lived memcached test with IO going to NFS instead of a local disk

parallelio
3.9.0 3.9.0 3.9.0
vanilla mm1-mmotm-20130522 mm1-lessdisrupt-v7r10
Ops memcachetest-0M 23323.00 ( 0.00%) 23241.00 ( -0.35%) 23321.00 ( -0.01%)
Ops memcachetest-715M 25526.00 ( 0.00%) 24763.00 ( -2.99%) 23242.00 ( -8.95%)
Ops memcachetest-2385M 8814.00 ( 0.00%) 26924.00 (205.47%) 23521.00 (166.86%)
Ops memcachetest-4055M 5835.00 ( 0.00%) 26827.00 (359.76%) 25560.00 (338.05%)
Ops io-duration-0M 0.00 ( 0.00%) 0.00 ( 0.00%) 0.00 ( 0.00%)
Ops io-duration-715M 65.00 ( 0.00%) 71.00 ( -9.23%) 11.00 ( 83.08%)
Ops io-duration-2385M 129.00 ( 0.00%) 94.00 ( 27.13%) 53.00 ( 58.91%)
Ops io-duration-4055M 301.00 ( 0.00%) 100.00 ( 66.78%) 108.00 ( 64.12%)
Ops swaptotal-0M 0.00 ( 0.00%) 0.00 ( 0.00%) 0.00 ( 0.00%)
Ops swaptotal-715M 14394.00 ( 0.00%) 949.00 ( 93.41%) 63.00 ( 99.56%)
Ops swaptotal-2385M 401483.00 ( 0.00%) 24437.00 ( 93.91%) 30118.00 ( 92.50%)
Ops swaptotal-4055M 554123.00 ( 0.00%) 35688.00 ( 93.56%) 63082.00 ( 88.62%)
Ops swapin-0M 0.00 ( 0.00%) 0.00 ( 0.00%) 0.00 ( 0.00%)
Ops swapin-715M 4522.00 ( 0.00%) 560.00 ( 87.62%) 63.00 ( 98.61%)
Ops swapin-2385M 169861.00 ( 0.00%) 5026.00 ( 97.04%) 13917.00 ( 91.81%)
Ops swapin-4055M 192374.00 ( 0.00%) 10056.00 ( 94.77%) 25729.00 ( 86.63%)
Ops minorfaults-0M 1445969.00 ( 0.00%) 1520878.00 ( -5.18%) 1454024.00 ( -0.56%)
Ops minorfaults-715M 1557288.00 ( 0.00%) 1528482.00 ( 1.85%) 1535776.00 ( 1.38%)
Ops minorfaults-2385M 1692896.00 ( 0.00%) 1570523.00 ( 7.23%) 1559622.00 ( 7.87%)
Ops minorfaults-4055M 1654985.00 ( 0.00%) 1581456.00 ( 4.44%) 1596713.00 ( 3.52%)
Ops majorfaults-0M 0.00 ( 0.00%) 1.00 (-99.00%) 0.00 ( 0.00%)
Ops majorfaults-715M 763.00 ( 0.00%) 265.00 ( 65.27%) 75.00 ( 90.17%)
Ops majorfaults-2385M 23861.00 ( 0.00%) 894.00 ( 96.25%) 2189.00 ( 90.83%)
Ops majorfaults-4055M 27210.00 ( 0.00%) 1569.00 ( 94.23%) 4088.00 ( 84.98%)

1. Performance does not collapse due to IO which is good. IO is also completing
faster. Note with mmotm, IO completes in a third of the time and faster again
with this series applied

2. Swapping is reduced, although not eliminated. The figures for the follow-up
look bad but it does vary a bit as the stalling is not perfect for nfs
or filesystems like ext3 with unusual handling of dirty and writeback
pages

3. There are swapins, particularly with larger amounts of IO indicating
that active pages are being reclaimed. However, the number of much
reduced.

3.9.0 3.9.0 3.9.0
vanillamm1-mmotm-20130522mm1-lessdisrupt-v7r10
Minor Faults 36339175 35025445 35219699
Major Faults 310964 27108 51887
Swap Ins 2176399 173069 333316
Swap Outs 3344050 357228 504824
Direct pages scanned 8972 77283 43242
Kswapd pages scanned 20899983 8939566 14772851
Kswapd pages reclaimed 6193156 5172605 5231026
Direct pages reclaimed 8450 73802 39514
Kswapd efficiency 29% 57% 35%
Kswapd velocity 3929.743 1847.499 3058.840
Direct efficiency 94% 95% 91%
Direct velocity 1.687 15.972 8.954
Percentage direct scans 0% 0% 0%
Zone normal velocity 3721.907 939.103 2185.142
Zone dma32 velocity 209.522 924.368 882.651
Zone dma velocity 0.000 0.000 0.000
Page writes by reclaim 4082185.000 526319.000 537114.000
Page writes file 738135 169091 32290
Page writes anon 3344050 357228 504824
Page reclaim immediate 9524 170 5595843
Sector Reads 8909900 861192 1483680
Sector Writes 13428980 1488744 2076800
Page rescued immediate 0 0 0
Slabs scanned 38016 31744 28672
Direct inode steals 0 0 0
Kswapd inode steals 424 0 0
Kswapd skipped wait 0 0 0
THP fault alloc 14 15 119
THP collapse alloc 1767 1569 1618
THP splits 30 29 25
THP fault fallback 0 0 0
THP collapse fail 8 5 0
Compaction stalls 17 41 100
Compaction success 7 31 95
Compaction failures 10 10 5
Page migrate success 7083 22157 62217
Page migrate failure 0 0 0
Compaction pages isolated 14847 48758 135830
Compaction migrate scanned 18328 48398 138929
Compaction free scanned 2000255 355827 1720269
Compaction cost 7 24 68

I guess the main takeaway again is the much reduced page writes
from reclaim context and reduced reads.

3.9.0 3.9.0 3.9.0
vanillamm1-mmotm-20130522mm1-lessdisrupt-v7r10
Mean sda-avgqz 23.58 0.35 0.44
Mean sda-await 133.47 15.72 15.46
Mean sda-r_await 4.72 4.69 3.95
Mean sda-w_await 507.69 28.40 33.68
Max sda-avgqz 680.60 12.25 23.14
Max sda-await 3958.89 221.83 286.22
Max sda-r_await 63.86 61.23 67.29
Max sda-w_await 11710.38 883.57 1767.28

And as before, write wait times are much reduced.

This patch:

The patch "mm: vmscan: Have kswapd writeback pages based on dirty pages
encountered, not priority" decides whether to writeback pages from reclaim
context based on the number of dirty pages encountered. This situation is
flagged too easily and flushers are not given the chance to catch up
resulting in more pages being written from reclaim context and potentially
impacting IO performance. The check for PageWriteback is also misplaced
as it happens within a PageDirty check which is nonsense as the dirty may
have been cleared for IO. The accounting is updated very late and pages
that are already under writeback, were reactivated, could not unmapped or
could not be released are all missed. Similarly, a page is considered
congested for reasons other than being congested and pages that cannot be
written out in the correct context are skipped. Finally, it considers
stalling and writing back filesystem pages due to encountering dirty
anonymous pages at the tail of the LRU which is dumb.

This patch causes kswapd to begin writing filesystem pages from reclaim
context only if page reclaim found that all filesystem pages at the tail
of the LRU were unqueued dirty pages. Before it starts writing filesystem
pages, it will stall to give flushers a chance to catch up. The decision
on whether wait_iff_congested is also now determined by dirty filesystem
pages only. Congested pages are based on whether the underlying BDI is
congested regardless of the context of the reclaiming process.

Signed-off-by: Mel Gorman <mgorman@suse.de>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Michal Hocko <mhocko@suse.cz>
Cc: Rik van Riel <riel@redhat.com>
Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Jiri Slaby <jslaby@suse.cz>
Cc: Valdis Kletnieks <Valdis.Kletnieks@vt.edu>
Cc: Zlatko Calusic <zcalusic@bitsync.net>
Cc: dormando <dormando@rydia.net>
Cc: Trond Myklebust <trond.myklebust@fys.uio.no>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
diff 96710098 Fri Nov 16 15:14:59 MST 2012 Mel Gorman <mgorman@suse.de> mm: revert "mm: vmscan: scale number of pages reclaimed by reclaim/compaction based on failures"

Jiri Slaby reported the following:

(It's an effective revert of "mm: vmscan: scale number of pages
reclaimed by reclaim/compaction based on failures".) Given kswapd
had hours of runtime in ps/top output yesterday in the morning
and after the revert it's now 2 minutes in sum for the last 24h,
I would say, it's gone.

The intention of the patch in question was to compensate for the loss of
lumpy reclaim. Part of the reason lumpy reclaim worked is because it
aggressively reclaimed pages and this patch was meant to be a sane
compromise.

When compaction fails, it gets deferred and both compaction and
reclaim/compaction is deferred avoid excessive reclaim. However, since
commit c654345924f7 ("mm: remove __GFP_NO_KSWAPD"), kswapd is woken up
each time and continues reclaiming which was not taken into account when
the patch was developed.

Attempts to address the problem ended up just changing the shape of the
problem instead of fixing it. The release window gets closer and while
a THP allocation failing is not a major problem, kswapd chewing up a lot
of CPU is.

This patch reverts commit 83fde0f22872 ("mm: vmscan: scale number of
pages reclaimed by reclaim/compaction based on failures") and will be
revisited in the future.

Signed-off-by: Mel Gorman <mgorman@suse.de>
Cc: Zdenek Kabelac <zkabelac@redhat.com>
Tested-by: Valdis Kletnieks <Valdis.Kletnieks@vt.edu>
Cc: Jiri Slaby <jirislaby@gmail.com>
Cc: Rik van Riel <riel@redhat.com>
Cc: Jiri Slaby <jslaby@suse.cz>
Cc: Johannes Hirte <johannes.hirte@fem.tu-ilmenau.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

Completed in 2808 milliseconds