Searched +hist:7 +hist:c4da061 (Results 1 - 1 of 1) sorted by relevance

/linux-master/mm/
H A Dslab_common.cdiff 7d6b6cc3 Thu Feb 17 02:16:09 MST 2022 Miaohe Lin <linmiaohe@huawei.com> mm/slab_common: use helper function is_power_of_2()

Use helper function is_power_of_2() to check if KMALLOC_MIN_SIZE is power
of two. Minor readability improvement.

Signed-off-by: Miaohe Lin <linmiaohe@huawei.com>
Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
Link: https://lore.kernel.org/r/20220217091609.8214-1-linmiaohe@huawei.com
diff 92850134 Thu Feb 25 18:19:55 MST 2021 Andrey Konovalov <andreyknvl@google.com> kasan, mm: don't save alloc stacks twice

Patch series "kasan: optimizations and fixes for HW_TAGS", v4.

This patchset makes the HW_TAGS mode more efficient, mostly by reworking
poisoning approaches and simplifying/inlining some internal helpers.

With this change, the overhead of HW_TAGS annotations excluding setting
and checking memory tags is ~3%. The performance impact caused by tags
will be unknown until we have hardware that supports MTE.

As a side-effect, this patchset speeds up generic KASAN by ~15%.

This patch (of 13):

Currently KASAN saves allocation stacks in both kasan_slab_alloc() and
kasan_kmalloc() annotations. This patch changes KASAN to save allocation
stacks for slab objects from kmalloc caches in kasan_kmalloc() only, and
stacks for other slab objects in kasan_slab_alloc() only.

This change requires ____kasan_kmalloc() knowing whether the object
belongs to a kmalloc cache. This is implemented by adding a flag field to
the kasan_info structure. That flag is only set for kmalloc caches via a
new kasan_cache_create_kmalloc() annotation.

Link: https://lkml.kernel.org/r/cover.1612546384.git.andreyknvl@google.com
Link: https://lkml.kernel.org/r/7c673ebca8d00f40a7ad6f04ab9a2bddeeae2097.1612546384.git.andreyknvl@google.com
Signed-off-by: Andrey Konovalov <andreyknvl@google.com>
Reviewed-by: Marco Elver <elver@google.com>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Vincenzo Frascino <vincenzo.frascino@arm.com>
Cc: Dmitry Vyukov <dvyukov@google.com>
Cc: Alexander Potapenko <glider@google.com>
Cc: Will Deacon <will.deacon@arm.com>
Cc: Andrey Ryabinin <aryabinin@virtuozzo.com>
Cc: Peter Collingbourne <pcc@google.com>
Cc: Evgenii Stepanov <eugenis@google.com>
Cc: Branislav Rankov <Branislav.Rankov@arm.com>
Cc: Kevin Brodsky <kevin.brodsky@arm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
diff 7e1fa93d Wed Feb 24 01:01:12 MST 2021 Vlastimil Babka <vbabka@suse.cz> mm, slab, slub: stop taking memory hotplug lock

Since commit 03afc0e25f7f ("slab: get_online_mems for
kmem_cache_{create,destroy,shrink}") we are taking memory hotplug lock for
SLAB and SLUB when creating, destroying or shrinking a cache. It is quite
a heavy lock and it's best to avoid it if possible, as we had several
issues with lockdep complaining about ordering in the past, see e.g.
e4f8e513c3d3 ("mm/slub: fix a deadlock in show_slab_objects()").

The problem scenario in 03afc0e25f7f (solved by the memory hotplug lock)
can be summarized as follows: while there's slab_mutex synchronizing new
kmem cache creation and SLUB's MEM_GOING_ONLINE callback
slab_mem_going_online_callback(), we may miss creation of kmem_cache_node
for the hotplugged node in the new kmem cache, because the hotplug
callback doesn't yet see the new cache, and cache creation in
init_kmem_cache_nodes() only inits kmem_cache_node for nodes in the
N_NORMAL_MEMORY nodemask, which however may not yet include the new node,
as that happens only later after the MEM_GOING_ONLINE callback.

Instead of using get/put_online_mems(), the problem can be solved by SLUB
maintaining its own nodemask of nodes for which it has allocated the
per-node kmem_cache_node structures. This nodemask would generally mirror
the N_NORMAL_MEMORY nodemask, but would be updated only in under SLUB's
control in its memory hotplug callbacks under the slab_mutex. This patch
adds such nodemask and its handling.

Commit 03afc0e25f7f mentiones "issues like [the one above]", but there
don't appear to be further issues. All the paths (shared for SLAB and
SLUB) taking the memory hotplug locks are also taking the slab_mutex,
except kmem_cache_shrink() where 03afc0e25f7f replaced slab_mutex with
get/put_online_mems().

We however cannot simply restore slab_mutex in kmem_cache_shrink(), as
SLUB can enters the function from a write to sysfs 'shrink' file, thus
holding kernfs lock, and in kmem_cache_create() the kernfs lock is nested
within slab_mutex. But on closer inspection we don't actually need to
protect kmem_cache_shrink() from hotplug callbacks: While SLUB's
__kmem_cache_shrink() does for_each_kmem_cache_node(), missing a new node
added in parallel hotplug is not fatal, and parallel hotremove does not
free kmem_cache_node's anymore after the previous patch, so use-after free
cannot happen. The per-node shrinking itself is protected by
n->list_lock. Same is true for SLAB, and SLOB is no-op.

SLAB also doesn't need the memory hotplug locking, which it only gained by
03afc0e25f7f through the shared paths in slab_common.c. Its memory
hotplug callbacks are also protected by slab_mutex against races with
these paths. The problem of SLUB relying on N_NORMAL_MEMORY doesn't apply
to SLAB, as its setup_kmem_cache_nodes relies on N_ONLINE, and the new
node is already set there during the MEM_GOING_ONLINE callback, so no
special care is needed for SLAB.

As such, this patch removes all get/put_online_mems() usage by the slab
subsystem.

Link: https://lkml.kernel.org/r/20210113131634.3671-3-vbabka@suse.cz
Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
Cc: Christoph Lameter <cl@linux.com>
Cc: David Hildenbrand <david@redhat.com>
Cc: David Rientjes <rientjes@google.com>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Pekka Enberg <penberg@kernel.org>
Cc: Qian Cai <cai@redhat.com>
Cc: Vladimir Davydov <vdavydov.dev@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
diff 49f2d241 Mon Jun 01 22:45:43 MDT 2020 Vlastimil Babka <vbabka@suse.cz> usercopy: mark dma-kmalloc caches as usercopy caches

We have seen a "usercopy: Kernel memory overwrite attempt detected to
SLUB object 'dma-kmalloc-1 k' (offset 0, size 11)!" error on s390x, as
IUCV uses kmalloc() with __GFP_DMA because of memory address
restrictions. The issue has been discussed [2] and it has been noted
that if all the kmalloc caches are marked as usercopy, there's little
reason not to mark dma-kmalloc caches too. The 'dma' part merely means
that __GFP_DMA is used to restrict memory address range.

As Jann Horn put it [3]:
"I think dma-kmalloc slabs should be handled the same way as normal
kmalloc slabs. When a dma-kmalloc allocation is freshly created, it is
just normal kernel memory - even if it might later be used for DMA -,
and it should be perfectly fine to copy_from_user() into such
allocations at that point, and to copy_to_user() out of them at the
end. If you look at the places where such allocations are created, you
can see things like kmemdup(), memcpy() and so on - all normal
operations that shouldn't conceptually be different from usercopy in
any relevant way."

Thus this patch marks the dma-kmalloc-* caches as usercopy.

[1] https://bugzilla.suse.com/show_bug.cgi?id=1156053
[2] https://lore.kernel.org/kernel-hardening/bfca96db-bbd0-d958-7732-76e36c667c68@suse.cz/
[3] https://lore.kernel.org/kernel-hardening/CAG48ez1a4waGk9kB0WLaSbs4muSoK0AYAVk8=XYaKj4_+6e6Hg@mail.gmail.com/

Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Acked-by: Christian Borntraeger <borntraeger@de.ibm.com>
Acked-by: Jiri Slaby <jslaby@suse.cz>
Cc: Jann Horn <jannh@google.com>
Cc: Christoph Hellwig <hch@infradead.org>
Cc: Christopher Lameter <cl@linux.com>
Cc: Julian Wiedmann <jwi@linux.ibm.com>
Cc: Ursula Braun <ubraun@linux.ibm.com>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Cc: David Windsor <dave@nullcore.net>
Cc: Pekka Enberg <penberg@kernel.org>
Cc: David Rientjes <rientjes@google.com>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: "David S. Miller" <davem@davemloft.net>
Cc: Laura Abbott <labbott@redhat.com>
Cc: Mark Rutland <mark.rutland@arm.com>
Cc: "Martin K. Petersen" <martin.petersen@oracle.com>
Cc: Paolo Bonzini <pbonzini@redhat.com>
Cc: Christoffer Dall <christoffer.dall@linaro.org>
Cc: Dave Kleikamp <dave.kleikamp@oracle.com>
Cc: Jan Kara <jack@suse.cz>
Cc: Luis de Bethencourt <luisbg@kernel.org>
Cc: Marc Zyngier <marc.zyngier@arm.com>
Cc: Rik van Riel <riel@surriel.com>
Cc: Matthew Garrett <mjg59@google.com>
Cc: Michal Kubecek <mkubecek@suse.cz>
Link: http://lkml.kernel.org/r/7d810f6d-8085-ea2f-7805-47ba3842dc50@suse.cz
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
diff 2fe20210 Mon Jan 13 17:29:32 MST 2020 Adrian Huang <ahuang12@lenovo.com> mm: memcg/slab: call flush_memcg_workqueue() only if memcg workqueue is valid

When booting with amd_iommu=off, the following WARNING message
appears:

AMD-Vi: AMD IOMMU disabled on kernel command-line
------------[ cut here ]------------
WARNING: CPU: 0 PID: 0 at kernel/workqueue.c:2772 flush_workqueue+0x42e/0x450
Modules linked in:
CPU: 0 PID: 0 Comm: swapper/0 Not tainted 5.5.0-rc3-amd-iommu #6
Hardware name: Lenovo ThinkSystem SR655-2S/7D2WRCZ000, BIOS D8E101L-1.00 12/05/2019
RIP: 0010:flush_workqueue+0x42e/0x450
Code: ff 0f 0b e9 7a fd ff ff 4d 89 ef e9 33 fe ff ff 0f 0b e9 7f fd ff ff 0f 0b e9 bc fd ff ff 0f 0b e9 a8 fd ff ff e8 52 2c fe ff <0f> 0b 31 d2 48 c7 c6 e0 88 c5 95 48 c7 c7 d8 ad f0 95 e8 19 f5 04
Call Trace:
kmem_cache_destroy+0x69/0x260
iommu_go_to_state+0x40c/0x5ab
amd_iommu_prepare+0x16/0x2a
irq_remapping_prepare+0x36/0x5f
enable_IR_x2apic+0x21/0x172
default_setup_apic_routing+0x12/0x6f
apic_intr_mode_init+0x1a1/0x1f1
x86_late_time_init+0x17/0x1c
start_kernel+0x480/0x53f
secondary_startup_64+0xb6/0xc0
---[ end trace 30894107c3749449 ]---
x2apic: IRQ remapping doesn't support X2APIC mode
x2apic disabled

The warning is caused by the calling of 'kmem_cache_destroy()'
in free_iommu_resources(). Here is the call path:

free_iommu_resources
kmem_cache_destroy
flush_memcg_workqueue
flush_workqueue

The root cause is that the IOMMU subsystem runs before the workqueue
subsystem, which the variable 'wq_online' is still 'false'. This leads
to the statement 'if (WARN_ON(!wq_online))' in flush_workqueue() is
'true'.

Since the variable 'memcg_kmem_cache_wq' is not allocated during the
time, it is unnecessary to call flush_memcg_workqueue(). This prevents
the WARNING message triggered by flush_workqueue().

Link: http://lkml.kernel.org/r/20200103085503.1665-1-ahuang12@lenovo.com
Fixes: 92ee383f6daab ("mm: fix race between kmem_cache destroy, create and deactivate")
Signed-off-by: Adrian Huang <ahuang12@lenovo.com>
Reported-by: Xiaochun Lee <lixc17@lenovo.com>
Reviewed-by: Shakeel Butt <shakeelb@google.com>
Cc: Joerg Roedel <jroedel@suse.de>
Cc: Christoph Lameter <cl@linux.com>
Cc: Pekka Enberg <penberg@kernel.org>
Cc: David Rientjes <rientjes@google.com>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
diff 2fe20210 Mon Jan 13 17:29:32 MST 2020 Adrian Huang <ahuang12@lenovo.com> mm: memcg/slab: call flush_memcg_workqueue() only if memcg workqueue is valid

When booting with amd_iommu=off, the following WARNING message
appears:

AMD-Vi: AMD IOMMU disabled on kernel command-line
------------[ cut here ]------------
WARNING: CPU: 0 PID: 0 at kernel/workqueue.c:2772 flush_workqueue+0x42e/0x450
Modules linked in:
CPU: 0 PID: 0 Comm: swapper/0 Not tainted 5.5.0-rc3-amd-iommu #6
Hardware name: Lenovo ThinkSystem SR655-2S/7D2WRCZ000, BIOS D8E101L-1.00 12/05/2019
RIP: 0010:flush_workqueue+0x42e/0x450
Code: ff 0f 0b e9 7a fd ff ff 4d 89 ef e9 33 fe ff ff 0f 0b e9 7f fd ff ff 0f 0b e9 bc fd ff ff 0f 0b e9 a8 fd ff ff e8 52 2c fe ff <0f> 0b 31 d2 48 c7 c6 e0 88 c5 95 48 c7 c7 d8 ad f0 95 e8 19 f5 04
Call Trace:
kmem_cache_destroy+0x69/0x260
iommu_go_to_state+0x40c/0x5ab
amd_iommu_prepare+0x16/0x2a
irq_remapping_prepare+0x36/0x5f
enable_IR_x2apic+0x21/0x172
default_setup_apic_routing+0x12/0x6f
apic_intr_mode_init+0x1a1/0x1f1
x86_late_time_init+0x17/0x1c
start_kernel+0x480/0x53f
secondary_startup_64+0xb6/0xc0
---[ end trace 30894107c3749449 ]---
x2apic: IRQ remapping doesn't support X2APIC mode
x2apic disabled

The warning is caused by the calling of 'kmem_cache_destroy()'
in free_iommu_resources(). Here is the call path:

free_iommu_resources
kmem_cache_destroy
flush_memcg_workqueue
flush_workqueue

The root cause is that the IOMMU subsystem runs before the workqueue
subsystem, which the variable 'wq_online' is still 'false'. This leads
to the statement 'if (WARN_ON(!wq_online))' in flush_workqueue() is
'true'.

Since the variable 'memcg_kmem_cache_wq' is not allocated during the
time, it is unnecessary to call flush_memcg_workqueue(). This prevents
the WARNING message triggered by flush_workqueue().

Link: http://lkml.kernel.org/r/20200103085503.1665-1-ahuang12@lenovo.com
Fixes: 92ee383f6daab ("mm: fix race between kmem_cache destroy, create and deactivate")
Signed-off-by: Adrian Huang <ahuang12@lenovo.com>
Reported-by: Xiaochun Lee <lixc17@lenovo.com>
Reviewed-by: Shakeel Butt <shakeelb@google.com>
Cc: Joerg Roedel <jroedel@suse.de>
Cc: Christoph Lameter <cl@linux.com>
Cc: Pekka Enberg <penberg@kernel.org>
Cc: David Rientjes <rientjes@google.com>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
diff 2fe20210 Mon Jan 13 17:29:32 MST 2020 Adrian Huang <ahuang12@lenovo.com> mm: memcg/slab: call flush_memcg_workqueue() only if memcg workqueue is valid

When booting with amd_iommu=off, the following WARNING message
appears:

AMD-Vi: AMD IOMMU disabled on kernel command-line
------------[ cut here ]------------
WARNING: CPU: 0 PID: 0 at kernel/workqueue.c:2772 flush_workqueue+0x42e/0x450
Modules linked in:
CPU: 0 PID: 0 Comm: swapper/0 Not tainted 5.5.0-rc3-amd-iommu #6
Hardware name: Lenovo ThinkSystem SR655-2S/7D2WRCZ000, BIOS D8E101L-1.00 12/05/2019
RIP: 0010:flush_workqueue+0x42e/0x450
Code: ff 0f 0b e9 7a fd ff ff 4d 89 ef e9 33 fe ff ff 0f 0b e9 7f fd ff ff 0f 0b e9 bc fd ff ff 0f 0b e9 a8 fd ff ff e8 52 2c fe ff <0f> 0b 31 d2 48 c7 c6 e0 88 c5 95 48 c7 c7 d8 ad f0 95 e8 19 f5 04
Call Trace:
kmem_cache_destroy+0x69/0x260
iommu_go_to_state+0x40c/0x5ab
amd_iommu_prepare+0x16/0x2a
irq_remapping_prepare+0x36/0x5f
enable_IR_x2apic+0x21/0x172
default_setup_apic_routing+0x12/0x6f
apic_intr_mode_init+0x1a1/0x1f1
x86_late_time_init+0x17/0x1c
start_kernel+0x480/0x53f
secondary_startup_64+0xb6/0xc0
---[ end trace 30894107c3749449 ]---
x2apic: IRQ remapping doesn't support X2APIC mode
x2apic disabled

The warning is caused by the calling of 'kmem_cache_destroy()'
in free_iommu_resources(). Here is the call path:

free_iommu_resources
kmem_cache_destroy
flush_memcg_workqueue
flush_workqueue

The root cause is that the IOMMU subsystem runs before the workqueue
subsystem, which the variable 'wq_online' is still 'false'. This leads
to the statement 'if (WARN_ON(!wq_online))' in flush_workqueue() is
'true'.

Since the variable 'memcg_kmem_cache_wq' is not allocated during the
time, it is unnecessary to call flush_memcg_workqueue(). This prevents
the WARNING message triggered by flush_workqueue().

Link: http://lkml.kernel.org/r/20200103085503.1665-1-ahuang12@lenovo.com
Fixes: 92ee383f6daab ("mm: fix race between kmem_cache destroy, create and deactivate")
Signed-off-by: Adrian Huang <ahuang12@lenovo.com>
Reported-by: Xiaochun Lee <lixc17@lenovo.com>
Reviewed-by: Shakeel Butt <shakeelb@google.com>
Cc: Joerg Roedel <jroedel@suse.de>
Cc: Christoph Lameter <cl@linux.com>
Cc: Pekka Enberg <penberg@kernel.org>
Cc: David Rientjes <rientjes@google.com>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
diff 57033297 Thu Jul 11 21:56:20 MDT 2019 Roman Gushchin <guro@fb.com> mm: memcg/slab: don't check the dying flag on kmem_cache creation

There is no point in checking the root_cache->memcg_params.dying flag on
kmem_cache creation path. New allocations shouldn't be performed using a
dead root kmem_cache, so no new memcg kmem_cache creation can be scheduled
after the flag is set. And if it was scheduled before,
flush_memcg_workqueue() will wait for it anyway.

So let's drop this check to simplify the code.

Link: http://lkml.kernel.org/r/20190611231813.3148843-7-guro@fb.com
Signed-off-by: Roman Gushchin <guro@fb.com>
Acked-by: Vladimir Davydov <vdavydov.dev@gmail.com>
Reviewed-by: Shakeel Butt <shakeelb@google.com>
Cc: Christoph Lameter <cl@linux.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Waiman Long <longman@redhat.com>
Cc: David Rientjes <rientjes@google.com>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Cc: Pekka Enberg <penberg@kernel.org>
Cc: Andrei Vagin <avagin@gmail.com>
Cc: Qian Cai <cai@lca.pw>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
diff c03914b7 Thu Jul 11 21:56:02 MDT 2019 Roman Gushchin <guro@fb.com> mm: memcg/slab: postpone kmem_cache memcg pointer initialization to memcg_link_cache()

Patch series "mm: reparent slab memory on cgroup removal", v7.

# Why do we need this?

We've noticed that the number of dying cgroups is steadily growing on most
of our hosts in production. The following investigation revealed an issue
in the userspace memory reclaim code [1], accounting of kernel stacks [2],
and also the main reason: slab objects.

The underlying problem is quite simple: any page charged to a cgroup holds
a reference to it, so the cgroup can't be reclaimed unless all charged
pages are gone. If a slab object is actively used by other cgroups, it
won't be reclaimed, and will prevent the origin cgroup from being
reclaimed.

Slab objects, and first of all vfs cache, is shared between cgroups, which
are using the same underlying fs, and what's even more important, it's
shared between multiple generations of the same workload. So if something
is running periodically every time in a new cgroup (like how systemd
works), we do accumulate multiple dying cgroups.

Strictly speaking pagecache isn't different here, but there is a key
difference: we disable protection and apply some extra pressure on LRUs of
dying cgroups, and these LRUs contain all charged pages. My experiments
show that with the disabled kernel memory accounting the number of dying
cgroups stabilizes at a relatively small number (~100, depends on memory
pressure and cgroup creation rate), and with kernel memory accounting it
grows pretty steadily up to several thousands.

Memory cgroups are quite complex and big objects (mostly due to percpu
stats), so it leads to noticeable memory losses. Memory occupied by dying
cgroups is measured in hundreds of megabytes. I've even seen a host with
more than 100Gb of memory wasted for dying cgroups. It leads to a
degradation of performance with the uptime, and generally limits the usage
of cgroups.

My previous attempt [3] to fix the problem by applying extra pressure on
slab shrinker lists caused a regressions with xfs and ext4, and has been
reverted [4]. The following attempts to find the right balance [5, 6]
were not successful.

So instead of trying to find a maybe non-existing balance, let's do
reparent accounted slab caches to the parent cgroup on cgroup removal.

# Implementation approach

There is however a significant problem with reparenting of slab memory:
there is no list of charged pages. Some of them are in shrinker lists,
but not all. Introducing of a new list is really not an option.

But fortunately there is a way forward: every slab page has a stable
pointer to the corresponding kmem_cache. So the idea is to reparent
kmem_caches instead of slab pages.

It's actually simpler and cheaper, but requires some underlying changes:
1) Make kmem_caches to hold a single reference to the memory cgroup,
instead of a separate reference per every slab page.
2) Stop setting page->mem_cgroup pointer for memcg slab pages and use
page->kmem_cache->memcg indirection instead. It's used only on
slab page release, so performance overhead shouldn't be a big issue.
3) Introduce a refcounter for non-root slab caches. It's required to
be able to destroy kmem_caches when they become empty and release
the associated memory cgroup.

There is a bonus: currently we release all memcg kmem_caches all together
with the memory cgroup itself. This patchset allows individual
kmem_caches to be released as soon as they become inactive and free.

Some additional implementation details are provided in corresponding
commit messages.

# Results

Below is the average number of dying cgroups on two groups of our
production hosts. They do run some sort of web frontend workload, the
memory pressure is moderate. As we can see, with the kernel memory
reparenting the number stabilizes in 60s range; however with the original
version it grows almost linearly and doesn't show any signs of plateauing.
The difference in slab and percpu usage between patched and unpatched
versions also grows linearly. In 7 days it exceeded 200Mb.

day 0 1 2 3 4 5 6 7
original 56 362 628 752 1070 1250 1490 1560
patched 23 46 51 55 60 57 67 69
mem diff(Mb) 22 74 123 152 164 182 214 241

# Links

[1]: commit 68600f623d69 ("mm: don't miss the last page because of round-off error")
[2]: commit 9b6f7e163cd0 ("mm: rework memcg kernel stack accounting")
[3]: commit 172b06c32b94 ("mm: slowly shrink slabs with a relatively small number of objects")
[4]: commit a9a238e83fbb ("Revert "mm: slowly shrink slabs with a relatively small number of objects")
[5]: https://lkml.org/lkml/2019/1/28/1865
[6]: https://marc.info/?l=linux-mm&m=155064763626437&w=2

This patch (of 10):

Initialize kmem_cache->memcg_params.memcg pointer in memcg_link_cache()
rather than in init_memcg_params().

Once kmem_cache will hold a reference to the memory cgroup, it will
simplify the refcounting.

For non-root kmem_caches memcg_link_cache() is always called before the
kmem_cache becomes visible to a user, so it's safe.

Link: http://lkml.kernel.org/r/20190611231813.3148843-2-guro@fb.com
Signed-off-by: Roman Gushchin <guro@fb.com>
Reviewed-by: Shakeel Butt <shakeelb@google.com>
Acked-by: Vladimir Davydov <vdavydov.dev@gmail.com>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Cc: Waiman Long <longman@redhat.com>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Christoph Lameter <cl@linux.com>
Cc: Pekka Enberg <penberg@kernel.org>
Cc: David Rientjes <rientjes@google.com>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Cc: Andrei Vagin <avagin@gmail.com>
Cc: Qian Cai <cai@lca.pw>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
diff c03914b7 Thu Jul 11 21:56:02 MDT 2019 Roman Gushchin <guro@fb.com> mm: memcg/slab: postpone kmem_cache memcg pointer initialization to memcg_link_cache()

Patch series "mm: reparent slab memory on cgroup removal", v7.

# Why do we need this?

We've noticed that the number of dying cgroups is steadily growing on most
of our hosts in production. The following investigation revealed an issue
in the userspace memory reclaim code [1], accounting of kernel stacks [2],
and also the main reason: slab objects.

The underlying problem is quite simple: any page charged to a cgroup holds
a reference to it, so the cgroup can't be reclaimed unless all charged
pages are gone. If a slab object is actively used by other cgroups, it
won't be reclaimed, and will prevent the origin cgroup from being
reclaimed.

Slab objects, and first of all vfs cache, is shared between cgroups, which
are using the same underlying fs, and what's even more important, it's
shared between multiple generations of the same workload. So if something
is running periodically every time in a new cgroup (like how systemd
works), we do accumulate multiple dying cgroups.

Strictly speaking pagecache isn't different here, but there is a key
difference: we disable protection and apply some extra pressure on LRUs of
dying cgroups, and these LRUs contain all charged pages. My experiments
show that with the disabled kernel memory accounting the number of dying
cgroups stabilizes at a relatively small number (~100, depends on memory
pressure and cgroup creation rate), and with kernel memory accounting it
grows pretty steadily up to several thousands.

Memory cgroups are quite complex and big objects (mostly due to percpu
stats), so it leads to noticeable memory losses. Memory occupied by dying
cgroups is measured in hundreds of megabytes. I've even seen a host with
more than 100Gb of memory wasted for dying cgroups. It leads to a
degradation of performance with the uptime, and generally limits the usage
of cgroups.

My previous attempt [3] to fix the problem by applying extra pressure on
slab shrinker lists caused a regressions with xfs and ext4, and has been
reverted [4]. The following attempts to find the right balance [5, 6]
were not successful.

So instead of trying to find a maybe non-existing balance, let's do
reparent accounted slab caches to the parent cgroup on cgroup removal.

# Implementation approach

There is however a significant problem with reparenting of slab memory:
there is no list of charged pages. Some of them are in shrinker lists,
but not all. Introducing of a new list is really not an option.

But fortunately there is a way forward: every slab page has a stable
pointer to the corresponding kmem_cache. So the idea is to reparent
kmem_caches instead of slab pages.

It's actually simpler and cheaper, but requires some underlying changes:
1) Make kmem_caches to hold a single reference to the memory cgroup,
instead of a separate reference per every slab page.
2) Stop setting page->mem_cgroup pointer for memcg slab pages and use
page->kmem_cache->memcg indirection instead. It's used only on
slab page release, so performance overhead shouldn't be a big issue.
3) Introduce a refcounter for non-root slab caches. It's required to
be able to destroy kmem_caches when they become empty and release
the associated memory cgroup.

There is a bonus: currently we release all memcg kmem_caches all together
with the memory cgroup itself. This patchset allows individual
kmem_caches to be released as soon as they become inactive and free.

Some additional implementation details are provided in corresponding
commit messages.

# Results

Below is the average number of dying cgroups on two groups of our
production hosts. They do run some sort of web frontend workload, the
memory pressure is moderate. As we can see, with the kernel memory
reparenting the number stabilizes in 60s range; however with the original
version it grows almost linearly and doesn't show any signs of plateauing.
The difference in slab and percpu usage between patched and unpatched
versions also grows linearly. In 7 days it exceeded 200Mb.

day 0 1 2 3 4 5 6 7
original 56 362 628 752 1070 1250 1490 1560
patched 23 46 51 55 60 57 67 69
mem diff(Mb) 22 74 123 152 164 182 214 241

# Links

[1]: commit 68600f623d69 ("mm: don't miss the last page because of round-off error")
[2]: commit 9b6f7e163cd0 ("mm: rework memcg kernel stack accounting")
[3]: commit 172b06c32b94 ("mm: slowly shrink slabs with a relatively small number of objects")
[4]: commit a9a238e83fbb ("Revert "mm: slowly shrink slabs with a relatively small number of objects")
[5]: https://lkml.org/lkml/2019/1/28/1865
[6]: https://marc.info/?l=linux-mm&m=155064763626437&w=2

This patch (of 10):

Initialize kmem_cache->memcg_params.memcg pointer in memcg_link_cache()
rather than in init_memcg_params().

Once kmem_cache will hold a reference to the memory cgroup, it will
simplify the refcounting.

For non-root kmem_caches memcg_link_cache() is always called before the
kmem_cache becomes visible to a user, so it's safe.

Link: http://lkml.kernel.org/r/20190611231813.3148843-2-guro@fb.com
Signed-off-by: Roman Gushchin <guro@fb.com>
Reviewed-by: Shakeel Butt <shakeelb@google.com>
Acked-by: Vladimir Davydov <vdavydov.dev@gmail.com>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Cc: Waiman Long <longman@redhat.com>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Christoph Lameter <cl@linux.com>
Cc: Pekka Enberg <penberg@kernel.org>
Cc: David Rientjes <rientjes@google.com>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Cc: Andrei Vagin <avagin@gmail.com>
Cc: Qian Cai <cai@lca.pw>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

Completed in 213 milliseconds