Searched +hist:6 +hist:d6ea1e9 (Results 1 - 4 of 4) sorted by relevance
/linux-master/mm/ | ||
H A D | slab.h | diff bb944290 Tue Jan 03 23:06:04 MST 2023 Feng Tang <feng.tang@intel.com> mm/slab: add is_kmalloc_cache() helper function commit 6edf2576a6cc ("mm/slub: enable debugging memory wasting of kmalloc") introduces 'SLAB_KMALLOC' bit specifying whether a kmem_cache is a kmalloc cache for slab/slub (slob doesn't have dedicated kmalloc caches). Add a helper inline function for other components like kasan to simplify code. Link: https://lkml.kernel.org/r/20230104060605.930910-1-feng.tang@intel.com Signed-off-by: Feng Tang <feng.tang@intel.com> Acked-by: Vlastimil Babka <vbabka@suse.cz> Acked-by: Hyeonggon Yoo <42.hyeyoo@gmail.com> Acked-by: David Rientjes <rientjes@google.com> Cc: Alexander Potapenko <glider@google.com> Cc: Andrey Konovalov <andreyknvl@gmail.com> Cc: Andrey Ryabinin <ryabinin.a.a@gmail.com> Cc: Christoph Lameter <cl@linux.com> Cc: Dmitry Vyukov <dvyukov@google.com> Cc: Hyeonggon Yoo <42.hyeyoo@gmail.com> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com> Cc: Pekka Enberg <penberg@kernel.org> Cc: Roman Gushchin <roman.gushchin@linux.dev> Cc: Vincenzo Frascino <vincenzo.frascino@arm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> diff 6cd6d33c Wed Nov 30 01:54:51 MST 2022 Feng Tang <feng.tang@intel.com> mm/slub, kunit: Add a test case for kmalloc redzone check kmalloc redzone check for slub has been merged, and it's better to add a kunit case for it, which is inspired by a real-world case as described in commit 120ee599b5bf ("staging: octeon-usb: prevent memory corruption"): " octeon-hcd will crash the kernel when SLOB is used. This usually happens after the 18-byte control transfer when a device descriptor is read. The DMA engine is always transferring full 32-bit words and if the transfer is shorter, some random garbage appears after the buffer. The problem is not visible with SLUB since it rounds up the allocations to word boundary, and the extra bytes will go undetected. " To avoid interrupting the normal functioning of kmalloc caches, a kmem_cache mimicing kmalloc cache is created with similar flags, and kmalloc_trace() is used to really test the orig_size and redzone setup. Suggested-by: Vlastimil Babka <vbabka@suse.cz> Signed-off-by: Feng Tang <feng.tang@intel.com> Reviewed-by: Hyeonggon Yoo <42.hyeyoo@gmail.com> Signed-off-by: Vlastimil Babka <vbabka@suse.cz> diff fdbcb2a6 Mon Jun 28 20:37:19 MDT 2021 Waiman Long <longman@redhat.com> mm/memcg: move mod_objcg_state() to memcontrol.c Patch series "mm/memcg: Reduce kmemcache memory accounting overhead", v6. With the recent introduction of the new slab memory controller, we eliminate the need for having separate kmemcaches for each memory cgroup and reduce overall kernel memory usage. However, we also add additional memory accounting overhead to each call of kmem_cache_alloc() and kmem_cache_free(). For workloads that require a lot of kmemcache allocations and de-allocations, they may experience performance regression as illustrated in [1] and [2]. A simple kernel module that performs repeated loop of 100,000,000 kmem_cache_alloc() and kmem_cache_free() of either a small 32-byte object or a big 4k object at module init time with a batch size of 4 (4 kmalloc's followed by 4 kfree's) is used for benchmarking. The benchmarking tool was run on a kernel based on linux-next-20210419. The test was run on a CascadeLake server with turbo-boosting disable to reduce run-to-run variation. The small object test exercises mainly the object stock charging and vmstat update code paths. The large object test also exercises the refill_obj_stock() and __memcg_kmem_charge()/__memcg_kmem_uncharge() code paths. With memory accounting disabled, the run time was 3.130s with both small object big object tests. With memory accounting enabled, both cgroup v1 and v2 showed similar results in the small object test. The performance results of the large object test, however, differed between cgroup v1 and v2. The execution times with the application of various patches in the patchset were: Applied patches Run time Accounting overhead %age 1 %age 2 --------------- -------- ------------------- ------ ------ Small 32-byte object: None 11.634s 8.504s 100.0% 271.7% 1-2 9.425s 6.295s 74.0% 201.1% 1-3 9.708s 6.578s 77.4% 210.2% 1-4 8.062s 4.932s 58.0% 157.6% Large 4k object (v2): None 22.107s 18.977s 100.0% 606.3% 1-2 20.960s 17.830s 94.0% 569.6% 1-3 14.238s 11.108s 58.5% 354.9% 1-4 11.329s 8.199s 43.2% 261.9% Large 4k object (v1): None 36.807s 33.677s 100.0% 1075.9% 1-2 36.648s 33.518s 99.5% 1070.9% 1-3 22.345s 19.215s 57.1% 613.9% 1-4 18.662s 15.532s 46.1% 496.2% N.B. %age 1 = overhead/unpatched overhead %age 2 = overhead/accounting disabled time Patch 2 (vmstat data stock caching) helps in both the small object test and the large v2 object test. It doesn't help much in v1 big object test. Patch 3 (refill_obj_stock improvement) does help the small object test but offer significant performance improvement for the large object test (both v1 and v2). Patch 4 (eliminating irq disable/enable) helps in all test cases. To test for the extreme case, a multi-threaded kmalloc/kfree microbenchmark was run on the 2-socket 48-core 96-thread system with 96 testing threads in the same memcg doing kmalloc+kfree of a 4k object with accounting enabled for 10s. The total number of kmalloc+kfree done in kilo operations per second (kops/s) were as follows: Applied patches v1 kops/s v1 change v2 kops/s v2 change --------------- --------- --------- --------- --------- None 3,520 1.00X 6,242 1.00X 1-2 4,304 1.22X 8,478 1.36X 1-3 4,731 1.34X 418,142 66.99X 1-4 4,587 1.30X 438,838 70.30X With memory accounting disabled, the kmalloc/kfree rate was 1,481,291 kop/s. This test shows how significant the memory accouting overhead can be in some extreme situations. For this multithreaded test, the improvement from patch 2 mainly comes from the conditional atomic xchg of objcg->nr_charged_bytes in mod_objcg_state(). By using an unconditional xchg, the operation rates were similar to the unpatched kernel. Patch 3 elminates the single highly contended cacheline of objcg->nr_charged_bytes for cgroup v2 leading to a huge performance improvement. Cgroup v1, however, still has another highly contended cacheline in the shared page counter &memcg->kmem. So the improvement is only modest. Patch 4 helps in cgroup v2, but performs worse in cgroup v1 as eliminating the irq_disable/irq_enable overhead seems to aggravate the cacheline contention. [1] https://lore.kernel.org/linux-mm/20210408193948.vfktg3azh2wrt56t@gabell/T/#u [2] https://lore.kernel.org/lkml/20210114025151.GA22932@xsang-OptiPlex-9020/ This patch (of 4): mod_objcg_state() is moved from mm/slab.h to mm/memcontrol.c so that further optimization can be done to it in later patches without exposing unnecessary details to other mm components. Link: https://lkml.kernel.org/r/20210506150007.16288-1-longman@redhat.com Link: https://lkml.kernel.org/r/20210506150007.16288-2-longman@redhat.com Signed-off-by: Waiman Long <longman@redhat.com> Acked-by: Johannes Weiner <hannes@cmpxchg.org> Reviewed-by: Shakeel Butt <shakeelb@google.com> Acked-by: Roman Gushchin <guro@fb.com> Cc: Alex Shi <alex.shi@linux.alibaba.com> Cc: Chris Down <chris@chrisdown.name> Cc: Christoph Lameter <cl@linux.com> Cc: David Rientjes <rientjes@google.com> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com> Cc: Masayoshi Mizuma <msys.mizuma@gmail.com> Cc: Matthew Wilcox <willy@infradead.org> Cc: Michal Hocko <mhocko@kernel.org> Cc: Muchun Song <songmuchun@bytedance.com> Cc: Pekka Enberg <penberg@kernel.org> Cc: Tejun Heo <tj@kernel.org> Cc: Vladimir Davydov <vdavydov.dev@gmail.com> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Wei Yang <richard.weiyang@gmail.com> Cc: Xing Zhengjun <zhengjun.xing@linux.intel.com> Cc: Yafang Shao <laoar.shao@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> diff 9855609b Fri Aug 07 00:21:10 MDT 2020 Roman Gushchin <guro@fb.com> mm: memcg/slab: use a single set of kmem_caches for all accounted allocations This is fairly big but mostly red patch, which makes all accounted slab allocations use a single set of kmem_caches instead of creating a separate set for each memory cgroup. Because the number of non-root kmem_caches is now capped by the number of root kmem_caches, there is no need to shrink or destroy them prematurely. They can be perfectly destroyed together with their root counterparts. This allows to dramatically simplify the management of non-root kmem_caches and delete a ton of code. This patch performs the following changes: 1) introduces memcg_params.memcg_cache pointer to represent the kmem_cache which will be used for all non-root allocations 2) reuses the existing memcg kmem_cache creation mechanism to create memcg kmem_cache on the first allocation attempt 3) memcg kmem_caches are named <kmemcache_name>-memcg, e.g. dentry-memcg 4) simplifies memcg_kmem_get_cache() to just return memcg kmem_cache or schedule it's creation and return the root cache 5) removes almost all non-root kmem_cache management code (separate refcounter, reparenting, shrinking, etc) 6) makes slab debugfs to display root_mem_cgroup css id and never show :dead and :deact flags in the memcg_slabinfo attribute. Following patches in the series will simplify the kmem_cache creation. Signed-off-by: Roman Gushchin <guro@fb.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Reviewed-by: Vlastimil Babka <vbabka@suse.cz> Reviewed-by: Shakeel Butt <shakeelb@google.com> Cc: Christoph Lameter <cl@linux.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Michal Hocko <mhocko@kernel.org> Cc: Tejun Heo <tj@kernel.org> Link: http://lkml.kernel.org/r/20200623174037.3951353-13-guro@fb.com Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> diff 1a3e1f40 Fri Aug 07 00:20:45 MDT 2020 Johannes Weiner <hannes@cmpxchg.org> mm: memcontrol: decouple reference counting from page accounting The reference counting of a memcg is currently coupled directly to how many 4k pages are charged to it. This doesn't work well with Roman's new slab controller, which maintains pools of objects and doesn't want to keep an extra balance sheet for the pages backing those objects. This unusual refcounting design (reference counts usually track pointers to an object) is only for historical reasons: memcg used to not take any css references and simply stalled offlining until all charges had been reparented and the page counters had dropped to zero. When we got rid of the reparenting requirement, the simple mechanical translation was to take a reference for every charge. More historical context can be found in commit e8ea14cc6ead ("mm: memcontrol: take a css reference for each charged page"), commit 64f219938941 ("mm: memcontrol: remove obsolete kmemcg pinning tricks") and commit b2052564e66d ("mm: memcontrol: continue cache reclaim from offlined groups"). The new slab controller exposes the limitations in this scheme, so let's switch it to a more idiomatic reference counting model based on actual kernel pointers to the memcg: - The per-cpu stock holds a reference to the memcg its caching - User pages hold a reference for their page->mem_cgroup. Transparent huge pages will no longer acquire tail references in advance, we'll get them if needed during the split. - Kernel pages hold a reference for their page->mem_cgroup - Pages allocated in the root cgroup will acquire and release css references for simplicity. css_get() and css_put() optimize that. - The current memcg_charge_slab() already hacked around the per-charge references; this change gets rid of that as well. - tcp accounting will handle reference in mem_cgroup_sk_{alloc,free} Roman: 1) Rebased on top of the current mm tree: added css_get() in mem_cgroup_charge(), dropped mem_cgroup_try_charge() part 2) I've reformatted commit references in the commit log to make checkpatch.pl happy. [hughd@google.com: remove css_put_many() from __mem_cgroup_clear_mc()] Link: http://lkml.kernel.org/r/alpine.LSU.2.11.2007302011450.2347@eggly.anvils Signed-off-by: Johannes Weiner <hannes@cmpxchg.org> Signed-off-by: Roman Gushchin <guro@fb.com> Signed-off-by: Hugh Dickins <hughd@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Reviewed-by: Shakeel Butt <shakeelb@google.com> Acked-by: Roman Gushchin <guro@fb.com> Acked-by: Michal Hocko <mhocko@suse.com> Cc: Christoph Lameter <cl@linux.com> Cc: Tejun Heo <tj@kernel.org> Cc: Vlastimil Babka <vbabka@suse.cz> Link: http://lkml.kernel.org/r/20200623174037.3951353-6-guro@fb.com Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> diff 9c315e4d Wed Apr 01 22:06:53 MDT 2020 Roman Gushchin <guro@fb.com> mm: memcg/slab: cache page number in memcg_(un)charge_slab() There are many places in memcg_charge_slab() and memcg_uncharge_slab() which are calculating the number of pages to charge, css references to grab etc depending on the order of the slab page. Let's simplify the code by calculating it once and caching in the local variable. Signed-off-by: Roman Gushchin <guro@fb.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Reviewed-by: Shakeel Butt <shakeelb@google.com> Acked-by: Johannes Weiner <hannes@cmpxchg.org> Cc: Michal Hocko <mhocko@kernel.org> Cc: Vladimir Davydov <vdavydov.dev@gmail.com> Link: http://lkml.kernel.org/r/20200109202659.752357-6-guro@fb.com Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> diff 10eaec2f Wed Apr 01 22:06:39 MDT 2020 Roman Gushchin <guro@fb.com> mm: kmem: cleanup (__)memcg_kmem_charge_memcg() arguments Patch series "mm: memcg: kmem API cleanup", v2. This patchset aims to clean up the kernel memory charging API. It doesn't bring any functional changes, just removes unused arguments, renames some functions and fixes some comments. Currently it's not obvious which functions are most basic (memcg_kmem_(un)charge_memcg()) and which are based on them (memcg_kmem_(un)charge()). The patchset renames these functions and removes unused arguments: TL;DR: was: memcg_kmem_charge_memcg(page, gfp, order, memcg) memcg_kmem_uncharge_memcg(memcg, nr_pages) memcg_kmem_charge(page, gfp, order) memcg_kmem_uncharge(page, order) now: memcg_kmem_charge(memcg, gfp, nr_pages) memcg_kmem_uncharge(memcg, nr_pages) memcg_kmem_charge_page(page, gfp, order) memcg_kmem_uncharge_page(page, order) This patch (of 6): The first argument of memcg_kmem_charge_memcg() and __memcg_kmem_charge_memcg() is the page pointer and it's not used. Let's drop it. Memcg pointer is passed as the last argument. Move it to the first place for consistency with other memcg functions, e.g. __memcg_kmem_uncharge_memcg() or try_charge(). Signed-off-by: Roman Gushchin <guro@fb.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Reviewed-by: Shakeel Butt <shakeelb@google.com> Acked-by: Johannes Weiner <hannes@cmpxchg.org> Cc: Michal Hocko <mhocko@kernel.org> Cc: Vladimir Davydov <vdavydov.dev@gmail.com> Link: http://lkml.kernel.org/r/20200109202659.752357-2-guro@fb.com Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> diff 6cea1d56 Thu Jul 11 21:56:16 MDT 2019 Roman Gushchin <guro@fb.com> mm: memcg/slab: unify SLAB and SLUB page accounting Currently the page accounting code is duplicated in SLAB and SLUB internals. Let's move it into new (un)charge_slab_page helpers in the slab_common.c file. These helpers will be responsible for statistics (global and memcg-aware) and memcg charging. So they are replacing direct memcg_(un)charge_slab() calls. Link: http://lkml.kernel.org/r/20190611231813.3148843-6-guro@fb.com Signed-off-by: Roman Gushchin <guro@fb.com> Reviewed-by: Shakeel Butt <shakeelb@google.com> Acked-by: Christoph Lameter <cl@linux.com> Acked-by: Vladimir Davydov <vdavydov.dev@gmail.com> Acked-by: Johannes Weiner <hannes@cmpxchg.org> Cc: Michal Hocko <mhocko@suse.com> Cc: Waiman Long <longman@redhat.com> Cc: David Rientjes <rientjes@google.com> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com> Cc: Pekka Enberg <penberg@kernel.org> Cc: Andrei Vagin <avagin@gmail.com> Cc: Qian Cai <cai@lca.pw> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> diff 6cea1d56 Thu Jul 11 21:56:16 MDT 2019 Roman Gushchin <guro@fb.com> mm: memcg/slab: unify SLAB and SLUB page accounting Currently the page accounting code is duplicated in SLAB and SLUB internals. Let's move it into new (un)charge_slab_page helpers in the slab_common.c file. These helpers will be responsible for statistics (global and memcg-aware) and memcg charging. So they are replacing direct memcg_(un)charge_slab() calls. Link: http://lkml.kernel.org/r/20190611231813.3148843-6-guro@fb.com Signed-off-by: Roman Gushchin <guro@fb.com> Reviewed-by: Shakeel Butt <shakeelb@google.com> Acked-by: Christoph Lameter <cl@linux.com> Acked-by: Vladimir Davydov <vdavydov.dev@gmail.com> Acked-by: Johannes Weiner <hannes@cmpxchg.org> Cc: Michal Hocko <mhocko@suse.com> Cc: Waiman Long <longman@redhat.com> Cc: David Rientjes <rientjes@google.com> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com> Cc: Pekka Enberg <penberg@kernel.org> Cc: Andrei Vagin <avagin@gmail.com> Cc: Qian Cai <cai@lca.pw> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> diff c03914b7 Thu Jul 11 21:56:02 MDT 2019 Roman Gushchin <guro@fb.com> mm: memcg/slab: postpone kmem_cache memcg pointer initialization to memcg_link_cache() Patch series "mm: reparent slab memory on cgroup removal", v7. # Why do we need this? We've noticed that the number of dying cgroups is steadily growing on most of our hosts in production. The following investigation revealed an issue in the userspace memory reclaim code [1], accounting of kernel stacks [2], and also the main reason: slab objects. The underlying problem is quite simple: any page charged to a cgroup holds a reference to it, so the cgroup can't be reclaimed unless all charged pages are gone. If a slab object is actively used by other cgroups, it won't be reclaimed, and will prevent the origin cgroup from being reclaimed. Slab objects, and first of all vfs cache, is shared between cgroups, which are using the same underlying fs, and what's even more important, it's shared between multiple generations of the same workload. So if something is running periodically every time in a new cgroup (like how systemd works), we do accumulate multiple dying cgroups. Strictly speaking pagecache isn't different here, but there is a key difference: we disable protection and apply some extra pressure on LRUs of dying cgroups, and these LRUs contain all charged pages. My experiments show that with the disabled kernel memory accounting the number of dying cgroups stabilizes at a relatively small number (~100, depends on memory pressure and cgroup creation rate), and with kernel memory accounting it grows pretty steadily up to several thousands. Memory cgroups are quite complex and big objects (mostly due to percpu stats), so it leads to noticeable memory losses. Memory occupied by dying cgroups is measured in hundreds of megabytes. I've even seen a host with more than 100Gb of memory wasted for dying cgroups. It leads to a degradation of performance with the uptime, and generally limits the usage of cgroups. My previous attempt [3] to fix the problem by applying extra pressure on slab shrinker lists caused a regressions with xfs and ext4, and has been reverted [4]. The following attempts to find the right balance [5, 6] were not successful. So instead of trying to find a maybe non-existing balance, let's do reparent accounted slab caches to the parent cgroup on cgroup removal. # Implementation approach There is however a significant problem with reparenting of slab memory: there is no list of charged pages. Some of them are in shrinker lists, but not all. Introducing of a new list is really not an option. But fortunately there is a way forward: every slab page has a stable pointer to the corresponding kmem_cache. So the idea is to reparent kmem_caches instead of slab pages. It's actually simpler and cheaper, but requires some underlying changes: 1) Make kmem_caches to hold a single reference to the memory cgroup, instead of a separate reference per every slab page. 2) Stop setting page->mem_cgroup pointer for memcg slab pages and use page->kmem_cache->memcg indirection instead. It's used only on slab page release, so performance overhead shouldn't be a big issue. 3) Introduce a refcounter for non-root slab caches. It's required to be able to destroy kmem_caches when they become empty and release the associated memory cgroup. There is a bonus: currently we release all memcg kmem_caches all together with the memory cgroup itself. This patchset allows individual kmem_caches to be released as soon as they become inactive and free. Some additional implementation details are provided in corresponding commit messages. # Results Below is the average number of dying cgroups on two groups of our production hosts. They do run some sort of web frontend workload, the memory pressure is moderate. As we can see, with the kernel memory reparenting the number stabilizes in 60s range; however with the original version it grows almost linearly and doesn't show any signs of plateauing. The difference in slab and percpu usage between patched and unpatched versions also grows linearly. In 7 days it exceeded 200Mb. day 0 1 2 3 4 5 6 7 original 56 362 628 752 1070 1250 1490 1560 patched 23 46 51 55 60 57 67 69 mem diff(Mb) 22 74 123 152 164 182 214 241 # Links [1]: commit 68600f623d69 ("mm: don't miss the last page because of round-off error") [2]: commit 9b6f7e163cd0 ("mm: rework memcg kernel stack accounting") [3]: commit 172b06c32b94 ("mm: slowly shrink slabs with a relatively small number of objects") [4]: commit a9a238e83fbb ("Revert "mm: slowly shrink slabs with a relatively small number of objects") [5]: https://lkml.org/lkml/2019/1/28/1865 [6]: https://marc.info/?l=linux-mm&m=155064763626437&w=2 This patch (of 10): Initialize kmem_cache->memcg_params.memcg pointer in memcg_link_cache() rather than in init_memcg_params(). Once kmem_cache will hold a reference to the memory cgroup, it will simplify the refcounting. For non-root kmem_caches memcg_link_cache() is always called before the kmem_cache becomes visible to a user, so it's safe. Link: http://lkml.kernel.org/r/20190611231813.3148843-2-guro@fb.com Signed-off-by: Roman Gushchin <guro@fb.com> Reviewed-by: Shakeel Butt <shakeelb@google.com> Acked-by: Vladimir Davydov <vdavydov.dev@gmail.com> Acked-by: Johannes Weiner <hannes@cmpxchg.org> Cc: Waiman Long <longman@redhat.com> Cc: Michal Hocko <mhocko@suse.com> Cc: Christoph Lameter <cl@linux.com> Cc: Pekka Enberg <penberg@kernel.org> Cc: David Rientjes <rientjes@google.com> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com> Cc: Andrei Vagin <avagin@gmail.com> Cc: Qian Cai <cai@lca.pw> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> diff c03914b7 Thu Jul 11 21:56:02 MDT 2019 Roman Gushchin <guro@fb.com> mm: memcg/slab: postpone kmem_cache memcg pointer initialization to memcg_link_cache() Patch series "mm: reparent slab memory on cgroup removal", v7. # Why do we need this? We've noticed that the number of dying cgroups is steadily growing on most of our hosts in production. The following investigation revealed an issue in the userspace memory reclaim code [1], accounting of kernel stacks [2], and also the main reason: slab objects. The underlying problem is quite simple: any page charged to a cgroup holds a reference to it, so the cgroup can't be reclaimed unless all charged pages are gone. If a slab object is actively used by other cgroups, it won't be reclaimed, and will prevent the origin cgroup from being reclaimed. Slab objects, and first of all vfs cache, is shared between cgroups, which are using the same underlying fs, and what's even more important, it's shared between multiple generations of the same workload. So if something is running periodically every time in a new cgroup (like how systemd works), we do accumulate multiple dying cgroups. Strictly speaking pagecache isn't different here, but there is a key difference: we disable protection and apply some extra pressure on LRUs of dying cgroups, and these LRUs contain all charged pages. My experiments show that with the disabled kernel memory accounting the number of dying cgroups stabilizes at a relatively small number (~100, depends on memory pressure and cgroup creation rate), and with kernel memory accounting it grows pretty steadily up to several thousands. Memory cgroups are quite complex and big objects (mostly due to percpu stats), so it leads to noticeable memory losses. Memory occupied by dying cgroups is measured in hundreds of megabytes. I've even seen a host with more than 100Gb of memory wasted for dying cgroups. It leads to a degradation of performance with the uptime, and generally limits the usage of cgroups. My previous attempt [3] to fix the problem by applying extra pressure on slab shrinker lists caused a regressions with xfs and ext4, and has been reverted [4]. The following attempts to find the right balance [5, 6] were not successful. So instead of trying to find a maybe non-existing balance, let's do reparent accounted slab caches to the parent cgroup on cgroup removal. # Implementation approach There is however a significant problem with reparenting of slab memory: there is no list of charged pages. Some of them are in shrinker lists, but not all. Introducing of a new list is really not an option. But fortunately there is a way forward: every slab page has a stable pointer to the corresponding kmem_cache. So the idea is to reparent kmem_caches instead of slab pages. It's actually simpler and cheaper, but requires some underlying changes: 1) Make kmem_caches to hold a single reference to the memory cgroup, instead of a separate reference per every slab page. 2) Stop setting page->mem_cgroup pointer for memcg slab pages and use page->kmem_cache->memcg indirection instead. It's used only on slab page release, so performance overhead shouldn't be a big issue. 3) Introduce a refcounter for non-root slab caches. It's required to be able to destroy kmem_caches when they become empty and release the associated memory cgroup. There is a bonus: currently we release all memcg kmem_caches all together with the memory cgroup itself. This patchset allows individual kmem_caches to be released as soon as they become inactive and free. Some additional implementation details are provided in corresponding commit messages. # Results Below is the average number of dying cgroups on two groups of our production hosts. They do run some sort of web frontend workload, the memory pressure is moderate. As we can see, with the kernel memory reparenting the number stabilizes in 60s range; however with the original version it grows almost linearly and doesn't show any signs of plateauing. The difference in slab and percpu usage between patched and unpatched versions also grows linearly. In 7 days it exceeded 200Mb. day 0 1 2 3 4 5 6 7 original 56 362 628 752 1070 1250 1490 1560 patched 23 46 51 55 60 57 67 69 mem diff(Mb) 22 74 123 152 164 182 214 241 # Links [1]: commit 68600f623d69 ("mm: don't miss the last page because of round-off error") [2]: commit 9b6f7e163cd0 ("mm: rework memcg kernel stack accounting") [3]: commit 172b06c32b94 ("mm: slowly shrink slabs with a relatively small number of objects") [4]: commit a9a238e83fbb ("Revert "mm: slowly shrink slabs with a relatively small number of objects") [5]: https://lkml.org/lkml/2019/1/28/1865 [6]: https://marc.info/?l=linux-mm&m=155064763626437&w=2 This patch (of 10): Initialize kmem_cache->memcg_params.memcg pointer in memcg_link_cache() rather than in init_memcg_params(). Once kmem_cache will hold a reference to the memory cgroup, it will simplify the refcounting. For non-root kmem_caches memcg_link_cache() is always called before the kmem_cache becomes visible to a user, so it's safe. Link: http://lkml.kernel.org/r/20190611231813.3148843-2-guro@fb.com Signed-off-by: Roman Gushchin <guro@fb.com> Reviewed-by: Shakeel Butt <shakeelb@google.com> Acked-by: Vladimir Davydov <vdavydov.dev@gmail.com> Acked-by: Johannes Weiner <hannes@cmpxchg.org> Cc: Waiman Long <longman@redhat.com> Cc: Michal Hocko <mhocko@suse.com> Cc: Christoph Lameter <cl@linux.com> Cc: Pekka Enberg <penberg@kernel.org> Cc: David Rientjes <rientjes@google.com> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com> Cc: Andrei Vagin <avagin@gmail.com> Cc: Qian Cai <cai@lca.pw> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> diff c03914b7 Thu Jul 11 21:56:02 MDT 2019 Roman Gushchin <guro@fb.com> mm: memcg/slab: postpone kmem_cache memcg pointer initialization to memcg_link_cache() Patch series "mm: reparent slab memory on cgroup removal", v7. # Why do we need this? We've noticed that the number of dying cgroups is steadily growing on most of our hosts in production. The following investigation revealed an issue in the userspace memory reclaim code [1], accounting of kernel stacks [2], and also the main reason: slab objects. The underlying problem is quite simple: any page charged to a cgroup holds a reference to it, so the cgroup can't be reclaimed unless all charged pages are gone. If a slab object is actively used by other cgroups, it won't be reclaimed, and will prevent the origin cgroup from being reclaimed. Slab objects, and first of all vfs cache, is shared between cgroups, which are using the same underlying fs, and what's even more important, it's shared between multiple generations of the same workload. So if something is running periodically every time in a new cgroup (like how systemd works), we do accumulate multiple dying cgroups. Strictly speaking pagecache isn't different here, but there is a key difference: we disable protection and apply some extra pressure on LRUs of dying cgroups, and these LRUs contain all charged pages. My experiments show that with the disabled kernel memory accounting the number of dying cgroups stabilizes at a relatively small number (~100, depends on memory pressure and cgroup creation rate), and with kernel memory accounting it grows pretty steadily up to several thousands. Memory cgroups are quite complex and big objects (mostly due to percpu stats), so it leads to noticeable memory losses. Memory occupied by dying cgroups is measured in hundreds of megabytes. I've even seen a host with more than 100Gb of memory wasted for dying cgroups. It leads to a degradation of performance with the uptime, and generally limits the usage of cgroups. My previous attempt [3] to fix the problem by applying extra pressure on slab shrinker lists caused a regressions with xfs and ext4, and has been reverted [4]. The following attempts to find the right balance [5, 6] were not successful. So instead of trying to find a maybe non-existing balance, let's do reparent accounted slab caches to the parent cgroup on cgroup removal. # Implementation approach There is however a significant problem with reparenting of slab memory: there is no list of charged pages. Some of them are in shrinker lists, but not all. Introducing of a new list is really not an option. But fortunately there is a way forward: every slab page has a stable pointer to the corresponding kmem_cache. So the idea is to reparent kmem_caches instead of slab pages. It's actually simpler and cheaper, but requires some underlying changes: 1) Make kmem_caches to hold a single reference to the memory cgroup, instead of a separate reference per every slab page. 2) Stop setting page->mem_cgroup pointer for memcg slab pages and use page->kmem_cache->memcg indirection instead. It's used only on slab page release, so performance overhead shouldn't be a big issue. 3) Introduce a refcounter for non-root slab caches. It's required to be able to destroy kmem_caches when they become empty and release the associated memory cgroup. There is a bonus: currently we release all memcg kmem_caches all together with the memory cgroup itself. This patchset allows individual kmem_caches to be released as soon as they become inactive and free. Some additional implementation details are provided in corresponding commit messages. # Results Below is the average number of dying cgroups on two groups of our production hosts. They do run some sort of web frontend workload, the memory pressure is moderate. As we can see, with the kernel memory reparenting the number stabilizes in 60s range; however with the original version it grows almost linearly and doesn't show any signs of plateauing. The difference in slab and percpu usage between patched and unpatched versions also grows linearly. In 7 days it exceeded 200Mb. day 0 1 2 3 4 5 6 7 original 56 362 628 752 1070 1250 1490 1560 patched 23 46 51 55 60 57 67 69 mem diff(Mb) 22 74 123 152 164 182 214 241 # Links [1]: commit 68600f623d69 ("mm: don't miss the last page because of round-off error") [2]: commit 9b6f7e163cd0 ("mm: rework memcg kernel stack accounting") [3]: commit 172b06c32b94 ("mm: slowly shrink slabs with a relatively small number of objects") [4]: commit a9a238e83fbb ("Revert "mm: slowly shrink slabs with a relatively small number of objects") [5]: https://lkml.org/lkml/2019/1/28/1865 [6]: https://marc.info/?l=linux-mm&m=155064763626437&w=2 This patch (of 10): Initialize kmem_cache->memcg_params.memcg pointer in memcg_link_cache() rather than in init_memcg_params(). Once kmem_cache will hold a reference to the memory cgroup, it will simplify the refcounting. For non-root kmem_caches memcg_link_cache() is always called before the kmem_cache becomes visible to a user, so it's safe. Link: http://lkml.kernel.org/r/20190611231813.3148843-2-guro@fb.com Signed-off-by: Roman Gushchin <guro@fb.com> Reviewed-by: Shakeel Butt <shakeelb@google.com> Acked-by: Vladimir Davydov <vdavydov.dev@gmail.com> Acked-by: Johannes Weiner <hannes@cmpxchg.org> Cc: Waiman Long <longman@redhat.com> Cc: Michal Hocko <mhocko@suse.com> Cc: Christoph Lameter <cl@linux.com> Cc: Pekka Enberg <penberg@kernel.org> Cc: David Rientjes <rientjes@google.com> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com> Cc: Andrei Vagin <avagin@gmail.com> Cc: Qian Cai <cai@lca.pw> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> |
H A D | slab_common.c | diff 011568eb Tue Feb 27 20:04:08 MST 2024 Xiaolei Wang <xiaolei.wang@windriver.com> mm/slab: Fix a kmemleak in kmem_cache_destroy() For earlier kmem cache creation, slab_sysfs_init() has not been called. Consequently, kmem_cache_destroy() cannot utilize kobj_type::release to release the kmem_cache structure. Therefore, tweak kmem_cache_release() to use slab_kmem_cache_release() for releasing kmem_cache when slab_state isn't FULL. This will fixes the memory leaks like following: unreferenced object 0xffff0000c2d87080 (size 128): comm "swapper/0", pid 1, jiffies 4294893428 hex dump (first 32 bytes): 00 00 00 00 ad 4e ad de ff ff ff ff 6b 6b 6b 6b .....N......kkkk ff ff ff ff ff ff ff ff b8 ab 48 89 00 80 ff ff.....H..... backtrace (crc 8819d0f6): [<ffff80008317a298>] kmemleak_alloc+0xb0/0xc4 [<ffff8000807e553c>] kmem_cache_alloc_node+0x288/0x3a8 [<ffff8000807e95f0>] __kmem_cache_create+0x1e4/0x64c [<ffff8000807216bc>] kmem_cache_create_usercopy+0x1c4/0x2cc [<ffff8000807217e0>] kmem_cache_create+0x1c/0x28 [<ffff8000819f6278>] arm_v7s_alloc_pgtable+0x1c0/0x6d4 [<ffff8000819f53a0>] alloc_io_pgtable_ops+0xe8/0x2d0 [<ffff800084b2d2c4>] arm_v7s_do_selftests+0xe0/0x73c [<ffff800080016b68>] do_one_initcall+0x11c/0x7ac [<ffff800084a71ddc>] kernel_init_freeable+0x53c/0xbb8 [<ffff8000831728d8>] kernel_init+0x24/0x144 [<ffff800080018e98>] ret_from_fork+0x10/0x20 Signed-off-by: Xiaolei Wang <xiaolei.wang@windriver.com> Reviewed-by: Chengming Zhou <chengming.zhou@linux.dev> Signed-off-by: Vlastimil Babka <vbabka@suse.cz> diff 011568eb Tue Feb 27 20:04:08 MST 2024 Xiaolei Wang <xiaolei.wang@windriver.com> mm/slab: Fix a kmemleak in kmem_cache_destroy() For earlier kmem cache creation, slab_sysfs_init() has not been called. Consequently, kmem_cache_destroy() cannot utilize kobj_type::release to release the kmem_cache structure. Therefore, tweak kmem_cache_release() to use slab_kmem_cache_release() for releasing kmem_cache when slab_state isn't FULL. This will fixes the memory leaks like following: unreferenced object 0xffff0000c2d87080 (size 128): comm "swapper/0", pid 1, jiffies 4294893428 hex dump (first 32 bytes): 00 00 00 00 ad 4e ad de ff ff ff ff 6b 6b 6b 6b .....N......kkkk ff ff ff ff ff ff ff ff b8 ab 48 89 00 80 ff ff.....H..... backtrace (crc 8819d0f6): [<ffff80008317a298>] kmemleak_alloc+0xb0/0xc4 [<ffff8000807e553c>] kmem_cache_alloc_node+0x288/0x3a8 [<ffff8000807e95f0>] __kmem_cache_create+0x1e4/0x64c [<ffff8000807216bc>] kmem_cache_create_usercopy+0x1c4/0x2cc [<ffff8000807217e0>] kmem_cache_create+0x1c/0x28 [<ffff8000819f6278>] arm_v7s_alloc_pgtable+0x1c0/0x6d4 [<ffff8000819f53a0>] alloc_io_pgtable_ops+0xe8/0x2d0 [<ffff800084b2d2c4>] arm_v7s_do_selftests+0xe0/0x73c [<ffff800080016b68>] do_one_initcall+0x11c/0x7ac [<ffff800084a71ddc>] kernel_init_freeable+0x53c/0xbb8 [<ffff8000831728d8>] kernel_init+0x24/0x144 [<ffff800080018e98>] ret_from_fork+0x10/0x20 Signed-off-by: Xiaolei Wang <xiaolei.wang@windriver.com> Reviewed-by: Chengming Zhou <chengming.zhou@linux.dev> Signed-off-by: Vlastimil Babka <vbabka@suse.cz> diff 011568eb Tue Feb 27 20:04:08 MST 2024 Xiaolei Wang <xiaolei.wang@windriver.com> mm/slab: Fix a kmemleak in kmem_cache_destroy() For earlier kmem cache creation, slab_sysfs_init() has not been called. Consequently, kmem_cache_destroy() cannot utilize kobj_type::release to release the kmem_cache structure. Therefore, tweak kmem_cache_release() to use slab_kmem_cache_release() for releasing kmem_cache when slab_state isn't FULL. This will fixes the memory leaks like following: unreferenced object 0xffff0000c2d87080 (size 128): comm "swapper/0", pid 1, jiffies 4294893428 hex dump (first 32 bytes): 00 00 00 00 ad 4e ad de ff ff ff ff 6b 6b 6b 6b .....N......kkkk ff ff ff ff ff ff ff ff b8 ab 48 89 00 80 ff ff.....H..... backtrace (crc 8819d0f6): [<ffff80008317a298>] kmemleak_alloc+0xb0/0xc4 [<ffff8000807e553c>] kmem_cache_alloc_node+0x288/0x3a8 [<ffff8000807e95f0>] __kmem_cache_create+0x1e4/0x64c [<ffff8000807216bc>] kmem_cache_create_usercopy+0x1c4/0x2cc [<ffff8000807217e0>] kmem_cache_create+0x1c/0x28 [<ffff8000819f6278>] arm_v7s_alloc_pgtable+0x1c0/0x6d4 [<ffff8000819f53a0>] alloc_io_pgtable_ops+0xe8/0x2d0 [<ffff800084b2d2c4>] arm_v7s_do_selftests+0xe0/0x73c [<ffff800080016b68>] do_one_initcall+0x11c/0x7ac [<ffff800084a71ddc>] kernel_init_freeable+0x53c/0xbb8 [<ffff8000831728d8>] kernel_init+0x24/0x144 [<ffff800080018e98>] ret_from_fork+0x10/0x20 Signed-off-by: Xiaolei Wang <xiaolei.wang@windriver.com> Reviewed-by: Chengming Zhou <chengming.zhou@linux.dev> Signed-off-by: Vlastimil Babka <vbabka@suse.cz> diff 011568eb Tue Feb 27 20:04:08 MST 2024 Xiaolei Wang <xiaolei.wang@windriver.com> mm/slab: Fix a kmemleak in kmem_cache_destroy() For earlier kmem cache creation, slab_sysfs_init() has not been called. Consequently, kmem_cache_destroy() cannot utilize kobj_type::release to release the kmem_cache structure. Therefore, tweak kmem_cache_release() to use slab_kmem_cache_release() for releasing kmem_cache when slab_state isn't FULL. This will fixes the memory leaks like following: unreferenced object 0xffff0000c2d87080 (size 128): comm "swapper/0", pid 1, jiffies 4294893428 hex dump (first 32 bytes): 00 00 00 00 ad 4e ad de ff ff ff ff 6b 6b 6b 6b .....N......kkkk ff ff ff ff ff ff ff ff b8 ab 48 89 00 80 ff ff.....H..... backtrace (crc 8819d0f6): [<ffff80008317a298>] kmemleak_alloc+0xb0/0xc4 [<ffff8000807e553c>] kmem_cache_alloc_node+0x288/0x3a8 [<ffff8000807e95f0>] __kmem_cache_create+0x1e4/0x64c [<ffff8000807216bc>] kmem_cache_create_usercopy+0x1c4/0x2cc [<ffff8000807217e0>] kmem_cache_create+0x1c/0x28 [<ffff8000819f6278>] arm_v7s_alloc_pgtable+0x1c0/0x6d4 [<ffff8000819f53a0>] alloc_io_pgtable_ops+0xe8/0x2d0 [<ffff800084b2d2c4>] arm_v7s_do_selftests+0xe0/0x73c [<ffff800080016b68>] do_one_initcall+0x11c/0x7ac [<ffff800084a71ddc>] kernel_init_freeable+0x53c/0xbb8 [<ffff8000831728d8>] kernel_init+0x24/0x144 [<ffff800080018e98>] ret_from_fork+0x10/0x20 Signed-off-by: Xiaolei Wang <xiaolei.wang@windriver.com> Reviewed-by: Chengming Zhou <chengming.zhou@linux.dev> Signed-off-by: Vlastimil Babka <vbabka@suse.cz> diff 6e284c55 Fri Aug 04 21:17:25 MDT 2023 Zhen Lei <thunder.leizhen@huawei.com> mm: Remove kmem_valid_obj() Function kmem_dump_obj() will splat if passed a pointer to a non-slab object. So nothing calls it directly, instead calling kmem_valid_obj() first to determine whether the passed pointer to a valid slab object. This means that merging kmem_valid_obj() into kmem_dump_obj() will make the code more concise. Therefore, convert kmem_dump_obj() to work the same way as vmalloc_dump_obj(), removing the need for the kmem_dump_obj() caller to check kmem_valid_obj(). After this, there are no remaining calls to kmem_valid_obj() anymore, and it can be safely removed. Suggested-by: Matthew Wilcox <willy@infradead.org> Signed-off-by: Zhen Lei <thunder.leizhen@huawei.com> Reviewed-by: Matthew Wilcox (Oracle) <willy@infradead.org> Acked-by: Vlastimil Babka <vbabka@suse.cz> Signed-off-by: Paul E. McKenney <paulmck@kernel.org> Signed-off-by: Frederic Weisbecker <frederic@kernel.org> diff eb4940d4 Fri Nov 04 06:57:11 MDT 2022 Vlastimil Babka <vbabka@suse.cz> mm/slab: remove !CONFIG_TRACING variants of kmalloc_[node_]trace() For !CONFIG_TRACING kernels, the kmalloc() implementation tries (in cases where the allocation size is build-time constant) to save a function call, by inlining kmalloc_trace() to a kmem_cache_alloc() call. However since commit 6edf2576a6cc ("mm/slub: enable debugging memory wasting of kmalloc") this path now fails to pass the original request size to be eventually recorded (for kmalloc caches with debugging enabled). We could adjust the code to call __kmem_cache_alloc_node() as the CONFIG_TRACING variant, but that would as a result inline a call with 5 parameters, bloating the kmalloc() call sites. The cost of extra function call (to kmalloc_trace()) seems like a lesser evil. It also appears that the !CONFIG_TRACING variant is incompatible with upcoming hardening efforts [1] so it's easier if we just remove it now. Kernels with no tracing are rare these days and the benefit is dubious anyway. [1] https://lore.kernel.org/linux-mm/20221101222520.never.109-kees@kernel.org/T/#m20ecf14390e406247bde0ea9cce368f469c539ed Link: https://lore.kernel.org/all/097d8fba-bd10-a312-24a3-a4068c4f424c@suse.cz/ Suggested-by: Hyeonggon Yoo <42.hyeyoo@gmail.com> Signed-off-by: Vlastimil Babka <vbabka@suse.cz> diff 6edf2576 Tue Sep 13 00:54:20 MDT 2022 Feng Tang <feng.tang@intel.com> mm/slub: enable debugging memory wasting of kmalloc kmalloc's API family is critical for mm, with one nature that it will round up the request size to a fixed one (mostly power of 2). Say when user requests memory for '2^n + 1' bytes, actually 2^(n+1) bytes could be allocated, so in worst case, there is around 50% memory space waste. The wastage is not a big issue for requests that get allocated/freed quickly, but may cause problems with objects that have longer life time. We've met a kernel boot OOM panic (v5.10), and from the dumped slab info: [ 26.062145] kmalloc-2k 814056KB 814056KB From debug we found there are huge number of 'struct iova_magazine', whose size is 1032 bytes (1024 + 8), so each allocation will waste 1016 bytes. Though the issue was solved by giving the right (bigger) size of RAM, it is still nice to optimize the size (either use a kmalloc friendly size or create a dedicated slab for it). And from lkml archive, there was another crash kernel OOM case [1] back in 2019, which seems to be related with the similar slab waste situation, as the log is similar: [ 4.332648] iommu: Adding device 0000:20:02.0 to group 16 [ 4.338946] swapper/0 invoked oom-killer: gfp_mask=0x6040c0(GFP_KERNEL|__GFP_COMP), nodemask=(null), order=0, oom_score_adj=0 ... [ 4.857565] kmalloc-2048 59164KB 59164KB The crash kernel only has 256M memory, and 59M is pretty big here. (Note: the related code has been changed and optimised in recent kernel [2], these logs are just picked to demo the problem, also a patch changing its size to 1024 bytes has been merged) So add an way to track each kmalloc's memory waste info, and leverage the existing SLUB debug framework (specifically SLUB_STORE_USER) to show its call stack of original allocation, so that user can evaluate the waste situation, identify some hot spots and optimize accordingly, for a better utilization of memory. The waste info is integrated into existing interface: '/sys/kernel/debug/slab/kmalloc-xx/alloc_traces', one example of 'kmalloc-4k' after boot is: 126 ixgbe_alloc_q_vector+0xbe/0x830 [ixgbe] waste=233856/1856 age=280763/281414/282065 pid=1330 cpus=32 nodes=1 __kmem_cache_alloc_node+0x11f/0x4e0 __kmalloc_node+0x4e/0x140 ixgbe_alloc_q_vector+0xbe/0x830 [ixgbe] ixgbe_init_interrupt_scheme+0x2ae/0xc90 [ixgbe] ixgbe_probe+0x165f/0x1d20 [ixgbe] local_pci_probe+0x78/0xc0 work_for_cpu_fn+0x26/0x40 ... which means in 'kmalloc-4k' slab, there are 126 requests of 2240 bytes which got a 4KB space (wasting 1856 bytes each and 233856 bytes in total), from ixgbe_alloc_q_vector(). And when system starts some real workload like multiple docker instances, there could are more severe waste. [1]. https://lkml.org/lkml/2019/8/12/266 [2]. https://lore.kernel.org/lkml/2920df89-9975-5785-f79b-257d3052dfaf@huawei.com/ [Thanks Hyeonggon for pointing out several bugs about sorting/format] [Thanks Vlastimil for suggesting way to reduce memory usage of orig_size and keep it only for kmalloc objects] Signed-off-by: Feng Tang <feng.tang@intel.com> Reviewed-by: Hyeonggon Yoo <42.hyeyoo@gmail.com> Cc: Robin Murphy <robin.murphy@arm.com> Cc: John Garry <john.garry@huawei.com> Cc: Kefeng Wang <wangkefeng.wang@huawei.com> Signed-off-by: Vlastimil Babka <vbabka@suse.cz> diff 96403bfe Wed Feb 24 01:04:26 MST 2021 Muchun Song <songmuchun@bytedance.com> mm: memcontrol: fix slub memory accounting SLUB currently account kmalloc() and kmalloc_node() allocations larger than order-1 page per-node. But it forget to update the per-memcg vmstats. So it can lead to inaccurate statistics of "slab_unreclaimable" which is from memory.stat. Fix it by using mod_lruvec_page_state instead of mod_node_page_state. Link: https://lkml.kernel.org/r/20210223092423.42420-1-songmuchun@bytedance.com Fixes: 6a486c0ad4dc ("mm, sl[ou]b: improve memory accounting") Signed-off-by: Muchun Song <songmuchun@bytedance.com> Reviewed-by: Shakeel Butt <shakeelb@google.com> Reviewed-by: Roman Gushchin <guro@fb.com> Reviewed-by: Michal Koutný <mkoutny@suse.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Michal Hocko <mhocko@kernel.org> Cc: Vladimir Davydov <vdavydov.dev@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> diff 9855609b Fri Aug 07 00:21:10 MDT 2020 Roman Gushchin <guro@fb.com> mm: memcg/slab: use a single set of kmem_caches for all accounted allocations This is fairly big but mostly red patch, which makes all accounted slab allocations use a single set of kmem_caches instead of creating a separate set for each memory cgroup. Because the number of non-root kmem_caches is now capped by the number of root kmem_caches, there is no need to shrink or destroy them prematurely. They can be perfectly destroyed together with their root counterparts. This allows to dramatically simplify the management of non-root kmem_caches and delete a ton of code. This patch performs the following changes: 1) introduces memcg_params.memcg_cache pointer to represent the kmem_cache which will be used for all non-root allocations 2) reuses the existing memcg kmem_cache creation mechanism to create memcg kmem_cache on the first allocation attempt 3) memcg kmem_caches are named <kmemcache_name>-memcg, e.g. dentry-memcg 4) simplifies memcg_kmem_get_cache() to just return memcg kmem_cache or schedule it's creation and return the root cache 5) removes almost all non-root kmem_cache management code (separate refcounter, reparenting, shrinking, etc) 6) makes slab debugfs to display root_mem_cgroup css id and never show :dead and :deact flags in the memcg_slabinfo attribute. Following patches in the series will simplify the kmem_cache creation. Signed-off-by: Roman Gushchin <guro@fb.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Reviewed-by: Vlastimil Babka <vbabka@suse.cz> Reviewed-by: Shakeel Butt <shakeelb@google.com> Cc: Christoph Lameter <cl@linux.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Michal Hocko <mhocko@kernel.org> Cc: Tejun Heo <tj@kernel.org> Link: http://lkml.kernel.org/r/20200623174037.3951353-13-guro@fb.com Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> diff 49f2d241 Mon Jun 01 22:45:43 MDT 2020 Vlastimil Babka <vbabka@suse.cz> usercopy: mark dma-kmalloc caches as usercopy caches We have seen a "usercopy: Kernel memory overwrite attempt detected to SLUB object 'dma-kmalloc-1 k' (offset 0, size 11)!" error on s390x, as IUCV uses kmalloc() with __GFP_DMA because of memory address restrictions. The issue has been discussed [2] and it has been noted that if all the kmalloc caches are marked as usercopy, there's little reason not to mark dma-kmalloc caches too. The 'dma' part merely means that __GFP_DMA is used to restrict memory address range. As Jann Horn put it [3]: "I think dma-kmalloc slabs should be handled the same way as normal kmalloc slabs. When a dma-kmalloc allocation is freshly created, it is just normal kernel memory - even if it might later be used for DMA -, and it should be perfectly fine to copy_from_user() into such allocations at that point, and to copy_to_user() out of them at the end. If you look at the places where such allocations are created, you can see things like kmemdup(), memcpy() and so on - all normal operations that shouldn't conceptually be different from usercopy in any relevant way." Thus this patch marks the dma-kmalloc-* caches as usercopy. [1] https://bugzilla.suse.com/show_bug.cgi?id=1156053 [2] https://lore.kernel.org/kernel-hardening/bfca96db-bbd0-d958-7732-76e36c667c68@suse.cz/ [3] https://lore.kernel.org/kernel-hardening/CAG48ez1a4waGk9kB0WLaSbs4muSoK0AYAVk8=XYaKj4_+6e6Hg@mail.gmail.com/ Signed-off-by: Vlastimil Babka <vbabka@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Acked-by: Christian Borntraeger <borntraeger@de.ibm.com> Acked-by: Jiri Slaby <jslaby@suse.cz> Cc: Jann Horn <jannh@google.com> Cc: Christoph Hellwig <hch@infradead.org> Cc: Christopher Lameter <cl@linux.com> Cc: Julian Wiedmann <jwi@linux.ibm.com> Cc: Ursula Braun <ubraun@linux.ibm.com> Cc: Alexander Viro <viro@zeniv.linux.org.uk> Cc: David Windsor <dave@nullcore.net> Cc: Pekka Enberg <penberg@kernel.org> Cc: David Rientjes <rientjes@google.com> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com> Cc: Andy Lutomirski <luto@kernel.org> Cc: "David S. Miller" <davem@davemloft.net> Cc: Laura Abbott <labbott@redhat.com> Cc: Mark Rutland <mark.rutland@arm.com> Cc: "Martin K. Petersen" <martin.petersen@oracle.com> Cc: Paolo Bonzini <pbonzini@redhat.com> Cc: Christoffer Dall <christoffer.dall@linaro.org> Cc: Dave Kleikamp <dave.kleikamp@oracle.com> Cc: Jan Kara <jack@suse.cz> Cc: Luis de Bethencourt <luisbg@kernel.org> Cc: Marc Zyngier <marc.zyngier@arm.com> Cc: Rik van Riel <riel@surriel.com> Cc: Matthew Garrett <mjg59@google.com> Cc: Michal Kubecek <mkubecek@suse.cz> Link: http://lkml.kernel.org/r/7d810f6d-8085-ea2f-7805-47ba3842dc50@suse.cz Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> |
H A D | slub.c | diff 011568eb Tue Feb 27 20:04:08 MST 2024 Xiaolei Wang <xiaolei.wang@windriver.com> mm/slab: Fix a kmemleak in kmem_cache_destroy() For earlier kmem cache creation, slab_sysfs_init() has not been called. Consequently, kmem_cache_destroy() cannot utilize kobj_type::release to release the kmem_cache structure. Therefore, tweak kmem_cache_release() to use slab_kmem_cache_release() for releasing kmem_cache when slab_state isn't FULL. This will fixes the memory leaks like following: unreferenced object 0xffff0000c2d87080 (size 128): comm "swapper/0", pid 1, jiffies 4294893428 hex dump (first 32 bytes): 00 00 00 00 ad 4e ad de ff ff ff ff 6b 6b 6b 6b .....N......kkkk ff ff ff ff ff ff ff ff b8 ab 48 89 00 80 ff ff.....H..... backtrace (crc 8819d0f6): [<ffff80008317a298>] kmemleak_alloc+0xb0/0xc4 [<ffff8000807e553c>] kmem_cache_alloc_node+0x288/0x3a8 [<ffff8000807e95f0>] __kmem_cache_create+0x1e4/0x64c [<ffff8000807216bc>] kmem_cache_create_usercopy+0x1c4/0x2cc [<ffff8000807217e0>] kmem_cache_create+0x1c/0x28 [<ffff8000819f6278>] arm_v7s_alloc_pgtable+0x1c0/0x6d4 [<ffff8000819f53a0>] alloc_io_pgtable_ops+0xe8/0x2d0 [<ffff800084b2d2c4>] arm_v7s_do_selftests+0xe0/0x73c [<ffff800080016b68>] do_one_initcall+0x11c/0x7ac [<ffff800084a71ddc>] kernel_init_freeable+0x53c/0xbb8 [<ffff8000831728d8>] kernel_init+0x24/0x144 [<ffff800080018e98>] ret_from_fork+0x10/0x20 Signed-off-by: Xiaolei Wang <xiaolei.wang@windriver.com> Reviewed-by: Chengming Zhou <chengming.zhou@linux.dev> Signed-off-by: Vlastimil Babka <vbabka@suse.cz> diff 011568eb Tue Feb 27 20:04:08 MST 2024 Xiaolei Wang <xiaolei.wang@windriver.com> mm/slab: Fix a kmemleak in kmem_cache_destroy() For earlier kmem cache creation, slab_sysfs_init() has not been called. Consequently, kmem_cache_destroy() cannot utilize kobj_type::release to release the kmem_cache structure. Therefore, tweak kmem_cache_release() to use slab_kmem_cache_release() for releasing kmem_cache when slab_state isn't FULL. This will fixes the memory leaks like following: unreferenced object 0xffff0000c2d87080 (size 128): comm "swapper/0", pid 1, jiffies 4294893428 hex dump (first 32 bytes): 00 00 00 00 ad 4e ad de ff ff ff ff 6b 6b 6b 6b .....N......kkkk ff ff ff ff ff ff ff ff b8 ab 48 89 00 80 ff ff.....H..... backtrace (crc 8819d0f6): [<ffff80008317a298>] kmemleak_alloc+0xb0/0xc4 [<ffff8000807e553c>] kmem_cache_alloc_node+0x288/0x3a8 [<ffff8000807e95f0>] __kmem_cache_create+0x1e4/0x64c [<ffff8000807216bc>] kmem_cache_create_usercopy+0x1c4/0x2cc [<ffff8000807217e0>] kmem_cache_create+0x1c/0x28 [<ffff8000819f6278>] arm_v7s_alloc_pgtable+0x1c0/0x6d4 [<ffff8000819f53a0>] alloc_io_pgtable_ops+0xe8/0x2d0 [<ffff800084b2d2c4>] arm_v7s_do_selftests+0xe0/0x73c [<ffff800080016b68>] do_one_initcall+0x11c/0x7ac [<ffff800084a71ddc>] kernel_init_freeable+0x53c/0xbb8 [<ffff8000831728d8>] kernel_init+0x24/0x144 [<ffff800080018e98>] ret_from_fork+0x10/0x20 Signed-off-by: Xiaolei Wang <xiaolei.wang@windriver.com> Reviewed-by: Chengming Zhou <chengming.zhou@linux.dev> Signed-off-by: Vlastimil Babka <vbabka@suse.cz> diff 011568eb Tue Feb 27 20:04:08 MST 2024 Xiaolei Wang <xiaolei.wang@windriver.com> mm/slab: Fix a kmemleak in kmem_cache_destroy() For earlier kmem cache creation, slab_sysfs_init() has not been called. Consequently, kmem_cache_destroy() cannot utilize kobj_type::release to release the kmem_cache structure. Therefore, tweak kmem_cache_release() to use slab_kmem_cache_release() for releasing kmem_cache when slab_state isn't FULL. This will fixes the memory leaks like following: unreferenced object 0xffff0000c2d87080 (size 128): comm "swapper/0", pid 1, jiffies 4294893428 hex dump (first 32 bytes): 00 00 00 00 ad 4e ad de ff ff ff ff 6b 6b 6b 6b .....N......kkkk ff ff ff ff ff ff ff ff b8 ab 48 89 00 80 ff ff.....H..... backtrace (crc 8819d0f6): [<ffff80008317a298>] kmemleak_alloc+0xb0/0xc4 [<ffff8000807e553c>] kmem_cache_alloc_node+0x288/0x3a8 [<ffff8000807e95f0>] __kmem_cache_create+0x1e4/0x64c [<ffff8000807216bc>] kmem_cache_create_usercopy+0x1c4/0x2cc [<ffff8000807217e0>] kmem_cache_create+0x1c/0x28 [<ffff8000819f6278>] arm_v7s_alloc_pgtable+0x1c0/0x6d4 [<ffff8000819f53a0>] alloc_io_pgtable_ops+0xe8/0x2d0 [<ffff800084b2d2c4>] arm_v7s_do_selftests+0xe0/0x73c [<ffff800080016b68>] do_one_initcall+0x11c/0x7ac [<ffff800084a71ddc>] kernel_init_freeable+0x53c/0xbb8 [<ffff8000831728d8>] kernel_init+0x24/0x144 [<ffff800080018e98>] ret_from_fork+0x10/0x20 Signed-off-by: Xiaolei Wang <xiaolei.wang@windriver.com> Reviewed-by: Chengming Zhou <chengming.zhou@linux.dev> Signed-off-by: Vlastimil Babka <vbabka@suse.cz> diff 011568eb Tue Feb 27 20:04:08 MST 2024 Xiaolei Wang <xiaolei.wang@windriver.com> mm/slab: Fix a kmemleak in kmem_cache_destroy() For earlier kmem cache creation, slab_sysfs_init() has not been called. Consequently, kmem_cache_destroy() cannot utilize kobj_type::release to release the kmem_cache structure. Therefore, tweak kmem_cache_release() to use slab_kmem_cache_release() for releasing kmem_cache when slab_state isn't FULL. This will fixes the memory leaks like following: unreferenced object 0xffff0000c2d87080 (size 128): comm "swapper/0", pid 1, jiffies 4294893428 hex dump (first 32 bytes): 00 00 00 00 ad 4e ad de ff ff ff ff 6b 6b 6b 6b .....N......kkkk ff ff ff ff ff ff ff ff b8 ab 48 89 00 80 ff ff.....H..... backtrace (crc 8819d0f6): [<ffff80008317a298>] kmemleak_alloc+0xb0/0xc4 [<ffff8000807e553c>] kmem_cache_alloc_node+0x288/0x3a8 [<ffff8000807e95f0>] __kmem_cache_create+0x1e4/0x64c [<ffff8000807216bc>] kmem_cache_create_usercopy+0x1c4/0x2cc [<ffff8000807217e0>] kmem_cache_create+0x1c/0x28 [<ffff8000819f6278>] arm_v7s_alloc_pgtable+0x1c0/0x6d4 [<ffff8000819f53a0>] alloc_io_pgtable_ops+0xe8/0x2d0 [<ffff800084b2d2c4>] arm_v7s_do_selftests+0xe0/0x73c [<ffff800080016b68>] do_one_initcall+0x11c/0x7ac [<ffff800084a71ddc>] kernel_init_freeable+0x53c/0xbb8 [<ffff8000831728d8>] kernel_init+0x24/0x144 [<ffff800080018e98>] ret_from_fork+0x10/0x20 Signed-off-by: Xiaolei Wang <xiaolei.wang@windriver.com> Reviewed-by: Chengming Zhou <chengming.zhou@linux.dev> Signed-off-by: Vlastimil Babka <vbabka@suse.cz> diff 6f3dd2c3 Mon Aug 07 12:50:44 MDT 2023 Vlastimil Babka <vbabka@suse.cz> mm/slub: fix bulk alloc and free stats The SLUB sysfs stats enabled CONFIG_SLUB_STATS have two deficiencies identified wrt bulk alloc/free operations: - Bulk allocations from cpu freelist are not counted. Add the ALLOC_FASTPATH counter there. - Bulk fastpath freeing will count a list of multiple objects with a single FREE_FASTPATH inc. Add a stat_add() variant to count them all. Reviewed-by: Chengming Zhou <zhouchengming@bytedance.com> Signed-off-by: Vlastimil Babka <vbabka@suse.cz> diff 90ce872c Mon Nov 21 03:44:54 MST 2022 Vlastimil Babka <vbabka@suse.cz> mm, slub: lower the default slub_max_order with CONFIG_SLUB_TINY With CONFIG_SLUB_TINY we want to minimize memory overhead. By lowering the default slub_max_order we can make slab allocations use smaller pages. However depending on object sizes, order-0 might not be the best due to increased fragmentation. When testing on a 8MB RAM k210 system by Damien Le Moal [1], slub_max_order=1 had the best results, so use that as the default for CONFIG_SLUB_TINY. [1] https://lore.kernel.org/all/6a1883c4-4c3f-545a-90e8-2cd805bcf4ae@opensource.wdc.com/ Signed-off-by: Vlastimil Babka <vbabka@suse.cz> Acked-by: Roman Gushchin <roman.gushchin@linux.dev> Reviewed-by: Hyeonggon Yoo <42.hyeyoo@gmail.com> Acked-by: Mike Rapoport <rppt@linux.ibm.com> Reviewed-by: Christoph Lameter <cl@linux.com> diff d65360f2 Mon Sep 26 08:20:41 MDT 2022 Chao Yu <chao.yu@oppo.com> mm/slub: clean up create_unique_id() As Christophe JAILLET suggested [1] In create_unique_id(), "looks that ID_STR_LENGTH could even be reduced to 32 or 16. The 2nd BUG_ON at the end of the function could certainly be just removed as well or remplaced by a: if (p > name + ID_STR_LENGTH - 1) { kfree(name); return -E<something>; } " According to above suggestion, let's do below cleanups: 1. reduce ID_STR_LENGTH to 32, as the buffer size should be enough; 2. use WARN_ON instead of BUG_ON() and return error if check condition is true; 3. use snprintf instead of sprintf to avoid overflow. [1] https://lore.kernel.org/linux-mm/2025305d-16db-abdf-6cd3-1fb93371c2b4@wanadoo.fr/ Suggested-by: Christophe JAILLET <christophe.jaillet@wanadoo.fr> Reviewed-by: Hyeonggon Yoo <42.hyeyoo@gmail.com> Signed-off-by: Chao Yu <chao.yu@oppo.com> Signed-off-by: Vlastimil Babka <vbabka@suse.cz> diff 6edf2576 Tue Sep 13 00:54:20 MDT 2022 Feng Tang <feng.tang@intel.com> mm/slub: enable debugging memory wasting of kmalloc kmalloc's API family is critical for mm, with one nature that it will round up the request size to a fixed one (mostly power of 2). Say when user requests memory for '2^n + 1' bytes, actually 2^(n+1) bytes could be allocated, so in worst case, there is around 50% memory space waste. The wastage is not a big issue for requests that get allocated/freed quickly, but may cause problems with objects that have longer life time. We've met a kernel boot OOM panic (v5.10), and from the dumped slab info: [ 26.062145] kmalloc-2k 814056KB 814056KB From debug we found there are huge number of 'struct iova_magazine', whose size is 1032 bytes (1024 + 8), so each allocation will waste 1016 bytes. Though the issue was solved by giving the right (bigger) size of RAM, it is still nice to optimize the size (either use a kmalloc friendly size or create a dedicated slab for it). And from lkml archive, there was another crash kernel OOM case [1] back in 2019, which seems to be related with the similar slab waste situation, as the log is similar: [ 4.332648] iommu: Adding device 0000:20:02.0 to group 16 [ 4.338946] swapper/0 invoked oom-killer: gfp_mask=0x6040c0(GFP_KERNEL|__GFP_COMP), nodemask=(null), order=0, oom_score_adj=0 ... [ 4.857565] kmalloc-2048 59164KB 59164KB The crash kernel only has 256M memory, and 59M is pretty big here. (Note: the related code has been changed and optimised in recent kernel [2], these logs are just picked to demo the problem, also a patch changing its size to 1024 bytes has been merged) So add an way to track each kmalloc's memory waste info, and leverage the existing SLUB debug framework (specifically SLUB_STORE_USER) to show its call stack of original allocation, so that user can evaluate the waste situation, identify some hot spots and optimize accordingly, for a better utilization of memory. The waste info is integrated into existing interface: '/sys/kernel/debug/slab/kmalloc-xx/alloc_traces', one example of 'kmalloc-4k' after boot is: 126 ixgbe_alloc_q_vector+0xbe/0x830 [ixgbe] waste=233856/1856 age=280763/281414/282065 pid=1330 cpus=32 nodes=1 __kmem_cache_alloc_node+0x11f/0x4e0 __kmalloc_node+0x4e/0x140 ixgbe_alloc_q_vector+0xbe/0x830 [ixgbe] ixgbe_init_interrupt_scheme+0x2ae/0xc90 [ixgbe] ixgbe_probe+0x165f/0x1d20 [ixgbe] local_pci_probe+0x78/0xc0 work_for_cpu_fn+0x26/0x40 ... which means in 'kmalloc-4k' slab, there are 126 requests of 2240 bytes which got a 4KB space (wasting 1856 bytes each and 233856 bytes in total), from ixgbe_alloc_q_vector(). And when system starts some real workload like multiple docker instances, there could are more severe waste. [1]. https://lkml.org/lkml/2019/8/12/266 [2]. https://lore.kernel.org/lkml/2920df89-9975-5785-f79b-257d3052dfaf@huawei.com/ [Thanks Hyeonggon for pointing out several bugs about sorting/format] [Thanks Vlastimil for suggesting way to reduce memory usage of orig_size and keep it only for kmalloc objects] Signed-off-by: Feng Tang <feng.tang@intel.com> Reviewed-by: Hyeonggon Yoo <42.hyeyoo@gmail.com> Cc: Robin Murphy <robin.murphy@arm.com> Cc: John Garry <john.garry@huawei.com> Cc: Kefeng Wang <wangkefeng.wang@huawei.com> Signed-off-by: Vlastimil Babka <vbabka@suse.cz> diff 6b6efe23 Sat Apr 09 08:42:39 MDT 2022 JaeSang Yoo <js.yoo.5b@gmail.com> mm/slub: remove meaningless node check in ___slab_alloc() node_match() with node=NUMA_NO_NODE always returns 1. Duplicate check by goto statement is meaningless. Remove it. Signed-off-by: JaeSang Yoo <jsyoo5b@gmail.com> Reviewed-by: Hyeonggon Yoo <42.hyeyoo@gmail.com> Reviewed-by: Muchun Song <songmuchun@bytedance.com> Acked-by: David Rientjes <rientjes@google.com> Signed-off-by: Vlastimil Babka <vbabka@suse.cz> Link: https://lore.kernel.org/r/20220409144239.2649257-1-jsyoo5b@gmail.com diff 6d3a16d0 Mon Mar 07 00:40:56 MST 2022 Hyeonggon Yoo <42.hyeyoo@gmail.com> mm/slub: refactor deactivate_slab() Simplify deactivate_slab() by unlocking n->list_lock and retrying cmpxchg_double() when cmpxchg_double() fails, and perform add_{partial,full} only when it succeed. Releasing and taking n->list_lock again here is not harmful as SLUB avoids deactivating slabs as much as possible. [ vbabka@suse.cz: perform add_{partial,full} when cmpxchg_double() succeed. count deactivating full slabs even if debugging flag is not set. ] Signed-off-by: Hyeonggon Yoo <42.hyeyoo@gmail.com> Reviewed-by: Vlastimil Babka <vbabka@suse.cz> Reviewed-by: Roman Gushchin <roman.gushchin@linux.dev> Signed-off-by: Vlastimil Babka <vbabka@suse.cz> Link: https://lore.kernel.org/r/20220307074057.902222-3-42.hyeyoo@gmail.com |
/linux-master/include/linux/ | ||
H A D | slab.h | diff 6e284c55 Fri Aug 04 21:17:25 MDT 2023 Zhen Lei <thunder.leizhen@huawei.com> mm: Remove kmem_valid_obj() Function kmem_dump_obj() will splat if passed a pointer to a non-slab object. So nothing calls it directly, instead calling kmem_valid_obj() first to determine whether the passed pointer to a valid slab object. This means that merging kmem_valid_obj() into kmem_dump_obj() will make the code more concise. Therefore, convert kmem_dump_obj() to work the same way as vmalloc_dump_obj(), removing the need for the kmem_dump_obj() caller to check kmem_valid_obj(). After this, there are no remaining calls to kmem_valid_obj() anymore, and it can be safely removed. Suggested-by: Matthew Wilcox <willy@infradead.org> Signed-off-by: Zhen Lei <thunder.leizhen@huawei.com> Reviewed-by: Matthew Wilcox (Oracle) <willy@infradead.org> Acked-by: Vlastimil Babka <vbabka@suse.cz> Signed-off-by: Paul E. McKenney <paulmck@kernel.org> Signed-off-by: Frederic Weisbecker <frederic@kernel.org> diff 6fa57d78 Thu Nov 17 20:51:59 MST 2022 Kees Cook <keescook@chromium.org> slab: Remove special-casing of const 0 size allocations Passing a constant-0 size allocation into kmalloc() or kmalloc_node() does not need to be a fast-path operation, so the static return value can be removed entirely. This makes sure that all paths through the inlines result in a full extern function call, where __alloc_size() hints will actually be seen[1] by GCC. (A constant return value of 0 means the "0" allocation size won't be propagated by the inline.) [1] https://gcc.gnu.org/bugzilla/show_bug.cgi?id=96503 Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Christoph Lameter <cl@linux.com> Cc: Pekka Enberg <penberg@kernel.org> Cc: David Rientjes <rientjes@google.com> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Roman Gushchin <roman.gushchin@linux.dev> Cc: Hyeonggon Yoo <42.hyeyoo@gmail.com> Cc: linux-mm@kvack.org Signed-off-by: Kees Cook <keescook@chromium.org> Reviewed-by: Hyeonggon Yoo <42.hyeyoo@gmail.com> Signed-off-by: Vlastimil Babka <vbabka@suse.cz> diff eb4940d4 Fri Nov 04 06:57:11 MDT 2022 Vlastimil Babka <vbabka@suse.cz> mm/slab: remove !CONFIG_TRACING variants of kmalloc_[node_]trace() For !CONFIG_TRACING kernels, the kmalloc() implementation tries (in cases where the allocation size is build-time constant) to save a function call, by inlining kmalloc_trace() to a kmem_cache_alloc() call. However since commit 6edf2576a6cc ("mm/slub: enable debugging memory wasting of kmalloc") this path now fails to pass the original request size to be eventually recorded (for kmalloc caches with debugging enabled). We could adjust the code to call __kmem_cache_alloc_node() as the CONFIG_TRACING variant, but that would as a result inline a call with 5 parameters, bloating the kmalloc() call sites. The cost of extra function call (to kmalloc_trace()) seems like a lesser evil. It also appears that the !CONFIG_TRACING variant is incompatible with upcoming hardening efforts [1] so it's easier if we just remove it now. Kernels with no tracing are rare these days and the benefit is dubious anyway. [1] https://lore.kernel.org/linux-mm/20221101222520.never.109-kees@kernel.org/T/#m20ecf14390e406247bde0ea9cce368f469c539ed Link: https://lore.kernel.org/all/097d8fba-bd10-a312-24a3-a4068c4f424c@suse.cz/ Suggested-by: Hyeonggon Yoo <42.hyeyoo@gmail.com> Signed-off-by: Vlastimil Babka <vbabka@suse.cz> diff 6edf2576 Tue Sep 13 00:54:20 MDT 2022 Feng Tang <feng.tang@intel.com> mm/slub: enable debugging memory wasting of kmalloc kmalloc's API family is critical for mm, with one nature that it will round up the request size to a fixed one (mostly power of 2). Say when user requests memory for '2^n + 1' bytes, actually 2^(n+1) bytes could be allocated, so in worst case, there is around 50% memory space waste. The wastage is not a big issue for requests that get allocated/freed quickly, but may cause problems with objects that have longer life time. We've met a kernel boot OOM panic (v5.10), and from the dumped slab info: [ 26.062145] kmalloc-2k 814056KB 814056KB From debug we found there are huge number of 'struct iova_magazine', whose size is 1032 bytes (1024 + 8), so each allocation will waste 1016 bytes. Though the issue was solved by giving the right (bigger) size of RAM, it is still nice to optimize the size (either use a kmalloc friendly size or create a dedicated slab for it). And from lkml archive, there was another crash kernel OOM case [1] back in 2019, which seems to be related with the similar slab waste situation, as the log is similar: [ 4.332648] iommu: Adding device 0000:20:02.0 to group 16 [ 4.338946] swapper/0 invoked oom-killer: gfp_mask=0x6040c0(GFP_KERNEL|__GFP_COMP), nodemask=(null), order=0, oom_score_adj=0 ... [ 4.857565] kmalloc-2048 59164KB 59164KB The crash kernel only has 256M memory, and 59M is pretty big here. (Note: the related code has been changed and optimised in recent kernel [2], these logs are just picked to demo the problem, also a patch changing its size to 1024 bytes has been merged) So add an way to track each kmalloc's memory waste info, and leverage the existing SLUB debug framework (specifically SLUB_STORE_USER) to show its call stack of original allocation, so that user can evaluate the waste situation, identify some hot spots and optimize accordingly, for a better utilization of memory. The waste info is integrated into existing interface: '/sys/kernel/debug/slab/kmalloc-xx/alloc_traces', one example of 'kmalloc-4k' after boot is: 126 ixgbe_alloc_q_vector+0xbe/0x830 [ixgbe] waste=233856/1856 age=280763/281414/282065 pid=1330 cpus=32 nodes=1 __kmem_cache_alloc_node+0x11f/0x4e0 __kmalloc_node+0x4e/0x140 ixgbe_alloc_q_vector+0xbe/0x830 [ixgbe] ixgbe_init_interrupt_scheme+0x2ae/0xc90 [ixgbe] ixgbe_probe+0x165f/0x1d20 [ixgbe] local_pci_probe+0x78/0xc0 work_for_cpu_fn+0x26/0x40 ... which means in 'kmalloc-4k' slab, there are 126 requests of 2240 bytes which got a 4KB space (wasting 1856 bytes each and 233856 bytes in total), from ixgbe_alloc_q_vector(). And when system starts some real workload like multiple docker instances, there could are more severe waste. [1]. https://lkml.org/lkml/2019/8/12/266 [2]. https://lore.kernel.org/lkml/2920df89-9975-5785-f79b-257d3052dfaf@huawei.com/ [Thanks Hyeonggon for pointing out several bugs about sorting/format] [Thanks Vlastimil for suggesting way to reduce memory usage of orig_size and keep it only for kmalloc objects] Signed-off-by: Feng Tang <feng.tang@intel.com> Reviewed-by: Hyeonggon Yoo <42.hyeyoo@gmail.com> Cc: Robin Murphy <robin.murphy@arm.com> Cc: John Garry <john.garry@huawei.com> Cc: Kefeng Wang <wangkefeng.wang@huawei.com> Signed-off-by: Vlastimil Babka <vbabka@suse.cz> diff 56bcf40f Fri Nov 05 14:36:31 MDT 2021 Kees Cook <keescook@chromium.org> mm/kvmalloc: add __alloc_size attributes for better bounds checking As already done in GrapheneOS, add the __alloc_size attribute for regular kvmalloc interfaces, to provide additional hinting for better bounds checking, assisting CONFIG_FORTIFY_SOURCE and other compiler optimizations. Link: https://lkml.kernel.org/r/20210930222704.2631604-6-keescook@chromium.org Signed-off-by: Kees Cook <keescook@chromium.org> Co-developed-by: Daniel Micay <danielmicay@gmail.com> Signed-off-by: Daniel Micay <danielmicay@gmail.com> Reviewed-by: Nick Desaulniers <ndesaulniers@google.com> Cc: Christoph Lameter <cl@linux.com> Cc: Pekka Enberg <penberg@kernel.org> Cc: David Rientjes <rientjes@google.com> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Andy Whitcroft <apw@canonical.com> Cc: Dennis Zhou <dennis@kernel.org> Cc: Dwaipayan Ray <dwaipayanray1@gmail.com> Cc: Joe Perches <joe@perches.com> Cc: Lukas Bulwahn <lukas.bulwahn@gmail.com> Cc: Miguel Ojeda <ojeda@kernel.org> Cc: Nathan Chancellor <nathan@kernel.org> Cc: Tejun Heo <tj@kernel.org> Cc: Alexandre Bounine <alex.bou9@gmail.com> Cc: Gustavo A. R. Silva <gustavoars@kernel.org> Cc: Ira Weiny <ira.weiny@intel.com> Cc: Jing Xiangfeng <jingxiangfeng@huawei.com> Cc: John Hubbard <jhubbard@nvidia.com> Cc: kernel test robot <lkp@intel.com> Cc: Matt Porter <mporter@kernel.crashing.org> Cc: Randy Dunlap <rdunlap@infradead.org> Cc: Souptick Joarder <jrdr.linux@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> diff 9855609b Fri Aug 07 00:21:10 MDT 2020 Roman Gushchin <guro@fb.com> mm: memcg/slab: use a single set of kmem_caches for all accounted allocations This is fairly big but mostly red patch, which makes all accounted slab allocations use a single set of kmem_caches instead of creating a separate set for each memory cgroup. Because the number of non-root kmem_caches is now capped by the number of root kmem_caches, there is no need to shrink or destroy them prematurely. They can be perfectly destroyed together with their root counterparts. This allows to dramatically simplify the management of non-root kmem_caches and delete a ton of code. This patch performs the following changes: 1) introduces memcg_params.memcg_cache pointer to represent the kmem_cache which will be used for all non-root allocations 2) reuses the existing memcg kmem_cache creation mechanism to create memcg kmem_cache on the first allocation attempt 3) memcg kmem_caches are named <kmemcache_name>-memcg, e.g. dentry-memcg 4) simplifies memcg_kmem_get_cache() to just return memcg kmem_cache or schedule it's creation and return the root cache 5) removes almost all non-root kmem_cache management code (separate refcounter, reparenting, shrinking, etc) 6) makes slab debugfs to display root_mem_cgroup css id and never show :dead and :deact flags in the memcg_slabinfo attribute. Following patches in the series will simplify the kmem_cache creation. Signed-off-by: Roman Gushchin <guro@fb.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Reviewed-by: Vlastimil Babka <vbabka@suse.cz> Reviewed-by: Shakeel Butt <shakeelb@google.com> Cc: Christoph Lameter <cl@linux.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Michal Hocko <mhocko@kernel.org> Cc: Tejun Heo <tj@kernel.org> Link: http://lkml.kernel.org/r/20200623174037.3951353-13-guro@fb.com Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> diff 6d6ea1e9 Thu Mar 28 21:43:42 MDT 2019 Nicolas Boichat <drinkcat@chromium.org> mm: add support for kmem caches in DMA32 zone Patch series "iommu/io-pgtable-arm-v7s: Use DMA32 zone for page tables", v6. This is a followup to the discussion in [1], [2]. IOMMUs using ARMv7 short-descriptor format require page tables (level 1 and 2) to be allocated within the first 4GB of RAM, even on 64-bit systems. For L1 tables that are bigger than a page, we can just use __get_free_pages with GFP_DMA32 (on arm64 systems only, arm would still use GFP_DMA). For L2 tables that only take 1KB, it would be a waste to allocate a full page, so we considered 3 approaches: 1. This series, adding support for GFP_DMA32 slab caches. 2. genalloc, which requires pre-allocating the maximum number of L2 page tables (4096, so 4MB of memory). 3. page_frag, which is not very memory-efficient as it is unable to reuse freed fragments until the whole page is freed. [3] This series is the most memory-efficient approach. stable@ note: We confirmed that this is a regression, and IOMMU errors happen on 4.19 and linux-next/master on MT8173 (elm, Acer Chromebook R13). The issue most likely starts from commit ad67f5a6545f ("arm64: replace ZONE_DMA with ZONE_DMA32"), i.e. 4.15, and presumably breaks a number of Mediatek platforms (and maybe others?). [1] https://lists.linuxfoundation.org/pipermail/iommu/2018-November/030876.html [2] https://lists.linuxfoundation.org/pipermail/iommu/2018-December/031696.html [3] https://patchwork.codeaurora.org/patch/671639/ This patch (of 3): IOMMUs using ARMv7 short-descriptor format require page tables to be allocated within the first 4GB of RAM, even on 64-bit systems. On arm64, this is done by passing GFP_DMA32 flag to memory allocation functions. For IOMMU L2 tables that only take 1KB, it would be a waste to allocate a full page using get_free_pages, so we considered 3 approaches: 1. This patch, adding support for GFP_DMA32 slab caches. 2. genalloc, which requires pre-allocating the maximum number of L2 page tables (4096, so 4MB of memory). 3. page_frag, which is not very memory-efficient as it is unable to reuse freed fragments until the whole page is freed. This change makes it possible to create a custom cache in DMA32 zone using kmem_cache_create, then allocate memory using kmem_cache_alloc. We do not create a DMA32 kmalloc cache array, as there are currently no users of kmalloc(..., GFP_DMA32). These calls will continue to trigger a warning, as we keep GFP_DMA32 in GFP_SLAB_BUG_MASK. This implies that calls to kmem_cache_*alloc on a SLAB_CACHE_DMA32 kmem_cache must _not_ use GFP_DMA32 (it is anyway redundant and unnecessary). Link: http://lkml.kernel.org/r/20181210011504.122604-2-drinkcat@chromium.org Signed-off-by: Nicolas Boichat <drinkcat@chromium.org> Acked-by: Vlastimil Babka <vbabka@suse.cz> Acked-by: Will Deacon <will.deacon@arm.com> Cc: Robin Murphy <robin.murphy@arm.com> Cc: Joerg Roedel <joro@8bytes.org> Cc: Christoph Lameter <cl@linux.com> Cc: Pekka Enberg <penberg@kernel.org> Cc: David Rientjes <rientjes@google.com> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com> Cc: Michal Hocko <mhocko@suse.com> Cc: Mel Gorman <mgorman@techsingularity.net> Cc: Sasha Levin <Alexander.Levin@microsoft.com> Cc: Huaisheng Ye <yehs1@lenovo.com> Cc: Mike Rapoport <rppt@linux.vnet.ibm.com> Cc: Yong Wu <yong.wu@mediatek.com> Cc: Matthias Brugger <matthias.bgg@gmail.com> Cc: Tomasz Figa <tfiga@google.com> Cc: Yingjoe Chen <yingjoe.chen@mediatek.com> Cc: Christoph Hellwig <hch@infradead.org> Cc: Matthew Wilcox <willy@infradead.org> Cc: Hsin-Yi Wang <hsinyi@chromium.org> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> diff 6d6ea1e9 Thu Mar 28 21:43:42 MDT 2019 Nicolas Boichat <drinkcat@chromium.org> mm: add support for kmem caches in DMA32 zone Patch series "iommu/io-pgtable-arm-v7s: Use DMA32 zone for page tables", v6. This is a followup to the discussion in [1], [2]. IOMMUs using ARMv7 short-descriptor format require page tables (level 1 and 2) to be allocated within the first 4GB of RAM, even on 64-bit systems. For L1 tables that are bigger than a page, we can just use __get_free_pages with GFP_DMA32 (on arm64 systems only, arm would still use GFP_DMA). For L2 tables that only take 1KB, it would be a waste to allocate a full page, so we considered 3 approaches: 1. This series, adding support for GFP_DMA32 slab caches. 2. genalloc, which requires pre-allocating the maximum number of L2 page tables (4096, so 4MB of memory). 3. page_frag, which is not very memory-efficient as it is unable to reuse freed fragments until the whole page is freed. [3] This series is the most memory-efficient approach. stable@ note: We confirmed that this is a regression, and IOMMU errors happen on 4.19 and linux-next/master on MT8173 (elm, Acer Chromebook R13). The issue most likely starts from commit ad67f5a6545f ("arm64: replace ZONE_DMA with ZONE_DMA32"), i.e. 4.15, and presumably breaks a number of Mediatek platforms (and maybe others?). [1] https://lists.linuxfoundation.org/pipermail/iommu/2018-November/030876.html [2] https://lists.linuxfoundation.org/pipermail/iommu/2018-December/031696.html [3] https://patchwork.codeaurora.org/patch/671639/ This patch (of 3): IOMMUs using ARMv7 short-descriptor format require page tables to be allocated within the first 4GB of RAM, even on 64-bit systems. On arm64, this is done by passing GFP_DMA32 flag to memory allocation functions. For IOMMU L2 tables that only take 1KB, it would be a waste to allocate a full page using get_free_pages, so we considered 3 approaches: 1. This patch, adding support for GFP_DMA32 slab caches. 2. genalloc, which requires pre-allocating the maximum number of L2 page tables (4096, so 4MB of memory). 3. page_frag, which is not very memory-efficient as it is unable to reuse freed fragments until the whole page is freed. This change makes it possible to create a custom cache in DMA32 zone using kmem_cache_create, then allocate memory using kmem_cache_alloc. We do not create a DMA32 kmalloc cache array, as there are currently no users of kmalloc(..., GFP_DMA32). These calls will continue to trigger a warning, as we keep GFP_DMA32 in GFP_SLAB_BUG_MASK. This implies that calls to kmem_cache_*alloc on a SLAB_CACHE_DMA32 kmem_cache must _not_ use GFP_DMA32 (it is anyway redundant and unnecessary). Link: http://lkml.kernel.org/r/20181210011504.122604-2-drinkcat@chromium.org Signed-off-by: Nicolas Boichat <drinkcat@chromium.org> Acked-by: Vlastimil Babka <vbabka@suse.cz> Acked-by: Will Deacon <will.deacon@arm.com> Cc: Robin Murphy <robin.murphy@arm.com> Cc: Joerg Roedel <joro@8bytes.org> Cc: Christoph Lameter <cl@linux.com> Cc: Pekka Enberg <penberg@kernel.org> Cc: David Rientjes <rientjes@google.com> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com> Cc: Michal Hocko <mhocko@suse.com> Cc: Mel Gorman <mgorman@techsingularity.net> Cc: Sasha Levin <Alexander.Levin@microsoft.com> Cc: Huaisheng Ye <yehs1@lenovo.com> Cc: Mike Rapoport <rppt@linux.vnet.ibm.com> Cc: Yong Wu <yong.wu@mediatek.com> Cc: Matthias Brugger <matthias.bgg@gmail.com> Cc: Tomasz Figa <tfiga@google.com> Cc: Yingjoe Chen <yingjoe.chen@mediatek.com> Cc: Christoph Hellwig <hch@infradead.org> Cc: Matthew Wilcox <willy@infradead.org> Cc: Hsin-Yi Wang <hsinyi@chromium.org> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> diff 0116523c Fri Dec 28 01:29:37 MST 2018 Andrey Konovalov <andreyknvl@google.com> kasan, mm: change hooks signatures Patch series "kasan: add software tag-based mode for arm64", v13. This patchset adds a new software tag-based mode to KASAN [1]. (Initially this mode was called KHWASAN, but it got renamed, see the naming rationale at the end of this section). The plan is to implement HWASan [2] for the kernel with the incentive, that it's going to have comparable to KASAN performance, but in the same time consume much less memory, trading that off for somewhat imprecise bug detection and being supported only for arm64. The underlying ideas of the approach used by software tag-based KASAN are: 1. By using the Top Byte Ignore (TBI) arm64 CPU feature, we can store pointer tags in the top byte of each kernel pointer. 2. Using shadow memory, we can store memory tags for each chunk of kernel memory. 3. On each memory allocation, we can generate a random tag, embed it into the returned pointer and set the memory tags that correspond to this chunk of memory to the same value. 4. By using compiler instrumentation, before each memory access we can add a check that the pointer tag matches the tag of the memory that is being accessed. 5. On a tag mismatch we report an error. With this patchset the existing KASAN mode gets renamed to generic KASAN, with the word "generic" meaning that the implementation can be supported by any architecture as it is purely software. The new mode this patchset adds is called software tag-based KASAN. The word "tag-based" refers to the fact that this mode uses tags embedded into the top byte of kernel pointers and the TBI arm64 CPU feature that allows to dereference such pointers. The word "software" here means that shadow memory manipulation and tag checking on pointer dereference is done in software. As it is the only tag-based implementation right now, "software tag-based" KASAN is sometimes referred to as simply "tag-based" in this patchset. A potential expansion of this mode is a hardware tag-based mode, which would use hardware memory tagging support (announced by Arm [3]) instead of compiler instrumentation and manual shadow memory manipulation. Same as generic KASAN, software tag-based KASAN is strictly a debugging feature. [1] https://www.kernel.org/doc/html/latest/dev-tools/kasan.html [2] http://clang.llvm.org/docs/HardwareAssistedAddressSanitizerDesign.html [3] https://community.arm.com/processors/b/blog/posts/arm-a-profile-architecture-2018-developments-armv85a ====== Rationale On mobile devices generic KASAN's memory usage is significant problem. One of the main reasons to have tag-based KASAN is to be able to perform a similar set of checks as the generic one does, but with lower memory requirements. Comment from Vishwath Mohan <vishwath@google.com>: I don't have data on-hand, but anecdotally both ASAN and KASAN have proven problematic to enable for environments that don't tolerate the increased memory pressure well. This includes (a) Low-memory form factors - Wear, TV, Things, lower-tier phones like Go, (c) Connected components like Pixel's visual core [1]. These are both places I'd love to have a low(er) memory footprint option at my disposal. Comment from Evgenii Stepanov <eugenis@google.com>: Looking at a live Android device under load, slab (according to /proc/meminfo) + kernel stack take 8-10% available RAM (~350MB). KASAN's overhead of 2x - 3x on top of it is not insignificant. Not having this overhead enables near-production use - ex. running KASAN/KHWASAN kernel on a personal, daily-use device to catch bugs that do not reproduce in test configuration. These are the ones that often cost the most engineering time to track down. CPU overhead is bad, but generally tolerable. RAM is critical, in our experience. Once it gets low enough, OOM-killer makes your life miserable. [1] https://www.blog.google/products/pixel/pixel-visual-core-image-processing-and-machine-learning-pixel-2/ ====== Technical details Software tag-based KASAN mode is implemented in a very similar way to the generic one. This patchset essentially does the following: 1. TCR_TBI1 is set to enable Top Byte Ignore. 2. Shadow memory is used (with a different scale, 1:16, so each shadow byte corresponds to 16 bytes of kernel memory) to store memory tags. 3. All slab objects are aligned to shadow scale, which is 16 bytes. 4. All pointers returned from the slab allocator are tagged with a random tag and the corresponding shadow memory is poisoned with the same value. 5. Compiler instrumentation is used to insert tag checks. Either by calling callbacks or by inlining them (CONFIG_KASAN_OUTLINE and CONFIG_KASAN_INLINE flags are reused). 6. When a tag mismatch is detected in callback instrumentation mode KASAN simply prints a bug report. In case of inline instrumentation, clang inserts a brk instruction, and KASAN has it's own brk handler, which reports the bug. 7. The memory in between slab objects is marked with a reserved tag, and acts as a redzone. 8. When a slab object is freed it's marked with a reserved tag. Bug detection is imprecise for two reasons: 1. We won't catch some small out-of-bounds accesses, that fall into the same shadow cell, as the last byte of a slab object. 2. We only have 1 byte to store tags, which means we have a 1/256 probability of a tag match for an incorrect access (actually even slightly less due to reserved tag values). Despite that there's a particular type of bugs that tag-based KASAN can detect compared to generic KASAN: use-after-free after the object has been allocated by someone else. ====== Testing Some kernel developers voiced a concern that changing the top byte of kernel pointers may lead to subtle bugs that are difficult to discover. To address this concern deliberate testing has been performed. It doesn't seem feasible to do some kind of static checking to find potential issues with pointer tagging, so a dynamic approach was taken. All pointer comparisons/subtractions have been instrumented in an LLVM compiler pass and a kernel module that would print a bug report whenever two pointers with different tags are being compared/subtracted (ignoring comparisons with NULL pointers and with pointers obtained by casting an error code to a pointer type) has been used. Then the kernel has been booted in QEMU and on an Odroid C2 board and syzkaller has been run. This yielded the following results. The two places that look interesting are: is_vmalloc_addr in include/linux/mm.h is_kernel_rodata in mm/util.c Here we compare a pointer with some fixed untagged values to make sure that the pointer lies in a particular part of the kernel address space. Since tag-based KASAN doesn't add tags to pointers that belong to rodata or vmalloc regions, this should work as is. To make sure debug checks to those two functions that check that the result doesn't change whether we operate on pointers with or without untagging has been added. A few other cases that don't look that interesting: Comparing pointers to achieve unique sorting order of pointee objects (e.g. sorting locks addresses before performing a double lock): tty_ldisc_lock_pair_timeout in drivers/tty/tty_ldisc.c pipe_double_lock in fs/pipe.c unix_state_double_lock in net/unix/af_unix.c lock_two_nondirectories in fs/inode.c mutex_lock_double in kernel/events/core.c ep_cmp_ffd in fs/eventpoll.c fsnotify_compare_groups fs/notify/mark.c Nothing needs to be done here, since the tags embedded into pointers don't change, so the sorting order would still be unique. Checks that a pointer belongs to some particular allocation: is_sibling_entry in lib/radix-tree.c object_is_on_stack in include/linux/sched/task_stack.h Nothing needs to be done here either, since two pointers can only belong to the same allocation if they have the same tag. Overall, since the kernel boots and works, there are no critical bugs. As for the rest, the traditional kernel testing way (use until fails) is the only one that looks feasible. Another point here is that tag-based KASAN is available under a separate config option that needs to be deliberately enabled. Even though it might be used in a "near-production" environment to find bugs that are not found during fuzzing or running tests, it is still a debug tool. ====== Benchmarks The following numbers were collected on Odroid C2 board. Both generic and tag-based KASAN were used in inline instrumentation mode. Boot time [1]: * ~1.7 sec for clean kernel * ~5.0 sec for generic KASAN * ~5.0 sec for tag-based KASAN Network performance [2]: * 8.33 Gbits/sec for clean kernel * 3.17 Gbits/sec for generic KASAN * 2.85 Gbits/sec for tag-based KASAN Slab memory usage after boot [3]: * ~40 kb for clean kernel * ~105 kb (~260% overhead) for generic KASAN * ~47 kb (~20% overhead) for tag-based KASAN KASAN memory overhead consists of three main parts: 1. Increased slab memory usage due to redzones. 2. Shadow memory (the whole reserved once during boot). 3. Quaratine (grows gradually until some preset limit; the more the limit, the more the chance to detect a use-after-free). Comparing tag-based vs generic KASAN for each of these points: 1. 20% vs 260% overhead. 2. 1/16th vs 1/8th of physical memory. 3. Tag-based KASAN doesn't require quarantine. [1] Time before the ext4 driver is initialized. [2] Measured as `iperf -s & iperf -c 127.0.0.1 -t 30`. [3] Measured as `cat /proc/meminfo | grep Slab`. ====== Some notes A few notes: 1. The patchset can be found here: https://github.com/xairy/kasan-prototype/tree/khwasan 2. Building requires a recent Clang version (7.0.0 or later). 3. Stack instrumentation is not supported yet and will be added later. This patch (of 25): Tag-based KASAN changes the value of the top byte of pointers returned from the kernel allocation functions (such as kmalloc). This patch updates KASAN hooks signatures and their usage in SLAB and SLUB code to reflect that. Link: http://lkml.kernel.org/r/aec2b5e3973781ff8a6bb6760f8543643202c451.1544099024.git.andreyknvl@google.com Signed-off-by: Andrey Konovalov <andreyknvl@google.com> Reviewed-by: Andrey Ryabinin <aryabinin@virtuozzo.com> Reviewed-by: Dmitry Vyukov <dvyukov@google.com> Cc: Christoph Lameter <cl@linux.com> Cc: Mark Rutland <mark.rutland@arm.com> Cc: Will Deacon <will.deacon@arm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> diff cc252eae Fri Oct 26 16:05:34 MDT 2018 Vlastimil Babka <vbabka@suse.cz> mm, slab: combine kmalloc_caches and kmalloc_dma_caches Patch series "kmalloc-reclaimable caches", v4. As discussed at LSF/MM [1] here's a patchset that introduces kmalloc-reclaimable caches (more details in the second patch) and uses them for dcache external names. That allows us to repurpose the NR_INDIRECTLY_RECLAIMABLE_BYTES counter later in the series. With patch 3/6, dcache external names are allocated from kmalloc-rcl-* caches, eliminating the need for manual accounting. More importantly, it also ensures the reclaimable kmalloc allocations are grouped in pages separate from the regular kmalloc allocations. The need for proper accounting of dcache external names has shown it's easy for misbehaving process to allocate lots of them, causing premature OOMs. Without the added grouping, it's likely that a similar workload can interleave the dcache external names allocations with regular kmalloc allocations (note: I haven't searched myself for an example of such regular kmalloc allocation, but I would be very surprised if there wasn't some). A pathological case would be e.g. one 64byte regular allocations with 63 external dcache names in a page (64x64=4096), which means the page is not freed even after reclaiming after all dcache names, and the process can thus "steal" the whole page with single 64byte allocation. If other kmalloc users similar to dcache external names become identified, they can also benefit from the new functionality simply by adding __GFP_RECLAIMABLE to the kmalloc calls. Side benefits of the patchset (that could be also merged separately) include removed branch for detecting __GFP_DMA kmalloc(), and shortening kmalloc cache names in /proc/slabinfo output. The latter is potentially an ABI break in case there are tools parsing the names and expecting the values to be in bytes. This is how /proc/slabinfo looks like after booting in virtme: ... kmalloc-rcl-4M 0 0 4194304 1 1024 : tunables 1 1 0 : slabdata 0 0 0 ... kmalloc-rcl-96 7 32 128 32 1 : tunables 120 60 8 : slabdata 1 1 0 kmalloc-rcl-64 25 128 64 64 1 : tunables 120 60 8 : slabdata 2 2 0 kmalloc-rcl-32 0 0 32 124 1 : tunables 120 60 8 : slabdata 0 0 0 kmalloc-4M 0 0 4194304 1 1024 : tunables 1 1 0 : slabdata 0 0 0 kmalloc-2M 0 0 2097152 1 512 : tunables 1 1 0 : slabdata 0 0 0 kmalloc-1M 0 0 1048576 1 256 : tunables 1 1 0 : slabdata 0 0 0 ... /proc/vmstat with renamed nr_indirectly_reclaimable_bytes counter: ... nr_slab_reclaimable 2817 nr_slab_unreclaimable 1781 ... nr_kernel_misc_reclaimable 0 ... /proc/meminfo with new KReclaimable counter: ... Shmem: 564 kB KReclaimable: 11260 kB Slab: 18368 kB SReclaimable: 11260 kB SUnreclaim: 7108 kB KernelStack: 1248 kB ... This patch (of 6): The kmalloc caches currently mainain separate (optional) array kmalloc_dma_caches for __GFP_DMA allocations. There are tests for __GFP_DMA in the allocation hotpaths. We can avoid the branches by combining kmalloc_caches and kmalloc_dma_caches into a single two-dimensional array where the outer dimension is cache "type". This will also allow to add kmalloc-reclaimable caches as a third type. Link: http://lkml.kernel.org/r/20180731090649.16028-2-vbabka@suse.cz Signed-off-by: Vlastimil Babka <vbabka@suse.cz> Acked-by: Mel Gorman <mgorman@techsingularity.net> Acked-by: Christoph Lameter <cl@linux.com> Acked-by: Roman Gushchin <guro@fb.com> Cc: Michal Hocko <mhocko@kernel.org> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: David Rientjes <rientjes@google.com> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com> Cc: Matthew Wilcox <willy@infradead.org> Cc: Laura Abbott <labbott@redhat.com> Cc: Sumit Semwal <sumit.semwal@linaro.org> Cc: Vijayanand Jitta <vjitta@codeaurora.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> diff cc252eae Fri Oct 26 16:05:34 MDT 2018 Vlastimil Babka <vbabka@suse.cz> mm, slab: combine kmalloc_caches and kmalloc_dma_caches Patch series "kmalloc-reclaimable caches", v4. As discussed at LSF/MM [1] here's a patchset that introduces kmalloc-reclaimable caches (more details in the second patch) and uses them for dcache external names. That allows us to repurpose the NR_INDIRECTLY_RECLAIMABLE_BYTES counter later in the series. With patch 3/6, dcache external names are allocated from kmalloc-rcl-* caches, eliminating the need for manual accounting. More importantly, it also ensures the reclaimable kmalloc allocations are grouped in pages separate from the regular kmalloc allocations. The need for proper accounting of dcache external names has shown it's easy for misbehaving process to allocate lots of them, causing premature OOMs. Without the added grouping, it's likely that a similar workload can interleave the dcache external names allocations with regular kmalloc allocations (note: I haven't searched myself for an example of such regular kmalloc allocation, but I would be very surprised if there wasn't some). A pathological case would be e.g. one 64byte regular allocations with 63 external dcache names in a page (64x64=4096), which means the page is not freed even after reclaiming after all dcache names, and the process can thus "steal" the whole page with single 64byte allocation. If other kmalloc users similar to dcache external names become identified, they can also benefit from the new functionality simply by adding __GFP_RECLAIMABLE to the kmalloc calls. Side benefits of the patchset (that could be also merged separately) include removed branch for detecting __GFP_DMA kmalloc(), and shortening kmalloc cache names in /proc/slabinfo output. The latter is potentially an ABI break in case there are tools parsing the names and expecting the values to be in bytes. This is how /proc/slabinfo looks like after booting in virtme: ... kmalloc-rcl-4M 0 0 4194304 1 1024 : tunables 1 1 0 : slabdata 0 0 0 ... kmalloc-rcl-96 7 32 128 32 1 : tunables 120 60 8 : slabdata 1 1 0 kmalloc-rcl-64 25 128 64 64 1 : tunables 120 60 8 : slabdata 2 2 0 kmalloc-rcl-32 0 0 32 124 1 : tunables 120 60 8 : slabdata 0 0 0 kmalloc-4M 0 0 4194304 1 1024 : tunables 1 1 0 : slabdata 0 0 0 kmalloc-2M 0 0 2097152 1 512 : tunables 1 1 0 : slabdata 0 0 0 kmalloc-1M 0 0 1048576 1 256 : tunables 1 1 0 : slabdata 0 0 0 ... /proc/vmstat with renamed nr_indirectly_reclaimable_bytes counter: ... nr_slab_reclaimable 2817 nr_slab_unreclaimable 1781 ... nr_kernel_misc_reclaimable 0 ... /proc/meminfo with new KReclaimable counter: ... Shmem: 564 kB KReclaimable: 11260 kB Slab: 18368 kB SReclaimable: 11260 kB SUnreclaim: 7108 kB KernelStack: 1248 kB ... This patch (of 6): The kmalloc caches currently mainain separate (optional) array kmalloc_dma_caches for __GFP_DMA allocations. There are tests for __GFP_DMA in the allocation hotpaths. We can avoid the branches by combining kmalloc_caches and kmalloc_dma_caches into a single two-dimensional array where the outer dimension is cache "type". This will also allow to add kmalloc-reclaimable caches as a third type. Link: http://lkml.kernel.org/r/20180731090649.16028-2-vbabka@suse.cz Signed-off-by: Vlastimil Babka <vbabka@suse.cz> Acked-by: Mel Gorman <mgorman@techsingularity.net> Acked-by: Christoph Lameter <cl@linux.com> Acked-by: Roman Gushchin <guro@fb.com> Cc: Michal Hocko <mhocko@kernel.org> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: David Rientjes <rientjes@google.com> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com> Cc: Matthew Wilcox <willy@infradead.org> Cc: Laura Abbott <labbott@redhat.com> Cc: Sumit Semwal <sumit.semwal@linaro.org> Cc: Vijayanand Jitta <vjitta@codeaurora.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> |
Completed in 843 milliseconds