#
17cce771 |
|
01-Mar-2024 |
Vlastimil Babka <vbabka@suse.cz> |
mm, slab: remove memcg_from_slab_obj() This empty wrapped exists only for !CONFIG_MEMCG_KMEM and seems it was never used. Probably a leftover from development of a series. Reviewed-by: Chengming Zhou <chengming.zhou@linux.dev> Reviewed-by: Roman Gushchin <roman.gushchin@linux.dev> Acked-by: David Rientjes <rientjes@google.com> Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
|
#
3dd549a5 |
|
22-Feb-2024 |
Chengming Zhou <chengming.zhou@linux.dev> |
mm, slab: remove the corner case of inc_slabs_node() We already have the inc_slabs_node() after kmem_cache_node->node[node] initialized in early_kmem_cache_node_alloc(), this special case of inc_slabs_node() can be removed. Then we don't need to consider the existence of kmem_cache_node in inc_slabs_node() anymore. Signed-off-by: Chengming Zhou <chengming.zhou@linux.dev> Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
|
#
011568eb |
|
27-Feb-2024 |
Xiaolei Wang <xiaolei.wang@windriver.com> |
mm/slab: Fix a kmemleak in kmem_cache_destroy() For earlier kmem cache creation, slab_sysfs_init() has not been called. Consequently, kmem_cache_destroy() cannot utilize kobj_type::release to release the kmem_cache structure. Therefore, tweak kmem_cache_release() to use slab_kmem_cache_release() for releasing kmem_cache when slab_state isn't FULL. This will fixes the memory leaks like following: unreferenced object 0xffff0000c2d87080 (size 128): comm "swapper/0", pid 1, jiffies 4294893428 hex dump (first 32 bytes): 00 00 00 00 ad 4e ad de ff ff ff ff 6b 6b 6b 6b .....N......kkkk ff ff ff ff ff ff ff ff b8 ab 48 89 00 80 ff ff.....H..... backtrace (crc 8819d0f6): [<ffff80008317a298>] kmemleak_alloc+0xb0/0xc4 [<ffff8000807e553c>] kmem_cache_alloc_node+0x288/0x3a8 [<ffff8000807e95f0>] __kmem_cache_create+0x1e4/0x64c [<ffff8000807216bc>] kmem_cache_create_usercopy+0x1c4/0x2cc [<ffff8000807217e0>] kmem_cache_create+0x1c/0x28 [<ffff8000819f6278>] arm_v7s_alloc_pgtable+0x1c0/0x6d4 [<ffff8000819f53a0>] alloc_io_pgtable_ops+0xe8/0x2d0 [<ffff800084b2d2c4>] arm_v7s_do_selftests+0xe0/0x73c [<ffff800080016b68>] do_one_initcall+0x11c/0x7ac [<ffff800084a71ddc>] kernel_init_freeable+0x53c/0xbb8 [<ffff8000831728d8>] kernel_init+0x24/0x144 [<ffff800080018e98>] ret_from_fork+0x10/0x20 Signed-off-by: Xiaolei Wang <xiaolei.wang@windriver.com> Reviewed-by: Chengming Zhou <chengming.zhou@linux.dev> Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
|
#
cc61eb85 |
|
23-Feb-2024 |
Vlastimil Babka <vbabka@suse.cz> |
mm, slab: use an enum to define SLAB_ cache creation flags The values of SLAB_ cache creation flags are defined by hand, which is tedious and error-prone. Use an enum to assign the bit number and a __SLAB_FLAG_BIT() macro to #define the final flags. This renumbers the flag values, which is OK as they are only used internally. Also define a __SLAB_FLAG_UNUSED macro to assign value to flags disabled by their respective config options in a unified and sparse-friendly way. Reviewed-and-tested-by: Xiongwei Song <xiongwei.song@windriver.com> Reviewed-by: Chengming Zhou <chengming.zhou@linux.dev> Reviewed-by: Roman Gushchin <roman.gushchin@linux.dev> Acked-by: David Rientjes <rientjes@google.com> Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
|
#
c94d2224 |
|
20-Feb-2024 |
Chengming Zhou <chengming.zhou@linux.dev> |
mm, slab: fix the comment of cpu partial list The partial slabs on cpu partial list are not frozen after the commit 8cd3fa428b56 ("slub: Delay freezing of partial slabs") merged. So fix the comment. Signed-off-by: Chengming Zhou <chengming.zhou@linux.dev> Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
|
#
303cd693 |
|
20-Feb-2024 |
Chengming Zhou <chengming.zhou@linux.dev> |
mm, slab: remove unused object_size parameter in kmem_cache_flags() We don't use the object_size parameter in kmem_cache_flags(), so just remove it. Signed-off-by: Chengming Zhou <chengming.zhou@linux.dev> Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
|
#
66b3dc1f |
|
29-Jan-2024 |
Zheng Yejian <zhengyejian1@huawei.com> |
mm/slub: remove parameter 'flags' in create_kmalloc_caches() After commit 16a1d968358a ("mm/slab: remove mm/slab.c and slab_def.h"), parameter 'flags' is only passed as 0 in create_kmalloc_caches(), and then it is only passed to new_kmalloc_cache(). So we can change parameter 'flags' to be a local variable with initial value 0 in new_kmalloc_cache() and remove parameter 'flags' in create_kmalloc_caches(). Also make new_kmalloc_cache() static due to it is only used in mm/slab_common.c. Signed-off-by: Zheng Yejian <zhengyejian1@huawei.com> Acked-by: David Rientjes <rientjes@google.com> Reviewed-by: Chengming Zhou <zhouchengming@bytedance.com> Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
|
#
c63349fc |
|
23-Jan-2024 |
Chengming Zhou <zhouchengming@bytedance.com> |
mm/slub: remove unused parameter in next_freelist_entry() The parameter "struct slab *slab" is unused in next_freelist_entry(), so just remove it. Acked-by: Christoph Lameter (Ampere) <cl@linux.com> Reviewed-by: Vlastimil Babka <vbabka@suse.cz> Signed-off-by: Chengming Zhou <zhouchengming@bytedance.com> Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
|
#
a6def11b |
|
23-Jan-2024 |
Chengming Zhou <zhouchengming@bytedance.com> |
mm/slub: remove full list manipulation for non-debug slab Since debug slab is processed by free_to_partial_list(), and only debug slab which has SLAB_STORE_USER flag would care about the full list, we can remove these unrelated full list manipulations from __slab_free(). Acked-by: Christoph Lameter (Ampere) <cl@linux.com> Reviewed-by: Vlastimil Babka <vbabka@suse.cz> Signed-off-by: Chengming Zhou <zhouchengming@bytedance.com> Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
|
#
90b1e566 |
|
23-Jan-2024 |
Chengming Zhou <zhouchengming@bytedance.com> |
mm/slub: directly load freelist from cpu partial slab in the likely case The likely case is that we get a usable slab from the cpu partial list, we can directly load freelist from it and return back, instead of going the other way that need more work, like reenable interrupt and recheck. But we need to remove the "VM_BUG_ON(!new.frozen)" in get_freelist() for reusing it, since cpu partial slab is not frozen. It seems acceptable since it's only for debug purpose. And get_freelist() also assumes it can return NULL if the freelist is empty, which is not possible for the cpu partial slab case, so we add "VM_BUG_ON(!freelist)" after get_freelist() to make it explicit. There is some small performance improvement too, which shows by: perf bench sched messaging -g 5 -t -l 100000 mm-stable slub-optimize Total time 7.473 7.209 Signed-off-by: Chengming Zhou <zhouchengming@bytedance.com> Reviewed-by: Vlastimil Babka <vbabka@suse.cz> Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
|
#
671776b3 |
|
14-Dec-2023 |
Xiongwei Song <xiongwei.song@windriver.com> |
mm/slub: unify all sl[au]b parameters with "slab_$param" Since the SLAB allocator has been removed, so we can clean up the sl[au]b_$params. With only one slab allocator left, it's better to use the generic "slab" term instead of "slub" which is an implementation detail, which is pointed out by Vlastimil Babka. For more information please see [1]. Hence, we are going to use "slab_$param" as the primary prefix. This patch is changing the following slab parameters - slub_max_order - slub_min_order - slub_min_objects - slub_debug to - slab_max_order - slab_min_order - slab_min_objects - slab_debug as the primary slab parameters for all references of them in docs and comments. But this patch won't change variables and functions inside slub as we will have wider slub/slab change. Meanwhile, "slub_$params" can also be passed by command line, which is to keep backward compatibility. Also mark all "slub_$params" as legacy. Remove the separate descriptions for slub_[no]merge, append legacy tip for them at the end of descriptions of slab_[no]merge. [1] https://lore.kernel.org/linux-mm/7512b350-4317-21a0-fab3-4101bc4d8f7a@suse.cz/ Signed-off-by: Xiongwei Song <xiongwei.song@windriver.com> Reviewed-by: Vlastimil Babka <vbabka@suse.cz> Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
|
#
5e0a760b |
|
28-Dec-2023 |
Kirill A. Shutemov <kirill.shutemov@linux.intel.com> |
mm, treewide: rename MAX_ORDER to MAX_PAGE_ORDER commit 23baf831a32c ("mm, treewide: redefine MAX_ORDER sanely") has changed the definition of MAX_ORDER to be inclusive. This has caused issues with code that was not yet upstream and depended on the previous definition. To draw attention to the altered meaning of the define, rename MAX_ORDER to MAX_PAGE_ORDER. Link: https://lkml.kernel.org/r/20231228144704.14033-2-kirill.shutemov@linux.intel.com Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|
#
8014c46a |
|
28-Dec-2023 |
Matthew Wilcox (Oracle) <willy@infradead.org> |
slub: use alloc_pages_node() in alloc_slab_page() For no apparent reason, we were open-coding alloc_pages_node() in this function. Link: https://lkml.kernel.org/r/20231228085748.1083901-3-willy@infradead.org Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> Acked-by: David Rientjes <rientjes@google.com> Cc: Hyeonggon Yoo <42.hyeyoo@gmail.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Vlastimil Babka <vbabka@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|
#
1ce9a052 |
|
19-Dec-2023 |
Andrey Konovalov <andreyknvl@gmail.com> |
kasan: rename and document kasan_(un)poison_object_data Rename kasan_unpoison_object_data to kasan_unpoison_new_object and add a documentation comment. Do the same for kasan_poison_object_data. The new names and the comments should suggest the users that these hooks are intended for internal use by the slab allocator. The following patch will remove non-slab-internal uses of these hooks. No functional changes. [andreyknvl@google.com: update references to renamed functions in comments] Link: https://lkml.kernel.org/r/20231221180637.105098-1-andrey.konovalov@linux.dev Link: https://lkml.kernel.org/r/eab156ebbd635f9635ef67d1a4271f716994e628.1703024586.git.andreyknvl@google.com Signed-off-by: Andrey Konovalov <andreyknvl@google.com> Reviewed-by: Marco Elver <elver@google.com> Cc: Alexander Lobakin <alobakin@pm.me> Cc: Alexander Potapenko <glider@google.com> Cc: Andrey Ryabinin <ryabinin.a.a@gmail.com> Cc: Breno Leitao <leitao@debian.org> Cc: Dmitry Vyukov <dvyukov@google.com> Cc: Evgenii Stepanov <eugenis@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|
#
782f8906 |
|
14-Nov-2023 |
Vlastimil Babka <vbabka@suse.cz> |
mm/slub: free KFENCE objects in slab_free_hook() When freeing an object that was allocated from KFENCE, we do that in the slowpath __slab_free(), relying on the fact that KFENCE "slab" cannot be the cpu slab, so the fastpath has to fallback to the slowpath. This optimization doesn't help much though, because is_kfence_address() is checked earlier anyway during the free hook processing or detached freelist building. Thus we can simplify the code by making the slab_free_hook() free the KFENCE object immediately, similarly to KASAN quarantine. In slab_free_hook() we can place kfence_free() above init processing, as callers have been making sure to set init to false for KFENCE objects. This simplifies slab_free(). This places it also above kasan_slab_free() which is ok as that skips KFENCE objects anyway. While at it also determine the init value in slab_free_freelist_hook() outside of the loop. This change will also make introducing per cpu array caches easier. Tested-by: Marco Elver <elver@google.com> Reviewed-by: Chengming Zhou <zhouchengming@bytedance.com> Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
|
#
2d552463 |
|
22-Nov-2023 |
Andrey Konovalov <andreyknvl@gmail.com> |
slub, kasan: improve interaction of KASAN and slub_debug poisoning When both KASAN and slub_debug are enabled, when a free object is being prepared in setup_object, slub_debug poisons the object data before KASAN initializes its per-object metadata. Right now, in setup_object, KASAN only initializes the alloc metadata, which is always stored outside of the object. slub_debug is aware of this and it skips poisoning and checking that memory area. However, with the following patch in this series, KASAN also starts initializing its free medata in setup_object. As this metadata might be stored within the object, this initialization might overwrite the slub_debug poisoning. This leads to slub_debug reports. Thus, skip checking slub_debug poisoning of the object data area that overlaps with the in-object KASAN free metadata. Also make slub_debug poisoning of tail kmalloc redzones more precise when KASAN is enabled: slub_debug can still poison and check the tail kmalloc allocation area that comes after the KASAN free metadata. Link: https://lkml.kernel.org/r/20231122231202.121277-1-andrey.konovalov@linux.dev Signed-off-by: Andrey Konovalov <andreyknvl@google.com> Tested-by: Hyeonggon Yoo <42.hyeyoo@gmail.com> Cc: Alexander Potapenko <glider@google.com> Cc: Dmitry Vyukov <dvyukov@google.com> Cc: Evgenii Stepanov <eugenis@google.com> Cc: Feng Tang <feng.tang@intel.com> Cc: Marco Elver <elver@google.com> Cc: Oscar Salvador <osalvador@suse.de> Cc: Vlastimil Babka <vbabka@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|
#
284f17ac |
|
03-Nov-2023 |
Vlastimil Babka <vbabka@suse.cz> |
mm/slub: handle bulk and single object freeing separately Currently we have a single function slab_free() handling both single object freeing and bulk freeing with necessary hooks, the latter case requiring slab_free_freelist_hook(). It should be however better to distinguish the two use cases for the following reasons: - code simpler to follow for the single object case - better code generation - although inlining should eliminate the slab_free_freelist_hook() for single object freeing in case no debugging options are enabled, it seems it's not perfect. When e.g. KASAN is enabled, we're imposing additional unnecessary overhead for single object freeing. - preparation to add percpu array caches in near future Therefore, simplify slab_free() for the single object case by dropping unnecessary parameters and calling only slab_free_hook() instead of slab_free_freelist_hook(). Rename the bulk variant to slab_free_bulk() and adjust callers accordingly. While at it, flip (and document) slab_free_hook() return value so that it returns true when the freeing can proceed, which matches the logic of slab_free_freelist_hook() and is not confusingly the opposite. Additionally we can simplify a bit by changing the tail parameter of do_slab_free() when freeing a single object - instead of NULL we can set it equal to head. bloat-o-meter shows small code reduction with a .config that has KASAN etc disabled: add/remove: 0/0 grow/shrink: 0/4 up/down: 0/-118 (-118) Function old new delta kmem_cache_alloc_bulk 1203 1196 -7 kmem_cache_free 861 835 -26 __kmem_cache_free 741 704 -37 kmem_cache_free_bulk 911 863 -48 Reviewed-by: Chengming Zhou <zhouchengming@bytedance.com> Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
|
#
520a688a |
|
02-Nov-2023 |
Vlastimil Babka <vbabka@suse.cz> |
mm/slub: introduce __kmem_cache_free_bulk() without free hooks Currently, when __kmem_cache_alloc_bulk() fails, it frees back the objects that were allocated before the failure, using kmem_cache_free_bulk(). Because kmem_cache_free_bulk() calls the free hooks (KASAN etc.) and those expect objects that were processed by the post alloc hooks, slab_post_alloc_hook() is called before kmem_cache_free_bulk(). This is wasteful, although not a big concern in practice for the rare error path. But in order to efficiently handle percpu array batch refill and free in the near future, we will also need a variant of kmem_cache_free_bulk() that avoids the free hooks. So introduce it now and use it for the failure path. In case of failure we however still need to perform memcg uncharge so handle that in a new memcg_slab_alloc_error_hook(). Thanks to Chengming Zhou for noticing the missing uncharge. As a consequence, __kmem_cache_alloc_bulk() no longer needs the objcg parameter, remove it. Reviewed-by: Chengming Zhou <zhouchengming@bytedance.com> Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
|
#
6f3dd2c3 |
|
07-Aug-2023 |
Vlastimil Babka <vbabka@suse.cz> |
mm/slub: fix bulk alloc and free stats The SLUB sysfs stats enabled CONFIG_SLUB_STATS have two deficiencies identified wrt bulk alloc/free operations: - Bulk allocations from cpu freelist are not counted. Add the ALLOC_FASTPATH counter there. - Bulk fastpath freeing will count a list of multiple objects with a single FREE_FASTPATH inc. Add a stat_add() variant to count them all. Reviewed-by: Chengming Zhou <zhouchengming@bytedance.com> Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
|
#
ecf9a253 |
|
26-Oct-2023 |
Vlastimil Babka <vbabka@suse.cz> |
mm/slub: optimize free fast path code layout Inspection of kmem_cache_free() disassembly showed we could make the fast path smaller by providing few more hints to the compiler, and splitting the memcg_slab_free_hook() into an inline part that only checks if there's work to do, and an out of line part doing the actual uncharge. bloat-o-meter results: add/remove: 2/0 grow/shrink: 0/3 up/down: 286/-554 (-268) Function old new delta __memcg_slab_free_hook - 270 +270 __pfx___memcg_slab_free_hook - 16 +16 kfree 828 665 -163 kmem_cache_free 1116 948 -168 kmem_cache_free_bulk.part 1701 1478 -223 Checking kmem_cache_free() disassembly now shows the non-fastpath cases are handled out of line, which should reduce instruction cache usage. Acked-by: David Rientjes <rientjes@google.com> Tested-by: David Rientjes <rientjes@google.com> Reviewed-by: Hyeonggon Yoo <42.hyeyoo@gmail.com> Tested-by: Hyeonggon Yoo <42.hyeyoo@gmail.com> Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
|
#
3450a0e5 |
|
13-Nov-2023 |
Vlastimil Babka <vbabka@suse.cz> |
mm/slub: optimize alloc fastpath code layout With allocation fastpaths no longer divided between two .c files, we have better inlining, however checking the disassembly of kmem_cache_alloc() reveals we can do better to make the fastpaths smaller and move the less common situations out of line or to separate functions, to reduce instruction cache pressure. - split memcg pre/post alloc hooks to inlined checks that use likely() to assume there will be no objcg handling necessary, and non-inline functions doing the actual handling - add some more likely/unlikely() to pre/post alloc hooks to indicate which scenarios should be out of line - change gfp_allowed_mask handling in slab_post_alloc_hook() so the code can be optimized away when kasan/kmsan/kmemleak is configured out bloat-o-meter shows: add/remove: 4/2 grow/shrink: 1/8 up/down: 521/-2924 (-2403) Function old new delta __memcg_slab_post_alloc_hook - 461 +461 kmem_cache_alloc_bulk 775 791 +16 __pfx_should_failslab.constprop - 16 +16 __pfx___memcg_slab_post_alloc_hook - 16 +16 should_failslab.constprop - 12 +12 __pfx_memcg_slab_post_alloc_hook 16 - -16 kmem_cache_alloc_lru 1295 1023 -272 kmem_cache_alloc_node 1118 817 -301 kmem_cache_alloc 1076 772 -304 kmalloc_node_trace 1149 838 -311 kmalloc_trace 1102 789 -313 __kmalloc_node_track_caller 1393 1080 -313 __kmalloc_node 1397 1082 -315 __kmalloc 1374 1059 -315 memcg_slab_post_alloc_hook 464 - -464 Note that gcc still decided to inline __memcg_pre_alloc_hook(), but the code is out of line. Forcing noinline did not improve the results. As a result the fastpaths are shorter and overal code size is reduced. Acked-by: David Rientjes <rientjes@google.com> Tested-by: David Rientjes <rientjes@google.com> Reviewed-by: Hyeonggon Yoo <42.hyeyoo@gmail.com> Tested-by: Hyeonggon Yoo <42.hyeyoo@gmail.com> Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
|
#
49378a05 |
|
26-Oct-2023 |
Vlastimil Babka <vbabka@suse.cz> |
mm/slub: remove slab_alloc() and __kmem_cache_alloc_lru() wrappers slab_alloc() is a thin wrapper around slab_alloc_node() with only one caller. Replace with direct call of slab_alloc_node(). __kmem_cache_alloc_lru() itself is a thin wrapper with two callers, so replace it with direct calls of slab_alloc_node() and trace_kmem_cache_alloc(). This also makes sure _RET_IP_ has always the expected value and not depending on inlining decisions. Reviewed-by: Kees Cook <keescook@chromium.org> Acked-by: David Rientjes <rientjes@google.com> Tested-by: David Rientjes <rientjes@google.com> Reviewed-by: Hyeonggon Yoo <42.hyeyoo@gmail.com> Tested-by: Hyeonggon Yoo <42.hyeyoo@gmail.com> Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
|
#
4862caa5 |
|
03-Oct-2023 |
Vlastimil Babka <vbabka@suse.cz> |
mm/slab: move kmalloc() functions from slab_common.c to slub.c This will eliminate a call between compilation units through __kmem_cache_alloc_node() and allow better inlining of the allocation fast path. Reviewed-by: Kees Cook <keescook@chromium.org> Acked-by: David Rientjes <rientjes@google.com> Tested-by: David Rientjes <rientjes@google.com> Reviewed-by: Hyeonggon Yoo <42.hyeyoo@gmail.com> Tested-by: Hyeonggon Yoo <42.hyeyoo@gmail.com> Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
|
#
b774d3e3 |
|
03-Oct-2023 |
Vlastimil Babka <vbabka@suse.cz> |
mm/slab: move kfree() from slab_common.c to slub.c This should result in better code. Currently kfree() makes a function call between compilation units to __kmem_cache_free() which does its own virt_to_slab(), throwing away the struct slab pointer we already had in kfree(). Now it can be reused. Additionally kfree() can now inline the whole SLUB freeing fastpath. Also move over free_large_kmalloc() as the only callsites are now in slub.c, and make it static. Reviewed-by: Kees Cook <keescook@chromium.org> Acked-by: David Rientjes <rientjes@google.com> Tested-by: David Rientjes <rientjes@google.com> Reviewed-by: Hyeonggon Yoo <42.hyeyoo@gmail.com> Tested-by: Hyeonggon Yoo <42.hyeyoo@gmail.com> Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
|
#
b52ef56e |
|
03-Oct-2023 |
Vlastimil Babka <vbabka@suse.cz> |
mm/slab: move struct kmem_cache_node from slab.h to slub.c The declaration and associated helpers are not used anywhere else anymore. Reviewed-by: Kees Cook <keescook@chromium.org> Acked-by: David Rientjes <rientjes@google.com> Tested-by: David Rientjes <rientjes@google.com> Reviewed-by: Hyeonggon Yoo <42.hyeyoo@gmail.com> Tested-by: Hyeonggon Yoo <42.hyeyoo@gmail.com> Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
|
#
0bedcc66 |
|
03-Oct-2023 |
Vlastimil Babka <vbabka@suse.cz> |
mm/slab: move memcg related functions from slab.h to slub.c We don't share those between SLAB and SLUB anymore, so most memcg related functions can be moved to slub.c proper. Reviewed-by: Kees Cook <keescook@chromium.org> Acked-by: Michal Hocko <mhocko@suse.com> Acked-by: David Rientjes <rientjes@google.com> Tested-by: David Rientjes <rientjes@google.com> Reviewed-by: Hyeonggon Yoo <42.hyeyoo@gmail.com> Tested-by: Hyeonggon Yoo <42.hyeyoo@gmail.com> Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
|
#
6011be59 |
|
03-Oct-2023 |
Vlastimil Babka <vbabka@suse.cz> |
mm/slab: move pre/post-alloc hooks from slab.h to slub.c We don't share the hooks between two slab implementations anymore so they can be moved away from the header. As part of the move, also move should_failslab() from slab_common.c as the pre_alloc hook uses it. This means slab.h can stop including fault-inject.h and kmemleak.h. Fix up some files that were depending on the includes transitively. Reviewed-by: Kees Cook <keescook@chromium.org> Acked-by: David Rientjes <rientjes@google.com> Tested-by: David Rientjes <rientjes@google.com> Reviewed-by: Hyeonggon Yoo <42.hyeyoo@gmail.com> Tested-by: Hyeonggon Yoo <42.hyeyoo@gmail.com> Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
|
#
7ef08ae8 |
|
03-Oct-2023 |
Vlastimil Babka <vbabka@suse.cz> |
mm/slab: move struct kmem_cache_cpu declaration to slub.c Nothing outside SLUB itself accesses the struct kmem_cache_cpu fields so it does not need to be declared in slub_def.h. This allows also to move enum stat_item. Reviewed-by: Kees Cook <keescook@chromium.org> Acked-by: David Rientjes <rientjes@google.com> Tested-by: David Rientjes <rientjes@google.com> Reviewed-by: Hyeonggon Yoo <42.hyeyoo@gmail.com> Tested-by: Hyeonggon Yoo <42.hyeyoo@gmail.com> Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
|
#
0445ee00 |
|
20-Nov-2023 |
Vlastimil Babka <vbabka@suse.cz> |
mm/slab, docs: switch mm-api docs generation from slab.c to slub.c The SLAB implementation is going to be removed, and mm-api.rst currently uses mm/slab.c to obtain kerneldocs for some API functions. Switch it to mm/slub.c and move the relevant kerneldocs of exported functions from one to the other. The rest of kerneldocs in slab.c is for static SLAB implementation-specific functions that don't have counterparts in slub.c and thus can be simply removed with the implementation. Acked-by: David Rientjes <rientjes@google.com> Tested-by: David Rientjes <rientjes@google.com> Reviewed-by: Hyeonggon Yoo <42.hyeyoo@gmail.com> Tested-by: Hyeonggon Yoo <42.hyeyoo@gmail.com> Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
|
#
31bda717 |
|
01-Nov-2023 |
Chengming Zhou <zhouchengming@bytedance.com> |
slub: Update frozen slabs documentations in the source The current updated scheme (which this series implemented) is: - node partial slabs: PG_Workingset && !frozen - cpu partial slabs: !PG_Workingset && !frozen - cpu slabs: !PG_Workingset && frozen - full slabs: !PG_Workingset && !frozen The most important change is that "frozen" bit is not set for the cpu partial slabs anymore, __slab_free() will grab node list_lock then check by !PG_Workingset that it's not on a node partial list. And the "frozen" bit is still kept for the cpu slabs for performance, since we don't need to grab node list_lock to check whether the PG_Workingset is set or not if the "frozen" bit is set in __slab_free(). Update related documentations and comments in the source. Signed-off-by: Chengming Zhou <zhouchengming@bytedance.com> Tested-by: Hyeonggon Yoo <42.hyeyoo@gmail.com> Reviewed-by: Hyeonggon Yoo <42.hyeyoo@gmail.com> Acked-by: Christoph Lameter (Ampere) <cl@linux.com> Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
|
#
21316fdc |
|
01-Nov-2023 |
Chengming Zhou <zhouchengming@bytedance.com> |
slub: Rename all *unfreeze_partials* functions to *put_partials* Since all partial slabs on the CPU partial list are not frozen anymore, we don't unfreeze when moving cpu partial slabs to node partial list, it's better to rename these functions. Signed-off-by: Chengming Zhou <zhouchengming@bytedance.com> Reviewed-by: Vlastimil Babka <vbabka@suse.cz> Tested-by: Hyeonggon Yoo <42.hyeyoo@gmail.com> Reviewed-by: Hyeonggon Yoo <42.hyeyoo@gmail.com> Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
|
#
00eb60c2 |
|
01-Nov-2023 |
Chengming Zhou <zhouchengming@bytedance.com> |
slub: Optimize deactivate_slab() Since the introduce of unfrozen slabs on cpu partial list, we don't need to synchronize the slab frozen state under the node list_lock. The caller of deactivate_slab() and the caller of __slab_free() won't manipulate the slab list concurrently. So we can get node list_lock in the last stage if we really need to manipulate the slab list in this path. Signed-off-by: Chengming Zhou <zhouchengming@bytedance.com> Reviewed-by: Vlastimil Babka <vbabka@suse.cz> Tested-by: Hyeonggon Yoo <42.hyeyoo@gmail.com> Reviewed-by: Hyeonggon Yoo <42.hyeyoo@gmail.com> Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
|
#
8cd3fa42 |
|
01-Nov-2023 |
Chengming Zhou <zhouchengming@bytedance.com> |
slub: Delay freezing of partial slabs Now we will freeze slabs when moving them out of node partial list to cpu partial list, this method needs two cmpxchg_double operations: 1. freeze slab (acquire_slab()) under the node list_lock 2. get_freelist() when pick used in ___slab_alloc() Actually we don't need to freeze when moving slabs out of node partial list, we can delay freezing to when use slab freelist in ___slab_alloc(), so we can save one cmpxchg_double(). And there are other good points: - The moving of slabs between node partial list and cpu partial list becomes simpler, since we don't need to freeze or unfreeze at all. - The node list_lock contention would be less, since we don't need to freeze any slab under the node list_lock. We can achieve this because there is no concurrent path would manipulate the partial slab list except the __slab_free() path, which is now serialized by slab_test_node_partial() under the list_lock. Since the slab returned by get_partial() interfaces is not frozen anymore and no freelist is returned in the partial_context, so we need to use the introduced freeze_slab() to freeze it and get its freelist. Similarly, the slabs on the CPU partial list are not frozen anymore, we need to freeze_slab() on it before use. We can now delete acquire_slab() as it became unused. Signed-off-by: Chengming Zhou <zhouchengming@bytedance.com> Reviewed-by: Vlastimil Babka <vbabka@suse.cz> Tested-by: Hyeonggon Yoo <42.hyeyoo@gmail.com> Reviewed-by: Hyeonggon Yoo <42.hyeyoo@gmail.com> Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
|
#
213094b5 |
|
01-Nov-2023 |
Chengming Zhou <zhouchengming@bytedance.com> |
slub: Introduce freeze_slab() We will have unfrozen slabs out of the node partial list later, so we need a freeze_slab() function to freeze the partial slab and get its freelist. Signed-off-by: Chengming Zhou <zhouchengming@bytedance.com> Reviewed-by: Vlastimil Babka <vbabka@suse.cz> Tested-by: Hyeonggon Yoo <42.hyeyoo@gmail.com> Reviewed-by: Hyeonggon Yoo <42.hyeyoo@gmail.com> Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
|
#
422e7d54 |
|
01-Nov-2023 |
Chengming Zhou <zhouchengming@bytedance.com> |
slub: Prepare __slab_free() for unfrozen partial slab out of node partial list Now the partially empty slub will be frozen when taken out of node partial list, so the __slab_free() will know from "was_frozen" that the partially empty slab is not on node partial list and is a cpu or cpu partial slab of some cpu. But we will change this, make partial slabs leave the node partial list with unfrozen state, so we need to change __slab_free() to use the new slab_test_node_partial() we just introduced. Signed-off-by: Chengming Zhou <zhouchengming@bytedance.com> Reviewed-by: Vlastimil Babka <vbabka@suse.cz> Tested-by: Hyeonggon Yoo <42.hyeyoo@gmail.com> Reviewed-by: Hyeonggon Yoo <42.hyeyoo@gmail.com> Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
|
#
8a399e2f |
|
01-Nov-2023 |
Chengming Zhou <zhouchengming@bytedance.com> |
slub: Keep track of whether slub is on the per-node partial list Now we rely on the "frozen" bit to see if we should manipulate the slab->slab_list, which will be changed in the following patch. Instead we introduce another way to keep track of whether slub is on the per-node partial list, here we reuse the PG_workingset bit. We have to use the atomic set_bit() and clear_bit() variants and change slab_unlock() to bit_spin_unlock() because when cmpxchg is not available and PG_lock is used, there may be concurrent operations on the two bits. Thanks to Mark Brown for reporting a hang and testing of a previous version where the non-atomic operations were used. Suggested-by: Matthew Wilcox <willy@infradead.org> Signed-off-by: Chengming Zhou <zhouchengming@bytedance.com> Reviewed-by: Vlastimil Babka <vbabka@suse.cz> Tested-by: Hyeonggon Yoo <42.hyeyoo@gmail.com> Reviewed-by: Hyeonggon Yoo <42.hyeyoo@gmail.com> Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
|
#
43c4c349 |
|
01-Nov-2023 |
Chengming Zhou <zhouchengming@bytedance.com> |
slub: Change get_partial() interfaces to return slab We need all get_partial() related interfaces to return a slab, instead of returning the freelist (or object). Use the partial_context.object to return back freelist or object for now. This patch shouldn't have any functional changes. Suggested-by: Vlastimil Babka <vbabka@suse.cz> Signed-off-by: Chengming Zhou <zhouchengming@bytedance.com> Reviewed-by: Vlastimil Babka <vbabka@suse.cz> Tested-by: Hyeonggon Yoo <42.hyeyoo@gmail.com> Reviewed-by: Hyeonggon Yoo <42.hyeyoo@gmail.com> Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
|
#
24c6a097 |
|
01-Nov-2023 |
Chengming Zhou <zhouchengming@bytedance.com> |
slub: Reflow ___slab_alloc() The get_partial() interface used in ___slab_alloc() may return a single object in the "kmem_cache_debug(s)" case, in which we will just return the "freelist" object. Move this handling up to prepare for later changes. And the "pfmemalloc_match()" part is not needed for node partial slab, since we already check this in the get_partial_node(). Signed-off-by: Chengming Zhou <zhouchengming@bytedance.com> Reviewed-by: Vlastimil Babka <vbabka@suse.cz> Tested-by: Hyeonggon Yoo <42.hyeyoo@gmail.com> Reviewed-by: Hyeonggon Yoo <42.hyeyoo@gmail.com> Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
|
#
90f055df |
|
07-Sep-2023 |
Vlastimil Babka <vbabka@suse.cz> |
mm/slub: refactor calculate_order() and calc_slab_order() After the previous cleanups, we can now move some code from calc_slab_order() to calculate_order() so it's executed just once, and do some more cleanups. - move the min_order and MAX_OBJS_PER_PAGE evaluation to calculate_order(). - change calc_slab_order() parameter min_objects to min_order Also make MAX_OBJS_PER_PAGE check more robust by considering also min_objects in addition to slub_min_order. Otherwise this is not a functional change. Signed-off-by: Vlastimil Babka <vbabka@suse.cz> Reviewed-by: Feng Tang <feng.tang@intel.com> Reviewed-and-tested-by: Jay Patel <jaypatel@linux.ibm.com>
|
#
5886fc82 |
|
08-Sep-2023 |
Vlastimil Babka <vbabka@suse.cz> |
mm/slub: attempt to find layouts up to 1/2 waste in calculate_order() The main loop in calculate_order() currently tries to find an order with at most 1/4 waste. If that's impossible (for particular large object sizes), there's a fallback that will try to place one object within slab_max_order. If we expand the loop boundary to also allow up to 1/2 waste as the last resort, we can remove the fallback and simplify the code, as the loop will find an order for such sizes as well. Note we don't need to allow more than 1/2 waste as that will never happen - calc_slab_order() would calculate more objects to fit, reducing waste below 1/2. Successfully finding an order in the loop (compared to the fallback) will also have the benefit in trying to satisfy min_objects, because the fallback was passing 1. Thus the resulting slab orders might be larger (not because it would improve waste, but to reduce pressure on shared locks), which is one of the goals of calculate_order(). For example, with nr_cpus=1 and 4kB PAGE_SIZE, slub_max_order=3, before the patch we would get the following orders for these object sizes: 2056 to 10920 - order-3 as selected by the loop 10928 to 12280 - order-2 due to fallback, as <1/4 waste is not possible 12288 to 32768 - order-3 as <1/4 waste is again possible After the patch: 2056 to 32768 - order-3, because even in the range of 10928 to 12280 we try to satisfy the calculated min_objects. As a result the code is simpler and gives more consistent results. Signed-off-by: Vlastimil Babka <vbabka@suse.cz> Reviewed-by: Feng Tang <feng.tang@intel.com> Reviewed-and-tested-by: Jay Patel <jaypatel@linux.ibm.com>
|
#
0fe2735d |
|
08-Sep-2023 |
Vlastimil Babka <vbabka@suse.cz> |
mm/slub: remove min_objects loop from calculate_order() calculate_order() currently has two nested loops. The inner one that gradually modifies the acceptable waste from 1/16 up to 1/4, and the outer one that decreases min_objects down to 2. Upon closer inspection, the outer loop is unnecessary. Decreasing min_objects could have in theory two effects to make the inner loop and its call to calc_slab_order() succeed where a previous iteration with higher min_objects would not: - it could cause the min_objects-derived min_order to fit within slub_max_order. But min_objects is already pre-capped to max_objects that's derived from slub_max_order above the loops, so every iteration tries at least slub_max_order in calc_slab_order() - it could cause calc_slab_order() to be called with lower min_objects thus potentially lower min_order in its loop. This would make a difference if the lower order could cause the fractional waste test to succeed where a higher order has already failed with same fract_leftover in the previous iteration with a higher min_order. But that's not possible, because increasing the order can only result in lower (or same) fractional waste. If we increase the slab size 2 times, we will fit at least 2 times the number of objects (thus same fraction of waste), or it will allow us to fit one more object (lower fraction of waste). For more confidence I have tried adding a printk to notify when decreasing min_objects resulted in a success, and simulated calculations for a range of object sizes, nr_cpus and page_sizes. As expected, the printk never triggered. Thus remove the outer loop and adjust comments accordingly. There's almost no functional change except a weird corner case when slub_min_objects=1 on boot command line would cause the whole two nested loops to be skipped before this patch. Now it would try to find the best layout as usual, resulting in potentially higher orderthat minimizes waste. This is not wrong and will be further expanded by the next patch. Signed-off-by: Vlastimil Babka <vbabka@suse.cz> Reviewed-by: Feng Tang <feng.tang@intel.com> Reviewed-and-tested-by: Jay Patel <jaypatel@linux.ibm.com>
|
#
c7355d75 |
|
08-Sep-2023 |
Vlastimil Babka <vbabka@suse.cz> |
mm/slub: simplify the last resort slab order calculation If calculate_order() can't fit even a single large object within slub_max_order, it will try using the smallest necessary order that may exceed slub_max_order but not MAX_ORDER. Currently this is done with a call to calc_slab_order() which is unnecessary. We can simply use get_order(size). No functional change. Signed-off-by: Vlastimil Babka <vbabka@suse.cz> Reviewed-by: Feng Tang <feng.tang@intel.com> Reviewed-and-tested-by: Jay Patel <jaypatel@linux.ibm.com>
|
#
e519ce7a |
|
20-Sep-2023 |
Feng Tang <feng.tang@intel.com> |
mm/slub: add sanity check for slub_min/max_order cmdline setup Currently there are 2 parameters could be setup from kernel cmdline: slub_min_order and slub_max_order. It's possible that the user configured slub_min_order is bigger than the default slub_max_order [1], which can still take effect, as calculate_oder() will use MAX_ORDER as a fallback to check against, but has some downsides: * the kernel message about SLUB will be strange in showing min/max orders: SLUB: HWalign=64, Order=9-3, MinObjects=0, CPUs=16, Nodes=1 * in calculate_order() called by each slab, the 2 loops of calc_slab_order() will all be meaningless due to slub_min_order is bigger than slub_max_order * prevent future code cleanup like in [2]. Fix it by adding some sanity check to enforce the min/max semantics. [1]. https://lore.kernel.org/lkml/21a0ba8b-bf05-0799-7c78-2a35f8c8d52a@os.amperecomputing.com/ [2]. https://lore.kernel.org/lkml/20230908145302.30320-7-vbabka@suse.cz/ Signed-off-by: Feng Tang <feng.tang@intel.com> Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
|
#
1662b6c2 |
|
11-Jul-2023 |
Vlastimil Babka <vbabka@suse.cz> |
mm/slub: remove freelist_dereference() freelist_dereference() is a one-liner only used from get_freepointer(). Remove it and make get_freepointer() call freelist_ptr_decode() directly to make the code easier to follow. Signed-off-by: Vlastimil Babka <vbabka@suse.cz> Acked-by: Kees Cook <keescook@chromium.org>
|
#
b06952cd |
|
10-Jul-2023 |
Vlastimil Babka <vbabka@suse.cz> |
mm/slub: remove redundant kasan_reset_tag() from freelist_ptr calculations Commit d36a63a943e3 ("kasan, slub: fix more conflicts with CONFIG_SLAB_FREELIST_HARDENED") has introduced kasan_reset_tags() to freelist_ptr() encoding/decoding when CONFIG_SLAB_FREELIST_HARDENED is enabled to resolve issues when passing tagged or untagged pointers inconsistently would lead to incorrect calculations. Later, commit aa1ef4d7b3f6 ("kasan, mm: reset tags when accessing metadata") made sure all pointers have tags reset regardless of CONFIG_SLAB_FREELIST_HARDENED, because there was no other way to access the freepointer metadata safely with hw tag-based KASAN. Therefore the kasan_reset_tag() usage in freelist_ptr_encode()/decode() is now redundant, as all callers use kasan_reset_tag() unconditionally when constructing ptr_addr. Remove the redundant calls and simplify the code and remove obsolete comments. Also in freelist_ptr_encode() introduce an 'encoded' variable to make the lines shorter and make it similar to the _decode() one. Signed-off-by: Vlastimil Babka <vbabka@suse.cz> Reviewed-by: Andrey Konovalov <andreyknvl@gmail.com> Acked-by: Kees Cook <keescook@chromium.org>
|
#
44f6a42d |
|
04-Jul-2023 |
Jann Horn <jannh@google.com> |
mm/slub: refactor freelist to use custom type Currently the SLUB code represents encoded freelist entries as "void*". That's misleading, those things are encoded under CONFIG_SLAB_FREELIST_HARDENED so that they're not actually dereferencable. Give them their own type, and split freelist_ptr() into one function per direction (one for encoding, one for decoding). Signed-off-by: Jann Horn <jannh@google.com> Co-developed-by: Matteo Rizzo <matteorizzo@google.com> Signed-off-by: Matteo Rizzo <matteorizzo@google.com> Acked-by: David Rientjes <rientjes@google.com> Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
|
#
6801be4f |
|
31-May-2023 |
Peter Zijlstra <peterz@infradead.org> |
slub: Replace cmpxchg_double() Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Arnd Bergmann <arnd@arndb.de> Reviewed-by: Mark Rutland <mark.rutland@arm.com> Acked-by: Vlastimil Babka <vbabka@suse.cz> Acked-by: Hyeonggon Yoo <42.hyeyoo@gmail.com> Tested-by: Mark Rutland <mark.rutland@arm.com> Link: https://lore.kernel.org/r/20230531132323.924677086@infradead.org
|
#
8040cbf5 |
|
13-Apr-2023 |
Xiongwei Song <xiongwei.song@windriver.com> |
slub: Don't read nr_slabs and total_objects directly We have node_nr_slabs() to read nr_slabs, node_nr_objs() to read total_objects in a kmem_cache_node, so no need to access the two members directly. Signed-off-by: Xiongwei Song <xiongwei.song@windriver.com> Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
|
#
4f174a8b |
|
13-Apr-2023 |
Xiongwei Song <xiongwei.song@windriver.com> |
slub: Remove slabs_node() function When traversing nodes one by one, the get_node() function called in for_each_kmem_cache_node macro, no need to call get_node() again in slabs_node(), just reading nr_slabs field should be enough. However, the node_nr_slabs() function can do this. Hence, the slabs_node() function is not needed anymore. Signed-off-by: Xiongwei Song <xiongwei.song@windriver.com> Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
|
#
c6c17c4d |
|
13-Apr-2023 |
Xiongwei Song <xiongwei.song@windriver.com> |
slub: Remove CONFIG_SMP defined check As CONFIG_SMP is one of dependencies of CONFIG_SLUB_CPU_PARTIAL, so if CONFIG_SLUB_CPU_PARTIAL is defined then CONFIG_SMP must be defined, no need to check CONFIG_SMP definition here. Signed-off-by: Xiongwei Song <xiongwei.song@windriver.com> Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
|
#
81bd3179 |
|
13-Apr-2023 |
Xiongwei Song <xiongwei.song@windriver.com> |
slub: Put objects_show() into CONFIG_SLUB_DEBUG enabled block The SO_ALL|SO_OBJECTS pair is only used when enabling CONFIG_SLUB_DEBUG option, so the objects_show() definition should be surrounded by CONFIG_SLUB_DEBUG too. Signed-off-by: Xiongwei Song <xiongwei.song@windriver.com> Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
|
#
35973232 |
|
13-Apr-2023 |
Xiongwei Song <xiongwei.song@windriver.com> |
slub: Correct the error code when slab_kset is NULL The -ENOSYS is inproper when kset_create_and_add call returns a NULL pointer, the failure more likely is because lacking memory, hence returning -ENOMEM is better. Signed-off-by: Xiongwei Song <xiongwei.song@windriver.com> Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
|
#
c7b23b68 |
|
13-Apr-2023 |
Yosry Ahmed <yosryahmed@google.com> |
mm: vmscan: refactor updating current->reclaim_state During reclaim, we keep track of pages reclaimed from other means than LRU-based reclaim through scan_control->reclaim_state->reclaimed_slab, which we stash a pointer to in current task_struct. However, we keep track of more than just reclaimed slab pages through this. We also use it for clean file pages dropped through pruned inodes, and xfs buffer pages freed. Rename reclaimed_slab to reclaimed, and add a helper function that wraps updating it through current, so that future changes to this logic are contained within include/linux/swap.h. Link: https://lkml.kernel.org/r/20230413104034.1086717-4-yosryahmed@google.com Signed-off-by: Yosry Ahmed <yosryahmed@google.com> Acked-by: Michal Hocko <mhocko@suse.com> Cc: Alexander Viro <viro@zeniv.linux.org.uk> Cc: Christoph Lameter <cl@linux.com> Cc: Darrick J. Wong <djwong@kernel.org> Cc: Dave Chinner <david@fromorbit.com> Cc: David Hildenbrand <david@redhat.com> Cc: David Rientjes <rientjes@google.com> Cc: Hyeonggon Yoo <42.hyeyoo@gmail.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com> Cc: Matthew Wilcox <willy@infradead.org> Cc: Miaohe Lin <linmiaohe@huawei.com> Cc: NeilBrown <neilb@suse.de> Cc: Peter Xu <peterx@redhat.com> Cc: Roman Gushchin <roman.gushchin@linux.dev> Cc: Shakeel Butt <shakeelb@google.com> Cc: Tim Chen <tim.c.chen@linux.intel.com> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Yu Zhao <yuzhao@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|
#
23baf831 |
|
15-Mar-2023 |
Kirill A. Shutemov <kirill.shutemov@linux.intel.com> |
mm, treewide: redefine MAX_ORDER sanely MAX_ORDER currently defined as number of orders page allocator supports: user can ask buddy allocator for page order between 0 and MAX_ORDER-1. This definition is counter-intuitive and lead to number of bugs all over the kernel. Change the definition of MAX_ORDER to be inclusive: the range of orders user can ask from buddy allocator is 0..MAX_ORDER now. [kirill@shutemov.name: fix min() warning] Link: https://lkml.kernel.org/r/20230315153800.32wib3n5rickolvh@box [akpm@linux-foundation.org: fix another min_t warning] [kirill@shutemov.name: fixups per Zi Yan] Link: https://lkml.kernel.org/r/20230316232144.b7ic4cif4kjiabws@box.shutemov.name [akpm@linux-foundation.org: fix underlining in docs] Link: https://lore.kernel.org/oe-kbuild-all/202303191025.VRCTk6mP-lkp@intel.com/ Link: https://lkml.kernel.org/r/20230315113133.11326-11-kirill.shutemov@linux.intel.com Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Reviewed-by: Michael Ellerman <mpe@ellerman.id.au> [powerpc] Cc: "Kirill A. Shutemov" <kirill@shutemov.name> Cc: Zi Yan <ziy@nvidia.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|
#
7a16d7c7 |
|
15-Mar-2023 |
Kirill A. Shutemov <kirill.shutemov@linux.intel.com> |
mm/slub: fix MAX_ORDER usage in calculate_order() MAX_ORDER is not inclusive: the maximum allocation order buddy allocator can deliver is MAX_ORDER-1. Fix MAX_ORDER usage in calculate_order(). Link: https://lkml.kernel.org/r/20230315113133.11326-9-kirill.shutemov@linux.intel.com Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Christoph Lameter <cl@linux.com> Cc: Pekka Enberg <penberg@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|
#
9ebe720e |
|
20-Feb-2023 |
Thomas Weißschuh <linux@weissschuh.net> |
mm: slub: make kobj_type structure constant Since commit ee6d3dd4ed48 ("driver core: make kobj_type constant.") the driver core allows the usage of const struct kobj_type. Take advantage of this to constify the structure definition to prevent modification at runtime. Signed-off-by: Thomas Weißschuh <linux@weissschuh.net> Acked-by: Hyeonggon Yoo <42.hyeyoo@gmail.com> Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
|
#
1c0310ad |
|
10-Feb-2023 |
Andrey Konovalov <andreyknvl@gmail.com> |
lib/stackdepot, mm: rename stack_depot_want_early_init Rename stack_depot_want_early_init to stack_depot_request_early_init. The old name is confusing, as it hints at returning some kind of intention of stack depot. The new name reflects that this function requests an action from stack depot instead. No functional changes. [akpm@linux-foundation.org: update mm/kmemleak.c] Link: https://lkml.kernel.org/r/359f31bf67429a06e630b4395816a967214ef753.1676063693.git.andreyknvl@google.com Signed-off-by: Andrey Konovalov <andreyknvl@google.com> Reviewed-by: Alexander Potapenko <glider@google.com> Acked-by: Vlastimil Babka <vbabka@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|
#
f5451547 |
|
07-Feb-2023 |
Thomas Gleixner <tglx@linutronix.de> |
mm, slab/slub: Ensure kmem_cache_alloc_bulk() is available early The memory allocators are available during early boot even in the phase where interrupts are disabled and scheduling is not yet possible. The setup is so that GFP_KERNEL allocations work in this phase without causing might_alloc() splats to be emitted because the system state is SYSTEM_BOOTING at that point which prevents the warnings to trigger. Most allocation/free functions use local_irq_save()/restore() or a lock variant of that. But kmem_cache_alloc_bulk() and kmem_cache_free_bulk() use local_[lock]_irq_disable()/enable(), which leads to a lockdep warning when interrupts are enabled during the early boot phase. This went unnoticed so far as there are no early users of these interfaces. The upcoming conversion of the interrupt descriptor store from radix_tree to maple_tree triggered this warning as maple_tree uses the bulk interface. Cure this by moving the kmem_cache_alloc/free() bulk variants of SLUB and SLAB to local[_lock]_irq_save()/restore(). There is obviously no reclaim possible and required at this point so there is no need to expand this coverage further. No functional change. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Reviewed-by: Hyeonggon Yoo <42.hyeyoo@gmail.com> Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
|
#
aa4a8605 |
|
02-Feb-2023 |
Greg Kroah-Hartman <gregkh@linuxfoundation.org> |
mm/slub: fix memory leak with using debugfs_lookup() When calling debugfs_lookup() the result must have dput() called on it, otherwise the memory will leak over time. To make things simpler, just call debugfs_lookup_and_remove() instead which handles all of the logic at once. Cc: Christoph Lameter <cl@linux.com> Cc: Pekka Enberg <penberg@kernel.org> Cc: David Rientjes <rientjes@google.com> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Roman Gushchin <roman.gushchin@linux.dev> Cc: Hyeonggon Yoo <42.hyeyoo@gmail.com> Cc: linux-mm@kvack.org Cc: linux-kernel@vger.kernel.org Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Reviewed-by: Hyeonggon Yoo <42.hyeyoo@gmail.com> Acked-by: Roman Gushchin <roman.gushchin@linux.dev> Acked-by: David Rientjes <rientjes@google.com> Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
|
#
02d65d6f |
|
06-Jan-2023 |
Sidhartha Kumar <sidhartha.kumar@oracle.com> |
mm: introduce folio_is_pfmemalloc Add a folio equivalent for page_is_pfmemalloc. This removes two instances of page_is_pfmemalloc(folio_page(folio, 0)) so the folio can be used directly. Link: https://lkml.kernel.org/r/20230106215251.599222-1-sidhartha.kumar@oracle.com Suggested-by: Matthew Wilcox <willy@infradead.org> Signed-off-by: Sidhartha Kumar <sidhartha.kumar@oracle.com> Reviewed-by: Matthew Wilcox (Oracle) <willy@infradead.org> Reviewed-by: SeongJae Park <sj@kernel.org> Acked-by: Vlastimil Babka <vbabka@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|
#
c034c6a4 |
|
09-Jan-2023 |
SeongJae Park <sj@kernel.org> |
mm/sl{a,u}b: fix wrong usages of folio_page() for getting head pages The standard idiom for getting head page of a given folio is '&folio->page', but some are wrongly using 'folio_page(folio, 0)' for the purpose. Fix those to use the idiom. Suggested-by: Matthew Wilcox <willy@infradead.org> Signed-off-by: SeongJae Park <sj@kernel.org> Reviewed-by: David Hildenbrand <david@redhat.com> Acked-by: Hyeonggon Yoo <42.hyeyoo@gmail.com> Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
|
#
909c6475 |
|
25-Nov-2022 |
David Gow <davidgow@google.com> |
mm: slub: test: Use the kunit_get_current_test() function Use the newly-added function kunit_get_current_test() instead of accessing current->kunit_test directly. This function uses a static key to return more quickly when KUnit is enabled, but no tests are actively running. There should therefore be a negligible performance impact to enabling the slub KUnit tests. Other than the performance improvement, this should be a no-op. Cc: Oliver Glitta <glittao@gmail.com> Cc: Hyeonggon Yoo <42.hyeyoo@gmail.com> Cc: Christoph Lameter <cl@linux.com> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: David Rientjes <rientjes@google.com> Cc: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: David Gow <davidgow@google.com> Acked-by: Vlastimil Babka <vbabka@suse.cz> Acked-by: Hyeonggon Yoo <42.hyeyoo@gmail.com> Signed-off-by: Shuah Khan <skhan@linuxfoundation.org>
|
#
be784ba8 |
|
21-Nov-2022 |
Vlastimil Babka <vbabka@suse.cz> |
mm, slub: don't aggressively inline with CONFIG_SLUB_TINY SLUB fastpaths use __always_inline to avoid function calls. With CONFIG_SLUB_TINY we would rather save the memory. Add a __fastpath_inline macro that's __always_inline normally but empty with CONFIG_SLUB_TINY. bloat-o-meter results on x86_64 mm/slub.o: add/remove: 3/1 grow/shrink: 1/8 up/down: 865/-1784 (-919) Function old new delta kmem_cache_free 20 281 +261 slab_alloc_node.isra - 245 +245 slab_free.constprop.isra - 231 +231 __kmem_cache_alloc_lru.isra - 128 +128 __kmem_cache_release 88 83 -5 __kmem_cache_create 1446 1436 -10 __kmem_cache_free 271 142 -129 kmem_cache_alloc_node 330 127 -203 kmem_cache_free_bulk.part 826 613 -213 __kmem_cache_alloc_node 230 10 -220 kmem_cache_alloc_lru 325 12 -313 kmem_cache_alloc 325 10 -315 kmem_cache_free.part 376 - -376 Total: Before=26103, After=25184, chg -3.52% Signed-off-by: Vlastimil Babka <vbabka@suse.cz> Acked-by: Mike Rapoport <rppt@linux.ibm.com> Reviewed-by: Christoph Lameter <cl@linux.com> Acked-by: Hyeonggon Yoo <42.hyeyoo@gmail.com>
|
#
0af8489b |
|
15-Nov-2022 |
Vlastimil Babka <vbabka@suse.cz> |
mm, slub: remove percpu slabs with CONFIG_SLUB_TINY SLUB gets most of its scalability by percpu slabs. However for CONFIG_SLUB_TINY the goal is minimal memory overhead, not scalability. Thus, #ifdef out the whole kmem_cache_cpu percpu structure and associated code. Additionally to the slab page savings, this reduces percpu allocator usage, and code size. This change builds on recent commit c7323a5ad078 ("mm/slub: restrict sysfs validation to debug caches and make it safe"), as caches with enabled debugging also avoid percpu slabs and all allocations and freeing ends up working with the partial list. With a bit more refactoring by the preceding patches, use the same code paths with CONFIG_SLUB_TINY. Signed-off-by: Vlastimil Babka <vbabka@suse.cz> Acked-by: Mike Rapoport <rppt@linux.ibm.com> Reviewed-by: Christoph Lameter <cl@linux.com>
|
#
56d5a2b9 |
|
21-Nov-2022 |
Vlastimil Babka <vbabka@suse.cz> |
mm, slub: split out allocations from pre/post hooks In the following patch we want to introduce CONFIG_SLUB_TINY allocation paths that don't use the percpu slab. To prepare, refactor the allocation functions: Split out __slab_alloc_node() from slab_alloc_node() where the former does the actual allocation and the latter calls the pre/post hooks. Analogically, split out __kmem_cache_alloc_bulk() from kmem_cache_alloc_bulk(). Signed-off-by: Vlastimil Babka <vbabka@suse.cz> Acked-by: Mike Rapoport <rppt@linux.ibm.com> Reviewed-by: Christoph Lameter <cl@linux.com> Reviewed-by: Hyeonggon Yoo <42.hyeyoo@gmail.com>
|
#
fa9b88e4 |
|
21-Nov-2022 |
Vlastimil Babka <vbabka@suse.cz> |
mm, slub: refactor free debug processing Since commit c7323a5ad078 ("mm/slub: restrict sysfs validation to debug caches and make it safe"), caches with debugging enabled use the free_debug_processing() function to do both freeing checks and actual freeing to partial list under list_lock, bypassing the fast paths. We will want to use the same path for CONFIG_SLUB_TINY, but without the debugging checks, so refactor the code so that free_debug_processing() does only the checks, while the freeing is handled by a new function free_to_partial_list(). For consistency, change return parameter alloc_debug_processing() from int to bool and correct the !SLUB_DEBUG variant to return true and not false. This didn't matter until now, but will in the following changes. Signed-off-by: Vlastimil Babka <vbabka@suse.cz> Acked-by: Mike Rapoport <rppt@linux.ibm.com> Reviewed-by: Christoph Lameter <cl@linux.com> Reviewed-by: Hyeonggon Yoo <42.hyeyoo@gmail.com>
|
#
90ce872c |
|
21-Nov-2022 |
Vlastimil Babka <vbabka@suse.cz> |
mm, slub: lower the default slub_max_order with CONFIG_SLUB_TINY With CONFIG_SLUB_TINY we want to minimize memory overhead. By lowering the default slub_max_order we can make slab allocations use smaller pages. However depending on object sizes, order-0 might not be the best due to increased fragmentation. When testing on a 8MB RAM k210 system by Damien Le Moal [1], slub_max_order=1 had the best results, so use that as the default for CONFIG_SLUB_TINY. [1] https://lore.kernel.org/all/6a1883c4-4c3f-545a-90e8-2cd805bcf4ae@opensource.wdc.com/ Signed-off-by: Vlastimil Babka <vbabka@suse.cz> Acked-by: Roman Gushchin <roman.gushchin@linux.dev> Reviewed-by: Hyeonggon Yoo <42.hyeyoo@gmail.com> Acked-by: Mike Rapoport <rppt@linux.ibm.com> Reviewed-by: Christoph Lameter <cl@linux.com>
|
#
5a8a3c1f |
|
15-Nov-2022 |
Vlastimil Babka <vbabka@suse.cz> |
mm, slub: retain no free slabs on partial list with CONFIG_SLUB_TINY SLUB will leave a number of slabs on the partial list even if they are empty, to avoid some slab freeing and reallocation. The goal of CONFIG_SLUB_TINY is to minimize memory overhead, so set the limits to 0 for immediate slab page freeing. Signed-off-by: Vlastimil Babka <vbabka@suse.cz> Acked-by: Roman Gushchin <roman.gushchin@linux.dev> Reviewed-by: Hyeonggon Yoo <42.hyeyoo@gmail.com> Acked-by: Mike Rapoport <rppt@linux.ibm.com> Reviewed-by: Christoph Lameter <cl@linux.com>
|
#
b1a413a3 |
|
14-Nov-2022 |
Vlastimil Babka <vbabka@suse.cz> |
mm, slub: disable SYSFS support with CONFIG_SLUB_TINY Currently SLUB enables its sysfs support depending unconditionally on the general CONFIG_SYSFS setting. To reduce the configuration combination space, make CONFIG_SLUB_TINY disable SLUB's sysfs support by reusing the existing SLAB_SUPPORTS_SYSFS define. It is unlikely that real tiny systems would combine CONFIG_SLUB_TINY with CONFIG_SYSFS, but a randconfig might. Signed-off-by: Vlastimil Babka <vbabka@suse.cz> Acked-by: Mike Rapoport <rppt@linux.ibm.com> Reviewed-by: Christoph Lameter <cl@linux.com>
|
#
346907ce |
|
16-Nov-2022 |
Vlastimil Babka <vbabka@suse.cz> |
mm, slab: ignore hardened usercopy parameters when disabled With CONFIG_HARDENED_USERCOPY not enabled, there are no __check_heap_object() checks happening that would use the struct kmem_cache useroffset and usersize fields. Yet the fields are still initialized, preventing merging of otherwise compatible caches. Also the fields contribute to struct kmem_cache size unnecessarily when unused. Thus #ifdef them out completely when CONFIG_HARDENED_USERCOPY is disabled. In kmem_dump_obj() print object_size instead of usersize, as that's actually the intention. In a quick virtme boot test, this has reduced the number of caches in /proc/slabinfo from 131 to 111. Cc: Kees Cook <keescook@chromium.org> Signed-off-by: Vlastimil Babka <vbabka@suse.cz> Acked-by: Roman Gushchin <roman.gushchin@linux.dev> Acked-by: Mike Rapoport <rppt@linux.ibm.com> Reviewed-by: Christoph Lameter <cl@linux.com>
|
#
8b881763 |
|
04-Nov-2022 |
Vlastimil Babka <vbabka@suse.cz> |
mm/migrate: make isolate_movable_page() skip slab pages In the next commit we want to rearrange struct slab fields to allow a larger rcu_head. Afterwards, the page->mapping field will overlap with SLUB's "struct list_head slab_list", where the value of prev pointer can become LIST_POISON2, which is 0x122 + POISON_POINTER_DELTA. Unfortunately the bit 1 being set can confuse PageMovable() to be a false positive and cause a GPF as reported by lkp [1]. To fix this, make isolate_movable_page() skip pages with the PageSlab flag set. This is a bit tricky as we need to add memory barriers to SLAB and SLUB's page allocation and freeing, and their counterparts to isolate_movable_page(). Based on my RFC from [2]. Added a comment update from Matthew's variant in [3] and, as done there, moved the PageSlab checks to happen before trying to take the page lock. [1] https://lore.kernel.org/all/208c1757-5edd-fd42-67d4-1940cc43b50f@intel.com/ [2] https://lore.kernel.org/all/aec59f53-0e53-1736-5932-25407125d4d4@suse.cz/ [3] https://lore.kernel.org/all/YzsVM8eToHUeTP75@casper.infradead.org/ Reported-by: kernel test robot <yujie.liu@intel.com> Signed-off-by: Vlastimil Babka <vbabka@suse.cz> Acked-by: Hyeonggon Yoo <42.hyeyoo@gmail.com>
|
#
a0dc161a |
|
24-Oct-2022 |
Baoquan He <bhe@redhat.com> |
mm/slub, percpu: correct the calculation of early percpu allocation size SLUB allocator relies on percpu allocator to initialize its ->cpu_slab during early boot. For that, the dynamic chunk of percpu which serves the early allocation need be large enough to satisfy the kmalloc creation. However, the current BUILD_BUG_ON() in alloc_kmem_cache_cpus() doesn't consider the kmalloc array with NR_KMALLOC_TYPES length. Fix that with correct calculation. Signed-off-by: Baoquan He <bhe@redhat.com> Cc: Christoph Lameter <cl@linux.com> Cc: Pekka Enberg <penberg@kernel.org> Cc: David Rientjes <rientjes@google.com> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Roman Gushchin <roman.gushchin@linux.dev> Cc: Hyeonggon Yoo <42.hyeyoo@gmail.com> Acked-by: Hyeonggon Yoo <42.hyeyoo@gmail.com> Acked-by: Dennis Zhou <dennis@kernel.org> Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
|
#
8032bf12 |
|
09-Oct-2022 |
Jason A. Donenfeld <Jason@zx2c4.com> |
treewide: use get_random_u32_below() instead of deprecated function This is a simple mechanical transformation done by: @@ expression E; @@ - prandom_u32_max + get_random_u32_below (E) Reviewed-by: Kees Cook <keescook@chromium.org> Reviewed-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Acked-by: Darrick J. Wong <djwong@kernel.org> # for xfs Reviewed-by: SeongJae Park <sj@kernel.org> # for damon Reviewed-by: Jason Gunthorpe <jgg@nvidia.com> # for infiniband Reviewed-by: Russell King (Oracle) <rmk+kernel@armlinux.org.uk> # for arm Acked-by: Ulf Hansson <ulf.hansson@linaro.org> # for mmc Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>
|
#
946fa0db |
|
20-Oct-2022 |
Feng Tang <feng.tang@intel.com> |
mm/slub: extend redzone check to extra allocated kmalloc space than requested kmalloc will round up the request size to a fixed size (mostly power of 2), so there could be a extra space than what is requested, whose size is the actual buffer size minus original request size. To better detect out of bound access or abuse of this space, add redzone sanity check for it. In current kernel, some kmalloc user already knows the existence of the space and utilizes it after calling 'ksize()' to know the real size of the allocated buffer. So we skip the sanity check for objects which have been called with ksize(), as treating them as legitimate users. Kees Cook is working on sanitizing all these user cases, by using kmalloc_size_roundup() to avoid ambiguous usages. And after this is done, this special handling for ksize() can be removed. In some cases, the free pointer could be saved inside the latter part of object data area, which may overlap the redzone part(for small sizes of kmalloc objects). As suggested by Hyeonggon Yoo, force the free pointer to be in meta data area when kmalloc redzone debug is enabled, to make all kmalloc objects covered by redzone check. Suggested-by: Vlastimil Babka <vbabka@suse.cz> Signed-off-by: Feng Tang <feng.tang@intel.com> Acked-by: Hyeonggon Yoo <42.hyeyoo@gmail.com> Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
|
#
5d1ba310 |
|
20-Oct-2022 |
Feng Tang <feng.tang@intel.com> |
mm: kasan: Extend kasan_metadata_size() to also cover in-object size When kasan is enabled for slab/slub, it may save kasan' free_meta data in the former part of slab object data area in slab object's free path, which works fine. There is ongoing effort to extend slub's debug function which will redzone the latter part of kmalloc object area, and when both of the debug are enabled, there is possible conflict, especially when the kmalloc object has small size, as caught by 0Day bot [1]. To solve it, slub code needs to know the in-object kasan's meta data size. Currently, there is existing kasan_metadata_size() which returns the kasan's metadata size inside slub's metadata area, so extend it to also cover the in-object meta size by adding a boolean flag 'in_object'. There is no functional change to existing code logic. [1]. https://lore.kernel.org/lkml/YuYm3dWwpZwH58Hu@xsang-OptiPlex-9020/ Reported-by: kernel test robot <oliver.sang@intel.com> Suggested-by: Andrey Konovalov <andreyknvl@gmail.com> Signed-off-by: Feng Tang <feng.tang@intel.com> Reviewed-by: Andrey Konovalov <andreyknvl@gmail.com> Cc: Andrey Ryabinin <ryabinin.a.a@gmail.com> Cc: Alexander Potapenko <glider@google.com> Cc: Dmitry Vyukov <dvyukov@google.com> Cc: Vincenzo Frascino <vincenzo.frascino@arm.com> Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
|
#
9ce67395 |
|
20-Oct-2022 |
Feng Tang <feng.tang@intel.com> |
mm/slub: only zero requested size of buffer for kzalloc when debug enabled kzalloc/kmalloc will round up the request size to a fixed size (mostly power of 2), so the allocated memory could be more than requested. Currently kzalloc family APIs will zero all the allocated memory. To detect out-of-bound usage of the extra allocated memory, only zero the requested part, so that redzone sanity check could be added to the extra space later. For kzalloc users who will call ksize() later and utilize this extra space, please be aware that the space is not zeroed any more when debug is enabled. (Thanks to Kees Cook's effort to sanitize all ksize() user cases [1], this won't be a big issue). [1]. https://lore.kernel.org/all/20220922031013.2150682-1-keescook@chromium.org/#r Signed-off-by: Feng Tang <feng.tang@intel.com> Acked-by: Hyeonggon Yoo <42.hyeyoo@gmail.com> Reviewed-by: Andrey Konovalov <andreyknvl@gmail.com> Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
|
#
946d5f9c |
|
22-Sep-2022 |
Liu Shixin <liushixin2@huawei.com> |
mm/slub.c: use hotplug_memory_notifier() directly Commit 76ae847497bc52 ("Documentation: raise minimum supported version of GCC to 5.1") updated the minimum gcc version to 5.1. So the problem mentioned in f02c69680088 ("include/linux/memory.h: implement register_hotmemory_notifier()") no longer exist. So we can now switch to use hotplug_memory_notifier() directly rather than register_hotmemory_notifier(). Link: https://lkml.kernel.org/r/20220923033347.3935160-4-liushixin2@huawei.com Signed-off-by: Liu Shixin <liushixin2@huawei.com> Reviewed-by: David Hildenbrand <david@redhat.com> Cc: Christoph Lameter <cl@linux.com> Cc: Kefeng Wang <wangkefeng.wang@huawei.com> Cc: Waiman Long <longman@redhat.com> Cc: zefan li <lizefan.x@bytedance.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|
#
bc29d5bd |
|
26-Aug-2022 |
Vlastimil Babka <vbabka@suse.cz> |
mm/slub: perform free consistency checks before call_rcu For SLAB_TYPESAFE_BY_RCU caches we use call_rcu to perform empty slab freeing. The rcu callback rcu_free_slab() calls __free_slab() that currently includes checking the slab consistency for caches with SLAB_CONSISTENCY_CHECKS flags. This check needs the slab->objects field to be intact. Because in the next patch we want to allow rcu_head in struct slab to become larger in debug configurations and thus potentially overwrite more fields through a union than slab_list, we want to limit the fields used in rcu_free_slab(). Thus move the consistency checks to free_slab() before call_rcu(). This can be done safely even for SLAB_TYPESAFE_BY_RCU caches where accesses to the objects can still occur after freeing them. As a result, only the slab->slab_cache field has to be physically separate from rcu_head for the freeing callback to work. We also save some cycles in the rcu callback for caches with consistency checks enabled. Signed-off-by: Vlastimil Babka <vbabka@suse.cz> Reviewed-by: Hyeonggon Yoo <42.hyeyoo@gmail.com>
|
#
a8e53869 |
|
14-Oct-2022 |
Hyeonggon Yoo <42.hyeyoo@gmail.com> |
mm/slub: remove dead code for debug caches on deactivate_slab() After commit c7323a5ad0786 ("mm/slub: restrict sysfs validation to debug caches and make it safe"), SLUB never installs percpu slab for debug caches and thus never deactivates percpu slab for them. Since only debug caches use the full list, SLUB no longer deactivates to full list. Remove dead code in deactivate_slab(). Signed-off-by: Hyeonggon Yoo <42.hyeyoo@gmail.com> Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
|
#
7c82b3b3 |
|
20-Sep-2022 |
Alexander Atanasov <alexander.atanasov@virtuozzo.com> |
mm: Make failslab writable again In (060807f841ac mm, slub: make remaining slub_debug related attributes read-only) failslab was made read-only. I think it became a collateral victim to the two other options for which the reasons are perfectly valid. Here is why: - sanity_checks and trace are slab internal debug options, failslab is used for fault injection. - for fault injections, which by presumption are random, it does not matter if it is not set atomically. And you need to set atleast one more option to trigger fault injection. - in a testing scenario you may need to change it at runtime example: module loading - you test all allocations limited by the space option. Then you move to test only your module's own slabs. - when set by command line flags it effectively disables all cache merges. Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Kees Cook <keescook@chromium.org> Cc: Roman Gushchin <guro@fb.com> Cc: Christoph Lameter <cl@linux.com> Cc: Jann Horn <jannh@google.com> Cc: Vijayanand Jitta <vjitta@codeaurora.org> Cc: David Rientjes <rientjes@google.com> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com> Cc: Pekka Enberg <penberg@kernel.org> Link: http://lkml.kernel.org/r/20200610163135.17364-5-vbabka@suse.cz Signed-off-by: Alexander Atanasov <alexander.atanasov@virtuozzo.com> Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
|
#
1a5ad30b |
|
29-Sep-2022 |
Rasmus Villemoes <linux@rasmusvillemoes.dk> |
mm: slub: make slab_sysfs_init() a late_initcall Currently, slab_sysfs_init() is an __initcall aka device_initcall. It is rather time-consuming; on my board it takes around 11ms. That's about 1% of the time budget I have from U-Boot letting go and until linux must assume responsibility of keeping the external watchdog happy. There's no particular reason this would need to run at device_initcall time, so instead make it a late_initcall to allow vital functionality to get started a bit sooner. This actually ends up winning more than just those 11ms, because the slab caches that get created during other device_initcalls (and before my watchdog device gets probed) now don't end up doing the somewhat expensive sysfs_slab_add() themselves. Some example lines (with initcall_debug set) before/after: initcall ext4_init_fs+0x0/0x1ac returned 0 after 1386 usecs initcall journal_init+0x0/0x138 returned 0 after 517 usecs initcall init_fat_fs+0x0/0x68 returned 0 after 294 usecs initcall ext4_init_fs+0x0/0x1ac returned 0 after 240 usecs initcall journal_init+0x0/0x138 returned 0 after 32 usecs initcall init_fat_fs+0x0/0x68 returned 0 after 18 usecs Altogether, this means I now get to petting the watchdog around 17ms sooner. [Of course, the time the other initcalls save is instead spent in slab_sysfs_init(), which goes from 11ms to 16ms, so there's no overall change in boot time.] Signed-off-by: Rasmus Villemoes <linux@rasmusvillemoes.dk> Acked-by: Hyeonggon Yoo <42.hyeyoo@gmail.com> Acked-by: David Rientjes <rientjes@google.com> Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
|
#
979857ea |
|
30-Sep-2022 |
Rasmus Villemoes <linux@rasmusvillemoes.dk> |
mm: slub: remove dead and buggy code from sysfs_slab_add() The function sysfs_slab_add() has two callers: One is slab_sysfs_init(), which first initializes slab_kset, and only when that succeeds sets slab_state to FULL, and then proceeds to call sysfs_slab_add() for all previously created slabs. The other is __kmem_cache_create(), but only after a if (slab_state <= UP) return 0; check. So in other words, sysfs_slab_add() is never called without slab_kset (aka the return value of cache_kset()) being non-NULL. And this is just as well, because if we ever did take this path and called kobject_init(&s->kobj), and then later when called again from slab_sysfs_init() would end up calling kobject_init_and_add(), we would hit if (kobj->state_initialized) { /* do not error out as sometimes we can recover */ pr_err("kobject (%p): tried to init an initialized object, something is seriously wrong.\n", dump_stack(); } in kobject.c. Signed-off-by: Rasmus Villemoes <linux@rasmusvillemoes.dk> Reviewed-by: Hyeonggon Yoo <42.hyeyoo@gmail.com> Acked-by: David Rientjes <rientjes@google.com> Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
|
#
81895a65 |
|
05-Oct-2022 |
Jason A. Donenfeld <Jason@zx2c4.com> |
treewide: use prandom_u32_max() when possible, part 1 Rather than incurring a division or requesting too many random bytes for the given range, use the prandom_u32_max() function, which only takes the minimum required bytes from the RNG and avoids divisions. This was done mechanically with this coccinelle script: @basic@ expression E; type T; identifier get_random_u32 =~ "get_random_int|prandom_u32|get_random_u32"; typedef u64; @@ ( - ((T)get_random_u32() % (E)) + prandom_u32_max(E) | - ((T)get_random_u32() & ((E) - 1)) + prandom_u32_max(E * XXX_MAKE_SURE_E_IS_POW2) | - ((u64)(E) * get_random_u32() >> 32) + prandom_u32_max(E) | - ((T)get_random_u32() & ~PAGE_MASK) + prandom_u32_max(PAGE_SIZE) ) @multi_line@ identifier get_random_u32 =~ "get_random_int|prandom_u32|get_random_u32"; identifier RAND; expression E; @@ - RAND = get_random_u32(); ... when != RAND - RAND %= (E); + RAND = prandom_u32_max(E); // Find a potential literal @literal_mask@ expression LITERAL; type T; identifier get_random_u32 =~ "get_random_int|prandom_u32|get_random_u32"; position p; @@ ((T)get_random_u32()@p & (LITERAL)) // Add one to the literal. @script:python add_one@ literal << literal_mask.LITERAL; RESULT; @@ value = None if literal.startswith('0x'): value = int(literal, 16) elif literal[0] in '123456789': value = int(literal, 10) if value is None: print("I don't know how to handle %s" % (literal)) cocci.include_match(False) elif value == 2**32 - 1 or value == 2**31 - 1 or value == 2**24 - 1 or value == 2**16 - 1 or value == 2**8 - 1: print("Skipping 0x%x for cleanup elsewhere" % (value)) cocci.include_match(False) elif value & (value + 1) != 0: print("Skipping 0x%x because it's not a power of two minus one" % (value)) cocci.include_match(False) elif literal.startswith('0x'): coccinelle.RESULT = cocci.make_expr("0x%x" % (value + 1)) else: coccinelle.RESULT = cocci.make_expr("%d" % (value + 1)) // Replace the literal mask with the calculated result. @plus_one@ expression literal_mask.LITERAL; position literal_mask.p; expression add_one.RESULT; identifier FUNC; @@ - (FUNC()@p & (LITERAL)) + prandom_u32_max(RESULT) @collapse_ret@ type T; identifier VAR; expression E; @@ { - T VAR; - VAR = (E); - return VAR; + return E; } @drop_var@ type T; identifier VAR; @@ { - T VAR; ... when != VAR } Reviewed-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Reviewed-by: Kees Cook <keescook@chromium.org> Reviewed-by: Yury Norov <yury.norov@gmail.com> Reviewed-by: KP Singh <kpsingh@kernel.org> Reviewed-by: Jan Kara <jack@suse.cz> # for ext4 and sbitmap Reviewed-by: Christoph Böhmwalder <christoph.boehmwalder@linbit.com> # for drbd Acked-by: Jakub Kicinski <kuba@kernel.org> Acked-by: Heiko Carstens <hca@linux.ibm.com> # for s390 Acked-by: Ulf Hansson <ulf.hansson@linaro.org> # for mmc Acked-by: Darrick J. Wong <djwong@kernel.org> # for xfs Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>
|
#
68ef169a |
|
15-Sep-2022 |
Alexander Potapenko <glider@google.com> |
mm: kmsan: call KMSAN hooks from SLUB code In order to report uninitialized memory coming from heap allocations KMSAN has to poison them unless they're created with __GFP_ZERO. It's handy that we need KMSAN hooks in the places where init_on_alloc/init_on_free initialization is performed. In addition, we apply __no_kmsan_checks to get_freepointer_safe() to suppress reports when accessing freelist pointers that reside in freed objects. Link: https://lkml.kernel.org/r/20220915150417.722975-16-glider@google.com Signed-off-by: Alexander Potapenko <glider@google.com> Reviewed-by: Marco Elver <elver@google.com> Cc: Alexander Viro <viro@zeniv.linux.org.uk> Cc: Alexei Starovoitov <ast@kernel.org> Cc: Andrey Konovalov <andreyknvl@gmail.com> Cc: Andrey Konovalov <andreyknvl@google.com> Cc: Andy Lutomirski <luto@kernel.org> Cc: Arnd Bergmann <arnd@arndb.de> Cc: Borislav Petkov <bp@alien8.de> Cc: Christoph Hellwig <hch@lst.de> Cc: Christoph Lameter <cl@linux.com> Cc: David Rientjes <rientjes@google.com> Cc: Dmitry Vyukov <dvyukov@google.com> Cc: Eric Biggers <ebiggers@google.com> Cc: Eric Biggers <ebiggers@kernel.org> Cc: Eric Dumazet <edumazet@google.com> Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Cc: Herbert Xu <herbert@gondor.apana.org.au> Cc: Ilya Leoshkevich <iii@linux.ibm.com> Cc: Ingo Molnar <mingo@redhat.com> Cc: Jens Axboe <axboe@kernel.dk> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com> Cc: Kees Cook <keescook@chromium.org> Cc: Mark Rutland <mark.rutland@arm.com> Cc: Matthew Wilcox <willy@infradead.org> Cc: Michael S. Tsirkin <mst@redhat.com> Cc: Pekka Enberg <penberg@kernel.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Petr Mladek <pmladek@suse.com> Cc: Stephen Rothwell <sfr@canb.auug.org.au> Cc: Steven Rostedt <rostedt@goodmis.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Vasily Gorbik <gor@linux.ibm.com> Cc: Vegard Nossum <vegard.nossum@oracle.com> Cc: Vlastimil Babka <vbabka@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|
#
b731e357 |
|
30-Sep-2022 |
Feng Tang <feng.tang@intel.com> |
mm/slub: fix a slab missed to be freed problem When enable kasan and kfence's in-kernel kunit test with slub_debug on, it caught a problem (in linux-next tree): ------------[ cut here ]------------ kmem_cache_destroy test: Slab cache still has objects when called from test_exit+0x1a/0x30 WARNING: CPU: 3 PID: 240 at mm/slab_common.c:492 kmem_cache_destroy+0x16c/0x170 Modules linked in: CPU: 3 PID: 240 Comm: kunit_try_catch Tainted: G B N 6.0.0-rc7-next-20220929 #52 Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.15.0-1 04/01/2014 RIP: 0010:kmem_cache_destroy+0x16c/0x170 Code: 41 5c 41 5d e9 a5 04 0b 00 c3 cc cc cc cc 48 8b 55 60 48 8b 4c 24 20 48 c7 c6 40 37 d2 82 48 c7 c7 e8 a0 33 83 e8 4e d7 14 01 <0f> 0b eb a7 41 56 41 89 d6 41 55 49 89 f5 41 54 49 89 fc 55 48 89 RSP: 0000:ffff88800775fea0 EFLAGS: 00010282 RAX: 0000000000000000 RBX: ffffffff83bdec48 RCX: 0000000000000000 RDX: 0000000000000001 RSI: 1ffff11000eebf9e RDI: ffffed1000eebfc6 RBP: ffff88804362fa00 R08: ffffffff81182e58 R09: ffff88800775fbdf R10: ffffed1000eebf7b R11: 0000000000000001 R12: 000000008c800d00 R13: ffff888005e78040 R14: 0000000000000000 R15: ffff888005cdfad0 FS: 0000000000000000(0000) GS:ffff88807ed00000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 0000000000000000 CR3: 000000000360e001 CR4: 0000000000370ee0 DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 Call Trace: <TASK> test_exit+0x1a/0x30 kunit_try_run_case+0xad/0xc0 kunit_generic_run_threadfn_adapter+0x26/0x50 kthread+0x17b/0x1b0 It was biscted to commit c7323a5ad078 ("mm/slub: restrict sysfs validation to debug caches and make it safe") The problem is inside free_debug_processing(), under certain circumstances the slab can be removed from the partial list but not freed by discard_slab() and thus n->nr_slabs is not decreased accordingly. During shutdown, this non-zero n->nr_slabs is detected and reported. Specifically, the problem is that there are two checks for detecting a full partial list by comparing n->nr_partial >= s->min_partial where the latter check is affected by remove_partial() decreasing n->nr_partial between the checks. Reoganize the code so there is a single check upfront. Link: https://lore.kernel.org/all/20220930100730.250248-1-feng.tang@intel.com/ Fixes: c7323a5ad078 ("mm/slub: restrict sysfs validation to debug caches and make it safe") Signed-off-by: Feng Tang <feng.tang@intel.com> Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
|
#
d65360f2 |
|
26-Sep-2022 |
Chao Yu <chao.yu@oppo.com> |
mm/slub: clean up create_unique_id() As Christophe JAILLET suggested [1] In create_unique_id(), "looks that ID_STR_LENGTH could even be reduced to 32 or 16. The 2nd BUG_ON at the end of the function could certainly be just removed as well or remplaced by a: if (p > name + ID_STR_LENGTH - 1) { kfree(name); return -E<something>; } " According to above suggestion, let's do below cleanups: 1. reduce ID_STR_LENGTH to 32, as the buffer size should be enough; 2. use WARN_ON instead of BUG_ON() and return error if check condition is true; 3. use snprintf instead of sprintf to avoid overflow. [1] https://lore.kernel.org/linux-mm/2025305d-16db-abdf-6cd3-1fb93371c2b4@wanadoo.fr/ Suggested-by: Christophe JAILLET <christophe.jaillet@wanadoo.fr> Reviewed-by: Hyeonggon Yoo <42.hyeyoo@gmail.com> Signed-off-by: Chao Yu <chao.yu@oppo.com> Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
|
#
6edf2576 |
|
13-Sep-2022 |
Feng Tang <feng.tang@intel.com> |
mm/slub: enable debugging memory wasting of kmalloc kmalloc's API family is critical for mm, with one nature that it will round up the request size to a fixed one (mostly power of 2). Say when user requests memory for '2^n + 1' bytes, actually 2^(n+1) bytes could be allocated, so in worst case, there is around 50% memory space waste. The wastage is not a big issue for requests that get allocated/freed quickly, but may cause problems with objects that have longer life time. We've met a kernel boot OOM panic (v5.10), and from the dumped slab info: [ 26.062145] kmalloc-2k 814056KB 814056KB From debug we found there are huge number of 'struct iova_magazine', whose size is 1032 bytes (1024 + 8), so each allocation will waste 1016 bytes. Though the issue was solved by giving the right (bigger) size of RAM, it is still nice to optimize the size (either use a kmalloc friendly size or create a dedicated slab for it). And from lkml archive, there was another crash kernel OOM case [1] back in 2019, which seems to be related with the similar slab waste situation, as the log is similar: [ 4.332648] iommu: Adding device 0000:20:02.0 to group 16 [ 4.338946] swapper/0 invoked oom-killer: gfp_mask=0x6040c0(GFP_KERNEL|__GFP_COMP), nodemask=(null), order=0, oom_score_adj=0 ... [ 4.857565] kmalloc-2048 59164KB 59164KB The crash kernel only has 256M memory, and 59M is pretty big here. (Note: the related code has been changed and optimised in recent kernel [2], these logs are just picked to demo the problem, also a patch changing its size to 1024 bytes has been merged) So add an way to track each kmalloc's memory waste info, and leverage the existing SLUB debug framework (specifically SLUB_STORE_USER) to show its call stack of original allocation, so that user can evaluate the waste situation, identify some hot spots and optimize accordingly, for a better utilization of memory. The waste info is integrated into existing interface: '/sys/kernel/debug/slab/kmalloc-xx/alloc_traces', one example of 'kmalloc-4k' after boot is: 126 ixgbe_alloc_q_vector+0xbe/0x830 [ixgbe] waste=233856/1856 age=280763/281414/282065 pid=1330 cpus=32 nodes=1 __kmem_cache_alloc_node+0x11f/0x4e0 __kmalloc_node+0x4e/0x140 ixgbe_alloc_q_vector+0xbe/0x830 [ixgbe] ixgbe_init_interrupt_scheme+0x2ae/0xc90 [ixgbe] ixgbe_probe+0x165f/0x1d20 [ixgbe] local_pci_probe+0x78/0xc0 work_for_cpu_fn+0x26/0x40 ... which means in 'kmalloc-4k' slab, there are 126 requests of 2240 bytes which got a 4KB space (wasting 1856 bytes each and 233856 bytes in total), from ixgbe_alloc_q_vector(). And when system starts some real workload like multiple docker instances, there could are more severe waste. [1]. https://lkml.org/lkml/2019/8/12/266 [2]. https://lore.kernel.org/lkml/2920df89-9975-5785-f79b-257d3052dfaf@huawei.com/ [Thanks Hyeonggon for pointing out several bugs about sorting/format] [Thanks Vlastimil for suggesting way to reduce memory usage of orig_size and keep it only for kmalloc objects] Signed-off-by: Feng Tang <feng.tang@intel.com> Reviewed-by: Hyeonggon Yoo <42.hyeyoo@gmail.com> Cc: Robin Murphy <robin.murphy@arm.com> Cc: John Garry <john.garry@huawei.com> Cc: Kefeng Wang <wangkefeng.wang@huawei.com> Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
|
#
1f04b07d |
|
25-Aug-2022 |
Thomas Gleixner <tglx@linutronix.de> |
slub: Make PREEMPT_RT support less convoluted The slub code already has a few helpers depending on PREEMPT_RT. Add a few more and get rid of the CONFIG_PREEMPT_RT conditionals all over the place. No functional change. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Christoph Lameter <cl@linux.com> Cc: David Rientjes <rientjes@google.com> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com> Cc: Pekka Enberg <penberg@kernel.org> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: linux-mm@kvack.org Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de> Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Hyeonggon Yoo <42.hyeyoo@gmail.com> Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
|
#
5875e598 |
|
23-Aug-2022 |
Vlastimil Babka <vbabka@suse.cz> |
mm/slub: simplify __cmpxchg_double_slab() and slab_[un]lock() The PREEMPT_RT specific disabling of irqs in __cmpxchg_double_slab() (through slab_[un]lock()) is unnecessary as bit_spin_lock() disables preemption and that's sufficient on PREEMPT_RT where no allocation/free operation is performed in hardirq context and so can't interrupt the current operation. That means we no longer need the slab_[un]lock() wrappers, so delete them and rename the current __slab_[un]lock() to slab_[un]lock(). Signed-off-by: Vlastimil Babka <vbabka@suse.cz> Acked-by: David Rientjes <rientjes@google.com> Reviewed-by: Hyeonggon Yoo <42.hyeyoo@gmail.com> Reviewed-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
|
#
4ef3f5a3 |
|
23-Aug-2022 |
Vlastimil Babka <vbabka@suse.cz> |
mm/slub: convert object_map_lock to non-raw spinlock The only remaining user of object_map_lock is list_slab_objects(). Obtaining the lock there used to happen under slab_lock() which implied disabling irqs on PREEMPT_RT, thus it's a raw_spinlock. With the slab_lock() removed, we can convert it to a normal spinlock. Also remove the get_map()/put_map() wrappers as list_slab_objects() became their only remaining user. Signed-off-by: Vlastimil Babka <vbabka@suse.cz> Acked-by: David Rientjes <rientjes@google.com> Reviewed-by: Hyeonggon Yoo <42.hyeyoo@gmail.com> Reviewed-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
|
#
41bec7c3 |
|
23-Aug-2022 |
Vlastimil Babka <vbabka@suse.cz> |
mm/slub: remove slab_lock() usage for debug operations All alloc and free operations on debug caches are now serialized by n->list_lock, so we can remove slab_lock() usage in validate_slab() and list_slab_objects() as those also happen under n->list_lock. Note the usage in list_slab_objects() could happen even on non-debug caches, but only during cache shutdown time, so there should not be any parallel freeing activity anymore. Except for buggy slab users, but in that case the slab_lock() would not help against the common cmpxchg based fast paths (in non-debug caches) anyway. Also adjust documentation comments accordingly. Suggested-by: Hyeonggon Yoo <42.hyeyoo@gmail.com> Signed-off-by: Vlastimil Babka <vbabka@suse.cz> Reviewed-by: Hyeonggon Yoo <42.hyeyoo@gmail.com> Acked-by: David Rientjes <rientjes@google.com>
|
#
c7323a5a |
|
23-Aug-2022 |
Vlastimil Babka <vbabka@suse.cz> |
mm/slub: restrict sysfs validation to debug caches and make it safe Rongwei Wang reports [1] that cache validation triggered by writing to /sys/kernel/slab/<cache>/validate is racy against normal cache operations (e.g. freeing) in a way that can cause false positive inconsistency reports for caches with debugging enabled. The problem is that debugging actions that mark object free or active and actual freelist operations are not atomic, and the validation can see an inconsistent state. For caches that do or don't have debugging enabled, additional races involving n->nr_slabs are possible that result in false reports of wrong slab counts. This patch attempts to solve these issues while not adding overhead to normal (especially fastpath) operations for caches that do not have debugging enabled. Such overhead would not be justified to make possible userspace-triggered validation safe. Instead, disable the validation for caches that don't have debugging enabled and make their sysfs validate handler return -EINVAL. For caches that do have debugging enabled, we can instead extend the existing approach of not using percpu freelists to force all alloc/free operations to the slow paths where debugging flags is checked and acted upon. There can adjust the debug-specific paths to increase n->list_lock coverage against concurrent validation as necessary. The processing on free in free_debug_processing() already happens under n->list_lock so we can extend it to actually do the freeing as well and thus make it atomic against concurrent validation. As observed by Hyeonggon Yoo, we do not really need to take slab_lock() anymore here because all paths we could race with are protected by n->list_lock under the new scheme, so drop its usage here. The processing on alloc in alloc_debug_processing() currently doesn't take any locks, but we have to first allocate the object from a slab on the partial list (as debugging caches have no percpu slabs) and thus take the n->list_lock anyway. Add a function alloc_single_from_partial() that grabs just the allocated object instead of the whole freelist, and does the debug processing. The n->list_lock coverage again makes it atomic against validation and it is also ultimately more efficient than the current grabbing of freelist immediately followed by slab deactivation. To prevent races on n->nr_slabs updates, make sure that for caches with debugging enabled, inc_slabs_node() or dec_slabs_node() is called under n->list_lock. When allocating a new slab for a debug cache, handle the allocation by a new function alloc_single_from_new_slab() instead of the current forced deactivation path. Neither of these changes affect the fast paths at all. The changes in slow paths are negligible for non-debug caches. [1] https://lore.kernel.org/all/20220529081535.69275-1-rongwei.wang@linux.alibaba.com/ Reported-by: Rongwei Wang <rongwei.wang@linux.alibaba.com> Signed-off-by: Vlastimil Babka <vbabka@suse.cz> Reviewed-by: Hyeonggon Yoo <42.hyeyoo@gmail.com>
|
#
b84e04f1 |
|
14-Aug-2022 |
Imran Khan <imran.f.khan@oracle.com> |
kfence: add sysfs interface to disable kfence for selected slabs. By default kfence allocation can happen for any slab object, whose size is up to PAGE_SIZE, as long as that allocation is the first allocation after expiration of kfence sample interval. But in certain debugging scenarios we may be interested in debugging corruptions involving some specific slub objects like dentry or ext4_* etc. In such cases limiting kfence for allocations involving only specific slub objects will increase the probablity of catching the issue since kfence pool will not be consumed by other slab objects. This patch introduces a sysfs interface '/sys/kernel/slab/<name>/skip_kfence' to disable kfence for specific slabs. Having the interface work in this way does not impact current/default behavior of kfence and allows us to use kfence for specific slabs (when needed) as well. The decision to skip/use kfence is taken depending on whether kmem_cache.flags has (newly introduced) SLAB_SKIP_KFENCE flag set or not. Link: https://lkml.kernel.org/r/20220814195353.2540848-1-imran.f.khan@oracle.com Signed-off-by: Imran Khan <imran.f.khan@oracle.com> Reviewed-by: Vlastimil Babka <vbabka@suse.cz> Reviewed-by: Marco Elver <elver@google.com> Reviewed-by: Hyeonggon Yoo <42.hyeyoo@gmail.com> Cc: Alexander Potapenko <glider@google.com> Cc: Dmitry Vyukov <dvyukov@google.com> Cc: Christoph Lameter <cl@linux.com> Cc: Pekka Enberg <penberg@kernel.org> Cc: David Rientjes <rientjes@google.com> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com> Cc: Roman Gushchin <roman.gushchin@linux.dev> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|
#
2c1d697f |
|
17-Aug-2022 |
Hyeonggon Yoo <42.hyeyoo@gmail.com> |
mm/slab_common: drop kmem_alloc & avoid dereferencing fields when not using Drop kmem_alloc event class, and define kmalloc and kmem_cache_alloc using TRACE_EVENT() macro. And then this patch does: - Do not pass pointer to struct kmem_cache to trace_kmalloc. gfp flag is enough to know if it's accounted or not. - Avoid dereferencing s->object_size and s->size when not using kmem_cache_alloc event. - Avoid dereferencing s->name in when not using kmem_cache_free event. - Adjust s->size to SLOB_UNITS(s->size) * SLOB_UNIT in SLOB Cc: Vasily Averin <vasily.averin@linux.dev> Suggested-by: Vlastimil Babka <vbabka@suse.cz> Signed-off-by: Hyeonggon Yoo <42.hyeyoo@gmail.com> Reviewed-by: Vlastimil Babka <vbabka@suse.cz> Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
|
#
11e9734b |
|
17-Aug-2022 |
Hyeonggon Yoo <42.hyeyoo@gmail.com> |
mm/slab_common: unify NUMA and UMA version of tracepoints Drop kmem_alloc event class, rename kmem_alloc_node to kmem_alloc, and remove _node postfix for NUMA version of tracepoints. This will break some tools that depend on {kmem_cache_alloc,kmalloc}_node, but at this point maintaining both kmem_alloc and kmem_alloc_node event classes does not makes sense at all. Signed-off-by: Hyeonggon Yoo <42.hyeyoo@gmail.com> Reviewed-by: Vlastimil Babka <vbabka@suse.cz> Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
|
#
26a40990 |
|
17-Aug-2022 |
Hyeonggon Yoo <42.hyeyoo@gmail.com> |
mm/sl[au]b: cleanup kmem_cache_alloc[_node]_trace() Despite its name, kmem_cache_alloc[_node]_trace() is hook for inlined kmalloc. So rename it to kmalloc[_node]_trace(). Move its implementation to slab_common.c by using __kmem_cache_alloc_node(), but keep CONFIG_TRACING=n varients to save a function call when CONFIG_TRACING=n. Use __assume_kmalloc_alignment for kmalloc[_node]_trace instead of __assume_slab_alignement. Generally kmalloc has larger alignment requirements. Suggested-by: Vlastimil Babka <vbabka@suse.cz> Signed-off-by: Hyeonggon Yoo <42.hyeyoo@gmail.com> Reviewed-by: Vlastimil Babka <vbabka@suse.cz> Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
|
#
b1405135 |
|
17-Aug-2022 |
Hyeonggon Yoo <42.hyeyoo@gmail.com> |
mm/sl[au]b: generalize kmalloc subsystem Now everything in kmalloc subsystem can be generalized. Let's do it! Generalize __do_kmalloc_node(), __kmalloc_node_track_caller(), kfree(), __ksize(), __kmalloc(), __kmalloc_node() and move them to slab_common.c. In the meantime, rename kmalloc_large_node_notrace() to __kmalloc_large_node() and make it static as it's now only called in slab_common.c. [ feng.tang@intel.com: adjust kfence skip list to include __kmem_cache_free so that kfence kunit tests do not fail ] Signed-off-by: Hyeonggon Yoo <42.hyeyoo@gmail.com> Reviewed-by: Vlastimil Babka <vbabka@suse.cz> Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
|
#
a579b056 |
|
23-Aug-2022 |
Vlastimil Babka <vbabka@suse.cz> |
mm/slub: move free_debug_processing() further In the following patch, the function free_debug_processing() will be calling add_partial(), remove_partial() and discard_slab(), se move it below their definitions to avoid forward declarations. To make review easier, separate the move from functional changes. Signed-off-by: Vlastimil Babka <vbabka@suse.cz> Reviewed-by: Hyeonggon Yoo <42.hyeyoo@gmail.com> Acked-by: David Rientjes <rientjes@google.com>
|
#
ed4cd17e |
|
17-Aug-2022 |
Hyeonggon Yoo <42.hyeyoo@gmail.com> |
mm/sl[au]b: introduce common alloc/free functions without tracepoint To unify kmalloc functions in later patch, introduce common alloc/free functions that does not have tracepoint. Signed-off-by: Hyeonggon Yoo <42.hyeyoo@gmail.com> Reviewed-by: Vlastimil Babka <vbabka@suse.cz> Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
|
#
d6a71648 |
|
17-Aug-2022 |
Hyeonggon Yoo <42.hyeyoo@gmail.com> |
mm/slab: kmalloc: pass requests larger than order-1 page to page allocator There is not much benefit for serving large objects in kmalloc(). Let's pass large requests to page allocator like SLUB for better maintenance of common code. Signed-off-by: Hyeonggon Yoo <42.hyeyoo@gmail.com> Reviewed-by: Vlastimil Babka <vbabka@suse.cz> Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
|
#
bf37d791 |
|
17-Aug-2022 |
Hyeonggon Yoo <42.hyeyoo@gmail.com> |
mm/slab_common: kmalloc_node: pass large requests to page allocator Now that kmalloc_large_node() is in common code, pass large requests to page allocator in kmalloc_node() using kmalloc_large_node(). One problem is that currently there is no tracepoint in kmalloc_large_node(). Instead of simply putting tracepoint in it, use kmalloc_large_node{,_notrace} depending on its caller to show useful address for both inlined kmalloc_node() and __kmalloc_node_track_caller() when large objects are allocated. Signed-off-by: Hyeonggon Yoo <42.hyeyoo@gmail.com> Reviewed-by: Vlastimil Babka <vbabka@suse.cz> Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
|
#
a0c3b940 |
|
17-Aug-2022 |
Hyeonggon Yoo <42.hyeyoo@gmail.com> |
mm/slub: move kmalloc_large_node() to slab_common.c In later patch SLAB will also pass requests larger than order-1 page to page allocator. Move kmalloc_large_node() to slab_common.c. Fold kmalloc_large_node_hook() into kmalloc_large_node() as there is no other caller. Signed-off-by: Hyeonggon Yoo <42.hyeyoo@gmail.com> Reviewed-by: Vlastimil Babka <vbabka@suse.cz> Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
|
#
0f853b2e |
|
17-Aug-2022 |
Hyeonggon Yoo <42.hyeyoo@gmail.com> |
mm/sl[au]b: factor out __do_kmalloc_node() __kmalloc(), __kmalloc_node(), __kmalloc_node_track_caller() mostly do same job. Factor out common code into __do_kmalloc_node(). Note that this patch also fixes missing kasan_kmalloc() in SLUB's __kmalloc_node_track_caller(). Signed-off-by: Hyeonggon Yoo <42.hyeyoo@gmail.com> Reviewed-by: Vlastimil Babka <vbabka@suse.cz> Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
|
#
c45248db |
|
17-Aug-2022 |
Hyeonggon Yoo <42.hyeyoo@gmail.com> |
mm/slab_common: cleanup kmalloc_track_caller() Make kmalloc_track_caller() wrapper of kmalloc_node_track_caller(). Signed-off-by: Hyeonggon Yoo <42.hyeyoo@gmail.com> Reviewed-by: Vlastimil Babka <vbabka@suse.cz> Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
|
#
f78a03f6 |
|
17-Aug-2022 |
Hyeonggon Yoo <42.hyeyoo@gmail.com> |
mm/slab_common: remove CONFIG_NUMA ifdefs for common kmalloc functions Now that slab_alloc_node() is available for SLAB when CONFIG_NUMA=n, remove CONFIG_NUMA ifdefs for common kmalloc functions. Signed-off-by: Hyeonggon Yoo <42.hyeyoo@gmail.com> Reviewed-by: Vlastimil Babka <vbabka@suse.cz> Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
|
#
2bfbb027 |
|
21-Aug-2022 |
ye xingchen <ye.xingchen@zte.com.cn> |
mm/slub: Remove the unneeded result variable Return the value from attribute->store(s, buf, len) and attribute->show(s, buf) directly instead of storing it in another redundant variable. Reported-by: Zeal Robot <zealci@zte.com.cn> Acked-by: Hyeonggon Yoo <42.hyeyoo@gmail.com> Signed-off-by: ye xingchen <ye.xingchen@zte.com.cn> Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
|
#
e45cc288 |
|
19-Sep-2022 |
Maurizio Lombardi <mlombard@redhat.com> |
mm: slub: fix flush_cpu_slab()/__free_slab() invocations in task context. Commit 5a836bf6b09f ("mm: slub: move flush_cpu_slab() invocations __free_slab() invocations out of IRQ context") moved all flush_cpu_slab() invocations to the global workqueue to avoid a problem related with deactivate_slab()/__free_slab() being called from an IRQ context on PREEMPT_RT kernels. When the flush_all_cpu_locked() function is called from a task context it may happen that a workqueue with WQ_MEM_RECLAIM bit set ends up flushing the global workqueue, this will cause a dependency issue. workqueue: WQ_MEM_RECLAIM nvme-delete-wq:nvme_delete_ctrl_work [nvme_core] is flushing !WQ_MEM_RECLAIM events:flush_cpu_slab WARNING: CPU: 37 PID: 410 at kernel/workqueue.c:2637 check_flush_dependency+0x10a/0x120 Workqueue: nvme-delete-wq nvme_delete_ctrl_work [nvme_core] RIP: 0010:check_flush_dependency+0x10a/0x120[ 453.262125] Call Trace: __flush_work.isra.0+0xbf/0x220 ? __queue_work+0x1dc/0x420 flush_all_cpus_locked+0xfb/0x120 __kmem_cache_shutdown+0x2b/0x320 kmem_cache_destroy+0x49/0x100 bioset_exit+0x143/0x190 blk_release_queue+0xb9/0x100 kobject_cleanup+0x37/0x130 nvme_fc_ctrl_free+0xc6/0x150 [nvme_fc] nvme_free_ctrl+0x1ac/0x2b0 [nvme_core] Fix this bug by creating a workqueue for the flush operation with the WQ_MEM_RECLAIM bit set. Fixes: 5a836bf6b09f ("mm: slub: move flush_cpu_slab() invocations __free_slab() invocations out of IRQ context") Cc: <stable@vger.kernel.org> Signed-off-by: Maurizio Lombardi <mlombard@redhat.com> Reviewed-by: Hyeonggon Yoo <42.hyeyoo@gmail.com> Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
|
#
5373b8a0 |
|
13-Sep-2022 |
Peter Collingbourne <pcc@google.com> |
kasan: call kasan_malloc() from __kmalloc_*track_caller() We were failing to call kasan_malloc() from __kmalloc_*track_caller() which was causing us to sometimes fail to produce KASAN error reports for allocations made using e.g. devm_kcalloc(), as the KASAN poison was not being initialized. Fix it. Signed-off-by: Peter Collingbourne <pcc@google.com> Cc: <stable@vger.kernel.org> # 5.15 Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
|
#
7e9c323c |
|
31-Aug-2022 |
Chao Yu <chao.yu@oppo.com> |
mm/slub: fix to return errno if kmalloc() fails In create_unique_id(), kmalloc(, GFP_KERNEL) can fail due to out-of-memory, if it fails, return errno correctly rather than triggering panic via BUG_ON(); kernel BUG at mm/slub.c:5893! Internal error: Oops - BUG: 0 [#1] PREEMPT SMP Call trace: sysfs_slab_add+0x258/0x260 mm/slub.c:5973 __kmem_cache_create+0x60/0x118 mm/slub.c:4899 create_cache mm/slab_common.c:229 [inline] kmem_cache_create_usercopy+0x19c/0x31c mm/slab_common.c:335 kmem_cache_create+0x1c/0x28 mm/slab_common.c:390 f2fs_kmem_cache_create fs/f2fs/f2fs.h:2766 [inline] f2fs_init_xattr_caches+0x78/0xb4 fs/f2fs/xattr.c:808 f2fs_fill_super+0x1050/0x1e0c fs/f2fs/super.c:4149 mount_bdev+0x1b8/0x210 fs/super.c:1400 f2fs_mount+0x44/0x58 fs/f2fs/super.c:4512 legacy_get_tree+0x30/0x74 fs/fs_context.c:610 vfs_get_tree+0x40/0x140 fs/super.c:1530 do_new_mount+0x1dc/0x4e4 fs/namespace.c:3040 path_mount+0x358/0x914 fs/namespace.c:3370 do_mount fs/namespace.c:3383 [inline] __do_sys_mount fs/namespace.c:3591 [inline] __se_sys_mount fs/namespace.c:3568 [inline] __arm64_sys_mount+0x2f8/0x408 fs/namespace.c:3568 Cc: <stable@kernel.org> Fixes: 81819f0fc8285 ("SLUB core") Reported-by: syzbot+81684812ea68216e08c5@syzkaller.appspotmail.com Reviewed-by: Muchun Song <songmuchun@bytedance.com> Reviewed-by: Hyeonggon Yoo <42.hyeyoo@gmail.com> Signed-off-by: Chao Yu <chao.yu@oppo.com> Acked-by: David Rientjes <rientjes@google.com> Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
|
#
2055e67b |
|
14-Jun-2022 |
Hyeonggon Yoo <42.hyeyoo@gmail.com> |
mm/sl[au]b: use own bulk free function when bulk alloc failed There is no benefit to call generic bulk free function when kmem_cache_alloc_bulk() failed. Use own kmem_cache_free_bulk() instead of generic function. Note that if kmem_cache_alloc_bulk() fails to allocate first object in SLUB, size is zero. So allow passing size == 0 to kmem_cache_free_bulk() like SLAB's. Signed-off-by: Hyeonggon Yoo <42.hyeyoo@gmail.com> Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
|
#
b77d5b1b |
|
29-Apr-2022 |
Muchun Song <songmuchun@bytedance.com> |
mm: slab: optimize memcg_slab_free_hook() Most callers of memcg_slab_free_hook() already know the slab, which could be passed to memcg_slab_free_hook() directly to reduce the overhead of an another call of virt_to_slab(). For bulk freeing of objects, the call of slab_objcgs() in the loop in memcg_slab_free_hook() is redundant as well. Rework memcg_slab_free_hook() and build_detached_freelist() to reduce those unnecessary overhead and make memcg_slab_free_hook() can handle bulk freeing in slab_free(). Move the calling site of memcg_slab_free_hook() from do_slab_free() to slab_free() for slub to make the code clearer since the logic is weird (e.g. the caller need to judge whether it needs to call memcg_slab_free_hook()). It is easy to make mistakes like missing calling of memcg_slab_free_hook() like fixes of: commit d1b2cf6cb84a ("mm: memcg/slab: uncharge during kmem_cache_free_bulk()") commit ae085d7f9365 ("mm: kfence: fix missing objcg housekeeping for SLAB") This optimization is mainly for bulk objects freeing. The following numbers is shown for 16-object freeing. before after kmem_cache_free_bulk: ~430 ns ~400 ns The overhead is reduced by about 7% for 16-object freeing. Signed-off-by: Muchun Song <songmuchun@bytedance.com> Reviewed-by: Hyeonggon Yoo <42.hyeyoo@gmail.com> Link: https://lore.kernel.org/r/20220429123044.37885-1-songmuchun@bytedance.com Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
|
#
b347aa7b |
|
02-Jun-2022 |
Vasily Averin <vasily.averin@linux.dev> |
mm/tracing: add 'accounted' entry into output of allocation tracepoints Slab caches marked with SLAB_ACCOUNT force accounting for every allocation from this cache even if __GFP_ACCOUNT flag is not passed. Unfortunately, at the moment this flag is not visible in ftrace output, and this makes it difficult to analyze the accounted allocations. This patch adds boolean "accounted" entry into trace output, and set it to 'true' for calls used __GFP_ACCOUNT flag and for allocations from caches marked with SLAB_ACCOUNT. Set it to 'false' if accounting is disabled in configs. Signed-off-by: Vasily Averin <vvs@openvz.org> Acked-by: Shakeel Butt <shakeelb@google.com> Acked-by: Roman Gushchin <roman.gushchin@linux.dev> Acked-by: Muchun Song <songmuchun@bytedance.com> Reviewed-by: Hyeonggon Yoo <42.hyeyoo@gmail.com> Link: https://lore.kernel.org/r/c418ed25-65fe-f623-fbf8-1676528859ed@openvz.org Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
|
#
efb93527 |
|
30-May-2022 |
Xiongwei Song <xiongwei.song@windriver.com> |
mm/slub: Simplify __kmem_cache_alias() There is no need to do anything if sysfs_slab_alias() return nonzero value after getting a mergeable cache. Signed-off-by: Xiongwei Song <xiongwei.song@windriver.com> Reviewed-by: Muchun Song <songmuchun@bytedance.com> Link: https://lore.kernel.org/all/e5ebc952-af17-321f-5343-bc914d47c931@suse.cz/ Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
|
#
eeaa345e |
|
08-Jun-2022 |
Jann Horn <jannh@google.com> |
mm/slub: add missing TID updates on slab deactivation The fastpath in slab_alloc_node() assumes that c->slab is stable as long as the TID stays the same. However, two places in __slab_alloc() currently don't update the TID when deactivating the CPU slab. If multiple operations race the right way, this could lead to an object getting lost; or, in an even more unlikely situation, it could even lead to an object being freed onto the wrong slab's freelist, messing up the `inuse` counter and eventually causing a page to be freed to the page allocator while it still contains slab objects. (I haven't actually tested these cases though, this is just based on looking at the code. Writing testcases for this stuff seems like it'd be a pain...) The race leading to state inconsistency is (all operations on the same CPU and kmem_cache): - task A: begin do_slab_free(): - read TID - read pcpu freelist (==NULL) - check `slab == c->slab` (true) - [PREEMPT A->B] - task B: begin slab_alloc_node(): - fastpath fails (`c->freelist` is NULL) - enter __slab_alloc() - slub_get_cpu_ptr() (disables preemption) - enter ___slab_alloc() - take local_lock_irqsave() - read c->freelist as NULL - get_freelist() returns NULL - write `c->slab = NULL` - drop local_unlock_irqrestore() - goto new_slab - slub_percpu_partial() is NULL - get_partial() returns NULL - slub_put_cpu_ptr() (enables preemption) - [PREEMPT B->A] - task A: finish do_slab_free(): - this_cpu_cmpxchg_double() succeeds() - [CORRUPT STATE: c->slab==NULL, c->freelist!=NULL] From there, the object on c->freelist will get lost if task B is allowed to continue from here: It will proceed to the retry_load_slab label, set c->slab, then jump to load_freelist, which clobbers c->freelist. But if we instead continue as follows, we get worse corruption: - task A: run __slab_free() on object from other struct slab: - CPU_PARTIAL_FREE case (slab was on no list, is now on pcpu partial) - task A: run slab_alloc_node() with NUMA node constraint: - fastpath fails (c->slab is NULL) - call __slab_alloc() - slub_get_cpu_ptr() (disables preemption) - enter ___slab_alloc() - c->slab is NULL: goto new_slab - slub_percpu_partial() is non-NULL - set c->slab to slub_percpu_partial(c) - [CORRUPT STATE: c->slab points to slab-1, c->freelist has objects from slab-2] - goto redo - node_match() fails - goto deactivate_slab - existing c->freelist is passed into deactivate_slab() - inuse count of slab-1 is decremented to account for object from slab-2 At this point, the inuse count of slab-1 is 1 lower than it should be. This means that if we free all allocated objects in slab-1 except for one, SLUB will think that slab-1 is completely unused, and may free its page, leading to use-after-free. Fixes: c17dda40a6a4e ("slub: Separate out kmem_cache_cpu processing from deactivate_slab") Fixes: 03e404af26dc2 ("slub: fast release on full slab") Cc: stable@vger.kernel.org Signed-off-by: Jann Horn <jannh@google.com> Acked-by: Christoph Lameter <cl@linux.com> Acked-by: David Rientjes <rientjes@google.com> Reviewed-by: Muchun Song <songmuchun@bytedance.com> Tested-by: Hyeonggon Yoo <42.hyeyoo@gmail.com> Signed-off-by: Vlastimil Babka <vbabka@suse.cz> Link: https://lore.kernel.org/r/20220608182205.2945720-1-jannh@google.com
|
#
c4cf6785 |
|
07-Jun-2022 |
Sebastian Andrzej Siewior <bigeasy@linutronix.de> |
mm/slub: Move the stackdepot related allocation out of IRQ-off section. The set_track() invocation in free_debug_processing() is invoked with acquired slab_lock(). The lock disables interrupts on PREEMPT_RT and this forbids to allocate memory which is done in stack_depot_save(). Split set_track() into two parts: set_track_prepare() which allocate memory and set_track_update() which only performs the assignment of the trace data structure. Use set_track_prepare() before disabling interrupts. [ vbabka@suse.cz: make set_track() call set_track_update() instead of open-coded assignments ] Fixes: 5cf909c553e9e ("mm/slub: use stackdepot to save stack trace in objects") Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de> Reviewed-by: Hyeonggon Yoo <42.hyeyoo@gmail.com> Signed-off-by: Vlastimil Babka <vbabka@suse.cz> Link: https://lore.kernel.org/r/Yp9sqoUi4fVa5ExF@linutronix.de
|
#
23587f7c |
|
29-Apr-2022 |
Miaohe Lin <linmiaohe@huawei.com> |
mm/slub: remove unused kmem_cache_order_objects max max field holds the largest slab order that was ever used for a slab cache. But it's unused now. Remove it. Signed-off-by: Miaohe Lin <linmiaohe@huawei.com> Reviewed-by: Muchun Song <songmuchun@bytedance.com> Acked-by: David Rientjes <rientjes@google.com> Signed-off-by: Vlastimil Babka <vbabka@suse.cz> Link: https://lore.kernel.org/r/20220429090545.33413-1-linmiaohe@huawei.com
|
#
a204e6d6 |
|
19-Apr-2022 |
Miaohe Lin <linmiaohe@huawei.com> |
mm/slub: remove unneeded return value of slab_pad_check The return value of slab_pad_check is never used. So we can make it return void now. Signed-off-by: Miaohe Lin <linmiaohe@huawei.com> Reviewed-by: Hyeonggon Yoo <42.hyeyoo@gmail.com> Signed-off-by: Vlastimil Babka <vbabka@suse.cz> Link: https://lore.kernel.org/r/20220419120352.37825-1-linmiaohe@huawei.com
|
#
6b6efe23 |
|
09-Apr-2022 |
JaeSang Yoo <js.yoo.5b@gmail.com> |
mm/slub: remove meaningless node check in ___slab_alloc() node_match() with node=NUMA_NO_NODE always returns 1. Duplicate check by goto statement is meaningless. Remove it. Signed-off-by: JaeSang Yoo <jsyoo5b@gmail.com> Reviewed-by: Hyeonggon Yoo <42.hyeyoo@gmail.com> Reviewed-by: Muchun Song <songmuchun@bytedance.com> Acked-by: David Rientjes <rientjes@google.com> Signed-off-by: Vlastimil Babka <vbabka@suse.cz> Link: https://lore.kernel.org/r/20220409144239.2649257-1-jsyoo5b@gmail.com
|
#
27c08f75 |
|
09-Apr-2022 |
Jiyoup Kim <lakroforce@gmail.com> |
mm/slub: remove duplicate flag in allocate_slab() In allocate_slab(), __GFP_NOFAIL flag is removed twice when trying higher-order allocation. Remove it. Signed-off-by: Jiyoup Kim <lakroforce@gmail.com> Reviewed-by: Hyeonggon Yoo <42.hyeyoo@gmail.com> Reviewed-by: Muchun Song <songmuchun@bytedance.com> Acked-by: David Rientjes <rientjes@google.com> Signed-off-by: Vlastimil Babka <vbabka@suse.cz> Link: https://lore.kernel.org/r/20220409150538.1264-1-lakroforce@gmail.com
|
#
c0f81a94 |
|
11-Apr-2022 |
JaeSang Yoo <js.yoo.5b@gmail.com> |
mm/slub: remove unused parameter in setup_object*() setup_object_debug() and setup_object() has unused parameter, "struct slab *slab". Remove it. By the commit 3ec0974210fe ("SLUB: Simplify debug code"), setup_object_debug() were introduced to refactor previous code blocks in the setup_object(). Previous code used SlabDebug() to init_object() and init_tracking(). As the SlabDebug() takes "struct page *page" as argument, the setup_object_debug() checks flag of "struct kmem_cache *s" which doesn't require "struct page *page". As the struct page were changed into struct slab by commit bb192ed9aa719 ("mm/slub: Convert most struct page to struct slab by spatch"), but it's still unused parameter. Suggested-by: Ohhoon Kwon <ohkwon1043@gmail.com> Signed-off-by: JaeSang Yoo <jsyoo5b@gmail.com> Acked-by: David Rientjes <rientjes@google.com> Signed-off-by: Vlastimil Babka <vbabka@suse.cz> Link: https://lore.kernel.org/r/20220411072534.3372768-1-jsyoo5b@gmail.com
|
#
553c0369 |
|
21-May-2021 |
Oliver Glitta <glittao@gmail.com> |
mm/slub: sort debugfs output by frequency of stack traces Sort the output of debugfs alloc_traces and free_traces by the frequency of allocation/freeing stack traces. Most frequently used stack traces will be printed first, e.g. for easier memory leak debugging. Signed-off-by: Oliver Glitta <glittao@gmail.com> Signed-off-by: Vlastimil Babka <vbabka@suse.cz> Reviewed-and-tested-by: Hyeonggon Yoo <42.hyeyoo@gmail.com> Acked-by: David Rientjes <rientjes@google.com>
|
#
8ea9fb92 |
|
21-May-2021 |
Oliver Glitta <glittao@gmail.com> |
mm/slub: distinguish and print stack traces in debugfs files Aggregate objects in slub cache by unique stack trace in addition to caller address when producing contents of debugfs files alloc_traces and free_traces in debugfs. Also add the stack traces to the debugfs output. This makes it much more useful to e.g. debug memory leaks. Signed-off-by: Oliver Glitta <glittao@gmail.com> Signed-off-by: Vlastimil Babka <vbabka@suse.cz> Reviewed-and-tested-by: Hyeonggon Yoo <42.hyeyoo@gmail.com>
|
#
5cf909c5 |
|
07-Jul-2021 |
Oliver Glitta <glittao@gmail.com> |
mm/slub: use stackdepot to save stack trace in objects Many stack traces are similar so there are many similar arrays. Stackdepot saves each unique stack only once. Replace field addrs in struct track with depot_stack_handle_t handle. Use stackdepot to save stack trace. The benefits are smaller memory overhead and possibility to aggregate per-cache statistics in the following patch using the stackdepot handle instead of matching stacks manually. [ vbabka@suse.cz: rebase to 5.17-rc1 and adjust accordingly ] This was initially merged as commit 788691464c29 and reverted by commit ae14c63a9f20 due to several issues, that should now be fixed. The problem of unconditional memory overhead by stackdepot has been addressed by commit 2dba5eb1c73b ("lib/stackdepot: allow optional init and stack_table allocation by kvmalloc()"), so the dependency on stackdepot will result in extra memory usage only when a slab cache tracking is actually enabled, and not for all CONFIG_SLUB_DEBUG builds. The build failures on some architectures were also addressed, and the reported issue with xfs/433 test did not reproduce on 5.17-rc1 with this patch. Signed-off-by: Oliver Glitta <glittao@gmail.com> Signed-off-by: Vlastimil Babka <vbabka@suse.cz> Reviewed-and-tested-by: Hyeonggon Yoo <42.hyeyoo@gmail.com> Acked-by: David Rientjes <rientjes@google.com> Cc: David Rientjes <rientjes@google.com> Cc: Christoph Lameter <cl@linux.com> Cc: Pekka Enberg <penberg@kernel.org> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
|
#
0cd1a029 |
|
04-Feb-2022 |
Vlastimil Babka <vbabka@suse.cz> |
mm/slub: move struct track init out of set_track() set_track() either zeroes out the struct track or fills it, depending on the addr parameter. This is unnecessary as there's only one place that calls it for the initialization - init_tracking(). We can simply do the zeroing there, with a single memset() that covers both TRACK_ALLOC and TRACK_FREE as they are adjacent. Signed-off-by: Vlastimil Babka <vbabka@suse.cz> Reviewed-and-tested-by: Hyeonggon Yoo <42.hyeyoo@gmail.com> Acked-by: David Rientjes <rientjes@google.com>
|
#
a285909f |
|
06-Apr-2022 |
Hyeonggon Yoo <42.hyeyoo@gmail.com> |
mm/slub, kunit: Make slub_kunit unaffected by user specified flags slub_kunit does not expect other debugging flags to be set when running tests. When SLAB_RED_ZONE flag is set globally, test fails because the flag affects number of errors reported. To make slub_kunit unaffected by user specified debugging flags, introduce SLAB_NO_USER_FLAGS to ignore them. With this flag, only flags specified in the code are used and others are ignored. Suggested-by: Vlastimil Babka <vbabka@suse.cz> Signed-off-by: Hyeonggon Yoo <42.hyeyoo@gmail.com> Signed-off-by: Vlastimil Babka <vbabka@suse.cz> Link: https://lore.kernel.org/r/Yk0sY9yoJhFEXWOg@hyeyoo
|
#
2dfe63e6 |
|
14-Apr-2022 |
Marco Elver <elver@google.com> |
mm, kfence: support kmem_dump_obj() for KFENCE objects Calling kmem_obj_info() via kmem_dump_obj() on KFENCE objects has been producing garbage data due to the object not actually being maintained by SLAB or SLUB. Fix this by implementing __kfence_obj_info() that copies relevant information to struct kmem_obj_info when the object was allocated by KFENCE; this is called by a common kmem_obj_info(), which also calls the slab/slub/slob specific variant now called __kmem_obj_info(). For completeness, kmem_dump_obj() now displays if the object was allocated by KFENCE. Link: https://lore.kernel.org/all/20220323090520.GG16885@xsang-OptiPlex-9020/ Link: https://lkml.kernel.org/r/20220406131558.3558585-1-elver@google.com Fixes: b89fb5ef0ce6 ("mm, kfence: insert KFENCE hooks for SLUB") Fixes: d3fb45f370d9 ("mm, kfence: insert KFENCE hooks for SLAB") Signed-off-by: Marco Elver <elver@google.com> Reviewed-by: Hyeonggon Yoo <42.hyeyoo@gmail.com> Reported-by: kernel test robot <oliver.sang@intel.com> Acked-by: Vlastimil Babka <vbabka@suse.cz> [slab] Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
#
a485e1da |
|
10-Mar-2022 |
Xiongwei Song <sxwjean@gmail.com> |
mm: slub: Delete useless parameter of alloc_slab_page() The parameter @s is useless for alloc_slab_page(). It was added in 2014 by commit 5dfb41750992 ("sl[au]b: charge slabs to kmemcg explicitly"). The need for it was removed in 2020 by commit 1f3147b49d75 ("mm: slub: call account_slab_page() after slab page initialization"). Let's delete it. [willy@infradead.org: Added detailed history of @s] Signed-off-by: Xiongwei Song <sxwjean@gmail.com> Reviewed-by: Matthew Wilcox (Oracle) <willy@infradead.org> Acked-by: David Rientjes <rientjes@google.com> Reviewed-by: Roman Gushchin <roman.gushchin@linux.dev> Reviewed-by: Hyeonggon Yoo <42.hyeyoo@gmail.com> Signed-off-by: Vlastimil Babka <vbabka@suse.cz> Link: https://lore.kernel.org/r/20220310140701.87908-3-sxwjean@me.com
|
#
ae44d81d |
|
09-Mar-2022 |
Miaohe Lin <linmiaohe@huawei.com> |
mm/slub: remove forced_order parameter in calculate_sizes Since commit 32a6f409b693 ("mm, slub: remove runtime allocation order changes"), forced_order is always -1. Remove this unneeded parameter to simplify the code. Signed-off-by: Miaohe Lin <linmiaohe@huawei.com> Reviewed-by: Hyeonggon Yoo <42.hyeyoo@gmail.com> Signed-off-by: Vlastimil Babka <vbabka@suse.cz> Link: https://lore.kernel.org/r/20220309092036.50844-1-linmiaohe@huawei.com
|
#
6d3a16d0 |
|
07-Mar-2022 |
Hyeonggon Yoo <42.hyeyoo@gmail.com> |
mm/slub: refactor deactivate_slab() Simplify deactivate_slab() by unlocking n->list_lock and retrying cmpxchg_double() when cmpxchg_double() fails, and perform add_{partial,full} only when it succeed. Releasing and taking n->list_lock again here is not harmful as SLUB avoids deactivating slabs as much as possible. [ vbabka@suse.cz: perform add_{partial,full} when cmpxchg_double() succeed. count deactivating full slabs even if debugging flag is not set. ] Signed-off-by: Hyeonggon Yoo <42.hyeyoo@gmail.com> Reviewed-by: Vlastimil Babka <vbabka@suse.cz> Reviewed-by: Roman Gushchin <roman.gushchin@linux.dev> Signed-off-by: Vlastimil Babka <vbabka@suse.cz> Link: https://lore.kernel.org/r/20220307074057.902222-3-42.hyeyoo@gmail.com
|
#
5182f3c9 |
|
07-Mar-2022 |
Hyeonggon Yoo <42.hyeyoo@gmail.com> |
mm/slub: limit number of node partial slabs only in cache creation SLUB sets number of minimum partial slabs for node (min_partial) using set_min_partial(). SLUB holds at least min_partial slabs even if they're empty to avoid excessive use of page allocator. set_min_partial() limits value of min_partial limits value of min_partial MIN_PARTIAL and MAX_PARTIAL. As set_min_partial() can be called by min_partial_store() too, Only limit value of min_partial in kmem_cache_open() so that it can be changed to value that a user wants. [ rientjes@google.com: Fold set_min_partial() into its callers ] Signed-off-by: Hyeonggon Yoo <42.hyeyoo@gmail.com> Reviewed-by: Vlastimil Babka <vbabka@suse.cz> Reviewed-by: Roman Gushchin <roman.gushchin@linux.dev> Signed-off-by: Vlastimil Babka <vbabka@suse.cz> Link: https://lore.kernel.org/r/20220307074057.902222-2-42.hyeyoo@gmail.com
|
#
d1d28bd9 |
|
06-Mar-2022 |
Lianjie Zhang <zhanglianjie@uniontech.com> |
mm/slub: use helper macro __ATTR_XX_MODE for SLAB_ATTR(_RO) This allows more concise code, and VERIFY_OCTAL_PERMISSIONS() can help validate any future change. Signed-off-by: Lianjie Zhang <zhanglianjie@uniontech.com> Acked-by: David Rientjes <rientjes@google.com> Signed-off-by: Vlastimil Babka <vbabka@suse.cz> Link: https://lore.kernel.org/r/20220306073818.15089-1-zhanglianjie@uniontech.com
|
#
88f2ef73 |
|
22-Mar-2022 |
Muchun Song <songmuchun@bytedance.com> |
mm: introduce kmem_cache_alloc_lru We currently allocate scope for every memcg to be able to tracked on every superblock instantiated in the system, regardless of whether that superblock is even accessible to that memcg. These huge memcg counts come from container hosts where memcgs are confined to just a small subset of the total number of superblocks that instantiated at any given point in time. For these systems with huge container counts, list_lru does not need the capability of tracking every memcg on every superblock. What it comes down to is that adding the memcg to the list_lru at the first insert. So introduce kmem_cache_alloc_lru to allocate objects and its list_lru. In the later patch, we will convert all inode and dentry allocation from kmem_cache_alloc to kmem_cache_alloc_lru. Link: https://lkml.kernel.org/r/20220228122126.37293-3-songmuchun@bytedance.com Signed-off-by: Muchun Song <songmuchun@bytedance.com> Cc: Alex Shi <alexs@kernel.org> Cc: Anna Schumaker <Anna.Schumaker@Netapp.com> Cc: Chao Yu <chao@kernel.org> Cc: Dave Chinner <david@fromorbit.com> Cc: Fam Zheng <fam.zheng@bytedance.com> Cc: Jaegeuk Kim <jaegeuk@kernel.org> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Kari Argillander <kari.argillander@gmail.com> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Michal Hocko <mhocko@kernel.org> Cc: Qi Zheng <zhengqi.arch@bytedance.com> Cc: Roman Gushchin <roman.gushchin@linux.dev> Cc: Shakeel Butt <shakeelb@google.com> Cc: Theodore Ts'o <tytso@mit.edu> Cc: Trond Myklebust <trond.myklebust@hammerspace.com> Cc: Vladimir Davydov <vdavydov.dev@gmail.com> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Wei Yang <richard.weiyang@gmail.com> Cc: Xiongchun Duan <duanxiongchun@bytedance.com> Cc: Yang Shi <shy828301@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
#
9c01e9af |
|
10-Nov-2021 |
Vlastimil Babka <vbabka@suse.cz> |
mm/slub: Define struct slab fields for CONFIG_SLUB_CPU_PARTIAL only when enabled The fields 'next' and 'slabs' are only used when CONFIG_SLUB_CPU_PARTIAL is enabled. We can put their definition to #ifdef to prevent accidental use when disabled. Currenlty show_slab_objects() and slabs_cpu_partial_show() contain code accessing the slabs field that's effectively dead with CONFIG_SLUB_CPU_PARTIAL=n through the wrappers slub_percpu_partial() and slub_percpu_partial_read_once(), but to prevent a compile error, we need to hide all this code behind #ifdef. Signed-off-by: Vlastimil Babka <vbabka@suse.cz> Reviewed-by: Hyeonggon Yoo <42.hyeyoo@gmail.com> Reviewed-by: Roman Gushchin <guro@fb.com> Tested-by: Hyeonggon Yoo <42.hyeyoo@gmail.com>
|
#
6e48a966 |
|
04-Oct-2021 |
Matthew Wilcox (Oracle) <willy@infradead.org> |
mm/kasan: Convert to struct folio and struct slab KASAN accesses some slab related struct page fields so we need to convert it to struct slab. Some places are a bit simplified thanks to kasan_addr_to_slab() encapsulating the PageSlab flag check through virt_to_slab(). When resolving object address to either a real slab or a large kmalloc, use struct folio as the intermediate type for testing the slab flag to avoid unnecessary implicit compound_head(). [ vbabka@suse.cz: use struct folio, adjust to differences in previous patches ] Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> Signed-off-by: Vlastimil Babka <vbabka@suse.cz> Reviewed-by: Andrey Konovalov <andreyknvl@gmail.com> Reviewed-by: Roman Gushchin <guro@fb.com> Tested-by: Hyeongogn Yoo <42.hyeyoo@gmail.com> Cc: Andrey Ryabinin <ryabinin.a.a@gmail.com> Cc: Alexander Potapenko <glider@google.com> Cc: Andrey Konovalov <andreyknvl@gmail.com> Cc: Dmitry Vyukov <dvyukov@google.com> Cc: <kasan-dev@googlegroups.com>
|
#
40f3bf0c |
|
02-Nov-2021 |
Vlastimil Babka <vbabka@suse.cz> |
mm: Convert struct page to struct slab in functions used by other subsystems KASAN, KFENCE and memcg interact with SLAB or SLUB internals through functions nearest_obj(), obj_to_index() and objs_per_slab() that use struct page as parameter. This patch converts it to struct slab including all callers, through a coccinelle semantic patch. // Options: --include-headers --no-includes --smpl-spacing include/linux/slab_def.h include/linux/slub_def.h mm/slab.h mm/kasan/*.c mm/kfence/kfence_test.c mm/memcontrol.c mm/slab.c mm/slub.c // Note: needs coccinelle 1.1.1 to avoid breaking whitespace @@ @@ -objs_per_slab_page( +objs_per_slab( ... ) { ... } @@ @@ -objs_per_slab_page( +objs_per_slab( ... ) @@ identifier fn =~ "obj_to_index|objs_per_slab"; @@ fn(..., - const struct page *page + const struct slab *slab ,...) { <... ( - page_address(page) + slab_address(slab) | - page + slab ) ...> } @@ identifier fn =~ "nearest_obj"; @@ fn(..., - struct page *page + const struct slab *slab ,...) { <... ( - page_address(page) + slab_address(slab) | - page + slab ) ...> } @@ identifier fn =~ "nearest_obj|obj_to_index|objs_per_slab"; expression E; @@ fn(..., ( - slab_page(E) + E | - virt_to_page(E) + virt_to_slab(E) | - virt_to_head_page(E) + virt_to_slab(E) | - page + page_slab(page) ) ,...) Signed-off-by: Vlastimil Babka <vbabka@suse.cz> Reviewed-by: Andrey Konovalov <andreyknvl@gmail.com> Reviewed-by: Roman Gushchin <guro@fb.com> Acked-by: Johannes Weiner <hannes@cmpxchg.org> Cc: Julia Lawall <julia.lawall@inria.fr> Cc: Luis Chamberlain <mcgrof@kernel.org> Cc: Andrey Ryabinin <ryabinin.a.a@gmail.com> Cc: Alexander Potapenko <glider@google.com> Cc: Andrey Konovalov <andreyknvl@gmail.com> Cc: Dmitry Vyukov <dvyukov@google.com> Cc: Marco Elver <elver@google.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Michal Hocko <mhocko@kernel.org> Cc: Vladimir Davydov <vdavydov.dev@gmail.com> Cc: <kasan-dev@googlegroups.com> Cc: <cgroups@vger.kernel.org>
|
#
c2092c12 |
|
15-Nov-2021 |
Vlastimil Babka <vbabka@suse.cz> |
mm/slub: Finish struct page to struct slab conversion Update comments mentioning pages to mention slabs where appropriate. Also some goto labels. Signed-off-by: Vlastimil Babka <vbabka@suse.cz> Reviewed-by: Roman Gushchin <guro@fb.com>
|
#
bb192ed9 |
|
03-Nov-2021 |
Vlastimil Babka <vbabka@suse.cz> |
mm/slub: Convert most struct page to struct slab by spatch The majority of conversion from struct page to struct slab in SLUB internals can be delegated to a coccinelle semantic patch. This includes renaming of variables with 'page' in name to 'slab', and similar. Big thanks to Julia Lawall and Luis Chamberlain for help with coccinelle. // Options: --include-headers --no-includes --smpl-spacing include/linux/slub_def.h mm/slub.c // Note: needs coccinelle 1.1.1 to avoid breaking whitespace, and ocaml for the // embedded script // build list of functions to exclude from applying the next rule @initialize:ocaml@ @@ let ok_function p = not (List.mem (List.hd p).current_element ["nearest_obj";"obj_to_index";"objs_per_slab_page";"__slab_lock";"__slab_unlock";"free_nonslab_page";"kmalloc_large_node"]) // convert the type from struct page to struct page in all functions except the // list from previous rule // this also affects struct kmem_cache_cpu, but that's ok @@ position p : script:ocaml() { ok_function p }; @@ - struct page@p + struct slab // in struct kmem_cache_cpu, change the name from page to slab // the type was already converted by the previous rule @@ @@ struct kmem_cache_cpu { ... -struct slab *page; +struct slab *slab; ... } // there are many places that use c->page which is now c->slab after the // previous rule @@ struct kmem_cache_cpu *c; @@ -c->page +c->slab @@ @@ struct kmem_cache { ... - unsigned int cpu_partial_pages; + unsigned int cpu_partial_slabs; ... } @@ struct kmem_cache *s; @@ - s->cpu_partial_pages + s->cpu_partial_slabs @@ @@ static void - setup_page_debug( + setup_slab_debug( ...) {...} @@ @@ - setup_page_debug( + setup_slab_debug( ...); // for all functions (with exceptions), change any "struct slab *page" // parameter to "struct slab *slab" in the signature, and generally all // occurences of "page" to "slab" in the body - with some special cases. @@ identifier fn !~ "free_nonslab_page|obj_to_index|objs_per_slab_page|nearest_obj"; @@ fn(..., - struct slab *page + struct slab *slab ,...) { <... - page + slab ...> } // similar to previous but the param is called partial_page @@ identifier fn; @@ fn(..., - struct slab *partial_page + struct slab *partial_slab ,...) { <... - partial_page + partial_slab ...> } // similar to previous but for functions that take pointer to struct page ptr @@ identifier fn; @@ fn(..., - struct slab **ret_page + struct slab **ret_slab ,...) { <... - ret_page + ret_slab ...> } // functions converted by previous rules that were temporarily called using // slab_page(E) so we want to remove the wrapper now that they accept struct // slab ptr directly @@ identifier fn =~ "slab_free|do_slab_free"; expression E; @@ fn(..., - slab_page(E) + E ,...) // similar to previous but for another pattern @@ identifier fn =~ "slab_pad_check|check_object"; @@ fn(..., - folio_page(folio, 0) + slab ,...) // functions that were returning struct page ptr and now will return struct // slab ptr, including slab_page() wrapper removal @@ identifier fn =~ "allocate_slab|new_slab"; expression E; @@ static -struct slab * +struct slab * fn(...) { <... - slab_page(E) + E ...> } // rename any former struct page * declarations @@ @@ struct slab * ( - page + slab | - partial_page + partial_slab | - oldpage + oldslab ) ; // this has to be separate from previous rule as page and page2 appear at the // same line @@ @@ struct slab * -page2 +slab2 ; // similar but with initial assignment @@ expression E; @@ struct slab * ( - page + slab | - flush_page + flush_slab | - discard_page + slab_to_discard | - page_to_unfreeze + slab_to_unfreeze ) = E; // convert most of struct page to struct slab usage inside functions (with // exceptions), including specific variable renames @@ identifier fn !~ "nearest_obj|obj_to_index|objs_per_slab_page|__slab_(un)*lock|__free_slab|free_nonslab_page|kmalloc_large_node"; expression E; @@ fn(...) { <... ( - int pages; + int slabs; | - int pages = E; + int slabs = E; | - page + slab | - flush_page + flush_slab | - partial_page + partial_slab | - oldpage->pages + oldslab->slabs | - oldpage + oldslab | - unsigned int nr_pages; + unsigned int nr_slabs; | - nr_pages + nr_slabs | - unsigned int partial_pages = E; + unsigned int partial_slabs = E; | - partial_pages + partial_slabs ) ...> } // this has to be split out from the previous rule so that lines containing // multiple matching changes will be fully converted @@ identifier fn !~ "nearest_obj|obj_to_index|objs_per_slab_page|__slab_(un)*lock|__free_slab|free_nonslab_page|kmalloc_large_node"; @@ fn(...) { <... ( - slab->pages + slab->slabs | - pages + slabs | - page2 + slab2 | - discard_page + slab_to_discard | - page_to_unfreeze + slab_to_unfreeze ) ...> } // after we simply changed all occurences of page to slab, some usages need // adjustment for slab-specific functions, or use slab_page() wrapper @@ identifier fn !~ "nearest_obj|obj_to_index|objs_per_slab_page|__slab_(un)*lock|__free_slab|free_nonslab_page|kmalloc_large_node"; @@ fn(...) { <... ( - page_slab(slab) + slab | - kasan_poison_slab(slab) + kasan_poison_slab(slab_page(slab)) | - page_address(slab) + slab_address(slab) | - page_size(slab) + slab_size(slab) | - PageSlab(slab) + folio_test_slab(slab_folio(slab)) | - page_to_nid(slab) + slab_nid(slab) | - compound_order(slab) + slab_order(slab) ) ...> } Signed-off-by: Vlastimil Babka <vbabka@suse.cz> Reviewed-by: Roman Gushchin <guro@fb.com> Reviewed-by: Hyeonggon Yoo <42.hyeyoo@gmail.com> Tested-by: Hyeonggon Yoo <42.hyeyoo@gmail.com> Cc: Julia Lawall <julia.lawall@inria.fr> Cc: Luis Chamberlain <mcgrof@kernel.org>
|
#
01b34d16 |
|
04-Oct-2021 |
Matthew Wilcox (Oracle) <willy@infradead.org> |
mm/slub: Convert pfmemalloc_match() to take a struct slab Preparatory for mass conversion. Use the new slab_test_pfmemalloc() helper. As it doesn't do VM_BUG_ON(!PageSlab()) we no longer need the pfmemalloc_match_unsafe() variant. Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> Signed-off-by: Vlastimil Babka <vbabka@suse.cz> Reviewed-by: Roman Gushchin <guro@fb.com> Reviewed-by: Hyeonggon Yoo <42.hyeyoo@gmail.com>
|
#
4020b4a2 |
|
28-Oct-2021 |
Vlastimil Babka <vbabka@suse.cz> |
mm/slub: Convert __free_slab() to use struct slab __free_slab() is on the boundary of distinguishing struct slab and struct page so start with struct slab but convert to folio for working with flags and folio_page() to call functions that require struct page. Signed-off-by: Vlastimil Babka <vbabka@suse.cz> Reviewed-by: Roman Gushchin <guro@fb.com> Reviewed-by: Hyeonggon Yoo <42.hyeyoo@gmail.com> Tested-by: Hyeonggon Yoo <42.hyeyoo@gmail.com>
|
#
45387b8c |
|
26-Oct-2021 |
Vlastimil Babka <vbabka@suse.cz> |
mm/slub: Convert alloc_slab_page() to return a struct slab Preparatory, callers convert back to struct page for now. Also move setting page flags to alloc_slab_page() where we still operate on a struct page. This means the page->slab_cache pointer is now set later than the PageSlab flag, which could theoretically confuse some pfn walker assuming PageSlab means there would be a valid cache pointer. But as the code had no barriers and used __set_bit() anyway, it could have happened already, so there shouldn't be such a walker. Signed-off-by: Vlastimil Babka <vbabka@suse.cz> Reviewed-by: Roman Gushchin <guro@fb.com> Reviewed-by: Hyeonggon Yoo <42.hyeyoo@gmail.com> Tested-by: Hyeonggon Yoo <42.hyeyoo@gmail.com>
|
#
fb012e27 |
|
04-Oct-2021 |
Matthew Wilcox (Oracle) <willy@infradead.org> |
mm/slub: Convert print_page_info() to print_slab_info() Improve the type safety and prepare for further conversion. For flags access, convert to folio internally. [ vbabka@suse.cz: access flags via folio_flags() ] Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> Signed-off-by: Vlastimil Babka <vbabka@suse.cz> Reviewed-by: Roman Gushchin <guro@fb.com> Reviewed-by: Hyeonggon Yoo <42.hyeyoo@gmail.com>
|
#
0393895b |
|
26-Oct-2021 |
Vlastimil Babka <vbabka@suse.cz> |
mm/slub: Convert __slab_lock() and __slab_unlock() to struct slab These functions operate on the PG_locked page flag, but make them accept struct slab to encapsulate this implementation detail. Signed-off-by: Vlastimil Babka <vbabka@suse.cz> Reviewed-by: Roman Gushchin <guro@fb.com>
|
#
d835eef4 |
|
04-Oct-2021 |
Matthew Wilcox (Oracle) <willy@infradead.org> |
mm/slub: Convert kfree() to use a struct slab Convert kfree(), kmem_cache_free() and ___cache_free() to resolve object addresses to struct slab, using folio as intermediate step where needed. Keep passing the result as struct page for now in preparation for mass conversion of internal functions. [ vbabka@suse.cz: Use folio as intermediate step when checking for large kmalloc pages, and when freeing them - rename free_nonslab_page() to free_large_kmalloc() that takes struct folio ] Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> Signed-off-by: Vlastimil Babka <vbabka@suse.cz> Reviewed-by: Roman Gushchin <guro@fb.com>
|
#
cc465c3b |
|
04-Oct-2021 |
Matthew Wilcox (Oracle) <willy@infradead.org> |
mm/slub: Convert detached_freelist to use a struct slab This gives us a little bit of extra typesafety as we know that nobody called virt_to_page() instead of virt_to_head_page(). [ vbabka@suse.cz: Use folio as intermediate step when filtering out large kmalloc pages ] Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> Signed-off-by: Vlastimil Babka <vbabka@suse.cz> Reviewed-by: Roman Gushchin <guro@fb.com>
|
#
0b3eb091 |
|
04-Oct-2021 |
Matthew Wilcox (Oracle) <willy@infradead.org> |
mm: Convert check_heap_object() to use struct slab Ensure that we're not seeing a tail page inside __check_heap_object() by converting to a slab instead of a page. Take the opportunity to mark the slab as const since we're not modifying it. Also move the declaration of __check_heap_object() to mm/slab.h so it's not available to the wider kernel. [ vbabka@suse.cz: in check_heap_object() only convert to struct slab for actual PageSlab pages; use folio as intermediate step instead of page ] Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> Signed-off-by: Vlastimil Babka <vbabka@suse.cz> Reviewed-by: Roman Gushchin <guro@fb.com>
|
#
7213230a |
|
04-Oct-2021 |
Matthew Wilcox (Oracle) <willy@infradead.org> |
mm: Use struct slab in kmem_obj_info() All three implementations of slab support kmem_obj_info() which reports details of an object allocated from the slab allocator. By using the slab type instead of the page type, we make it obvious that this can only be called for slabs. [ vbabka@suse.cz: also convert the related kmem_valid_obj() to folios ] Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> Signed-off-by: Vlastimil Babka <vbabka@suse.cz> Reviewed-by: Roman Gushchin <guro@fb.com>
|
#
0c24811b |
|
04-Oct-2021 |
Matthew Wilcox (Oracle) <willy@infradead.org> |
mm: Convert __ksize() to struct slab In SLUB, use folios, and struct slab to access slab_cache field. In SLOB, use folios to properly resolve pointers beyond PAGE_SIZE offset of the object. [ vbabka@suse.cz: use folios, and only convert folio_test_slab() == true folios to struct slab ] Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> Signed-off-by: Vlastimil Babka <vbabka@suse.cz> Acked-by: Johannes Weiner <hannes@cmpxchg.org> Reviewed-by: Roman Gushchin <guro@fb.com>
|
#
b918653b |
|
04-Oct-2021 |
Matthew Wilcox (Oracle) <willy@infradead.org> |
mm: Convert [un]account_slab_page() to struct slab Convert the parameter of these functions to struct slab instead of struct page and drop _page from the names. For now their callers just convert page to slab. [ vbabka@suse.cz: replace existing functions instead of calling them ] Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> Signed-off-by: Vlastimil Babka <vbabka@suse.cz> Acked-by: Johannes Weiner <hannes@cmpxchg.org> Reviewed-by: Roman Gushchin <guro@fb.com>
|
#
d122019b |
|
04-Oct-2021 |
Matthew Wilcox (Oracle) <willy@infradead.org> |
mm: Split slab into its own type Make struct slab independent of struct page. It still uses the underlying memory in struct page for storing slab-specific data, but slab and slub can now be weaned off using struct page directly. Some of the wrapper functions (slab_address() and slab_order()) still need to cast to struct folio, but this is a significant disentanglement. [ vbabka@suse.cz: Rebase on folios, use folio instead of page where possible. Do not duplicate flags field in struct slab, instead make the related accessors go through slab_folio(). For testing pfmemalloc use the folio_*_active flag accessors directly so the PageSlabPfmemalloc wrappers can be removed later. Make folio_slab() expect only folio_test_slab() == true folios and virt_to_slab() return NULL when folio_test_slab() == false. Move struct slab to mm/slab.h. Don't represent with struct slab pages that are not true slab pages, but just a compound page obtained directly rom page allocator (with large kmalloc() for SLUB and SLOB). ] Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> Signed-off-by: Vlastimil Babka <vbabka@suse.cz> Acked-by: Johannes Weiner <hannes@cmpxchg.org> Reviewed-by: Roman Gushchin <guro@fb.com>
|
#
ae16d059 |
|
26-Oct-2021 |
Vlastimil Babka <vbabka@suse.cz> |
mm/slub: Make object_err() static There are no callers outside of mm/slub.c anymore. Move freelist_corrupted() that calls object_err() to avoid a need for forward declaration. Signed-off-by: Vlastimil Babka <vbabka@suse.cz> Reviewed-by: Roman Gushchin <guro@fb.com>
|
#
005a79e5 |
|
10-Dec-2021 |
Gerald Schaefer <gerald.schaefer@linux.ibm.com> |
mm/slub: fix endianness bug for alloc/free_traces attributes On big-endian s390, the alloc/free_traces attributes produce endless output, because of always 0 idx in slab_debugfs_show(). idx is de-referenced from *v, which points to a loff_t value, with unsigned int idx = *(unsigned int *)v; This will only give the upper 32 bits on big-endian, which remain 0. Instead of only fixing this de-reference, during discussion it seemed more appropriate to change the seq_ops so that they use an explicit iterator in private loc_track struct. This patch adds idx to loc_track, which will also fix the endianness bug. Link: https://lore.kernel.org/r/20211117193932.4049412-1-gerald.schaefer@linux.ibm.com Link: https://lkml.kernel.org/r/20211126171848.17534-1-gerald.schaefer@linux.ibm.com Fixes: 64dd68497be7 ("mm: slub: move sysfs slab alloc/free interfaces to debugfs") Signed-off-by: Gerald Schaefer <gerald.schaefer@linux.ibm.com> Reported-by: Steffen Maier <maier@linux.ibm.com> Acked-by: Vlastimil Babka <vbabka@suse.cz> Cc: Faiyaz Mohammed <faiyazm@codeaurora.org> Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Cc: Christoph Lameter <cl@linux.com> Cc: Pekka Enberg <penberg@kernel.org> Cc: David Rientjes <rientjes@google.com> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
#
9a543f00 |
|
19-Nov-2021 |
Yunfeng Ye <yeyunfeng@huawei.com> |
mm: emit the "free" trace report before freeing memory in kmem_cache_free() After the memory is freed, it can be immediately allocated by other CPUs, before the "free" trace report has been emitted. This causes inaccurate traces. For example, if the following sequence of events occurs: CPU 0 CPU 1 (1) alloc xxxxxx (2) free xxxxxx (3) alloc xxxxxx (4) free xxxxxx Then they will be inaccurately reported via tracing, so that they appear to have happened in this order: CPU 0 CPU 1 (1) alloc xxxxxx (2) alloc xxxxxx (3) free xxxxxx (4) free xxxxxx This makes it look like CPU 1 somehow managed to allocate memory that CPU 0 still had allocated for itself. In order to avoid this, emit the "free xxxxxx" tracing report just before the actual call to free the memory, instead of just after it. Link: https://lkml.kernel.org/r/374eb75d-7404-8721-4e1e-65b0e5b17279@huawei.com Signed-off-by: Yunfeng Ye <yeyunfeng@huawei.com> Reviewed-by: Vlastimil Babka <vbabka@suse.cz> Reviewed-by: John Hubbard <jhubbard@nvidia.com> Cc: Christoph Lameter <cl@linux.com> Cc: Pekka Enberg <penberg@kernel.org> Cc: David Rientjes <rientjes@google.com> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com> Cc: Vlastimil Babka <vbabka@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
#
53944f17 |
|
05-Nov-2021 |
Stephen Kitt <steve@sk2.org> |
mm: remove HARDENED_USERCOPY_FALLBACK This has served its purpose and is no longer used. All usercopy violations appear to have been handled by now, any remaining instances (or new bugs) will cause copies to be rejected. This isn't a direct revert of commit 2d891fbc3bb6 ("usercopy: Allow strict enforcement of whitelists"); since usercopy_fallback is effectively 0, the fallback handling is removed too. This also removes the usercopy_fallback module parameter on slab_common. Link: https://github.com/KSPP/linux/issues/153 Link: https://lkml.kernel.org/r/20210921061149.1091163-1-steve@sk2.org Signed-off-by: Stephen Kitt <steve@sk2.org> Suggested-by: Kees Cook <keescook@chromium.org> Acked-by: Kees Cook <keescook@chromium.org> Reviewed-by: Joel Stanley <joel@jms.id.au> [defconfig change] Acked-by: David Rientjes <rientjes@google.com> Cc: Christoph Lameter <cl@linux.com> Cc: Pekka Enberg <penberg@kernel.org> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: James Morris <jmorris@namei.org> Cc: "Serge E . Hallyn" <serge@hallyn.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
#
04b4b006 |
|
05-Nov-2021 |
Hyeonggon Yoo <42.hyeyoo@gmail.com> |
mm, slub: use prefetchw instead of prefetch Commit 0ad9500e16fe ("slub: prefetch next freelist pointer in slab_alloc()") introduced prefetch_freepointer() because when other cpu(s) freed objects into a page that current cpu owns, the freelist link is hot on cpu(s) which freed objects and possibly very cold on current cpu. But if freelist link chain is hot on cpu(s) which freed objects, it's better to invalidate that chain because they're not going to access again within a short time. So use prefetchw instead of prefetch. On supported architectures like x86 and arm, it invalidates other copied instances of a cache line when prefetching it. Before: Time: 91.677 Performance counter stats for 'hackbench -g 100 -l 10000': 1462938.07 msec cpu-clock # 15.908 CPUs utilized 18072550 context-switches # 12.354 K/sec 1018814 cpu-migrations # 696.416 /sec 104558 page-faults # 71.471 /sec 1580035699271 cycles # 1.080 GHz (54.51%) 2003670016013 instructions # 1.27 insn per cycle (54.31%) 5702204863 branch-misses (54.28%) 643368500985 cache-references # 439.778 M/sec (54.26%) 18475582235 cache-misses # 2.872 % of all cache refs (54.28%) 642206796636 L1-dcache-loads # 438.984 M/sec (46.87%) 18215813147 L1-dcache-load-misses # 2.84% of all L1-dcache accesses (46.83%) 653842996501 dTLB-loads # 446.938 M/sec (46.63%) 3227179675 dTLB-load-misses # 0.49% of all dTLB cache accesses (46.85%) 537531951350 iTLB-loads # 367.433 M/sec (54.33%) 114750630 iTLB-load-misses # 0.02% of all iTLB cache accesses (54.37%) 630135543177 L1-icache-loads # 430.733 M/sec (46.80%) 22923237620 L1-icache-load-misses # 3.64% of all L1-icache accesses (46.76%) 91.964452802 seconds time elapsed 43.416742000 seconds user 1422.441123000 seconds sys After: Time: 90.220 Performance counter stats for 'hackbench -g 100 -l 10000': 1437418.48 msec cpu-clock # 15.880 CPUs utilized 17694068 context-switches # 12.310 K/sec 958257 cpu-migrations # 666.651 /sec 100604 page-faults # 69.989 /sec 1583259429428 cycles # 1.101 GHz (54.57%) 2004002484935 instructions # 1.27 insn per cycle (54.37%) 5594202389 branch-misses (54.36%) 643113574524 cache-references # 447.409 M/sec (54.39%) 18233791870 cache-misses # 2.835 % of all cache refs (54.37%) 640205852062 L1-dcache-loads # 445.386 M/sec (46.75%) 17968160377 L1-dcache-load-misses # 2.81% of all L1-dcache accesses (46.79%) 651747432274 dTLB-loads # 453.415 M/sec (46.59%) 3127124271 dTLB-load-misses # 0.48% of all dTLB cache accesses (46.75%) 535395273064 iTLB-loads # 372.470 M/sec (54.38%) 113500056 iTLB-load-misses # 0.02% of all iTLB cache accesses (54.35%) 628871845924 L1-icache-loads # 437.501 M/sec (46.80%) 22585641203 L1-icache-load-misses # 3.59% of all L1-icache accesses (46.79%) 90.514819303 seconds time elapsed 43.877656000 seconds user 1397.176001000 seconds sys Link: https://lkml.org/lkml/2021/10/8/598=20 Link: https://lkml.kernel.org/r/20211011144331.70084-1-42.hyeyoo@gmail.com Signed-off-by: Hyeonggon Yoo <42.hyeyoo@gmail.com> Acked-by: Vlastimil Babka <vbabka@suse.cz> Cc: Christoph Lameter <cl@linux.com> Cc: Pekka Enberg <penberg@kernel.org> Cc: David Rientjes <rientjes@google.com> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
#
23e98ad1 |
|
05-Nov-2021 |
Vlastimil Babka <vbabka@suse.cz> |
mm/slub: increase default cpu partial list sizes The defaults are determined based on object size and can go up to 30 for objects smaller than 256 bytes. Before the previous patch changed the accounting, this could have made cpu partial list contain up to 30 pages. After that patch, only up to 2 pages with default allocation order. Very short lists limit the usefulness of the whole concept of cpu partial lists, so this patch aims at a more reasonable default under the new accounting. The defaults are quadrupled, except for object size >= PAGE_SIZE where it's doubled. This makes the lists grow up to 10 pages in practice. A quick test of booting a kernel under virtme with 4GB RAM and 8 vcpus shows the following slab memory usage after boot: Before previous patch (using page->pobjects): Slab: 36732 kB SReclaimable: 14836 kB SUnreclaim: 21896 kB After previous patch (using page->pages): Slab: 34720 kB SReclaimable: 13716 kB SUnreclaim: 21004 kB After this patch (using page->pages, higher defaults): Slab: 35252 kB SReclaimable: 13944 kB SUnreclaim: 21308 kB In the same setup, I also ran 5 times: hackbench -l 16000 -g 16 Differences in time were in the noise, we can compare slub stats as given by slabinfo -r skbuff_head_cache (the other cache heavily used by hackbench, kmalloc-cg-512 looks similar). Negligible stats left out for brevity. Before previous patch (using page->pobjects): Objects: 1408, Memory Total: 401408 Used : 304128 Slab Perf Counter Alloc Free %Al %Fr -------------------------------------------------- Fastpath 469952498 5946606 91 1 Slowpath 42053573 506059465 8 98 Page Alloc 41093 41044 0 0 Add partial 18 21229327 0 4 Remove partial 20039522 36051 3 0 Cpu partial list 4686640 24767229 0 4 RemoteObj/SlabFrozen 16 124027841 0 24 Total 512006071 512006071 Flushes 18 Slab Deactivation Occurrences % ------------------------------------------------- Slab empty 4993 0% Deactivation bypass 24767229 99% Refilled from foreign frees 21972674 88% After previous patch (using page->pages): Objects: 480, Memory Total: 131072 Used : 103680 Slab Perf Counter Alloc Free %Al %Fr -------------------------------------------------- Fastpath 473016294 5405653 92 1 Slowpath 38989777 506600418 7 98 Page Alloc 32717 32701 0 0 Add partial 3 22749164 0 4 Remove partial 11371127 32474 2 0 Cpu partial list 11686226 23090059 2 4 RemoteObj/SlabFrozen 2 67541803 0 13 Total 512006071 512006071 Flushes 3 Slab Deactivation Occurrences % ------------------------------------------------- Slab empty 227 0% Deactivation bypass 23090059 99% Refilled from foreign frees 27585695 119% After this patch (using page->pages, higher defaults): Objects: 896, Memory Total: 229376 Used : 193536 Slab Perf Counter Alloc Free %Al %Fr -------------------------------------------------- Fastpath 473799295 4980278 92 0 Slowpath 38206776 507025793 7 99 Page Alloc 32295 32267 0 0 Add partial 11 23291143 0 4 Remove partial 5815764 31278 1 0 Cpu partial list 18119280 23967320 3 4 RemoteObj/SlabFrozen 10 76974794 0 15 Total 512006071 512006071 Flushes 11 Slab Deactivation Occurrences % ------------------------------------------------- Slab empty 989 0% Deactivation bypass 23967320 99% Refilled from foreign frees 32358473 135% As expected, memory usage dropped significantly with change of accounting, increasing the defaults increased it, but not as much. The number of page allocation/frees dropped significantly with the new accounting, but didn't increase with the higher defaults. Interestingly, the number of fasthpath allocations increased, as well as allocations from the cpu partial list, even though it's shorter. Link: https://lkml.kernel.org/r/20211012134651.11258-2-vbabka@suse.cz Signed-off-by: Vlastimil Babka <vbabka@suse.cz> Cc: Christoph Lameter <cl@linux.com> Cc: David Rientjes <rientjes@google.com> Cc: Jann Horn <jannh@google.com> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com> Cc: Pekka Enberg <penberg@kernel.org> Cc: Roman Gushchin <guro@fb.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
#
b47291ef |
|
05-Nov-2021 |
Vlastimil Babka <vbabka@suse.cz> |
mm, slub: change percpu partial accounting from objects to pages With CONFIG_SLUB_CPU_PARTIAL enabled, SLUB keeps a percpu list of partial slabs that can be promoted to cpu slab when the previous one is depleted, without accessing the shared partial list. A slab can be added to this list by 1) refill of an empty list from get_partial_node() - once we really have to access the shared partial list, we acquire multiple slabs to amortize the cost of locking, and 2) first free to a previously full slab - instead of putting the slab on a shared partial list, we can more cheaply freeze it and put it on the per-cpu list. To control how large a percpu partial list can grow for a kmem cache, set_cpu_partial() calculates a target number of free objects on each cpu's percpu partial list, and this can be also set by the sysfs file cpu_partial. However, the tracking of actual number of objects is imprecise, in order to limit overhead from cpu X freeing an objects to a slab on percpu partial list of cpu Y. Basically, the percpu partial slabs form a single linked list, and when we add a new slab to the list with current head "oldpage", we set in the struct page of the slab we're adding: page->pages = oldpage->pages + 1; // this is precise page->pobjects = oldpage->pobjects + (page->objects - page->inuse); page->next = oldpage; Thus the real number of free objects in the slab (objects - inuse) is only determined at the moment of adding the slab to the percpu partial list, and further freeing doesn't update the pobjects counter nor propagate it to the current list head. As Jann reports [1], this can easily lead to large inaccuracies, where the target number of objects (up to 30 by default) can translate to the same number of (empty) slab pages on the list. In case 2) above, we put a slab with 1 free object on the list, thus only increase page->pobjects by 1, even if there are subsequent frees on the same slab. Jann has noticed this in practice and so did we [2] when investigating significant increase of kmemcg usage after switching from SLAB to SLUB. While this is no longer a problem in kmemcg context thanks to the accounting rewrite in 5.9, the memory waste is still not ideal and it's questionable whether it makes sense to perform free object count based control when object counts can easily become so much inaccurate. So this patch converts the accounting to be based on number of pages only (which is precise) and removes the page->pobjects field completely. This is also ultimately simpler. To retain the existing set_cpu_partial() heuristic, first calculate the target number of objects as previously, but then convert it to target number of pages by assuming the pages will be half-filled on average. This assumption might obviously also be inaccurate in practice, but cannot degrade to actual number of pages being equal to the target number of objects. We could also skip the intermediate step with target number of objects and rewrite the heuristic in terms of pages. However we still have the sysfs file cpu_partial which uses number of objects and could break existing users if it suddenly becomes number of pages, so this patch doesn't do that. In practice, after this patch the heuristics limit the size of percpu partial list up to 2 pages. In case of a reported regression (which would mean some workload has benefited from the previous imprecise object based counting), we can tune the heuristics to get a better compromise within the new scheme, while still avoid the unexpectedly long percpu partial lists. [1] https://lore.kernel.org/linux-mm/CAG48ez2Qx5K1Cab-m8BdSibp6wLTip6ro4=-umR7BLsEgjEYzA@mail.gmail.com/ [2] https://lore.kernel.org/all/2f0f46e8-2535-410a-1859-e9cfa4e57c18@suse.cz/ ========== Evaluation ========== Mel was kind enough to run v1 through mmtests machinery for netperf (localhost) and hackbench and, for most significant results see below. So there are some apparent regressions, especially with hackbench, which I think ultimately boils down to having shorter percpu partial lists on average and some benchmarks benefiting from longer ones. Monitoring slab usage also indicated less memory usage by slab. Based on that, the following patch will bump the defaults to allow longer percpu partial lists than after this patch. However the goal is certainly not such that we would limit the percpu partial lists to 30 pages just because previously a specific alloc/free pattern could lead to the limit of 30 objects translate to a limit to 30 pages - that would make little sense. This is a correctness patch, and if a workload benefits from larger lists, the sysfs tuning knobs are still there to allow that. Netperf 2-socket Intel(R) Xeon(R) Gold 5218R CPU @ 2.10GHz (20 cores, 40 threads per socket), 384GB RAM TCP-RR: hmean before 127045.79 after 121092.94 (-4.69%, worse) stddev before 2634.37 after 1254.08 UDP-RR: hmean before 166985.45 after 160668.94 ( -3.78%, worse) stddev before 4059.69 after 1943.63 2-socket Intel(R) Xeon(R) CPU E5-2698 v4 @ 2.20GHz (20 cores, 40 threads per socket), 512GB RAM TCP-RR: hmean before 84173.25 after 76914.72 ( -8.62%, worse) UDP-RR: hmean before 93571.12 after 96428.69 ( 3.05%, better) stddev before 23118.54 after 16828.14 2-socket Intel(R) Xeon(R) CPU E5-2670 v3 @ 2.30GHz (12 cores, 24 threads per socket), 64GB RAM TCP-RR: hmean before 49984.92 after 48922.27 ( -2.13%, worse) stddev before 6248.15 after 4740.51 UDP-RR: hmean before 61854.31 after 68761.81 ( 11.17%, better) stddev before 4093.54 after 5898.91 other machines - within 2% Hackbench (results before and after the patch, negative % means worse) 2-socket AMD EPYC 7713 (64 cores, 128 threads per core), 256GB RAM hackbench-process-sockets Amean 1 0.5380 0.5583 ( -3.78%) Amean 4 0.7510 0.8150 ( -8.52%) Amean 7 0.7930 0.9533 ( -20.22%) Amean 12 0.7853 1.1313 ( -44.06%) Amean 21 1.1520 1.4993 ( -30.15%) Amean 30 1.6223 1.9237 ( -18.57%) Amean 48 2.6767 2.9903 ( -11.72%) Amean 79 4.0257 5.1150 ( -27.06%) Amean 110 5.5193 7.4720 ( -35.38%) Amean 141 7.2207 9.9840 ( -38.27%) Amean 172 8.4770 12.1963 ( -43.88%) Amean 203 9.6473 14.3137 ( -48.37%) Amean 234 11.3960 18.7917 ( -64.90%) Amean 265 13.9627 22.4607 ( -60.86%) Amean 296 14.9163 26.0483 ( -74.63%) hackbench-thread-sockets Amean 1 0.5597 0.5877 ( -5.00%) Amean 4 0.7913 0.8960 ( -13.23%) Amean 7 0.8190 1.0017 ( -22.30%) Amean 12 0.9560 1.1727 ( -22.66%) Amean 21 1.7587 1.5660 ( 10.96%) Amean 30 2.4477 1.9807 ( 19.08%) Amean 48 3.4573 3.0630 ( 11.41%) Amean 79 4.7903 5.1733 ( -8.00%) Amean 110 6.1370 7.4220 ( -20.94%) Amean 141 7.5777 9.2617 ( -22.22%) Amean 172 9.2280 11.0907 ( -20.18%) Amean 203 10.2793 13.3470 ( -29.84%) Amean 234 11.2410 17.1070 ( -52.18%) Amean 265 12.5970 23.3323 ( -85.22%) Amean 296 17.1540 24.2857 ( -41.57%) 2-socket Intel(R) Xeon(R) Gold 5218R CPU @ 2.10GHz (20 cores, 40 threads per socket), 384GB RAM hackbench-process-sockets Amean 1 0.5760 0.4793 ( 16.78%) Amean 4 0.9430 0.9707 ( -2.93%) Amean 7 1.5517 1.8843 ( -21.44%) Amean 12 2.4903 2.7267 ( -9.49%) Amean 21 3.9560 4.2877 ( -8.38%) Amean 30 5.4613 5.8343 ( -6.83%) Amean 48 8.5337 9.2937 ( -8.91%) Amean 79 14.0670 15.2630 ( -8.50%) Amean 110 19.2253 21.2467 ( -10.51%) Amean 141 23.7557 25.8550 ( -8.84%) Amean 172 28.4407 29.7603 ( -4.64%) Amean 203 33.3407 33.9927 ( -1.96%) Amean 234 38.3633 39.1150 ( -1.96%) Amean 265 43.4420 43.8470 ( -0.93%) Amean 296 48.3680 48.9300 ( -1.16%) hackbench-thread-sockets Amean 1 0.6080 0.6493 ( -6.80%) Amean 4 1.0000 1.0513 ( -5.13%) Amean 7 1.6607 2.0260 ( -22.00%) Amean 12 2.7637 2.9273 ( -5.92%) Amean 21 5.0613 4.5153 ( 10.79%) Amean 30 6.3340 6.1140 ( 3.47%) Amean 48 9.0567 9.5577 ( -5.53%) Amean 79 14.5657 15.7983 ( -8.46%) Amean 110 19.6213 21.6333 ( -10.25%) Amean 141 24.1563 26.2697 ( -8.75%) Amean 172 28.9687 30.2187 ( -4.32%) Amean 203 33.9763 34.6970 ( -2.12%) Amean 234 38.8647 39.3207 ( -1.17%) Amean 265 44.0813 44.1507 ( -0.16%) Amean 296 49.2040 49.4330 ( -0.47%) 2-socket Intel(R) Xeon(R) CPU E5-2698 v4 @ 2.20GHz (20 cores, 40 threads per socket), 512GB RAM hackbench-process-sockets Amean 1 0.5027 0.5017 ( 0.20%) Amean 4 1.1053 1.2033 ( -8.87%) Amean 7 1.8760 2.1820 ( -16.31%) Amean 12 2.9053 3.1810 ( -9.49%) Amean 21 4.6777 4.9920 ( -6.72%) Amean 30 6.5180 6.7827 ( -4.06%) Amean 48 10.0710 10.5227 ( -4.48%) Amean 79 16.4250 17.5053 ( -6.58%) Amean 110 22.6203 24.4617 ( -8.14%) Amean 141 28.0967 31.0363 ( -10.46%) Amean 172 34.4030 36.9233 ( -7.33%) Amean 203 40.5933 43.0850 ( -6.14%) Amean 234 46.6477 48.7220 ( -4.45%) Amean 265 53.0530 53.9597 ( -1.71%) Amean 296 59.2760 59.9213 ( -1.09%) hackbench-thread-sockets Amean 1 0.5363 0.5330 ( 0.62%) Amean 4 1.1647 1.2157 ( -4.38%) Amean 7 1.9237 2.2833 ( -18.70%) Amean 12 2.9943 3.3110 ( -10.58%) Amean 21 4.9987 5.1880 ( -3.79%) Amean 30 6.7583 7.0043 ( -3.64%) Amean 48 10.4547 10.8353 ( -3.64%) Amean 79 16.6707 17.6790 ( -6.05%) Amean 110 22.8207 24.4403 ( -7.10%) Amean 141 28.7090 31.0533 ( -8.17%) Amean 172 34.9387 36.8260 ( -5.40%) Amean 203 41.1567 43.0450 ( -4.59%) Amean 234 47.3790 48.5307 ( -2.43%) Amean 265 53.9543 54.6987 ( -1.38%) Amean 296 60.0820 60.2163 ( -0.22%) 1-socket Intel(R) Xeon(R) CPU E3-1240 v5 @ 3.50GHz (4 cores, 8 threads), 32 GB RAM hackbench-process-sockets Amean 1 1.4760 1.5773 ( -6.87%) Amean 3 3.9370 4.0910 ( -3.91%) Amean 5 6.6797 6.9357 ( -3.83%) Amean 7 9.3367 9.7150 ( -4.05%) Amean 12 15.7627 16.1400 ( -2.39%) Amean 18 23.5360 23.6890 ( -0.65%) Amean 24 31.0663 31.3137 ( -0.80%) Amean 30 38.7283 39.0037 ( -0.71%) Amean 32 41.3417 41.6097 ( -0.65%) hackbench-thread-sockets Amean 1 1.5250 1.6043 ( -5.20%) Amean 3 4.0897 4.2603 ( -4.17%) Amean 5 6.7760 7.0933 ( -4.68%) Amean 7 9.4817 9.9157 ( -4.58%) Amean 12 15.9610 16.3937 ( -2.71%) Amean 18 23.9543 24.3417 ( -1.62%) Amean 24 31.4400 31.7217 ( -0.90%) Amean 30 39.2457 39.5467 ( -0.77%) Amean 32 41.8267 42.1230 ( -0.71%) 2-socket Intel(R) Xeon(R) CPU E5-2670 v3 @ 2.30GHz (12 cores, 24 threads per socket), 64GB RAM hackbench-process-sockets Amean 1 1.0347 1.0880 ( -5.15%) Amean 4 1.7267 1.8527 ( -7.30%) Amean 7 2.6707 2.8110 ( -5.25%) Amean 12 4.1617 4.3383 ( -4.25%) Amean 21 7.0070 7.2600 ( -3.61%) Amean 30 9.9187 10.2397 ( -3.24%) Amean 48 15.6710 16.3923 ( -4.60%) Amean 79 24.7743 26.1247 ( -5.45%) Amean 110 34.3000 35.9307 ( -4.75%) Amean 141 44.2043 44.8010 ( -1.35%) Amean 172 54.2430 54.7260 ( -0.89%) Amean 192 60.6557 60.9777 ( -0.53%) hackbench-thread-sockets Amean 1 1.0610 1.1353 ( -7.01%) Amean 4 1.7543 1.9140 ( -9.10%) Amean 7 2.7840 2.9573 ( -6.23%) Amean 12 4.3813 4.4937 ( -2.56%) Amean 21 7.3460 7.5350 ( -2.57%) Amean 30 10.2313 10.5190 ( -2.81%) Amean 48 15.9700 16.5940 ( -3.91%) Amean 79 25.3973 26.6637 ( -4.99%) Amean 110 35.1087 36.4797 ( -3.91%) Amean 141 45.8220 46.3053 ( -1.05%) Amean 172 55.4917 55.7320 ( -0.43%) Amean 192 62.7490 62.5410 ( 0.33%) Link: https://lkml.kernel.org/r/20211012134651.11258-1-vbabka@suse.cz Signed-off-by: Vlastimil Babka <vbabka@suse.cz> Reported-by: Jann Horn <jannh@google.com> Cc: Roman Gushchin <guro@fb.com> Cc: Christoph Lameter <cl@linux.com> Cc: Pekka Enberg <penberg@kernel.org> Cc: David Rientjes <rientjes@google.com> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
#
d0fe47c6 |
|
05-Nov-2021 |
Kefeng Wang <wangkefeng.wang@huawei.com> |
slub: add back check for free nonslab objects After commit f227f0faf63b ("slub: fix unreclaimable slab stat for bulk free"), the check for free nonslab page is replaced by VM_BUG_ON_PAGE, which only check with CONFIG_DEBUG_VM enabled, but this config may impact performance, so it only for debug. Commit 0937502af7c9 ("slub: Add check for kfree() of non slab objects.") add the ability, which should be needed in any configs to catch the invalid free, they even could be potential issue, eg, memory corruption, use after free and double free, so replace VM_BUG_ON_PAGE to WARN_ON_ONCE, add object address printing to help use to debug the issue. Link: https://lkml.kernel.org/r/20210930070214.61499-1-wangkefeng.wang@huawei.com Signed-off-by: Kefeng Wang <wangkefeng.wang@huawei.com> Cc: Matthew Wilcox <willy@infradead.org> Cc: Shakeel Butt <shakeelb@google.com> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Christoph Lameter <cl@linux.com> Cc: Pekka Enberg <penberg@kernel.org> Cc: David Rienjes <rientjes@google.com> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
#
23efd080 |
|
19-Oct-2021 |
Matthew Wilcox (Oracle) <willy@infradead.org> |
vsprintf: Make %pGp print the hex value All existing users of %pGp want the hex value as well as the decoded flag names. This looks awkward (passing the same parameter to printf twice), so move that functionality into the core. If we want, we can make that optional with flag arguments to %pGp in the future. Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> Reviewed-by: Yafang Shao <laoar.shao@gmail.com> Reviewed-by: Petr Mladek <pmladek@suse.com> Signed-off-by: Petr Mladek <pmladek@suse.com> Link: https://lore.kernel.org/r/20211019142621.2810043-6-willy@infradead.org
|
#
3ddd6026 |
|
18-Oct-2021 |
Miaohe Lin <linmiaohe@huawei.com> |
mm, slub: fix incorrect memcg slab count for bulk free kmem_cache_free_bulk() will call memcg_slab_free_hook() for all objects when doing bulk free. So we shouldn't call memcg_slab_free_hook() again for bulk free to avoid incorrect memcg slab count. Link: https://lkml.kernel.org/r/20210916123920.48704-6-linmiaohe@huawei.com Fixes: d1b2cf6cb84a ("mm: memcg/slab: uncharge during kmem_cache_free_bulk()") Signed-off-by: Miaohe Lin <linmiaohe@huawei.com> Reviewed-by: Vlastimil Babka <vbabka@suse.cz> Cc: Andrey Konovalov <andreyknvl@gmail.com> Cc: Andrey Ryabinin <ryabinin.a.a@gmail.com> Cc: Bharata B Rao <bharata@linux.ibm.com> Cc: Christoph Lameter <cl@linux.com> Cc: David Rientjes <rientjes@google.com> Cc: Faiyaz Mohammed <faiyazm@codeaurora.org> Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com> Cc: Kees Cook <keescook@chromium.org> Cc: Pekka Enberg <penberg@kernel.org> Cc: Roman Gushchin <guro@fb.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
#
67823a54 |
|
18-Oct-2021 |
Miaohe Lin <linmiaohe@huawei.com> |
mm, slub: fix potential use-after-free in slab_debugfs_fops When sysfs_slab_add failed, we shouldn't call debugfs_slab_add() for s because s will be freed soon. And slab_debugfs_fops will use s later leading to a use-after-free. Link: https://lkml.kernel.org/r/20210916123920.48704-5-linmiaohe@huawei.com Fixes: 64dd68497be7 ("mm: slub: move sysfs slab alloc/free interfaces to debugfs") Signed-off-by: Miaohe Lin <linmiaohe@huawei.com> Reviewed-by: Vlastimil Babka <vbabka@suse.cz> Cc: Andrey Konovalov <andreyknvl@gmail.com> Cc: Andrey Ryabinin <ryabinin.a.a@gmail.com> Cc: Bharata B Rao <bharata@linux.ibm.com> Cc: Christoph Lameter <cl@linux.com> Cc: David Rientjes <rientjes@google.com> Cc: Faiyaz Mohammed <faiyazm@codeaurora.org> Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com> Cc: Kees Cook <keescook@chromium.org> Cc: Pekka Enberg <penberg@kernel.org> Cc: Roman Gushchin <guro@fb.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
#
9037c576 |
|
18-Oct-2021 |
Miaohe Lin <linmiaohe@huawei.com> |
mm, slub: fix potential memoryleak in kmem_cache_open() In error path, the random_seq of slub cache might be leaked. Fix this by using __kmem_cache_release() to release all the relevant resources. Link: https://lkml.kernel.org/r/20210916123920.48704-4-linmiaohe@huawei.com Fixes: 210e7a43fa90 ("mm: SLUB freelist randomization") Signed-off-by: Miaohe Lin <linmiaohe@huawei.com> Reviewed-by: Vlastimil Babka <vbabka@suse.cz> Cc: Andrey Konovalov <andreyknvl@gmail.com> Cc: Andrey Ryabinin <ryabinin.a.a@gmail.com> Cc: Bharata B Rao <bharata@linux.ibm.com> Cc: Christoph Lameter <cl@linux.com> Cc: David Rientjes <rientjes@google.com> Cc: Faiyaz Mohammed <faiyazm@codeaurora.org> Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com> Cc: Kees Cook <keescook@chromium.org> Cc: Pekka Enberg <penberg@kernel.org> Cc: Roman Gushchin <guro@fb.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
#
899447f6 |
|
18-Oct-2021 |
Miaohe Lin <linmiaohe@huawei.com> |
mm, slub: fix mismatch between reconstructed freelist depth and cnt If object's reuse is delayed, it will be excluded from the reconstructed freelist. But we forgot to adjust the cnt accordingly. So there will be a mismatch between reconstructed freelist depth and cnt. This will lead to free_debug_processing() complaining about freelist count or a incorrect slub inuse count. Link: https://lkml.kernel.org/r/20210916123920.48704-3-linmiaohe@huawei.com Fixes: c3895391df38 ("kasan, slub: fix handling of kasan_slab_free hook") Signed-off-by: Miaohe Lin <linmiaohe@huawei.com> Reviewed-by: Vlastimil Babka <vbabka@suse.cz> Cc: Andrey Konovalov <andreyknvl@gmail.com> Cc: Andrey Ryabinin <ryabinin.a.a@gmail.com> Cc: Bharata B Rao <bharata@linux.ibm.com> Cc: Christoph Lameter <cl@linux.com> Cc: David Rientjes <rientjes@google.com> Cc: Faiyaz Mohammed <faiyazm@codeaurora.org> Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com> Cc: Kees Cook <keescook@chromium.org> Cc: Pekka Enberg <penberg@kernel.org> Cc: Roman Gushchin <guro@fb.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
#
2127d225 |
|
18-Oct-2021 |
Miaohe Lin <linmiaohe@huawei.com> |
mm, slub: fix two bugs in slab_debug_trace_open() Patch series "Fixups for slub". This series contains various bug fixes for slub. We fix memoryleak, use-afer-free, NULL pointer dereferencing and so on in slub. More details can be found in the respective changelogs. This patch (of 5): It's possible that __seq_open_private() will return NULL. So we should check it before using lest dereferencing NULL pointer. And in error paths, we forgot to release private buffer via seq_release_private(). Memory will leak in these paths. Link: https://lkml.kernel.org/r/20210916123920.48704-1-linmiaohe@huawei.com Link: https://lkml.kernel.org/r/20210916123920.48704-2-linmiaohe@huawei.com Fixes: 64dd68497be7 ("mm: slub: move sysfs slab alloc/free interfaces to debugfs") Signed-off-by: Miaohe Lin <linmiaohe@huawei.com> Reviewed-by: Vlastimil Babka <vbabka@suse.cz> Cc: Christoph Lameter <cl@linux.com> Cc: Pekka Enberg <penberg@kernel.org> Cc: David Rientjes <rientjes@google.com> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com> Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Cc: Faiyaz Mohammed <faiyazm@codeaurora.org> Cc: Andrey Konovalov <andreyknvl@gmail.com> Cc: Andrey Ryabinin <ryabinin.a.a@gmail.com> Cc: Kees Cook <keescook@chromium.org> Cc: Bharata B Rao <bharata@linux.ibm.com> Cc: Roman Gushchin <guro@fb.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
#
bd0e7491 |
|
21-May-2021 |
Vlastimil Babka <vbabka@suse.cz> |
mm, slub: convert kmem_cpu_slab protection to local_lock Embed local_lock into struct kmem_cpu_slab and use the irq-safe versions of local_lock instead of plain local_irq_save/restore. On !PREEMPT_RT that's equivalent, with better lockdep visibility. On PREEMPT_RT that means better preemption. However, the cost on PREEMPT_RT is the loss of lockless fast paths which only work with cpu freelist. Those are designed to detect and recover from being preempted by other conflicting operations (both fast or slow path), but the slow path operations assume they cannot be preempted by a fast path operation, which is guaranteed naturally with disabled irqs. With local locks on PREEMPT_RT, the fast paths now also need to take the local lock to avoid races. In the allocation fastpath slab_alloc_node() we can just defer to the slowpath __slab_alloc() which also works with cpu freelist, but under the local lock. In the free fastpath do_slab_free() we have to add a new local lock protected version of freeing to the cpu freelist, as the existing slowpath only works with the page freelist. Also update the comment about locking scheme in SLUB to reflect changes done by this series. [ Mike Galbraith <efault@gmx.de>: use local_lock() without irq in PREEMPT_RT scope; debugging of RT crashes resulting in put_cpu_partial() locking changes ] Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
|
#
25c00c50 |
|
21-May-2021 |
Vlastimil Babka <vbabka@suse.cz> |
mm, slub: use migrate_disable() on PREEMPT_RT We currently use preempt_disable() (directly or via get_cpu_ptr()) to stabilize the pointer to kmem_cache_cpu. On PREEMPT_RT this would be incompatible with the list_lock spinlock. We can use migrate_disable() instead, but that increases overhead on !PREEMPT_RT as it's an unconditional function call. In order to get the best available mechanism on both PREEMPT_RT and !PREEMPT_RT, introduce private slub_get_cpu_ptr() and slub_put_cpu_ptr() wrappers and use them. Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
|
#
e0a043aa |
|
27-Jul-2021 |
Vlastimil Babka <vbabka@suse.cz> |
mm, slub: protect put_cpu_partial() with disabled irqs instead of cmpxchg Jann Horn reported [1] the following theoretically possible race: task A: put_cpu_partial() calls preempt_disable() task A: oldpage = this_cpu_read(s->cpu_slab->partial) interrupt: kfree() reaches unfreeze_partials() and discards the page task B (on another CPU): reallocates page as page cache task A: reads page->pages and page->pobjects, which are actually halves of the pointer page->lru.prev task B (on another CPU): frees page interrupt: allocates page as SLUB page and places it on the percpu partial list task A: this_cpu_cmpxchg() succeeds which would cause page->pages and page->pobjects to end up containing halves of pointers that would then influence when put_cpu_partial() happens and show up in root-only sysfs files. Maybe that's acceptable, I don't know. But there should probably at least be a comment for now to point out that we're reading union fields of a page that might be in a completely different state. Additionally, the this_cpu_cmpxchg() approach in put_cpu_partial() is only safe against s->cpu_slab->partial manipulation in ___slab_alloc() if the latter disables irqs, otherwise a __slab_free() in an irq handler could call put_cpu_partial() in the middle of ___slab_alloc() manipulating ->partial and corrupt it. This becomes an issue on RT after a local_lock is introduced in later patch. The fix means taking the local_lock also in put_cpu_partial() on RT. After debugging this issue, Mike Galbraith suggested [2] that to avoid different locking schemes on RT and !RT, we can just protect put_cpu_partial() with disabled irqs (to be converted to local_lock_irqsave() later) everywhere. This should be acceptable as it's not a fast path, and moving the actual partial unfreezing outside of the irq disabled section makes it short, and with the retry loop gone the code can be also simplified. In addition, the race reported by Jann should no longer be possible. [1] https://lore.kernel.org/lkml/CAG48ez1mvUuXwg0YPH5ANzhQLpbphqk-ZS+jbRz+H66fvm4FcA@mail.gmail.com/ [2] https://lore.kernel.org/linux-rt-users/e3470ab357b48bccfbd1f5133b982178a7d2befb.camel@gmx.de/ Reported-by: Jann Horn <jannh@google.com> Suggested-by: Mike Galbraith <efault@gmx.de> Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
|
#
a2b4ae8b |
|
03-Jun-2021 |
Vlastimil Babka <vbabka@suse.cz> |
mm, slub: make slab_lock() disable irqs with PREEMPT_RT We need to disable irqs around slab_lock() (a bit spinlock) to make it irq-safe. Most calls to slab_lock() are nested under spin_lock_irqsave() which doesn't disable irqs on PREEMPT_RT, so add explicit disabling with PREEMPT_RT. The exception is cmpxchg_double_slab() which already disables irqs, so use a __slab_[un]lock() variant without irq disable there. slab_[un]lock() thus needs a flags pointer parameter, which is unused on !RT. free_debug_processing() now has two flags variables, which looks odd, but only one is actually used - the one used in spin_lock_irqsave() on !RT and the one used in slab_lock() on RT. As a result, __cmpxchg_double_slab() and cmpxchg_double_slab() become effectively identical on RT, as both will disable irqs, which is necessary on RT as most callers of this function also rely on irqsaving lock operations. Thus, assert that irqs are already disabled in __cmpxchg_double_slab() only on !RT and also change the VM_BUG_ON assertion to the more standard lockdep_assert one. Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
|
#
94ef0304 |
|
16-Jul-2020 |
Sebastian Andrzej Siewior <bigeasy@linutronix.de> |
mm: slub: make object_map_lock a raw_spinlock_t The variable object_map is protected by object_map_lock. The lock is always acquired in debug code and within already atomic context Make object_map_lock a raw_spinlock_t. Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de> Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
|
#
5a836bf6 |
|
26-Feb-2021 |
Sebastian Andrzej Siewior <bigeasy@linutronix.de> |
mm: slub: move flush_cpu_slab() invocations __free_slab() invocations out of IRQ context flush_all() flushes a specific SLAB cache on each CPU (where the cache is present). The deactivate_slab()/__free_slab() invocation happens within IPI handler and is problematic for PREEMPT_RT. The flush operation is not a frequent operation or a hot path. The per-CPU flush operation can be moved to within a workqueue. Because a workqueue handler, unlike IPI handler, does not disable irqs, flush_slab() now has to disable them for working with the kmem_cache_cpu fields. deactivate_slab() is safe to call with irqs enabled. [vbabka@suse.cz: adapt to new SLUB changes] Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de> Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
|
#
08beb547 |
|
03-Jun-2021 |
Vlastimil Babka <vbabka@suse.cz> |
mm, slab: split out the cpu offline variant of flush_slab() flush_slab() is called either as part IPI handler on given live cpu, or as a cleanup on behalf of another cpu that went offline. The first case needs to protect updating the kmem_cache_cpu fields with disabled irqs. Currently the whole call happens with irqs disabled by the IPI handler, but the following patch will change from IPI to workqueue, and flush_slab() will have to disable irqs (to be replaced with a local lock later) in the critical part. To prepare for this change, replace the call to flush_slab() for the dead cpu handling with an opencoded variant that will not disable irqs nor take a local lock. Suggested-by: Mike Galbraith <efault@gmx.de> Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
|
#
0e7ac738 |
|
20-May-2021 |
Vlastimil Babka <vbabka@suse.cz> |
mm, slub: don't disable irqs in slub_cpu_dead() slub_cpu_dead() cleans up for an offlined cpu from another cpu and calls only functions that are now irq safe, so we don't need to disable irqs anymore. Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
|
#
7cf9f3ba |
|
20-May-2021 |
Vlastimil Babka <vbabka@suse.cz> |
mm, slub: only disable irq with spin_lock in __unfreeze_partials() __unfreeze_partials() no longer needs to have irqs disabled, except for making the spin_lock operations irq-safe, so convert the spin_locks operations and remove the separate irq handling. Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
|
#
fc1455f4 |
|
20-May-2021 |
Vlastimil Babka <vbabka@suse.cz> |
mm, slub: separate detaching of partial list in unfreeze_partials() from unfreezing Unfreezing partial list can be split to two phases - detaching the list from struct kmem_cache_cpu, and processing the list. The whole operation does not need to be protected by disabled irqs. Restructure the code to separate the detaching (with disabled irqs) and unfreezing (with irq disabling to be reduced in the next patch). Also, unfreeze_partials() can be called from another cpu on behalf of a cpu that is being offlined, where disabling irqs on the local cpu has no sense, so restructure the code as follows: - __unfreeze_partials() is the bulk of unfreeze_partials() that processes the detached percpu partial list - unfreeze_partials() detaches list from current cpu with irqs disabled and calls __unfreeze_partials() - unfreeze_partials_cpu() is to be called for the offlined cpu so it needs no irq disabling, and is called from __flush_cpu_slab() - flush_cpu_slab() is for the local cpu thus it needs to call unfreeze_partials(). So it can't simply call __flush_cpu_slab(smp_processor_id()) anymore and we have to open-code the proper calls. Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
|
#
c2f973ba |
|
20-May-2021 |
Vlastimil Babka <vbabka@suse.cz> |
mm, slub: detach whole partial list at once in unfreeze_partials() Instead of iterating through the live percpu partial list, detach it from the kmem_cache_cpu at once. This is simpler and will allow further optimization. Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
|
#
8de06a6f |
|
20-May-2021 |
Vlastimil Babka <vbabka@suse.cz> |
mm, slub: discard slabs in unfreeze_partials() without irqs disabled No need for disabled irqs when discarding slabs, so restore them before discarding. Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
|
#
f3ab8b6b |
|
20-May-2021 |
Vlastimil Babka <vbabka@suse.cz> |
mm, slub: move irq control into unfreeze_partials() unfreeze_partials() can be optimized so that it doesn't need irqs disabled for the whole time. As the first step, move irq control into the function and remove it from the put_cpu_partial() caller. Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
|
#
cfdf836e |
|
12-May-2021 |
Vlastimil Babka <vbabka@suse.cz> |
mm, slub: call deactivate_slab() without disabling irqs The function is now safe to be called with irqs enabled, so move the calls outside of irq disabled sections. When called from ___slab_alloc() -> flush_slab() we have irqs disabled, so to reenable them before deactivate_slab() we need to open-code flush_slab() in ___slab_alloc() and reenable irqs after modifying the kmem_cache_cpu fields. But that means a IRQ handler meanwhile might have assigned a new page to kmem_cache_cpu.page so we have to retry the whole check. The remaining callers of flush_slab() are the IPI handler which has disabled irqs anyway, and slub_cpu_dead() which will be dealt with in the following patch. Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
|
#
3406e91b |
|
12-May-2021 |
Vlastimil Babka <vbabka@suse.cz> |
mm, slub: make locking in deactivate_slab() irq-safe dectivate_slab() now no longer touches the kmem_cache_cpu structure, so it will be possible to call it with irqs enabled. Just convert the spin_lock calls to their irq saving/restoring variants to make it irq-safe. Note we now have to use cmpxchg_double_slab() for irq-safe slab_lock(), because in some situations we don't take the list_lock, which would disable irqs. Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
|
#
a019d201 |
|
12-May-2021 |
Vlastimil Babka <vbabka@suse.cz> |
mm, slub: move reset of c->page and freelist out of deactivate_slab() deactivate_slab() removes the cpu slab by merging the cpu freelist with slab's freelist and putting the slab on the proper node's list. It also sets the respective kmem_cache_cpu pointers to NULL. By extracting the kmem_cache_cpu operations from the function, we can make it not dependent on disabled irqs. Also if we return a single free pointer from ___slab_alloc, we no longer have to assign kmem_cache_cpu.page before deactivation or care if somebody preempted us and assigned a different page to our kmem_cache_cpu in the process. Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
|
#
4b1f449d |
|
11-May-2021 |
Vlastimil Babka <vbabka@suse.cz> |
mm, slub: stop disabling irqs around get_partial() The function get_partial() does not need to have irqs disabled as a whole. It's sufficient to convert spin_lock operations to their irq saving/restoring versions. As a result, it's now possible to reach the page allocator from the slab allocator without disabling and re-enabling interrupts on the way. Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
|
#
9f101ee8 |
|
11-May-2021 |
Vlastimil Babka <vbabka@suse.cz> |
mm, slub: check new pages with restored irqs Building on top of the previous patch, re-enable irqs before checking new pages. alloc_debug_processing() is now called with enabled irqs so we need to remove VM_BUG_ON(!irqs_disabled()); in check_slab() - there doesn't seem to be a need for it anyway. Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
|
#
3f2b77e3 |
|
11-May-2021 |
Vlastimil Babka <vbabka@suse.cz> |
mm, slub: validate slab from partial list or page allocator before making it cpu slab When we obtain a new slab page from node partial list or page allocator, we assign it to kmem_cache_cpu, perform some checks, and if they fail, we undo the assignment. In order to allow doing the checks without irq disabled, restructure the code so that the checks are done first, and kmem_cache_cpu.page assignment only after they pass. Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
|
#
6c1dbb67 |
|
10-May-2021 |
Vlastimil Babka <vbabka@suse.cz> |
mm, slub: restore irqs around calling new_slab() allocate_slab() currently re-enables irqs before calling to the page allocator. It depends on gfpflags_allow_blocking() to determine if it's safe to do so. Now we can instead simply restore irq before calling it through new_slab(). The other caller early_kmem_cache_node_alloc() is unaffected by this. Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
|
#
fa417ab7 |
|
10-May-2021 |
Vlastimil Babka <vbabka@suse.cz> |
mm, slub: move disabling irqs closer to get_partial() in ___slab_alloc() Continue reducing the irq disabled scope. Check for per-cpu partial slabs with first with irqs enabled and then recheck with irqs disabled before grabbing the slab page. Mostly preparatory for the following patches. Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
|
#
0b303fb4 |
|
07-May-2021 |
Vlastimil Babka <vbabka@suse.cz> |
mm, slub: do initial checks in ___slab_alloc() with irqs enabled As another step of shortening irq disabled sections in ___slab_alloc(), delay disabling irqs until we pass the initial checks if there is a cached percpu slab and it's suitable for our allocation. Now we have to recheck c->page after actually disabling irqs as an allocation in irq handler might have replaced it. Because we call pfmemalloc_match() as one of the checks, we might hit VM_BUG_ON_PAGE(!PageSlab(page)) in PageSlabPfmemalloc in case we get interrupted and the page is freed. Thus introduce a pfmemalloc_match_unsafe() variant that lacks the PageSlab check. Signed-off-by: Vlastimil Babka <vbabka@suse.cz> Acked-by: Mel Gorman <mgorman@techsingularity.net>
|
#
e500059b |
|
07-May-2021 |
Vlastimil Babka <vbabka@suse.cz> |
mm, slub: move disabling/enabling irqs to ___slab_alloc() Currently __slab_alloc() disables irqs around the whole ___slab_alloc(). This includes cases where this is not needed, such as when the allocation ends up in the page allocator and has to awkwardly enable irqs back based on gfp flags. Also the whole kmem_cache_alloc_bulk() is executed with irqs disabled even when it hits the __slab_alloc() slow path, and long periods with disabled interrupts are undesirable. As a first step towards reducing irq disabled periods, move irq handling into ___slab_alloc(). Callers will instead prevent the s->cpu_slab percpu pointer from becoming invalid via get_cpu_ptr(), thus preempt_disable(). This does not protect against modification by an irq handler, which is still done by disabled irq for most of ___slab_alloc(). As a small immediate benefit, slab_out_of_memory() from ___slab_alloc() is now called with irqs enabled. kmem_cache_alloc_bulk() disables irqs for its fastpath and then re-enables them before calling ___slab_alloc(), which then disables them at its discretion. The whole kmem_cache_alloc_bulk() operation also disables preemption. When ___slab_alloc() calls new_slab() to allocate a new page, re-enable preemption, because new_slab() will re-enable interrupts in contexts that allow blocking (this will be improved by later patches). The patch itself will thus increase overhead a bit due to disabled preemption (on configs where it matters) and increased disabling/enabling irqs in kmem_cache_alloc_bulk(), but that will be gradually improved in the following patches. Note in __slab_alloc() we need to change the #ifdef CONFIG_PREEMPT guard to CONFIG_PREEMPT_COUNT to make sure preempt disable/enable is properly paired in all configurations. On configs without involuntary preemption and debugging the re-read of kmem_cache_cpu pointer is still compiled out as it was before. [ Mike Galbraith <efault@gmx.de>: Fix kmem_cache_alloc_bulk() error path ] Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
|
#
9b4bc85a |
|
17-May-2021 |
Vlastimil Babka <vbabka@suse.cz> |
mm, slub: simplify kmem_cache_cpu and tid setup In slab_alloc_node() and do_slab_free() fastpaths we need to guarantee that our kmem_cache_cpu pointer is from the same cpu as the tid value. Currently that's done by reading the tid first using this_cpu_read(), then the kmem_cache_cpu pointer and verifying we read the same tid using the pointer and plain READ_ONCE(). This can be simplified to just fetching kmem_cache_cpu pointer and then reading tid using the pointer. That guarantees they are from the same cpu. We don't need to read the tid using this_cpu_read() because the value will be validated by this_cpu_cmpxchg_double(), making sure we are on the correct cpu and the freelist didn't change by anyone preempting us since reading the tid. Signed-off-by: Vlastimil Babka <vbabka@suse.cz> Acked-by: Mel Gorman <mgorman@techsingularity.net>
|
#
1572df7c |
|
11-May-2021 |
Vlastimil Babka <vbabka@suse.cz> |
mm, slub: restructure new page checks in ___slab_alloc() When we allocate slab object from a newly acquired page (from node's partial list or page allocator), we usually also retain the page as a new percpu slab. There are two exceptions - when pfmemalloc status of the page doesn't match our gfp flags, or when the cache has debugging enabled. The current code for these decisions is not easy to follow, so restructure it and add comments. The new structure will also help with the following changes. No functional change. Signed-off-by: Vlastimil Babka <vbabka@suse.cz> Acked-by: Mel Gorman <mgorman@techsingularity.net>
|
#
75c8ff28 |
|
11-May-2021 |
Vlastimil Babka <vbabka@suse.cz> |
mm, slub: return slab page from get_partial() and set c->page afterwards The function get_partial() finds a suitable page on a partial list, acquires and returns its freelist and assigns the page pointer to kmem_cache_cpu. In later patch we will need more control over the kmem_cache_cpu.page assignment, so instead of passing a kmem_cache_cpu pointer, pass a pointer to a pointer to a page that get_partial() can fill and the caller can assign the kmem_cache_cpu.page pointer. No functional change as all of this still happens with disabled IRQs. Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
|
#
53a0de06 |
|
11-May-2021 |
Vlastimil Babka <vbabka@suse.cz> |
mm, slub: dissolve new_slab_objects() into ___slab_alloc() The later patches will need more fine grained control over individual actions in ___slab_alloc(), the only caller of new_slab_objects(), so dissolve it there. This is a preparatory step with no functional change. The only minor change is moving WARN_ON_ONCE() for using a constructor together with __GFP_ZERO to new_slab(), which makes it somewhat less frequent, but still able to catch a development change introducing a systematic misuse. Signed-off-by: Vlastimil Babka <vbabka@suse.cz> Acked-by: Christoph Lameter <cl@linux.com> Acked-by: Mel Gorman <mgorman@techsingularity.net>
|
#
2a904905 |
|
10-May-2021 |
Vlastimil Babka <vbabka@suse.cz> |
mm, slub: extract get_partial() from new_slab_objects() The later patches will need more fine grained control over individual actions in ___slab_alloc(), the only caller of new_slab_objects(), so this is a first preparatory step with no functional change. This adds a goto label that appears unnecessary at this point, but will be useful for later changes. Signed-off-by: Vlastimil Babka <vbabka@suse.cz> Acked-by: Christoph Lameter <cl@linux.com>
|
#
976b805c |
|
07-Jun-2021 |
Vlastimil Babka <vbabka@suse.cz> |
mm, slub: remove redundant unfreeze_partials() from put_cpu_partial() Commit d6e0b7fa1186 ("slub: make dead caches discard free slabs immediately") introduced cpu partial flushing for kmemcg caches, based on setting the target cpu_partial to 0 and adding a flushing check in put_cpu_partial(). This code that sets cpu_partial to 0 was later moved by c9fc586403e7 ("slab: introduce __kmemcg_cache_deactivate()") and ultimately removed by 9855609bde03 ("mm: memcg/slab: use a single set of kmem_caches for all accounted allocations"). However the check and flush in put_cpu_partial() was never removed, although it's effectively a dead code. So this patch removes it. Note that d6e0b7fa1186 also added preempt_disable()/enable() to unfreeze_partials() which could be thus also considered unnecessary. But further patches will rely on it, so keep it. Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
|
#
84048039 |
|
20-May-2021 |
Vlastimil Babka <vbabka@suse.cz> |
mm, slub: don't disable irq for debug_check_no_locks_freed() In slab_free_hook() we disable irqs around the debug_check_no_locks_freed() call, which is unnecessary, as irqs are already being disabled inside the call. This seems to be leftover from the past where there were more calls inside the irq disabled sections. Remove the irq disable/enable operations. Mel noted: > Looks like it was needed for kmemcheck which went away back in 4.15 Signed-off-by: Vlastimil Babka <vbabka@suse.cz> Acked-by: Mel Gorman <mgorman@techsingularity.net>
|
#
0a19e7dd |
|
22-May-2021 |
Vlastimil Babka <vbabka@suse.cz> |
mm, slub: allocate private object map for validate_slab_cache() validate_slab_cache() is called either to handle a sysfs write, or from a self-test context. In both situations it's straightforward to preallocate a private object bitmap instead of grabbing the shared static one meant for critical sections, so let's do that. Signed-off-by: Vlastimil Babka <vbabka@suse.cz> Acked-by: Christoph Lameter <cl@linux.com> Acked-by: Mel Gorman <mgorman@techsingularity.net>
|
#
b3fd64e1 |
|
22-May-2021 |
Vlastimil Babka <vbabka@suse.cz> |
mm, slub: allocate private object map for debugfs listings Slub has a static spinlock protected bitmap for marking which objects are on freelist when it wants to list them, for situations where dynamically allocating such map can lead to recursion or locking issues, and on-stack bitmap would be too large. The handlers of debugfs files alloc_traces and free_traces also currently use this shared bitmap, but their syscall context makes it straightforward to allocate a private map before entering locked sections, so switch these processing paths to use a private bitmap. Signed-off-by: Vlastimil Babka <vbabka@suse.cz> Acked-by: Christoph Lameter <cl@linux.com> Acked-by: Mel Gorman <mgorman@techsingularity.net>
|
#
eafb1d64 |
|
28-May-2021 |
Vlastimil Babka <vbabka@suse.cz> |
mm, slub: don't call flush_all() from slab_debug_trace_open() slab_debug_trace_open() can only be called on caches with SLAB_STORE_USER flag and as with all slub debugging flags, such caches avoid cpu or percpu partial slabs altogether, so there's nothing to flush. Signed-off-by: Vlastimil Babka <vbabka@suse.cz> Acked-by: Christoph Lameter <cl@linux.com>
|
#
a7f1d485 |
|
13-Aug-2021 |
Vlastimil Babka <vbabka@suse.cz> |
mm: slub: fix slub_debug disabling for list of slabs Vijayanand Jitta reports: Consider the scenario where CONFIG_SLUB_DEBUG_ON is set and we would want to disable slub_debug for few slabs. Using boot parameter with slub_debug=-,slab_name syntax doesn't work as expected i.e; only disabling debugging for the specified list of slabs. Instead it disables debugging for all slabs, which is wrong. This patch fixes it by delaying the moment when the global slub_debug flags variable is updated. In case a "slub_debug=-,slab_name" has been passed, the global flags remain as initialized (depending on CONFIG_SLUB_DEBUG_ON enabled or disabled) and are not simply reset to 0. Link: https://lkml.kernel.org/r/8a3d992a-473a-467b-28a0-4ad2ff60ab82@suse.cz Signed-off-by: Vlastimil Babka <vbabka@suse.cz> Reported-by: Vijayanand Jitta <vjitta@codeaurora.org> Reviewed-by: Vijayanand Jitta <vjitta@codeaurora.org> Acked-by: David Rientjes <rientjes@google.com> Cc: Christoph Lameter <cl@linux.com> Cc: Pekka Enberg <penberg@kernel.org> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com> Cc: Vinayak Menon <vinmenon@codeaurora.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
#
1ed7ce57 |
|
13-Aug-2021 |
Shakeel Butt <shakeelb@google.com> |
slub: fix kmalloc_pagealloc_invalid_free unit test The unit test kmalloc_pagealloc_invalid_free makes sure that for the higher order slub allocation which goes to page allocator, the free is called with the correct address i.e. the virtual address of the head page. Commit f227f0faf63b ("slub: fix unreclaimable slab stat for bulk free") unified the free code paths for page allocator based slub allocations but instead of using the address passed by the caller, it extracted the address from the page. Thus making the unit test kmalloc_pagealloc_invalid_free moot. So, fix this by using the address passed by the caller. Should we fix this? I think yes because dev expect kasan to catch these type of programming bugs. Link: https://lkml.kernel.org/r/20210802180819.1110165-1-shakeelb@google.com Fixes: f227f0faf63b ("slub: fix unreclaimable slab stat for bulk free") Signed-off-by: Shakeel Butt <shakeelb@google.com> Reported-by: Nathan Chancellor <nathan@kernel.org> Tested-by: Nathan Chancellor <nathan@kernel.org> Acked-by: Roman Gushchin <guro@fb.com> Cc: Michal Hocko <mhocko@suse.com> Cc: Muchun Song <songmuchun@bytedance.com> Cc: Christoph Lameter <cl@linux.com> Cc: Pekka Enberg <penberg@kernel.org> Cc: David Rientjes <rientjes@google.com> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com> Cc: Vlastimil Babka <vbabka@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
#
340caf17 |
|
13-Aug-2021 |
Kuan-Ying Lee <Kuan-Ying.Lee@mediatek.com> |
kasan, slub: reset tag when printing address The address still includes the tags when it is printed. With hardware tag-based kasan enabled, we will get a false positive KASAN issue when we access metadata. Reset the tag before we access the metadata. Link: https://lkml.kernel.org/r/20210804090957.12393-3-Kuan-Ying.Lee@mediatek.com Fixes: aa1ef4d7b3f6 ("kasan, mm: reset tags when accessing metadata") Signed-off-by: Kuan-Ying Lee <Kuan-Ying.Lee@mediatek.com> Reviewed-by: Marco Elver <elver@google.com> Reviewed-by: Andrey Konovalov <andreyknvl@gmail.com> Cc: Alexander Potapenko <glider@google.com> Cc: Andrey Ryabinin <ryabinin.a.a@gmail.com> Cc: Catalin Marinas <catalin.marinas@arm.com> Cc: Chinwen Chang <chinwen.chang@mediatek.com> Cc: Nicholas Tang <nicholas.tang@mediatek.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
#
f227f0fa |
|
29-Jul-2021 |
Shakeel Butt <shakeelb@google.com> |
slub: fix unreclaimable slab stat for bulk free SLUB uses page allocator for higher order allocations and update unreclaimable slab stat for such allocations. At the moment, the bulk free for SLUB does not share code with normal free code path for these type of allocations and have missed the stat update. So, fix the stat update by common code. The user visible impact of the bug is the potential of inconsistent unreclaimable slab stat visible through meminfo and vmstat. Link: https://lkml.kernel.org/r/20210728155354.3440560-1-shakeelb@google.com Fixes: 6a486c0ad4dc ("mm, sl[ou]b: improve memory accounting") Signed-off-by: Shakeel Butt <shakeelb@google.com> Acked-by: Michal Hocko <mhocko@suse.com> Acked-by: Roman Gushchin <guro@fb.com> Reviewed-by: Muchun Song <songmuchun@bytedance.com> Cc: Christoph Lameter <cl@linux.com> Cc: Pekka Enberg <penberg@kernel.org> Cc: David Rientjes <rientjes@google.com> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com> Cc: Vlastimil Babka <vbabka@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
#
ae14c63a |
|
17-Jul-2021 |
Linus Torvalds <torvalds@linux-foundation.org> |
Revert "mm/slub: use stackdepot to save stack trace in objects" This reverts commit 788691464c29455346dc613a3b43c2fb9e5757a4. It's not clear why, but it causes unexplained problems in entirely unrelated xfs code. The most likely explanation is some slab corruption, possibly triggered due to CONFIG_SLUB_DEBUG_ON. See [1]. It ends up having a few other problems too, like build errors on arch/arc, and Geert reporting it using much more memory on m68k [3] (it probably does so elsewhere too, but it is probably just more noticeable on m68k). The architecture issues (both build and memory use) are likely just because this change effectively force-enabled STACKDEPOT (along with a very bad default value for the stackdepot hash size). But together with the xfs issue, this all smells like "this commit was not ready" to me. Link: https://lore.kernel.org/linux-xfs/YPE3l82acwgI2OiV@infradead.org/ [1] Link: https://lore.kernel.org/lkml/202107150600.LkGNb4Vb-lkp@intel.com/ [2] Link: https://lore.kernel.org/lkml/CAMuHMdW=eoVzM1Re5FVoEN87nKfiLmM2+Ah7eNu2KXEhCvbZyA@mail.gmail.com/ [3] Reported-by: Christoph Hellwig <hch@infradead.org> Reported-by: kernel test robot <lkp@intel.com> Reported-by: Geert Uytterhoeven <geert@linux-m68k.org> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Randy Dunlap <rdunlap@infradead.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
#
0d4a062a |
|
14-Jul-2021 |
Marco Elver <elver@google.com> |
mm: move helper to check slub_debug_enabled Move the helper to check slub_debug_enabled, so that we can confine the use of #ifdef outside slub.c as well. Link: https://lkml.kernel.org/r/20210705103229.8505-2-yee.lee@mediatek.com Signed-off-by: Marco Elver <elver@google.com> Signed-off-by: Yee Lee <yee.lee@mediatek.com> Suggested-by: Matthew Wilcox <willy@infradead.org> Cc: Alexander Potapenko <glider@google.com> Cc: Andrey Konovalov <andreyknvl@gmail.com> Cc: Andrey Ryabinin <ryabinin.a.a@gmail.com> Cc: Chinwen Chang <chinwen.chang@mediatek.com> Cc: Dmitry Vyukov <dvyukov@google.com> Cc: Kuan-Ying Lee <Kuan-Ying.Lee@mediatek.com> Cc: Nicholas Tang <nicholas.tang@mediatek.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
#
78869146 |
|
07-Jul-2021 |
Oliver Glitta <glittao@gmail.com> |
mm/slub: use stackdepot to save stack trace in objects Many stack traces are similar so there are many similar arrays. Stackdepot saves each unique stack only once. Replace field addrs in struct track with depot_stack_handle_t handle. Use stackdepot to save stack trace. The benefits are smaller memory overhead and possibility to aggregate per-cache statistics in the future using the stackdepot handle instead of matching stacks manually. [rdunlap@infradead.org: rename save_stack_trace()] Link: https://lkml.kernel.org/r/20210513051920.29320-1-rdunlap@infradead.org [vbabka@suse.cz: fix lockdep splat] Link: https://lkml.kernel.org/r/20210516195150.26740-1-vbabka@suse.czLink: https://lkml.kernel.org/r/20210414163434.4376-1-glittao@gmail.com Signed-off-by: Oliver Glitta <glittao@gmail.com> Signed-off-by: Randy Dunlap <rdunlap@infradead.org> Signed-off-by: Vlastimil Babka <vbabka@suse.cz> Reviewed-by: Vlastimil Babka <vbabka@suse.cz> Acked-by: David Rientjes <rientjes@google.com> Cc: Christoph Lameter <cl@linux.com> Cc: Pekka Enberg <penberg@kernel.org> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
#
e548eaa1 |
|
16-Mar-2021 |
Maninder Singh <maninder1.s@samsung.com> |
mm/slub: Add Support for free path information of an object This commit adds enables a stack dump for the last free of an object: slab kmalloc-64 start c8ab0140 data offset 64 pointer offset 0 size 64 allocated at meminfo_proc_show+0x40/0x4fc [ 20.192078] meminfo_proc_show+0x40/0x4fc [ 20.192263] seq_read_iter+0x18c/0x4c4 [ 20.192430] proc_reg_read_iter+0x84/0xac [ 20.192617] generic_file_splice_read+0xe8/0x17c [ 20.192816] splice_direct_to_actor+0xb8/0x290 [ 20.193008] do_splice_direct+0xa0/0xe0 [ 20.193185] do_sendfile+0x2d0/0x438 [ 20.193345] sys_sendfile64+0x12c/0x140 [ 20.193523] ret_fast_syscall+0x0/0x58 [ 20.193695] 0xbeeacde4 [ 20.193822] Free path: [ 20.193935] meminfo_proc_show+0x5c/0x4fc [ 20.194115] seq_read_iter+0x18c/0x4c4 [ 20.194285] proc_reg_read_iter+0x84/0xac [ 20.194475] generic_file_splice_read+0xe8/0x17c [ 20.194685] splice_direct_to_actor+0xb8/0x290 [ 20.194870] do_splice_direct+0xa0/0xe0 [ 20.195014] do_sendfile+0x2d0/0x438 [ 20.195174] sys_sendfile64+0x12c/0x140 [ 20.195336] ret_fast_syscall+0x0/0x58 [ 20.195491] 0xbeeacde4 Acked-by: Vlastimil Babka <vbabka@suse.cz> Co-developed-by: Vaneet Narang <v.narang@samsung.com> Signed-off-by: Vaneet Narang <v.narang@samsung.com> Signed-off-by: Maninder Singh <maninder1.s@samsung.com> Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
|
#
0cbc124b |
|
16-Mar-2021 |
Maninder Singh <maninder1.s@samsung.com> |
mm/slub: Fix backtrace of objects to handle redzone adjustment This commit fixes commit 8e7f37f2aaa5 ("mm: Add mem_dump_obj() to print source of memory block"). With current code, the backtrace of allocated object is incorrect: / # cat /proc/meminfo [ 14.969843] slab kmalloc-64 start c8ab0140 data offset 64 pointer offset 0 size 64 allocated at 0x6b6b6b6b [ 14.970635] 0x6b6b6b6b [ 14.970794] 0x6b6b6b6b [ 14.970932] 0x6b6b6b6b [ 14.971077] 0x6b6b6b6b [ 14.971202] 0x6b6b6b6b [ 14.971317] 0x6b6b6b6b [ 14.971423] 0x6b6b6b6b [ 14.971635] 0x6b6b6b6b [ 14.971740] 0x6b6b6b6b [ 14.971871] 0x6b6b6b6b [ 14.972229] 0x6b6b6b6b [ 14.972363] 0x6b6b6b6b [ 14.972505] 0xa56b6b6b [ 14.972631] 0xbbbbbbbb [ 14.972734] 0xc8ab0400 [ 14.972891] meminfo_proc_show+0x40/0x4fc The reason is that the object address was not adjusted for the red zone. With this fix, the backtrace is correct: / # cat /proc/meminfo [ 14.870782] slab kmalloc-64 start c8ab0140 data offset 64 pointer offset 128 size 64 allocated at meminfo_proc_show+0x40/0x4f4 [ 14.871817] meminfo_proc_show+0x40/0x4f4 [ 14.872035] seq_read_iter+0x18c/0x4c4 [ 14.872229] proc_reg_read_iter+0x84/0xac [ 14.872433] generic_file_splice_read+0xe8/0x17c [ 14.872621] splice_direct_to_actor+0xb8/0x290 [ 14.872747] do_splice_direct+0xa0/0xe0 [ 14.872896] do_sendfile+0x2d0/0x438 [ 14.873044] sys_sendfile64+0x12c/0x140 [ 14.873229] ret_fast_syscall+0x0/0x58 [ 14.873372] 0xbe861de4 Acked-by: Vlastimil Babka <vbabka@suse.cz> Signed-off-by: Vaneet Narang <v.narang@samsung.com> Signed-off-by: Maninder Singh <maninder1.s@samsung.com> Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
|
#
65ebdeef |
|
28-Jun-2021 |
Georgi Djakov <quic_c_gdjako@quicinc.com> |
mm/slub: add taint after the errors are printed When running the kernel with panic_on_taint, the usual slub debug error messages are not being printed when object corruption happens. That's because we panic in add_taint(), which is called before printing the additional information. This is a bit unfortunate as the error messages are actually very useful, especially before a panic. Let's fix this by moving add_taint() after the errors are printed on the console. Link: https://lkml.kernel.org/r/1623860738-146761-1-git-send-email-quic_c_gdjako@quicinc.com Signed-off-by: Georgi Djakov <quic_c_gdjako@quicinc.com> Acked-by: Rafael Aquini <aquini@redhat.com> Acked-by: David Rientjes <rientjes@google.com> Acked-by: Vlastimil Babka <vbabka@suse.cz> Reviewed-by: Aaron Tomlin <atomlin@redhat.com> Cc: Christoph Lameter <cl@linux.com> Cc: Pekka Enberg <penberg@kernel.org> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
#
64dd6849 |
|
28-Jun-2021 |
Faiyaz Mohammed <faiyazm@codeaurora.org> |
mm: slub: move sysfs slab alloc/free interfaces to debugfs alloc_calls and free_calls implementation in sysfs have two issues, one is PAGE_SIZE limitation of sysfs and other is it does not adhere to "one value per file" rule. To overcome this issues, move the alloc_calls and free_calls implementation to debugfs. Debugfs cache will be created if SLAB_STORE_USER flag is set. Rename the alloc_calls/free_calls to alloc_traces/free_traces, to be inline with what it does. [faiyazm@codeaurora.org: fix the leak of alloc/free traces debugfs interface] Link: https://lkml.kernel.org/r/1624248060-30286-1-git-send-email-faiyazm@codeaurora.org Link: https://lkml.kernel.org/r/1623438200-19361-1-git-send-email-faiyazm@codeaurora.org Signed-off-by: Faiyaz Mohammed <faiyazm@codeaurora.org> Reviewed-by: Vlastimil Babka <vbabka@suse.cz> Reviewed-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Cc: Christoph Lameter <cl@linux.com> Cc: Pekka Enberg <penberg@kernel.org> Cc: David Rientjes <rientjes@google.com> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
#
79270291 |
|
28-Jun-2021 |
Stephen Boyd <swboyd@chromium.org> |
slub: force on no_hash_pointers when slub_debug is enabled Obscuring the pointers that slub shows when debugging makes for some confusing slub debug messages: Padding overwritten. 0x0000000079f0674a-0x000000000d4dce17 Those addresses are hashed for kernel security reasons. If we're trying to be secure with slub_debug on the commandline we have some big problems given that we dump whole chunks of kernel memory to the kernel logs. Let's force on the no_hash_pointers commandline flag when slub_debug is on the commandline. This makes slub debug messages more meaningful and if by chance a kernel address is in some slub debug object dump we will have a better chance of figuring out what went wrong. Note that we don't use %px in the slub code because we want to reduce the number of places that %px is used in the kernel. This also nicely prints a big fat warning at kernel boot if slub_debug is on the commandline so that we know that this kernel shouldn't be used on production systems. [akpm@linux-foundation.org: fix build with CONFIG_SLUB_DEBUG=n] Link: https://lkml.kernel.org/r/20210601182202.3011020-5-swboyd@chromium.org Signed-off-by: Stephen Boyd <swboyd@chromium.org> Acked-by: Vlastimil Babka <vbabka@suse.cz> Acked-by: Petr Mladek <pmladek@suse.com> Cc: Joe Perches <joe@perches.com> Cc: Christoph Lameter <cl@linux.com> Cc: Pekka Enberg <penberg@kernel.org> Cc: David Rientjes <rientjes@google.com> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com> Cc: Muchun Song <songmuchun@bytedance.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
#
582d1212 |
|
28-Jun-2021 |
Joe Perches <joe@perches.com> |
slub: indicate slab_fix() uses printf formats Ideally, slab_fix() would be marked with __printf and the format here would not use \n as that's emitted by the slab_fix(). Make these changes. Link: https://lkml.kernel.org/r/20210601182202.3011020-4-swboyd@chromium.org Signed-off-by: Joe Perches <joe@perches.com> Acked-by: Vlastimil Babka <vbabka@suse.cz> Signed-off-by: Stephen Boyd <swboyd@chromium.org> Acked-by: David Rientjes <rientjes@google.com> Cc: Christoph Lameter <cl@linux.com> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com> Cc: Muchun Song <songmuchun@bytedance.com> Cc: Pekka Enberg <penberg@kernel.org> Cc: Petr Mladek <pmladek@suse.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
#
1a88ef87 |
|
28-Jun-2021 |
Stephen Boyd <swboyd@chromium.org> |
slub: actually use 'message' in restore_bytes() The message argument isn't used here. Let's pass the string to the printk message so that the developer can figure out what's happening, instead of guessing that a redzone is being restored, etc. Link: https://lkml.kernel.org/r/20210601182202.3011020-3-swboyd@chromium.org Signed-off-by: Stephen Boyd <swboyd@chromium.org> Reviewed-by: Vlastimil Babka <vbabka@suse.cz> Acked-by: David Rientjes <rientjes@google.com> Reviewed-by: Muchun Song <songmuchun@bytedance.com> Cc: Christoph Lameter <cl@linux.com> Cc: Joe Perches <joe@perches.com> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com> Cc: Pekka Enberg <penberg@kernel.org> Cc: Petr Mladek <pmladek@suse.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
#
02ac47d0 |
|
28-Jun-2021 |
Stephen Boyd <swboyd@chromium.org> |
slub: restore slub_debug=- behavior Petch series "slub: Print non-hashed pointers in slub debugging", v3. I was doing some debugging recently and noticed that my pointers were being hashed while slub_debug was on the kernel commandline. Let's force on the no hash pointer option when slub_debug is on the kernel commandline so that the prints are more meaningful. The first two patches are something else I noticed while looking at the code. The message argument is never used so the debugging messages are not as clear as they could be and the slub_debug=- behavior seems to be busted. Then there's a printf fixup from Joe and the final patch is the one that force disables pointer hashing. This patch (of 4): Passing slub_debug=- on the kernel commandline is supposed to disable slub debugging. This is especially useful with CONFIG_SLUB_DEBUG_ON where the default is to have slub debugging enabled in the build. Due to some code reorganization this behavior was dropped, but the code to make it work mostly stuck around. Restore the previous behavior by disabling the static key when we parse the commandline and see that we're trying to disable slub debugging. Link: https://lkml.kernel.org/r/20210601182202.3011020-1-swboyd@chromium.org Link: https://lkml.kernel.org/r/20210601182202.3011020-2-swboyd@chromium.org Fixes: ca0cab65ea2b ("mm, slub: introduce static key for slub_debug()") Signed-off-by: Stephen Boyd <swboyd@chromium.org> Acked-by: Vlastimil Babka <vbabka@suse.cz> Reviewed-by: Muchun Song <songmuchun@bytedance.com> Cc: Christoph Lameter <cl@linux.com> Cc: David Rientjes <rientjes@google.com> Cc: Joe Perches <joe@perches.com> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com> Cc: Pekka Enberg <penberg@kernel.org> Cc: Petr Mladek <pmladek@suse.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
#
3d8e374c |
|
28-Jun-2021 |
Oliver Glitta <glittao@gmail.com> |
slub: remove resiliency_test() function Function resiliency_test() is hidden behind #ifdef SLUB_RESILIENCY_TEST that is not part of Kconfig, so nobody runs it. This function is replaced with KUnit test for SLUB added by the previous patch "selftests: add a KUnit test for SLUB debugging functionality". Link: https://lkml.kernel.org/r/20210511150734.3492-3-glittao@gmail.com Signed-off-by: Oliver Glitta <glittao@gmail.com> Reviewed-by: Marco Elver <elver@google.com> Acked-by: Vlastimil Babka <vbabka@suse.cz> Acked-by: David Rientjes <rientjes@google.com> Cc: Christoph Lameter <cl@linux.com> Cc: Pekka Enberg <penberg@kernel.org> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com> Cc: Oliver Glitta <glittao@gmail.com> Cc: Brendan Higgins <brendanhiggins@google.com> Cc: Daniel Latypov <dlatypov@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
#
1f9f78b1 |
|
28-Jun-2021 |
Oliver Glitta <glittao@gmail.com> |
mm/slub, kunit: add a KUnit test for SLUB debugging functionality SLUB has resiliency_test() function which is hidden behind #ifdef SLUB_RESILIENCY_TEST that is not part of Kconfig, so nobody runs it. KUnit should be a proper replacement for it. Try changing byte in redzone after allocation and changing pointer to next free node, first byte, 50th byte and redzone byte. Check if validation finds errors. There are several differences from the original resiliency test: Tests create own caches with known state instead of corrupting shared kmalloc caches. The corruption of freepointer uses correct offset, the original resiliency test got broken with freepointer changes. Scratch changing random byte test, because it does not have meaning in this form where we need deterministic results. Add new option CONFIG_SLUB_KUNIT_TEST in Kconfig. Tests next_pointer, first_word and clobber_50th_byte do not run with KASAN option on. Because the test deliberately modifies non-allocated objects. Use kunit_resource to count errors in cache and silence bug reports. Count error whenever slab_bug() or slab_fix() is called or when the count of pages is wrong. [glittao@gmail.com: remove unused function test_exit(), from SLUB KUnit test] Link: https://lkml.kernel.org/r/20210512140656.12083-1-glittao@gmail.com [akpm@linux-foundation.org: export kasan_enable/disable_current to modules] Link: https://lkml.kernel.org/r/20210511150734.3492-2-glittao@gmail.com Signed-off-by: Oliver Glitta <glittao@gmail.com> Reviewed-by: Vlastimil Babka <vbabka@suse.cz> Acked-by: Daniel Latypov <dlatypov@google.com> Acked-by: Marco Elver <elver@google.com> Cc: Brendan Higgins <brendanhiggins@google.com> Cc: Christoph Lameter <cl@linux.com> Cc: David Rientjes <rientjes@google.com> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com> Cc: Pekka Enberg <penberg@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
#
1b3865d0 |
|
15-Jun-2021 |
Andrew Morton <akpm@linux-foundation.org> |
mm/slub.c: include swab.h Fixes build with CONFIG_SLAB_FREELIST_HARDENED=y. Hopefully. But it's the right thing to do anwyay. Fixes: 1ad53d9fa3f61 ("slub: improve bit diffusion for freelist ptr obfuscation") Link: https://bugzilla.kernel.org/show_bug.cgi?id=213417 Reported-by: <vannguye@cisco.com> Acked-by: Kees Cook <keescook@chromium.org> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
#
e41a49fa |
|
15-Jun-2021 |
Kees Cook <keescook@chromium.org> |
mm/slub: actually fix freelist pointer vs redzoning It turns out that SLUB redzoning ("slub_debug=Z") checks from s->object_size rather than from s->inuse (which is normally bumped to make room for the freelist pointer), so a cache created with an object size less than 24 would have the freelist pointer written beyond s->object_size, causing the redzone to be corrupted by the freelist pointer. This was very visible with "slub_debug=ZF": BUG test (Tainted: G B ): Right Redzone overwritten ----------------------------------------------------------------------------- INFO: 0xffff957ead1c05de-0xffff957ead1c05df @offset=1502. First byte 0x1a instead of 0xbb INFO: Slab 0xffffef3950b47000 objects=170 used=170 fp=0x0000000000000000 flags=0x8000000000000200 INFO: Object 0xffff957ead1c05d8 @offset=1496 fp=0xffff957ead1c0620 Redzone (____ptrval____): bb bb bb bb bb bb bb bb ........ Object (____ptrval____): 00 00 00 00 00 f6 f4 a5 ........ Redzone (____ptrval____): 40 1d e8 1a aa @.... Padding (____ptrval____): 00 00 00 00 00 00 00 00 ........ Adjust the offset to stay within s->object_size. (Note that no caches of in this size range are known to exist in the kernel currently.) Link: https://lkml.kernel.org/r/20210608183955.280836-4-keescook@chromium.org Link: https://lore.kernel.org/linux-mm/20200807160627.GA1420741@elver.google.com/ Link: https://lore.kernel.org/lkml/0f7dd7b2-7496-5e2d-9488-2ec9f8e90441@suse.cz/Fixes: 89b83f282d8b (slub: avoid redzone when choosing freepointer location) Link: https://lore.kernel.org/lkml/CANpmjNOwZ5VpKQn+SYWovTkFB4VsT-RPwyENBmaK0dLcpqStkA@mail.gmail.com Signed-off-by: Kees Cook <keescook@chromium.org> Reported-by: Marco Elver <elver@google.com> Reported-by: "Lin, Zhenpeng" <zplin@psu.edu> Tested-by: Marco Elver <elver@google.com> Acked-by: Vlastimil Babka <vbabka@suse.cz> Cc: Christoph Lameter <cl@linux.com> Cc: David Rientjes <rientjes@google.com> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com> Cc: Pekka Enberg <penberg@kernel.org> Cc: Roman Gushchin <guro@fb.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
#
74c1d3e0 |
|
15-Jun-2021 |
Kees Cook <keescook@chromium.org> |
mm/slub: fix redzoning for small allocations The redzone area for SLUB exists between s->object_size and s->inuse (which is at least the word-aligned object_size). If a cache were created with an object_size smaller than sizeof(void *), the in-object stored freelist pointer would overwrite the redzone (e.g. with boot param "slub_debug=ZF"): BUG test (Tainted: G B ): Right Redzone overwritten ----------------------------------------------------------------------------- INFO: 0xffff957ead1c05de-0xffff957ead1c05df @offset=1502. First byte 0x1a instead of 0xbb INFO: Slab 0xffffef3950b47000 objects=170 used=170 fp=0x0000000000000000 flags=0x8000000000000200 INFO: Object 0xffff957ead1c05d8 @offset=1496 fp=0xffff957ead1c0620 Redzone (____ptrval____): bb bb bb bb bb bb bb bb ........ Object (____ptrval____): f6 f4 a5 40 1d e8 ...@.. Redzone (____ptrval____): 1a aa .. Padding (____ptrval____): 00 00 00 00 00 00 00 00 ........ Store the freelist pointer out of line when object_size is smaller than sizeof(void *) and redzoning is enabled. Additionally remove the "smaller than sizeof(void *)" check under CONFIG_DEBUG_VM in kmem_cache_sanity_check() as it is now redundant: SLAB and SLOB both handle small sizes. (Note that no caches within this size range are known to exist in the kernel currently.) Link: https://lkml.kernel.org/r/20210608183955.280836-3-keescook@chromium.org Fixes: 81819f0fc828 ("SLUB core") Signed-off-by: Kees Cook <keescook@chromium.org> Acked-by: Vlastimil Babka <vbabka@suse.cz> Cc: Christoph Lameter <cl@linux.com> Cc: David Rientjes <rientjes@google.com> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com> Cc: "Lin, Zhenpeng" <zplin@psu.edu> Cc: Marco Elver <elver@google.com> Cc: Pekka Enberg <penberg@kernel.org> Cc: Roman Gushchin <guro@fb.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
#
8669dbab |
|
15-Jun-2021 |
Kees Cook <keescook@chromium.org> |
mm/slub: clarify verification reporting Patch series "Actually fix freelist pointer vs redzoning", v4. This fixes redzoning vs the freelist pointer (both for middle-position and very small caches). Both are "theoretical" fixes, in that I see no evidence of such small-sized caches actually be used in the kernel, but that's no reason to let the bugs continue to exist, especially since people doing local development keep tripping over it. :) This patch (of 3): Instead of repeating "Redzone" and "Poison", clarify which sides of those zones got tripped. Additionally fix column alignment in the trailer. Before: BUG test (Tainted: G B ): Redzone overwritten ... Redzone (____ptrval____): bb bb bb bb bb bb bb bb ........ Object (____ptrval____): f6 f4 a5 40 1d e8 ...@.. Redzone (____ptrval____): 1a aa .. Padding (____ptrval____): 00 00 00 00 00 00 00 00 ........ After: BUG test (Tainted: G B ): Right Redzone overwritten ... Redzone (____ptrval____): bb bb bb bb bb bb bb bb ........ Object (____ptrval____): f6 f4 a5 40 1d e8 ...@.. Redzone (____ptrval____): 1a aa .. Padding (____ptrval____): 00 00 00 00 00 00 00 00 ........ The earlier commits that slowly resulted in the "Before" reporting were: d86bd1bece6f ("mm/slub: support left redzone") ffc79d288000 ("slub: use print_hex_dump") 2492268472e7 ("SLUB: change error reporting format to follow lockdep loosely") Link: https://lkml.kernel.org/r/20210608183955.280836-1-keescook@chromium.org Link: https://lkml.kernel.org/r/20210608183955.280836-2-keescook@chromium.org Link: https://lore.kernel.org/lkml/cfdb11d7-fb8e-e578-c939-f7f5fb69a6bd@suse.cz/ Signed-off-by: Kees Cook <keescook@chromium.org> Acked-by: Vlastimil Babka <vbabka@suse.cz> Cc: Marco Elver <elver@google.com> Cc: "Lin, Zhenpeng" <zplin@psu.edu> Cc: Christoph Lameter <cl@linux.com> Cc: Pekka Enberg <penberg@kernel.org> Cc: David Rientjes <rientjes@google.com> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com> Cc: Roman Gushchin <guro@fb.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
#
f70b0049 |
|
22-May-2021 |
Alexander Potapenko <glider@google.com> |
kasan: slab: always reset the tag in get_freepointer_safe() With CONFIG_DEBUG_PAGEALLOC enabled, the kernel should also untag the object pointer, as done in get_freepointer(). Failing to do so reportedly leads to SLUB freelist corruptions that manifest as boot-time crashes. Link: https://lkml.kernel.org/r/20210514072228.534418-1-glider@google.com Signed-off-by: Alexander Potapenko <glider@google.com> Cc: Marco Elver <elver@google.com> Cc: Vincenzo Frascino <vincenzo.frascino@arm.com> Cc: Andrey Ryabinin <aryabinin@virtuozzo.com> Cc: Andrey Konovalov <andreyknvl@gmail.com> Cc: Elliot Berman <eberman@codeaurora.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
#
afe0c26d |
|
14-May-2021 |
Vlastimil Babka <vbabka@suse.cz> |
mm, slub: move slub_debug static key enabling outside slab_mutex Paul E. McKenney reported [1] that commit 1f0723a4c0df ("mm, slub: enable slub_debug static key when creating cache with explicit debug flags") results in the lockdep complaint: ====================================================== WARNING: possible circular locking dependency detected 5.12.0+ #15 Not tainted ------------------------------------------------------ rcu_torture_sta/109 is trying to acquire lock: ffffffff96063cd0 (cpu_hotplug_lock){++++}-{0:0}, at: static_key_enable+0x9/0x20 but task is already holding lock: ffffffff96173c28 (slab_mutex){+.+.}-{3:3}, at: kmem_cache_create_usercopy+0x2d/0x250 which lock already depends on the new lock. the existing dependency chain (in reverse order) is: -> #1 (slab_mutex){+.+.}-{3:3}: lock_acquire+0xb9/0x3a0 __mutex_lock+0x8d/0x920 slub_cpu_dead+0x15/0xf0 cpuhp_invoke_callback+0x17a/0x7c0 cpuhp_invoke_callback_range+0x3b/0x80 _cpu_down+0xdf/0x2a0 cpu_down+0x2c/0x50 device_offline+0x82/0xb0 remove_cpu+0x1a/0x30 torture_offline+0x80/0x140 torture_onoff+0x147/0x260 kthread+0x10a/0x140 ret_from_fork+0x22/0x30 -> #0 (cpu_hotplug_lock){++++}-{0:0}: check_prev_add+0x8f/0xbf0 __lock_acquire+0x13f0/0x1d80 lock_acquire+0xb9/0x3a0 cpus_read_lock+0x21/0xa0 static_key_enable+0x9/0x20 __kmem_cache_create+0x38d/0x430 kmem_cache_create_usercopy+0x146/0x250 kmem_cache_create+0xd/0x10 rcu_torture_stats+0x79/0x280 kthread+0x10a/0x140 ret_from_fork+0x22/0x30 other info that might help us debug this: Possible unsafe locking scenario: CPU0 CPU1 ---- ---- lock(slab_mutex); lock(cpu_hotplug_lock); lock(slab_mutex); lock(cpu_hotplug_lock); *** DEADLOCK *** 1 lock held by rcu_torture_sta/109: #0: ffffffff96173c28 (slab_mutex){+.+.}-{3:3}, at: kmem_cache_create_usercopy+0x2d/0x250 stack backtrace: CPU: 3 PID: 109 Comm: rcu_torture_sta Not tainted 5.12.0+ #15 Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.13.0-1ubuntu1.1 04/01/2014 Call Trace: dump_stack+0x6d/0x89 check_noncircular+0xfe/0x110 ? lock_is_held_type+0x98/0x110 check_prev_add+0x8f/0xbf0 __lock_acquire+0x13f0/0x1d80 lock_acquire+0xb9/0x3a0 ? static_key_enable+0x9/0x20 ? mark_held_locks+0x49/0x70 cpus_read_lock+0x21/0xa0 ? static_key_enable+0x9/0x20 static_key_enable+0x9/0x20 __kmem_cache_create+0x38d/0x430 kmem_cache_create_usercopy+0x146/0x250 ? rcu_torture_stats_print+0xd0/0xd0 kmem_cache_create+0xd/0x10 rcu_torture_stats+0x79/0x280 ? rcu_torture_stats_print+0xd0/0xd0 kthread+0x10a/0x140 ? kthread_park+0x80/0x80 ret_from_fork+0x22/0x30 This is because there's one order of locking from the hotplug callbacks: lock(cpu_hotplug_lock); // from hotplug machinery itself lock(slab_mutex); // in e.g. slab_mem_going_offline_callback() And commit 1f0723a4c0df made the reverse sequence possible: lock(slab_mutex); // in kmem_cache_create_usercopy() lock(cpu_hotplug_lock); // kmem_cache_open() -> static_key_enable() The simplest fix is to move static_key_enable() to a place before slab_mutex is taken. That means kmem_cache_create_usercopy() in mm/slab_common.c which is not ideal for SLUB-specific code, but the #ifdef CONFIG_SLUB_DEBUG makes it at least self-contained and obvious. [1] https://lore.kernel.org/lkml/20210502171827.GA3670492@paulmck-ThinkPad-P17-Gen-1/ Link: https://lkml.kernel.org/r/20210504120019.26791-1-vbabka@suse.cz Fixes: 1f0723a4c0df ("mm, slub: enable slub_debug static key when creating cache with explicit debug flags") Signed-off-by: Vlastimil Babka <vbabka@suse.cz> Reported-by: Paul E. McKenney <paulmck@kernel.org> Tested-by: Paul E. McKenney <paulmck@kernel.org> Acked-by: David Rientjes <rientjes@google.com> Cc: Christoph Lameter <cl@linux.com> Cc: Pekka Enberg <penberg@kernel.org> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
#
f0953a1b |
|
06-May-2021 |
Ingo Molnar <mingo@kernel.org> |
mm: fix typos in comments Fix ~94 single-word typos in locking code comments, plus a few very obvious grammar mistakes. Link: https://lkml.kernel.org/r/20210322212624.GA1963421@gmail.com Link: https://lore.kernel.org/r/20210322205203.GB1959563@gmail.com Signed-off-by: Ingo Molnar <mingo@kernel.org> Reviewed-by: Matthew Wilcox (Oracle) <willy@infradead.org> Reviewed-by: Randy Dunlap <rdunlap@infradead.org> Cc: Bhaskar Chowdhury <unixbhaskar@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
#
d57a964e |
|
30-Apr-2021 |
Andrey Konovalov <andreyknvl@gmail.com> |
kasan, mm: integrate slab init_on_free with HW_TAGS This change uses the previously added memory initialization feature of HW_TAGS KASAN routines for slab memory when init_on_free is enabled. With this change, memory initialization memset() is no longer called when both HW_TAGS KASAN and init_on_free are enabled. Instead, memory is initialized in KASAN runtime. For SLUB, the memory initialization memset() is moved into slab_free_hook() that currently directly follows the initialization loop. A new argument is added to slab_free_hook() that indicates whether to initialize the memory or not. To avoid discrepancies with which memory gets initialized that can be caused by future changes, both KASAN hook and initialization memset() are put together and a warning comment is added. Combining setting allocation tags with memory initialization improves HW_TAGS KASAN performance when init_on_free is enabled. Link: https://lkml.kernel.org/r/190fd15c1886654afdec0d19ebebd5ade665b601.1615296150.git.andreyknvl@google.com Signed-off-by: Andrey Konovalov <andreyknvl@google.com> Reviewed-by: Marco Elver <elver@google.com> Cc: Alexander Potapenko <glider@google.com> Cc: Andrey Ryabinin <aryabinin@virtuozzo.com> Cc: Branislav Rankov <Branislav.Rankov@arm.com> Cc: Catalin Marinas <catalin.marinas@arm.com> Cc: Christoph Lameter <cl@linux.com> Cc: David Rientjes <rientjes@google.com> Cc: Dmitry Vyukov <dvyukov@google.com> Cc: Evgenii Stepanov <eugenis@google.com> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com> Cc: Kevin Brodsky <kevin.brodsky@arm.com> Cc: Pekka Enberg <penberg@kernel.org> Cc: Peter Collingbourne <pcc@google.com> Cc: Vincenzo Frascino <vincenzo.frascino@arm.com> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Will Deacon <will.deacon@arm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
#
da844b78 |
|
30-Apr-2021 |
Andrey Konovalov <andreyknvl@gmail.com> |
kasan, mm: integrate slab init_on_alloc with HW_TAGS This change uses the previously added memory initialization feature of HW_TAGS KASAN routines for slab memory when init_on_alloc is enabled. With this change, memory initialization memset() is no longer called when both HW_TAGS KASAN and init_on_alloc are enabled. Instead, memory is initialized in KASAN runtime. The memory initialization memset() is moved into slab_post_alloc_hook() that currently directly follows the initialization loop. A new argument is added to slab_post_alloc_hook() that indicates whether to initialize the memory or not. To avoid discrepancies with which memory gets initialized that can be caused by future changes, both KASAN hook and initialization memset() are put together and a warning comment is added. Combining setting allocation tags with memory initialization improves HW_TAGS KASAN performance when init_on_alloc is enabled. Link: https://lkml.kernel.org/r/c1292aeb5d519da221ec74a0684a949b027d7720.1615296150.git.andreyknvl@google.com Signed-off-by: Andrey Konovalov <andreyknvl@google.com> Reviewed-by: Marco Elver <elver@google.com> Cc: Alexander Potapenko <glider@google.com> Cc: Andrey Ryabinin <aryabinin@virtuozzo.com> Cc: Branislav Rankov <Branislav.Rankov@arm.com> Cc: Catalin Marinas <catalin.marinas@arm.com> Cc: Christoph Lameter <cl@linux.com> Cc: David Rientjes <rientjes@google.com> Cc: Dmitry Vyukov <dvyukov@google.com> Cc: Evgenii Stepanov <eugenis@google.com> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com> Cc: Kevin Brodsky <kevin.brodsky@arm.com> Cc: Pekka Enberg <penberg@kernel.org> Cc: Peter Collingbourne <pcc@google.com> Cc: Vincenzo Frascino <vincenzo.frascino@arm.com> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Will Deacon <will.deacon@arm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
#
dc84207d |
|
29-Apr-2021 |
Bhaskar Chowdhury <unixbhaskar@gmail.com> |
mm/slub.c: trivial typo fixes s/operatios/operations/ s/Mininum/Minimum/ s/mininum/minimum/ ......two different places. Link: https://lkml.kernel.org/r/20210325044940.14516-1-unixbhaskar@gmail.com Signed-off-by: Bhaskar Chowdhury <unixbhaskar@gmail.com> Acked-by: Randy Dunlap <rdunlap@infradead.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
#
1f0723a4 |
|
29-Apr-2021 |
Vlastimil Babka <vbabka@suse.cz> |
mm, slub: enable slub_debug static key when creating cache with explicit debug flags Commit ca0cab65ea2b ("mm, slub: introduce static key for slub_debug()") introduced a static key to optimize the case where no debugging is enabled for any cache. The static key is enabled when slub_debug boot parameter is passed, or CONFIG_SLUB_DEBUG_ON enabled. However, some caches might be created with one or more debugging flags explicitly passed to kmem_cache_create(), and the commit missed this. Thus the debugging functionality would not be actually performed for these caches unless the static key gets enabled by boot param or config. This patch fixes it by checking for debugging flags passed to kmem_cache_create() and enabling the static key accordingly. Note such explicit debugging flags should not be used outside of debugging and testing as they will now enable the static key globally. btrfs_init_cachep() creates a cache with SLAB_RED_ZONE but that's a mistake that's being corrected [1]. rcu_torture_stats() creates a cache with SLAB_STORE_USER, but that is a testing module so it's OK and will start working as intended after this patch. Also note that in case of backports to kernels before v5.12 that don't have 59450bbc12be ("mm, slab, slub: stop taking cpu hotplug lock"), static_branch_enable_cpuslocked() should be used. [1] https://lore.kernel.org/linux-btrfs/20210315141824.26099-1-dsterba@suse.com/ Link: https://lkml.kernel.org/r/20210315153415.24404-1-vbabka@suse.cz Fixes: ca0cab65ea2b ("mm, slub: introduce static key for slub_debug()") Signed-off-by: Vlastimil Babka <vbabka@suse.cz> Reported-by: Oliver Glitta <glittao@gmail.com> Acked-by: David Rientjes <rientjes@google.com> Cc: Christoph Lameter <cl@linux.com> Cc: Pekka Enberg <penberg@kernel.org> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com> Cc: "Paul E. McKenney" <paulmck@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
#
5bb1bb35 |
|
07-Jan-2021 |
Paul E. McKenney <paulmck@kernel.org> |
mm: Don't build mm_dump_obj() on CONFIG_PRINTK=n kernels The mem_dump_obj() functionality adds a few hundred bytes, which is a small price to pay. Except on kernels built with CONFIG_PRINTK=n, in which mem_dump_obj() messages will be suppressed. This commit therefore makes mem_dump_obj() be a static inline empty function on kernels built with CONFIG_PRINTK=n and excludes all of its support functions as well. This avoids kernel bloat on systems that cannot use mem_dump_obj(). Cc: Christoph Lameter <cl@linux.com> Cc: Pekka Enberg <penberg@kernel.org> Cc: David Rientjes <rientjes@google.com> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com> Cc: <linux-mm@kvack.org> Suggested-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
|
#
96b94abc |
|
19-Mar-2021 |
Yafang Shao <laoar.shao@gmail.com> |
mm, slub: don't combine pr_err with INFO It is strange to combine "pr_err" with "INFO", so let's remove the prefix completely. This patch is motivated by David's comment[1]. - before the patch [ 8846.517809] INFO: Slab 0x00000000f42a2c60 objects=33 used=3 fp=0x0000000060d32ca8 flags=0x17ffffc0010200(slab|head) - after the patch [ 6343.396602] Slab 0x000000004382e02b objects=33 used=3 fp=0x000000009ae06ffc flags=0x17ffffc0010200(slab|head) [1] https://lore.kernel.org/linux-mm/b9c0f2b6-e9b0-0c36-ebdd-2bc684c5a762@redhat.com/#t Suggested-by: Vlastimil Babka <vbabka@suse.cz> Signed-off-by: Yafang Shao <laoar.shao@gmail.com> Acked-by: Vlastimil Babka <vbabka@suse.cz> Reviewed-by: Miaohe Lin <linmiaohe@huawei.com> Reviewed-by: Andy Shevchenko <andriy.shevchenko@linux.intel.com> Reviewed-by: David Hildenbrand <david@redhat.com> Acked-by: David Rientjes <rientjes@google.com> Cc: Matthew Wilcox <willy@infradead.org> Reviewed-by: Sergey Senozhatsky <senozhatsky@chromium.org> Signed-off-by: Petr Mladek <pmladek@suse.com> Link: https://lore.kernel.org/r/20210319101246.73513-3-laoar.shao@gmail.com
|
#
4a8ef190 |
|
19-Mar-2021 |
Yafang Shao <laoar.shao@gmail.com> |
mm, slub: use pGp to print page flags As pGp has been already introduced in printk, we'd better use it to make the output human readable. Before this change, the output is, [ 6155.716018] INFO: Slab 0x000000004027dd4f objects=33 used=3 fp=0x000000008cd1579c flags=0x17ffffc0010200 While after this change, the output is, [ 8846.517809] INFO: Slab 0x00000000f42a2c60 objects=33 used=3 fp=0x0000000060d32ca8 flags=0x17ffffc0010200(slab|head) Signed-off-by: Yafang Shao <laoar.shao@gmail.com> Reviewed-by: David Hildenbrand <david@redhat.com> Reviewed-by: Vlastimil Babka <vbabka@suse.cz> Acked-by: David Rientjes <rientjes@google.com> Acked-by: Christoph Lameter <cl@linux.com> Reviewed-by: Matthew Wilcox (Oracle) <willy@infradead.org> Reviewed-by: Miaohe Lin <linmiaohe@huawei.com> Reviewed-by: Andy Shevchenko <andriy.shevchenko@linux.intel.com> Reviewed-by: Sergey Senozhatsky <senozhatsky@chromium.org> Signed-off-by: Petr Mladek <pmladek@suse.com> Link: https://lore.kernel.org/r/20210319101246.73513-2-laoar.shao@gmail.com
|
#
9b1ea29b |
|
10-Mar-2021 |
Linus Torvalds <torvalds@linux-foundation.org> |
Revert "mm, slub: consider rest of partial list if acquire_slab() fails" This reverts commit 8ff60eb052eeba95cfb3efe16b08c9199f8121cf. The kernel test robot reports a huge performance regression due to the commit, and the reason seems fairly straightforward: when there is contention on the page list (which is what causes acquire_slab() to fail), we do _not_ want to just loop and try again, because that will transfer the contention to the 'n->list_lock' spinlock we hold, and just make things even worse. This is admittedly likely a problem only on big machines - the kernel test robot report comes from a 96-thread dual socket Intel Xeon Gold 6252 setup, but the regression there really is quite noticeable: -47.9% regression of stress-ng.rawpkt.ops_per_sec and the commit that was marked as being fixed (7ced37197196: "slub: Acquire_slab() avoid loop") actually did the loop exit early very intentionally (the hint being that "avoid loop" part of that commit message), exactly to avoid this issue. The correct thing to do may be to pick some kind of reasonable middle ground: instead of breaking out of the loop on the very first sign of contention, or trying over and over and over again, the right thing may be to re-try _once_, and then give up on the second failure (or pick your favorite value for "once"..). Reported-by: kernel test robot <oliver.sang@intel.com> Link: https://lore.kernel.org/lkml/20210301080404.GF12822@xsang-OptiPlex-9020/ Cc: Jann Horn <jannh@google.com> Cc: David Rientjes <rientjes@google.com> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com> Acked-by: Christoph Lameter <cl@linux.com> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
#
e2db1a9a |
|
25-Feb-2021 |
Andrey Konovalov <andreyknvl@google.com> |
kasan, mm: optimize kmalloc poisoning For allocations from kmalloc caches, kasan_kmalloc() always follows kasan_slab_alloc(). Currenly, both of them unpoison the whole object, which is unnecessary. This patch provides separate implementations for both annotations: kasan_slab_alloc() unpoisons the whole object, and kasan_kmalloc() only poisons the redzone. For generic KASAN, the redzone start might not be aligned to KASAN_GRANULE_SIZE. Therefore, the poisoning is split in two parts: kasan_poison_last_granule() poisons the unaligned part, and then kasan_poison() poisons the rest. This patch also clarifies alignment guarantees of each of the poisoning functions and drops the unnecessary round_up() call for redzone_end. With this change, the early SLUB cache annotation needs to be changed to kasan_slab_alloc(), as kasan_kmalloc() doesn't unpoison objects now. The number of poisoned bytes for objects in this cache stays the same, as kmem_cache_node->object_size is equal to sizeof(struct kmem_cache_node). Link: https://lkml.kernel.org/r/7e3961cb52be380bc412860332063f5f7ce10d13.1612546384.git.andreyknvl@google.com Signed-off-by: Andrey Konovalov <andreyknvl@google.com> Reviewed-by: Marco Elver <elver@google.com> Cc: Alexander Potapenko <glider@google.com> Cc: Andrey Ryabinin <aryabinin@virtuozzo.com> Cc: Branislav Rankov <Branislav.Rankov@arm.com> Cc: Catalin Marinas <catalin.marinas@arm.com> Cc: Dmitry Vyukov <dvyukov@google.com> Cc: Evgenii Stepanov <eugenis@google.com> Cc: Kevin Brodsky <kevin.brodsky@arm.com> Cc: Peter Collingbourne <pcc@google.com> Cc: Vincenzo Frascino <vincenzo.frascino@arm.com> Cc: Will Deacon <will.deacon@arm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
#
b89fb5ef |
|
25-Feb-2021 |
Alexander Potapenko <glider@google.com> |
mm, kfence: insert KFENCE hooks for SLUB Inserts KFENCE hooks into the SLUB allocator. To pass the originally requested size to KFENCE, add an argument 'orig_size' to slab_alloc*(). The additional argument is required to preserve the requested original size for kmalloc() allocations, which uses size classes (e.g. an allocation of 272 bytes will return an object of size 512). Therefore, kmem_cache::size does not represent the kmalloc-caller's requested size, and we must introduce the argument 'orig_size' to propagate the originally requested size to KFENCE. Without the originally requested size, we would not be able to detect out-of-bounds accesses for objects placed at the end of a KFENCE object page if that object is not equal to the kmalloc-size class it was bucketed into. When KFENCE is disabled, there is no additional overhead, since slab_alloc*() functions are __always_inline. Link: https://lkml.kernel.org/r/20201103175841.3495947-6-elver@google.com Signed-off-by: Marco Elver <elver@google.com> Signed-off-by: Alexander Potapenko <glider@google.com> Reviewed-by: Dmitry Vyukov <dvyukov@google.com> Reviewed-by: Jann Horn <jannh@google.com> Co-developed-by: Marco Elver <elver@google.com> Cc: Andrey Konovalov <andreyknvl@google.com> Cc: Andrey Ryabinin <aryabinin@virtuozzo.com> Cc: Andy Lutomirski <luto@kernel.org> Cc: Borislav Petkov <bp@alien8.de> Cc: Catalin Marinas <catalin.marinas@arm.com> Cc: Christopher Lameter <cl@linux.com> Cc: Dave Hansen <dave.hansen@linux.intel.com> Cc: David Rientjes <rientjes@google.com> Cc: Eric Dumazet <edumazet@google.com> Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Cc: Hillf Danton <hdanton@sina.com> Cc: "H. Peter Anvin" <hpa@zytor.com> Cc: Ingo Molnar <mingo@redhat.com> Cc: Joern Engel <joern@purestorage.com> Cc: Jonathan Corbet <corbet@lwn.net> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com> Cc: Kees Cook <keescook@chromium.org> Cc: Mark Rutland <mark.rutland@arm.com> Cc: Paul E. McKenney <paulmck@kernel.org> Cc: Pekka Enberg <penberg@kernel.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: SeongJae Park <sjpark@amazon.de> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Will Deacon <will@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
#
027b37b5 |
|
24-Feb-2021 |
Andrey Konovalov <andreyknvl@google.com> |
kasan: move _RET_IP_ to inline wrappers Generic mm functions that call KASAN annotations that might report a bug pass _RET_IP_ to them as an argument. This allows KASAN to include the name of the function that called the mm function in its report's header. Now that KASAN has inline wrappers for all of its annotations, move _RET_IP_ to those wrappers to simplify annotation call sites. Link: https://linux-review.googlesource.com/id/I8fb3c06d49671305ee184175a39591bc26647a67 Link: https://lkml.kernel.org/r/5c1490eddf20b436b8c4eeea83fce47687d5e4a4.1610733117.git.andreyknvl@google.com Signed-off-by: Andrey Konovalov <andreyknvl@google.com> Reviewed-by: Marco Elver <elver@google.com> Reviewed-by: Alexander Potapenko <glider@google.com> Cc: Andrey Ryabinin <aryabinin@virtuozzo.com> Cc: Branislav Rankov <Branislav.Rankov@arm.com> Cc: Catalin Marinas <catalin.marinas@arm.com> Cc: Dmitry Vyukov <dvyukov@google.com> Cc: Evgenii Stepanov <eugenis@google.com> Cc: Kevin Brodsky <kevin.brodsky@arm.com> Cc: Peter Collingbourne <pcc@google.com> Cc: Vincenzo Frascino <vincenzo.frascino@arm.com> Cc: Will Deacon <will.deacon@arm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
#
96403bfe |
|
24-Feb-2021 |
Muchun Song <songmuchun@bytedance.com> |
mm: memcontrol: fix slub memory accounting SLUB currently account kmalloc() and kmalloc_node() allocations larger than order-1 page per-node. But it forget to update the per-memcg vmstats. So it can lead to inaccurate statistics of "slab_unreclaimable" which is from memory.stat. Fix it by using mod_lruvec_page_state instead of mod_node_page_state. Link: https://lkml.kernel.org/r/20210223092423.42420-1-songmuchun@bytedance.com Fixes: 6a486c0ad4dc ("mm, sl[ou]b: improve memory accounting") Signed-off-by: Muchun Song <songmuchun@bytedance.com> Reviewed-by: Shakeel Butt <shakeelb@google.com> Reviewed-by: Roman Gushchin <guro@fb.com> Reviewed-by: Michal Koutný <mkoutny@suse.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Michal Hocko <mhocko@kernel.org> Cc: Vladimir Davydov <vdavydov.dev@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
#
2e9bd483 |
|
24-Feb-2021 |
Roman Gushchin <guro@fb.com> |
mm: memcg/slab: pre-allocate obj_cgroups for slab caches with SLAB_ACCOUNT In general it's unknown in advance if a slab page will contain accounted objects or not. In order to avoid memory waste, an obj_cgroup vector is allocated dynamically when a need to account of a new object arises. Such approach is memory efficient, but requires an expensive cmpxchg() to set up the memcg/objcgs pointer, because an allocation can race with a different allocation on another cpu. But in some common cases it's known for sure that a slab page will contain accounted objects: if the page belongs to a slab cache with a SLAB_ACCOUNT flag set. It includes such popular objects like vm_area_struct, anon_vma, task_struct, etc. In such cases we can pre-allocate the objcgs vector and simple assign it to the page without any atomic operations, because at this early stage the page is not visible to anyone else. A very simplistic benchmark (allocating 10000000 64-bytes objects in a row) shows ~15% win. In the real life it seems that most workloads are not very sensitive to the speed of (accounted) slab allocations. [guro@fb.com: open-code set_page_objcgs() and add some comments, by Johannes] Link: https://lkml.kernel.org/r/20201113001926.GA2934489@carbon.dhcp.thefacebook.com [akpm@linux-foundation.org: fix it for mm-slub-call-account_slab_page-after-slab-page-initialization-fix.patch] Link: https://lkml.kernel.org/r/20201110195753.530157-2-guro@fb.com Signed-off-by: Roman Gushchin <guro@fb.com> Acked-by: Johannes Weiner <hannes@cmpxchg.org> Reviewed-by: Shakeel Butt <shakeelb@google.com> Cc: Michal Hocko <mhocko@kernel.org> Cc: Christoph Lameter <cl@linux.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
#
457c82c3 |
|
24-Feb-2021 |
Zhiyuan Dai <daizhiyuan@phytium.com.cn> |
mm/slub: minor coding style tweaks Add whitespace to fix coding style issues, improve code reading. Link: https://lkml.kernel.org/r/1612847403-5594-1-git-send-email-daizhiyuan@phytium.com.cn Signed-off-by: Zhiyuan Dai <daizhiyuan@phytium.com.cn> Acked-by: David Rientjes <rientjes@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
#
fe2cce15 |
|
24-Feb-2021 |
Vlastimil Babka <vbabka@suse.cz> |
mm, slub: remove slub_memcg_sysfs boot param and CONFIG_SLUB_MEMCG_SYSFS_ON The boot param and config determine the value of memcg_sysfs_enabled, which is unused since commit 10befea91b61 ("mm: memcg/slab: use a single set of kmem_caches for all allocations") as there are no per-memcg kmem caches anymore. Link: https://lkml.kernel.org/r/20210127124745.7928-1-vbabka@suse.cz Signed-off-by: Vlastimil Babka <vbabka@suse.cz> Reviewed-by: David Hildenbrand <david@redhat.com> Acked-by: Roman Gushchin <guro@fb.com> Acked-by: David Rientjes <rientjes@google.com> Reviewed-by: Miaohe Lin <linmiaohe@huawei.com> Cc: Christoph Lameter <cl@linux.com> Cc: Pekka Enberg <penberg@kernel.org> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
#
d930ff03 |
|
24-Feb-2021 |
Vlastimil Babka <vbabka@suse.cz> |
mm, slub: splice cpu and page freelists in deactivate_slab() In deactivate_slab() we currently move all but one objects on the cpu freelist to the page freelist one by one using the costly cmpxchg_double() operation. Then we unfreeze the page while moving the last object on page freelist, with a final cmpxchg_double(). This can be optimized to avoid the cmpxchg_double() per object. Just count the objects on cpu freelist (to adjust page->inuse properly) and also remember the last object in the chain. Then splice page->freelist to the last object and effectively add the whole cpu freelist to page->freelist while unfreezing the page, with a single cmpxchg_double(). Link: https://lkml.kernel.org/r/20210115183543.15097-1-vbabka@suse.cz Signed-off-by: Vlastimil Babka <vbabka@suse.cz> Reviewed-by: Jann Horn <jannh@google.com> Cc: Christoph Lameter <cl@linux.com> Cc: Pekka Enberg <penberg@kernel.org> Cc: David Rientjes <rientjes@google.com> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
#
7e1fa93d |
|
24-Feb-2021 |
Vlastimil Babka <vbabka@suse.cz> |
mm, slab, slub: stop taking memory hotplug lock Since commit 03afc0e25f7f ("slab: get_online_mems for kmem_cache_{create,destroy,shrink}") we are taking memory hotplug lock for SLAB and SLUB when creating, destroying or shrinking a cache. It is quite a heavy lock and it's best to avoid it if possible, as we had several issues with lockdep complaining about ordering in the past, see e.g. e4f8e513c3d3 ("mm/slub: fix a deadlock in show_slab_objects()"). The problem scenario in 03afc0e25f7f (solved by the memory hotplug lock) can be summarized as follows: while there's slab_mutex synchronizing new kmem cache creation and SLUB's MEM_GOING_ONLINE callback slab_mem_going_online_callback(), we may miss creation of kmem_cache_node for the hotplugged node in the new kmem cache, because the hotplug callback doesn't yet see the new cache, and cache creation in init_kmem_cache_nodes() only inits kmem_cache_node for nodes in the N_NORMAL_MEMORY nodemask, which however may not yet include the new node, as that happens only later after the MEM_GOING_ONLINE callback. Instead of using get/put_online_mems(), the problem can be solved by SLUB maintaining its own nodemask of nodes for which it has allocated the per-node kmem_cache_node structures. This nodemask would generally mirror the N_NORMAL_MEMORY nodemask, but would be updated only in under SLUB's control in its memory hotplug callbacks under the slab_mutex. This patch adds such nodemask and its handling. Commit 03afc0e25f7f mentiones "issues like [the one above]", but there don't appear to be further issues. All the paths (shared for SLAB and SLUB) taking the memory hotplug locks are also taking the slab_mutex, except kmem_cache_shrink() where 03afc0e25f7f replaced slab_mutex with get/put_online_mems(). We however cannot simply restore slab_mutex in kmem_cache_shrink(), as SLUB can enters the function from a write to sysfs 'shrink' file, thus holding kernfs lock, and in kmem_cache_create() the kernfs lock is nested within slab_mutex. But on closer inspection we don't actually need to protect kmem_cache_shrink() from hotplug callbacks: While SLUB's __kmem_cache_shrink() does for_each_kmem_cache_node(), missing a new node added in parallel hotplug is not fatal, and parallel hotremove does not free kmem_cache_node's anymore after the previous patch, so use-after free cannot happen. The per-node shrinking itself is protected by n->list_lock. Same is true for SLAB, and SLOB is no-op. SLAB also doesn't need the memory hotplug locking, which it only gained by 03afc0e25f7f through the shared paths in slab_common.c. Its memory hotplug callbacks are also protected by slab_mutex against races with these paths. The problem of SLUB relying on N_NORMAL_MEMORY doesn't apply to SLAB, as its setup_kmem_cache_nodes relies on N_ONLINE, and the new node is already set there during the MEM_GOING_ONLINE callback, so no special care is needed for SLAB. As such, this patch removes all get/put_online_mems() usage by the slab subsystem. Link: https://lkml.kernel.org/r/20210113131634.3671-3-vbabka@suse.cz Signed-off-by: Vlastimil Babka <vbabka@suse.cz> Cc: Christoph Lameter <cl@linux.com> Cc: David Hildenbrand <david@redhat.com> Cc: David Rientjes <rientjes@google.com> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com> Cc: Michal Hocko <mhocko@kernel.org> Cc: Pekka Enberg <penberg@kernel.org> Cc: Qian Cai <cai@redhat.com> Cc: Vladimir Davydov <vdavydov.dev@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
#
666716fd |
|
24-Feb-2021 |
Vlastimil Babka <vbabka@suse.cz> |
mm, slub: stop freeing kmem_cache_node structures on node offline Patch series "mm, slab, slub: remove cpu and memory hotplug locks". Some related work caused me to look at how we use get/put_mems_online() and get/put_online_cpus() during kmem cache creation/descruction/shrinking, and realize that it should be actually safe to remove all of that with rather small effort (as e.g. Michal Hocko suspected in some of the past discussions already). This has the benefit to avoid rather heavy locks that have caused locking order issues already in the past. So this is the result, Patches 2 and 3 remove memory hotplug and cpu hotplug locking, respectively. Patch 1 is due to realization that in fact some races exist despite the locks (even if not removed), but the most sane solution is not to introduce more of them, but rather accept some wasted memory in scenarios that should be rare anyway (full memory hot remove), as we do the same in other contexts already. This patch (of 3): Commit e4f8e513c3d3 ("mm/slub: fix a deadlock in show_slab_objects()") has fixed a problematic locking order by removing the memory hotplug lock get/put_online_mems() from show_slab_objects(). During the discussion, it was argued [1] that this is OK, because existing slabs on the node would prevent a hotremove to proceed. That's true, but per-node kmem_cache_node structures are not necessarily allocated on the same node and may exist even without actual slab pages on the same node. Any path that uses get_node() directly or via for_each_kmem_cache_node() (such as show_slab_objects()) can race with freeing of kmem_cache_node even with the !NULL check, resulting in use-after-free. To that end, commit e4f8e513c3d3 argues in a comment that: * We don't really need mem_hotplug_lock (to hold off * slab_mem_going_offline_callback) here because slab's memory hot * unplug code doesn't destroy the kmem_cache->node[] data. While it's true that slab_mem_going_offline_callback() doesn't free the kmem_cache_node, the later callback slab_mem_offline_callback() actually does, so the race and use-after-free exists. Not just for show_slab_objects() after commit e4f8e513c3d3, but also many other places that are not under slab_mutex. And adding slab_mutex locking or other synchronization to SLUB paths such as get_any_partial() would be bad for performance and error-prone. The easiest solution is therefore to make the abovementioned comment true and stop freeing the kmem_cache_node structures, accepting some wasted memory in the full memory node removal scenario. Analogically we also don't free hotremoved pgdat as mentioned in [1], nor the similar per-node structures in SLAB. Importantly this approach will not block the hotremove, as generally such nodes should be movable in order to succeed hotremove in the first place, and thus the GFP_KERNEL allocated kmem_cache_node will come from elsewhere. [1] https://lore.kernel.org/linux-mm/20190924151147.GB23050@dhcp22.suse.cz/ Link: https://lkml.kernel.org/r/20210113131634.3671-1-vbabka@suse.cz Link: https://lkml.kernel.org/r/20210113131634.3671-2-vbabka@suse.cz Signed-off-by: Vlastimil Babka <vbabka@suse.cz> Cc: Christoph Lameter <cl@linux.com> Cc: Pekka Enberg <penberg@kernel.org> Cc: David Rientjes <rientjes@google.com> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com> Cc: Vladimir Davydov <vdavydov.dev@gmail.com> Cc: Qian Cai <cai@redhat.com> Cc: David Hildenbrand <david@redhat.com> Cc: Michal Hocko <mhocko@kernel.org> Cc: Vlastimil Babka <vbabka@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
#
ca220593 |
|
24-Feb-2021 |
Johannes Berg <johannes.berg@intel.com> |
mm/slub: disable user tracing for kmemleak caches by default If kmemleak is enabled, it uses a kmem cache for its own objects. These objects are used to hold information kmemleak uses, including a stack trace. If slub_debug is also turned on, each of them has *another* stack trace, so the overhead adds up, and on my tests (on ARCH=um, admittedly) 2/3rds of the allocations end up being doing the stack tracing. Turn off SLAB_STORE_USER if SLAB_NOLEAKTRACE was given, to avoid storing the essentially same data twice. Link: https://lkml.kernel.org/r/20210113215114.d94efa13ba30.I117b6764e725b3192318bbcf4269b13b709539ae@changeid Signed-off-by: Johannes Berg <johannes.berg@intel.com> Acked-by: David Rientjes <rientjes@google.com> Acked-by: Catalin Marinas <catalin.marinas@arm.com> Acked-by: Vlastimil Babka <vbabka@suse.cz> Cc: Christoph Lameter <cl@linux.com> Cc: Pekka Enberg <penberg@kernel.org> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
#
37540008 |
|
24-Feb-2021 |
Nikolay Borisov <nborisov@suse.com> |
mm/sl?b.c: remove ctor argument from kmem_cache_flags This argument hasn't been used since e153362a50a3 ("slub: Remove objsize check in kmem_cache_flags()") so simply remove it. Link: https://lkml.kernel.org/r/20210126095733.974665-1-nborisov@suse.com Signed-off-by: Nikolay Borisov <nborisov@suse.com> Reviewed-by: Miaohe Lin <linmiaohe@huawei.com> Reviewed-by: Vlastimil Babka <vbabka@suse.cz> Acked-by: Christoph Lameter <cl@linux.com> Acked-by: David Rientjes <rientjes@google.com> Cc: Pekka Enberg <penberg@kernel.org> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com> Cc: Vlastimil Babka <vbabka@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
#
3544de8e |
|
24-Feb-2021 |
Jacob Wen <jian.w.wen@oracle.com> |
mm, tracing: record slab name for kmem_cache_free() Currently, a trace record generated by the RCU core is as below. ... kmem_cache_free: call_site=rcu_core+0x1fd/0x610 ptr=00000000f3b49a66 It doesn't tell us what the RCU core has freed. This patch adds the slab name to trace_kmem_cache_free(). The new format is as follows. ... kmem_cache_free: call_site=rcu_core+0x1fd/0x610 ptr=0000000037f79c8d name=dentry ... kmem_cache_free: call_site=rcu_core+0x1fd/0x610 ptr=00000000f78cb7b5 name=sock_inode_cache ... kmem_cache_free: call_site=rcu_core+0x1fd/0x610 ptr=0000000018768985 name=pool_workqueue ... kmem_cache_free: call_site=rcu_core+0x1fd/0x610 ptr=000000006a6cb484 name=radix_tree_node We can use it to understand what the RCU core is going to free. For example, some users maybe interested in when the RCU core starts freeing reclaimable slabs like dentry to reduce memory pressure. Link: https://lkml.kernel.org/r/20201216072804.8838-1-jian.w.wen@oracle.com Signed-off-by: Jacob Wen <jian.w.wen@oracle.com> Cc: Christoph Lameter <cl@linux.com> Cc: Pekka Enberg <penberg@kernel.org> Cc: David Rientjes <rientjes@google.com> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com> Cc: Steven Rostedt <rostedt@goodmis.org> Cc: "Paul E. McKenney" <paulmck@linux.vnet.ibm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
#
8e7f37f2 |
|
07-Dec-2020 |
Paul E. McKenney <paulmck@kernel.org> |
mm: Add mem_dump_obj() to print source of memory block There are kernel facilities such as per-CPU reference counts that give error messages in generic handlers or callbacks, whose messages are unenlightening. In the case of per-CPU reference-count underflow, this is not a problem when creating a new use of this facility because in that case the bug is almost certainly in the code implementing that new use. However, trouble arises when deploying across many systems, which might exercise corner cases that were not seen during development and testing. Here, it would be really nice to get some kind of hint as to which of several uses the underflow was caused by. This commit therefore exposes a mem_dump_obj() function that takes a pointer to memory (which must still be allocated if it has been dynamically allocated) and prints available information on where that memory came from. This pointer can reference the middle of the block as well as the beginning of the block, as needed by things like RCU callback functions and timer handlers that might not know where the beginning of the memory block is. These functions and handlers can use mem_dump_obj() to print out better hints as to where the problem might lie. The information printed can depend on kernel configuration. For example, the allocation return address can be printed only for slab and slub, and even then only when the necessary debug has been enabled. For slab, build with CONFIG_DEBUG_SLAB=y, and either use sizes with ample space to the next power of two or use the SLAB_STORE_USER when creating the kmem_cache structure. For slub, build with CONFIG_SLUB_DEBUG=y and boot with slub_debug=U, or pass SLAB_STORE_USER to kmem_cache_create() if more focused use is desired. Also for slub, use CONFIG_STACKTRACE to enable printing of the allocation-time stack trace. Cc: Christoph Lameter <cl@linux.com> Cc: Pekka Enberg <penberg@kernel.org> Cc: David Rientjes <rientjes@google.com> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: <linux-mm@kvack.org> Reported-by: Andrii Nakryiko <andrii@kernel.org> [ paulmck: Convert to printing and change names per Joonsoo Kim. ] [ paulmck: Move slab definition per Stephen Rothwell and kbuild test robot. ] [ paulmck: Handle CONFIG_MMU=n case where vmalloc() is kmalloc(). ] [ paulmck: Apply Vlastimil Babka feedback on slab.c kmem_provenance(). ] [ paulmck: Extract more info from !SLUB_DEBUG per Joonsoo Kim. ] [ paulmck: Explicitly check for small pointers per Naresh Kamboju. ] Acked-by: Joonsoo Kim <iamjoonsoo.kim@lge.com> Acked-by: Vlastimil Babka <vbabka@suse.cz> Tested-by: Naresh Kamboju <naresh.kamboju@linaro.org> Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
|
#
3286222f |
|
09-Feb-2021 |
Vlastimil Babka <vbabka@suse.cz> |
mm, slub: better heuristic for number of cpus when calculating slab order When creating a new kmem cache, SLUB determines how large the slab pages will based on number of inputs, including the number of CPUs in the system. Larger slab pages mean that more objects can be allocated/free from per-cpu slabs before accessing shared structures, but also potentially more memory can be wasted due to low slab usage and fragmentation. The rough idea of using number of CPUs is that larger systems will be more likely to benefit from reduced contention, and also should have enough memory to spare. Number of CPUs used to be determined as nr_cpu_ids, which is number of possible cpus, but on some systems many will never be onlined, thus commit 045ab8c9487b ("mm/slub: let number of online CPUs determine the slub page order") changed it to nr_online_cpus(). However, for kmem caches created early before CPUs are onlined, this may lead to permamently low slab page sizes. Vincent reports a regression [1] of hackbench on arm64 systems: "I'm facing significant performances regression on a large arm64 server system (224 CPUs). Regressions is also present on small arm64 system (8 CPUs) but in a far smaller order of magnitude On 224 CPUs system : 9 iterations of hackbench -l 16000 -g 16 v5.11-rc4 : 9.135sec (+/- 0.45%) v5.11-rc4 + revert this patch: 3.173sec (+/- 0.48%) v5.10: 3.136sec (+/- 0.40%)" Mel reports a regression [2] of hackbench on x86_64, with lockstat suggesting page allocator contention: "i.e. the patch incurs a 7% to 32% performance penalty. This bisected cleanly yesterday when I was looking for the regression and then found the thread. Numerous caches change size. For example, kmalloc-512 goes from order-0 (vanilla) to order-2 with the revert. So mostly this is down to the number of times SLUB calls into the page allocator which only caches order-0 pages on a per-cpu basis" Clearly num_online_cpus() doesn't work too early in bootup. We could change the order dynamically in a memory hotplug callback, but runtime order changing for existing kmem caches has been already shown as dangerous, and removed in 32a6f409b693 ("mm, slub: remove runtime allocation order changes"). It could be resurrected in a safe manner with some effort, but to fix the regression we need something simpler. We could use num_present_cpus() that should be the number of physically present CPUs even before they are onlined. That would work for PowerPC [3], which triggered the original commit, but that still doesn't work on arm64 [4] as explained in [5]. So this patch tries to determine the best available value without specific arch knowledge. - num_present_cpus() if the number is larger than 1, as that means the arch is likely setting it properly - nr_cpu_ids otherwise This should fix the reported regressions while also keeping the effect of 045ab8c9487b for PowerPC systems. It's possible there are configurations where num_present_cpus() is 1 during boot while nr_cpu_ids is at the same time bloated, so these (if they exist) would keep the large orders based on nr_cpu_ids as was before 045ab8c9487b. [1] https://lore.kernel.org/linux-mm/CAKfTPtA_JgMf_+zdFbcb_V9rM7JBWNPjAz9irgwFj7Rou=xzZg@mail.gmail.com/ [2] https://lore.kernel.org/linux-mm/20210128134512.GF3592@techsingularity.net/ [3] https://lore.kernel.org/linux-mm/20210123051607.GC2587010@in.ibm.com/ [4] https://lore.kernel.org/linux-mm/CAKfTPtAjyVmS5VYvU6DBxg4-JEo5bdmWbngf-03YsY18cmWv_g@mail.gmail.com/ [5] https://lore.kernel.org/linux-mm/20210126230305.GD30941@willie-the-truck/ Link: https://lkml.kernel.org/r/20210208134108.22286-1-vbabka@suse.cz Fixes: 045ab8c9487b ("mm/slub: let number of online CPUs determine the slub page order") Signed-off-by: Vlastimil Babka <vbabka@suse.cz> Reported-by: Vincent Guittot <vincent.guittot@linaro.org> Reported-by: Mel Gorman <mgorman@techsingularity.net> Tested-by: Mel Gorman <mgorman@techsingularity.net> Tested-by: Vincent Guittot <vincent.guittot@linaro.org> Cc: Catalin Marinas <catalin.marinas@arm.com> Cc: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com> Cc: Bharata B Rao <bharata@linux.ibm.com> Cc: Christoph Lameter <cl@linux.com> Cc: Roman Gushchin <guro@fb.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com> Cc: Jann Horn <jannh@google.com> Cc: Michal Hocko <mhocko@kernel.org> Cc: David Rientjes <rientjes@google.com> Cc: Shakeel Butt <shakeelb@google.com> Cc: Will Deacon <will@kernel.org> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
#
757fed1d |
|
28-Jan-2021 |
Wang Hai <wanghai38@huawei.com> |
Revert "mm/slub: fix a memory leak in sysfs_slab_add()" This reverts commit dde3c6b72a16c2db826f54b2d49bdea26c3534a2. syzbot report a double-free bug. The following case can cause this bug. - mm/slab_common.c: create_cache(): if the __kmem_cache_create() fails, it does: out_free_cache: kmem_cache_free(kmem_cache, s); - but __kmem_cache_create() - at least for slub() - will have done sysfs_slab_add(s) -> sysfs_create_group() .. fails .. -> kobject_del(&s->kobj); .. which frees s ... We can't remove the kmem_cache_free() in create_cache(), because other error cases of __kmem_cache_create() do not free this. So, revert the commit dde3c6b72a16 ("mm/slub: fix a memory leak in sysfs_slab_add()") to fix this. Reported-by: syzbot+d0bd96b4696c1ef67991@syzkaller.appspotmail.com Fixes: dde3c6b72a16 ("mm/slub: fix a memory leak in sysfs_slab_add()") Acked-by: Vlastimil Babka <vbabka@suse.cz> Signed-off-by: Wang Hai <wanghai38@huawei.com> Cc: <stable@vger.kernel.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
#
ce5716c6 |
|
23-Jan-2021 |
Andrey Konovalov <andreyknvl@google.com> |
kasan, mm: fix conflicts with init_on_alloc/free A few places where SLUB accesses object's data or metadata were missed in a previous patch. This leads to false positives with hardware tag-based KASAN when bulk allocations are used with init_on_alloc/free. Fix the false-positives by resetting pointer tags during these accesses. (The kasan_reset_tag call is removed from slab_alloc_node, as it's added into maybe_wipe_obj_freeptr.) Link: https://linux-review.googlesource.com/id/I50dd32838a666e173fe06c3c5c766f2c36aae901 Link: https://lkml.kernel.org/r/093428b5d2ca8b507f4a79f92f9929b35f7fada7.1610731872.git.andreyknvl@google.com Fixes: aa1ef4d7b3f67 ("kasan, mm: reset tags when accessing metadata") Signed-off-by: Andrey Konovalov <andreyknvl@google.com> Reported-by: Dmitry Vyukov <dvyukov@google.com> Acked-by: Vlastimil Babka <vbabka@suse.cz> Cc: Catalin Marinas <catalin.marinas@arm.com> Cc: Vincenzo Frascino <vincenzo.frascino@arm.com> Cc: Alexander Potapenko <glider@google.com> Cc: Marco Elver <elver@google.com> Cc: Will Deacon <will.deacon@arm.com> Cc: Andrey Ryabinin <aryabinin@virtuozzo.com> Cc: Peter Collingbourne <pcc@google.com> Cc: Evgenii Stepanov <eugenis@google.com> Cc: Branislav Rankov <Branislav.Rankov@arm.com> Cc: Kevin Brodsky <kevin.brodsky@arm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
#
8ff60eb0 |
|
12-Jan-2021 |
Jann Horn <jannh@google.com> |
mm, slub: consider rest of partial list if acquire_slab() fails acquire_slab() fails if there is contention on the freelist of the page (probably because some other CPU is concurrently freeing an object from the page). In that case, it might make sense to look for a different page (since there might be more remote frees to the page from other CPUs, and we don't want contention on struct page). However, the current code accidentally stops looking at the partial list completely in that case. Especially on kernels without CONFIG_NUMA set, this means that get_partial() fails and new_slab_objects() falls back to new_slab(), allocating new pages. This could lead to an unnecessary increase in memory fragmentation. Link: https://lkml.kernel.org/r/20201228130853.1871516-1-jannh@google.com Fixes: 7ced37197196 ("slub: Acquire_slab() avoid loop") Signed-off-by: Jann Horn <jannh@google.com> Acked-by: David Rientjes <rientjes@google.com> Acked-by: Joonsoo Kim <iamjoonsoo.kim@lge.com> Cc: Christoph Lameter <cl@linux.com> Cc: Pekka Enberg <penberg@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
#
1f3147b4 |
|
29-Dec-2020 |
Roman Gushchin <guro@fb.com> |
mm: slub: call account_slab_page() after slab page initialization It's convenient to have page->objects initialized before calling into account_slab_page(). In particular, this information can be used to pre-alloc the obj_cgroup vector. Let's call account_slab_page() a bit later, after the initialization of page->objects. This commit doesn't bring any functional change, but is required for further optimizations. [akpm@linux-foundation.org: undo changes needed by forthcoming mm-memcg-slab-pre-allocate-obj_cgroups-for-slab-caches-with-slab_account.patch] Link: https://lkml.kernel.org/r/20201110195753.530157-1-guro@fb.com Signed-off-by: Roman Gushchin <guro@fb.com> Acked-by: Johannes Weiner <hannes@cmpxchg.org> Reviewed-by: Shakeel Butt <shakeelb@google.com> Cc: Michal Hocko <mhocko@kernel.org> Cc: Christoph Lameter <cl@linux.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
#
aa1ef4d7 |
|
22-Dec-2020 |
Andrey Konovalov <andreyknvl@google.com> |
kasan, mm: reset tags when accessing metadata Kernel allocator code accesses metadata for slab objects, that may lie out-of-bounds of the object itself, or be accessed when an object is freed. Such accesses trigger tag faults and lead to false-positive reports with hardware tag-based KASAN. Software KASAN modes disable instrumentation for allocator code via KASAN_SANITIZE Makefile macro, and rely on kasan_enable/disable_current() annotations which are used to ignore KASAN reports. With hardware tag-based KASAN neither of those options are available, as it doesn't use compiler instrumetation, no tag faults are ignored, and MTE is disabled after the first one. Instead, reset tags when accessing metadata (currently only for SLUB). Link: https://lkml.kernel.org/r/a0f3cefbc49f34c843b664110842de4db28179d0.1606161801.git.andreyknvl@google.com Signed-off-by: Andrey Konovalov <andreyknvl@google.com> Signed-off-by: Vincenzo Frascino <vincenzo.frascino@arm.com> Acked-by: Marco Elver <elver@google.com> Reviewed-by: Alexander Potapenko <glider@google.com> Tested-by: Vincenzo Frascino <vincenzo.frascino@arm.com> Cc: Andrey Ryabinin <aryabinin@virtuozzo.com> Cc: Branislav Rankov <Branislav.Rankov@arm.com> Cc: Catalin Marinas <catalin.marinas@arm.com> Cc: Dmitry Vyukov <dvyukov@google.com> Cc: Evgenii Stepanov <eugenis@google.com> Cc: Kevin Brodsky <kevin.brodsky@arm.com> Cc: Vasily Gorbik <gor@linux.ibm.com> Cc: Will Deacon <will.deacon@arm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
#
bf16d19a |
|
14-Dec-2020 |
Joe Perches <joe@perches.com> |
mm: slub: convert sysfs sprintf family to sysfs_emit/sysfs_emit_at Convert the unbounded uses of sprintf to sysfs_emit. A few conversions may now not end in a newline if the output buffer is overflowed. Link: https://lkml.kernel.org/r/0c90a90f466167f8c37de4b737553cf49c4a277f.1605376435.git.joe@perches.com Signed-off-by: Joe Perches <joe@perches.com> Cc: Christoph Lameter <cl@linux.com> Cc: David Rientjes <rientjes@google.com> Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Cc: Hugh Dickins <hughd@google.com> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com> Cc: Matthew Wilcox <willy@infradead.org> Cc: Mike Kravetz <mike.kravetz@oracle.com> Cc: Pekka Enberg <penberg@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
#
045ab8c9 |
|
14-Dec-2020 |
Bharata B Rao <bharata@linux.ibm.com> |
mm/slub: let number of online CPUs determine the slub page order The page order of the slab that gets chosen for a given slab cache depends on the number of objects that can be fit in the slab while meeting other requirements. We start with a value of minimum objects based on nr_cpu_ids that is driven by possible number of CPUs and hence could be higher than the actual number of CPUs present in the system. This leads to calculate_order() chosing a page order that is on the higher side leading to increased slab memory consumption on systems that have bigger page sizes. Hence rely on the number of online CPUs when determining the mininum objects, thereby increasing the chances of chosing a lower conservative page order for the slab. Vlastimil said: "Ideally, we would react to hotplug events and update existing caches accordingly. But for that, recalculation of order for existing caches would have to be made safe, while not affecting hot paths. We have removed the sysfs interface with 32a6f409b693 ("mm, slub: remove runtime allocation order changes") as it didn't seem easy and worth the trouble. In case somebody wants to start with a large order right from the boot because they know they will hotplug lots of cpus later, they can use slub_min_objects= boot param to override this heuristic. So in case this change regresses somebody's performance, there's a way around it and thus the risk is low IMHO" Link: https://lkml.kernel.org/r/20201118082759.1413056-1-bharata@linux.ibm.com Signed-off-by: Bharata B Rao <bharata@linux.ibm.com> Acked-by: Vlastimil Babka <vbabka@suse.cz> Acked-by: Roman Gushchin <guro@fb.com> Acked-by: David Rientjes <rientjes@google.com> Cc: Christoph Lameter <cl@linux.com> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com> Cc: Shakeel Butt <shakeelb@google.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
#
965c4848 |
|
14-Dec-2020 |
Vlastimil Babka <vbabka@suse.cz> |
mm, slub: use kmem_cache_debug_flags() in deactivate_slab() Commit 9cf7a1118365 ("mm/slub: make add_full() condition more explicit") replaced an unnecessarily generic kmem_cache_debug(s) check with an explicit check of SLAB_STORE_USER and #ifdef CONFIG_SLUB_DEBUG. We can achieve the same specific check with the recently added kmem_cache_debug_flags() which removes the #ifdef and restores the no-branch-overhead benefit of static key check when slub debugging is not enabled. Link: https://lkml.kernel.org/r/3ef24214-38c7-1238-8296-88caf7f48ab6@suse.cz Signed-off-by: Vlastimil Babka <vbabka@suse.cz> Cc: Abel Wu <wuyun.wu@huawei.com> Cc: Christopher Lameter <cl@linux.com> Cc: Pekka Enberg <penberg@kernel.org> Cc: David Rientjes <rientjes@google.com> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com> Cc: Liu Xiang <liu.xiang6@zte.com.cn> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
#
0c06dd75 |
|
14-Dec-2020 |
Vlastimil Babka <vbabka@suse.cz> |
mm, slab, slub: clear the slab_cache field when freeing page The page allocator expects that page->mapping is NULL for a page being freed. SLAB and SLUB use the slab_cache field which is in union with mapping, but before freeing the page, the field is referenced with the "mapping" name when set to NULL. It's IMHO more correct (albeit functionally the same) to use the slab_cache name as that's the field we use in SL*B, and document why we clear it in a comment (we don't clear fields such as s_mem or freelist, as page allocator doesn't care about those). While using the 'mapping' name would automagically keep the code correct if the unions in struct page changed, such changes should be done consciously and needed changes evaluated - the comment should help with that. Link: https://lkml.kernel.org/r/20201210160020.21562-1-vbabka@suse.cz Signed-off-by: Vlastimil Babka <vbabka@suse.cz> Acked-by: David Rientjes <rientjes@google.com> Acked-by: Joonsoo Kim <iamjoonsoo.kim@lge.com> Cc: Christoph Lameter <cl@linux.com> Cc: Pekka Enberg <penberg@kernel.org> Cc: Matthew Wilcox <willy@infradead.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
#
22e4663e |
|
13-Nov-2020 |
Laurent Dufour <ldufour@linux.ibm.com> |
mm/slub: fix panic in slab_alloc_node() While doing memory hot-unplug operation on a PowerPC VM running 1024 CPUs with 11TB of ram, I hit the following panic: BUG: Kernel NULL pointer dereference on read at 0x00000007 Faulting instruction address: 0xc000000000456048 Oops: Kernel access of bad area, sig: 11 [#2] LE PAGE_SIZE=64K MMU=Hash SMP NR_CPUS= 2048 NUMA pSeries Modules linked in: rpadlpar_io rpaphp CPU: 160 PID: 1 Comm: systemd Tainted: G D 5.9.0 #1 NIP: c000000000456048 LR: c000000000455fd4 CTR: c00000000047b350 REGS: c00006028d1b77a0 TRAP: 0300 Tainted: G D (5.9.0) MSR: 8000000000009033 <SF,EE,ME,IR,DR,RI,LE> CR: 24004228 XER: 00000000 CFAR: c00000000000f1b0 DAR: 0000000000000007 DSISR: 40000000 IRQMASK: 0 GPR00: c000000000455fd4 c00006028d1b7a30 c000000001bec800 0000000000000000 GPR04: 0000000000000dc0 0000000000000000 00000000000374ef c00007c53df99320 GPR08: 000007c53c980000 0000000000000000 000007c53c980000 0000000000000000 GPR12: 0000000000004400 c00000001e8e4400 0000000000000000 0000000000000f6a GPR16: 0000000000000000 c000000001c25930 c000000001d62528 00000000000000c1 GPR20: c000000001d62538 c00006be469e9000 0000000fffffffe0 c0000000003c0ff8 GPR24: 0000000000000018 0000000000000000 0000000000000dc0 0000000000000000 GPR28: c00007c513755700 c000000001c236a4 c00007bc4001f800 0000000000000001 NIP [c000000000456048] __kmalloc_node+0x108/0x790 LR [c000000000455fd4] __kmalloc_node+0x94/0x790 Call Trace: kvmalloc_node+0x58/0x110 mem_cgroup_css_online+0x10c/0x270 online_css+0x48/0xd0 cgroup_apply_control_enable+0x2c4/0x470 cgroup_mkdir+0x408/0x5f0 kernfs_iop_mkdir+0x90/0x100 vfs_mkdir+0x138/0x250 do_mkdirat+0x154/0x1c0 system_call_exception+0xf8/0x200 system_call_common+0xf0/0x27c Instruction dump: e93e0000 e90d0030 39290008 7cc9402a e94d0030 e93e0000 7ce95214 7f89502a 2fbc0000 419e0018 41920230 e9270010 <89290007> 7f994800 419e0220 7ee6bb78 This pointing to the following code: mm/slub.c:2851 if (unlikely(!object || !node_match(page, node))) { c000000000456038: 00 00 bc 2f cmpdi cr7,r28,0 c00000000045603c: 18 00 9e 41 beq cr7,c000000000456054 <__kmalloc_node+0x114> node_match(): mm/slub.c:2491 if (node != NUMA_NO_NODE && page_to_nid(page) != node) c000000000456040: 30 02 92 41 beq cr4,c000000000456270 <__kmalloc_node+0x330> page_to_nid(): include/linux/mm.h:1294 c000000000456044: 10 00 27 e9 ld r9,16(r7) c000000000456048: 07 00 29 89 lbz r9,7(r9) <<<< r9 = NULL node_match(): mm/slub.c:2491 c00000000045604c: 00 48 99 7f cmpw cr7,r25,r9 c000000000456050: 20 02 9e 41 beq cr7,c000000000456270 <__kmalloc_node+0x330> The panic occurred in slab_alloc_node() when checking for the page's node: object = c->freelist; page = c->page; if (unlikely(!object || !node_match(page, node))) { object = __slab_alloc(s, gfpflags, node, addr, c); stat(s, ALLOC_SLOWPATH); The issue is that object is not NULL while page is NULL which is odd but may happen if the cache flush happened after loading object but before loading page. Thus checking for the page pointer is required too. The cache flush is done through an inter processor interrupt when a piece of memory is off-lined. That interrupt is triggered when a memory hot-unplug operation is initiated and offline_pages() is calling the slub's MEM_GOING_OFFLINE callback slab_mem_going_offline_callback() which is calling flush_cpu_slab(). If that interrupt is caught between the reading of c->freelist and the reading of c->page, this could lead to such a situation. That situation is expected and the later call to this_cpu_cmpxchg_double() will detect the change to c->freelist and redo the whole operation. In commit 6159d0f5c03e ("mm/slub.c: page is always non-NULL in node_match()") check on the page pointer has been removed assuming that page is always valid when it is called. It happens that this is not true in that particular case, so check for page before calling node_match() here. Fixes: 6159d0f5c03e ("mm/slub.c: page is always non-NULL in node_match()") Signed-off-by: Laurent Dufour <ldufour@linux.ibm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Acked-by: Vlastimil Babka <vbabka@suse.cz> Acked-by: Christoph Lameter <cl@linux.com> Cc: Wei Yang <richard.weiyang@gmail.com> Cc: Pekka Enberg <penberg@kernel.org> Cc: David Rientjes <rientjes@google.com> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com> Cc: Nathan Lynch <nathanl@linux.ibm.com> Cc: Scott Cheloha <cheloha@linux.ibm.com> Cc: Michal Hocko <mhocko@suse.com> Cc: <stable@vger.kernel.org> Link: https://lkml.kernel.org/r/20201027190406.33283-1-ldufour@linux.ibm.com Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
#
70b6d25e |
|
15-Oct-2020 |
Chen Tao <chentao3@hotmail.com> |
mm: fix some comments formatting Correct one function name "get_partials" with "get_partial". Update the old struct name of list3 with kmem_cache_node. Signed-off-by: Chen Tao <chentao3@hotmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Reviewed-by: Mike Rapoport <rppt@linux.ibm.com> Link: https://lkml.kernel.org/r/Message-ID: Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
#
d1b2cf6c |
|
13-Oct-2020 |
Bharata B Rao <bharata@linux.ibm.com> |
mm: memcg/slab: uncharge during kmem_cache_free_bulk() Object cgroup charging is done for all the objects during allocation, but during freeing, uncharging ends up happening for only one object in the case of bulk allocation/freeing. Fix this by having a separate call to uncharge all the objects from kmem_cache_free_bulk() and by modifying memcg_slab_free_hook() to take care of bulk uncharging. Fixes: 964d4bd370d5 ("mm: memcg/slab: save obj_cgroup for non-root slab objects" Signed-off-by: Bharata B Rao <bharata@linux.ibm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Acked-by: Roman Gushchin <guro@fb.com> Cc: Christoph Lameter <cl@linux.com> Cc: David Rientjes <rientjes@google.com> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Shakeel Butt <shakeelb@google.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Michal Hocko <mhocko@kernel.org> Cc: Tejun Heo <tj@kernel.org> Cc: <stable@vger.kernel.org> Link: https://lkml.kernel.org/r/20201009060423.390479-1-bharata@linux.ibm.com Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
#
9cf7a111 |
|
13-Oct-2020 |
Abel Wu <wuyun.wu@huawei.com> |
mm/slub: make add_full() condition more explicit The commit below is incomplete, as it didn't handle the add_full() part. commit a4d3f8916c65 ("slub: remove useless kmem_cache_debug() before remove_full()") This patch checks for SLAB_STORE_USER instead of kmem_cache_debug(), since that should be the only context in which we need the list_lock for add_full(). Signed-off-by: Abel Wu <wuyun.wu@huawei.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Cc: Christoph Lameter <cl@linux.com> Cc: Pekka Enberg <penberg@kernel.org> Cc: David Rientjes <rientjes@google.com> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com> Cc: Liu Xiang <liu.xiang6@zte.com.cn> Link: https://lkml.kernel.org/r/20200811020240.1231-1-wuyun.wu@huawei.com Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
#
9f986d99 |
|
13-Oct-2020 |
Abel Wu <wuyun.wu@huawei.com> |
mm/slub: fix missing ALLOC_SLOWPATH stat when bulk alloc The ALLOC_SLOWPATH statistics is missing in bulk allocation now. Fix it by doing statistics in alloc slow path. Signed-off-by: Abel Wu <wuyun.wu@huawei.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Reviewed-by: Pekka Enberg <penberg@kernel.org> Acked-by: David Rientjes <rientjes@google.com> Cc: Christoph Lameter <cl@linux.com> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com> Cc: Hewenliang <hewenliang4@huawei.com> Cc: Hu Shiyuan <hushiyuan@huawei.com> Link: http://lkml.kernel.org/r/20200811022427.1363-1-wuyun.wu@huawei.com Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
#
c270cf30 |
|
13-Oct-2020 |
Abel Wu <wuyun.wu@huawei.com> |
mm/slub.c: branch optimization in free slowpath The two conditions are mutually exclusive and gcc compiler will optimise this into if-else-like pattern. Given that the majority of free_slowpath is free_frozen, let's provide some hint to the compilers. Tests (perf bench sched messaging -g 20 -l 400000, executed 10x after reboot) are done and the summarized result: un-patched patched max. 192.316 189.851 min. 187.267 186.252 avg. 189.154 188.086 stdev. 1.37 0.99 Signed-off-by: Abel Wu <wuyun.wu@huawei.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Acked-by: Christoph Lameter <cl@linux.com> Cc: Pekka Enberg <penberg@kernel.org> Cc: David Rientjes <rientjes@google.com> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com> Cc: Hewenliang <hewenliang4@huawei.com> Cc: Hu Shiyuan <hushiyuan@huawei.com> Link: http://lkml.kernel.org/r/20200813101812.1617-1-wuyun.wu@huawei.com Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
#
484cfaca |
|
02-Oct-2020 |
Eric Farman <farman@linux.ibm.com> |
mm, slub: restore initial kmem_cache flags The routine that applies debug flags to the kmem_cache slabs inadvertantly prevents non-debug flags from being applied to those same objects. That is, if slub_debug=<flag>,<slab> is specified, non-debugged slabs will end up having flags of zero, and the slabs may be unusable. Fix this by including the input flags for non-matching slabs with the contents of slub_debug, so that the caches are created as expected alongside any debugging options that may be requested. With this, we can remove the check for a NULL slub_debug_string, since it's covered by the loop itself. Fixes: e17f1dfba37b ("mm, slub: extend slub_debug syntax for multiple blocks") Signed-off-by: Eric Farman <farman@linux.ibm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Acked-by: Vlastimil Babka <vbabka@suse.cz> Cc: Kees Cook <keescook@chromium.org> Cc: Christoph Lameter <cl@linux.com> Cc: Pekka Enberg <penberg@kernel.org> Cc: David Rientjes <rientjes@google.com> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com> Link: https://lkml.kernel.org/r/20200930161931.28575-1-farman@linux.ibm.com Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
#
dc07a728 |
|
04-Sep-2020 |
Eugeniu Rosca <erosca@de.adit-jv.com> |
mm: slub: fix conversion of freelist_corrupted() Commit 52f23478081ae0 ("mm/slub.c: fix corrupted freechain in deactivate_slab()") suffered an update when picked up from LKML [1]. Specifically, relocating 'freelist = NULL' into 'freelist_corrupted()' created a no-op statement. Fix it by sticking to the behavior intended in the original patch [1]. In addition, make freelist_corrupted() immune to passing NULL instead of &freelist. The issue has been spotted via static analysis and code review. [1] https://lore.kernel.org/linux-mm/20200331031450.12182-1-dongli.zhang@oracle.com/ Fixes: 52f23478081ae0 ("mm/slub.c: fix corrupted freechain in deactivate_slab()") Signed-off-by: Eugeniu Rosca <erosca@de.adit-jv.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Cc: Dongli Zhang <dongli.zhang@oracle.com> Cc: Joe Jin <joe.jin@oracle.com> Cc: Christoph Lameter <cl@linux.com> Cc: Pekka Enberg <penberg@kernel.org> Cc: David Rientjes <rientjes@google.com> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com> Cc: <stable@vger.kernel.org> Link: https://lkml.kernel.org/r/20200824130643.10291-1-erosca@de.adit-jv.com Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
#
74d555be |
|
07-Aug-2020 |
Roman Gushchin <guro@fb.com> |
mm: slab: rename (un)charge_slab_page() to (un)account_slab_page() charge_slab_page() and uncharge_slab_page() are not related anymore to memcg charging and uncharging. In order to make their names less confusing, let's rename them to account_slab_page() and unaccount_slab_page() respectively. Signed-off-by: Roman Gushchin <guro@fb.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Reviewed-by: Shakeel Butt <shakeelb@google.com> Acked-by: Vlastimil Babka <vbabka@suse.cz> Cc: Christoph Lameter <cl@linux.com> Cc: David Rientjes <rientjes@google.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com> Cc: Michal Hocko <mhocko@kernel.org> Cc: Pekka Enberg <penberg@kernel.org> Link: http://lkml.kernel.org/r/20200707173612.124425-2-guro@fb.com Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
#
84950480 |
|
07-Aug-2020 |
Roman Gushchin <guro@fb.com> |
mm: memcg/slab: remove unused argument by charge_slab_page() charge_slab_page() is not using the gfp argument anymore, remove it. Signed-off-by: Roman Gushchin <guro@fb.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Reviewed-by: Shakeel Butt <shakeelb@google.com> Acked-by: Vlastimil Babka <vbabka@suse.cz> Cc: Christoph Lameter <cl@linux.com> Cc: Pekka Enberg <penberg@kernel.org> Cc: David Rientjes <rientjes@google.com> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Michal Hocko <mhocko@kernel.org> Link: http://lkml.kernel.org/r/20200707173612.124425-1-guro@fb.com Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
#
10befea9 |
|
07-Aug-2020 |
Roman Gushchin <guro@fb.com> |
mm: memcg/slab: use a single set of kmem_caches for all allocations Instead of having two sets of kmem_caches: one for system-wide and non-accounted allocations and the second one shared by all accounted allocations, we can use just one. The idea is simple: space for obj_cgroup metadata can be allocated on demand and filled only for accounted allocations. It allows to remove a bunch of code which is required to handle kmem_cache clones for accounted allocations. There is no more need to create them, accumulate statistics, propagate attributes, etc. It's a quite significant simplification. Also, because the total number of slab_caches is reduced almost twice (not all kmem_caches have a memcg clone), some additional memory savings are expected. On my devvm it additionally saves about 3.5% of slab memory. [guro@fb.com: fix build on MIPS] Link: http://lkml.kernel.org/r/20200717214810.3733082-1-guro@fb.com Suggested-by: Johannes Weiner <hannes@cmpxchg.org> Signed-off-by: Roman Gushchin <guro@fb.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Reviewed-by: Vlastimil Babka <vbabka@suse.cz> Reviewed-by: Shakeel Butt <shakeelb@google.com> Cc: Christoph Lameter <cl@linux.com> Cc: Michal Hocko <mhocko@kernel.org> Cc: Tejun Heo <tj@kernel.org> Cc: Naresh Kamboju <naresh.kamboju@linaro.org> Link: http://lkml.kernel.org/r/20200623174037.3951353-18-guro@fb.com Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
#
c7094406 |
|
07-Aug-2020 |
Roman Gushchin <guro@fb.com> |
mm: memcg/slab: deprecate slab_root_caches Currently there are two lists of kmem_caches: 1) slab_caches, which contains all kmem_caches, 2) slab_root_caches, which contains only root kmem_caches. And there is some preprocessor magic to have a single list if CONFIG_MEMCG_KMEM isn't enabled. It was required earlier because the number of non-root kmem_caches was proportional to the number of memory cgroups and could reach really big values. Now, when it cannot exceed the number of root kmem_caches, there is really no reason to maintain two lists. We never iterate over the slab_root_caches list on any hot paths, so it's perfectly fine to iterate over slab_caches and filter out non-root kmem_caches. It allows to remove a lot of config-dependent code and two pointers from the kmem_cache structure. Signed-off-by: Roman Gushchin <guro@fb.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Reviewed-by: Vlastimil Babka <vbabka@suse.cz> Reviewed-by: Shakeel Butt <shakeelb@google.com> Cc: Christoph Lameter <cl@linux.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Michal Hocko <mhocko@kernel.org> Cc: Tejun Heo <tj@kernel.org> Link: http://lkml.kernel.org/r/20200623174037.3951353-16-guro@fb.com Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
#
9855609b |
|
07-Aug-2020 |
Roman Gushchin <guro@fb.com> |
mm: memcg/slab: use a single set of kmem_caches for all accounted allocations This is fairly big but mostly red patch, which makes all accounted slab allocations use a single set of kmem_caches instead of creating a separate set for each memory cgroup. Because the number of non-root kmem_caches is now capped by the number of root kmem_caches, there is no need to shrink or destroy them prematurely. They can be perfectly destroyed together with their root counterparts. This allows to dramatically simplify the management of non-root kmem_caches and delete a ton of code. This patch performs the following changes: 1) introduces memcg_params.memcg_cache pointer to represent the kmem_cache which will be used for all non-root allocations 2) reuses the existing memcg kmem_cache creation mechanism to create memcg kmem_cache on the first allocation attempt 3) memcg kmem_caches are named <kmemcache_name>-memcg, e.g. dentry-memcg 4) simplifies memcg_kmem_get_cache() to just return memcg kmem_cache or schedule it's creation and return the root cache 5) removes almost all non-root kmem_cache management code (separate refcounter, reparenting, shrinking, etc) 6) makes slab debugfs to display root_mem_cgroup css id and never show :dead and :deact flags in the memcg_slabinfo attribute. Following patches in the series will simplify the kmem_cache creation. Signed-off-by: Roman Gushchin <guro@fb.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Reviewed-by: Vlastimil Babka <vbabka@suse.cz> Reviewed-by: Shakeel Butt <shakeelb@google.com> Cc: Christoph Lameter <cl@linux.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Michal Hocko <mhocko@kernel.org> Cc: Tejun Heo <tj@kernel.org> Link: http://lkml.kernel.org/r/20200623174037.3951353-13-guro@fb.com Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
#
964d4bd3 |
|
07-Aug-2020 |
Roman Gushchin <guro@fb.com> |
mm: memcg/slab: save obj_cgroup for non-root slab objects Store the obj_cgroup pointer in the corresponding place of page->obj_cgroups for each allocated non-root slab object. Make sure that each allocated object holds a reference to obj_cgroup. Objcg pointer is obtained from the memcg->objcg dereferencing in memcg_kmem_get_cache() and passed from pre_alloc_hook to post_alloc_hook. Then in case of successful allocation(s) it's getting stored in the page->obj_cgroups vector. The objcg obtaining part look a bit bulky now, but it will be simplified by next commits in the series. Signed-off-by: Roman Gushchin <guro@fb.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Reviewed-by: Vlastimil Babka <vbabka@suse.cz> Reviewed-by: Shakeel Butt <shakeelb@google.com> Cc: Christoph Lameter <cl@linux.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Michal Hocko <mhocko@kernel.org> Cc: Tejun Heo <tj@kernel.org> Link: http://lkml.kernel.org/r/20200623174037.3951353-9-guro@fb.com Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
#
4138fdfc |
|
07-Aug-2020 |
Roman Gushchin <guro@fb.com> |
mm: slub: implement SLUB version of obj_to_index() This commit implements SLUB version of the obj_to_index() function, which will be required to calculate the offset of obj_cgroup in the obj_cgroups vector to store/obtain the objcg ownership data. To make it faster, let's repeat the SLAB's trick introduced by commit 6a2d7a955d8d ("SLAB: use a multiply instead of a divide in obj_to_index()") and avoid an expensive division. Vlastimil Babka noticed, that SLUB does have already a similar function called slab_index(), which is defined only if SLUB_DEBUG is enabled. The function does a similar math, but with a division, and it also takes a page address instead of a page pointer. Let's remove slab_index() and replace it with the new helper __obj_to_index(), which takes a page address. obj_to_index() will be a simple wrapper taking a page pointer and passing page_address(page) into __obj_to_index(). Signed-off-by: Roman Gushchin <guro@fb.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Reviewed-by: Vlastimil Babka <vbabka@suse.cz> Reviewed-by: Shakeel Butt <shakeelb@google.com> Acked-by: Johannes Weiner <hannes@cmpxchg.org> Cc: Christoph Lameter <cl@linux.com> Cc: Michal Hocko <mhocko@kernel.org> Cc: Tejun Heo <tj@kernel.org> Link: http://lkml.kernel.org/r/20200623174037.3951353-5-guro@fb.com Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
#
d42f3245 |
|
07-Aug-2020 |
Roman Gushchin <guro@fb.com> |
mm: memcg: convert vmstat slab counters to bytes In order to prepare for per-object slab memory accounting, convert NR_SLAB_RECLAIMABLE and NR_SLAB_UNRECLAIMABLE vmstat items to bytes. To make it obvious, rename them to NR_SLAB_RECLAIMABLE_B and NR_SLAB_UNRECLAIMABLE_B (similar to NR_KERNEL_STACK_KB). Internally global and per-node counters are stored in pages, however memcg and lruvec counters are stored in bytes. This scheme may look weird, but only for now. As soon as slab pages will be shared between multiple cgroups, global and node counters will reflect the total number of slab pages. However memcg and lruvec counters will be used for per-memcg slab memory tracking, which will take separate kernel objects in the account. Keeping global and node counters in pages helps to avoid additional overhead. The size of slab memory shouldn't exceed 4Gb on 32-bit machines, so it will fit into atomic_long_t we use for vmstats. Signed-off-by: Roman Gushchin <guro@fb.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Reviewed-by: Shakeel Butt <shakeelb@google.com> Acked-by: Johannes Weiner <hannes@cmpxchg.org> Acked-by: Vlastimil Babka <vbabka@suse.cz> Cc: Christoph Lameter <cl@linux.com> Cc: Michal Hocko <mhocko@kernel.org> Cc: Tejun Heo <tj@kernel.org> Link: http://lkml.kernel.org/r/20200623174037.3951353-4-guro@fb.com Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
#
cfbe1636 |
|
07-Aug-2020 |
Marco Elver <elver@google.com> |
mm, kcsan: instrument SLAB/SLUB free with "ASSERT_EXCLUSIVE_ACCESS" Provide the necessary KCSAN checks to assist with debugging racy use-after-frees. While KASAN is more reliable at generally catching such use-after-frees (due to its use of a quarantine), it can be difficult to debug racy use-after-frees. If a reliable reproducer exists, KCSAN can assist in debugging such issues. Note: ASSERT_EXCLUSIVE_ACCESS is a convenience wrapper if the size is simply sizeof(var). Instead, here we just use __kcsan_check_access() explicitly to pass the correct size. Signed-off-by: Marco Elver <elver@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Cc: Alexander Potapenko <glider@google.com> Cc: Andrey Konovalov <andreyknvl@google.com> Cc: Christoph Lameter <cl@linux.com> Cc: Pekka Enberg <penberg@kernel.org> Cc: David Rientjes <rientjes@google.com> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com> Link: http://lkml.kernel.org/r/20200623072653.114563-1-elver@google.com Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
#
b3cb9fc3 |
|
07-Aug-2020 |
Sebastian Andrzej Siewior <bigeasy@linutronix.de> |
mm/slub.c: drop lockdep_assert_held() from put_map() There is no point in using lockdep_assert_held() unlock that is about to be unlocked. It works only with lockdep and lockdep will complain if spin_unlock() is used on a lock that has not been locked. Remove superfluous lockdep_assert_held(). Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Cc: Yu Zhao <yuzhao@google.com> Cc: Christopher Lameter <cl@linux.com> Cc: Pekka Enberg <penberg@kernel.org> Cc: David Rientjes <rientjes@google.com> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com> Cc: Thomas Gleixner <tglx@linutronix.de> Link: http://lkml.kernel.org/r/20200618201234.795692-2-bigeasy@linutronix.de Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
#
e42f174e |
|
07-Aug-2020 |
Vlastimil Babka <vbabka@suse.cz> |
mm, slab/slub: improve error reporting and overhead of cache_from_obj() cache_from_obj() was added by commit b9ce5ef49f00 ("sl[au]b: always get the cache from its page in kmem_cache_free()") to support kmemcg, where per-memcg cache can be different from the root one, so we can't use the kmem_cache pointer given to kmem_cache_free(). Prior to that commit, SLUB already had debugging check+warning that could be enabled to compare the given kmem_cache pointer to one referenced by the slab page where the object-to-be-freed resides. This check was moved to cache_from_obj(). Later the check was also enabled for SLAB_FREELIST_HARDENED configs by commit 598a0717a816 ("mm/slab: validate cache membership under freelist hardening"). These checks and warnings can be useful especially for the debugging, which can be improved. Commit 598a0717a816 changed the pr_err() with WARN_ON_ONCE() to WARN_ONCE() so only the first hit is now reported, others are silent. This patch changes it to WARN() so that all errors are reported. It's also useful to print SLUB allocation/free tracking info for the offending object, if tracking is enabled. Thus, export the SLUB print_tracking() function and provide an empty one for SLAB. For SLUB we can also benefit from the static key check in kmem_cache_debug_flags(), but we need to move this function to slab.h and declare the static key there. [1] https://lore.kernel.org/r/20200608230654.828134-18-guro@fb.com [vbabka@suse.cz: avoid bogus WARN()] Link: https://lore.kernel.org/r/20200623090213.GW5535@shao2-debian Link: http://lkml.kernel.org/r/b33e0fa7-cd28-4788-9e54-5927846329ef@suse.cz Signed-off-by: Vlastimil Babka <vbabka@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Acked-by: Kees Cook <keescook@chromium.org> Acked-by: Roman Gushchin <guro@fb.com> Cc: Christoph Lameter <cl@linux.com> Cc: Pekka Enberg <penberg@kernel.org> Cc: David Rientjes <rientjes@google.com> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com> Cc: Matthew Garrett <mjg59@google.com> Cc: Jann Horn <jannh@google.com> Cc: Vijayanand Jitta <vjitta@codeaurora.org> Cc: Vinayak Menon <vinmenon@codeaurora.org> Link: http://lkml.kernel.org/r/afeda7ac-748b-33d8-a905-56b708148ad5@suse.cz Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
#
d3c58f24 |
|
07-Aug-2020 |
Vlastimil Babka <vbabka@suse.cz> |
mm, slab/slub: move and improve cache_from_obj() The function cache_from_obj() was added by commit b9ce5ef49f00 ("sl[au]b: always get the cache from its page in kmem_cache_free()") to support kmemcg, where per-memcg cache can be different from the root one, so we can't use the kmem_cache pointer given to kmem_cache_free(). Prior to that commit, SLUB already had debugging check+warning that could be enabled to compare the given kmem_cache pointer to one referenced by the slab page where the object-to-be-freed resides. This check was moved to cache_from_obj(). Later the check was also enabled for SLAB_FREELIST_HARDENED configs by commit 598a0717a816 ("mm/slab: validate cache membership under freelist hardening"). These checks and warnings can be useful especially for the debugging, which can be improved. Commit 598a0717a816 changed the pr_err() with WARN_ON_ONCE() to WARN_ONCE() so only the first hit is now reported, others are silent. This patch changes it to WARN() so that all errors are reported. It's also useful to print SLUB allocation/free tracking info for the offending object, if tracking is enabled. We could export the SLUB print_tracking() function and provide an empty one for SLAB, or realize that both the debugging and hardening cases in cache_from_obj() are only supported by SLUB anyway. So this patch moves cache_from_obj() from slab.h to separate instances in slab.c and slub.c, where the SLAB version only does the kmemcg lookup and even could be completely removed once the kmemcg rework [1] is merged. The SLUB version can thus easily use the print_tracking() function. It can also use the kmem_cache_debug_flags() static key check for improved performance in kernels without the hardening and with debugging not enabled on boot. [1] https://lore.kernel.org/r/20200608230654.828134-18-guro@fb.com Signed-off-by: Vlastimil Babka <vbabka@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Cc: Christoph Lameter <cl@linux.com> Cc: Jann Horn <jannh@google.com> Cc: Kees Cook <keescook@chromium.org> Cc: Vijayanand Jitta <vjitta@codeaurora.org> Cc: David Rientjes <rientjes@google.com> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com> Cc: Pekka Enberg <penberg@kernel.org> Link: http://lkml.kernel.org/r/20200610163135.17364-10-vbabka@suse.cz Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
#
8fc8d666 |
|
07-Aug-2020 |
Vlastimil Babka <vbabka@suse.cz> |
mm, slub: extend checks guarded by slub_debug static key There are few more places in SLUB that could benefit from reduced overhead of the static key introduced by a previous patch: - setup_object_debug() called on each object in newly allocated slab page - setup_page_debug() called on newly allocated slab page - __free_slab() called on freed slab page Signed-off-by: Vlastimil Babka <vbabka@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Acked-by: Roman Gushchin <guro@fb.com> Acked-by: Christoph Lameter <cl@linux.com> Cc: Jann Horn <jannh@google.com> Cc: Kees Cook <keescook@chromium.org> Cc: Vijayanand Jitta <vjitta@codeaurora.org> Cc: David Rientjes <rientjes@google.com> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com> Cc: Pekka Enberg <penberg@kernel.org> Link: http://lkml.kernel.org/r/20200610163135.17364-9-vbabka@suse.cz Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
#
59052e89 |
|
07-Aug-2020 |
Vlastimil Babka <vbabka@suse.cz> |
mm, slub: introduce kmem_cache_debug_flags() There are few places that call kmem_cache_debug(s) (which tests if any of debug flags are enabled for a cache) immediately followed by a test for a specific flag. The compiler can probably eliminate the extra check, but we can make the code nicer by introducing kmem_cache_debug_flags() that works like kmem_cache_debug() (including the static key check) but tests for specific flag(s). The next patches will add more users. [vbabka@suse.cz: change return from int to bool, per Kees. Add VM_WARN_ON_ONCE() for invalid flags, per Roman] Link: http://lkml.kernel.org/r/949b90ed-e0f0-07d7-4d21-e30ec0958a7c@suse.cz Signed-off-by: Vlastimil Babka <vbabka@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Acked-by: Roman Gushchin <guro@fb.com> Acked-by: Christoph Lameter <cl@linux.com> Acked-by: Kees Cook <keescook@chromium.org> Cc: Jann Horn <jannh@google.com> Cc: Vijayanand Jitta <vjitta@codeaurora.org> Cc: David Rientjes <rientjes@google.com> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com> Cc: Pekka Enberg <penberg@kernel.org> Link: http://lkml.kernel.org/r/20200610163135.17364-8-vbabka@suse.cz Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
#
ca0cab65 |
|
07-Aug-2020 |
Vlastimil Babka <vbabka@suse.cz> |
mm, slub: introduce static key for slub_debug() One advantage of CONFIG_SLUB_DEBUG is that a generic distro kernel can be built with the option enabled, but it's inactive until simply enabled on boot, without rebuilding the kernel. With a static key, we can further eliminate the overhead of checking whether a cache has a particular debug flag enabled if we know that there are no such caches (slub_debug was not enabled during boot). We use the same mechanism also for e.g. page_owner, debug_pagealloc or kmemcg functionality. This patch introduces the static key and makes the general check for per-cache debug flags kmem_cache_debug() use it. This benefits several call sites, including (slow path but still rather frequent) __slab_free(). The next patches will add more uses. Signed-off-by: Vlastimil Babka <vbabka@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Reviewed-by: Kees Cook <keescook@chromium.org> Acked-by: Roman Gushchin <guro@fb.com> Acked-by: Christoph Lameter <cl@linux.com> Cc: Jann Horn <jannh@google.com> Cc: Vijayanand Jitta <vjitta@codeaurora.org> Cc: David Rientjes <rientjes@google.com> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com> Cc: Pekka Enberg <penberg@kernel.org> Link: http://lkml.kernel.org/r/20200610163135.17364-7-vbabka@suse.cz Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
#
8f58119a |
|
07-Aug-2020 |
Vlastimil Babka <vbabka@suse.cz> |
mm, slub: make reclaim_account attribute read-only The attribute reflects the SLAB_RECLAIM_ACCOUNT cache flag. It's not clear why this attribute was writable in the first place, as it's tied to how the cache is used by its creator, it's not a user tunable. Furthermore: - it affects slab merging, but that's not being checked while toggled - if affects whether __GFP_RECLAIMABLE flag is used to allocate page, but the runtime toggle doesn't update allocflags - it affects cache_vmstat_idx() so runtime toggling might lead to incosistency of NR_SLAB_RECLAIMABLE and NR_SLAB_UNRECLAIMABLE Thus make it read-only. Signed-off-by: Vlastimil Babka <vbabka@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Reviewed-by: Kees Cook <keescook@chromium.org> Acked-by: Roman Gushchin <guro@fb.com> Cc: Christoph Lameter <cl@linux.com> Cc: Jann Horn <jannh@google.com> Cc: Vijayanand Jitta <vjitta@codeaurora.org> Cc: David Rientjes <rientjes@google.com> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com> Cc: Pekka Enberg <penberg@kernel.org> Link: http://lkml.kernel.org/r/20200610163135.17364-6-vbabka@suse.cz Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
#
060807f8 |
|
07-Aug-2020 |
Vlastimil Babka <vbabka@suse.cz> |
mm, slub: make remaining slub_debug related attributes read-only SLUB_DEBUG creates several files under /sys/kernel/slab/<cache>/ that can be read to check if the respective debugging options are enabled for given cache. Some options, namely sanity_checks, trace, and failslab can be also enabled and disabled at runtime by writing into the files. The runtime toggling is racy. Some options disable __CMPXCHG_DOUBLE when enabled, which means that in case of concurrent allocations, some can still use __CMPXCHG_DOUBLE and some not, leading to potential corruption. The s->flags field is also not updated or checked atomically. The simplest solution is to remove the runtime toggling. The extended slub_debug boot parameter syntax introduced by earlier patch should allow to fine-tune the debugging configuration during boot with same granularity. Signed-off-by: Vlastimil Babka <vbabka@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Reviewed-by: Kees Cook <keescook@chromium.org> Acked-by: Roman Gushchin <guro@fb.com> Cc: Christoph Lameter <cl@linux.com> Cc: Jann Horn <jannh@google.com> Cc: Vijayanand Jitta <vjitta@codeaurora.org> Cc: David Rientjes <rientjes@google.com> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com> Cc: Pekka Enberg <penberg@kernel.org> Link: http://lkml.kernel.org/r/20200610163135.17364-5-vbabka@suse.cz Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
#
32a6f409 |
|
07-Aug-2020 |
Vlastimil Babka <vbabka@suse.cz> |
mm, slub: remove runtime allocation order changes SLUB allows runtime changing of page allocation order by writing into the /sys/kernel/slab/<cache>/order file. Jann has reported [1] that this interface allows the order to be set too small, leading to crashes. While it's possible to fix the immediate issue, closer inspection reveals potential races. Storing the new order calls calculate_sizes() which non-atomically updates a lot of kmem_cache fields while the cache is still in use. Unexpected behavior might occur even if the fields are set to the same value as they were. This could be fixed by splitting out the part of calculate_sizes() that depends on forced_order, so that we only update kmem_cache.oo field. This could still race with init_cache_random_seq(), shuffle_freelist(), allocate_slab(). Perhaps it's possible to audit and e.g. add some READ_ONCE/WRITE_ONCE accesses, it might be easier just to remove the runtime order changes, which is what this patch does. If there are valid usecases for per-cache order setting, we could e.g. extend the boot parameters to do that. [1] https://lore.kernel.org/r/CAG48ez31PP--h6_FzVyfJ4H86QYczAFPdxtJHUEEan+7VJETAQ@mail.gmail.com Reported-by: Jann Horn <jannh@google.com> Signed-off-by: Vlastimil Babka <vbabka@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Reviewed-by: Kees Cook <keescook@chromium.org> Acked-by: Christoph Lameter <cl@linux.com> Acked-by: Roman Gushchin <guro@fb.com> Cc: Vijayanand Jitta <vjitta@codeaurora.org> Cc: David Rientjes <rientjes@google.com> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com> Cc: Pekka Enberg <penberg@kernel.org> Link: http://lkml.kernel.org/r/20200610163135.17364-4-vbabka@suse.cz Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
#
ad38b5b1 |
|
07-Aug-2020 |
Vlastimil Babka <vbabka@suse.cz> |
mm, slub: make some slub_debug related attributes read-only SLUB_DEBUG creates several files under /sys/kernel/slab/<cache>/ that can be read to check if the respective debugging options are enabled for given cache. The options can be also toggled at runtime by writing into the files. Some of those, namely red_zone, poison, and store_user can be toggled only when no objects yet exist in the cache. Vijayanand reports [1] that there is a problem with freelist randomization if changing the debugging option's state results in different number of objects per page, and the random sequence cache needs thus needs to be recomputed. However, another problem is that the check for "no objects yet exist in the cache" is racy, as noted by Jann [2] and fixing that would add overhead or otherwise complicate the allocation/freeing paths. Thus it would be much simpler just to remove the runtime toggling support. The documentation describes it's "In case you forgot to enable debugging on the kernel command line", but the neccessity of having no objects limits its usefulness anyway for many caches. Vijayanand describes an use case [3] where debugging is enabled for all but zram caches for memory overhead reasons, and using the runtime toggles was the only way to achieve such configuration. After the previous patch it's now possible to do that directly from the kernel boot option, so we can remove the dangerous runtime toggles by making the /sys attribute files read-only. While updating it, also improve the documentation of the debugging /sys files. [1] https://lkml.kernel.org/r/1580379523-32272-1-git-send-email-vjitta@codeaurora.org [2] https://lore.kernel.org/r/CAG48ez31PP--h6_FzVyfJ4H86QYczAFPdxtJHUEEan+7VJETAQ@mail.gmail.com [3] https://lore.kernel.org/r/1383cd32-1ddc-4dac-b5f8-9c42282fa81c@codeaurora.org Reported-by: Vijayanand Jitta <vjitta@codeaurora.org> Reported-by: Jann Horn <jannh@google.com> Signed-off-by: Vlastimil Babka <vbabka@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Reviewed-by: Kees Cook <keescook@chromium.org> Acked-by: Roman Gushchin <guro@fb.com> Cc: Christoph Lameter <cl@linux.com> Cc: David Rientjes <rientjes@google.com> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com> Cc: Pekka Enberg <penberg@kernel.org> Link: http://lkml.kernel.org/r/20200610163135.17364-3-vbabka@suse.cz Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
#
e17f1dfb |
|
07-Aug-2020 |
Vlastimil Babka <vbabka@suse.cz> |
mm, slub: extend slub_debug syntax for multiple blocks Patch series "slub_debug fixes and improvements". The slub_debug kernel boot parameter can either apply a single set of options to all caches or a list of caches. There is a use case where debugging is applied for all caches and then disabled at runtime for specific caches, for performance and memory consumption reasons [1]. As runtime changes are dangerous, extend the boot parameter syntax so that multiple blocks of either global or slab-specific options can be specified, with blocks delimited by ';'. This will also support the use case of [1] without runtime changes. For details see the updated Documentation/vm/slub.rst [1] https://lore.kernel.org/r/1383cd32-1ddc-4dac-b5f8-9c42282fa81c@codeaurora.org [weiyongjun1@huawei.com: make parse_slub_debug_flags() static] Link: http://lkml.kernel.org/r/20200702150522.4940-1-weiyongjun1@huawei.com Signed-off-by: Vlastimil Babka <vbabka@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Reviewed-by: Kees Cook <keescook@chromium.org> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Christoph Lameter <cl@linux.com> Cc: Jann Horn <jannh@google.com> Cc: Roman Gushchin <guro@fb.com> Cc: Vijayanand Jitta <vjitta@codeaurora.org> Cc: Pekka Enberg <penberg@kernel.org> Cc: David Rientjes <rientjes@google.com> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com> Link: http://lkml.kernel.org/r/20200610163135.17364-2-vbabka@suse.cz Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
#
44405099 |
|
07-Aug-2020 |
Long Li <lonuxli.64@gmail.com> |
mm, slab: check GFP_SLAB_BUG_MASK before alloc_pages in kmalloc_order kmalloc cannot allocate memory from HIGHMEM. Allocating large amounts of memory currently bypasses the check and will simply leak the memory when page_address() returns NULL. To fix this, factor the GFP_SLAB_BUG_MASK check out of slab & slub, and call it from kmalloc_order() as well. In order to make the code clear, the warning message is put in one place. Signed-off-by: Long Li <lonuxli.64@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Reviewed-by: Matthew Wilcox (Oracle) <willy@infradead.org> Reviewed-by: Pekka Enberg <penberg@kernel.org> Acked-by: David Rientjes <rientjes@google.com> Cc: Christoph Lameter <cl@linux.com> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com> Link: http://lkml.kernel.org/r/20200704035027.GA62481@lilong Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
#
3f649ab7 |
|
03-Jun-2020 |
Kees Cook <keescook@chromium.org> |
treewide: Remove uninitialized_var() usage Using uninitialized_var() is dangerous as it papers over real bugs[1] (or can in the future), and suppresses unrelated compiler warnings (e.g. "unused variable"). If the compiler thinks it is uninitialized, either simply initialize the variable or make compiler changes. In preparation for removing[2] the[3] macro[4], remove all remaining needless uses with the following script: git grep '\buninitialized_var\b' | cut -d: -f1 | sort -u | \ xargs perl -pi -e \ 's/\buninitialized_var\(([^\)]+)\)/\1/g; s:\s*/\* (GCC be quiet|to make compiler happy) \*/$::g;' drivers/video/fbdev/riva/riva_hw.c was manually tweaked to avoid pathological white-space. No outstanding warnings were found building allmodconfig with GCC 9.3.0 for x86_64, i386, arm64, arm, powerpc, powerpc64le, s390x, mips, sparc64, alpha, and m68k. [1] https://lore.kernel.org/lkml/20200603174714.192027-1-glider@google.com/ [2] https://lore.kernel.org/lkml/CA+55aFw+Vbj0i=1TGqCR5vQkCzWJ0QxK6CernOU6eedsudAixw@mail.gmail.com/ [3] https://lore.kernel.org/lkml/CA+55aFwgbgqhbp1fkxvRKEpzyR5J8n1vKT1VZdz9knmPuXhOeg@mail.gmail.com/ [4] https://lore.kernel.org/lkml/CA+55aFz2500WfbKXAx8s67wrm9=yVJu65TpLgN_ybYNv0VEOKA@mail.gmail.com/ Reviewed-by: Leon Romanovsky <leonro@mellanox.com> # drivers/infiniband and mlx4/mlx5 Acked-by: Jason Gunthorpe <jgg@mellanox.com> # IB Acked-by: Kalle Valo <kvalo@codeaurora.org> # wireless drivers Reviewed-by: Chao Yu <yuchao0@huawei.com> # erofs Signed-off-by: Kees Cook <keescook@chromium.org>
|
#
55860d96 |
|
25-Jun-2020 |
Sebastian Andrzej Siewior <bigeasy@linutronix.de> |
slub: cure list_slab_objects() from double fix According to Christopher Lameter two fixes have been merged for the same problem. As far as I can tell, the code does not acquire the list_lock and invoke kmalloc(). list_slab_objects() misses an unlock (the counterpart to get_map()) and the memory allocated in free_partial() isn't used. Revert the mentioned commit. Link: http://lkml.kernel.org/r/20200618201234.795692-1-bigeasy@linutronix.de Fixes: aa456c7aebb14 ("slub: remove kmalloc under list_lock from list_slab_objects() V2") Link: https://lkml.kernel.org/r/alpine.DEB.2.22.394.2006181501480.12014@www.lameter.com Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de> Cc: Pekka Enberg <penberg@kernel.org> Cc: David Rientjes <rientjes@google.com> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Yu Zhao <yuzhao@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
#
fe557319 |
|
17-Jun-2020 |
Christoph Hellwig <hch@lst.de> |
maccess: rename probe_kernel_{read,write} to copy_{from,to}_kernel_nofault Better describe what these functions do. Suggested-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
#
0d645ed1 |
|
04-Jun-2020 |
Ethon Paul <ethp@qq.com> |
mm/slub: fix a typo in comment "disambiguiation"->"disambiguation" There is a typo in comment, fix it. Signed-off-by: Ethon Paul <ethp@qq.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Acked-by: David Rientjes <rientjes@google.com> Link: http://lkml.kernel.org/r/20200411002247.14468-1-ethp@qq.com Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
#
97a225e6 |
|
03-Jun-2020 |
Joonsoo Kim <iamjoonsoo.kim@lge.com> |
mm/page_alloc: integrate classzone_idx and high_zoneidx classzone_idx is just different name for high_zoneidx now. So, integrate them and add some comment to struct alloc_context in order to reduce future confusion about the meaning of this variable. The accessor, ac_classzone_idx() is also removed since it isn't needed after integration. In addition to integration, this patch also renames high_zoneidx to highest_zoneidx since it represents more precise meaning. Signed-off-by: Joonsoo Kim <iamjoonsoo.kim@lge.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Reviewed-by: Baoquan He <bhe@redhat.com> Acked-by: Vlastimil Babka <vbabka@suse.cz> Acked-by: David Rientjes <rientjes@google.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Mel Gorman <mgorman@techsingularity.net> Cc: Michal Hocko <mhocko@kernel.org> Cc: Minchan Kim <minchan@kernel.org> Cc: Ye Xiaolong <xiaolong.ye@intel.com> Link: http://lkml.kernel.org/r/1587095923-7515-3-git-send-email-iamjoonsoo.kim@lge.com Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
#
dde3c6b7 |
|
03-Jun-2020 |
Wang Hai <wanghai38@huawei.com> |
mm/slub: fix a memory leak in sysfs_slab_add() syzkaller reports for memory leak when kobject_init_and_add() returns an error in the function sysfs_slab_add() [1] When this happened, the function kobject_put() is not called for the corresponding kobject, which potentially leads to memory leak. This patch fixes the issue by calling kobject_put() even if kobject_init_and_add() fails. [1] BUG: memory leak unreferenced object 0xffff8880a6d4be88 (size 8): comm "syz-executor.3", pid 946, jiffies 4295772514 (age 18.396s) hex dump (first 8 bytes): 70 69 64 5f 33 00 ff ff pid_3... backtrace: kstrdup+0x35/0x70 mm/util.c:60 kstrdup_const+0x3d/0x50 mm/util.c:82 kvasprintf_const+0x112/0x170 lib/kasprintf.c:48 kobject_set_name_vargs+0x55/0x130 lib/kobject.c:289 kobject_add_varg lib/kobject.c:384 [inline] kobject_init_and_add+0xd8/0x170 lib/kobject.c:473 sysfs_slab_add+0x1d8/0x290 mm/slub.c:5811 __kmem_cache_create+0x50a/0x570 mm/slub.c:4384 create_cache+0x113/0x1e0 mm/slab_common.c:407 kmem_cache_create_usercopy+0x1a1/0x260 mm/slab_common.c:505 kmem_cache_create+0xd/0x10 mm/slab_common.c:564 create_pid_cachep kernel/pid_namespace.c:54 [inline] create_pid_namespace kernel/pid_namespace.c:96 [inline] copy_pid_ns+0x77c/0x8f0 kernel/pid_namespace.c:148 create_new_namespaces+0x26b/0xa30 kernel/nsproxy.c:95 unshare_nsproxy_namespaces+0xa7/0x1e0 kernel/nsproxy.c:229 ksys_unshare+0x3d2/0x770 kernel/fork.c:2969 __do_sys_unshare kernel/fork.c:3037 [inline] __se_sys_unshare kernel/fork.c:3035 [inline] __x64_sys_unshare+0x2d/0x40 kernel/fork.c:3035 do_syscall_64+0xa1/0x530 arch/x86/entry/common.c:295 Fixes: 80da026a8e5d ("mm/slub: fix slab double-free in case of duplicate sysfs filename") Reported-by: Hulk Robot <hulkci@huawei.com> Signed-off-by: Wang Hai <wanghai38@huawei.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Cc: Christoph Lameter <cl@linux.com> Cc: Pekka Enberg <penberg@kernel.org> Cc: David Rientjes <rientjes@google.com> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com> Link: http://lkml.kernel.org/r/20200602115033.1054-1-wanghai38@huawei.com Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
#
a68ee057 |
|
01-Jun-2020 |
Qian Cai <cai@lca.pw> |
mm/slub: fix stack overruns with SLUB_STATS There is no need to copy SLUB_STATS items from root memcg cache to new memcg cache copies. Doing so could result in stack overruns because the store function only accepts 0 to clear the stat and returns an error for everything else while the show method would print out the whole stat. Then, the mismatch of the lengths returns from show and store methods happens in memcg_propagate_slab_attrs(): else if (root_cache->max_attr_size < ARRAY_SIZE(mbuf)) buf = mbuf; max_attr_size is only 2 from slab_attr_store(), then, it uses mbuf[64] in show_stat() later where a bounch of sprintf() would overrun the stack variable. Fix it by always allocating a page of buffer to be used in show_stat() if SLUB_STATS=y which should only be used for debug purpose. # echo 1 > /sys/kernel/slab/fs_cache/shrink BUG: KASAN: stack-out-of-bounds in number+0x421/0x6e0 Write of size 1 at addr ffffc900256cfde0 by task kworker/76:0/53251 Hardware name: HPE ProLiant DL385 Gen10/ProLiant DL385 Gen10, BIOS A40 07/10/2019 Workqueue: memcg_kmem_cache memcg_kmem_cache_create_func Call Trace: number+0x421/0x6e0 vsnprintf+0x451/0x8e0 sprintf+0x9e/0xd0 show_stat+0x124/0x1d0 alloc_slowpath_show+0x13/0x20 __kmem_cache_create+0x47a/0x6b0 addr ffffc900256cfde0 is located in stack of task kworker/76:0/53251 at offset 0 in frame: process_one_work+0x0/0xb90 this frame has 1 object: [32, 72) 'lockdep_map' Memory state around the buggy address: ffffc900256cfc80: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ffffc900256cfd00: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 >ffffc900256cfd80: 00 00 00 00 00 00 00 00 00 00 00 00 f1 f1 f1 f1 ^ ffffc900256cfe00: 00 00 00 00 00 f2 f2 f2 00 00 00 00 00 00 00 00 ffffc900256cfe80: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ================================================================== Kernel panic - not syncing: stack-protector: Kernel stack is corrupted in: __kmem_cache_create+0x6ac/0x6b0 Workqueue: memcg_kmem_cache memcg_kmem_cache_create_func Call Trace: __kmem_cache_create+0x6ac/0x6b0 Fixes: 107dab5c92d5 ("slub: slub-specific propagation changes") Signed-off-by: Qian Cai <cai@lca.pw> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Cc: Glauber Costa <glauber@scylladb.com> Cc: Christoph Lameter <cl@linux.com> Cc: Pekka Enberg <penberg@kernel.org> Cc: David Rientjes <rientjes@google.com> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com> Link: http://lkml.kernel.org/r/20200429222356.4322-1-cai@lca.pw Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
#
aa456c7a |
|
01-Jun-2020 |
Christopher Lameter <cl@linux.com> |
slub: remove kmalloc under list_lock from list_slab_objects() V2 list_slab_objects() is called when a slab is destroyed and there are objects still left to list the objects in the syslog. This is a pretty rare event. And there it seems we take the list_lock and call kmalloc while holding that lock. Perform the allocation in free_partial() before the list_lock is taken. Fixes: bbd7d57bfe852d9788bae5fb171c7edb4021d8ac ("slub: Potential stack overflow") Signed-off-by: Christopher Lameter <cl@linux.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Cc: Pekka Enberg <penberg@kernel.org> Cc: David Rientjes <rientjes@google.com> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com> Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com> Cc: Tetsuo Handa <penguin-kernel@i-love.sakura.ne.jp> Cc: Yu Zhao <yuzhao@google.com> Link: http://lkml.kernel.org/r/alpine.DEB.2.21.2002031721250.1668@www.lameter.com Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
#
d7660ce5 |
|
01-Jun-2020 |
Christoph Lameter <cl@linux.com> |
slub: Remove userspace notifier for cache add/remove I came across some unnecessary uevents once again which reminded me this. The patch seems to be lost in the leaves of the original discussion [1], so resending. [1] https://lore.kernel.org/r/alpine.DEB.2.21.2001281813130.745@www.lameter.com Kmem caches are internal kernel structures so it is strange that userspace notifiers would be needed. And I am not aware of any use of these notifiers. These notifiers may just exist because in the initial slub release the sysfs code was copied from another subsystem. Signed-off-by: Christoph Lameter <cl@linux.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Acked-by: Vlastimil Babka <vbabka@suse.cz> Acked-by: Michal Koutný <mkoutny@suse.com> Acked-by: David Rientjes <rientjes@google.com> Cc: Pekka Enberg <penberg@kernel.org> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com> Link: http://lkml.kernel.org/r/20200423115721.19821-1-mkoutny@suse.com Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
#
52f23478 |
|
01-Jun-2020 |
Dongli Zhang <dongli.zhang@oracle.com> |
mm/slub.c: fix corrupted freechain in deactivate_slab() The slub_debug is able to fix the corrupted slab freelist/page. However, alloc_debug_processing() only checks the validity of current and next freepointer during allocation path. As a result, once some objects have their freepointers corrupted, deactivate_slab() may lead to page fault. Below is from a test kernel module when 'slub_debug=PUF,kmalloc-128 slub_nomerge'. The test kernel corrupts the freepointer of one free object on purpose. Unfortunately, deactivate_slab() does not detect it when iterating the freechain. BUG: unable to handle page fault for address: 00000000123456f8 #PF: supervisor read access in kernel mode #PF: error_code(0x0000) - not-present page PGD 0 P4D 0 Oops: 0000 [#1] SMP PTI ... ... RIP: 0010:deactivate_slab.isra.92+0xed/0x490 ... ... Call Trace: ___slab_alloc+0x536/0x570 __slab_alloc+0x17/0x30 __kmalloc+0x1d9/0x200 ext4_htree_store_dirent+0x30/0xf0 htree_dirblock_to_tree+0xcb/0x1c0 ext4_htree_fill_tree+0x1bc/0x2d0 ext4_readdir+0x54f/0x920 iterate_dir+0x88/0x190 __x64_sys_getdents+0xa6/0x140 do_syscall_64+0x49/0x170 entry_SYSCALL_64_after_hwframe+0x44/0xa9 Therefore, this patch adds extra consistency check in deactivate_slab(). Once an object's freepointer is corrupted, all following objects starting at this object are isolated. [akpm@linux-foundation.org: fix build with CONFIG_SLAB_DEBUG=n] Signed-off-by: Dongli Zhang <dongli.zhang@oracle.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Cc: Joe Jin <joe.jin@oracle.com> Cc: Christoph Lameter <cl@linux.com> Cc: Pekka Enberg <penberg@kernel.org> Cc: David Rientjes <rientjes@google.com> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com> Link: http://lkml.kernel.org/r/20200331031450.12182-1-dongli.zhang@oracle.com Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
#
cbfc35a4 |
|
07-May-2020 |
Waiman Long <longman@redhat.com> |
mm/slub: fix incorrect interpretation of s->offset In a couple of places in the slub memory allocator, the code uses "s->offset" as a check to see if the free pointer is put right after the object. That check is no longer true with commit 3202fa62fb43 ("slub: relocate freelist pointer to middle of object"). As a result, echoing "1" into the validate sysfs file, e.g. of dentry, may cause a bunch of "Freepointer corrupt" error reports like the following to appear with the system in panic afterwards. ============================================================================= BUG dentry(666:pmcd.service) (Tainted: G B): Freepointer corrupt ----------------------------------------------------------------------------- To fix it, use the check "s->offset == s->inuse" in the new helper function freeptr_outside_object() instead. Also add another helper function get_info_end() to return the end of info block (inuse + free pointer if not overlapping with object). Fixes: 3202fa62fb43 ("slub: relocate freelist pointer to middle of object") Signed-off-by: Waiman Long <longman@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Reviewed-by: Matthew Wilcox (Oracle) <willy@infradead.org> Reviewed-by: Kees Cook <keescook@chromium.org> Acked-by: Rafael Aquini <aquini@redhat.com> Cc: Christoph Lameter <cl@linux.com> Cc: Vitaly Nikolenko <vnik@duasynt.com> Cc: Silvio Cesare <silvio.cesare@gmail.com> Cc: Pekka Enberg <penberg@kernel.org> Cc: David Rientjes <rientjes@google.com> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com> Cc: Markus Elfring <Markus.Elfring@web.de> Cc: Changbin Du <changbin.du@gmail.com> Link: http://lkml.kernel.org/r/20200429135328.26976-1-longman@redhat.com Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
#
89b83f28 |
|
20-Apr-2020 |
Kees Cook <keescook@chromium.org> |
slub: avoid redzone when choosing freepointer location Marco Elver reported system crashes when booting with "slub_debug=Z". The freepointer location (s->offset) was not taking into account that the "inuse" size that includes the redzone area should not be used by the freelist pointer. Change the calculation to save the area of the object that an inline freepointer may be written into. Fixes: 3202fa62fb43 ("slub: relocate freelist pointer to middle of object") Reported-by: Marco Elver <elver@google.com> Signed-off-by: Kees Cook <keescook@chromium.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Tested-by: Marco Elver <elver@google.com> Cc: Christoph Lameter <cl@linux.com> Cc: Pekka Enberg <penberg@kernel.org> Cc: David Rientjes <rientjes@google.com> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com> Link: http://lkml.kernel.org/r/202004151054.BD695840@keescook Link: https://lore.kernel.org/linux-mm/20200415164726.GA234932@google.com Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
#
81aba9e0 |
|
06-Apr-2020 |
Jules Irenge <jbi.octave@gmail.com> |
mm/slub: add missing annotation for put_map() Sparse reports a warning at put_map()() warning: context imbalance in put_map() - unexpected unlock The root cause is the missing annotation at put_map() Add the missing __releases(&object_map_lock) annotation Signed-off-by: Jules Irenge <jbi.octave@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Link: http://lkml.kernel.org/r/20200214204741.94112-10-jbi.octave@gmail.com Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
#
31364c2e |
|
06-Apr-2020 |
Jules Irenge <jbi.octave@gmail.com> |
mm/slub: add missing annotation for get_map() Sparse reports a warning at get_map()() warning: context imbalance in get_map() - wrong count at exit The root cause is the missing annotation at get_map() Add the missing __acquires(&object_map_lock) annotation Signed-off-by: Jules Irenge <jbi.octave@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Link: http://lkml.kernel.org/r/20200214204741.94112-9-jbi.octave@gmail.com Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
#
3202fa62 |
|
01-Apr-2020 |
Kees Cook <keescook@chromium.org> |
slub: relocate freelist pointer to middle of object In a recent discussion[1] with Vitaly Nikolenko and Silvio Cesare, it became clear that moving the freelist pointer away from the edge of allocations would likely improve the overall defensive posture of the inline freelist pointer. My benchmarks show no meaningful change to performance (they seem to show it being faster), so this looks like a reasonable change to make. Instead of having the freelist pointer at the very beginning of an allocation (offset 0) or at the very end of an allocation (effectively offset -sizeof(void *) from the next allocation), move it away from the edges of the allocation and into the middle. This provides some protection against small-sized neighboring overflows (or underflows), for which the freelist pointer is commonly the target. (Large or well controlled overwrites are much more likely to attack live object contents, instead of attempting freelist corruption.) The vaunted kernel build benchmark, across 5 runs. Before: Mean: 250.05 Std Dev: 1.85 and after, which appears mysteriously faster: Mean: 247.13 Std Dev: 0.76 Attempts at running "sysbench --test=memory" show the change to be well in the noise (sysbench seems to be pretty unstable here -- it's not really measuring allocation). Hackbench is more allocation-heavy, and while the std dev is above the difference, it looks like may manifest as an improvement as well: 20 runs of "hackbench -g 20 -l 1000", before: Mean: 36.322 Std Dev: 0.577 and after: Mean: 36.056 Std Dev: 0.598 [1] https://twitter.com/vnik5287/status/1235113523098685440 Signed-off-by: Kees Cook <keescook@chromium.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Acked-by: Christoph Lameter <cl@linux.com> Cc: Vitaly Nikolenko <vnik@duasynt.com> Cc: Silvio Cesare <silvio.cesare@gmail.com> Cc: Christoph Lameter <cl@linux.com>Cc: Pekka Enberg <penberg@kernel.org> Cc: David Rientjes <rientjes@google.com> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com> Link: http://lkml.kernel.org/r/202003051624.AAAC9AECC@keescook Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
#
1ad53d9f |
|
01-Apr-2020 |
Kees Cook <keescook@chromium.org> |
slub: improve bit diffusion for freelist ptr obfuscation Under CONFIG_SLAB_FREELIST_HARDENED=y, the obfuscation was relatively weak in that the ptr and ptr address were usually so close that the first XOR would result in an almost entirely 0-byte value[1], leaving most of the "secret" number ultimately being stored after the third XOR. A single blind memory content exposure of the freelist was generally sufficient to learn the secret. Add a swab() call to mix bits a little more. This is a cheap way (1 cycle) to make attacks need more than a single exposure to learn the secret (or to know _where_ the exposure is in memory). kmalloc-32 freelist walk, before: ptr ptr_addr stored value secret ffff90c22e019020@ffff90c22e019000 is 86528eb656b3b5bd (86528eb656b3b59d) ffff90c22e019040@ffff90c22e019020 is 86528eb656b3b5fd (86528eb656b3b59d) ffff90c22e019060@ffff90c22e019040 is 86528eb656b3b5bd (86528eb656b3b59d) ffff90c22e019080@ffff90c22e019060 is 86528eb656b3b57d (86528eb656b3b59d) ffff90c22e0190a0@ffff90c22e019080 is 86528eb656b3b5bd (86528eb656b3b59d) ... after: ptr ptr_addr stored value secret ffff9eed6e019020@ffff9eed6e019000 is 793d1135d52cda42 (86528eb656b3b59d) ffff9eed6e019040@ffff9eed6e019020 is 593d1135d52cda22 (86528eb656b3b59d) ffff9eed6e019060@ffff9eed6e019040 is 393d1135d52cda02 (86528eb656b3b59d) ffff9eed6e019080@ffff9eed6e019060 is 193d1135d52cdae2 (86528eb656b3b59d) ffff9eed6e0190a0@ffff9eed6e019080 is f93d1135d52cdac2 (86528eb656b3b59d) [1] https://blog.infosectcbr.com.au/2020/03/weaknesses-in-linux-kernel-heap.html Fixes: 2482ddec670f ("mm: add SLUB free list pointer obfuscation") Reported-by: Silvio Cesare <silvio.cesare@gmail.com> Signed-off-by: Kees Cook <keescook@chromium.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Cc: Christoph Lameter <cl@linux.com> Cc: Pekka Enberg <penberg@kernel.org> Cc: David Rientjes <rientjes@google.com> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com> Cc: <stable@vger.kernel.org> Link: http://lkml.kernel.org/r/202003051623.AF4F8CB@keescook Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
#
bbd4e305 |
|
01-Apr-2020 |
chenqiwu <chenqiwu@xiaomi.com> |
mm/slub.c: replace kmem_cache->cpu_partial with wrapped APIs There are slub_cpu_partial() and slub_set_cpu_partial() APIs to wrap kmem_cache->cpu_partial. This patch will use the two APIs to replace kmem_cache->cpu_partial in slub code. Signed-off-by: chenqiwu <chenqiwu@xiaomi.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Reviewed-by: Andrew Morton <akpm@linux-foundation.org> Cc: Christoph Lameter <cl@linux.com> Cc: Pekka Enberg <penberg@kernel.org> Cc: David Rientjes <rientjes@google.com> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com> Link: http://lkml.kernel.org/r/1582079562-17980-1-git-send-email-qiwuchen55@gmail.com Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
#
4c7ba22e |
|
01-Apr-2020 |
chenqiwu <chenqiwu@xiaomi.com> |
mm/slub.c: replace cpu_slab->partial with wrapped APIs There are slub_percpu_partial() and slub_set_percpu_partial() APIs to wrap kmem_cache->cpu_partial. This patch will use the two to replace cpu_slab->partial in slub code. Signed-off-by: chenqiwu <chenqiwu@xiaomi.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Reviewed-by: Andrew Morton <akpm@linux-foundation.org> Cc: Christoph Lameter <cl@linux.com> Cc: Pekka Enberg <penberg@kernel.org> Cc: David Rientjes <rientjes@google.com> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com> Link: http://lkml.kernel.org/r/1581951895-3038-1-git-send-email-qiwuchen55@gmail.com Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
#
fd7cb575 |
|
23-Mar-2020 |
Daniel Vetter <daniel.vetter@ffwll.ch> |
mm/sl[uo]b: export __kmalloc_track(_node)_caller slab does this already, and I want to use this in a memory allocation tracker in drm for stuff that's tied to the lifetime of a drm_device, not the underlying struct device. Kinda like devres, but for drm. Acked-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Daniel Vetter <daniel.vetter@intel.com> Cc: Christoph Lameter <cl@linux.com> Cc: Pekka Enberg <penberg@kernel.org> Cc: David Rientjes <rientjes@google.com> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: linux-mm@kvack.org Link: https://patchwork.freedesktop.org/patch/msgid/20200323144950.3018436-2-daniel.vetter@ffwll.ch
|
#
0715e6c5 |
|
21-Mar-2020 |
Vlastimil Babka <vbabka@suse.cz> |
mm, slub: prevent kmalloc_node crashes and memory leaks Sachin reports [1] a crash in SLUB __slab_alloc(): BUG: Kernel NULL pointer dereference on read at 0x000073b0 Faulting instruction address: 0xc0000000003d55f4 Oops: Kernel access of bad area, sig: 11 [#1] LE PAGE_SIZE=64K MMU=Hash SMP NR_CPUS=2048 NUMA pSeries Modules linked in: CPU: 19 PID: 1 Comm: systemd Not tainted 5.6.0-rc2-next-20200218-autotest #1 NIP: c0000000003d55f4 LR: c0000000003d5b94 CTR: 0000000000000000 REGS: c0000008b37836d0 TRAP: 0300 Not tainted (5.6.0-rc2-next-20200218-autotest) MSR: 8000000000009033 <SF,EE,ME,IR,DR,RI,LE> CR: 24004844 XER: 00000000 CFAR: c00000000000dec4 DAR: 00000000000073b0 DSISR: 40000000 IRQMASK: 1 GPR00: c0000000003d5b94 c0000008b3783960 c00000000155d400 c0000008b301f500 GPR04: 0000000000000dc0 0000000000000002 c0000000003443d8 c0000008bb398620 GPR08: 00000008ba2f0000 0000000000000001 0000000000000000 0000000000000000 GPR12: 0000000024004844 c00000001ec52a00 0000000000000000 0000000000000000 GPR16: c0000008a1b20048 c000000001595898 c000000001750c18 0000000000000002 GPR20: c000000001750c28 c000000001624470 0000000fffffffe0 5deadbeef0000122 GPR24: 0000000000000001 0000000000000dc0 0000000000000002 c0000000003443d8 GPR28: c0000008b301f500 c0000008bb398620 0000000000000000 c00c000002287180 NIP ___slab_alloc+0x1f4/0x760 LR __slab_alloc+0x34/0x60 Call Trace: ___slab_alloc+0x334/0x760 (unreliable) __slab_alloc+0x34/0x60 __kmalloc_node+0x110/0x490 kvmalloc_node+0x58/0x110 mem_cgroup_css_online+0x108/0x270 online_css+0x48/0xd0 cgroup_apply_control_enable+0x2ec/0x4d0 cgroup_mkdir+0x228/0x5f0 kernfs_iop_mkdir+0x90/0xf0 vfs_mkdir+0x110/0x230 do_mkdirat+0xb0/0x1a0 system_call+0x5c/0x68 This is a PowerPC platform with following NUMA topology: available: 2 nodes (0-1) node 0 cpus: node 0 size: 0 MB node 0 free: 0 MB node 1 cpus: 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 node 1 size: 35247 MB node 1 free: 30907 MB node distances: node 0 1 0: 10 40 1: 40 10 possible numa nodes: 0-31 This only happens with a mmotm patch "mm/memcontrol.c: allocate shrinker_map on appropriate NUMA node" [2] which effectively calls kmalloc_node for each possible node. SLUB however only allocates kmem_cache_node on online N_NORMAL_MEMORY nodes, and relies on node_to_mem_node to return such valid node for other nodes since commit a561ce00b09e ("slub: fall back to node_to_mem_node() node if allocating on memoryless node"). This is however not true in this configuration where the _node_numa_mem_ array is not initialized for nodes 0 and 2-31, thus it contains zeroes and get_partial() ends up accessing non-allocated kmem_cache_node. A related issue was reported by Bharata (originally by Ramachandran) [3] where a similar PowerPC configuration, but with mainline kernel without patch [2] ends up allocating large amounts of pages by kmalloc-1k kmalloc-512. This seems to have the same underlying issue with node_to_mem_node() not behaving as expected, and might probably also lead to an infinite loop with CONFIG_SLUB_CPU_PARTIAL [4]. This patch should fix both issues by not relying on node_to_mem_node() anymore and instead simply falling back to NUMA_NO_NODE, when kmalloc_node(node) is attempted for a node that's not online, or has no usable memory. The "usable memory" condition is also changed from node_present_pages() to N_NORMAL_MEMORY node state, as that is exactly the condition that SLUB uses to allocate kmem_cache_node structures. The check in get_partial() is removed completely, as the checks in ___slab_alloc() are now sufficient to prevent get_partial() being reached with an invalid node. [1] https://lore.kernel.org/linux-next/3381CD91-AB3D-4773-BA04-E7A072A63968@linux.vnet.ibm.com/ [2] https://lore.kernel.org/linux-mm/fff0e636-4c36-ed10-281c-8cdb0687c839@virtuozzo.com/ [3] https://lore.kernel.org/linux-mm/20200317092624.GB22538@in.ibm.com/ [4] https://lore.kernel.org/linux-mm/088b5996-faae-8a56-ef9c-5b567125ae54@suse.cz/ Fixes: a561ce00b09e ("slub: fall back to node_to_mem_node() node if allocating on memoryless node") Reported-by: Sachin Sant <sachinp@linux.vnet.ibm.com> Reported-by: PUVICHAKRAVARTHY RAMACHANDRAN <puvichakravarthy@in.ibm.com> Signed-off-by: Vlastimil Babka <vbabka@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Tested-by: Sachin Sant <sachinp@linux.vnet.ibm.com> Tested-by: Bharata B Rao <bharata@linux.ibm.com> Reviewed-by: Srikar Dronamraju <srikar@linux.vnet.ibm.com> Cc: Mel Gorman <mgorman@techsingularity.net> Cc: Michael Ellerman <mpe@ellerman.id.au> Cc: Michal Hocko <mhocko@kernel.org> Cc: Christopher Lameter <cl@linux.com> Cc: linuxppc-dev@lists.ozlabs.org Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com> Cc: Pekka Enberg <penberg@kernel.org> Cc: David Rientjes <rientjes@google.com> Cc: Kirill Tkhai <ktkhai@virtuozzo.com> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Nathan Lynch <nathanl@linux.ibm.com> Cc: <stable@vger.kernel.org> Link: http://lkml.kernel.org/r/20200320115533.9604-1-vbabka@suse.cz Debugged-by: Srikar Dronamraju <srikar@linux.vnet.ibm.com> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
#
5076190d |
|
17-Mar-2020 |
Linus Torvalds <torvalds@linux-foundation.org> |
mm: slub: be more careful about the double cmpxchg of freelist This is just a cleanup addition to Jann's fix to properly update the transaction ID for the slub slowpath in commit fd4d9c7d0c71 ("mm: slub: add missing TID bump.."). The transaction ID is what protects us against any concurrent accesses, but we should really also make sure to make the 'freelist' comparison itself always use the same freelist value that we then used as the new next free pointer. Jann points out that if we do all of this carefully, we could skip the transaction ID update for all the paths that only remove entries from the lists, and only update the TID when adding entries (to avoid the ABA issue with cmpxchg and list handling re-adding a previously seen value). But this patch just does the "make sure to cmpxchg the same value we used" rather than then try to be clever. Acked-by: Jann Horn <jannh@google.com> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
#
fd4d9c7d |
|
16-Mar-2020 |
Jann Horn <jannh@google.com> |
mm: slub: add missing TID bump in kmem_cache_alloc_bulk() When kmem_cache_alloc_bulk() attempts to allocate N objects from a percpu freelist of length M, and N > M > 0, it will first remove the M elements from the percpu freelist, then call ___slab_alloc() to allocate the next element and repopulate the percpu freelist. ___slab_alloc() can re-enable IRQs via allocate_slab(), so the TID must be bumped before ___slab_alloc() to properly commit the freelist head change. Fix it by unconditionally bumping c->tid when entering the slowpath. Cc: stable@vger.kernel.org Fixes: ebe909e0fdb3 ("slub: improve bulk alloc strategy") Signed-off-by: Jann Horn <jannh@google.com> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
#
90e9f6a6 |
|
30-Jan-2020 |
Yu Zhao <yuzhao@google.com> |
mm/slub.c: avoid slub allocation while holding list_lock If we are already under list_lock, don't call kmalloc(). Otherwise we will run into a deadlock because kmalloc() also tries to grab the same lock. Fix the problem by using a static bitmap instead. WARNING: possible recursive locking detected -------------------------------------------- mount-encrypted/4921 is trying to acquire lock: (&(&n->list_lock)->rlock){-.-.}, at: ___slab_alloc+0x104/0x437 but task is already holding lock: (&(&n->list_lock)->rlock){-.-.}, at: __kmem_cache_shutdown+0x81/0x3cb other info that might help us debug this: Possible unsafe locking scenario: CPU0 ---- lock(&(&n->list_lock)->rlock); lock(&(&n->list_lock)->rlock); *** DEADLOCK *** Link: http://lkml.kernel.org/r/20191108193958.205102-2-yuzhao@google.com Signed-off-by: Yu Zhao <yuzhao@google.com> Acked-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Cc: Christoph Lameter <cl@linux.com> Cc: Pekka Enberg <penberg@kernel.org> Cc: David Rientjes <rientjes@google.com> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com> Cc: Tetsuo Handa <penguin-kernel@i-love.sakura.ne.jp> Cc: Yu Zhao <yuzhao@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
#
cb923159 |
|
17-Jan-2020 |
Sebastian Andrzej Siewior <bigeasy@linutronix.de> |
smp: Remove allocation mask from on_each_cpu_cond.*() The allocation mask is no longer used by on_each_cpu_cond() and on_each_cpu_cond_mask() and can be removed. Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://lore.kernel.org/r/20200117090137.1205765-4-bigeasy@linutronix.de
|
#
8e57f8ac |
|
13-Jan-2020 |
Vlastimil Babka <vbabka@suse.cz> |
mm, debug_pagealloc: don't rely on static keys too early Commit 96a2b03f281d ("mm, debug_pagelloc: use static keys to enable debugging") has introduced a static key to reduce overhead when debug_pagealloc is compiled in but not enabled. It relied on the assumption that jump_label_init() is called before parse_early_param() as in start_kernel(), so when the "debug_pagealloc=on" option is parsed, it is safe to enable the static key. However, it turns out multiple architectures call parse_early_param() earlier from their setup_arch(). x86 also calls jump_label_init() even earlier, so no issue was found while testing the commit, but same is not true for e.g. ppc64 and s390 where the kernel would not boot with debug_pagealloc=on as found by our QA. To fix this without tricky changes to init code of multiple architectures, this patch partially reverts the static key conversion from 96a2b03f281d. Init-time and non-fastpath calls (such as in arch code) of debug_pagealloc_enabled() will again test a simple bool variable. Fastpath mm code is converted to a new debug_pagealloc_enabled_static() variant that relies on the static key, which is enabled in a well-defined point in mm_init() where it's guaranteed that jump_label_init() has been called, regardless of architecture. [sfr@canb.auug.org.au: export _debug_pagealloc_enabled_early] Link: http://lkml.kernel.org/r/20200106164944.063ac07b@canb.auug.org.au Link: http://lkml.kernel.org/r/20191219130612.23171-1-vbabka@suse.cz Fixes: 96a2b03f281d ("mm, debug_pagelloc: use static keys to enable debugging") Signed-off-by: Vlastimil Babka <vbabka@suse.cz> Signed-off-by: Stephen Rothwell <sfr@canb.auug.org.au> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com> Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com> Cc: Michal Hocko <mhocko@kernel.org> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Matthew Wilcox <willy@infradead.org> Cc: Mel Gorman <mgorman@techsingularity.net> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Borislav Petkov <bp@alien8.de> Cc: Qian Cai <cai@lca.pw> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
#
923717cb |
|
15-Oct-2019 |
Thomas Gleixner <tglx@linutronix.de> |
sched/rt, mm: Use CONFIG_PREEMPTION CONFIG_PREEMPTION is selected by CONFIG_PREEMPT and by CONFIG_PREEMPT_RT. Both PREEMPT and PREEMPT_RT require the same functionality which today depends on CONFIG_PREEMPT. Switch the pte_unmap_same() and SLUB code over to use CONFIG_PREEMPTION. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Acked-by: Chistoph Lameter <cl@linux.com> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: David Rientjes <rientjes@google.com> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Pekka Enberg <penberg@kernel.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: linux-mm@kvack.org Link: https://lore.kernel.org/r/20191015191821.11479-26-bigeasy@linutronix.de Signed-off-by: Ingo Molnar <mingo@kernel.org>
|
#
dd98afd4 |
|
30-Nov-2019 |
Yu Zhao <yuzhao@google.com> |
mm/slub.c: clean up validate_slab() The function doesn't need to return any value, and the check can be done in one pass. There is a behavior change: before the patch, we stop at the first invalid free object; after the patch, we stop at the first invalid object, free or in use. This shouldn't matter because the original behavior isn't intended anyway. Link: http://lkml.kernel.org/r/20191108193958.205102-1-yuzhao@google.com Signed-off-by: Yu Zhao <yuzhao@google.com> Acked-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Cc: Christoph Lameter <cl@linux.com> Cc: Pekka Enberg <penberg@kernel.org> Cc: David Rientjes <rientjes@google.com> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com> Cc: Tetsuo Handa <penguin-kernel@i-love.sakura.ne.jp> Cc: Yu Zhao <yuzhao@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
#
aed68148 |
|
30-Nov-2019 |
Yu Zhao <yuzhao@google.com> |
mm/slub.c: update comments Slub doesn't use PG_active and PG_error anymore. Link: http://lkml.kernel.org/r/20191007222023.162256-1-yuzhao@google.com Signed-off-by: Yu Zhao <yuzhao@google.com> Reviewed-by: Andrew Morton <akpm@linux-foundation.org> Acked-by: David Rientjes <rientjes@google.com> Cc: Christoph Lameter <cl@linux.com> Cc: Pekka Enberg <penberg@kernel.org> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
#
e1b70dd1 |
|
30-Nov-2019 |
Miles Chen <miles.chen@mediatek.com> |
mm: slub: print the offset of fault addresses With commit ad67b74d2469 ("printk: hash addresses printed with %p"), it is a little bit harder to match the fault addresses printed by check_bytes_and_report() or slab_pad_check() in the dump because the fault addresses may not show up in the dump. Print the offset of the fault addresses to make it easier to match the incorrect poison or padding values in the dump. Before: We have to search the "63" in the dump. If we want to get the offset of 63, we have to count it from the start of Object dump. ============================================================= BUG kmalloc-128 (Not tainted): Poison overwritten ------------------------------------------------------------- Disabling lock debugging due to kernel taint INFO: 0x00000000570da294-0x00000000570da294. First byte 0x63 instead of 0x6b ... INFO: Object 0x000000006ebb3b9e @offset=14208 fp=0x0000000065862488 Redzone 00000000a6abccff: bb bb bb bb bb bb bb bb bb bb bb bb bb bb bb bb Redzone 00000000741c16f0: bb bb bb bb bb bb bb bb bb bb bb bb bb bb bb bb Redzone 0000000061ad278f: bb bb bb bb bb bb bb bb bb bb bb bb bb bb bb bb Redzone 000000000467c1bd: bb bb bb bb bb bb bb bb bb bb bb bb bb bb bb bb Redzone 000000008812766b: bb bb bb bb bb bb bb bb bb bb bb bb bb bb bb bb Redzone 000000003d9b8f25: bb bb bb bb bb bb bb bb bb bb bb bb bb bb bb bb Redzone 0000000000d80c33: bb bb bb bb bb bb bb bb bb bb bb bb bb bb bb bb Redzone 00000000867b0d90: bb bb bb bb bb bb bb bb bb bb bb bb bb bb bb bb Object 000000006ebb3b9e: 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b Object 000000005ea59a9f: 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b Object 000000003ef8bddc: 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b Object 000000008190375d: 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b Object 000000006df7fb32: 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b Object 0000000069474eae: 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b Object 0000000008073b7d: 6b 6b 6b 6b 63 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b Object 00000000b45ae74d: 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b a5 After: We know the fault address is at @offset=1508, and the Object is at @offset=1408, so we know the fault address is at offset=100 within the object. ========================================================= BUG kmalloc-128 (Not tainted): Poison overwritten --------------------------------------------------------- Disabling lock debugging due to kernel taint INFO: 0x00000000638ec1d1-0x00000000638ec1d1 @offset=1508. First byte 0x63 instead of 0x6b ... INFO: Object 0x000000008171818d @offset=1408 fp=0x0000000066dae230 Redzone 00000000e2697ab6: bb bb bb bb bb bb bb bb bb bb bb bb bb bb bb bb Redzone 0000000064b6a381: bb bb bb bb bb bb bb bb bb bb bb bb bb bb bb bb Redzone 00000000e413a234: bb bb bb bb bb bb bb bb bb bb bb bb bb bb bb bb Redzone 0000000004c1dfeb: bb bb bb bb bb bb bb bb bb bb bb bb bb bb bb bb Redzone 000000009ad24d42: bb bb bb bb bb bb bb bb bb bb bb bb bb bb bb bb Redzone 000000002a196a23: bb bb bb bb bb bb bb bb bb bb bb bb bb bb bb bb Redzone 00000000a7b8468a: bb bb bb bb bb bb bb bb bb bb bb bb bb bb bb bb Redzone 0000000088db6da3: bb bb bb bb bb bb bb bb bb bb bb bb bb bb bb bb Object 000000008171818d: 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b Object 000000007c4035d4: 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b Object 000000004dd281a4: 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b Object 0000000079121dff: 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b Object 00000000756682a9: 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b Object 0000000053b7e541: 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b Object 0000000091f8d530: 6b 6b 6b 6b 63 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b Object 000000009c76035c: 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b a5 Link: http://lkml.kernel.org/r/20190925140807.20490-1-miles.chen@mediatek.com Signed-off-by: Miles Chen <miles.chen@mediatek.com> Reviewed-by: Andrew Morton <akpm@linux-foundation.org> Cc: Christoph Lameter <cl@linux.com> Cc: Matthew Wilcox <willy@infradead.org> Cc: David Rientjes <rientjes@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
#
aea4df4c |
|
15-Nov-2019 |
Laura Abbott <labbott@redhat.com> |
mm: slub: really fix slab walking for init_on_free Commit 1b7e816fc80e ("mm: slub: Fix slab walking for init_on_free") fixed one problem with the slab walking but missed a key detail: When walking the list, the head and tail pointers need to be updated since we end up reversing the list as a result. Without doing this, bulk free is broken. One way this is exposed is a NULL pointer with slub_debug=F: ============================================================================= BUG skbuff_head_cache (Tainted: G T): Object already free ----------------------------------------------------------------------------- INFO: Slab 0x000000000d2d2f8f objects=16 used=3 fp=0x0000000064309071 flags=0x3fff00000000201 BUG: kernel NULL pointer dereference, address: 0000000000000000 Oops: 0000 [#1] PREEMPT SMP PTI RIP: 0010:print_trailer+0x70/0x1d5 Call Trace: <IRQ> free_debug_processing.cold.37+0xc9/0x149 __slab_free+0x22a/0x3d0 kmem_cache_free_bulk+0x415/0x420 __kfree_skb_flush+0x30/0x40 net_rx_action+0x2dd/0x480 __do_softirq+0xf0/0x246 irq_exit+0x93/0xb0 do_IRQ+0xa0/0x110 common_interrupt+0xf/0xf </IRQ> Given we're now almost identical to the existing debugging code which correctly walks the list, combine with that. Link: https://lkml.kernel.org/r/20191104170303.GA50361@gandi.net Link: http://lkml.kernel.org/r/20191106222208.26815-1-labbott@redhat.com Fixes: 1b7e816fc80e ("mm: slub: Fix slab walking for init_on_free") Signed-off-by: Laura Abbott <labbott@redhat.com> Reported-by: Thibaut Sautereau <thibaut.sautereau@clip-os.org> Acked-by: David Rientjes <rientjes@google.com> Tested-by: Alexander Potapenko <glider@google.com> Acked-by: Alexander Potapenko <glider@google.com> Cc: Kees Cook <keescook@chromium.org> Cc: "David S. Miller" <davem@davemloft.net> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: <clipos@ssi.gouv.fr> Cc: Christoph Lameter <cl@linux.com> Cc: Pekka Enberg <penberg@kernel.org> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
#
0f181f9f |
|
14-Oct-2019 |
Alexander Potapenko <glider@google.com> |
mm/slub.c: init_on_free=1 should wipe freelist ptr for bulk allocations slab_alloc_node() already zeroed out the freelist pointer if init_on_free was on. Thibaut Sautereau noticed that the same needs to be done for kmem_cache_alloc_bulk(), which performs the allocations separately. kmem_cache_alloc_bulk() is currently used in two places in the kernel, so this change is unlikely to have a major performance impact. SLAB doesn't require a similar change, as auto-initialization makes the allocator store the freelist pointers off-slab. Link: http://lkml.kernel.org/r/20191007091605.30530-1-glider@google.com Fixes: 6471384af2a6 ("mm: security: introduce init_on_alloc=1 and init_on_free=1 boot options") Signed-off-by: Alexander Potapenko <glider@google.com> Reported-by: Thibaut Sautereau <thibaut@sautereau.fr> Reported-by: Kees Cook <keescook@chromium.org> Cc: Christoph Lameter <cl@linux.com> Cc: Laura Abbott <labbott@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
#
e4f8e513 |
|
14-Oct-2019 |
Qian Cai <cai@lca.pw> |
mm/slub: fix a deadlock in show_slab_objects() A long time ago we fixed a similar deadlock in show_slab_objects() [1]. However, it is apparently due to the commits like 01fb58bcba63 ("slab: remove synchronous synchronize_sched() from memcg cache deactivation path") and 03afc0e25f7f ("slab: get_online_mems for kmem_cache_{create,destroy,shrink}"), this kind of deadlock is back by just reading files in /sys/kernel/slab which will generate a lockdep splat below. Since the "mem_hotplug_lock" here is only to obtain a stable online node mask while racing with NUMA node hotplug, in the worst case, the results may me miscalculated while doing NUMA node hotplug, but they shall be corrected by later reads of the same files. WARNING: possible circular locking dependency detected ------------------------------------------------------ cat/5224 is trying to acquire lock: ffff900012ac3120 (mem_hotplug_lock.rw_sem){++++}, at: show_slab_objects+0x94/0x3a8 but task is already holding lock: b8ff009693eee398 (kn->count#45){++++}, at: kernfs_seq_start+0x44/0xf0 which lock already depends on the new lock. the existing dependency chain (in reverse order) is: -> #2 (kn->count#45){++++}: lock_acquire+0x31c/0x360 __kernfs_remove+0x290/0x490 kernfs_remove+0x30/0x44 sysfs_remove_dir+0x70/0x88 kobject_del+0x50/0xb0 sysfs_slab_unlink+0x2c/0x38 shutdown_cache+0xa0/0xf0 kmemcg_cache_shutdown_fn+0x1c/0x34 kmemcg_workfn+0x44/0x64 process_one_work+0x4f4/0x950 worker_thread+0x390/0x4bc kthread+0x1cc/0x1e8 ret_from_fork+0x10/0x18 -> #1 (slab_mutex){+.+.}: lock_acquire+0x31c/0x360 __mutex_lock_common+0x16c/0xf78 mutex_lock_nested+0x40/0x50 memcg_create_kmem_cache+0x38/0x16c memcg_kmem_cache_create_func+0x3c/0x70 process_one_work+0x4f4/0x950 worker_thread+0x390/0x4bc kthread+0x1cc/0x1e8 ret_from_fork+0x10/0x18 -> #0 (mem_hotplug_lock.rw_sem){++++}: validate_chain+0xd10/0x2bcc __lock_acquire+0x7f4/0xb8c lock_acquire+0x31c/0x360 get_online_mems+0x54/0x150 show_slab_objects+0x94/0x3a8 total_objects_show+0x28/0x34 slab_attr_show+0x38/0x54 sysfs_kf_seq_show+0x198/0x2d4 kernfs_seq_show+0xa4/0xcc seq_read+0x30c/0x8a8 kernfs_fop_read+0xa8/0x314 __vfs_read+0x88/0x20c vfs_read+0xd8/0x10c ksys_read+0xb0/0x120 __arm64_sys_read+0x54/0x88 el0_svc_handler+0x170/0x240 el0_svc+0x8/0xc other info that might help us debug this: Chain exists of: mem_hotplug_lock.rw_sem --> slab_mutex --> kn->count#45 Possible unsafe locking scenario: CPU0 CPU1 ---- ---- lock(kn->count#45); lock(slab_mutex); lock(kn->count#45); lock(mem_hotplug_lock.rw_sem); *** DEADLOCK *** 3 locks held by cat/5224: #0: 9eff00095b14b2a0 (&p->lock){+.+.}, at: seq_read+0x4c/0x8a8 #1: 0eff008997041480 (&of->mutex){+.+.}, at: kernfs_seq_start+0x34/0xf0 #2: b8ff009693eee398 (kn->count#45){++++}, at: kernfs_seq_start+0x44/0xf0 stack backtrace: Call trace: dump_backtrace+0x0/0x248 show_stack+0x20/0x2c dump_stack+0xd0/0x140 print_circular_bug+0x368/0x380 check_noncircular+0x248/0x250 validate_chain+0xd10/0x2bcc __lock_acquire+0x7f4/0xb8c lock_acquire+0x31c/0x360 get_online_mems+0x54/0x150 show_slab_objects+0x94/0x3a8 total_objects_show+0x28/0x34 slab_attr_show+0x38/0x54 sysfs_kf_seq_show+0x198/0x2d4 kernfs_seq_show+0xa4/0xcc seq_read+0x30c/0x8a8 kernfs_fop_read+0xa8/0x314 __vfs_read+0x88/0x20c vfs_read+0xd8/0x10c ksys_read+0xb0/0x120 __arm64_sys_read+0x54/0x88 el0_svc_handler+0x170/0x240 el0_svc+0x8/0xc I think it is important to mention that this doesn't expose the show_slab_objects to use-after-free. There is only a single path that might really race here and that is the slab hotplug notifier callback __kmem_cache_shrink (via slab_mem_going_offline_callback) but that path doesn't really destroy kmem_cache_node data structures. [1] http://lkml.iu.edu/hypermail/linux/kernel/1101.0/02850.html [akpm@linux-foundation.org: add comment explaining why we don't need mem_hotplug_lock] Link: http://lkml.kernel.org/r/1570192309-10132-1-git-send-email-cai@lca.pw Fixes: 01fb58bcba63 ("slab: remove synchronous synchronize_sched() from memcg cache deactivation path") Fixes: 03afc0e25f7f ("slab: get_online_mems for kmem_cache_{create,destroy,shrink}") Signed-off-by: Qian Cai <cai@lca.pw> Acked-by: Michal Hocko <mhocko@suse.com> Cc: Christoph Lameter <cl@linux.com> Cc: Pekka Enberg <penberg@kernel.org> Cc: David Rientjes <rientjes@google.com> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com> Cc: Tejun Heo <tj@kernel.org> Cc: Vladimir Davydov <vdavydov.dev@gmail.com> Cc: Roman Gushchin <guro@fb.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
#
6a486c0a |
|
06-Oct-2019 |
Vlastimil Babka <vbabka@suse.cz> |
mm, sl[ou]b: improve memory accounting Patch series "guarantee natural alignment for kmalloc()", v2. This patch (of 2): SLOB currently doesn't account its pages at all, so in /proc/meminfo the Slab field shows zero. Modifying a counter on page allocation and freeing should be acceptable even for the small system scenarios SLOB is intended for. Since reclaimable caches are not separated in SLOB, account everything as unreclaimable. SLUB currently doesn't account kmalloc() and kmalloc_node() allocations larger than order-1 page, that are passed directly to the page allocator. As they also don't appear in /proc/slabinfo, it might look like a memory leak. For consistency, account them as well. (SLAB doesn't actually use page allocator directly, so no change there). Ideally SLOB and SLUB would be handled in separate patches, but due to the shared kmalloc_order() function and different kfree() implementations, it's easier to patch both at once to prevent inconsistencies. Link: http://lkml.kernel.org/r/20190826111627.7505-2-vbabka@suse.cz Signed-off-by: Vlastimil Babka <vbabka@suse.cz> Cc: Christoph Lameter <cl@linux.com> Cc: Pekka Enberg <penberg@kernel.org> Cc: David Rientjes <rientjes@google.com> Cc: Ming Lei <ming.lei@redhat.com> Cc: Dave Chinner <david@fromorbit.com> Cc: Matthew Wilcox <willy@infradead.org> Cc: "Darrick J . Wong" <darrick.wong@oracle.com> Cc: Christoph Hellwig <hch@lst.de> Cc: James Bottomley <James.Bottomley@HansenPartnership.com> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
#
a50b854e |
|
23-Sep-2019 |
Matthew Wilcox (Oracle) <willy@infradead.org> |
mm: introduce page_size() Patch series "Make working with compound pages easier", v2. These three patches add three helpers and convert the appropriate places to use them. This patch (of 3): It's unnecessarily hard to find out the size of a potentially huge page. Replace 'PAGE_SIZE << compound_order(page)' with page_size(page). Link: http://lkml.kernel.org/r/20190721104612.19120-2-willy@infradead.org Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> Acked-by: Michal Hocko <mhocko@suse.com> Reviewed-by: Andrew Morton <akpm@linux-foundation.org> Reviewed-by: Ira Weiny <ira.weiny@intel.com> Acked-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
#
9d5f0be0 |
|
23-Sep-2019 |
Qian Cai <cai@lca.pw> |
mm/slub.c: fix -Wunused-function compiler warnings tid_to_cpu() and tid_to_event() are only used in note_cmpxchg_failure() when SLUB_DEBUG_CMPXCHG=y, so when SLUB_DEBUG_CMPXCHG=n by default, Clang will complain that those unused functions. Link: http://lkml.kernel.org/r/1568752232-5094-1-git-send-email-cai@lca.pw Signed-off-by: Qian Cai <cai@lca.pw> Acked-by: David Rientjes <rientjes@google.com> Cc: Christoph Lameter <cl@linux.com> Cc: Pekka Enberg <penberg@kernel.org> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
#
04f768a3 |
|
23-Sep-2019 |
Waiman Long <longman@redhat.com> |
mm, slab: extend slab/shrink to shrink all memcg caches Currently, a value of '1" is written to /sys/kernel/slab/<slab>/shrink file to shrink the slab by flushing out all the per-cpu slabs and free slabs in partial lists. This can be useful to squeeze out a bit more memory under extreme condition as well as making the active object counts in /proc/slabinfo more accurate. This usually applies only to the root caches, as the SLUB_MEMCG_SYSFS_ON option is usually not enabled and "slub_memcg_sysfs=1" not set. Even if memcg sysfs is turned on, it is too cumbersome and impractical to manage all those per-memcg sysfs files in a real production system. So there is no practical way to shrink memcg caches. Fix this by enabling a proper write to the shrink sysfs file of the root cache to scan all the available memcg caches and shrink them as well. For a non-root memcg cache (when SLUB_MEMCG_SYSFS_ON or slub_memcg_sysfs is on), only that cache will be shrunk when written. On a 2-socket 64-core 256-thread arm64 system with 64k page after a parallel kernel build, the the amount of memory occupied by slabs before shrinking slabs were: # grep task_struct /proc/slabinfo task_struct 53137 53192 4288 61 4 : tunables 0 0 0 : slabdata 872 872 0 # grep "^S[lRU]" /proc/meminfo Slab: 3936832 kB SReclaimable: 399104 kB SUnreclaim: 3537728 kB After shrinking slabs (by echoing "1" to all shrink files): # grep "^S[lRU]" /proc/meminfo Slab: 1356288 kB SReclaimable: 263296 kB SUnreclaim: 1092992 kB # grep task_struct /proc/slabinfo task_struct 2764 6832 4288 61 4 : tunables 0 0 0 : slabdata 112 112 0 Link: http://lkml.kernel.org/r/20190723151445.7385-1-longman@redhat.com Signed-off-by: Waiman Long <longman@redhat.com> Acked-by: Roman Gushchin <guro@fb.com> Acked-by: Christoph Lameter <cl@linux.com> Cc: Pekka Enberg <penberg@kernel.org> Cc: David Rientjes <rientjes@google.com> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com> Cc: Michal Hocko <mhocko@kernel.org> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Shakeel Butt <shakeelb@google.com> Cc: Vladimir Davydov <vdavydov.dev@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
#
1b7e816f |
|
31-Jul-2019 |
Laura Abbott <labbott@redhat.com> |
mm: slub: Fix slab walking for init_on_free To properly clear the slab on free with slab_want_init_on_free, we walk the list of free objects using get_freepointer/set_freepointer. The value we get from get_freepointer may not be valid. This isn't an issue since an actual value will get written later but this means there's a chance of triggering a bug if we use this value with set_freepointer: kernel BUG at mm/slub.c:306! invalid opcode: 0000 [#1] PREEMPT PTI CPU: 0 PID: 0 Comm: swapper Not tainted 5.2.0-05754-g6471384a #4 RIP: 0010:kfree+0x58a/0x5c0 Code: 48 83 05 78 37 51 02 01 0f 0b 48 83 05 7e 37 51 02 01 48 83 05 7e 37 51 02 01 48 83 05 7e 37 51 02 01 48 83 05 d6 37 51 02 01 <0f> 0b 48 83 05 d4 37 51 02 01 48 83 05 d4 37 51 02 01 48 83 05 d4 RSP: 0000:ffffffff82603d90 EFLAGS: 00010002 RAX: ffff8c3976c04320 RBX: ffff8c3976c04300 RCX: 0000000000000000 RDX: ffff8c3976c04300 RSI: 0000000000000000 RDI: ffff8c3976c04320 RBP: ffffffff82603db8 R08: 0000000000000000 R09: 0000000000000000 R10: ffff8c3976c04320 R11: ffffffff8289e1e0 R12: ffffd52cc8db0100 R13: ffff8c3976c01a00 R14: ffffffff810f10d4 R15: ffff8c3976c04300 FS: 0000000000000000(0000) GS:ffffffff8266b000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: ffff8c397ffff000 CR3: 0000000125020000 CR4: 00000000000406b0 Call Trace: apply_wqattrs_prepare+0x154/0x280 apply_workqueue_attrs_locked+0x4e/0xe0 apply_workqueue_attrs+0x36/0x60 alloc_workqueue+0x25a/0x6d0 workqueue_init_early+0x246/0x348 start_kernel+0x3c7/0x7ec x86_64_start_reservations+0x40/0x49 x86_64_start_kernel+0xda/0xe4 secondary_startup_64+0xb6/0xc0 Modules linked in: ---[ end trace f67eb9af4d8d492b ]--- Fix this by ensuring the value we set with set_freepointer is either NULL or another value in the chain. Reported-by: kernel test robot <rong.a.chen@intel.com> Signed-off-by: Laura Abbott <labbott@redhat.com> Fixes: 6471384af2a6 ("mm: security: introduce init_on_alloc=1 and init_on_free=1 boot options") Reviewed-by: Kees Cook <keescook@chromium.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
#
6471384a |
|
11-Jul-2019 |
Alexander Potapenko <glider@google.com> |
mm: security: introduce init_on_alloc=1 and init_on_free=1 boot options Patch series "add init_on_alloc/init_on_free boot options", v10. Provide init_on_alloc and init_on_free boot options. These are aimed at preventing possible information leaks and making the control-flow bugs that depend on uninitialized values more deterministic. Enabling either of the options guarantees that the memory returned by the page allocator and SL[AU]B is initialized with zeroes. SLOB allocator isn't supported at the moment, as its emulation of kmem caches complicates handling of SLAB_TYPESAFE_BY_RCU caches correctly. Enabling init_on_free also guarantees that pages and heap objects are initialized right after they're freed, so it won't be possible to access stale data by using a dangling pointer. As suggested by Michal Hocko, right now we don't let the heap users to disable initialization for certain allocations. There's not enough evidence that doing so can speed up real-life cases, and introducing ways to opt-out may result in things going out of control. This patch (of 2): The new options are needed to prevent possible information leaks and make control-flow bugs that depend on uninitialized values more deterministic. This is expected to be on-by-default on Android and Chrome OS. And it gives the opportunity for anyone else to use it under distros too via the boot args. (The init_on_free feature is regularly requested by folks where memory forensics is included in their threat models.) init_on_alloc=1 makes the kernel initialize newly allocated pages and heap objects with zeroes. Initialization is done at allocation time at the places where checks for __GFP_ZERO are performed. init_on_free=1 makes the kernel initialize freed pages and heap objects with zeroes upon their deletion. This helps to ensure sensitive data doesn't leak via use-after-free accesses. Both init_on_alloc=1 and init_on_free=1 guarantee that the allocator returns zeroed memory. The two exceptions are slab caches with constructors and SLAB_TYPESAFE_BY_RCU flag. Those are never zero-initialized to preserve their semantics. Both init_on_alloc and init_on_free default to zero, but those defaults can be overridden with CONFIG_INIT_ON_ALLOC_DEFAULT_ON and CONFIG_INIT_ON_FREE_DEFAULT_ON. If either SLUB poisoning or page poisoning is enabled, those options take precedence over init_on_alloc and init_on_free: initialization is only applied to unpoisoned allocations. Slowdown for the new features compared to init_on_free=0, init_on_alloc=0: hackbench, init_on_free=1: +7.62% sys time (st.err 0.74%) hackbench, init_on_alloc=1: +7.75% sys time (st.err 2.14%) Linux build with -j12, init_on_free=1: +8.38% wall time (st.err 0.39%) Linux build with -j12, init_on_free=1: +24.42% sys time (st.err 0.52%) Linux build with -j12, init_on_alloc=1: -0.13% wall time (st.err 0.42%) Linux build with -j12, init_on_alloc=1: +0.57% sys time (st.err 0.40%) The slowdown for init_on_free=0, init_on_alloc=0 compared to the baseline is within the standard error. The new features are also going to pave the way for hardware memory tagging (e.g. arm64's MTE), which will require both on_alloc and on_free hooks to set the tags for heap objects. With MTE, tagging will have the same cost as memory initialization. Although init_on_free is rather costly, there are paranoid use-cases where in-memory data lifetime is desired to be minimized. There are various arguments for/against the realism of the associated threat models, but given that we'll need the infrastructure for MTE anyway, and there are people who want wipe-on-free behavior no matter what the performance cost, it seems reasonable to include it in this series. [glider@google.com: v8] Link: http://lkml.kernel.org/r/20190626121943.131390-2-glider@google.com [glider@google.com: v9] Link: http://lkml.kernel.org/r/20190627130316.254309-2-glider@google.com [glider@google.com: v10] Link: http://lkml.kernel.org/r/20190628093131.199499-2-glider@google.com Link: http://lkml.kernel.org/r/20190617151050.92663-2-glider@google.com Signed-off-by: Alexander Potapenko <glider@google.com> Acked-by: Kees Cook <keescook@chromium.org> Acked-by: Michal Hocko <mhocko@suse.cz> [page and dmapool parts Acked-by: James Morris <jamorris@linux.microsoft.com>] Cc: Christoph Lameter <cl@linux.com> Cc: Masahiro Yamada <yamada.masahiro@socionext.com> Cc: "Serge E. Hallyn" <serge@hallyn.com> Cc: Nick Desaulniers <ndesaulniers@google.com> Cc: Kostya Serebryany <kcc@google.com> Cc: Dmitry Vyukov <dvyukov@google.com> Cc: Sandeep Patil <sspatil@android.com> Cc: Laura Abbott <labbott@redhat.com> Cc: Randy Dunlap <rdunlap@infradead.org> Cc: Jann Horn <jannh@google.com> Cc: Mark Rutland <mark.rutland@arm.com> Cc: Marco Elver <elver@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
#
6cea1d56 |
|
11-Jul-2019 |
Roman Gushchin <guro@fb.com> |
mm: memcg/slab: unify SLAB and SLUB page accounting Currently the page accounting code is duplicated in SLAB and SLUB internals. Let's move it into new (un)charge_slab_page helpers in the slab_common.c file. These helpers will be responsible for statistics (global and memcg-aware) and memcg charging. So they are replacing direct memcg_(un)charge_slab() calls. Link: http://lkml.kernel.org/r/20190611231813.3148843-6-guro@fb.com Signed-off-by: Roman Gushchin <guro@fb.com> Reviewed-by: Shakeel Butt <shakeelb@google.com> Acked-by: Christoph Lameter <cl@linux.com> Acked-by: Vladimir Davydov <vdavydov.dev@gmail.com> Acked-by: Johannes Weiner <hannes@cmpxchg.org> Cc: Michal Hocko <mhocko@suse.com> Cc: Waiman Long <longman@redhat.com> Cc: David Rientjes <rientjes@google.com> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com> Cc: Pekka Enberg <penberg@kernel.org> Cc: Andrei Vagin <avagin@gmail.com> Cc: Qian Cai <cai@lca.pw> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
#
43486694 |
|
11-Jul-2019 |
Roman Gushchin <guro@fb.com> |
mm: memcg/slab: generalize postponed non-root kmem_cache deactivation Currently SLUB uses a work scheduled after an RCU grace period to deactivate a non-root kmem_cache. This mechanism can be reused for kmem_caches release, but requires generalization for SLAB case. Introduce kmemcg_cache_deactivate() function, which calls allocator-specific __kmem_cache_deactivate() and schedules execution of __kmem_cache_deactivate_after_rcu() with all necessary locks in a worker context after an rcu grace period. Here is the new calling scheme: kmemcg_cache_deactivate() __kmemcg_cache_deactivate() SLAB/SLUB-specific kmemcg_rcufn() rcu kmemcg_workfn() work __kmemcg_cache_deactivate_after_rcu() SLAB/SLUB-specific instead of: __kmemcg_cache_deactivate() SLAB/SLUB-specific slab_deactivate_memcg_cache_rcu_sched() SLUB-only kmemcg_rcufn() rcu kmemcg_workfn() work kmemcg_cache_deact_after_rcu() SLUB-only For consistency, all allocator-specific functions start with "__". Link: http://lkml.kernel.org/r/20190611231813.3148843-4-guro@fb.com Signed-off-by: Roman Gushchin <guro@fb.com> Acked-by: Vladimir Davydov <vdavydov.dev@gmail.com> Reviewed-by: Shakeel Butt <shakeelb@google.com> Cc: Christoph Lameter <cl@linux.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Michal Hocko <mhocko@suse.com> Cc: Waiman Long <longman@redhat.com> Cc: David Rientjes <rientjes@google.com> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com> Cc: Pekka Enberg <penberg@kernel.org> Cc: Andrei Vagin <avagin@gmail.com> Cc: Qian Cai <cai@lca.pw> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
#
c03914b7 |
|
11-Jul-2019 |
Roman Gushchin <guro@fb.com> |
mm: memcg/slab: postpone kmem_cache memcg pointer initialization to memcg_link_cache() Patch series "mm: reparent slab memory on cgroup removal", v7. # Why do we need this? We've noticed that the number of dying cgroups is steadily growing on most of our hosts in production. The following investigation revealed an issue in the userspace memory reclaim code [1], accounting of kernel stacks [2], and also the main reason: slab objects. The underlying problem is quite simple: any page charged to a cgroup holds a reference to it, so the cgroup can't be reclaimed unless all charged pages are gone. If a slab object is actively used by other cgroups, it won't be reclaimed, and will prevent the origin cgroup from being reclaimed. Slab objects, and first of all vfs cache, is shared between cgroups, which are using the same underlying fs, and what's even more important, it's shared between multiple generations of the same workload. So if something is running periodically every time in a new cgroup (like how systemd works), we do accumulate multiple dying cgroups. Strictly speaking pagecache isn't different here, but there is a key difference: we disable protection and apply some extra pressure on LRUs of dying cgroups, and these LRUs contain all charged pages. My experiments show that with the disabled kernel memory accounting the number of dying cgroups stabilizes at a relatively small number (~100, depends on memory pressure and cgroup creation rate), and with kernel memory accounting it grows pretty steadily up to several thousands. Memory cgroups are quite complex and big objects (mostly due to percpu stats), so it leads to noticeable memory losses. Memory occupied by dying cgroups is measured in hundreds of megabytes. I've even seen a host with more than 100Gb of memory wasted for dying cgroups. It leads to a degradation of performance with the uptime, and generally limits the usage of cgroups. My previous attempt [3] to fix the problem by applying extra pressure on slab shrinker lists caused a regressions with xfs and ext4, and has been reverted [4]. The following attempts to find the right balance [5, 6] were not successful. So instead of trying to find a maybe non-existing balance, let's do reparent accounted slab caches to the parent cgroup on cgroup removal. # Implementation approach There is however a significant problem with reparenting of slab memory: there is no list of charged pages. Some of them are in shrinker lists, but not all. Introducing of a new list is really not an option. But fortunately there is a way forward: every slab page has a stable pointer to the corresponding kmem_cache. So the idea is to reparent kmem_caches instead of slab pages. It's actually simpler and cheaper, but requires some underlying changes: 1) Make kmem_caches to hold a single reference to the memory cgroup, instead of a separate reference per every slab page. 2) Stop setting page->mem_cgroup pointer for memcg slab pages and use page->kmem_cache->memcg indirection instead. It's used only on slab page release, so performance overhead shouldn't be a big issue. 3) Introduce a refcounter for non-root slab caches. It's required to be able to destroy kmem_caches when they become empty and release the associated memory cgroup. There is a bonus: currently we release all memcg kmem_caches all together with the memory cgroup itself. This patchset allows individual kmem_caches to be released as soon as they become inactive and free. Some additional implementation details are provided in corresponding commit messages. # Results Below is the average number of dying cgroups on two groups of our production hosts. They do run some sort of web frontend workload, the memory pressure is moderate. As we can see, with the kernel memory reparenting the number stabilizes in 60s range; however with the original version it grows almost linearly and doesn't show any signs of plateauing. The difference in slab and percpu usage between patched and unpatched versions also grows linearly. In 7 days it exceeded 200Mb. day 0 1 2 3 4 5 6 7 original 56 362 628 752 1070 1250 1490 1560 patched 23 46 51 55 60 57 67 69 mem diff(Mb) 22 74 123 152 164 182 214 241 # Links [1]: commit 68600f623d69 ("mm: don't miss the last page because of round-off error") [2]: commit 9b6f7e163cd0 ("mm: rework memcg kernel stack accounting") [3]: commit 172b06c32b94 ("mm: slowly shrink slabs with a relatively small number of objects") [4]: commit a9a238e83fbb ("Revert "mm: slowly shrink slabs with a relatively small number of objects") [5]: https://lkml.org/lkml/2019/1/28/1865 [6]: https://marc.info/?l=linux-mm&m=155064763626437&w=2 This patch (of 10): Initialize kmem_cache->memcg_params.memcg pointer in memcg_link_cache() rather than in init_memcg_params(). Once kmem_cache will hold a reference to the memory cgroup, it will simplify the refcounting. For non-root kmem_caches memcg_link_cache() is always called before the kmem_cache becomes visible to a user, so it's safe. Link: http://lkml.kernel.org/r/20190611231813.3148843-2-guro@fb.com Signed-off-by: Roman Gushchin <guro@fb.com> Reviewed-by: Shakeel Butt <shakeelb@google.com> Acked-by: Vladimir Davydov <vdavydov.dev@gmail.com> Acked-by: Johannes Weiner <hannes@cmpxchg.org> Cc: Waiman Long <longman@redhat.com> Cc: Michal Hocko <mhocko@suse.com> Cc: Christoph Lameter <cl@linux.com> Cc: Pekka Enberg <penberg@kernel.org> Cc: David Rientjes <rientjes@google.com> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com> Cc: Andrei Vagin <avagin@gmail.com> Cc: Qian Cai <cai@lca.pw> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
#
10d1f8cb |
|
11-Jul-2019 |
Marco Elver <elver@google.com> |
mm/slab: refactor common ksize KASAN logic into slab_common.c This refactors common code of ksize() between the various allocators into slab_common.c: __ksize() is the allocator-specific implementation without instrumentation, whereas ksize() includes the required KASAN logic. Link: http://lkml.kernel.org/r/20190626142014.141844-5-elver@google.com Signed-off-by: Marco Elver <elver@google.com> Acked-by: Christoph Lameter <cl@linux.com> Reviewed-by: Andrey Ryabinin <aryabinin@virtuozzo.com> Cc: Dmitry Vyukov <dvyukov@google.com> Cc: Alexander Potapenko <glider@google.com> Cc: Andrey Konovalov <andreyknvl@google.com> Cc: Pekka Enberg <penberg@kernel.org> Cc: David Rientjes <rientjes@google.com> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com> Cc: Mark Rutland <mark.rutland@arm.com> Cc: Kees Cook <keescook@chromium.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
#
cb097cd4 |
|
11-Jul-2019 |
Shakeel Butt <shakeelb@google.com> |
slub: don't panic for memcg kmem cache creation failure Currently for CONFIG_SLUB, if a memcg kmem cache creation is failed and the corresponding root kmem cache has SLAB_PANIC flag, the kernel will be crashed. This is unnecessary as the kernel can handle the creation failures of memcg kmem caches. Additionally CONFIG_SLAB does not implement this behavior. So, to keep the behavior consistent between SLAB and SLUB, removing the panic for memcg kmem cache creation failures. The root kmem cache creation failure for SLAB_PANIC correctly panics for both SLAB and SLUB. Link: http://lkml.kernel.org/r/20190619232514.58994-1-shakeelb@google.com Reported-by: Dave Hansen <dave.hansen@intel.com> Signed-off-by: Shakeel Butt <shakeelb@google.com> Acked-by: David Rientjes <rientjes@google.com> Acked-by: Michal Hocko <mhocko@suse.com> Cc: Alexey Dobriyan <adobriyan@gmail.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Christoph Lameter <cl@linux.com> Cc: Roman Gushchin <guro@fb.com> Cc: Pekka Enberg <penberg@kernel.org> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
#
9cf3a8d8 |
|
11-Jul-2019 |
Yury Norov <yury.norov@gmail.com> |
mm/slub.c: avoid double string traverse in kmem_cache_flags() If ',' is not found, kmem_cache_flags() calls strlen() to find the end of line. We can do it in a single pass using strchrnul(). Link: http://lkml.kernel.org/r/20190501053111.7950-1-ynorov@marvell.com Signed-off-by: Yury Norov <ynorov@marvell.com> Acked-by: Aaron Tomlin <atomlin@redhat.com> Reviewed-by: Andrew Morton <akpm@linux-foundation.org> Cc: Christoph Lameter <cl@linux.com> Cc: Pekka Enberg <penberg@kernel.org> Cc: David Rientjes <rientjes@google.com> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
#
632b2ef0 |
|
13-May-2019 |
Liu Xiang <liu.xiang6@zte.com.cn> |
mm/slub.c: update the comment about slab frozen Now frozen slab can only be on the per cpu partial list. Link: http://lkml.kernel.org/r/1554022325-11305-1-git-send-email-liu.xiang6@zte.com.cn Signed-off-by: Liu Xiang <liu.xiang6@zte.com.cn> Cc: Christoph Lameter <cl@linux.com> Cc: Pekka Enberg <penberg@kernel.org> Cc: David Rientjes <rientjes@google.com> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
#
a4d3f891 |
|
13-May-2019 |
Liu Xiang <liu.xiang6@zte.com.cn> |
slub: remove useless kmem_cache_debug() before remove_full() When CONFIG_SLUB_DEBUG is not enabled, remove_full() is empty. While CONFIG_SLUB_DEBUG is enabled, remove_full() can check s->flags by itself. So kmem_cache_debug() is useless and can be removed. Link: http://lkml.kernel.org/r/1552577313-2830-1-git-send-email-liu.xiang6@zte.com.cn Signed-off-by: Liu Xiang <liu.xiang6@zte.com.cn> Acked-by: David Rientjes <rientjes@google.com> Cc: Christoph Lameter <cl@linux.com> Cc: Pekka Enberg <penberg@kernel.org> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
#
916ac052 |
|
13-May-2019 |
Tobin C. Harding <tobin@kernel.org> |
slub: use slab_list instead of lru Currently we use the page->lru list for maintaining lists of slabs. We have a list in the page structure (slab_list) that can be used for this purpose. Doing so makes the code cleaner since we are not overloading the lru list. Use the slab_list instead of the lru list for maintaining lists of slabs. Link: http://lkml.kernel.org/r/20190402230545.2929-6-tobin@kernel.org Signed-off-by: Tobin C. Harding <tobin@kernel.org> Acked-by: Christoph Lameter <cl@linux.com> Reviewed-by: Roman Gushchin <guro@fb.com> Acked-by: Vlastimil Babka <vbabka@suse.cz> Cc: David Rientjes <rientjes@google.com> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com> Cc: Matthew Wilcox <willy@infradead.org> Cc: Pekka Enberg <penberg@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
#
6dfd1b65 |
|
13-May-2019 |
Tobin C. Harding <tobin@kernel.org> |
slub: add comments to endif pre-processor macros SLUB allocator makes heavy use of ifdef/endif pre-processor macros. The pairing of these statements is at times hard to follow e.g. if the pair are further than a screen apart or if there are nested pairs. We can reduce cognitive load by adding a comment to the endif statement of form #ifdef CONFIG_FOO ... #endif /* CONFIG_FOO */ Add comments to endif pre-processor macros if ifdef/endif pair is not immediately apparent. Link: http://lkml.kernel.org/r/20190402230545.2929-5-tobin@kernel.org Signed-off-by: Tobin C. Harding <tobin@kernel.org> Acked-by: Christoph Lameter <cl@linux.com> Reviewed-by: Roman Gushchin <guro@fb.com> Acked-by: Vlastimil Babka <vbabka@suse.cz> Cc: David Rientjes <rientjes@google.com> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com> Cc: Matthew Wilcox <willy@infradead.org> Cc: Pekka Enberg <penberg@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
#
79716799 |
|
25-Apr-2019 |
Thomas Gleixner <tglx@linutronix.de> |
mm/slub: Simplify stack trace retrieval Replace the indirection through struct stack_trace with an invocation of the storage array based interface. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Reviewed-by: Josh Poimboeuf <jpoimboe@redhat.com> Acked-by: Christoph Lameter <cl@linux.com> Cc: Andy Lutomirski <luto@kernel.org> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Pekka Enberg <penberg@kernel.org> Cc: linux-mm@kvack.org Cc: David Rientjes <rientjes@google.com> Cc: Steven Rostedt <rostedt@goodmis.org> Cc: Alexander Potapenko <glider@google.com> Cc: Alexey Dobriyan <adobriyan@gmail.com> Cc: Catalin Marinas <catalin.marinas@arm.com> Cc: Dmitry Vyukov <dvyukov@google.com> Cc: Andrey Ryabinin <aryabinin@virtuozzo.com> Cc: kasan-dev@googlegroups.com Cc: Mike Rapoport <rppt@linux.vnet.ibm.com> Cc: Akinobu Mita <akinobu.mita@gmail.com> Cc: Christoph Hellwig <hch@lst.de> Cc: iommu@lists.linux-foundation.org Cc: Robin Murphy <robin.murphy@arm.com> Cc: Marek Szyprowski <m.szyprowski@samsung.com> Cc: Johannes Thumshirn <jthumshirn@suse.de> Cc: David Sterba <dsterba@suse.com> Cc: Chris Mason <clm@fb.com> Cc: Josef Bacik <josef@toxicpanda.com> Cc: linux-btrfs@vger.kernel.org Cc: dm-devel@redhat.com Cc: Mike Snitzer <snitzer@redhat.com> Cc: Alasdair Kergon <agk@redhat.com> Cc: Daniel Vetter <daniel@ffwll.ch> Cc: intel-gfx@lists.freedesktop.org Cc: Joonas Lahtinen <joonas.lahtinen@linux.intel.com> Cc: Maarten Lankhorst <maarten.lankhorst@linux.intel.com> Cc: dri-devel@lists.freedesktop.org Cc: David Airlie <airlied@linux.ie> Cc: Jani Nikula <jani.nikula@linux.intel.com> Cc: Rodrigo Vivi <rodrigo.vivi@intel.com> Cc: Tom Zanussi <tom.zanussi@linux.intel.com> Cc: Miroslav Benes <mbenes@suse.cz> Cc: linux-arch@vger.kernel.org Link: https://lkml.kernel.org/r/20190425094801.771410441@linutronix.de
|
#
b8ca7ff7 |
|
09-Apr-2019 |
Thomas Gleixner <tglx@linutronix.de> |
mm/slub: Remove the ULONG_MAX stack trace hackery No architecture terminates the stack trace with ULONG_MAX anymore. Remove the cruft. While at it remove the pointless loop of clearing the stack array completely. It's sufficient to clear the last entry as the consumers break out on the first zeroed entry anyway. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Josh Poimboeuf <jpoimboe@redhat.com> Cc: Andy Lutomirski <luto@kernel.org> Cc: Steven Rostedt <rostedt@goodmis.org> Cc: Alexander Potapenko <glider@google.com> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Pekka Enberg <penberg@kernel.org> Cc: linux-mm@kvack.org Cc: David Rientjes <rientjes@google.com> Cc: Christoph Lameter <cl@linux.com> Link: https://lkml.kernel.org/r/20190410103644.574058244@linutronix.de
|
#
6d6ea1e9 |
|
28-Mar-2019 |
Nicolas Boichat <drinkcat@chromium.org> |
mm: add support for kmem caches in DMA32 zone Patch series "iommu/io-pgtable-arm-v7s: Use DMA32 zone for page tables", v6. This is a followup to the discussion in [1], [2]. IOMMUs using ARMv7 short-descriptor format require page tables (level 1 and 2) to be allocated within the first 4GB of RAM, even on 64-bit systems. For L1 tables that are bigger than a page, we can just use __get_free_pages with GFP_DMA32 (on arm64 systems only, arm would still use GFP_DMA). For L2 tables that only take 1KB, it would be a waste to allocate a full page, so we considered 3 approaches: 1. This series, adding support for GFP_DMA32 slab caches. 2. genalloc, which requires pre-allocating the maximum number of L2 page tables (4096, so 4MB of memory). 3. page_frag, which is not very memory-efficient as it is unable to reuse freed fragments until the whole page is freed. [3] This series is the most memory-efficient approach. stable@ note: We confirmed that this is a regression, and IOMMU errors happen on 4.19 and linux-next/master on MT8173 (elm, Acer Chromebook R13). The issue most likely starts from commit ad67f5a6545f ("arm64: replace ZONE_DMA with ZONE_DMA32"), i.e. 4.15, and presumably breaks a number of Mediatek platforms (and maybe others?). [1] https://lists.linuxfoundation.org/pipermail/iommu/2018-November/030876.html [2] https://lists.linuxfoundation.org/pipermail/iommu/2018-December/031696.html [3] https://patchwork.codeaurora.org/patch/671639/ This patch (of 3): IOMMUs using ARMv7 short-descriptor format require page tables to be allocated within the first 4GB of RAM, even on 64-bit systems. On arm64, this is done by passing GFP_DMA32 flag to memory allocation functions. For IOMMU L2 tables that only take 1KB, it would be a waste to allocate a full page using get_free_pages, so we considered 3 approaches: 1. This patch, adding support for GFP_DMA32 slab caches. 2. genalloc, which requires pre-allocating the maximum number of L2 page tables (4096, so 4MB of memory). 3. page_frag, which is not very memory-efficient as it is unable to reuse freed fragments until the whole page is freed. This change makes it possible to create a custom cache in DMA32 zone using kmem_cache_create, then allocate memory using kmem_cache_alloc. We do not create a DMA32 kmalloc cache array, as there are currently no users of kmalloc(..., GFP_DMA32). These calls will continue to trigger a warning, as we keep GFP_DMA32 in GFP_SLAB_BUG_MASK. This implies that calls to kmem_cache_*alloc on a SLAB_CACHE_DMA32 kmem_cache must _not_ use GFP_DMA32 (it is anyway redundant and unnecessary). Link: http://lkml.kernel.org/r/20181210011504.122604-2-drinkcat@chromium.org Signed-off-by: Nicolas Boichat <drinkcat@chromium.org> Acked-by: Vlastimil Babka <vbabka@suse.cz> Acked-by: Will Deacon <will.deacon@arm.com> Cc: Robin Murphy <robin.murphy@arm.com> Cc: Joerg Roedel <joro@8bytes.org> Cc: Christoph Lameter <cl@linux.com> Cc: Pekka Enberg <penberg@kernel.org> Cc: David Rientjes <rientjes@google.com> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com> Cc: Michal Hocko <mhocko@suse.com> Cc: Mel Gorman <mgorman@techsingularity.net> Cc: Sasha Levin <Alexander.Levin@microsoft.com> Cc: Huaisheng Ye <yehs1@lenovo.com> Cc: Mike Rapoport <rppt@linux.vnet.ibm.com> Cc: Yong Wu <yong.wu@mediatek.com> Cc: Matthias Brugger <matthias.bgg@gmail.com> Cc: Tomasz Figa <tfiga@google.com> Cc: Yingjoe Chen <yingjoe.chen@mediatek.com> Cc: Christoph Hellwig <hch@infradead.org> Cc: Matthew Wilcox <willy@infradead.org> Cc: Hsin-Yi Wang <hsinyi@chromium.org> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
#
b9726c26 |
|
05-Mar-2019 |
Alexey Dobriyan <adobriyan@gmail.com> |
numa: make "nr_node_ids" unsigned int Number of NUMA nodes can't be negative. This saves a few bytes on x86_64: add/remove: 0/0 grow/shrink: 4/21 up/down: 27/-265 (-238) Function old new delta hv_synic_alloc.cold 88 110 +22 prealloc_shrinker 260 262 +2 bootstrap 249 251 +2 sched_init_numa 1566 1567 +1 show_slab_objects 778 777 -1 s_show 1201 1200 -1 kmem_cache_init 346 345 -1 __alloc_workqueue_key 1146 1145 -1 mem_cgroup_css_alloc 1614 1612 -2 __do_sys_swapon 4702 4699 -3 __list_lru_init 655 651 -4 nic_probe 2379 2374 -5 store_user_store 118 111 -7 red_zone_store 106 99 -7 poison_store 106 99 -7 wq_numa_init 348 338 -10 __kmem_cache_empty 75 65 -10 task_numa_free 186 173 -13 merge_across_nodes_store 351 336 -15 irq_create_affinity_masks 1261 1246 -15 do_numa_crng_init 343 321 -22 task_numa_fault 4760 4737 -23 swapfile_init 179 156 -23 hv_synic_alloc 536 492 -44 apply_wqattrs_prepare 746 695 -51 Link: http://lkml.kernel.org/r/20190201223029.GA15820@avx2 Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com> Reviewed-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
#
8bb4e7a2 |
|
05-Mar-2019 |
Wei Yang <richard.weiyang@gmail.com> |
mm: fix some typos in mm directory No functional change. Link: http://lkml.kernel.org/r/20190118235123.27843-1-richard.weiyang@gmail.com Signed-off-by: Wei Yang <richard.weiyang@gmail.com> Reviewed-by: Pekka Enberg <penberg@kernel.org> Acked-by: Mike Rapoport <rppt@linux.ibm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
#
9234bae9 |
|
05-Mar-2019 |
Wei Yang <richard.weiyang@gmail.com> |
mm, slub: make the comment of put_cpu_partial() complete There are two cases when put_cpu_partial() is invoked. * __slab_free * get_partial_node This patch just makes it cover these two cases. Link: http://lkml.kernel.org/r/20181025094437.18951-3-richard.weiyang@gmail.com Signed-off-by: Wei Yang <richard.weiyang@gmail.com> Acked-by: Christoph Lameter <cl@linux.com> Cc: Pekka Enberg <penberg@kernel.org> Cc: David Rientjes <rientjes@google.com> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
#
278d7756 |
|
05-Mar-2019 |
Qian Cai <cai@lca.pw> |
mm/slub.c: remove an unused addr argument "addr" function argument is not used in alloc_consistency_checks() at all, so remove it. Link: http://lkml.kernel.org/r/20190211123214.35592-1-cai@lca.pw Fixes: becfda68abca ("slub: convert SLAB_DEBUG_FREE to SLAB_CONSISTENCY_CHECKS") Signed-off-by: Qian Cai <cai@lca.pw> Reviewed-by: Andrew Morton <akpm@linux-foundation.org> Acked-by: David Rientjes <rientjes@google.com> Cc: Christoph Lameter <cl@linux.com> Cc: Pekka Enberg <penberg@kernel.org> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
#
edde82b6 |
|
05-Mar-2019 |
Peng Wang <rocking@whu.edu.cn> |
mm/slub.c: freelist is ensured to be NULL when new_slab() fails new_slab_objects() will return immediately if freelist is not NULL. if (freelist) return freelist; One more assignment operation could be avoided. Link: http://lkml.kernel.org/r/20181229062512.30469-1-rocking@whu.edu.cn Signed-off-by: Peng Wang <rocking@whu.edu.cn> Reviewed-by: Pekka Enberg <penberg@kernel.org> Reviewed-by: Andrew Morton <akpm@linux-foundation.org> Acked-by: David Rientjes <rientjes@google.com> Cc: Christoph Lameter <cl@linux.com> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
#
6373dca1 |
|
20-Feb-2019 |
Qian Cai <cai@lca.pw> |
slub: fix a crash with SLUB_DEBUG + KASAN_SW_TAGS In process_slab(), "p = get_freepointer()" could return a tagged pointer, but "addr = page_address()" always return a native pointer. As the result, slab_index() is messed up here, return (p - addr) / s->size; All other callers of slab_index() have the same situation where "addr" is from page_address(), so just need to untag "p". # cat /sys/kernel/slab/hugetlbfs_inode_cache/alloc_calls Unable to handle kernel paging request at virtual address 2bff808aa4856d48 Mem abort info: ESR = 0x96000007 Exception class = DABT (current EL), IL = 32 bits SET = 0, FnV = 0 EA = 0, S1PTW = 0 Data abort info: ISV = 0, ISS = 0x00000007 CM = 0, WnR = 0 swapper pgtable: 64k pages, 48-bit VAs, pgdp = 0000000002498338 [2bff808aa4856d48] pgd=00000097fcfd0003, pud=00000097fcfd0003, pmd=00000097fca30003, pte=00e8008b24850712 Internal error: Oops: 96000007 [#1] SMP CPU: 3 PID: 79210 Comm: read_all Tainted: G L 5.0.0-rc7+ #84 Hardware name: HPE Apollo 70 /C01_APACHE_MB , BIOS L50_5.13_1.0.6 07/10/2018 pstate: 00400089 (nzcv daIf +PAN -UAO) pc : get_map+0x78/0xec lr : get_map+0xa0/0xec sp : aeff808989e3f8e0 x29: aeff808989e3f940 x28: ffff800826200000 x27: ffff100012d47000 x26: 9700000000002500 x25: 0000000000000001 x24: 52ff8008200131f8 x23: 52ff8008200130a0 x22: 52ff800820013098 x21: ffff800826200000 x20: ffff100013172ba0 x19: 2bff808a8971bc00 x18: ffff1000148f5538 x17: 000000000000001b x16: 00000000000000ff x15: ffff1000148f5000 x14: 00000000000000d2 x13: 0000000000000001 x12: 0000000000000000 x11: 0000000020000002 x10: 2bff808aa4856d48 x9 : 0000020000000000 x8 : 68ff80082620ebb0 x7 : 0000000000000000 x6 : ffff1000105da1dc x5 : 0000000000000000 x4 : 0000000000000000 x3 : 0000000000000010 x2 : 2bff808a8971bc00 x1 : ffff7fe002098800 x0 : ffff80082620ceb0 Process read_all (pid: 79210, stack limit = 0x00000000f65b9361) Call trace: get_map+0x78/0xec process_slab+0x7c/0x47c list_locations+0xb0/0x3c8 alloc_calls_show+0x34/0x40 slab_attr_show+0x34/0x48 sysfs_kf_seq_show+0x2e4/0x570 kernfs_seq_show+0x12c/0x1a0 seq_read+0x48c/0xf84 kernfs_fop_read+0xd4/0x448 __vfs_read+0x94/0x5d4 vfs_read+0xcc/0x194 ksys_read+0x6c/0xe8 __arm64_sys_read+0x68/0xb0 el0_svc_handler+0x230/0x3bc el0_svc+0x8/0xc Code: d3467d2a 9ac92329 8b0a0e6a f9800151 (c85f7d4b) ---[ end trace a383a9a44ff13176 ]--- Kernel panic - not syncing: Fatal exception SMP: stopping secondary CPUs SMP: failed to stop secondary CPUs 1-7,32,40,127 Kernel Offset: disabled CPU features: 0x002,20000c18 Memory Limit: none ---[ end Kernel panic - not syncing: Fatal exception ]--- Link: http://lkml.kernel.org/r/20190220020251.82039-1-cai@lca.pw Signed-off-by: Qian Cai <cai@lca.pw> Reviewed-by: Andrey Konovalov <andreyknvl@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
#
338cfaad |
|
20-Feb-2019 |
Qian Cai <cai@lca.pw> |
slub: fix SLAB_CONSISTENCY_CHECKS + KASAN_SW_TAGS Enabling SLUB_DEBUG's SLAB_CONSISTENCY_CHECKS with KASAN_SW_TAGS triggers endless false positives during boot below due to check_valid_pointer() checks tagged pointers which have no addresses that is valid within slab pages: BUG radix_tree_node (Tainted: G B ): Freelist Pointer check fails ----------------------------------------------------------------------------- INFO: Slab objects=69 used=69 fp=0x (null) flags=0x7ffffffc000200 INFO: Object @offset=15060037153926966016 fp=0x Redzone: bb bb bb bb bb bb bb bb bb bb bb bb bb bb bb bb ................ Object : 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................ Object : 00 00 00 00 00 00 00 00 18 6b 06 00 08 80 ff d0 .........k...... Object : 18 6b 06 00 08 80 ff d0 00 00 00 00 00 00 00 00 .k.............. Object : 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................ Object : 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................ Object : 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................ Object : 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................ Object : 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................ Object : 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................ Object : 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................ Object : 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................ Object : 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................ Object : 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................ Object : 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................ Object : 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................ Object : 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................ Object : 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................ Object : 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................ Object : 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................ Object : 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................ Object : 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................ Object : 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................ Object : 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................ Object : 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................ Object : 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................ Object : 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................ Object : 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................ Object : 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................ Object : 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................ Object : 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................ Object : 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................ Object : 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................ Object : 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................ Object : 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................ Object : 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................ Object : 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................ Redzone: bb bb bb bb bb bb bb bb ........ Padding: 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a ZZZZZZZZZZZZZZZZ CPU: 0 PID: 0 Comm: swapper/0 Tainted: G B 5.0.0-rc5+ #18 Call trace: dump_backtrace+0x0/0x450 show_stack+0x20/0x2c __dump_stack+0x20/0x28 dump_stack+0xa0/0xfc print_trailer+0x1bc/0x1d0 object_err+0x40/0x50 alloc_debug_processing+0xf0/0x19c ___slab_alloc+0x554/0x704 kmem_cache_alloc+0x2f8/0x440 radix_tree_node_alloc+0x90/0x2fc idr_get_free+0x1e8/0x6d0 idr_alloc_u32+0x11c/0x2a4 idr_alloc+0x74/0xe0 worker_pool_assign_id+0x5c/0xbc workqueue_init_early+0x49c/0xd50 start_kernel+0x52c/0xac4 FIX radix_tree_node: Marking all objects used Link: http://lkml.kernel.org/r/20190209044128.3290-1-cai@lca.pw Signed-off-by: Qian Cai <cai@lca.pw> Reviewed-by: Andrey Konovalov <andreyknvl@google.com> Cc: Christoph Lameter <cl@linux.com> Cc: Pekka Enberg <penberg@kernel.org> Cc: David Rientjes <rientjes@google.com> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
#
d36a63a9 |
|
20-Feb-2019 |
Andrey Konovalov <andreyknvl@google.com> |
kasan, slub: fix more conflicts with CONFIG_SLAB_FREELIST_HARDENED When CONFIG_KASAN_SW_TAGS is enabled, ptr_addr might be tagged. Normally, this doesn't cause any issues, as both set_freepointer() and get_freepointer() are called with a pointer with the same tag. However, there are some issues with CONFIG_SLUB_DEBUG code. For example, when __free_slub() iterates over objects in a cache, it passes untagged pointers to check_object(). check_object() in turns calls get_freepointer() with an untagged pointer, which causes the freepointer to be restored incorrectly. Add kasan_reset_tag to freelist_ptr(). Also add a detailed comment. Link: http://lkml.kernel.org/r/bf858f26ef32eb7bd24c665755b3aee4bc58d0e4.1550103861.git.andreyknvl@google.com Signed-off-by: Andrey Konovalov <andreyknvl@google.com> Reported-by: Qian Cai <cai@lca.pw> Tested-by: Qian Cai <cai@lca.pw> Cc: Andrey Ryabinin <aryabinin@virtuozzo.com> Cc: Alexander Potapenko <glider@google.com> Cc: Dmitry Vyukov <dvyukov@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
#
18e50661 |
|
20-Feb-2019 |
Andrey Konovalov <andreyknvl@google.com> |
kasan, slub: fix conflicts with CONFIG_SLAB_FREELIST_HARDENED CONFIG_SLAB_FREELIST_HARDENED hashes freelist pointer with the address of the object where the pointer gets stored. With tag based KASAN we don't account for that when building freelist, as we call set_freepointer() with the first argument untagged. This patch changes the code to properly propagate tags throughout the loop. Link: http://lkml.kernel.org/r/3df171559c52201376f246bf7ce3184fe21c1dc7.1549921721.git.andreyknvl@google.com Signed-off-by: Andrey Konovalov <andreyknvl@google.com> Reported-by: Qian Cai <cai@lca.pw> Cc: Andrey Ryabinin <aryabinin@virtuozzo.com> Cc: Alexander Potapenko <glider@google.com> Cc: Dmitry Vyukov <dvyukov@google.com> Cc: Catalin Marinas <catalin.marinas@arm.com> Cc: Christoph Lameter <cl@linux.com> Cc: Pekka Enberg <penberg@kernel.org> Cc: David Rientjes <rientjes@google.com> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com> Cc: Vincenzo Frascino <vincenzo.frascino@arm.com> Cc: Kostya Serebryany <kcc@google.com> Cc: Evgeniy Stepanov <eugenis@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
#
a7101224 |
|
20-Feb-2019 |
Andrey Konovalov <andreyknvl@google.com> |
kasan, slub: move kasan_poison_slab hook before page_address With tag based KASAN page_address() looks at the page flags to see whether the resulting pointer needs to have a tag set. Since we don't want to set a tag when page_address() is called on SLAB pages, we call page_kasan_tag_reset() in kasan_poison_slab(). However in allocate_slab() page_address() is called before kasan_poison_slab(). Fix it by changing the order. [andreyknvl@google.com: fix compilation error when CONFIG_SLUB_DEBUG=n] Link: http://lkml.kernel.org/r/ac27cc0bbaeb414ed77bcd6671a877cf3546d56e.1550066133.git.andreyknvl@google.com Link: http://lkml.kernel.org/r/cd895d627465a3f1c712647072d17f10883be2a1.1549921721.git.andreyknvl@google.com Signed-off-by: Andrey Konovalov <andreyknvl@google.com> Cc: Alexander Potapenko <glider@google.com> Cc: Andrey Ryabinin <aryabinin@virtuozzo.com> Cc: Catalin Marinas <catalin.marinas@arm.com> Cc: Christoph Lameter <cl@linux.com> Cc: David Rientjes <rientjes@google.com> Cc: Dmitry Vyukov <dvyukov@google.com> Cc: Evgeniy Stepanov <eugenis@google.com> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com> Cc: Kostya Serebryany <kcc@google.com> Cc: Pekka Enberg <penberg@kernel.org> Cc: Qian Cai <cai@lca.pw> Cc: Vincenzo Frascino <vincenzo.frascino@arm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
#
a2f77575 |
|
20-Feb-2019 |
Andrey Konovalov <andreyknvl@google.com> |
kmemleak: account for tagged pointers when calculating pointer range kmemleak keeps two global variables, min_addr and max_addr, which store the range of valid (encountered by kmemleak) pointer values, which it later uses to speed up pointer lookup when scanning blocks. With tagged pointers this range will get bigger than it needs to be. This patch makes kmemleak untag pointers before saving them to min_addr and max_addr and when performing a lookup. Link: http://lkml.kernel.org/r/16e887d442986ab87fe87a755815ad92fa431a5f.1550066133.git.andreyknvl@google.com Signed-off-by: Andrey Konovalov <andreyknvl@google.com> Tested-by: Qian Cai <cai@lca.pw> Acked-by: Catalin Marinas <catalin.marinas@arm.com> Cc: Alexander Potapenko <glider@google.com> Cc: Andrey Ryabinin <aryabinin@virtuozzo.com> Cc: Christoph Lameter <cl@linux.com> Cc: David Rientjes <rientjes@google.com> Cc: Dmitry Vyukov <dvyukov@google.com> Cc: Evgeniy Stepanov <eugenis@google.com> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com> Cc: Kostya Serebryany <kcc@google.com> Cc: Pekka Enberg <penberg@kernel.org> Cc: Vincenzo Frascino <vincenzo.frascino@arm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
#
53128245 |
|
20-Feb-2019 |
Andrey Konovalov <andreyknvl@google.com> |
kasan, kmemleak: pass tagged pointers to kmemleak Right now we call kmemleak hooks before assigning tags to pointers in KASAN hooks. As a result, when an objects gets allocated, kmemleak sees a differently tagged pointer, compared to the one it sees when the object gets freed. Fix it by calling KASAN hooks before kmemleak's ones. Link: http://lkml.kernel.org/r/cd825aa4897b0fc37d3316838993881daccbe9f5.1549921721.git.andreyknvl@google.com Signed-off-by: Andrey Konovalov <andreyknvl@google.com> Reported-by: Qian Cai <cai@lca.pw> Cc: Alexander Potapenko <glider@google.com> Cc: Andrey Ryabinin <aryabinin@virtuozzo.com> Cc: Catalin Marinas <catalin.marinas@arm.com> Cc: Christoph Lameter <cl@linux.com> Cc: David Rientjes <rientjes@google.com> Cc: Dmitry Vyukov <dvyukov@google.com> Cc: Evgeniy Stepanov <eugenis@google.com> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com> Cc: Kostya Serebryany <kcc@google.com> Cc: Pekka Enberg <penberg@kernel.org> Cc: Vincenzo Frascino <vincenzo.frascino@arm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
#
96fedce2 |
|
08-Jan-2019 |
Andrey Konovalov <andreyknvl@google.com> |
kasan: make tag based mode work with CONFIG_HARDENED_USERCOPY With CONFIG_HARDENED_USERCOPY enabled __check_heap_object() compares and then subtracts a potentially tagged pointer with a non-tagged address of the page that this pointer belongs to, which leads to unexpected behavior. Untag the pointer in __check_heap_object() before doing any of these operations. Link: http://lkml.kernel.org/r/7e756a298d514c4482f52aea6151db34818d395d.1546540962.git.andreyknvl@google.com Signed-off-by: Andrey Konovalov <andreyknvl@google.com> Cc: Andrey Ryabinin <aryabinin@virtuozzo.com> Cc: Christoph Lameter <cl@linux.com> Cc: Dmitry Vyukov <dvyukov@google.com> Cc: Mark Rutland <mark.rutland@arm.com> Cc: Vincenzo Frascino <vincenzo.frascino@arm.com> Cc: Will Deacon <will.deacon@arm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
#
88349a28 |
|
28-Dec-2018 |
Wei Yang <richard.weiyang@gmail.com> |
mm/slub.c: record final state of slub action in deactivate_slab() If __cmpxchg_double_slab() fails and (l != m), current code records transition states of slub action. Update the action after __cmpxchg_double_slab() success to record the final state. [akpm@linux-foundation.org: more whitespace cleanup] Link: http://lkml.kernel.org/r/20181107013119.3816-1-richard.weiyang@gmail.com Signed-off-by: Wei Yang <richard.weiyang@gmail.com> Reviewed-by: Andrew Morton <akpm@linux-foundation.org> Cc: Christoph Lameter <cl@linux.com> Cc: Pekka Enberg <penberg@kernel.org> Cc: David Rientjes <rientjes@google.com> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
#
6159d0f5 |
|
28-Dec-2018 |
Wei Yang <richard.weiyang@gmail.com> |
mm/slub.c: page is always non-NULL in node_match() node_match() is a static function and is only invoked in slub.c. In all three places, `page' is ensured to be valid. Link: http://lkml.kernel.org/r/20181106150245.1668-1-richard.weiyang@gmail.com Signed-off-by: Wei Yang <richard.weiyang@gmail.com> Acked-by: Christoph Lameter <cl@linux.com> Cc: Pekka Enberg <penberg@kernel.org> Cc: David Rientjes <rientjes@google.com> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
#
1265ef2d |
|
28-Dec-2018 |
Wei Yang <richard.weiyang@gmail.com> |
mm/slub.c: remove validation on cpu_slab in __flush_cpu_slab() cpu_slab is a per cpu variable which is allocated in all or none. If a cpu_slab failed to be allocated, the slub is not usable. We could use cpu_slab without validation in __flush_cpu_slab(). Link: http://lkml.kernel.org/r/20181103141218.22844-1-richard.weiyang@gmail.com Signed-off-by: Wei Yang <richard.weiyang@gmail.com> Reviewed-by: Andrew Morton <akpm@linux-foundation.org> Cc: Christoph Lameter <cl@linux.com> Cc: Pekka Enberg <penberg@kernel.org> Cc: David Rientjes <rientjes@google.com> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
#
4d176711 |
|
28-Dec-2018 |
Andrey Konovalov <andreyknvl@google.com> |
kasan: preassign tags to objects with ctors or SLAB_TYPESAFE_BY_RCU An object constructor can initialize pointers within this objects based on the address of the object. Since the object address might be tagged, we need to assign a tag before calling constructor. The implemented approach is to assign tags to objects with constructors when a slab is allocated and call constructors once as usual. The downside is that such object would always have the same tag when it is reallocated, so we won't catch use-after-frees on it. Also pressign tags for objects from SLAB_TYPESAFE_BY_RCU caches, since they can be validy accessed after having been freed. Link: http://lkml.kernel.org/r/f158a8a74a031d66f0a9398a5b0ed453c37ba09a.1544099024.git.andreyknvl@google.com Signed-off-by: Andrey Konovalov <andreyknvl@google.com> Reviewed-by: Andrey Ryabinin <aryabinin@virtuozzo.com> Reviewed-by: Dmitry Vyukov <dvyukov@google.com> Cc: Christoph Lameter <cl@linux.com> Cc: Mark Rutland <mark.rutland@arm.com> Cc: Will Deacon <will.deacon@arm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
#
2bd926b4 |
|
28-Dec-2018 |
Andrey Konovalov <andreyknvl@google.com> |
kasan: add CONFIG_KASAN_GENERIC and CONFIG_KASAN_SW_TAGS This commit splits the current CONFIG_KASAN config option into two: 1. CONFIG_KASAN_GENERIC, that enables the generic KASAN mode (the one that exists now); 2. CONFIG_KASAN_SW_TAGS, that enables the software tag-based KASAN mode. The name CONFIG_KASAN_SW_TAGS is chosen as in the future we will have another hardware tag-based KASAN mode, that will rely on hardware memory tagging support in arm64. With CONFIG_KASAN_SW_TAGS enabled, compiler options are changed to instrument kernel files with -fsantize=kernel-hwaddress (except the ones for which KASAN_SANITIZE := n is set). Both CONFIG_KASAN_GENERIC and CONFIG_KASAN_SW_TAGS support both CONFIG_KASAN_INLINE and CONFIG_KASAN_OUTLINE instrumentation modes. This commit also adds empty placeholder (for now) implementation of tag-based KASAN specific hooks inserted by the compiler and adjusts common hooks implementation. While this commit adds the CONFIG_KASAN_SW_TAGS config option, this option is not selectable, as it depends on HAVE_ARCH_KASAN_SW_TAGS, which we will enable once all the infrastracture code has been added. Link: http://lkml.kernel.org/r/b2550106eb8a68b10fefbabce820910b115aa853.1544099024.git.andreyknvl@google.com Signed-off-by: Andrey Konovalov <andreyknvl@google.com> Reviewed-by: Andrey Ryabinin <aryabinin@virtuozzo.com> Reviewed-by: Dmitry Vyukov <dvyukov@google.com> Cc: Christoph Lameter <cl@linux.com> Cc: Mark Rutland <mark.rutland@arm.com> Cc: Will Deacon <will.deacon@arm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
#
12b22386 |
|
28-Dec-2018 |
Andrey Konovalov <andreyknvl@google.com> |
kasan, slub: handle pointer tags in early_kmem_cache_node_alloc The previous patch updated KASAN hooks signatures and their usage in SLAB and SLUB code, except for the early_kmem_cache_node_alloc function. This patch handles that function separately, as it requires to reorder some of the initialization code to correctly propagate a tagged pointer in case a tag is assigned by kasan_kmalloc. Link: http://lkml.kernel.org/r/fc8d0fdcf733a7a52e8d0daaa650f4736a57de8c.1544099024.git.andreyknvl@google.com Signed-off-by: Andrey Konovalov <andreyknvl@google.com> Cc: Andrey Ryabinin <aryabinin@virtuozzo.com> Cc: Christoph Lameter <cl@linux.com> Cc: Dmitry Vyukov <dvyukov@google.com> Cc: Mark Rutland <mark.rutland@arm.com> Cc: Will Deacon <will.deacon@arm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
#
0116523c |
|
28-Dec-2018 |
Andrey Konovalov <andreyknvl@google.com> |
kasan, mm: change hooks signatures Patch series "kasan: add software tag-based mode for arm64", v13. This patchset adds a new software tag-based mode to KASAN [1]. (Initially this mode was called KHWASAN, but it got renamed, see the naming rationale at the end of this section). The plan is to implement HWASan [2] for the kernel with the incentive, that it's going to have comparable to KASAN performance, but in the same time consume much less memory, trading that off for somewhat imprecise bug detection and being supported only for arm64. The underlying ideas of the approach used by software tag-based KASAN are: 1. By using the Top Byte Ignore (TBI) arm64 CPU feature, we can store pointer tags in the top byte of each kernel pointer. 2. Using shadow memory, we can store memory tags for each chunk of kernel memory. 3. On each memory allocation, we can generate a random tag, embed it into the returned pointer and set the memory tags that correspond to this chunk of memory to the same value. 4. By using compiler instrumentation, before each memory access we can add a check that the pointer tag matches the tag of the memory that is being accessed. 5. On a tag mismatch we report an error. With this patchset the existing KASAN mode gets renamed to generic KASAN, with the word "generic" meaning that the implementation can be supported by any architecture as it is purely software. The new mode this patchset adds is called software tag-based KASAN. The word "tag-based" refers to the fact that this mode uses tags embedded into the top byte of kernel pointers and the TBI arm64 CPU feature that allows to dereference such pointers. The word "software" here means that shadow memory manipulation and tag checking on pointer dereference is done in software. As it is the only tag-based implementation right now, "software tag-based" KASAN is sometimes referred to as simply "tag-based" in this patchset. A potential expansion of this mode is a hardware tag-based mode, which would use hardware memory tagging support (announced by Arm [3]) instead of compiler instrumentation and manual shadow memory manipulation. Same as generic KASAN, software tag-based KASAN is strictly a debugging feature. [1] https://www.kernel.org/doc/html/latest/dev-tools/kasan.html [2] http://clang.llvm.org/docs/HardwareAssistedAddressSanitizerDesign.html [3] https://community.arm.com/processors/b/blog/posts/arm-a-profile-architecture-2018-developments-armv85a ====== Rationale On mobile devices generic KASAN's memory usage is significant problem. One of the main reasons to have tag-based KASAN is to be able to perform a similar set of checks as the generic one does, but with lower memory requirements. Comment from Vishwath Mohan <vishwath@google.com>: I don't have data on-hand, but anecdotally both ASAN and KASAN have proven problematic to enable for environments that don't tolerate the increased memory pressure well. This includes (a) Low-memory form factors - Wear, TV, Things, lower-tier phones like Go, (c) Connected components like Pixel's visual core [1]. These are both places I'd love to have a low(er) memory footprint option at my disposal. Comment from Evgenii Stepanov <eugenis@google.com>: Looking at a live Android device under load, slab (according to /proc/meminfo) + kernel stack take 8-10% available RAM (~350MB). KASAN's overhead of 2x - 3x on top of it is not insignificant. Not having this overhead enables near-production use - ex. running KASAN/KHWASAN kernel on a personal, daily-use device to catch bugs that do not reproduce in test configuration. These are the ones that often cost the most engineering time to track down. CPU overhead is bad, but generally tolerable. RAM is critical, in our experience. Once it gets low enough, OOM-killer makes your life miserable. [1] https://www.blog.google/products/pixel/pixel-visual-core-image-processing-and-machine-learning-pixel-2/ ====== Technical details Software tag-based KASAN mode is implemented in a very similar way to the generic one. This patchset essentially does the following: 1. TCR_TBI1 is set to enable Top Byte Ignore. 2. Shadow memory is used (with a different scale, 1:16, so each shadow byte corresponds to 16 bytes of kernel memory) to store memory tags. 3. All slab objects are aligned to shadow scale, which is 16 bytes. 4. All pointers returned from the slab allocator are tagged with a random tag and the corresponding shadow memory is poisoned with the same value. 5. Compiler instrumentation is used to insert tag checks. Either by calling callbacks or by inlining them (CONFIG_KASAN_OUTLINE and CONFIG_KASAN_INLINE flags are reused). 6. When a tag mismatch is detected in callback instrumentation mode KASAN simply prints a bug report. In case of inline instrumentation, clang inserts a brk instruction, and KASAN has it's own brk handler, which reports the bug. 7. The memory in between slab objects is marked with a reserved tag, and acts as a redzone. 8. When a slab object is freed it's marked with a reserved tag. Bug detection is imprecise for two reasons: 1. We won't catch some small out-of-bounds accesses, that fall into the same shadow cell, as the last byte of a slab object. 2. We only have 1 byte to store tags, which means we have a 1/256 probability of a tag match for an incorrect access (actually even slightly less due to reserved tag values). Despite that there's a particular type of bugs that tag-based KASAN can detect compared to generic KASAN: use-after-free after the object has been allocated by someone else. ====== Testing Some kernel developers voiced a concern that changing the top byte of kernel pointers may lead to subtle bugs that are difficult to discover. To address this concern deliberate testing has been performed. It doesn't seem feasible to do some kind of static checking to find potential issues with pointer tagging, so a dynamic approach was taken. All pointer comparisons/subtractions have been instrumented in an LLVM compiler pass and a kernel module that would print a bug report whenever two pointers with different tags are being compared/subtracted (ignoring comparisons with NULL pointers and with pointers obtained by casting an error code to a pointer type) has been used. Then the kernel has been booted in QEMU and on an Odroid C2 board and syzkaller has been run. This yielded the following results. The two places that look interesting are: is_vmalloc_addr in include/linux/mm.h is_kernel_rodata in mm/util.c Here we compare a pointer with some fixed untagged values to make sure that the pointer lies in a particular part of the kernel address space. Since tag-based KASAN doesn't add tags to pointers that belong to rodata or vmalloc regions, this should work as is. To make sure debug checks to those two functions that check that the result doesn't change whether we operate on pointers with or without untagging has been added. A few other cases that don't look that interesting: Comparing pointers to achieve unique sorting order of pointee objects (e.g. sorting locks addresses before performing a double lock): tty_ldisc_lock_pair_timeout in drivers/tty/tty_ldisc.c pipe_double_lock in fs/pipe.c unix_state_double_lock in net/unix/af_unix.c lock_two_nondirectories in fs/inode.c mutex_lock_double in kernel/events/core.c ep_cmp_ffd in fs/eventpoll.c fsnotify_compare_groups fs/notify/mark.c Nothing needs to be done here, since the tags embedded into pointers don't change, so the sorting order would still be unique. Checks that a pointer belongs to some particular allocation: is_sibling_entry in lib/radix-tree.c object_is_on_stack in include/linux/sched/task_stack.h Nothing needs to be done here either, since two pointers can only belong to the same allocation if they have the same tag. Overall, since the kernel boots and works, there are no critical bugs. As for the rest, the traditional kernel testing way (use until fails) is the only one that looks feasible. Another point here is that tag-based KASAN is available under a separate config option that needs to be deliberately enabled. Even though it might be used in a "near-production" environment to find bugs that are not found during fuzzing or running tests, it is still a debug tool. ====== Benchmarks The following numbers were collected on Odroid C2 board. Both generic and tag-based KASAN were used in inline instrumentation mode. Boot time [1]: * ~1.7 sec for clean kernel * ~5.0 sec for generic KASAN * ~5.0 sec for tag-based KASAN Network performance [2]: * 8.33 Gbits/sec for clean kernel * 3.17 Gbits/sec for generic KASAN * 2.85 Gbits/sec for tag-based KASAN Slab memory usage after boot [3]: * ~40 kb for clean kernel * ~105 kb (~260% overhead) for generic KASAN * ~47 kb (~20% overhead) for tag-based KASAN KASAN memory overhead consists of three main parts: 1. Increased slab memory usage due to redzones. 2. Shadow memory (the whole reserved once during boot). 3. Quaratine (grows gradually until some preset limit; the more the limit, the more the chance to detect a use-after-free). Comparing tag-based vs generic KASAN for each of these points: 1. 20% vs 260% overhead. 2. 1/16th vs 1/8th of physical memory. 3. Tag-based KASAN doesn't require quarantine. [1] Time before the ext4 driver is initialized. [2] Measured as `iperf -s & iperf -c 127.0.0.1 -t 30`. [3] Measured as `cat /proc/meminfo | grep Slab`. ====== Some notes A few notes: 1. The patchset can be found here: https://github.com/xairy/kasan-prototype/tree/khwasan 2. Building requires a recent Clang version (7.0.0 or later). 3. Stack instrumentation is not supported yet and will be added later. This patch (of 25): Tag-based KASAN changes the value of the top byte of pointers returned from the kernel allocation functions (such as kmalloc). This patch updates KASAN hooks signatures and their usage in SLAB and SLUB code to reflect that. Link: http://lkml.kernel.org/r/aec2b5e3973781ff8a6bb6760f8543643202c451.1544099024.git.andreyknvl@google.com Signed-off-by: Andrey Konovalov <andreyknvl@google.com> Reviewed-by: Andrey Ryabinin <aryabinin@virtuozzo.com> Reviewed-by: Dmitry Vyukov <dvyukov@google.com> Cc: Christoph Lameter <cl@linux.com> Cc: Mark Rutland <mark.rutland@arm.com> Cc: Will Deacon <will.deacon@arm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
#
cc252eae |
|
26-Oct-2018 |
Vlastimil Babka <vbabka@suse.cz> |
mm, slab: combine kmalloc_caches and kmalloc_dma_caches Patch series "kmalloc-reclaimable caches", v4. As discussed at LSF/MM [1] here's a patchset that introduces kmalloc-reclaimable caches (more details in the second patch) and uses them for dcache external names. That allows us to repurpose the NR_INDIRECTLY_RECLAIMABLE_BYTES counter later in the series. With patch 3/6, dcache external names are allocated from kmalloc-rcl-* caches, eliminating the need for manual accounting. More importantly, it also ensures the reclaimable kmalloc allocations are grouped in pages separate from the regular kmalloc allocations. The need for proper accounting of dcache external names has shown it's easy for misbehaving process to allocate lots of them, causing premature OOMs. Without the added grouping, it's likely that a similar workload can interleave the dcache external names allocations with regular kmalloc allocations (note: I haven't searched myself for an example of such regular kmalloc allocation, but I would be very surprised if there wasn't some). A pathological case would be e.g. one 64byte regular allocations with 63 external dcache names in a page (64x64=4096), which means the page is not freed even after reclaiming after all dcache names, and the process can thus "steal" the whole page with single 64byte allocation. If other kmalloc users similar to dcache external names become identified, they can also benefit from the new functionality simply by adding __GFP_RECLAIMABLE to the kmalloc calls. Side benefits of the patchset (that could be also merged separately) include removed branch for detecting __GFP_DMA kmalloc(), and shortening kmalloc cache names in /proc/slabinfo output. The latter is potentially an ABI break in case there are tools parsing the names and expecting the values to be in bytes. This is how /proc/slabinfo looks like after booting in virtme: ... kmalloc-rcl-4M 0 0 4194304 1 1024 : tunables 1 1 0 : slabdata 0 0 0 ... kmalloc-rcl-96 7 32 128 32 1 : tunables 120 60 8 : slabdata 1 1 0 kmalloc-rcl-64 25 128 64 64 1 : tunables 120 60 8 : slabdata 2 2 0 kmalloc-rcl-32 0 0 32 124 1 : tunables 120 60 8 : slabdata 0 0 0 kmalloc-4M 0 0 4194304 1 1024 : tunables 1 1 0 : slabdata 0 0 0 kmalloc-2M 0 0 2097152 1 512 : tunables 1 1 0 : slabdata 0 0 0 kmalloc-1M 0 0 1048576 1 256 : tunables 1 1 0 : slabdata 0 0 0 ... /proc/vmstat with renamed nr_indirectly_reclaimable_bytes counter: ... nr_slab_reclaimable 2817 nr_slab_unreclaimable 1781 ... nr_kernel_misc_reclaimable 0 ... /proc/meminfo with new KReclaimable counter: ... Shmem: 564 kB KReclaimable: 11260 kB Slab: 18368 kB SReclaimable: 11260 kB SUnreclaim: 7108 kB KernelStack: 1248 kB ... This patch (of 6): The kmalloc caches currently mainain separate (optional) array kmalloc_dma_caches for __GFP_DMA allocations. There are tests for __GFP_DMA in the allocation hotpaths. We can avoid the branches by combining kmalloc_caches and kmalloc_dma_caches into a single two-dimensional array where the outer dimension is cache "type". This will also allow to add kmalloc-reclaimable caches as a third type. Link: http://lkml.kernel.org/r/20180731090649.16028-2-vbabka@suse.cz Signed-off-by: Vlastimil Babka <vbabka@suse.cz> Acked-by: Mel Gorman <mgorman@techsingularity.net> Acked-by: Christoph Lameter <cl@linux.com> Acked-by: Roman Gushchin <guro@fb.com> Cc: Michal Hocko <mhocko@kernel.org> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: David Rientjes <rientjes@google.com> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com> Cc: Matthew Wilcox <willy@infradead.org> Cc: Laura Abbott <labbott@redhat.com> Cc: Sumit Semwal <sumit.semwal@linaro.org> Cc: Vijayanand Jitta <vjitta@codeaurora.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
#
c5fd3ca0 |
|
26-Oct-2018 |
Aaron Tomlin <atomlin@redhat.com> |
slub: extend slub debug to handle multiple slabs Extend the slub_debug syntax to "slub_debug=<flags>[,<slub>]*", where <slub> may contain an asterisk at the end. For example, the following would poison all kmalloc slabs: slub_debug=P,kmalloc* and the following would apply the default flags to all kmalloc and all block IO slabs: slub_debug=,bio*,kmalloc* Please note that a similar patch was posted by Iliyan Malchev some time ago but was never merged: https://marc.info/?l=linux-mm&m=131283905330474&w=2 Link: http://lkml.kernel.org/r/20180928111139.27962-1-atomlin@redhat.com Signed-off-by: Aaron Tomlin <atomlin@redhat.com> Acked-by: Christoph Lameter <cl@linux.com> Cc: Pekka Enberg <penberg@kernel.org> Cc: David Rientjes <rientjes@google.com> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com> Cc: Iliyan Malchev <malchev@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
#
0684e652 |
|
26-Oct-2018 |
Andy Shevchenko <andriy.shevchenko@linux.intel.com> |
mm/slub.c: switch to bitmap_zalloc() Switch to bitmap_zalloc() to show clearly what we are allocating. Besides that it returns pointer of bitmap type instead of opaque void *. Link: http://lkml.kernel.org/r/20180830104301.61649-1-andriy.shevchenko@linux.intel.com Signed-off-by: Andy Shevchenko <andriy.shevchenko@linux.intel.com> Acked-by: Christoph Lameter <cl@linux.com> Reviewed-by: Andrew Morton <akpm@linux-foundation.org> Tested-by: David Rientjes <rientjes@google.com> Cc: Pekka Enberg <penberg@kernel.org> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
#
13ba17be |
|
24-Aug-2018 |
Mukesh Ojha <mojha@codeaurora.org> |
notifier: Remove notifier header file wherever not used The conversion of the hotplug notifiers to a state machine left the notifier.h includes around in some places. Remove them. Signed-off-by: Mukesh Ojha <mojha@codeaurora.org> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Link: https://lkml.kernel.org/r/1535114033-4605-1-git-send-email-mojha@codeaurora.org
|
#
0882ff91 |
|
17-Aug-2018 |
Vlastimil Babka <vbabka@suse.cz> |
mm, slub: restore the original intention of prefetch_freepointer() In SLUB, prefetch_freepointer() is used when allocating an object from cache's freelist, to make sure the next object in the list is cache-hot, since it's probable it will be allocated soon. Commit 2482ddec670f ("mm: add SLUB free list pointer obfuscation") has unintentionally changed the prefetch in a way where the prefetch is turned to a real fetch, and only the next->next pointer is prefetched. In case there is not a stream of allocations that would benefit from prefetching, the extra real fetch might add a useless cache miss to the allocation. Restore the previous behavior. Link: http://lkml.kernel.org/r/20180809085245.22448-1-vbabka@suse.cz Fixes: 2482ddec670f ("mm: add SLUB free list pointer obfuscation") Signed-off-by: Vlastimil Babka <vbabka@suse.cz> Acked-by: Kees Cook <keescook@chromium.org> Cc: Daniel Micay <danielmicay@gmail.com> Cc: Eric Dumazet <edumazet@google.com> Cc: Christoph Lameter <cl@linux.com> Cc: Pekka Enberg <penberg@kernel.org> Cc: David Rientjes <rientjes@google.com> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
#
d50d82fa |
|
28-Jun-2018 |
Mikulas Patocka <mpatocka@redhat.com> |
slub: fix failure when we delete and create a slab cache In kernel 4.17 I removed some code from dm-bufio that did slab cache merging (commit 21bb13276768: "dm bufio: remove code that merges slab caches") - both slab and slub support merging caches with identical attributes, so dm-bufio now just calls kmem_cache_create and relies on implicit merging. This uncovered a bug in the slub subsystem - if we delete a cache and immediatelly create another cache with the same attributes, it fails because of duplicate filename in /sys/kernel/slab/. The slub subsystem offloads freeing the cache to a workqueue - and if we create the new cache before the workqueue runs, it complains because of duplicate filename in sysfs. This patch fixes the bug by moving the call of kobject_del from sysfs_slab_remove_workfn to shutdown_cache. kobject_del must be called while we hold slab_mutex - so that the sysfs entry is deleted before a cache with the same attributes could be created. Running device-mapper-test-suite with: dmtest run --suite thin-provisioning -n /commit_failure_causes_fallback/ triggered: Buffer I/O error on dev dm-0, logical block 1572848, async page read device-mapper: thin: 253:1: metadata operation 'dm_pool_alloc_data_block' failed: error = -5 device-mapper: thin: 253:1: aborting current metadata transaction sysfs: cannot create duplicate filename '/kernel/slab/:a-0000144' CPU: 2 PID: 1037 Comm: kworker/u48:1 Not tainted 4.17.0.snitm+ #25 Hardware name: Supermicro SYS-1029P-WTR/X11DDW-L, BIOS 2.0a 12/06/2017 Workqueue: dm-thin do_worker [dm_thin_pool] Call Trace: dump_stack+0x5a/0x73 sysfs_warn_dup+0x58/0x70 sysfs_create_dir_ns+0x77/0x80 kobject_add_internal+0xba/0x2e0 kobject_init_and_add+0x70/0xb0 sysfs_slab_add+0xb1/0x250 __kmem_cache_create+0x116/0x150 create_cache+0xd9/0x1f0 kmem_cache_create_usercopy+0x1c1/0x250 kmem_cache_create+0x18/0x20 dm_bufio_client_create+0x1ae/0x410 [dm_bufio] dm_block_manager_create+0x5e/0x90 [dm_persistent_data] __create_persistent_data_objects+0x38/0x940 [dm_thin_pool] dm_pool_abort_metadata+0x64/0x90 [dm_thin_pool] metadata_operation_failed+0x59/0x100 [dm_thin_pool] alloc_data_block.isra.53+0x86/0x180 [dm_thin_pool] process_cell+0x2a3/0x550 [dm_thin_pool] do_worker+0x28d/0x8f0 [dm_thin_pool] process_one_work+0x171/0x370 worker_thread+0x49/0x3f0 kthread+0xf8/0x130 ret_from_fork+0x35/0x40 kobject_add_internal failed for :a-0000144 with -EEXIST, don't try to register things with the same name in the same directory. kmem_cache_create(dm_bufio_buffer-16) failed with error -17 Link: http://lkml.kernel.org/r/alpine.LRH.2.02.1806151817130.6333@file01.intranet.prod.int.rdu2.redhat.com Signed-off-by: Mikulas Patocka <mpatocka@redhat.com> Reported-by: Mike Snitzer <snitzer@redhat.com> Tested-by: Mike Snitzer <snitzer@redhat.com> Cc: Christoph Lameter <cl@linux.com> Cc: Pekka Enberg <penberg@kernel.org> Cc: David Rientjes <rientjes@google.com> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
#
6396bb22 |
|
12-Jun-2018 |
Kees Cook <keescook@chromium.org> |
treewide: kzalloc() -> kcalloc() The kzalloc() function has a 2-factor argument form, kcalloc(). This patch replaces cases of: kzalloc(a * b, gfp) with: kcalloc(a * b, gfp) as well as handling cases of: kzalloc(a * b * c, gfp) with: kzalloc(array3_size(a, b, c), gfp) as it's slightly less ugly than: kzalloc_array(array_size(a, b), c, gfp) This does, however, attempt to ignore constant size factors like: kzalloc(4 * 1024, gfp) though any constants defined via macros get caught up in the conversion. Any factors with a sizeof() of "unsigned char", "char", and "u8" were dropped, since they're redundant. The Coccinelle script used for this was: // Fix redundant parens around sizeof(). @@ type TYPE; expression THING, E; @@ ( kzalloc( - (sizeof(TYPE)) * E + sizeof(TYPE) * E , ...) | kzalloc( - (sizeof(THING)) * E + sizeof(THING) * E , ...) ) // Drop single-byte sizes and redundant parens. @@ expression COUNT; typedef u8; typedef __u8; @@ ( kzalloc( - sizeof(u8) * (COUNT) + COUNT , ...) | kzalloc( - sizeof(__u8) * (COUNT) + COUNT , ...) | kzalloc( - sizeof(char) * (COUNT) + COUNT , ...) | kzalloc( - sizeof(unsigned char) * (COUNT) + COUNT , ...) | kzalloc( - sizeof(u8) * COUNT + COUNT , ...) | kzalloc( - sizeof(__u8) * COUNT + COUNT , ...) | kzalloc( - sizeof(char) * COUNT + COUNT , ...) | kzalloc( - sizeof(unsigned char) * COUNT + COUNT , ...) ) // 2-factor product with sizeof(type/expression) and identifier or constant. @@ type TYPE; expression THING; identifier COUNT_ID; constant COUNT_CONST; @@ ( - kzalloc + kcalloc ( - sizeof(TYPE) * (COUNT_ID) + COUNT_ID, sizeof(TYPE) , ...) | - kzalloc + kcalloc ( - sizeof(TYPE) * COUNT_ID + COUNT_ID, sizeof(TYPE) , ...) | - kzalloc + kcalloc ( - sizeof(TYPE) * (COUNT_CONST) + COUNT_CONST, sizeof(TYPE) , ...) | - kzalloc + kcalloc ( - sizeof(TYPE) * COUNT_CONST + COUNT_CONST, sizeof(TYPE) , ...) | - kzalloc + kcalloc ( - sizeof(THING) * (COUNT_ID) + COUNT_ID, sizeof(THING) , ...) | - kzalloc + kcalloc ( - sizeof(THING) * COUNT_ID + COUNT_ID, sizeof(THING) , ...) | - kzalloc + kcalloc ( - sizeof(THING) * (COUNT_CONST) + COUNT_CONST, sizeof(THING) , ...) | - kzalloc + kcalloc ( - sizeof(THING) * COUNT_CONST + COUNT_CONST, sizeof(THING) , ...) ) // 2-factor product, only identifiers. @@ identifier SIZE, COUNT; @@ - kzalloc + kcalloc ( - SIZE * COUNT + COUNT, SIZE , ...) // 3-factor product with 1 sizeof(type) or sizeof(expression), with // redundant parens removed. @@ expression THING; identifier STRIDE, COUNT; type TYPE; @@ ( kzalloc( - sizeof(TYPE) * (COUNT) * (STRIDE) + array3_size(COUNT, STRIDE, sizeof(TYPE)) , ...) | kzalloc( - sizeof(TYPE) * (COUNT) * STRIDE + array3_size(COUNT, STRIDE, sizeof(TYPE)) , ...) | kzalloc( - sizeof(TYPE) * COUNT * (STRIDE) + array3_size(COUNT, STRIDE, sizeof(TYPE)) , ...) | kzalloc( - sizeof(TYPE) * COUNT * STRIDE + array3_size(COUNT, STRIDE, sizeof(TYPE)) , ...) | kzalloc( - sizeof(THING) * (COUNT) * (STRIDE) + array3_size(COUNT, STRIDE, sizeof(THING)) , ...) | kzalloc( - sizeof(THING) * (COUNT) * STRIDE + array3_size(COUNT, STRIDE, sizeof(THING)) , ...) | kzalloc( - sizeof(THING) * COUNT * (STRIDE) + array3_size(COUNT, STRIDE, sizeof(THING)) , ...) | kzalloc( - sizeof(THING) * COUNT * STRIDE + array3_size(COUNT, STRIDE, sizeof(THING)) , ...) ) // 3-factor product with 2 sizeof(variable), with redundant parens removed. @@ expression THING1, THING2; identifier COUNT; type TYPE1, TYPE2; @@ ( kzalloc( - sizeof(TYPE1) * sizeof(TYPE2) * COUNT + array3_size(COUNT, sizeof(TYPE1), sizeof(TYPE2)) , ...) | kzalloc( - sizeof(TYPE1) * sizeof(THING2) * (COUNT) + array3_size(COUNT, sizeof(TYPE1), sizeof(TYPE2)) , ...) | kzalloc( - sizeof(THING1) * sizeof(THING2) * COUNT + array3_size(COUNT, sizeof(THING1), sizeof(THING2)) , ...) | kzalloc( - sizeof(THING1) * sizeof(THING2) * (COUNT) + array3_size(COUNT, sizeof(THING1), sizeof(THING2)) , ...) | kzalloc( - sizeof(TYPE1) * sizeof(THING2) * COUNT + array3_size(COUNT, sizeof(TYPE1), sizeof(THING2)) , ...) | kzalloc( - sizeof(TYPE1) * sizeof(THING2) * (COUNT) + array3_size(COUNT, sizeof(TYPE1), sizeof(THING2)) , ...) ) // 3-factor product, only identifiers, with redundant parens removed. @@ identifier STRIDE, SIZE, COUNT; @@ ( kzalloc( - (COUNT) * STRIDE * SIZE + array3_size(COUNT, STRIDE, SIZE) , ...) | kzalloc( - COUNT * (STRIDE) * SIZE + array3_size(COUNT, STRIDE, SIZE) , ...) | kzalloc( - COUNT * STRIDE * (SIZE) + array3_size(COUNT, STRIDE, SIZE) , ...) | kzalloc( - (COUNT) * (STRIDE) * SIZE + array3_size(COUNT, STRIDE, SIZE) , ...) | kzalloc( - COUNT * (STRIDE) * (SIZE) + array3_size(COUNT, STRIDE, SIZE) , ...) | kzalloc( - (COUNT) * STRIDE * (SIZE) + array3_size(COUNT, STRIDE, SIZE) , ...) | kzalloc( - (COUNT) * (STRIDE) * (SIZE) + array3_size(COUNT, STRIDE, SIZE) , ...) | kzalloc( - COUNT * STRIDE * SIZE + array3_size(COUNT, STRIDE, SIZE) , ...) ) // Any remaining multi-factor products, first at least 3-factor products, // when they're not all constants... @@ expression E1, E2, E3; constant C1, C2, C3; @@ ( kzalloc(C1 * C2 * C3, ...) | kzalloc( - (E1) * E2 * E3 + array3_size(E1, E2, E3) , ...) | kzalloc( - (E1) * (E2) * E3 + array3_size(E1, E2, E3) , ...) | kzalloc( - (E1) * (E2) * (E3) + array3_size(E1, E2, E3) , ...) | kzalloc( - E1 * E2 * E3 + array3_size(E1, E2, E3) , ...) ) // And then all remaining 2 factors products when they're not all constants, // keeping sizeof() as the second factor argument. @@ expression THING, E1, E2; type TYPE; constant C1, C2, C3; @@ ( kzalloc(sizeof(THING) * C2, ...) | kzalloc(sizeof(TYPE) * C2, ...) | kzalloc(C1 * C2 * C3, ...) | kzalloc(C1 * C2, ...) | - kzalloc + kcalloc ( - sizeof(TYPE) * (E2) + E2, sizeof(TYPE) , ...) | - kzalloc + kcalloc ( - sizeof(TYPE) * E2 + E2, sizeof(TYPE) , ...) | - kzalloc + kcalloc ( - sizeof(THING) * (E2) + E2, sizeof(THING) , ...) | - kzalloc + kcalloc ( - sizeof(THING) * E2 + E2, sizeof(THING) , ...) | - kzalloc + kcalloc ( - (E1) * E2 + E1, E2 , ...) | - kzalloc + kcalloc ( - (E1) * (E2) + E1, E2 , ...) | - kzalloc + kcalloc ( - E1 * E2 + E1, E2 , ...) ) Signed-off-by: Kees Cook <keescook@chromium.org>
|
#
6da2ec56 |
|
12-Jun-2018 |
Kees Cook <keescook@chromium.org> |
treewide: kmalloc() -> kmalloc_array() The kmalloc() function has a 2-factor argument form, kmalloc_array(). This patch replaces cases of: kmalloc(a * b, gfp) with: kmalloc_array(a * b, gfp) as well as handling cases of: kmalloc(a * b * c, gfp) with: kmalloc(array3_size(a, b, c), gfp) as it's slightly less ugly than: kmalloc_array(array_size(a, b), c, gfp) This does, however, attempt to ignore constant size factors like: kmalloc(4 * 1024, gfp) though any constants defined via macros get caught up in the conversion. Any factors with a sizeof() of "unsigned char", "char", and "u8" were dropped, since they're redundant. The tools/ directory was manually excluded, since it has its own implementation of kmalloc(). The Coccinelle script used for this was: // Fix redundant parens around sizeof(). @@ type TYPE; expression THING, E; @@ ( kmalloc( - (sizeof(TYPE)) * E + sizeof(TYPE) * E , ...) | kmalloc( - (sizeof(THING)) * E + sizeof(THING) * E , ...) ) // Drop single-byte sizes and redundant parens. @@ expression COUNT; typedef u8; typedef __u8; @@ ( kmalloc( - sizeof(u8) * (COUNT) + COUNT , ...) | kmalloc( - sizeof(__u8) * (COUNT) + COUNT , ...) | kmalloc( - sizeof(char) * (COUNT) + COUNT , ...) | kmalloc( - sizeof(unsigned char) * (COUNT) + COUNT , ...) | kmalloc( - sizeof(u8) * COUNT + COUNT , ...) | kmalloc( - sizeof(__u8) * COUNT + COUNT , ...) | kmalloc( - sizeof(char) * COUNT + COUNT , ...) | kmalloc( - sizeof(unsigned char) * COUNT + COUNT , ...) ) // 2-factor product with sizeof(type/expression) and identifier or constant. @@ type TYPE; expression THING; identifier COUNT_ID; constant COUNT_CONST; @@ ( - kmalloc + kmalloc_array ( - sizeof(TYPE) * (COUNT_ID) + COUNT_ID, sizeof(TYPE) , ...) | - kmalloc + kmalloc_array ( - sizeof(TYPE) * COUNT_ID + COUNT_ID, sizeof(TYPE) , ...) | - kmalloc + kmalloc_array ( - sizeof(TYPE) * (COUNT_CONST) + COUNT_CONST, sizeof(TYPE) , ...) | - kmalloc + kmalloc_array ( - sizeof(TYPE) * COUNT_CONST + COUNT_CONST, sizeof(TYPE) , ...) | - kmalloc + kmalloc_array ( - sizeof(THING) * (COUNT_ID) + COUNT_ID, sizeof(THING) , ...) | - kmalloc + kmalloc_array ( - sizeof(THING) * COUNT_ID + COUNT_ID, sizeof(THING) , ...) | - kmalloc + kmalloc_array ( - sizeof(THING) * (COUNT_CONST) + COUNT_CONST, sizeof(THING) , ...) | - kmalloc + kmalloc_array ( - sizeof(THING) * COUNT_CONST + COUNT_CONST, sizeof(THING) , ...) ) // 2-factor product, only identifiers. @@ identifier SIZE, COUNT; @@ - kmalloc + kmalloc_array ( - SIZE * COUNT + COUNT, SIZE , ...) // 3-factor product with 1 sizeof(type) or sizeof(expression), with // redundant parens removed. @@ expression THING; identifier STRIDE, COUNT; type TYPE; @@ ( kmalloc( - sizeof(TYPE) * (COUNT) * (STRIDE) + array3_size(COUNT, STRIDE, sizeof(TYPE)) , ...) | kmalloc( - sizeof(TYPE) * (COUNT) * STRIDE + array3_size(COUNT, STRIDE, sizeof(TYPE)) , ...) | kmalloc( - sizeof(TYPE) * COUNT * (STRIDE) + array3_size(COUNT, STRIDE, sizeof(TYPE)) , ...) | kmalloc( - sizeof(TYPE) * COUNT * STRIDE + array3_size(COUNT, STRIDE, sizeof(TYPE)) , ...) | kmalloc( - sizeof(THING) * (COUNT) * (STRIDE) + array3_size(COUNT, STRIDE, sizeof(THING)) , ...) | kmalloc( - sizeof(THING) * (COUNT) * STRIDE + array3_size(COUNT, STRIDE, sizeof(THING)) , ...) | kmalloc( - sizeof(THING) * COUNT * (STRIDE) + array3_size(COUNT, STRIDE, sizeof(THING)) , ...) | kmalloc( - sizeof(THING) * COUNT * STRIDE + array3_size(COUNT, STRIDE, sizeof(THING)) , ...) ) // 3-factor product with 2 sizeof(variable), with redundant parens removed. @@ expression THING1, THING2; identifier COUNT; type TYPE1, TYPE2; @@ ( kmalloc( - sizeof(TYPE1) * sizeof(TYPE2) * COUNT + array3_size(COUNT, sizeof(TYPE1), sizeof(TYPE2)) , ...) | kmalloc( - sizeof(TYPE1) * sizeof(THING2) * (COUNT) + array3_size(COUNT, sizeof(TYPE1), sizeof(TYPE2)) , ...) | kmalloc( - sizeof(THING1) * sizeof(THING2) * COUNT + array3_size(COUNT, sizeof(THING1), sizeof(THING2)) , ...) | kmalloc( - sizeof(THING1) * sizeof(THING2) * (COUNT) + array3_size(COUNT, sizeof(THING1), sizeof(THING2)) , ...) | kmalloc( - sizeof(TYPE1) * sizeof(THING2) * COUNT + array3_size(COUNT, sizeof(TYPE1), sizeof(THING2)) , ...) | kmalloc( - sizeof(TYPE1) * sizeof(THING2) * (COUNT) + array3_size(COUNT, sizeof(TYPE1), sizeof(THING2)) , ...) ) // 3-factor product, only identifiers, with redundant parens removed. @@ identifier STRIDE, SIZE, COUNT; @@ ( kmalloc( - (COUNT) * STRIDE * SIZE + array3_size(COUNT, STRIDE, SIZE) , ...) | kmalloc( - COUNT * (STRIDE) * SIZE + array3_size(COUNT, STRIDE, SIZE) , ...) | kmalloc( - COUNT * STRIDE * (SIZE) + array3_size(COUNT, STRIDE, SIZE) , ...) | kmalloc( - (COUNT) * (STRIDE) * SIZE + array3_size(COUNT, STRIDE, SIZE) , ...) | kmalloc( - COUNT * (STRIDE) * (SIZE) + array3_size(COUNT, STRIDE, SIZE) , ...) | kmalloc( - (COUNT) * STRIDE * (SIZE) + array3_size(COUNT, STRIDE, SIZE) , ...) | kmalloc( - (COUNT) * (STRIDE) * (SIZE) + array3_size(COUNT, STRIDE, SIZE) , ...) | kmalloc( - COUNT * STRIDE * SIZE + array3_size(COUNT, STRIDE, SIZE) , ...) ) // Any remaining multi-factor products, first at least 3-factor products, // when they're not all constants... @@ expression E1, E2, E3; constant C1, C2, C3; @@ ( kmalloc(C1 * C2 * C3, ...) | kmalloc( - (E1) * E2 * E3 + array3_size(E1, E2, E3) , ...) | kmalloc( - (E1) * (E2) * E3 + array3_size(E1, E2, E3) , ...) | kmalloc( - (E1) * (E2) * (E3) + array3_size(E1, E2, E3) , ...) | kmalloc( - E1 * E2 * E3 + array3_size(E1, E2, E3) , ...) ) // And then all remaining 2 factors products when they're not all constants, // keeping sizeof() as the second factor argument. @@ expression THING, E1, E2; type TYPE; constant C1, C2, C3; @@ ( kmalloc(sizeof(THING) * C2, ...) | kmalloc(sizeof(TYPE) * C2, ...) | kmalloc(C1 * C2 * C3, ...) | kmalloc(C1 * C2, ...) | - kmalloc + kmalloc_array ( - sizeof(TYPE) * (E2) + E2, sizeof(TYPE) , ...) | - kmalloc + kmalloc_array ( - sizeof(TYPE) * E2 + E2, sizeof(TYPE) , ...) | - kmalloc + kmalloc_array ( - sizeof(THING) * (E2) + E2, sizeof(THING) , ...) | - kmalloc + kmalloc_array ( - sizeof(THING) * E2 + E2, sizeof(THING) , ...) | - kmalloc + kmalloc_array ( - (E1) * E2 + E1, E2 , ...) | - kmalloc + kmalloc_array ( - (E1) * (E2) + E1, E2 , ...) | - kmalloc + kmalloc_array ( - E1 * E2 + E1, E2 , ...) ) Signed-off-by: Kees Cook <keescook@chromium.org>
|
#
325d7d4a |
|
07-Jun-2018 |
Matthew Wilcox <willy@infradead.org> |
slub: remove 'reserved' file from sysfs Christoph doubts anyone was using the 'reserved' file in sysfs, so remove it. Link: http://lkml.kernel.org/r/20180518194519.3820-17-willy@infradead.org Signed-off-by: Matthew Wilcox <mawilcox@microsoft.com> Acked-by: Christoph Lameter <cl@linux.com> Cc: Dave Hansen <dave.hansen@linux.intel.com> Cc: Jérôme Glisse <jglisse@redhat.com> Cc: "Kirill A . Shutemov" <kirill.shutemov@linux.intel.com> Cc: Lai Jiangshan <jiangshanlai@gmail.com> Cc: Martin Schwidefsky <schwidefsky@de.ibm.com> Cc: Pekka Enberg <penberg@kernel.org> Cc: Randy Dunlap <rdunlap@infradead.org> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Andrey Ryabinin <aryabinin@virtuozzo.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
#
9736d2a9 |
|
07-Jun-2018 |
Matthew Wilcox <willy@infradead.org> |
slub: remove kmem_cache->reserved The reserved field was only used for embedding an rcu_head in the data structure. With the previous commit, we no longer need it. That lets us remove the 'reserved' argument to a lot of functions. Link: http://lkml.kernel.org/r/20180518194519.3820-16-willy@infradead.org Signed-off-by: Matthew Wilcox <mawilcox@microsoft.com> Acked-by: Christoph Lameter <cl@linux.com> Cc: Dave Hansen <dave.hansen@linux.intel.com> Cc: Jérôme Glisse <jglisse@redhat.com> Cc: "Kirill A . Shutemov" <kirill.shutemov@linux.intel.com> Cc: Lai Jiangshan <jiangshanlai@gmail.com> Cc: Martin Schwidefsky <schwidefsky@de.ibm.com> Cc: Pekka Enberg <penberg@kernel.org> Cc: Randy Dunlap <rdunlap@infradead.org> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Andrey Ryabinin <aryabinin@virtuozzo.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
#
bf68c214 |
|
07-Jun-2018 |
Matthew Wilcox <willy@infradead.org> |
slab,slub: remove rcu_head size checks rcu_head may now grow larger than list_head without affecting slab or slub. Link: http://lkml.kernel.org/r/20180518194519.3820-15-willy@infradead.org Signed-off-by: Matthew Wilcox <mawilcox@microsoft.com> Acked-by: Christoph Lameter <cl@linux.com> Acked-by: Vlastimil Babka <vbabka@suse.cz> Cc: Dave Hansen <dave.hansen@linux.intel.com> Cc: Jérôme Glisse <jglisse@redhat.com> Cc: "Kirill A . Shutemov" <kirill.shutemov@linux.intel.com> Cc: Lai Jiangshan <jiangshanlai@gmail.com> Cc: Martin Schwidefsky <schwidefsky@de.ibm.com> Cc: Pekka Enberg <penberg@kernel.org> Cc: Randy Dunlap <rdunlap@infradead.org> Cc: Andrey Ryabinin <aryabinin@virtuozzo.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
#
b7ccc7f8 |
|
07-Jun-2018 |
Matthew Wilcox <willy@infradead.org> |
mm: move lru union within struct page Since the LRU is two words, this does not affect the double-word alignment of SLUB's freelist. Link: http://lkml.kernel.org/r/20180518194519.3820-10-willy@infradead.org Signed-off-by: Matthew Wilcox <mawilcox@microsoft.com> Acked-by: Vlastimil Babka <vbabka@suse.cz> Acked-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Cc: Christoph Lameter <cl@linux.com> Cc: Dave Hansen <dave.hansen@linux.intel.com> Cc: Jérôme Glisse <jglisse@redhat.com> Cc: Lai Jiangshan <jiangshanlai@gmail.com> Cc: Martin Schwidefsky <schwidefsky@de.ibm.com> Cc: Pekka Enberg <penberg@kernel.org> Cc: Randy Dunlap <rdunlap@infradead.org> Cc: Andrey Ryabinin <aryabinin@virtuozzo.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
#
7d27a04b |
|
07-Jun-2018 |
Matthew Wilcox <willy@infradead.org> |
mm: move 'private' union within struct page By moving page->private to the fourth word of struct page, we can put the SLUB counters in the same word as SLAB's s_mem and still do the cmpxchg_double trick. Now the SLUB counters no longer overlap with the mapcount or refcount so we can drop the call to page_mapcount_reset() and simplify set_page_slub_counters() to a single line. Link: http://lkml.kernel.org/r/20180518194519.3820-6-willy@infradead.org Signed-off-by: Matthew Wilcox <mawilcox@microsoft.com> Acked-by: Vlastimil Babka <vbabka@suse.cz> Acked-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Cc: Christoph Lameter <cl@linux.com> Cc: Dave Hansen <dave.hansen@linux.intel.com> Cc: Jérôme Glisse <jglisse@redhat.com> Cc: Lai Jiangshan <jiangshanlai@gmail.com> Cc: Martin Schwidefsky <schwidefsky@de.ibm.com> Cc: Pekka Enberg <penberg@kernel.org> Cc: Randy Dunlap <rdunlap@infradead.org> Cc: Andrey Ryabinin <aryabinin@virtuozzo.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
#
d4fc5069 |
|
07-Jun-2018 |
Matthew Wilcox <willy@infradead.org> |
mm: switch s_mem and slab_cache in struct page This will allow us to store slub's counters in the same bits as slab's s_mem. slub now needs to set page->mapping to NULL as it frees the page, just like slab does. Link: http://lkml.kernel.org/r/20180518194519.3820-5-willy@infradead.org Signed-off-by: Matthew Wilcox <mawilcox@microsoft.com> Acked-by: Christoph Lameter <cl@linux.com> Acked-by: Vlastimil Babka <vbabka@suse.cz> Cc: Dave Hansen <dave.hansen@linux.intel.com> Cc: Jérôme Glisse <jglisse@redhat.com> Cc: "Kirill A . Shutemov" <kirill.shutemov@linux.intel.com> Cc: Lai Jiangshan <jiangshanlai@gmail.com> Cc: Martin Schwidefsky <schwidefsky@de.ibm.com> Cc: Pekka Enberg <penberg@kernel.org> Cc: Randy Dunlap <rdunlap@infradead.org> Cc: Andrey Ryabinin <aryabinin@virtuozzo.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
#
05088e5d |
|
07-Jun-2018 |
Canjiang Lu <canjiang.lu@samsung.com> |
mm/slub: remove obsolete comment The obsolete comment removed in this patch was introduced by 51df1142816e4 ("slub: Dynamically size kmalloc cache allocations"). I paste related modification from that commit: +#ifdef CONFIG_NUMA + /* + * Allocate kmem_cache_node properly from the kmem_cache slab. + * kmem_cache_node is separately allocated so no need to + * update any list pointers. + */ + temp_kmem_cache_node = kmem_cache_node; + kmem_cache_node = kmem_cache_alloc(kmem_cache, GFP_NOWAIT); + memcpy(kmem_cache_node, temp_kmem_cache_node, kmem_size); + + kmem_cache_bootstrap_fixup(kmem_cache_node); + + caches++; +#else + /* + * kmem_cache has kmem_cache_node embedded and we moved it! + * Update the list heads + */ + INIT_LIST_HEAD(&kmem_cache->local_node.partial); + list_splice(&temp_kmem_cache->local_node.partial, &kmem_cache->local_node.partial); +#ifdef CONFIG_SLUB_DEBUG + INIT_LIST_HEAD(&kmem_cache->local_node.full); + list_splice(&temp_kmem_cache->local_node.full, &kmem_cache->local_node.full); +#endif As we can see there're used to distinguish the difference handling between NUMA/non-NUMA configuration in the original commit. I think it doesn't make any sense in current implementation which is placed above kmem_cache_node = bootstrap(&boot_kmem_cache_node); So maybe it's better to remove them now? Link: http://lkml.kernel.org/r/5af26f58.1c69fb81.1be0e.c520SMTPIN_ADDED_BROKEN@mx.google.com Signed-off-by: Canjiang Lu <canjiang.lu@samsung.com> Acked-by: Christoph Lameter <cl@linux.com> Cc: Pekka Enberg <penberg@kernel.org> Cc: David Rientjes <rientjes@google.com> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
#
a38965bf |
|
07-Jun-2018 |
Mathieu Malaterre <malat@debian.org> |
mm/slub.c: add __printf verification to slab_err() __printf is useful to verify format and arguments. Remove the following warning (with W=1): mm/slub.c:721:2: warning: function might be possible candidate for `gnu_printf' format attribute [-Wsuggest-attribute=format] Link: http://lkml.kernel.org/r/20180505200706.19986-1-malat@debian.org Signed-off-by: Mathieu Malaterre <malat@debian.org> Reviewed-by: Andrew Morton <akpm@linux-foundation.org> Cc: Christoph Lameter <cl@linux.com> Cc: Pekka Enberg <penberg@kernel.org> Cc: David Rientjes <rientjes@google.com> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
#
128227e7 |
|
07-Jun-2018 |
Matthew Wilcox <willy@infradead.org> |
slab: __GFP_ZERO is incompatible with a constructor __GFP_ZERO requests that the object be initialised to all-zeroes, while the purpose of a constructor is to initialise an object to a particular pattern. We cannot do both. Add a warning to catch any users who mistakenly pass a __GFP_ZERO flag when allocating a slab with a constructor. Link: http://lkml.kernel.org/r/20180412191322.GA21205@bombadil.infradead.org Fixes: d07dbea46405 ("Slab allocators: support __GFP_ZERO in all allocators") Signed-off-by: Matthew Wilcox <mawilcox@microsoft.com> Acked-by: Johannes Weiner <hannes@cmpxchg.org> Acked-by: Vlastimil Babka <vbabka@suse.cz> Acked-by: Michal Hocko <mhocko@suse.com> Reviewed-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
#
c3895391 |
|
10-Apr-2018 |
Andrey Konovalov <andreyknvl@google.com> |
kasan, slub: fix handling of kasan_slab_free hook The kasan_slab_free hook's return value denotes whether the reuse of a slab object must be delayed (e.g. when the object is put into memory qurantine). The current way SLUB handles this hook is by ignoring its return value and hardcoding checks similar (but not exactly the same) to the ones performed in kasan_slab_free, which is prone to making mistakes. The main difference between the hardcoded checks and the ones in kasan_slab_free is whether we want to perform a free in case when an invalid-free or a double-free was detected (we don't). This patch changes the way SLUB handles this by: 1. taking into account the return value of kasan_slab_free for each of the objects, that are being freed; 2. reconstructing the freelist of objects to exclude the ones, whose reuse must be delayed. [andreyknvl@google.com: eliminate unnecessary branch in slab_free] Link: http://lkml.kernel.org/r/a62759a2545fddf69b0c034547212ca1eb1b3ce2.1520359686.git.andreyknvl@google.com Link: http://lkml.kernel.org/r/083f58501e54731203801d899632d76175868e97.1519400992.git.andreyknvl@google.com Signed-off-by: Andrey Konovalov <andreyknvl@google.com> Acked-by: Andrey Ryabinin <aryabinin@virtuozzo.com> Cc: Christoph Lameter <cl@linux.com> Cc: Pekka Enberg <penberg@kernel.org> Cc: David Rientjes <rientjes@google.com> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com> Cc: Alexander Potapenko <glider@google.com> Cc: Dmitry Vyukov <dvyukov@google.com> Cc: Kostya Serebryany <kcc@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
#
f9e13c0a |
|
05-Apr-2018 |
Shakeel Butt <shakeelb@google.com> |
slab, slub: skip unnecessary kasan_cache_shutdown() The kasan quarantine is designed to delay freeing slab objects to catch use-after-free. The quarantine can be large (several percent of machine memory size). When kmem_caches are deleted related objects are flushed from the quarantine but this requires scanning the entire quarantine which can be very slow. We have seen the kernel busily working on this while holding slab_mutex and badly affecting cache_reaper, slabinfo readers and memcg kmem cache creations. It can easily reproduced by following script: yes . | head -1000000 | xargs stat > /dev/null for i in `seq 1 10`; do seq 500 | (cd /cg/memory && xargs mkdir) seq 500 | xargs -I{} sh -c 'echo $BASHPID > \ /cg/memory/{}/tasks && exec stat .' > /dev/null seq 500 | (cd /cg/memory && xargs rmdir) done The busy stack: kasan_cache_shutdown shutdown_cache memcg_destroy_kmem_caches mem_cgroup_css_free css_free_rwork_fn process_one_work worker_thread kthread ret_from_fork This patch is based on the observation that if the kmem_cache to be destroyed is empty then there should not be any objects of this cache in the quarantine. Without the patch the script got stuck for couple of hours. With the patch the script completed within a second. Link: http://lkml.kernel.org/r/20180327230603.54721-1-shakeelb@google.com Signed-off-by: Shakeel Butt <shakeelb@google.com> Reviewed-by: Andrew Morton <akpm@linux-foundation.org> Acked-by: Andrey Ryabinin <aryabinin@virtuozzo.com> Acked-by: Christoph Lameter <cl@linux.com> Cc: Vladimir Davydov <vdavydov.dev@gmail.com> Cc: Alexander Potapenko <glider@google.com> Cc: Greg Thelen <gthelen@google.com> Cc: Dmitry Vyukov <dvyukov@google.com> Cc: Pekka Enberg <penberg@kernel.org> Cc: David Rientjes <rientjes@google.com> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
#
870b1fbb |
|
05-Apr-2018 |
Alexey Dobriyan <adobriyan@gmail.com> |
slub: make size_from_object() return unsigned int Function returns size of the object without red zone which can't be negative. Link: http://lkml.kernel.org/r/20180305200730.15812-24-adobriyan@gmail.com Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com> Acked-by: Christoph Lameter <cl@linux.com> Cc: Pekka Enberg <penberg@kernel.org> Cc: David Rientjes <rientjes@google.com> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
#
19af27af |
|
05-Apr-2018 |
Alexey Dobriyan <adobriyan@gmail.com> |
slub: make struct kmem_cache_order_objects::x unsigned int struct kmem_cache_order_objects is for mixing order and number of objects, and orders aren't big enough to warrant 64-bit width. Propagate unsignedness down so that everything fits. !!! Patch assumes that "PAGE_SIZE << order" doesn't overflow. !!! Link: http://lkml.kernel.org/r/20180305200730.15812-23-adobriyan@gmail.com Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com> Cc: Christoph Lameter <cl@linux.com> Cc: Pekka Enberg <penberg@kernel.org> Cc: David Rientjes <rientjes@google.com> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
#
284b50dd |
|
05-Apr-2018 |
Alexey Dobriyan <adobriyan@gmail.com> |
slub: make slab_index() return unsigned int slab_index() returns index of an object within a slab which is at most u15 (or u16?). Iterators additionally guarantee that "p >= addr". Link: http://lkml.kernel.org/r/20180305200730.15812-22-adobriyan@gmail.com Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com> Acked-by: Christoph Lameter <cl@linux.com> Cc: Pekka Enberg <penberg@kernel.org> Cc: David Rientjes <rientjes@google.com> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
#
7bbdb81e |
|
05-Apr-2018 |
Alexey Dobriyan <adobriyan@gmail.com> |
slab: make usercopy region 32-bit If kmem case sizes are 32-bit, then usecopy region should be too. Link: http://lkml.kernel.org/r/20180305200730.15812-21-adobriyan@gmail.com Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com> Cc: David Miller <davem@davemloft.net> Cc: Christoph Lameter <cl@linux.com> Cc: Pekka Enberg <penberg@kernel.org> Cc: David Rientjes <rientjes@google.com> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
#
be4a7988 |
|
05-Apr-2018 |
Alexey Dobriyan <adobriyan@gmail.com> |
kasan: make kasan_cache_create() work with 32-bit slab cache sizes If SLAB doesn't support 4GB+ kmem caches (it never did), KASAN should not do it as well. Link: http://lkml.kernel.org/r/20180305200730.15812-20-adobriyan@gmail.com Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com> Cc: Andrey Ryabinin <aryabinin@virtuozzo.com> Cc: Alexander Potapenko <glider@google.com> Cc: Dmitry Vyukov <dvyukov@google.com> Cc: Christoph Lameter <cl@linux.com> Cc: Pekka Enberg <penberg@kernel.org> Cc: David Rientjes <rientjes@google.com> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
#
0293d1fd |
|
05-Apr-2018 |
Alexey Dobriyan <adobriyan@gmail.com> |
slab: make kmem_cache_flags accept 32-bit object size Now that all sizes are properly typed, propagate "unsigned int" down the callgraph. Link: http://lkml.kernel.org/r/20180305200730.15812-19-adobriyan@gmail.com Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com> Acked-by: Christoph Lameter <cl@linux.com> Cc: Pekka Enberg <penberg@kernel.org> Cc: David Rientjes <rientjes@google.com> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
#
44065b2e |
|
05-Apr-2018 |
Alexey Dobriyan <adobriyan@gmail.com> |
slub: make ->size unsigned int Linux doesn't support negative length objects (including meta data). Link: http://lkml.kernel.org/r/20180305200730.15812-18-adobriyan@gmail.com Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com> Acked-by: Christoph Lameter <cl@linux.com> Cc: Pekka Enberg <penberg@kernel.org> Cc: David Rientjes <rientjes@google.com> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
#
1b473f29 |
|
05-Apr-2018 |
Alexey Dobriyan <adobriyan@gmail.com> |
slub: make ->object_size unsigned int Linux doesn't support negative length objects. Link: http://lkml.kernel.org/r/20180305200730.15812-17-adobriyan@gmail.com Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com> Acked-by: Christoph Lameter <cl@linux.com> Cc: Pekka Enberg <penberg@kernel.org> Cc: David Rientjes <rientjes@google.com> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
#
e5d9998f |
|
05-Apr-2018 |
Alexey Dobriyan <adobriyan@gmail.com> |
slub: make ->cpu_partial unsigned int /* * cpu_partial determined the maximum number of objects * kept in the per cpu partial lists of a processor. */ Can't be negative. Link: http://lkml.kernel.org/r/20180305200730.15812-15-adobriyan@gmail.com Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com> Acked-by: Christoph Lameter <cl@linux.com> Cc: Pekka Enberg <penberg@kernel.org> Cc: David Rientjes <rientjes@google.com> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
#
52ee6d74 |
|
05-Apr-2018 |
Alexey Dobriyan <adobriyan@gmail.com> |
slub: make ->inuse unsigned int ->inuse is "the number of bytes in actual use by the object", can't be negative. Link: http://lkml.kernel.org/r/20180305200730.15812-14-adobriyan@gmail.com Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com> Acked-by: Christoph Lameter <cl@linux.com> Cc: Pekka Enberg <penberg@kernel.org> Cc: David Rientjes <rientjes@google.com> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
#
3a3791ec |
|
05-Apr-2018 |
Alexey Dobriyan <adobriyan@gmail.com> |
slub: make ->align unsigned int Kmem cache alignment can't be negative. Link: http://lkml.kernel.org/r/20180305200730.15812-13-adobriyan@gmail.com Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com> Acked-by: Christoph Lameter <cl@linux.com> Cc: Pekka Enberg <penberg@kernel.org> Cc: David Rientjes <rientjes@google.com> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
#
d66e52d1 |
|
05-Apr-2018 |
Alexey Dobriyan <adobriyan@gmail.com> |
slub: make ->reserved unsigned int ->reserved is either 0 or sizeof(struct rcu_head), can't be negative. Link: http://lkml.kernel.org/r/20180305200730.15812-12-adobriyan@gmail.com Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com> Acked-by: Christoph Lameter <cl@linux.com> Cc: Pekka Enberg <penberg@kernel.org> Cc: David Rientjes <rientjes@google.com> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
#
eb7235eb |
|
05-Apr-2018 |
Alexey Dobriyan <adobriyan@gmail.com> |
slub: make ->remote_node_defrag_ratio unsigned int ->remote_node_defrag_ratio is in range 0..1000. This also adds a check and modifies the behavior to return an error code. Before this patch invalid values were ignored. Link: http://lkml.kernel.org/r/20180305200730.15812-9-adobriyan@gmail.com Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com> Acked-by: Christoph Lameter <cl@linux.com> Cc: Pekka Enberg <penberg@kernel.org> Cc: David Rientjes <rientjes@google.com> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
#
f4957d5b |
|
05-Apr-2018 |
Alexey Dobriyan <adobriyan@gmail.com> |
slab: make kmem_cache_create() work with 32-bit sizes struct kmem_cache::size and ::align were always 32-bit. Out of curiosity I created 4GB kmem_cache, it oopsed with division by 0. kmem_cache_create(1UL<<32+1) created 1-byte cache as expected. size_t doesn't work and never did. Link: http://lkml.kernel.org/r/20180305200730.15812-6-adobriyan@gmail.com Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com> Cc: Christoph Lameter <cl@linux.com> Cc: Pekka Enberg <penberg@kernel.org> Cc: David Rientjes <rientjes@google.com> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
#
86609d33 |
|
05-Apr-2018 |
Chintan Pandya <cpandya@codeaurora.org> |
mm/slub.c: use jitter-free reference while printing age When SLUB_DEBUG catches some issues, it prints all the required debug info. However, in a few cases where allocation and free of the object has happened in a very short time, 'age' might be misleading. See the example below: ============================================================================= BUG kmalloc-256 (Tainted: G W O ): Poison overwritten ----------------------------------------------------------------------------- ... INFO: Allocated in binder_transaction+0x4b0/0x2448 age=731 cpu=3 pid=5314 ... INFO: Freed in binder_free_transaction+0x2c/0x58 age=735 cpu=6 pid=2079 ... Object fffffff14956a870: 6b 6b 6b 6b 6b 6b 6b 6b 67 6b 6b 6b 6b 6b 6b a5 kkkkkkkkgkkkk In this case, object got freed later but 'age' shows otherwise. This could be because, while printing this info, we print allocation traces first and free traces thereafter. In between, if we get schedule out or jiffies increment, (jiffies - t->when) could become meaningless. Use the jitter free reference to calculate age. New output will exactly be same. 'age' is still staying with single jiffies ref in both prints. Change-Id: I0846565807a4229748649bbecb1ffb743d71fcd8 Link: http://lkml.kernel.org/r/1520492010-19389-1-git-send-email-cpandya@codeaurora.org Signed-off-by: Chintan Pandya <cpandya@codeaurora.org> Acked-by: Christoph Lameter <cl@linux.com> Reviewed-by: Andrew Morton <akpm@linux-foundation.org> Cc: Pekka Enberg <penberg@kernel.org> Cc: David Rientjes <rientjes@google.com> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
#
ee3ce779 |
|
06-Feb-2018 |
Dmitry Vyukov <dvyukov@google.com> |
kasan: don't use __builtin_return_address(1) __builtin_return_address(1) is unreliable without frame pointers. With defconfig on kmalloc_pagealloc_invalid_free test I am getting: BUG: KASAN: double-free or invalid-free in (null) Pass caller PC from callers explicitly. Link: http://lkml.kernel.org/r/9b01bc2d237a4df74ff8472a3bf6b7635908de01.1514378558.git.dvyukov@google.com Signed-off-by: Dmitry Vyukov <dvyukov@google.com> Cc: Andrey Ryabinin <aryabinin@virtuozzo.com>a Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
#
47adccce |
|
06-Feb-2018 |
Dmitry Vyukov <dvyukov@google.com> |
kasan: detect invalid frees for large objects Patch series "kasan: detect invalid frees". KASAN detects double-frees, but does not detect invalid-frees (when a pointer into a middle of heap object is passed to free). We recently had a very unpleasant case in crypto code which freed an inner object inside of a heap allocation. This left unnoticed during free, but totally corrupted heap and later lead to a bunch of random crashes all over kernel code. Detect invalid frees. This patch (of 5): Detect frees of pointers into middle of large heap objects. I dropped const from kasan_kfree_large() because it starts propagating through a bunch of functions in kasan_report.c, slab/slub nearest_obj(), all of their local variables, fixup_red_left(), etc. Link: http://lkml.kernel.org/r/1b45b4fe1d20fc0de1329aab674c1dd973fee723.1514378558.git.dvyukov@google.com Signed-off-by: Dmitry Vyukov <dvyukov@google.com> Cc: Andrey Ryabinin <aryabinin@virtuozzo.com>a Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
#
0d2d5d40 |
|
31-Jan-2018 |
Miles Chen <miles.chen@mediatek.com> |
slub: remove obsolete comments of put_cpu_partial() Commit d6e0b7fa1186 ("slub: make dead caches discard free slabs immediately") makes put_cpu_partial() run with preemption disabled and interrupts disabled when calling unfreeze_partials(). The comment: "put_cpu_partial() is done without interrupts disabled and without preemption disabled" looks obsolete, so remove it. Link: http://lkml.kernel.org/r/1516968550-1520-1-git-send-email-miles.chen@mediatek.com Signed-off-by: Miles Chen <miles.chen@mediatek.com> Reviewed-by: Andrew Morton <akpm@linux-foundation.org> Cc: Christoph Lameter <cl@linux.com> Cc: Pekka Enberg <penberg@kernel.org> Cc: David Rientjes <rientjes@google.com> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com> Cc: Vladimir Davydov <vdavydov.dev@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
#
5d682681 |
|
31-Jan-2018 |
Balasubramani Vivekanandan <balasubramani_vivekanandan@mentor.com> |
mm/slub.c: fix wrong address during slab padding restoration Start address calculated for slab padding restoration was wrong. Wrong address would point to some section before padding and could cause corruption Link: http://lkml.kernel.org/r/1516604578-4577-1-git-send-email-balasubramani_vivekanandan@mentor.com Signed-off-by: Balasubramani Vivekanandan <balasubramani_vivekanandan@mentor.com> Cc: Christoph Lameter <cl@linux.com> Cc: Pekka Enberg <penberg@kernel.org> Cc: David Rientjes <rientjes@google.com> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
#
2d891fbc |
|
30-Nov-2017 |
Kees Cook <keescook@chromium.org> |
usercopy: Allow strict enforcement of whitelists This introduces CONFIG_HARDENED_USERCOPY_FALLBACK to control the behavior of hardened usercopy whitelist violations. By default, whitelist violations will continue to WARN() so that any bad or missing usercopy whitelists can be discovered without being too disruptive. If this config is disabled at build time or a system is booted with "slab_common.usercopy_fallback=0", usercopy whitelists will BUG() instead of WARN(). This is useful for admins that want to use usercopy whitelists immediately. Suggested-by: Matthew Garrett <mjg59@google.com> Signed-off-by: Kees Cook <keescook@chromium.org>
|
#
afcc90f8 |
|
10-Jan-2018 |
Kees Cook <keescook@chromium.org> |
usercopy: WARN() on slab cache usercopy region violations This patch adds checking of usercopy cache whitelisting, and is modified from Brad Spengler/PaX Team's PAX_USERCOPY whitelisting code in the last public patch of grsecurity/PaX based on my understanding of the code. Changes or omissions from the original code are mine and don't reflect the original grsecurity/PaX code. The SLAB and SLUB allocators are modified to WARN() on all copy operations in which the kernel heap memory being modified falls outside of the cache's defined usercopy region. Based on an earlier patch from David Windsor. Cc: Christoph Lameter <cl@linux.com> Cc: Pekka Enberg <penberg@kernel.org> Cc: David Rientjes <rientjes@google.com> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Laura Abbott <labbott@redhat.com> Cc: Ingo Molnar <mingo@kernel.org> Cc: Mark Rutland <mark.rutland@arm.com> Cc: linux-mm@kvack.org Cc: linux-xfs@vger.kernel.org Signed-off-by: Kees Cook <keescook@chromium.org>
|
#
8eb8284b |
|
10-Jun-2017 |
David Windsor <dave@nullcore.net> |
usercopy: Prepare for usercopy whitelisting This patch prepares the slab allocator to handle caches having annotations (useroffset and usersize) defining usercopy regions. This patch is modified from Brad Spengler/PaX Team's PAX_USERCOPY whitelisting code in the last public patch of grsecurity/PaX based on my understanding of the code. Changes or omissions from the original code are mine and don't reflect the original grsecurity/PaX code. Currently, hardened usercopy performs dynamic bounds checking on slab cache objects. This is good, but still leaves a lot of kernel memory available to be copied to/from userspace in the face of bugs. To further restrict what memory is available for copying, this creates a way to whitelist specific areas of a given slab cache object for copying to/from userspace, allowing much finer granularity of access control. Slab caches that are never exposed to userspace can declare no whitelist for their objects, thereby keeping them unavailable to userspace via dynamic copy operations. (Note, an implicit form of whitelisting is the use of constant sizes in usercopy operations and get_user()/put_user(); these bypass hardened usercopy checks since these sizes cannot change at runtime.) To support this whitelist annotation, usercopy region offset and size members are added to struct kmem_cache. The slab allocator receives a new function, kmem_cache_create_usercopy(), that creates a new cache with a usercopy region defined, suitable for declaring spans of fields within the objects that get copied to/from userspace. In this patch, the default kmem_cache_create() marks the entire allocation as whitelisted, leaving it semantically unchanged. Once all fine-grained whitelists have been added (in subsequent patches), this will be changed to a usersize of 0, making caches created with kmem_cache_create() not copyable to/from userspace. After the entire usercopy whitelist series is applied, less than 15% of the slab cache memory remains exposed to potential usercopy bugs after a fresh boot: Total Slab Memory: 48074720 Usercopyable Memory: 6367532 13.2% task_struct 0.2% 4480/1630720 RAW 0.3% 300/96000 RAWv6 2.1% 1408/64768 ext4_inode_cache 3.0% 269760/8740224 dentry 11.1% 585984/5273856 mm_struct 29.1% 54912/188448 kmalloc-8 100.0% 24576/24576 kmalloc-16 100.0% 28672/28672 kmalloc-32 100.0% 81920/81920 kmalloc-192 100.0% 96768/96768 kmalloc-128 100.0% 143360/143360 names_cache 100.0% 163840/163840 kmalloc-64 100.0% 167936/167936 kmalloc-256 100.0% 339968/339968 kmalloc-512 100.0% 350720/350720 kmalloc-96 100.0% 455616/455616 kmalloc-8192 100.0% 655360/655360 kmalloc-1024 100.0% 812032/812032 kmalloc-4096 100.0% 819200/819200 kmalloc-2048 100.0% 1310720/1310720 After some kernel build workloads, the percentage (mainly driven by dentry and inode caches expanding) drops under 10%: Total Slab Memory: 95516184 Usercopyable Memory: 8497452 8.8% task_struct 0.2% 4000/1456000 RAW 0.3% 300/96000 RAWv6 2.1% 1408/64768 ext4_inode_cache 3.0% 1217280/39439872 dentry 11.1% 1623200/14608800 mm_struct 29.1% 73216/251264 kmalloc-8 100.0% 24576/24576 kmalloc-16 100.0% 28672/28672 kmalloc-32 100.0% 94208/94208 kmalloc-192 100.0% 96768/96768 kmalloc-128 100.0% 143360/143360 names_cache 100.0% 163840/163840 kmalloc-64 100.0% 245760/245760 kmalloc-256 100.0% 339968/339968 kmalloc-512 100.0% 350720/350720 kmalloc-96 100.0% 563520/563520 kmalloc-8192 100.0% 655360/655360 kmalloc-1024 100.0% 794624/794624 kmalloc-4096 100.0% 819200/819200 kmalloc-2048 100.0% 1257472/1257472 Signed-off-by: David Windsor <dave@nullcore.net> [kees: adjust commit log, split out a few extra kmalloc hunks] [kees: add field names to function declarations] [kees: convert BUGs to WARNs and fail closed] [kees: add attack surface reduction analysis to commit log] Cc: Pekka Enberg <penberg@kernel.org> Cc: David Rientjes <rientjes@google.com> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: linux-mm@kvack.org Cc: linux-xfs@vger.kernel.org Signed-off-by: Kees Cook <keescook@chromium.org> Acked-by: Christoph Lameter <cl@linux.com>
|
#
f4e6e289 |
|
10-Jan-2018 |
Kees Cook <keescook@chromium.org> |
usercopy: Include offset in hardened usercopy report This refactors the hardened usercopy code so that failure reporting can happen within the checking functions instead of at the top level. This simplifies the return value handling and allows more details and offsets to be included in the report. Having the offset can be much more helpful in understanding hardened usercopy bugs. Signed-off-by: Kees Cook <keescook@chromium.org>
|
#
4675ff05 |
|
15-Nov-2017 |
Levin, Alexander (Sasha Levin) <alexander.levin@verizon.com> |
kmemcheck: rip it out Fix up makefiles, remove references, and git rm kmemcheck. Link: http://lkml.kernel.org/r/20171007030159.22241-4-alexander.levin@verizon.com Signed-off-by: Sasha Levin <alexander.levin@verizon.com> Cc: Steven Rostedt <rostedt@goodmis.org> Cc: Vegard Nossum <vegardno@ifi.uio.no> Cc: Pekka Enberg <penberg@kernel.org> Cc: Michal Hocko <mhocko@kernel.org> Cc: Eric W. Biederman <ebiederm@xmission.com> Cc: Alexander Potapenko <glider@google.com> Cc: Tim Hansen <devtimhansen@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
#
d8be7566 |
|
15-Nov-2017 |
Levin, Alexander (Sasha Levin) <alexander.levin@verizon.com> |
kmemcheck: remove whats left of NOTRACK flags Now that kmemcheck is gone, we don't need the NOTRACK flags. Link: http://lkml.kernel.org/r/20171007030159.22241-5-alexander.levin@verizon.com Signed-off-by: Sasha Levin <alexander.levin@verizon.com> Cc: Alexander Potapenko <glider@google.com> Cc: Eric W. Biederman <ebiederm@xmission.com> Cc: Michal Hocko <mhocko@kernel.org> Cc: Pekka Enberg <penberg@kernel.org> Cc: Steven Rostedt <rostedt@goodmis.org> Cc: Tim Hansen <devtimhansen@gmail.com> Cc: Vegard Nossum <vegardno@ifi.uio.no> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
#
75f296d9 |
|
15-Nov-2017 |
Levin, Alexander (Sasha Levin) <alexander.levin@verizon.com> |
kmemcheck: stop using GFP_NOTRACK and SLAB_NOTRACK Convert all allocations that used a NOTRACK flag to stop using it. Link: http://lkml.kernel.org/r/20171007030159.22241-3-alexander.levin@verizon.com Signed-off-by: Sasha Levin <alexander.levin@verizon.com> Cc: Alexander Potapenko <glider@google.com> Cc: Eric W. Biederman <ebiederm@xmission.com> Cc: Michal Hocko <mhocko@kernel.org> Cc: Pekka Enberg <penberg@kernel.org> Cc: Steven Rostedt <rostedt@goodmis.org> Cc: Tim Hansen <devtimhansen@gmail.com> Cc: Vegard Nossum <vegardno@ifi.uio.no> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
#
49502766 |
|
15-Nov-2017 |
Levin, Alexander (Sasha Levin) <alexander.levin@verizon.com> |
kmemcheck: remove annotations Patch series "kmemcheck: kill kmemcheck", v2. As discussed at LSF/MM, kill kmemcheck. KASan is a replacement that is able to work without the limitation of kmemcheck (single CPU, slow). KASan is already upstream. We are also not aware of any users of kmemcheck (or users who don't consider KASan as a suitable replacement). The only objection was that since KASAN wasn't supported by all GCC versions provided by distros at that time we should hold off for 2 years, and try again. Now that 2 years have passed, and all distros provide gcc that supports KASAN, kill kmemcheck again for the very same reasons. This patch (of 4): Remove kmemcheck annotations, and calls to kmemcheck from the kernel. [alexander.levin@verizon.com: correctly remove kmemcheck call from dma_map_sg_attrs] Link: http://lkml.kernel.org/r/20171012192151.26531-1-alexander.levin@verizon.com Link: http://lkml.kernel.org/r/20171007030159.22241-2-alexander.levin@verizon.com Signed-off-by: Sasha Levin <alexander.levin@verizon.com> Cc: Alexander Potapenko <glider@google.com> Cc: Eric W. Biederman <ebiederm@xmission.com> Cc: Michal Hocko <mhocko@kernel.org> Cc: Pekka Enberg <penberg@kernel.org> Cc: Steven Rostedt <rostedt@goodmis.org> Cc: Tim Hansen <devtimhansen@gmail.com> Cc: Vegard Nossum <vegardno@ifi.uio.no> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
#
11066386 |
|
15-Nov-2017 |
Miles Chen <miles.chen@mediatek.com> |
slub: fix sysfs duplicate filename creation when slub_debug=O When slub_debug=O is set. It is possible to clear debug flags for an "unmergeable" slab cache in kmem_cache_open(). It makes the "unmergeable" cache became "mergeable" in sysfs_slab_add(). These caches will generate their "unique IDs" by create_unique_id(), but it is possible to create identical unique IDs. In my experiment, sgpool-128, names_cache, biovec-256 generate the same ID ":Ft-0004096" and the kernel reports "sysfs: cannot create duplicate filename '/kernel/slab/:Ft-0004096'". To repeat my experiment, set disable_higher_order_debug=1, CONFIG_SLUB_DEBUG_ON=y in kernel-4.14. Fix this issue by setting unmergeable=1 if slub_debug=O and the the default slub_debug contains any no-merge flags. call path: kmem_cache_create() __kmem_cache_alias() -> we set SLAB_NEVER_MERGE flags here create_cache() __kmem_cache_create() kmem_cache_open() -> clear DEBUG_METADATA_FLAGS sysfs_slab_add() -> the slab cache is mergeable now sysfs: cannot create duplicate filename '/kernel/slab/:Ft-0004096' ------------[ cut here ]------------ WARNING: CPU: 0 PID: 1 at fs/sysfs/dir.c:31 sysfs_warn_dup+0x60/0x7c Modules linked in: CPU: 0 PID: 1 Comm: swapper/0 Tainted: G W 4.14.0-rc7ajb-00131-gd4c2e9f-dirty #123 Hardware name: linux,dummy-virt (DT) task: ffffffc07d4e0080 task.stack: ffffff8008008000 PC is at sysfs_warn_dup+0x60/0x7c LR is at sysfs_warn_dup+0x60/0x7c pc : lr : pstate: 60000145 Call trace: sysfs_warn_dup+0x60/0x7c sysfs_create_dir_ns+0x98/0xa0 kobject_add_internal+0xa0/0x294 kobject_init_and_add+0x90/0xb4 sysfs_slab_add+0x90/0x200 __kmem_cache_create+0x26c/0x438 kmem_cache_create+0x164/0x1f4 sg_pool_init+0x60/0x100 do_one_initcall+0x38/0x12c kernel_init_freeable+0x138/0x1d4 kernel_init+0x10/0xfc ret_from_fork+0x10/0x18 Link: http://lkml.kernel.org/r/1510365805-5155-1-git-send-email-miles.chen@mediatek.com Signed-off-by: Miles Chen <miles.chen@mediatek.com> Acked-by: Christoph Lameter <cl@linux.com> Cc: Pekka Enberg <penberg@kernel.org> Cc: David Rientjes <rientjes@google.com> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
#
4fd0b46e |
|
15-Nov-2017 |
Alexey Dobriyan <adobriyan@gmail.com> |
slab, slub, slob: convert slab_flags_t to 32-bit struct kmem_cache::flags is "unsigned long" which is unnecessary on 64-bit as no flags are defined in the higher bits. Switch the field to 32-bit and save some space on x86_64 until such flags appear: add/remove: 0/0 grow/shrink: 0/107 up/down: 0/-657 (-657) function old new delta sysfs_slab_add 720 719 -1 ... check_object 699 676 -23 [akpm@linux-foundation.org: fix printk warning] Link: http://lkml.kernel.org/r/20171021100635.GA8287@avx2 Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com> Acked-by: Pekka Enberg <penberg@kernel.org> Cc: Christoph Lameter <cl@linux.com> Cc: David Rientjes <rientjes@google.com> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
#
d50112ed |
|
15-Nov-2017 |
Alexey Dobriyan <adobriyan@gmail.com> |
slab, slub, slob: add slab_flags_t Add sparse-checked slab_flags_t for struct kmem_cache::flags (SLAB_POISON, etc). SLAB is bloated temporarily by switching to "unsigned long", but only temporarily. Link: http://lkml.kernel.org/r/20171021100225.GA22428@avx2 Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com> Acked-by: Pekka Enberg <penberg@kernel.org> Cc: Christoph Lameter <cl@linux.com> Cc: David Rientjes <rientjes@google.com> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
#
5b365771 |
|
15-Nov-2017 |
Yang Shi <yang.s@alibaba-inc.com> |
mm: slabinfo: remove CONFIG_SLABINFO According to discussion with Christoph (https://marc.info/?l=linux-kernel&m=150695909709711&w=2), it sounds like it is pointless to keep CONFIG_SLABINFO around. This patch removes the CONFIG_SLABINFO config option, but /proc/slabinfo is still available. [yang.s@alibaba-inc.com: v11] Link: http://lkml.kernel.org/r/1507656303-103845-3-git-send-email-yang.s@alibaba-inc.com Link: http://lkml.kernel.org/r/1507152550-46205-3-git-send-email-yang.s@alibaba-inc.com Signed-off-by: Yang Shi <yang.s@alibaba-inc.com> Acked-by: David Rientjes <rientjes@google.com> Cc: Christoph Lameter <cl@linux.com> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com> Cc: Michal Hocko <mhocko@kernel.org> Cc: Pekka Enberg <penberg@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
#
b2441318 |
|
01-Nov-2017 |
Greg Kroah-Hartman <gregkh@linuxfoundation.org> |
License cleanup: add SPDX GPL-2.0 license identifier to files with no license Many source files in the tree are missing licensing information, which makes it harder for compliance tools to determine the correct license. By default all files without license information are under the default license of the kernel, which is GPL version 2. Update the files which contain no license information with the 'GPL-2.0' SPDX license identifier. The SPDX identifier is a legally binding shorthand, which can be used instead of the full boiler plate text. This patch is based on work done by Thomas Gleixner and Kate Stewart and Philippe Ombredanne. How this work was done: Patches were generated and checked against linux-4.14-rc6 for a subset of the use cases: - file had no licensing information it it. - file was a */uapi/* one with no licensing information in it, - file was a */uapi/* one with existing licensing information, Further patches will be generated in subsequent months to fix up cases where non-standard license headers were used, and references to license had to be inferred by heuristics based on keywords. The analysis to determine which SPDX License Identifier to be applied to a file was done in a spreadsheet of side by side results from of the output of two independent scanners (ScanCode & Windriver) producing SPDX tag:value files created by Philippe Ombredanne. Philippe prepared the base worksheet, and did an initial spot review of a few 1000 files. The 4.13 kernel was the starting point of the analysis with 60,537 files assessed. Kate Stewart did a file by file comparison of the scanner results in the spreadsheet to determine which SPDX license identifier(s) to be applied to the file. She confirmed any determination that was not immediately clear with lawyers working with the Linux Foundation. Criteria used to select files for SPDX license identifier tagging was: - Files considered eligible had to be source code files. - Make and config files were included as candidates if they contained >5 lines of source - File already had some variant of a license header in it (even if <5 lines). All documentation files were explicitly excluded. The following heuristics were used to determine which SPDX license identifiers to apply. - when both scanners couldn't find any license traces, file was considered to have no license information in it, and the top level COPYING file license applied. For non */uapi/* files that summary was: SPDX license identifier # files ---------------------------------------------------|------- GPL-2.0 11139 and resulted in the first patch in this series. If that file was a */uapi/* path one, it was "GPL-2.0 WITH Linux-syscall-note" otherwise it was "GPL-2.0". Results of that was: SPDX license identifier # files ---------------------------------------------------|------- GPL-2.0 WITH Linux-syscall-note 930 and resulted in the second patch in this series. - if a file had some form of licensing information in it, and was one of the */uapi/* ones, it was denoted with the Linux-syscall-note if any GPL family license was found in the file or had no licensing in it (per prior point). Results summary: SPDX license identifier # files ---------------------------------------------------|------ GPL-2.0 WITH Linux-syscall-note 270 GPL-2.0+ WITH Linux-syscall-note 169 ((GPL-2.0 WITH Linux-syscall-note) OR BSD-2-Clause) 21 ((GPL-2.0 WITH Linux-syscall-note) OR BSD-3-Clause) 17 LGPL-2.1+ WITH Linux-syscall-note 15 GPL-1.0+ WITH Linux-syscall-note 14 ((GPL-2.0+ WITH Linux-syscall-note) OR BSD-3-Clause) 5 LGPL-2.0+ WITH Linux-syscall-note 4 LGPL-2.1 WITH Linux-syscall-note 3 ((GPL-2.0 WITH Linux-syscall-note) OR MIT) 3 ((GPL-2.0 WITH Linux-syscall-note) AND MIT) 1 and that resulted in the third patch in this series. - when the two scanners agreed on the detected license(s), that became the concluded license(s). - when there was disagreement between the two scanners (one detected a license but the other didn't, or they both detected different licenses) a manual inspection of the file occurred. - In most cases a manual inspection of the information in the file resulted in a clear resolution of the license that should apply (and which scanner probably needed to revisit its heuristics). - When it was not immediately clear, the license identifier was confirmed with lawyers working with the Linux Foundation. - If there was any question as to the appropriate license identifier, the file was flagged for further research and to be revisited later in time. In total, over 70 hours of logged manual review was done on the spreadsheet to determine the SPDX license identifiers to apply to the source files by Kate, Philippe, Thomas and, in some cases, confirmation by lawyers working with the Linux Foundation. Kate also obtained a third independent scan of the 4.13 code base from FOSSology, and compared selected files where the other two scanners disagreed against that SPDX file, to see if there was new insights. The Windriver scanner is based on an older version of FOSSology in part, so they are related. Thomas did random spot checks in about 500 files from the spreadsheets for the uapi headers and agreed with SPDX license identifier in the files he inspected. For the non-uapi files Thomas did random spot checks in about 15000 files. In initial set of patches against 4.14-rc6, 3 files were found to have copy/paste license identifier errors, and have been fixed to reflect the correct identifier. Additionally Philippe spent 10 hours this week doing a detailed manual inspection and review of the 12,461 patched files from the initial patch version early this week with: - a full scancode scan run, collecting the matched texts, detected license ids and scores - reviewing anything where there was a license detected (about 500+ files) to ensure that the applied SPDX license was correct - reviewing anything where there was no detection but the patch license was not GPL-2.0 WITH Linux-syscall-note to ensure that the applied SPDX license was correct This produced a worksheet with 20 files needing minor correction. This worksheet was then exported into 3 different .csv files for the different types of files to be modified. These .csv files were then reviewed by Greg. Thomas wrote a script to parse the csv files and add the proper SPDX tag to the file, in the format that the file expected. This script was further refined by Greg based on the output to detect more types of files automatically and to distinguish between header and source .c files (which need different comment types.) Finally Greg ran the script using the .csv files to generate the patches. Reviewed-by: Kate Stewart <kstewart@linuxfoundation.org> Reviewed-by: Philippe Ombredanne <pombredanne@nexb.com> Reviewed-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
#
0ee931c4 |
|
13-Sep-2017 |
Michal Hocko <mhocko@suse.com> |
mm: treewide: remove GFP_TEMPORARY allocation flag GFP_TEMPORARY was introduced by commit e12ba74d8ff3 ("Group short-lived and reclaimable kernel allocations") along with __GFP_RECLAIMABLE. It's primary motivation was to allow users to tell that an allocation is short lived and so the allocator can try to place such allocations close together and prevent long term fragmentation. As much as this sounds like a reasonable semantic it becomes much less clear when to use the highlevel GFP_TEMPORARY allocation flag. How long is temporary? Can the context holding that memory sleep? Can it take locks? It seems there is no good answer for those questions. The current implementation of GFP_TEMPORARY is basically GFP_KERNEL | __GFP_RECLAIMABLE which in itself is tricky because basically none of the existing caller provide a way to reclaim the allocated memory. So this is rather misleading and hard to evaluate for any benefits. I have checked some random users and none of them has added the flag with a specific justification. I suspect most of them just copied from other existing users and others just thought it might be a good idea to use without any measuring. This suggests that GFP_TEMPORARY just motivates for cargo cult usage without any reasoning. I believe that our gfp flags are quite complex already and especially those with highlevel semantic should be clearly defined to prevent from confusion and abuse. Therefore I propose dropping GFP_TEMPORARY and replace all existing users to simply use GFP_KERNEL. Please note that SLAB users with shrinkers will still get __GFP_RECLAIMABLE heuristic and so they will be placed properly for memory fragmentation prevention. I can see reasons we might want some gfp flag to reflect shorterm allocations but I propose starting from a clear semantic definition and only then add users with proper justification. This was been brought up before LSF this year by Matthew [1] and it turned out that GFP_TEMPORARY really doesn't have a clear semantic. It seems to be a heuristic without any measured advantage for most (if not all) its current users. The follow up discussion has revealed that opinions on what might be temporary allocation differ a lot between developers. So rather than trying to tweak existing users into a semantic which they haven't expected I propose to simply remove the flag and start from scratch if we really need a semantic for short term allocations. [1] http://lkml.kernel.org/r/20170118054945.GD18349@bombadil.infradead.org [akpm@linux-foundation.org: fix typo] [akpm@linux-foundation.org: coding-style fixes] [sfr@canb.auug.org.au: drm/i915: fix up] Link: http://lkml.kernel.org/r/20170816144703.378d4f4d@canb.auug.org.au Link: http://lkml.kernel.org/r/20170728091904.14627-1-mhocko@kernel.org Signed-off-by: Michal Hocko <mhocko@suse.com> Signed-off-by: Stephen Rothwell <sfr@canb.auug.org.au> Acked-by: Mel Gorman <mgorman@suse.de> Acked-by: Vlastimil Babka <vbabka@suse.cz> Cc: Matthew Wilcox <willy@infradead.org> Cc: Neil Brown <neilb@suse.de> Cc: "Theodore Ts'o" <tytso@mit.edu> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
#
9b130ad5 |
|
08-Sep-2017 |
Alexey Dobriyan <adobriyan@gmail.com> |
treewide: make "nr_cpu_ids" unsigned First, number of CPUs can't be negative number. Second, different signnnedness leads to suboptimal code in the following cases: 1) kmalloc(nr_cpu_ids * sizeof(X)); "int" has to be sign extended to size_t. 2) while (loff_t *pos < nr_cpu_ids) MOVSXD is 1 byte longed than the same MOV. Other cases exist as well. Basically compiler is told that nr_cpu_ids can't be negative which can't be deduced if it is "int". Code savings on allyesconfig kernel: -3KB add/remove: 0/0 grow/shrink: 25/264 up/down: 261/-3631 (-3370) function old new delta coretemp_cpu_online 450 512 +62 rcu_init_one 1234 1272 +38 pci_device_probe 374 399 +25 ... pgdat_reclaimable_pages 628 556 -72 select_fallback_rq 446 369 -77 task_numa_find_cpu 1923 1807 -116 Link: http://lkml.kernel.org/r/20170819114959.GA30580@avx2 Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
#
1fdaaa23 |
|
06-Sep-2017 |
Arvind Yadav <arvind.yadav.cs@gmail.com> |
mm/slub.c: constify attribute_group structures attribute_group are not supposed to change at runtime. All functions working with attribute_group provided by <linux/sysfs.h> work with const attribute_group. So mark the non-const structs as const. Link: http://lkml.kernel.org/r/1501157186-3749-1-git-send-email-arvind.yadav.cs@gmail.com Signed-off-by: Arvind Yadav <arvind.yadav.cs@gmail.com> Acked-by: Christoph Lameter <cl@linux.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
#
ce6fa91b |
|
06-Sep-2017 |
Alexander Popov <alex.popov@linux.com> |
mm/slub.c: add a naive detection of double free or corruption Add an assertion similar to "fasttop" check in GNU C Library allocator as a part of SLAB_FREELIST_HARDENED feature. An object added to a singly linked freelist should not point to itself. That helps to detect some double free errors (e.g. CVE-2017-2636) without slub_debug and KASAN. Link: http://lkml.kernel.org/r/1502468246-1262-1-git-send-email-alex.popov@linux.com Signed-off-by: Alexander Popov <alex.popov@linux.com> Acked-by: Christoph Lameter <cl@linux.com> Cc: Kees Cook <keescook@chromium.org> Cc: Pekka Enberg <penberg@kernel.org> Cc: David Rientjes <rientjes@google.com> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com> Cc: Paul E McKenney <paulmck@linux.vnet.ibm.com> Cc: Ingo Molnar <mingo@kernel.org> Cc: Tejun Heo <tj@kernel.org> Cc: Andy Lutomirski <luto@kernel.org> Cc: Nicolas Pitre <nicolas.pitre@linaro.org> Cc: Rik van Riel <riel@redhat.com> Cc: Tycho Andersen <tycho@docker.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
#
2482ddec |
|
06-Sep-2017 |
Kees Cook <keescook@chromium.org> |
mm: add SLUB free list pointer obfuscation This SLUB free list pointer obfuscation code is modified from Brad Spengler/PaX Team's code in the last public patch of grsecurity/PaX based on my understanding of the code. Changes or omissions from the original code are mine and don't reflect the original grsecurity/PaX code. This adds a per-cache random value to SLUB caches that is XORed with their freelist pointer address and value. This adds nearly zero overhead and frustrates the very common heap overflow exploitation method of overwriting freelist pointers. A recent example of the attack is written up here: http://cyseclabs.com/blog/cve-2016-6187-heap-off-by-one-exploit and there is a section dedicated to the technique the book "A Guide to Kernel Exploitation: Attacking the Core". This is based on patches by Daniel Micay, and refactored to minimize the use of #ifdef. With 200-count cycles of "hackbench -g 20 -l 1000" I saw the following run times: before: mean 10.11882499999999999995 variance .03320378329145728642 stdev .18221905304181911048 after: mean 10.12654000000000000014 variance .04700556623115577889 stdev .21680767106160192064 The difference gets lost in the noise, but if the above is to be taken literally, using CONFIG_FREELIST_HARDENED is 0.07% slower. Link: http://lkml.kernel.org/r/20170802180609.GA66807@beast Signed-off-by: Kees Cook <keescook@chromium.org> Suggested-by: Daniel Micay <danielmicay@gmail.com> Cc: Rik van Riel <riel@redhat.com> Cc: Tycho Andersen <tycho@docker.com> Cc: Alexander Popov <alex.popov@linux.com> Cc: Christoph Lameter <cl@linux.com> Cc: Pekka Enberg <penberg@kernel.org> Cc: David Rientjes <rientjes@google.com> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
#
ea37df54 |
|
06-Sep-2017 |
Alexander Potapenko <glider@google.com> |
slub: tidy up initialization ordering - free_kmem_cache_nodes() frees the cache node before nulling out a reference to it - init_kmem_cache_nodes() publishes the cache node before initializing it Neither of these matter at runtime because the cache nodes cannot be looked up by any other thread. But it's neater and more consistent to reorder these. Link: http://lkml.kernel.org/r/20170707083408.40410-1-glider@google.com Signed-off-by: Alexander Potapenko <glider@google.com> Acked-by: Christoph Lameter <cl@linux.com> Cc: Pekka Enberg <penberg@kernel.org> Cc: David Rientjes <rientjes@google.com> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
#
f6ba4880 |
|
18-Aug-2017 |
Vladimir Davydov <vdavydov.dev@gmail.com> |
slub: fix per memcg cache leak on css offline To avoid a possible deadlock, sysfs_slab_remove() schedules an asynchronous work to delete sysfs entries corresponding to the kmem cache. To ensure the cache isn't freed before the work function is called, it takes a reference to the cache kobject. The reference is supposed to be released by the work function. However, the work function (sysfs_slab_remove_workfn()) does nothing in case the cache sysfs entry has already been deleted, leaking the kobject and the corresponding cache. This may happen on a per memcg cache destruction, because sysfs entries of a per memcg cache are deleted on memcg offline if the cache is empty (see __kmemcg_cache_deactivate()). The kmemleak report looks like this: unreferenced object 0xffff9f798a79f540 (size 32): comm "kworker/1:4", pid 15416, jiffies 4307432429 (age 28687.554s) hex dump (first 32 bytes): 6b 6d 61 6c 6c 6f 63 2d 31 36 28 31 35 39 39 3a kmalloc-16(1599: 6e 65 77 72 6f 6f 74 29 00 23 6b c0 ff ff ff ff newroot).#k..... backtrace: kmemleak_alloc+0x4a/0xa0 __kmalloc_track_caller+0x148/0x2c0 kvasprintf+0x66/0xd0 kasprintf+0x49/0x70 memcg_create_kmem_cache+0xe6/0x160 memcg_kmem_cache_create_func+0x20/0x110 process_one_work+0x205/0x5d0 worker_thread+0x4e/0x3a0 kthread+0x109/0x140 ret_from_fork+0x2a/0x40 unreferenced object 0xffff9f79b6136840 (size 416): comm "kworker/1:4", pid 15416, jiffies 4307432429 (age 28687.573s) hex dump (first 32 bytes): 40 fb 80 c2 3e 33 00 00 00 00 00 40 00 00 00 00 @...>3.....@.... 00 00 00 00 00 00 00 00 10 00 00 00 10 00 00 00 ................ backtrace: kmemleak_alloc+0x4a/0xa0 kmem_cache_alloc+0x128/0x280 create_cache+0x3b/0x1e0 memcg_create_kmem_cache+0x118/0x160 memcg_kmem_cache_create_func+0x20/0x110 process_one_work+0x205/0x5d0 worker_thread+0x4e/0x3a0 kthread+0x109/0x140 ret_from_fork+0x2a/0x40 Fix the leak by adding the missing call to kobject_put() to sysfs_slab_remove_workfn(). Link: http://lkml.kernel.org/r/20170812181134.25027-1-vdavydov.dev@gmail.com Fixes: 3b7b314053d02 ("slub: make sysfs file removal asynchronous") Signed-off-by: Vladimir Davydov <vdavydov.dev@gmail.com> Reported-by: Andrei Vagin <avagin@gmail.com> Tested-by: Andrei Vagin <avagin@gmail.com> Acked-by: Tejun Heo <tj@kernel.org> Acked-by: David Rientjes <rientjes@google.com> Cc: Michal Hocko <mhocko@kernel.org> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Christoph Lameter <cl@linux.com> Cc: Pekka Enberg <penberg@kernel.org> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com> Cc: <stable@vger.kernel.org> [4.12.x] Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
#
7779f212 |
|
06-Jul-2017 |
Johannes Weiner <hannes@cmpxchg.org> |
mm: memcontrol: account slab stats per lruvec Josef's redesign of the balancing between slab caches and the page cache requires slab cache statistics at the lruvec level. Link: http://lkml.kernel.org/r/20170530181724.27197-7-hannes@cmpxchg.org Signed-off-by: Johannes Weiner <hannes@cmpxchg.org> Acked-by: Vladimir Davydov <vdavydov.dev@gmail.com> Cc: Josef Bacik <josef@toxicpanda.com> Cc: Michal Hocko <mhocko@suse.com> Cc: Rik van Riel <riel@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
#
385386cf |
|
06-Jul-2017 |
Johannes Weiner <hannes@cmpxchg.org> |
mm: vmstat: move slab statistics from zone to node counters Patch series "mm: per-lruvec slab stats" Josef is working on a new approach to balancing slab caches and the page cache. For this to work, he needs slab cache statistics on the lruvec level. These patches implement that by adding infrastructure that allows updating and reading generic VM stat items per lruvec, then switches some existing VM accounting sites, including the slab accounting ones, to this new cgroup-aware API. I'll follow up with more patches on this, because there is actually substantial simplification that can be done to the memory controller when we replace private memcg accounting with making the existing VM accounting sites cgroup-aware. But this is enough for Josef to base his slab reclaim work on, so here goes. This patch (of 5): To re-implement slab cache vs. page cache balancing, we'll need the slab counters at the lruvec level, which, ever since lru reclaim was moved from the zone to the node, is the intersection of the node, not the zone, and the memcg. We could retain the per-zone counters for when the page allocator dumps its memory information on failures, and have counters on both levels - which on all but NUMA node 0 is usually redundant. But let's keep it simple for now and just move them. If anybody complains we can restore the per-zone counters. [hannes@cmpxchg.org: fix oops] Link: http://lkml.kernel.org/r/20170605183511.GA8915@cmpxchg.org Link: http://lkml.kernel.org/r/20170530181724.27197-3-hannes@cmpxchg.org Signed-off-by: Johannes Weiner <hannes@cmpxchg.org> Cc: Josef Bacik <josef@toxicpanda.com> Cc: Michal Hocko <mhocko@suse.com> Cc: Vladimir Davydov <vdavydov.dev@gmail.com> Cc: Rik van Riel <riel@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
#
e6d0e1dc |
|
06-Jul-2017 |
Wei Yang <richard.weiyang@gmail.com> |
mm/slub.c: wrap kmem_cache->cpu_partial in config CONFIG_SLUB_CPU_PARTIAL kmem_cache->cpu_partial is just used when CONFIG_SLUB_CPU_PARTIAL is set, so wrap it with config CONFIG_SLUB_CPU_PARTIAL will save some space on 32bit arch. This patch wraps kmem_cache->cpu_partial in config CONFIG_SLUB_CPU_PARTIAL and wraps its sysfs too. Link: http://lkml.kernel.org/r/20170502144533.10729-4-richard.weiyang@gmail.com Signed-off-by: Wei Yang <richard.weiyang@gmail.com> Cc: Christoph Lameter <cl@linux.com> Cc: Pekka Enberg <penberg@kernel.org> Cc: David Rientjes <rientjes@google.com> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
#
a93cf07b |
|
06-Jul-2017 |
Wei Yang <richard.weiyang@gmail.com> |
mm/slub.c: wrap cpu_slab->partial in CONFIG_SLUB_CPU_PARTIAL cpu_slab's field partial is used when CONFIG_SLUB_CPU_PARTIAL is set, which means we can save a pointer's space on each cpu for every slub item. This patch wraps cpu_slab->partial in CONFIG_SLUB_CPU_PARTIAL and wraps its sysfs use too. [akpm@linux-foundation.org: avoid strange 80-col tricks] Link: http://lkml.kernel.org/r/20170502144533.10729-3-richard.weiyang@gmail.com Signed-off-by: Wei Yang <richard.weiyang@gmail.com> Cc: Christoph Lameter <cl@linux.com> Cc: Pekka Enberg <penberg@kernel.org> Cc: David Rientjes <rientjes@google.com> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
#
d4ff6d35 |
|
06-Jul-2017 |
Wei Yang <richard.weiyang@gmail.com> |
mm/slub: reset cpu_slab's pointer in deactivate_slab() Each time a slab is deactivated, the page and freelist pointer should be reset. This patch just merges these two options into deactivate_slab(). Link: http://lkml.kernel.org/r/20170507031215.3130-2-richard.weiyang@gmail.com Signed-off-by: Wei Yang <richard.weiyang@gmail.com> Cc: Christoph Lameter <cl@linux.com> Cc: Pekka Enberg <penberg@kernel.org> Cc: David Rientjes <rientjes@google.com> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
#
66fdbe52 |
|
06-Jul-2017 |
Wei Yang <richard.weiyang@gmail.com> |
mm/slub.c: remove a redundant assignment in ___slab_alloc() When the code comes to this point, there are two cases: 1. cpu_slab is deactivated 2. cpu_slab is empty In both cased, cpu_slab->freelist is NULL at this moment. This patch removes the redundant assignment of cpu_slab->freelist. Link: http://lkml.kernel.org/r/20170507031215.3130-1-richard.weiyang@gmail.com Signed-off-by: Wei Yang <richard.weiyang@gmail.com> Cc: Christoph Lameter <cl@linux.com> Cc: Pekka Enberg <penberg@kernel.org> Cc: David Rientjes <rientjes@google.com> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
#
3b7b3140 |
|
23-Jun-2017 |
Tejun Heo <tj@kernel.org> |
slub: make sysfs file removal asynchronous Commit bf5eb3de3847 ("slub: separate out sysfs_slab_release() from sysfs_slab_remove()") made slub sysfs file removals synchronous to kmem_cache shutdown. Unfortunately, this created a possible ABBA deadlock between slab_mutex and sysfs draining mechanism triggering the following lockdep warning. ====================================================== [ INFO: possible circular locking dependency detected ] 4.10.0-test+ #48 Not tainted ------------------------------------------------------- rmmod/1211 is trying to acquire lock: (s_active#120){++++.+}, at: [<ffffffff81308073>] kernfs_remove+0x23/0x40 but task is already holding lock: (slab_mutex){+.+.+.}, at: [<ffffffff8120f691>] kmem_cache_destroy+0x41/0x2d0 which lock already depends on the new lock. the existing dependency chain (in reverse order) is: -> #1 (slab_mutex){+.+.+.}: lock_acquire+0xf6/0x1f0 __mutex_lock+0x75/0x950 mutex_lock_nested+0x1b/0x20 slab_attr_store+0x75/0xd0 sysfs_kf_write+0x45/0x60 kernfs_fop_write+0x13c/0x1c0 __vfs_write+0x28/0x120 vfs_write+0xc8/0x1e0 SyS_write+0x49/0xa0 entry_SYSCALL_64_fastpath+0x1f/0xc2 -> #0 (s_active#120){++++.+}: __lock_acquire+0x10ed/0x1260 lock_acquire+0xf6/0x1f0 __kernfs_remove+0x254/0x320 kernfs_remove+0x23/0x40 sysfs_remove_dir+0x51/0x80 kobject_del+0x18/0x50 __kmem_cache_shutdown+0x3e6/0x460 kmem_cache_destroy+0x1fb/0x2d0 kvm_exit+0x2d/0x80 [kvm] vmx_exit+0x19/0xa1b [kvm_intel] SyS_delete_module+0x198/0x1f0 entry_SYSCALL_64_fastpath+0x1f/0xc2 other info that might help us debug this: Possible unsafe locking scenario: CPU0 CPU1 ---- ---- lock(slab_mutex); lock(s_active#120); lock(slab_mutex); lock(s_active#120); *** DEADLOCK *** 2 locks held by rmmod/1211: #0: (cpu_hotplug.dep_map){++++++}, at: [<ffffffff810a7877>] get_online_cpus+0x37/0x80 #1: (slab_mutex){+.+.+.}, at: [<ffffffff8120f691>] kmem_cache_destroy+0x41/0x2d0 stack backtrace: CPU: 3 PID: 1211 Comm: rmmod Not tainted 4.10.0-test+ #48 Hardware name: Hewlett-Packard HP Compaq Pro 6300 SFF/339A, BIOS K01 v02.05 05/07/2012 Call Trace: print_circular_bug+0x1be/0x210 __lock_acquire+0x10ed/0x1260 lock_acquire+0xf6/0x1f0 __kernfs_remove+0x254/0x320 kernfs_remove+0x23/0x40 sysfs_remove_dir+0x51/0x80 kobject_del+0x18/0x50 __kmem_cache_shutdown+0x3e6/0x460 kmem_cache_destroy+0x1fb/0x2d0 kvm_exit+0x2d/0x80 [kvm] vmx_exit+0x19/0xa1b [kvm_intel] SyS_delete_module+0x198/0x1f0 ? SyS_delete_module+0x5/0x1f0 entry_SYSCALL_64_fastpath+0x1f/0xc2 It'd be the cleanest to deal with the issue by removing sysfs files without holding slab_mutex before the rest of shutdown; however, given the current code structure, it is pretty difficult to do so. This patch punts sysfs file removal to a work item. Before commit bf5eb3de3847, the removal was punted to a RCU delayed work item which is executed after release. Now, we're punting to a different work item on shutdown which still maintains the goal removing the sysfs files earlier when destroying kmem_caches. Link: http://lkml.kernel.org/r/20170620204512.GI21326@htj.duckdns.org Fixes: bf5eb3de3847 ("slub: separate out sysfs_slab_release() from sysfs_slab_remove()") Signed-off-by: Tejun Heo <tj@kernel.org> Reported-by: Steven Rostedt (VMware) <rostedt@goodmis.org> Tested-by: Steven Rostedt (VMware) <rostedt@goodmis.org> Cc: Christoph Lameter <cl@linux.com> Cc: Pekka Enberg <penberg@kernel.org> Cc: David Rientjes <rientjes@google.com> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
#
478fe303 |
|
02-Jun-2017 |
Thomas Gleixner <tglx@linutronix.de> |
slub/memcg: cure the brainless abuse of sysfs attributes memcg_propagate_slab_attrs() abuses the sysfs attribute file functions to propagate settings from the root kmem_cache to a newly created kmem_cache. It does that with: attr->show(root, buf); attr->store(new, buf, strlen(bug); Aside of being a lazy and absurd hackery this is broken because it does not check the return value of the show() function. Some of the show() functions return 0 w/o touching the buffer. That means in such a case the store function is called with the stale content of the previous show(). That causes nonsense like invoking kmem_cache_shrink() on a newly created kmem_cache. In the worst case it would cause handing in an uninitialized buffer. This should be rewritten proper by adding a propagate() callback to those slub_attributes which must be propagated and avoid that insane conversion to and from ASCII, but that's too large for a hot fix. Check at least the return value of the show() function, so calling store() with stale content is prevented. Steven said: "It can cause a deadlock with get_online_cpus() that has been uncovered by recent cpu hotplug and lockdep changes that Thomas and Peter have been doing. Possible unsafe locking scenario: CPU0 CPU1 ---- ---- lock(cpu_hotplug.lock); lock(slab_mutex); lock(cpu_hotplug.lock); lock(slab_mutex); *** DEADLOCK ***" Link: http://lkml.kernel.org/r/alpine.DEB.2.20.1705201244540.2255@nanos Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Reported-by: Steven Rostedt <rostedt@goodmis.org> Acked-by: David Rientjes <rientjes@google.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Michal Hocko <mhocko@kernel.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Christoph Lameter <cl@linux.com> Cc: Pekka Enberg <penberg@kernel.org> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com> Cc: Christoph Hellwig <hch@infradead.org> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
#
5f0d5a3a |
|
18-Jan-2017 |
Paul E. McKenney <paulmck@kernel.org> |
mm: Rename SLAB_DESTROY_BY_RCU to SLAB_TYPESAFE_BY_RCU A group of Linux kernel hackers reported chasing a bug that resulted from their assumption that SLAB_DESTROY_BY_RCU provided an existence guarantee, that is, that no block from such a slab would be reallocated during an RCU read-side critical section. Of course, that is not the case. Instead, SLAB_DESTROY_BY_RCU only prevents freeing of an entire slab of blocks. However, there is a phrase for this, namely "type safety". This commit therefore renames SLAB_DESTROY_BY_RCU to SLAB_TYPESAFE_BY_RCU in order to avoid future instances of this sort of confusion. Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com> Cc: Christoph Lameter <cl@linux.com> Cc: Pekka Enberg <penberg@kernel.org> Cc: David Rientjes <rientjes@google.com> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: <linux-mm@kvack.org> Acked-by: Johannes Weiner <hannes@cmpxchg.org> Acked-by: Vlastimil Babka <vbabka@suse.cz> [ paulmck: Add comments mentioning the old name, as requested by Eric Dumazet, in order to help people familiar with the old name find the new one. ] Acked-by: David Rientjes <rientjes@google.com>
|
#
1663f26d |
|
22-Feb-2017 |
Tejun Heo <tj@kernel.org> |
slub: make sysfs directories for memcg sub-caches optional SLUB creates a per-cache directory under /sys/kernel/slab which hosts a bunch of debug files. Usually, there aren't that many caches on a system and this doesn't really matter; however, if memcg is in use, each cache can have per-cgroup sub-caches. SLUB creates the same directories for these sub-caches under /sys/kernel/slab/$CACHE/cgroup. Unfortunately, because there can be a lot of cgroups, active or draining, the product of the numbers of caches, cgroups and files in each directory can reach a very high number - hundreds of thousands is commonplace. Millions and beyond aren't difficult to reach either. What's under /sys/kernel/slab is primarily for debugging and the information and control on the a root cache already cover its sub-caches. While having a separate directory for each sub-cache can be helpful for development, it doesn't make much sense to pay this amount of overhead by default. This patch introduces a boot parameter slub_memcg_sysfs which determines whether to create sysfs directories for per-memcg sub-caches. It also adds CONFIG_SLUB_MEMCG_SYSFS_ON which determines the boot parameter's default value and defaults to 0. [akpm@linux-foundation.org: kset_unregister(NULL) is legal] Link: http://lkml.kernel.org/r/20170204145203.GB26958@mtj.duckdns.org Signed-off-by: Tejun Heo <tj@kernel.org> Cc: Christoph Lameter <cl@linux.com> Cc: Pekka Enberg <penberg@kernel.org> Cc: David Rientjes <rientjes@google.com> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com> Cc: Vladimir Davydov <vdavydov.dev@gmail.com> Cc: Michal Hocko <mhocko@kernel.org> Cc: Johannes Weiner <hannes@cmpxchg.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
#
50862ce7 |
|
22-Feb-2017 |
Tejun Heo <tj@kernel.org> |
slab: remove slub sysfs interface files early for empty memcg caches With kmem cgroup support enabled, kmem_caches can be created and destroyed frequently and a great number of near empty kmem_caches can accumulate if there are a lot of transient cgroups and the system is not under memory pressure. When memory reclaim starts under such conditions, it can lead to consecutive deactivation and destruction of many kmem_caches, easily hundreds of thousands on moderately large systems, exposing scalability issues in the current slab management code. This is one of the patches to address the issue. Each cache has a number of sysfs interface files under /sys/kernel/slab. On a system with a lot of memory and transient memcgs, the number of interface files which have to be removed once memory reclaim kicks in can reach millions. Link: http://lkml.kernel.org/r/20170117235411.9408-10-tj@kernel.org Signed-off-by: Tejun Heo <tj@kernel.org> Reported-by: Jay Vana <jsvana@fb.com> Acked-by: Vladimir Davydov <vdavydov.dev@gmail.com> Cc: Christoph Lameter <cl@linux.com> Cc: Pekka Enberg <penberg@kernel.org> Cc: David Rientjes <rientjes@google.com> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
#
01fb58bc |
|
22-Feb-2017 |
Tejun Heo <tj@kernel.org> |
slab: remove synchronous synchronize_sched() from memcg cache deactivation path With kmem cgroup support enabled, kmem_caches can be created and destroyed frequently and a great number of near empty kmem_caches can accumulate if there are a lot of transient cgroups and the system is not under memory pressure. When memory reclaim starts under such conditions, it can lead to consecutive deactivation and destruction of many kmem_caches, easily hundreds of thousands on moderately large systems, exposing scalability issues in the current slab management code. This is one of the patches to address the issue. slub uses synchronize_sched() to deactivate a memcg cache. synchronize_sched() is an expensive and slow operation and doesn't scale when a huge number of caches are destroyed back-to-back. While there used to be a simple batching mechanism, the batching was too restricted to be helpful. This patch implements slab_deactivate_memcg_cache_rcu_sched() which slub can use to schedule sched RCU callback instead of performing synchronize_sched() synchronously while holding cgroup_mutex. While this adds online cpus, mems and slab_mutex operations, operating on these locks back-to-back from the same kworker, which is what's gonna happen when there are many to deactivate, isn't expensive at all and this gets rid of the scalability problem completely. Link: http://lkml.kernel.org/r/20170117235411.9408-9-tj@kernel.org Signed-off-by: Tejun Heo <tj@kernel.org> Reported-by: Jay Vana <jsvana@fb.com> Acked-by: Vladimir Davydov <vdavydov.dev@gmail.com> Cc: Christoph Lameter <cl@linux.com> Cc: Pekka Enberg <penberg@kernel.org> Cc: David Rientjes <rientjes@google.com> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
#
c9fc5864 |
|
22-Feb-2017 |
Tejun Heo <tj@kernel.org> |
slab: introduce __kmemcg_cache_deactivate() __kmem_cache_shrink() is called with %true @deactivate only for memcg caches. Remove @deactivate from __kmem_cache_shrink() and introduce __kmemcg_cache_deactivate() instead. Each memcg-supporting allocator should implement it and it should deactivate and drain the cache. This is to allow memcg cache deactivation behavior to further deviate from simple shrinking without messing up __kmem_cache_shrink(). This is pure reorganization and doesn't introduce any observable behavior changes. v2: Dropped unnecessary ifdef in mm/slab.h as suggested by Vladimir. Link: http://lkml.kernel.org/r/20170117235411.9408-8-tj@kernel.org Signed-off-by: Tejun Heo <tj@kernel.org> Acked-by: Vladimir Davydov <vdavydov.dev@gmail.com> Cc: Christoph Lameter <cl@linux.com> Cc: Pekka Enberg <penberg@kernel.org> Cc: David Rientjes <rientjes@google.com> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
#
510ded33 |
|
22-Feb-2017 |
Tejun Heo <tj@kernel.org> |
slab: implement slab_root_caches list With kmem cgroup support enabled, kmem_caches can be created and destroyed frequently and a great number of near empty kmem_caches can accumulate if there are a lot of transient cgroups and the system is not under memory pressure. When memory reclaim starts under such conditions, it can lead to consecutive deactivation and destruction of many kmem_caches, easily hundreds of thousands on moderately large systems, exposing scalability issues in the current slab management code. This is one of the patches to address the issue. slab_caches currently lists all caches including root and memcg ones. This is the only data structure which lists the root caches and iterating root caches can only be done by walking the list while skipping over memcg caches. As there can be a huge number of memcg caches, this can become very expensive. This also can make /proc/slabinfo behave very badly. seq_file processes reads in 4k chunks and seeks to the previous Nth position on slab_caches list to resume after each chunk. With a lot of memcg cache churns on the list, reading /proc/slabinfo can become very slow and its content often ends up with duplicate and/or missing entries. This patch adds a new list slab_root_caches which lists only the root caches. When memcg is not enabled, it becomes just an alias of slab_caches. memcg specific list operations are collected into memcg_[un]link_cache(). Link: http://lkml.kernel.org/r/20170117235411.9408-7-tj@kernel.org Signed-off-by: Tejun Heo <tj@kernel.org> Reported-by: Jay Vana <jsvana@fb.com> Acked-by: Vladimir Davydov <vdavydov@tarantool.org> Cc: Christoph Lameter <cl@linux.com> Cc: Pekka Enberg <penberg@kernel.org> Cc: David Rientjes <rientjes@google.com> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
#
bf5eb3de |
|
22-Feb-2017 |
Tejun Heo <tj@kernel.org> |
slub: separate out sysfs_slab_release() from sysfs_slab_remove() Separate out slub sysfs removal and release, and call the former earlier from __kmem_cache_shutdown(). There's no reason to defer sysfs removal through RCU and this will later allow us to remove sysfs files way earlier during memory cgroup offline instead of release. Link: http://lkml.kernel.org/r/20170117235411.9408-3-tj@kernel.org Signed-off-by: Tejun Heo <tj@kernel.org> Acked-by: Vladimir Davydov <vdavydov.dev@gmail.com> Cc: Christoph Lameter <cl@linux.com> Cc: Pekka Enberg <penberg@kernel.org> Cc: David Rientjes <rientjes@google.com> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
#
290b6a58 |
|
22-Feb-2017 |
Tejun Heo <tj@kernel.org> |
Revert "slub: move synchronize_sched out of slab_mutex on shrink" Patch series "slab: make memcg slab destruction scalable", v3. With kmem cgroup support enabled, kmem_caches can be created and destroyed frequently and a great number of near empty kmem_caches can accumulate if there are a lot of transient cgroups and the system is not under memory pressure. When memory reclaim starts under such conditions, it can lead to consecutive deactivation and destruction of many kmem_caches, easily hundreds of thousands on moderately large systems, exposing scalability issues in the current slab management code. I've seen machines which end up with hundred thousands of caches and many millions of kernfs_nodes. The current code is O(N^2) on the total number of caches and has synchronous rcu_barrier() and synchronize_sched() in cgroup offline / release path which is executed while holding cgroup_mutex. Combined, this leads to very expensive and slow cache destruction operations which can easily keep running for half a day. This also messes up /proc/slabinfo along with other cache iterating operations. seq_file operates on 4k chunks and on each 4k boundary tries to seek to the last position in the list. With a huge number of caches on the list, this becomes very slow and very prone to the list content changing underneath it leading to a lot of missing and/or duplicate entries. This patchset addresses the scalability problem. * Add root and per-memcg lists. Update each user to use the appropriate list. * Make rcu_barrier() for SLAB_DESTROY_BY_RCU caches globally batched and asynchronous. * For dying empty slub caches, remove the sysfs files after deactivation so that we don't end up with millions of sysfs files without any useful information on them. This patchset contains the following nine patches. 0001-Revert-slub-move-synchronize_sched-out-of-slab_mutex.patch 0002-slub-separate-out-sysfs_slab_release-from-sysfs_slab.patch 0003-slab-remove-synchronous-rcu_barrier-call-in-memcg-ca.patch 0004-slab-reorganize-memcg_cache_params.patch 0005-slab-link-memcg-kmem_caches-on-their-associated-memo.patch 0006-slab-implement-slab_root_caches-list.patch 0007-slab-introduce-__kmemcg_cache_deactivate.patch 0008-slab-remove-synchronous-synchronize_sched-from-memcg.patch 0009-slab-remove-slub-sysfs-interface-files-early-for-emp.patch 0010-slab-use-memcg_kmem_cache_wq-for-slab-destruction-op.patch 0001 reverts an existing optimization to prepare for the following changes. 0002 is a prep patch. 0003 makes rcu_barrier() in release path batched and asynchronous. 0004-0006 separate out the lists. 0007-0008 replace synchronize_sched() in slub destruction path with call_rcu_sched(). 0009 removes sysfs files early for empty dying caches. 0010 makes destruction work items use a workqueue with limited concurrency. This patch (of 10): Revert 89e364db71fb5e ("slub: move synchronize_sched out of slab_mutex on shrink"). With kmem cgroup support enabled, kmem_caches can be created and destroyed frequently and a great number of near empty kmem_caches can accumulate if there are a lot of transient cgroups and the system is not under memory pressure. When memory reclaim starts under such conditions, it can lead to consecutive deactivation and destruction of many kmem_caches, easily hundreds of thousands on moderately large systems, exposing scalability issues in the current slab management code. This is one of the patches to address the issue. Moving synchronize_sched() out of slab_mutex isn't enough as it's still inside cgroup_mutex. The whole deactivation / release path will be updated to avoid all synchronous RCU operations. Revert this insufficient optimization in preparation to ease future changes. Link: http://lkml.kernel.org/r/20170117235411.9408-2-tj@kernel.org Signed-off-by: Tejun Heo <tj@kernel.org> Reported-by: Jay Vana <jsvana@fb.com> Cc: Vladimir Davydov <vdavydov.dev@gmail.com> Cc: Christoph Lameter <cl@linux.com> Cc: Pekka Enberg <penberg@kernel.org> Cc: David Rientjes <rientjes@google.com> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
#
65b9de75 |
|
22-Feb-2017 |
Borislav Petkov <bp@suse.de> |
mm/slub: add a dump_stack() to the unexpected GFP check We wish to know who is doing such a thing. slab.c does this. Link: http://lkml.kernel.org/r/20170116091643.15260-1-bp@alien8.de Signed-off-by: Borislav Petkov <bp@suse.de> Acked-by: Michal Hocko <mhocko@suse.com> Acked-by: Vlastimil Babka <vbabka@suse.cz> Cc: Christoph Lameter <cl@linux.com> Cc: Pekka Enberg <penberg@kernel.org> Cc: David Rientjes <rientjes@google.com> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
#
a810007a |
|
08-Feb-2017 |
Sean Rees <sean@erifax.org> |
mm/slub.c: fix random_seq offset destruction Commit 210e7a43fa90 ("mm: SLUB freelist randomization") broke USB hub initialisation as described in https://bugzilla.kernel.org/show_bug.cgi?id=177551. Bail out early from init_cache_random_seq if s->random_seq is already initialised. This prevents destroying the previously computed random_seq offsets later in the function. If the offsets are destroyed, then shuffle_freelist will truncate page->freelist to just the first object (orphaning the rest). Fixes: 210e7a43fa90 ("mm: SLUB freelist randomization") Link: http://lkml.kernel.org/r/20170207140707.20824-1-sean@erifax.org Signed-off-by: Sean Rees <sean@erifax.org> Reported-by: <userwithuid@gmail.com> Cc: Christoph Lameter <cl@linux.com> Cc: Pekka Enberg <penberg@kernel.org> Cc: David Rientjes <rientjes@google.com> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com> Cc: Thomas Garnier <thgarnie@google.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
#
aa2efd5e |
|
24-Jan-2017 |
Daniel Thompson <daniel.thompson@linaro.org> |
mm/slub.c: trace free objects at KERN_INFO Currently when trace is enabled (e.g. slub_debug=T,kmalloc-128 ) the trace messages are mostly output at KERN_INFO. However the trace code also calls print_section() to hexdump the head of a free object. This is hard coded to use KERN_ERR, meaning the console is deluged with trace messages even if we've asked for quiet. Fix this the obvious way but adding a level parameter to print_section(), allowing calls from the trace code to use the same trace level as other trace messages. Link: http://lkml.kernel.org/r/20170113154850.518-1-daniel.thompson@linaro.org Signed-off-by: Daniel Thompson <daniel.thompson@linaro.org> Acked-by: Christoph Lameter <cl@linux.com> Acked-by: David Rientjes <rientjes@google.com> Cc: Pekka Enberg <penberg@kernel.org> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
#
84582c8a |
|
12-Dec-2016 |
Arnd Bergmann <arnd@arndb.de> |
slub: avoid false-postive warning The slub allocator gives us some incorrect warnings when CONFIG_PROFILE_ANNOTATED_BRANCHES is set, as the unlikely() macro prevents it from seeing that the return code matches what it was before: mm/slub.c: In function `kmem_cache_free_bulk': mm/slub.c:262:23: error: `df.s' may be used uninitialized in this function [-Werror=maybe-uninitialized] mm/slub.c:2943:3: error: `df.cnt' may be used uninitialized in this function [-Werror=maybe-uninitialized] mm/slub.c:2933:4470: error: `df.freelist' may be used uninitialized in this function [-Werror=maybe-uninitialized] mm/slub.c:2943:3: error: `df.tail' may be used uninitialized in this function [-Werror=maybe-uninitialized] I have not been able to come up with a perfect way for dealing with this, the three options I see are: - add a bogus initialization, which would increase the runtime overhead - replace unlikely() with unlikely_notrace() - remove the unlikely() annotation completely I checked the object code for a typical x86 configuration and the last two cases produce the same result, so I went for the last one, which is the simplest. Link: http://lkml.kernel.org/r/20161024155704.3114445-1-arnd@arndb.de Signed-off-by: Arnd Bergmann <arnd@arndb.de> Acked-by: Jesper Dangaard Brouer <brouer@redhat.com> Cc: Arnd Bergmann <arnd@arndb.de> Cc: Christoph Lameter <cl@linux.com> Cc: Pekka Enberg <penberg@kernel.org> Cc: David Rientjes <rientjes@google.com> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com> Cc: Vladimir Davydov <vdavydov.dev@gmail.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Laura Abbott <labbott@fedoraproject.org> Cc: Alexander Potapenko <glider@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
#
89e364db |
|
12-Dec-2016 |
Vladimir Davydov <vdavydov.dev@gmail.com> |
slub: move synchronize_sched out of slab_mutex on shrink synchronize_sched() is a heavy operation and calling it per each cache owned by a memory cgroup being destroyed may take quite some time. What is worse, it's currently called under the slab_mutex, stalling all works doing cache creation/destruction. Actually, there isn't much point in calling synchronize_sched() for each cache - it's enough to call it just once - after setting cpu_partial for all caches and before shrinking them. This way, we can also move it out of the slab_mutex, which we have to hold for iterating over the slab cache list. Link: https://bugzilla.kernel.org/show_bug.cgi?id=172991 Link: http://lkml.kernel.org/r/0a10d71ecae3db00fb4421bcd3f82bcc911f4be4.1475329751.git.vdavydov.dev@gmail.com Signed-off-by: Vladimir Davydov <vdavydov.dev@gmail.com> Reported-by: Doug Smythies <dsmythies@telus.net> Acked-by: Joonsoo Kim <iamjoonsoo.kim@lge.com> Cc: Christoph Lameter <cl@linux.com> Cc: David Rientjes <rientjes@google.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Michal Hocko <mhocko@kernel.org> Cc: Pekka Enberg <penberg@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
#
a96a87bf |
|
18-Aug-2016 |
Sebastian Andrzej Siewior <bigeasy@linutronix.de> |
slub: Convert to hotplug state machine Install the callbacks via the state machine. Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Pekka Enberg <penberg@kernel.org> Cc: linux-mm@kvack.org Cc: rt@linutronix.de Cc: David Rientjes <rientjes@google.com> Cc: Christoph Lameter <cl@linux.com> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com> Link: http://lkml.kernel.org/r/20160818125731.27256-5-bigeasy@linutronix.de Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
|
#
60398923 |
|
10-Aug-2016 |
Chris Wilson <chris@chris-wilson.co.uk> |
mm/slub.c: run free_partial() outside of the kmem_cache_node->list_lock With debugobjects enabled and using SLAB_DESTROY_BY_RCU, when a kmem_cache_node is destroyed the call_rcu() may trigger a slab allocation to fill the debug object pool (__debug_object_init:fill_pool). Everywhere but during kmem_cache_destroy(), discard_slab() is performed outside of the kmem_cache_node->list_lock and avoids a lockdep warning about potential recursion: ============================================= [ INFO: possible recursive locking detected ] 4.8.0-rc1-gfxbench+ #1 Tainted: G U --------------------------------------------- rmmod/8895 is trying to acquire lock: (&(&n->list_lock)->rlock){-.-...}, at: [<ffffffff811c80d7>] get_partial_node.isra.63+0x47/0x430 but task is already holding lock: (&(&n->list_lock)->rlock){-.-...}, at: [<ffffffff811cbda4>] __kmem_cache_shutdown+0x54/0x320 other info that might help us debug this: Possible unsafe locking scenario: CPU0 ---- lock(&(&n->list_lock)->rlock); lock(&(&n->list_lock)->rlock); *** DEADLOCK *** May be due to missing lock nesting notation 5 locks held by rmmod/8895: #0: (&dev->mutex){......}, at: driver_detach+0x42/0xc0 #1: (&dev->mutex){......}, at: driver_detach+0x50/0xc0 #2: (cpu_hotplug.dep_map){++++++}, at: get_online_cpus+0x2d/0x80 #3: (slab_mutex){+.+.+.}, at: kmem_cache_destroy+0x3c/0x220 #4: (&(&n->list_lock)->rlock){-.-...}, at: __kmem_cache_shutdown+0x54/0x320 stack backtrace: CPU: 6 PID: 8895 Comm: rmmod Tainted: G U 4.8.0-rc1-gfxbench+ #1 Hardware name: Gigabyte Technology Co., Ltd. H87M-D3H/H87M-D3H, BIOS F11 08/18/2015 Call Trace: __lock_acquire+0x1646/0x1ad0 lock_acquire+0xb2/0x200 _raw_spin_lock+0x36/0x50 get_partial_node.isra.63+0x47/0x430 ___slab_alloc.constprop.67+0x1a7/0x3b0 __slab_alloc.isra.64.constprop.66+0x43/0x80 kmem_cache_alloc+0x236/0x2d0 __debug_object_init+0x2de/0x400 debug_object_activate+0x109/0x1e0 __call_rcu.constprop.63+0x32/0x2f0 call_rcu+0x12/0x20 discard_slab+0x3d/0x40 __kmem_cache_shutdown+0xdb/0x320 shutdown_cache+0x19/0x60 kmem_cache_destroy+0x1ae/0x220 i915_gem_load_cleanup+0x14/0x40 [i915] i915_driver_unload+0x151/0x180 [i915] i915_pci_remove+0x14/0x20 [i915] pci_device_remove+0x34/0xb0 __device_release_driver+0x95/0x140 driver_detach+0xb6/0xc0 bus_remove_driver+0x53/0xd0 driver_unregister+0x27/0x50 pci_unregister_driver+0x25/0x70 i915_exit+0x1a/0x1e2 [i915] SyS_delete_module+0x193/0x1f0 entry_SYSCALL_64_fastpath+0x1c/0xac Fixes: 52b4b950b507 ("mm: slab: free kmem_cache_node after destroy sysfs file") Link: http://lkml.kernel.org/r/1470759070-18743-1-git-send-email-chris@chris-wilson.co.uk Reported-by: Dave Gordon <david.s.gordon@intel.com> Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk> Reviewed-by: Vladimir Davydov <vdavydov@virtuozzo.com> Acked-by: Christoph Lameter <cl@linux.com> Cc: Pekka Enberg <penberg@kernel.org> Cc: David Rientjes <rientjes@google.com> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com> Cc: Dmitry Safonov <dsafonov@virtuozzo.com> Cc: Daniel Vetter <daniel.vetter@ffwll.ch> Cc: Dave Gordon <david.s.gordon@intel.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
#
117d54df |
|
04-Aug-2016 |
Geert Uytterhoeven <geert@linux-m68k.org> |
slub: drop bogus inline for fixup_red_left() With m68k-linux-gnu-gcc-4.1: include/linux/slub_def.h:126: warning: `fixup_red_left' declared inline after being called include/linux/slub_def.h:126: warning: previous declaration of `fixup_red_left' was here Commit c146a2b98eb5 ("mm, kasan: account for object redzone in SLUB's nearest_obj()") made fixup_red_left() global, but forgot to remove the inline keyword. Fixes: c146a2b98eb5898e ("mm, kasan: account for object redzone in SLUB's nearest_obj()") Link: http://lkml.kernel.org/r/1470256262-1586-1-git-send-email-geert@linux-m68k.org Signed-off-by: Geert Uytterhoeven <geert@linux-m68k.org> Cc: Alexander Potapenko <glider@google.com> Acked-by: David Rientjes <rientjes@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
#
b3cbd9bf |
|
02-Aug-2016 |
Andrey Ryabinin <ryabinin.a.a@gmail.com> |
mm/kasan: get rid of ->state in struct kasan_alloc_meta The state of object currently tracked in two places - shadow memory, and the ->state field in struct kasan_alloc_meta. We can get rid of the latter. The will save us a little bit of memory. Also, this allow us to move free stack into struct kasan_alloc_meta, without increasing memory consumption. So now we should always know when the last time the object was freed. This may be useful for long delayed use-after-free bugs. As a side effect this fixes following UBSAN warning: UBSAN: Undefined behaviour in mm/kasan/quarantine.c:102:13 member access within misaligned address ffff88000d1efebc for type 'struct qlist_node' which requires 8 byte alignment Link: http://lkml.kernel.org/r/1470062715-14077-5-git-send-email-aryabinin@virtuozzo.com Reported-by: kernel test robot <xiaolong.ye@intel.com> Signed-off-by: Andrey Ryabinin <aryabinin@virtuozzo.com> Cc: Alexander Potapenko <glider@google.com> Cc: Dmitry Vyukov <dvyukov@google.com> Cc: Christoph Lameter <cl@linux.com> Cc: Pekka Enberg <penberg@kernel.org> Cc: David Rientjes <rientjes@google.com> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
#
80a9201a |
|
28-Jul-2016 |
Alexander Potapenko <glider@google.com> |
mm, kasan: switch SLUB to stackdepot, enable memory quarantine for SLUB For KASAN builds: - switch SLUB allocator to using stackdepot instead of storing the allocation/deallocation stacks in the objects; - change the freelist hook so that parts of the freelist can be put into the quarantine. [aryabinin@virtuozzo.com: fixes] Link: http://lkml.kernel.org/r/1468601423-28676-1-git-send-email-aryabinin@virtuozzo.com Link: http://lkml.kernel.org/r/1468347165-41906-3-git-send-email-glider@google.com Signed-off-by: Alexander Potapenko <glider@google.com> Cc: Andrey Konovalov <adech.fo@gmail.com> Cc: Christoph Lameter <cl@linux.com> Cc: Dmitry Vyukov <dvyukov@google.com> Cc: Steven Rostedt (Red Hat) <rostedt@goodmis.org> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com> Cc: Kostya Serebryany <kcc@google.com> Cc: Andrey Ryabinin <aryabinin@virtuozzo.com> Cc: Kuthonuzo Luruo <kuthonuzo.luruo@hpe.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
#
c146a2b9 |
|
28-Jul-2016 |
Alexander Potapenko <glider@google.com> |
mm, kasan: account for object redzone in SLUB's nearest_obj() When looking up the nearest SLUB object for a given address, correctly calculate its offset if SLAB_RED_ZONE is enabled for that cache. Previously, when KASAN had detected an error on an object from a cache with SLAB_RED_ZONE set, the actual start address of the object was miscalculated, which led to random stacks having been reported. When looking up the nearest SLUB object for a given address, correctly calculate its offset if SLAB_RED_ZONE is enabled for that cache. Fixes: 7ed2f9e663854db ("mm, kasan: SLAB support") Link: http://lkml.kernel.org/r/1468347165-41906-2-git-send-email-glider@google.com Signed-off-by: Alexander Potapenko <glider@google.com> Cc: Andrey Konovalov <adech.fo@gmail.com> Cc: Christoph Lameter <cl@linux.com> Cc: Dmitry Vyukov <dvyukov@google.com> Cc: Steven Rostedt (Red Hat) <rostedt@goodmis.org> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com> Cc: Kostya Serebryany <kcc@google.com> Cc: Andrey Ryabinin <aryabinin@virtuozzo.com> Cc: Kuthonuzo Luruo <kuthonuzo.luruo@hpe.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
#
4949148a |
|
26-Jul-2016 |
Vladimir Davydov <vdavydov.dev@gmail.com> |
mm: charge/uncharge kmemcg from generic page allocator paths Currently, to charge a non-slab allocation to kmemcg one has to use alloc_kmem_pages helper with __GFP_ACCOUNT flag. A page allocated with this helper should finally be freed using free_kmem_pages, otherwise it won't be uncharged. This API suits its current users fine, but it turns out to be impossible to use along with page reference counting, i.e. when an allocation is supposed to be freed with put_page, as it is the case with pipe or unix socket buffers. To overcome this limitation, this patch moves charging/uncharging to generic page allocator paths, i.e. to __alloc_pages_nodemask and free_pages_prepare, and zaps alloc/free_kmem_pages helpers. This way, one can use any of the available page allocation functions to get the allocated page charged to kmemcg - it's enough to pass __GFP_ACCOUNT, just like in case of kmalloc and friends. A charged page will be automatically uncharged on free. To make it possible, we need to mark pages charged to kmemcg somehow. To avoid introducing a new page flag, we make use of page->_mapcount for marking such pages. Since pages charged to kmemcg are not supposed to be mapped to userspace, it should work just fine. There are other (ab)users of page->_mapcount - buddy and balloon pages - but we don't conflict with them. In case kmemcg is compiled out or not used at runtime, this patch introduces no overhead to generic page allocator paths. If kmemcg is used, it will be plus one gfp flags check on alloc and plus one page->_mapcount check on free, which shouldn't hurt performance, because the data accessed are hot. Link: http://lkml.kernel.org/r/a9736d856f895bcb465d9f257b54efe32eda6f99.1464079538.git.vdavydov@virtuozzo.com Signed-off-by: Vladimir Davydov <vdavydov@virtuozzo.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Michal Hocko <mhocko@kernel.org> Cc: Eric Dumazet <eric.dumazet@gmail.com> Cc: Minchan Kim <minchan@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
#
72baeef0c |
|
26-Jul-2016 |
Michal Hocko <mhocko@suse.com> |
slab: do not panic on invalid gfp_mask Both SLAB and SLUB BUG() when a caller provides an invalid gfp_mask. This is a rather harsh way to announce a non-critical issue. Allocator is free to ignore invalid flags. Let's simply replace BUG() by dump_stack to tell the offender and fixup the mask to move on with the allocation request. This is an example for kmalloc(GFP_KERNEL|__GFP_HIGHMEM) from a test module: Unexpected gfp: 0x2 (__GFP_HIGHMEM). Fixing up to gfp: 0x24000c0 (GFP_KERNEL). Fix your code! CPU: 0 PID: 2916 Comm: insmod Tainted: G O 4.6.0-slabgfp2-00002-g4cdfc2ef4892-dirty #936 Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS Debian-1.8.2-1 04/01/2014 Call Trace: dump_stack+0x67/0x90 cache_alloc_refill+0x201/0x617 kmem_cache_alloc_trace+0xa7/0x24a ? 0xffffffffa0005000 mymodule_init+0x20/0x1000 [test_slab] do_one_initcall+0xe7/0x16c ? rcu_read_lock_sched_held+0x61/0x69 ? kmem_cache_alloc_trace+0x197/0x24a do_init_module+0x5f/0x1d9 load_module+0x1a3d/0x1f21 ? retint_kernel+0x2d/0x2d SyS_init_module+0xe8/0x10e ? SyS_init_module+0xe8/0x10e do_syscall_64+0x68/0x13f entry_SYSCALL64_slow_path+0x25/0x25 Link: http://lkml.kernel.org/r/1465548200-11384-2-git-send-email-mhocko@kernel.org Signed-off-by: Michal Hocko <mhocko@suse.com> Cc: Sergey Senozhatsky <sergey.senozhatsky.work@gmail.com> Cc: Christoph Lameter <cl@linux.com> Cc: Pekka Enberg <penberg@kernel.org> Cc: David Rientjes <rientjes@google.com> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
#
bacdcb34 |
|
26-Jul-2016 |
Michal Hocko <mhocko@suse.com> |
slab: make GFP_SLAB_BUG_MASK information more human readable printk offers %pGg for quite some time so let's use it to get a human readable list of invalid flags. The original output would be [ 429.191962] gfp: 2 after the change [ 429.191962] Unexpected gfp: 0x2 (__GFP_HIGHMEM) Link: http://lkml.kernel.org/r/1465548200-11384-1-git-send-email-mhocko@kernel.org Signed-off-by: Michal Hocko <mhocko@suse.com> Cc: Sergey Senozhatsky <sergey.senozhatsky.work@gmail.com> Cc: Christoph Lameter <cl@linux.com> Cc: Pekka Enberg <penberg@kernel.org> Cc: David Rientjes <rientjes@google.com> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
#
210e7a43 |
|
26-Jul-2016 |
Thomas Garnier <thgarnie@google.com> |
mm: SLUB freelist randomization Implements freelist randomization for the SLUB allocator. It was previous implemented for the SLAB allocator. Both use the same configuration option (CONFIG_SLAB_FREELIST_RANDOM). The list is randomized during initialization of a new set of pages. The order on different freelist sizes is pre-computed at boot for performance. Each kmem_cache has its own randomized freelist. This security feature reduces the predictability of the kernel SLUB allocator against heap overflows rendering attacks much less stable. For example these attacks exploit the predictability of the heap: - Linux Kernel CAN SLUB overflow (https://goo.gl/oMNWkU) - Exploiting Linux Kernel Heap corruptions (http://goo.gl/EXLn95) Performance results: slab_test impact is between 3% to 4% on average for 100000 attempts without smp. It is a very focused testing, kernbench show the overall impact on the system is way lower. Before: Single thread testing ===================== 1. Kmalloc: Repeatedly allocate then free test 100000 times kmalloc(8) -> 49 cycles kfree -> 77 cycles 100000 times kmalloc(16) -> 51 cycles kfree -> 79 cycles 100000 times kmalloc(32) -> 53 cycles kfree -> 83 cycles 100000 times kmalloc(64) -> 62 cycles kfree -> 90 cycles 100000 times kmalloc(128) -> 81 cycles kfree -> 97 cycles 100000 times kmalloc(256) -> 98 cycles kfree -> 121 cycles 100000 times kmalloc(512) -> 95 cycles kfree -> 122 cycles 100000 times kmalloc(1024) -> 96 cycles kfree -> 126 cycles 100000 times kmalloc(2048) -> 115 cycles kfree -> 140 cycles 100000 times kmalloc(4096) -> 149 cycles kfree -> 171 cycles 2. Kmalloc: alloc/free test 100000 times kmalloc(8)/kfree -> 70 cycles 100000 times kmalloc(16)/kfree -> 70 cycles 100000 times kmalloc(32)/kfree -> 70 cycles 100000 times kmalloc(64)/kfree -> 70 cycles 100000 times kmalloc(128)/kfree -> 70 cycles 100000 times kmalloc(256)/kfree -> 69 cycles 100000 times kmalloc(512)/kfree -> 70 cycles 100000 times kmalloc(1024)/kfree -> 73 cycles 100000 times kmalloc(2048)/kfree -> 72 cycles 100000 times kmalloc(4096)/kfree -> 71 cycles After: Single thread testing ===================== 1. Kmalloc: Repeatedly allocate then free test 100000 times kmalloc(8) -> 57 cycles kfree -> 78 cycles 100000 times kmalloc(16) -> 61 cycles kfree -> 81 cycles 100000 times kmalloc(32) -> 76 cycles kfree -> 93 cycles 100000 times kmalloc(64) -> 83 cycles kfree -> 94 cycles 100000 times kmalloc(128) -> 106 cycles kfree -> 107 cycles 100000 times kmalloc(256) -> 118 cycles kfree -> 117 cycles 100000 times kmalloc(512) -> 114 cycles kfree -> 116 cycles 100000 times kmalloc(1024) -> 115 cycles kfree -> 118 cycles 100000 times kmalloc(2048) -> 147 cycles kfree -> 131 cycles 100000 times kmalloc(4096) -> 214 cycles kfree -> 161 cycles 2. Kmalloc: alloc/free test 100000 times kmalloc(8)/kfree -> 66 cycles 100000 times kmalloc(16)/kfree -> 66 cycles 100000 times kmalloc(32)/kfree -> 66 cycles 100000 times kmalloc(64)/kfree -> 66 cycles 100000 times kmalloc(128)/kfree -> 65 cycles 100000 times kmalloc(256)/kfree -> 67 cycles 100000 times kmalloc(512)/kfree -> 67 cycles 100000 times kmalloc(1024)/kfree -> 64 cycles 100000 times kmalloc(2048)/kfree -> 67 cycles 100000 times kmalloc(4096)/kfree -> 67 cycles Kernbench, before: Average Optimal load -j 12 Run (std deviation): Elapsed Time 101.873 (1.16069) User Time 1045.22 (1.60447) System Time 88.969 (0.559195) Percent CPU 1112.9 (13.8279) Context Switches 189140 (2282.15) Sleeps 99008.6 (768.091) After: Average Optimal load -j 12 Run (std deviation): Elapsed Time 102.47 (0.562732) User Time 1045.3 (1.34263) System Time 88.311 (0.342554) Percent CPU 1105.8 (6.49444) Context Switches 189081 (2355.78) Sleeps 99231.5 (800.358) Link: http://lkml.kernel.org/r/1464295031-26375-3-git-send-email-thgarnie@google.com Signed-off-by: Thomas Garnier <thgarnie@google.com> Reviewed-by: Kees Cook <keescook@chromium.org> Cc: Christoph Lameter <cl@linux.com> Cc: Pekka Enberg <penberg@kernel.org> Cc: David Rientjes <rientjes@google.com> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
#
ed18adc1 |
|
23-Jun-2016 |
Kees Cook <keescook@chromium.org> |
mm: SLUB hardened usercopy support Under CONFIG_HARDENED_USERCOPY, this adds object size checking to the SLUB allocator to catch any copies that may span objects. Includes a redzone handling fix discovered by Michael Ellerman. Based on code from PaX and grsecurity. Signed-off-by: Kees Cook <keescook@chromium.org> Tested-by: Michael Ellerman <mpe@ellerman.id.au> Reviwed-by: Laura Abbott <labbott@redhat.com>
|
#
4ebb31a4 |
|
20-May-2016 |
Alexander Potapenko <glider@google.com> |
mm, kasan: don't call kasan_krealloc() from ksize(). Instead of calling kasan_krealloc(), which replaces the memory allocation stack ID (if stack depot is used), just unpoison the whole memory chunk. Signed-off-by: Alexander Potapenko <glider@google.com> Acked-by: Andrey Ryabinin <aryabinin@virtuozzo.com> Cc: Andrey Konovalov <adech.fo@gmail.com> Cc: Dmitry Vyukov <dvyukov@google.com> Cc: Christoph Lameter <cl@linux.com> Cc: Konstantin Serebryany <kcc@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
#
0139aa7b |
|
19-May-2016 |
Joonsoo Kim <iamjoonsoo.kim@lge.com> |
mm: rename _count, field of the struct page, to _refcount Many developers already know that field for reference count of the struct page is _count and atomic type. They would try to handle it directly and this could break the purpose of page reference count tracepoint. To prevent direct _count modification, this patch rename it to _refcount and add warning message on the code. After that, developer who need to handle reference count will find that field should not be accessed directly. [akpm@linux-foundation.org: fix comments, per Vlastimil] [akpm@linux-foundation.org: Documentation/vm/transhuge.txt too] [sfr@canb.auug.org.au: sync ethernet driver changes] Signed-off-by: Joonsoo Kim <iamjoonsoo.kim@lge.com> Signed-off-by: Stephen Rothwell <sfr@canb.auug.org.au> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Hugh Dickins <hughd@google.com> Cc: Johannes Berg <johannes@sipsolutions.net> Cc: "David S. Miller" <davem@davemloft.net> Cc: Sunil Goutham <sgoutham@cavium.com> Cc: Chris Metcalf <cmetcalf@mellanox.com> Cc: Manish Chopra <manish.chopra@qlogic.com> Cc: Yuval Mintz <yuval.mintz@qlogic.com> Cc: Tariq Toukan <tariqt@mellanox.com> Cc: Saeed Mahameed <saeedm@mellanox.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
#
43efd3ea |
|
19-May-2016 |
Li Peng <lip@dtdream.com> |
mm/slub.c: fix sysfs filename in comment /sys/kernel/slab/xx/defrag_ratio should be remote_node_defrag_ratio. Link: http://lkml.kernel.org/r/1463449242-5366-1-git-send-email-lip@dtdream.com Signed-off-by: Li Peng <lip@dtdream.com> Acked-by: Christoph Lameter <cl@linux.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
#
81ae6d03 |
|
19-May-2016 |
Vladimir Davydov <vdavydov.dev@gmail.com> |
mm/slub.c: replace kick_all_cpus_sync() with synchronize_sched() in kmem_cache_shrink() When we call __kmem_cache_shrink on memory cgroup removal, we need to synchronize kmem_cache->cpu_partial update with put_cpu_partial that might be running on other cpus. Currently, we achieve that by using kick_all_cpus_sync, which works as a system wide memory barrier. Though fast it is, this method has a flaw - it issues a lot of IPIs, which might hurt high performance or real-time workloads. To fix this, let's replace kick_all_cpus_sync with synchronize_sched. Although the latter one may take much longer to finish, it shouldn't be a problem in this particular case, because memory cgroups are destroyed asynchronously from a workqueue so that no user visible effects should be introduced. OTOH, it will save us from excessive IPIs when someone removes a cgroup. Anyway, even if using synchronize_sched turns out to take too long, we can always introduce a kind of __kmem_cache_shrink batching so that this method would only be called once per one cgroup destruction (not per each per memcg kmem cache as it is now). Signed-off-by: Vladimir Davydov <vdavydov@virtuozzo.com> Reported-by: Peter Zijlstra <peterz@infradead.org> Suggested-by: Peter Zijlstra <peterz@infradead.org> Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org> Acked-by: Michal Hocko <mhocko@suse.com> Cc: Christoph Lameter <cl@linux.com> Cc: Pekka Enberg <penberg@kernel.org> Cc: David Rientjes <rientjes@google.com> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
#
505f5dcb |
|
25-Mar-2016 |
Alexander Potapenko <glider@google.com> |
mm, kasan: add GFP flags to KASAN API Add GFP flags to KASAN hooks for future patches to use. This patch is based on the "mm: kasan: unified support for SLUB and SLAB allocators" patch originally prepared by Dmitry Chernenkov. Signed-off-by: Alexander Potapenko <glider@google.com> Cc: Christoph Lameter <cl@linux.com> Cc: Pekka Enberg <penberg@kernel.org> Cc: David Rientjes <rientjes@google.com> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com> Cc: Andrey Konovalov <adech.fo@gmail.com> Cc: Dmitry Vyukov <dvyukov@google.com> Cc: Andrey Ryabinin <ryabinin.a.a@gmail.com> Cc: Steven Rostedt <rostedt@goodmis.org> Cc: Konstantin Serebryany <kcc@google.com> Cc: Dmitry Chernenkov <dmitryc@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
#
756a025f |
|
17-Mar-2016 |
Joe Perches <joe@perches.com> |
mm: coalesce split strings Kernel style prefers a single string over split strings when the string is 'user-visible'. Miscellanea: - Add a missing newline - Realign arguments Signed-off-by: Joe Perches <joe@perches.com> Acked-by: Tejun Heo <tj@kernel.org> [percpu] Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
#
444eb2a4 |
|
17-Mar-2016 |
Mel Gorman <mgorman@techsingularity.net> |
mm: thp: set THP defrag by default to madvise and add a stall-free defrag option THP defrag is enabled by default to direct reclaim/compact but not wake kswapd in the event of a THP allocation failure. The problem is that THP allocation requests potentially enter reclaim/compaction. This potentially incurs a severe stall that is not guaranteed to be offset by reduced TLB misses. While there has been considerable effort to reduce the impact of reclaim/compaction, it is still a high cost and workloads that should fit in memory fail to do so. Specifically, a simple anon/file streaming workload will enter direct reclaim on NUMA at least even though the working set size is 80% of RAM. It's been years and it's time to throw in the towel. First, this patch defines THP defrag as follows; madvise: A failed allocation will direct reclaim/compact if the application requests it never: Neither reclaim/compact nor wake kswapd defer: A failed allocation will wake kswapd/kcompactd always: A failed allocation will direct reclaim/compact (historical behaviour) khugepaged defrag will enter direct/reclaim but not wake kswapd. Next it sets the default defrag option to be "madvise" to only enter direct reclaim/compaction for applications that specifically requested it. Lastly, it removes a check from the page allocator slowpath that is related to __GFP_THISNODE to allow "defer" to work. The callers that really cares are slub/slab and they are updated accordingly. The slab one may be surprising because it also corrects a comment as kswapd was never woken up by that path. This means that a THP fault will no longer stall for most applications by default and the ideal for most users that get THP if they are immediately available. There are still options for users that prefer a stall at startup of a new application by either restoring historical behaviour with "always" or pick a half-way point with "defer" where kswapd does some of the work in the background and wakes kcompactd if necessary. THP defrag for khugepaged remains enabled and will enter direct/reclaim but no wakeup kswapd or kcompactd. After this patch a THP allocation failure will quickly fallback and rely on khugepaged to recover the situation at some time in the future. In some cases, this will reduce THP usage but the benefit of THP is hard to measure and not a universal win where as a stall to reclaim/compaction is definitely measurable and can be painful. The first test for this is using "usemem" to read a large file and write a large anonymous mapping (to avoid the zero page) multiple times. The total size of the mappings is 80% of RAM and the benchmark simply measures how long it takes to complete. It uses multiple threads to see if that is a factor. On UMA, the performance is almost identical so is not reported but on NUMA, we see this usemem 4.4.0 4.4.0 kcompactd-v1r1 nodefrag-v1r3 Amean System-1 102.86 ( 0.00%) 46.81 ( 54.50%) Amean System-4 37.85 ( 0.00%) 34.02 ( 10.12%) Amean System-7 48.12 ( 0.00%) 46.89 ( 2.56%) Amean System-12 51.98 ( 0.00%) 56.96 ( -9.57%) Amean System-21 80.16 ( 0.00%) 79.05 ( 1.39%) Amean System-30 110.71 ( 0.00%) 107.17 ( 3.20%) Amean System-48 127.98 ( 0.00%) 124.83 ( 2.46%) Amean Elapsd-1 185.84 ( 0.00%) 105.51 ( 43.23%) Amean Elapsd-4 26.19 ( 0.00%) 25.58 ( 2.33%) Amean Elapsd-7 21.65 ( 0.00%) 21.62 ( 0.16%) Amean Elapsd-12 18.58 ( 0.00%) 17.94 ( 3.43%) Amean Elapsd-21 17.53 ( 0.00%) 16.60 ( 5.33%) Amean Elapsd-30 17.45 ( 0.00%) 17.13 ( 1.84%) Amean Elapsd-48 15.40 ( 0.00%) 15.27 ( 0.82%) For a single thread, the benchmark completes 43.23% faster with this patch applied with smaller benefits as the thread increases. Similar, notice the large reduction in most cases in system CPU usage. The overall CPU time is 4.4.0 4.4.0 kcompactd-v1r1 nodefrag-v1r3 User 10357.65 10438.33 System 3988.88 3543.94 Elapsed 2203.01 1634.41 Which is substantial. Now, the reclaim figures 4.4.0 4.4.0 kcompactd-v1r1nodefrag-v1r3 Minor Faults 128458477 278352931 Major Faults 2174976 225 Swap Ins 16904701 0 Swap Outs 17359627 0 Allocation stalls 43611 0 DMA allocs 0 0 DMA32 allocs 19832646 19448017 Normal allocs 614488453 580941839 Movable allocs 0 0 Direct pages scanned 24163800 0 Kswapd pages scanned 0 0 Kswapd pages reclaimed 0 0 Direct pages reclaimed 20691346 0 Compaction stalls 42263 0 Compaction success 938 0 Compaction failures 41325 0 This patch eliminates almost all swapping and direct reclaim activity. There is still overhead but it's from NUMA balancing which does not identify that it's pointless trying to do anything with this workload. I also tried the thpscale benchmark which forces a corner case where compaction can be used heavily and measures the latency of whether base or huge pages were used thpscale Fault Latencies 4.4.0 4.4.0 kcompactd-v1r1 nodefrag-v1r3 Amean fault-base-1 5288.84 ( 0.00%) 2817.12 ( 46.73%) Amean fault-base-3 6365.53 ( 0.00%) 3499.11 ( 45.03%) Amean fault-base-5 6526.19 ( 0.00%) 4363.06 ( 33.15%) Amean fault-base-7 7142.25 ( 0.00%) 4858.08 ( 31.98%) Amean fault-base-12 13827.64 ( 0.00%) 10292.11 ( 25.57%) Amean fault-base-18 18235.07 ( 0.00%) 13788.84 ( 24.38%) Amean fault-base-24 21597.80 ( 0.00%) 24388.03 (-12.92%) Amean fault-base-30 26754.15 ( 0.00%) 19700.55 ( 26.36%) Amean fault-base-32 26784.94 ( 0.00%) 19513.57 ( 27.15%) Amean fault-huge-1 4223.96 ( 0.00%) 2178.57 ( 48.42%) Amean fault-huge-3 2194.77 ( 0.00%) 2149.74 ( 2.05%) Amean fault-huge-5 2569.60 ( 0.00%) 2346.95 ( 8.66%) Amean fault-huge-7 3612.69 ( 0.00%) 2997.70 ( 17.02%) Amean fault-huge-12 3301.75 ( 0.00%) 6727.02 (-103.74%) Amean fault-huge-18 6696.47 ( 0.00%) 6685.72 ( 0.16%) Amean fault-huge-24 8000.72 ( 0.00%) 9311.43 (-16.38%) Amean fault-huge-30 13305.55 ( 0.00%) 9750.45 ( 26.72%) Amean fault-huge-32 9981.71 ( 0.00%) 10316.06 ( -3.35%) The average time to fault pages is substantially reduced in the majority of caseds but with the obvious caveat that fewer THPs are actually used in this adverse workload 4.4.0 4.4.0 kcompactd-v1r1 nodefrag-v1r3 Percentage huge-1 0.71 ( 0.00%) 14.04 (1865.22%) Percentage huge-3 10.77 ( 0.00%) 33.05 (206.85%) Percentage huge-5 60.39 ( 0.00%) 38.51 (-36.23%) Percentage huge-7 45.97 ( 0.00%) 34.57 (-24.79%) Percentage huge-12 68.12 ( 0.00%) 40.07 (-41.17%) Percentage huge-18 64.93 ( 0.00%) 47.82 (-26.35%) Percentage huge-24 62.69 ( 0.00%) 44.23 (-29.44%) Percentage huge-30 43.49 ( 0.00%) 55.38 ( 27.34%) Percentage huge-32 50.72 ( 0.00%) 51.90 ( 2.35%) 4.4.0 4.4.0 kcompactd-v1r1nodefrag-v1r3 Minor Faults 37429143 47564000 Major Faults 1916 1558 Swap Ins 1466 1079 Swap Outs 2936863 149626 Allocation stalls 62510 3 DMA allocs 0 0 DMA32 allocs 6566458 6401314 Normal allocs 216361697 216538171 Movable allocs 0 0 Direct pages scanned 25977580 17998 Kswapd pages scanned 0 3638931 Kswapd pages reclaimed 0 207236 Direct pages reclaimed 8833714 88 Compaction stalls 103349 5 Compaction success 270 4 Compaction failures 103079 1 Note again that while this does swap as it's an aggressive workload, the direct relcim activity and allocation stalls is substantially reduced. There is some kswapd activity but ftrace showed that the kswapd activity was due to normal wakeups from 4K pages being allocated. Compaction-related stalls and activity are almost eliminated. I also tried the stutter benchmark. For this, I do not have figures for NUMA but it's something that does impact UMA so I'll report what is available stutter 4.4.0 4.4.0 kcompactd-v1r1 nodefrag-v1r3 Min mmap 7.3571 ( 0.00%) 7.3438 ( 0.18%) 1st-qrtle mmap 7.5278 ( 0.00%) 17.9200 (-138.05%) 2nd-qrtle mmap 7.6818 ( 0.00%) 21.6055 (-181.25%) 3rd-qrtle mmap 11.0889 ( 0.00%) 21.8881 (-97.39%) Max-90% mmap 27.8978 ( 0.00%) 22.1632 ( 20.56%) Max-93% mmap 28.3202 ( 0.00%) 22.3044 ( 21.24%) Max-95% mmap 28.5600 ( 0.00%) 22.4580 ( 21.37%) Max-99% mmap 29.6032 ( 0.00%) 25.5216 ( 13.79%) Max mmap 4109.7289 ( 0.00%) 4813.9832 (-17.14%) Mean mmap 12.4474 ( 0.00%) 19.3027 (-55.07%) This benchmark is trying to fault an anonymous mapping while there is a heavy IO load -- a scenario that desktop users used to complain about frequently. This shows a mix because the ideal case of mapping with THP is not hit as often. However, note that 99% of the mappings complete 13.79% faster. The CPU usage here is particularly interesting 4.4.0 4.4.0 kcompactd-v1r1nodefrag-v1r3 User 67.50 0.99 System 1327.88 91.30 Elapsed 2079.00 2128.98 And once again we look at the reclaim figures 4.4.0 4.4.0 kcompactd-v1r1nodefrag-v1r3 Minor Faults 335241922 1314582827 Major Faults 715 819 Swap Ins 0 0 Swap Outs 0 0 Allocation stalls 532723 0 DMA allocs 0 0 DMA32 allocs 1822364341 1177950222 Normal allocs 1815640808 1517844854 Movable allocs 0 0 Direct pages scanned 21892772 0 Kswapd pages scanned 20015890 41879484 Kswapd pages reclaimed 19961986 41822072 Direct pages reclaimed 21892741 0 Compaction stalls 1065755 0 Compaction success 514 0 Compaction failures 1065241 0 Allocation stalls and all direct reclaim activity is eliminated as well as compaction-related stalls. THP gives impressive gains in some cases but only if they are quickly available. We're not going to reach the point where they are completely free so lets take the costs out of the fast paths finally and defer the cost to kswapd, kcompactd and khugepaged where it belongs. Signed-off-by: Mel Gorman <mgorman@techsingularity.net> Acked-by: Rik van Riel <riel@redhat.com> Acked-by: Johannes Weiner <hannes@cmpxchg.org> Acked-by: Vlastimil Babka <vbabka@suse.cz> Cc: Andrea Arcangeli <aarcange@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
#
922d566c |
|
17-Mar-2016 |
Joonsoo Kim <iamjoonsoo.kim@lge.com> |
mm/slub: query dynamic DEBUG_PAGEALLOC setting We can disable debug_pagealloc processing even if the code is compiled with CONFIG_DEBUG_PAGEALLOC. This patch changes the code to query whether it is enabled or not in runtime. [akpm@linux-foundation.org: clean up code, per Christian] Signed-off-by: Joonsoo Kim <iamjoonsoo.kim@lge.com> Reviewed-by: Christian Borntraeger <borntraeger@de.ibm.com> Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org> Cc: Chris Metcalf <cmetcalf@ezchip.com> Cc: Christoph Lameter <cl@linux.com> Cc: David Rientjes <rientjes@google.com> Cc: Pekka Enberg <penberg@kernel.org> Cc: Takashi Iwai <tiwai@suse.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
#
27ee57c9 |
|
17-Mar-2016 |
Vladimir Davydov <vdavydov.dev@gmail.com> |
mm: memcontrol: report slab usage in cgroup2 memory.stat Show how much memory is used for storing reclaimable and unreclaimable in-kernel data structures allocated from slab caches. Signed-off-by: Vladimir Davydov <vdavydov@virtuozzo.com> Acked-by: Johannes Weiner <hannes@cmpxchg.org> Cc: Michal Hocko <mhocko@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
#
5b3810e5 |
|
15-Mar-2016 |
Vlastimil Babka <vbabka@suse.cz> |
mm, sl[au]b: print gfp_flags as strings in slab_out_of_memory() We can now print gfp_flags more human-readable. Make use of this in slab_out_of_memory() for SLUB and SLAB. Also convert the SLAB variant it to pr_warn() along the way. Signed-off-by: Vlastimil Babka <vbabka@suse.cz> Acked-by: David Rientjes <rientjes@google.com> Cc: Christoph Lameter <cl@linux.com> Cc: Pekka Enberg <penberg@kernel.org> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
#
d86bd1be |
|
15-Mar-2016 |
Joonsoo Kim <iamjoonsoo.kim@lge.com> |
mm/slub: support left redzone SLUB already has a redzone debugging feature. But it is only positioned at the end of object (aka right redzone) so it cannot catch left oob. Although current object's right redzone acts as left redzone of next object, first object in a slab cannot take advantage of this effect. This patch explicitly adds a left red zone to each object to detect left oob more precisely. Background: Someone complained to me that left OOB doesn't catch even if KASAN is enabled which does page allocation debugging. That page is out of our control so it would be allocated when left OOB happens and, in this case, we can't find OOB. Moreover, SLUB debugging feature can be enabled without page allocator debugging and, in this case, we will miss that OOB. Before trying to implement, I expected that changes would be too complex, but, it doesn't look that complex to me now. Almost changes are applied to debug specific functions so I feel okay. Signed-off-by: Joonsoo Kim <iamjoonsoo.kim@lge.com> Cc: Christoph Lameter <cl@linux.com> Cc: Pekka Enberg <penberg@kernel.org> Cc: David Rientjes <rientjes@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
#
149daaf3 |
|
15-Mar-2016 |
Laura Abbott <labbott@fedoraproject.org> |
slub: relax CMPXCHG consistency restrictions When debug options are enabled, cmpxchg on the page is disabled. This is because the page must be locked to ensure there are no false positives when performing consistency checks. Some debug options such as poisoning and red zoning only act on the object itself. There is no need to protect other CPUs from modification on only the object. Allow cmpxchg to happen with poisoning and red zoning are set on a slab. Credit to Mathias Krause for the original work which inspired this series Signed-off-by: Laura Abbott <labbott@fedoraproject.org> Acked-by: Christoph Lameter <cl@linux.com> Cc: Pekka Enberg <penberg@kernel.org> Cc: David Rientjes <rientjes@google.com> Cc: Joonsoo Kim <js1304@gmail.com> Cc: Kees Cook <keescook@chromium.org> Cc: Mathias Krause <minipli@googlemail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
#
becfda68 |
|
15-Mar-2016 |
Laura Abbott <labbott@fedoraproject.org> |
slub: convert SLAB_DEBUG_FREE to SLAB_CONSISTENCY_CHECKS SLAB_DEBUG_FREE allows expensive consistency checks at free to be turned on or off. Expand its use to be able to turn off all consistency checks. This gives a nice speed up if you only want features such as poisoning or tracing. Credit to Mathias Krause for the original work which inspired this series Signed-off-by: Laura Abbott <labbott@fedoraproject.org> Acked-by: Christoph Lameter <cl@linux.com> Cc: Pekka Enberg <penberg@kernel.org> Cc: David Rientjes <rientjes@google.com> Cc: Joonsoo Kim <js1304@gmail.com> Cc: Kees Cook <keescook@chromium.org> Cc: Mathias Krause <minipli@googlemail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
#
804aa132 |
|
15-Mar-2016 |
Laura Abbott <labbott@fedoraproject.org> |
slub: fix/clean free_debug_processing return paths Since commit 19c7ff9ecd89 ("slub: Take node lock during object free checks") check_object has been incorrectly returning success as it follows the out label which just returns the node. Thanks to refactoring, the out and fail paths are now basically the same. Combine the two into one and just use a single label. Credit to Mathias Krause for the original work which inspired this series Signed-off-by: Laura Abbott <labbott@fedoraproject.org> Acked-by: Christoph Lameter <cl@linux.com> Cc: Pekka Enberg <penberg@kernel.org> Cc: David Rientjes <rientjes@google.com> Cc: Joonsoo Kim <js1304@gmail.com> Cc: Kees Cook <keescook@chromium.org> Cc: Mathias Krause <minipli@googlemail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
#
282acb43 |
|
15-Mar-2016 |
Laura Abbott <labbott@fedoraproject.org> |
slub: drop lock at the end of free_debug_processing This series takes the suggestion of Christoph Lameter and only focuses on optimizing the slow path where the debug processing runs. The two main optimizations in this series are letting the consistency checks be skipped and relaxing the cmpxchg restrictions when we are not doing consistency checks. With hackbench -g 20 -l 1000 averaged over 100 runs: Before slub_debug=P mean 15.607 variance .086 stdev .294 After slub_debug=P mean 10.836 variance .155 stdev .394 This still isn't as fast as what is in grsecurity unfortunately so there's still work to be done. Profiling ___slab_alloc shows that 25-50% of time is spent in deactivate_slab. I haven't looked too closely to see if this is something that can be optimized. My plan for now is to focus on getting all of this merged (if appropriate) before digging in to another task. This patch (of 4): Currently, free_debug_processing has a comment "Keep node_lock to preserve integrity until the object is actually freed". In actuallity, the lock is dropped immediately in __slab_free. Rather than wait until __slab_free and potentially throw off the unlikely marking, just drop the lock in __slab_free. This also lets free_debug_processing take its own copy of the spinlock flags rather than trying to share the ones from __slab_free. Since there is no use for the node afterwards, change the return type of free_debug_processing to return an int like alloc_debug_processing. Credit to Mathias Krause for the original work which inspired this series [akpm@linux-foundation.org: fix build] Signed-off-by: Laura Abbott <labbott@fedoraproject.org> Acked-by: Christoph Lameter <cl@linux.com> Cc: Pekka Enberg <penberg@kernel.org> Cc: David Rientjes <rientjes@google.com> Cc: Joonsoo Kim <js1304@gmail.com> Cc: Kees Cook <keescook@chromium.org> Cc: Mathias Krause <minipli@googlemail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
#
ca257195 |
|
15-Mar-2016 |
Jesper Dangaard Brouer <brouer@redhat.com> |
mm: new API kfree_bulk() for SLAB+SLUB allocators This patch introduce a new API call kfree_bulk() for bulk freeing memory objects not bound to a single kmem_cache. Christoph pointed out that it is possible to implement freeing of objects, without knowing the kmem_cache pointer as that information is available from the object's page->slab_cache. Proposing to remove the kmem_cache argument from the bulk free API. Jesper demonstrated that these extra steps per object comes at a performance cost. It is only in the case CONFIG_MEMCG_KMEM is compiled in and activated runtime that these steps are done anyhow. The extra cost is most visible for SLAB allocator, because the SLUB allocator does the page lookup (virt_to_head_page()) anyhow. Thus, the conclusion was to keep the kmem_cache free bulk API with a kmem_cache pointer, but we can still implement a kfree_bulk() API fairly easily. Simply by handling if kmem_cache_free_bulk() gets called with a kmem_cache NULL pointer. This does increase the code size a bit, but implementing a separate kfree_bulk() call would likely increase code size even more. Below benchmarks cost of alloc+free (obj size 256 bytes) on CPU i7-4790K @ 4.00GHz, no PREEMPT and CONFIG_MEMCG_KMEM=y. Code size increase for SLAB: add/remove: 0/0 grow/shrink: 1/0 up/down: 74/0 (74) function old new delta kmem_cache_free_bulk 660 734 +74 SLAB fastpath: 87 cycles(tsc) 21.814 sz - fallback - kmem_cache_free_bulk - kfree_bulk 1 - 103 cycles 25.878 ns - 41 cycles 10.498 ns - 81 cycles 20.312 ns 2 - 94 cycles 23.673 ns - 26 cycles 6.682 ns - 42 cycles 10.649 ns 3 - 92 cycles 23.181 ns - 21 cycles 5.325 ns - 39 cycles 9.950 ns 4 - 90 cycles 22.727 ns - 18 cycles 4.673 ns - 26 cycles 6.693 ns 8 - 89 cycles 22.270 ns - 14 cycles 3.664 ns - 23 cycles 5.835 ns 16 - 88 cycles 22.038 ns - 14 cycles 3.503 ns - 22 cycles 5.543 ns 30 - 89 cycles 22.284 ns - 13 cycles 3.310 ns - 20 cycles 5.197 ns 32 - 88 cycles 22.249 ns - 13 cycles 3.420 ns - 20 cycles 5.166 ns 34 - 88 cycles 22.224 ns - 14 cycles 3.643 ns - 20 cycles 5.170 ns 48 - 88 cycles 22.088 ns - 14 cycles 3.507 ns - 20 cycles 5.203 ns 64 - 88 cycles 22.063 ns - 13 cycles 3.428 ns - 20 cycles 5.152 ns 128 - 89 cycles 22.483 ns - 15 cycles 3.891 ns - 23 cycles 5.885 ns 158 - 89 cycles 22.381 ns - 15 cycles 3.779 ns - 22 cycles 5.548 ns 250 - 91 cycles 22.798 ns - 16 cycles 4.152 ns - 23 cycles 5.967 ns SLAB when enabling MEMCG_KMEM runtime: - kmemcg fastpath: 130 cycles(tsc) 32.684 ns (step:0) 1 - 148 cycles 37.220 ns - 66 cycles 16.622 ns - 66 cycles 16.583 ns 2 - 141 cycles 35.510 ns - 51 cycles 12.820 ns - 58 cycles 14.625 ns 3 - 140 cycles 35.017 ns - 37 cycles 9.326 ns - 33 cycles 8.474 ns 4 - 137 cycles 34.507 ns - 31 cycles 7.888 ns - 33 cycles 8.300 ns 8 - 140 cycles 35.069 ns - 25 cycles 6.461 ns - 25 cycles 6.436 ns 16 - 138 cycles 34.542 ns - 23 cycles 5.945 ns - 22 cycles 5.670 ns 30 - 136 cycles 34.227 ns - 22 cycles 5.502 ns - 22 cycles 5.587 ns 32 - 136 cycles 34.253 ns - 21 cycles 5.475 ns - 21 cycles 5.324 ns 34 - 136 cycles 34.254 ns - 21 cycles 5.448 ns - 20 cycles 5.194 ns 48 - 136 cycles 34.075 ns - 21 cycles 5.458 ns - 21 cycles 5.367 ns 64 - 135 cycles 33.994 ns - 21 cycles 5.350 ns - 21 cycles 5.259 ns 128 - 137 cycles 34.446 ns - 23 cycles 5.816 ns - 22 cycles 5.688 ns 158 - 137 cycles 34.379 ns - 22 cycles 5.727 ns - 22 cycles 5.602 ns 250 - 138 cycles 34.755 ns - 24 cycles 6.093 ns - 23 cycles 5.986 ns Code size increase for SLUB: function old new delta kmem_cache_free_bulk 717 799 +82 SLUB benchmark: SLUB fastpath: 46 cycles(tsc) 11.691 ns (step:0) sz - fallback - kmem_cache_free_bulk - kfree_bulk 1 - 61 cycles 15.486 ns - 53 cycles 13.364 ns - 57 cycles 14.464 ns 2 - 54 cycles 13.703 ns - 32 cycles 8.110 ns - 33 cycles 8.482 ns 3 - 53 cycles 13.272 ns - 25 cycles 6.362 ns - 27 cycles 6.947 ns 4 - 51 cycles 12.994 ns - 24 cycles 6.087 ns - 24 cycles 6.078 ns 8 - 50 cycles 12.576 ns - 21 cycles 5.354 ns - 22 cycles 5.513 ns 16 - 49 cycles 12.368 ns - 20 cycles 5.054 ns - 20 cycles 5.042 ns 30 - 49 cycles 12.273 ns - 18 cycles 4.748 ns - 19 cycles 4.758 ns 32 - 49 cycles 12.401 ns - 19 cycles 4.821 ns - 19 cycles 4.810 ns 34 - 98 cycles 24.519 ns - 24 cycles 6.154 ns - 24 cycles 6.157 ns 48 - 83 cycles 20.833 ns - 21 cycles 5.446 ns - 21 cycles 5.429 ns 64 - 75 cycles 18.891 ns - 20 cycles 5.247 ns - 20 cycles 5.238 ns 128 - 93 cycles 23.271 ns - 27 cycles 6.856 ns - 27 cycles 6.823 ns 158 - 102 cycles 25.581 ns - 30 cycles 7.714 ns - 30 cycles 7.695 ns 250 - 107 cycles 26.917 ns - 38 cycles 9.514 ns - 38 cycles 9.506 ns SLUB when enabling MEMCG_KMEM runtime: - kmemcg fastpath: 71 cycles(tsc) 17.897 ns (step:0) 1 - 85 cycles 21.484 ns - 78 cycles 19.569 ns - 75 cycles 18.938 ns 2 - 81 cycles 20.363 ns - 45 cycles 11.258 ns - 44 cycles 11.076 ns 3 - 78 cycles 19.709 ns - 33 cycles 8.354 ns - 32 cycles 8.044 ns 4 - 77 cycles 19.430 ns - 28 cycles 7.216 ns - 28 cycles 7.003 ns 8 - 101 cycles 25.288 ns - 23 cycles 5.849 ns - 23 cycles 5.787 ns 16 - 76 cycles 19.148 ns - 20 cycles 5.162 ns - 20 cycles 5.081 ns 30 - 76 cycles 19.067 ns - 19 cycles 4.868 ns - 19 cycles 4.821 ns 32 - 76 cycles 19.052 ns - 19 cycles 4.857 ns - 19 cycles 4.815 ns 34 - 121 cycles 30.291 ns - 25 cycles 6.333 ns - 25 cycles 6.268 ns 48 - 108 cycles 27.111 ns - 21 cycles 5.498 ns - 21 cycles 5.458 ns 64 - 100 cycles 25.164 ns - 20 cycles 5.242 ns - 20 cycles 5.229 ns 128 - 155 cycles 38.976 ns - 27 cycles 6.886 ns - 27 cycles 6.892 ns 158 - 132 cycles 33.034 ns - 30 cycles 7.711 ns - 30 cycles 7.728 ns 250 - 130 cycles 32.612 ns - 38 cycles 9.560 ns - 38 cycles 9.549 ns Signed-off-by: Jesper Dangaard Brouer <brouer@redhat.com> Cc: Christoph Lameter <cl@linux.com> Cc: Pekka Enberg <penberg@kernel.org> Cc: David Rientjes <rientjes@google.com> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com> Cc: Vladimir Davydov <vdavydov@virtuozzo.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
#
11c7aec2 |
|
15-Mar-2016 |
Jesper Dangaard Brouer <brouer@redhat.com> |
mm/slab: move SLUB alloc hooks to common mm/slab.h First step towards sharing alloc_hook's between SLUB and SLAB allocators. Move the SLUB allocators *_alloc_hook to the common mm/slab.h for internal slab definitions. Signed-off-by: Jesper Dangaard Brouer <brouer@redhat.com> Cc: Christoph Lameter <cl@linux.com> Cc: Pekka Enberg <penberg@kernel.org> Cc: David Rientjes <rientjes@google.com> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com> Cc: Vladimir Davydov <vdavydov@virtuozzo.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
#
376bf125 |
|
15-Mar-2016 |
Jesper Dangaard Brouer <brouer@redhat.com> |
slub: clean up code for kmem cgroup support to kmem_cache_free_bulk This change is primarily an attempt to make it easier to realize the optimizations the compiler performs in-case CONFIG_MEMCG_KMEM is not enabled. Performance wise, even when CONFIG_MEMCG_KMEM is compiled in, the overhead is zero. This is because, as long as no process have enabled kmem cgroups accounting, the assignment is replaced by asm-NOP operations. This is possible because memcg_kmem_enabled() uses a static_key_false() construct. It also helps readability as it avoid accessing the p[] array like: p[size - 1] which "expose" that the array is processed backwards inside helper function build_detached_freelist(). Lastly this also makes the code more robust, in error case like passing NULL pointers in the array. Which were previously handled before commit 033745189b1b ("slub: add missing kmem cgroup support to kmem_cache_free_bulk"). Fixes: 033745189b1b ("slub: add missing kmem cgroup support to kmem_cache_free_bulk") Signed-off-by: Jesper Dangaard Brouer <brouer@redhat.com> Cc: Christoph Lameter <cl@linux.com> Cc: Pekka Enberg <penberg@kernel.org> Cc: David Rientjes <rientjes@google.com> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com> Cc: Vladimir Davydov <vdavydov@virtuozzo.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
#
52b4b950 |
|
17-Feb-2016 |
Dmitry Safonov <0x7f454c46@gmail.com> |
mm: slab: free kmem_cache_node after destroy sysfs file When slub_debug alloc_calls_show is enabled we will try to track location and user of slab object on each online node, kmem_cache_node structure and cpu_cache/cpu_slub shouldn't be freed till there is the last reference to sysfs file. This fixes the following panic: BUG: unable to handle kernel NULL pointer dereference at 0000000000000020 IP: list_locations+0x169/0x4e0 PGD 257304067 PUD 438456067 PMD 0 Oops: 0000 [#1] SMP CPU: 3 PID: 973074 Comm: cat ve: 0 Not tainted 3.10.0-229.7.2.ovz.9.30-00007-japdoll-dirty #2 9.30 Hardware name: DEPO Computers To Be Filled By O.E.M./H67DE3, BIOS L1.60c 07/14/2011 task: ffff88042a5dc5b0 ti: ffff88037f8d8000 task.ti: ffff88037f8d8000 RIP: list_locations+0x169/0x4e0 Call Trace: alloc_calls_show+0x1d/0x30 slab_attr_show+0x1b/0x30 sysfs_read_file+0x9a/0x1a0 vfs_read+0x9c/0x170 SyS_read+0x58/0xb0 system_call_fastpath+0x16/0x1b Code: 5e 07 12 00 b9 00 04 00 00 3d 00 04 00 00 0f 4f c1 3d 00 04 00 00 89 45 b0 0f 84 c3 00 00 00 48 63 45 b0 49 8b 9c c4 f8 00 00 00 <48> 8b 43 20 48 85 c0 74 b6 48 89 df e8 46 37 44 00 48 8b 53 10 CR2: 0000000000000020 Separated __kmem_cache_release from __kmem_cache_shutdown which now called on slab_kmem_cache_release (after the last reference to sysfs file object has dropped). Reintroduced locking in free_partial as sysfs file might access cache's partial list after shutdowning - partial revert of the commit 69cb8e6b7c29 ("slub: free slabs without holding locks"). Zap __remove_partial and use remove_partial (w/o underscores) as free_partial now takes list_lock which s partial revert for commit 1e4dd9461fab ("slub: do not assert not having lock in removing freed partial") Signed-off-by: Dmitry Safonov <dsafonov@virtuozzo.com> Suggested-by: Vladimir Davydov <vdavydov@virtuozzo.com> Acked-by: Vladimir Davydov <vdavydov@virtuozzo.com> Cc: Christoph Lameter <cl@linux.com> Cc: Pekka Enberg <penberg@kernel.org> Cc: David Rientjes <rientjes@google.com> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
#
127424c8 |
|
20-Jan-2016 |
Johannes Weiner <hannes@cmpxchg.org> |
mm: memcontrol: move kmem accounting code to CONFIG_MEMCG The cgroup2 memory controller will account important in-kernel memory consumers per default. Move all necessary components to CONFIG_MEMCG. Signed-off-by: Johannes Weiner <hannes@cmpxchg.org> Acked-by: Vladimir Davydov <vdavydov@virtuozzo.com> Cc: Michal Hocko <mhocko@suse.cz> Cc: Arnd Bergmann <arnd@arndb.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
#
48c935ad |
|
15-Jan-2016 |
Kirill A. Shutemov <kirill.shutemov@linux.intel.com> |
page-flags: define PG_locked behavior on compound pages lock_page() must operate on the whole compound page. It doesn't make much sense to lock part of compound page. Change code to use head page's PG_locked, if tail page is passed. This patch also gets rid of custom helper functions -- __set_page_locked() and __clear_page_locked(). They are replaced with helpers generated by __SETPAGEFLAG/__CLEARPAGEFLAG. Tail pages to these helper would trigger VM_BUG_ON(). SLUB uses PG_locked as a bit spin locked. IIUC, tail pages should never appear there. VM_BUG_ON() is added to make sure that this assumption is correct. [akpm@linux-foundation.org: fix fs/cifs/file.c] Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Hugh Dickins <hughd@google.com> Cc: Dave Hansen <dave.hansen@intel.com> Cc: Mel Gorman <mgorman@suse.de> Cc: Rik van Riel <riel@redhat.com> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Christoph Lameter <cl@linux.com> Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com> Cc: Steve Capper <steve.capper@linaro.org> Cc: "Aneesh Kumar K.V" <aneesh.kumar@linux.vnet.ibm.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Michal Hocko <mhocko@suse.cz> Cc: Jerome Marchand <jmarchan@redhat.com> Cc: Jérôme Glisse <jglisse@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
#
230e9fc2 |
|
14-Jan-2016 |
Vladimir Davydov <vdavydov.dev@gmail.com> |
slab: add SLAB_ACCOUNT flag Currently, if we want to account all objects of a particular kmem cache, we have to pass __GFP_ACCOUNT to each kmem_cache_alloc call, which is inconvenient. This patch introduces SLAB_ACCOUNT flag which if passed to kmem_cache_create will force accounting for every allocation from this cache even if __GFP_ACCOUNT is not passed. This patch does not make any of the existing caches use this flag - it will be done later in the series. Note, a cache with SLAB_ACCOUNT cannot be merged with a cache w/o SLAB_ACCOUNT, because merged caches share the same kmem_cache struct and hence cannot have different sets of SLAB_* flags. Thus using this flag will probably reduce the number of merged slabs even if kmem accounting is not used (only compiled in). Signed-off-by: Vladimir Davydov <vdavydov@virtuozzo.com> Suggested-by: Tejun Heo <tj@kernel.org> Acked-by: Johannes Weiner <hannes@cmpxchg.org> Acked-by: Michal Hocko <mhocko@suse.com> Cc: Greg Thelen <gthelen@google.com> Cc: Christoph Lameter <cl@linux.com> Cc: Pekka Enberg <penberg@kernel.org> Cc: David Rientjes <rientjes@google.com> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
#
865762a8 |
|
20-Nov-2015 |
Jesper Dangaard Brouer <brouer@redhat.com> |
slab/slub: adjust kmem_cache_alloc_bulk API Adjust kmem_cache_alloc_bulk API before we have any real users. Adjust API to return type 'int' instead of previously type 'bool'. This is done to allow future extension of the bulk alloc API. A future extension could be to allow SLUB to stop at a page boundary, when specified by a flag, and then return the number of objects. The advantage of this approach, would make it easier to make bulk alloc run without local IRQs disabled. With an approach of cmpxchg "stealing" the entire c->freelist or page->freelist. To avoid overshooting we would stop processing at a slab-page boundary. Else we always end up returning some objects at the cost of another cmpxchg. To keep compatible with future users of this API linking against an older kernel when using the new flag, we need to return the number of allocated objects with this API change. Signed-off-by: Jesper Dangaard Brouer <brouer@redhat.com> Cc: Vladimir Davydov <vdavydov@virtuozzo.com> Acked-by: Christoph Lameter <cl@linux.com> Cc: Pekka Enberg <penberg@kernel.org> Cc: David Rientjes <rientjes@google.com> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
#
03374518 |
|
20-Nov-2015 |
Jesper Dangaard Brouer <brouer@redhat.com> |
slub: add missing kmem cgroup support to kmem_cache_free_bulk Initial implementation missed support for kmem cgroup support in kmem_cache_free_bulk() call, add this. If CONFIG_MEMCG_KMEM is not enabled, the compiler should be smart enough to not add any asm code. Incoming bulk free objects can belong to different kmem cgroups, and object free call can happen at a later point outside memcg context. Thus, we need to keep the orig kmem_cache, to correctly verify if a memcg object match against its "root_cache" (s->memcg_params.root_cache). Signed-off-by: Jesper Dangaard Brouer <brouer@redhat.com> Reviewed-by: Vladimir Davydov <vdavydov@virtuozzo.com> Cc: Christoph Lameter <cl@linux.com> Cc: Pekka Enberg <penberg@kernel.org> Cc: David Rientjes <rientjes@google.com> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
#
03ec0ed5 |
|
20-Nov-2015 |
Jesper Dangaard Brouer <brouer@redhat.com> |
slub: fix kmem cgroup bug in kmem_cache_alloc_bulk The call slab_pre_alloc_hook() interacts with kmemgc and is not allowed to be called several times inside the bulk alloc for loop, due to the call to memcg_kmem_get_cache(). This would result in hitting the VM_BUG_ON in __memcg_kmem_get_cache. As suggested by Vladimir Davydov, change slab_post_alloc_hook() to be able to handle an array of objects. A subtle detail is, loop iterator "i" in slab_post_alloc_hook() must have same type (size_t) as size argument. This helps the compiler to easier realize that it can remove the loop, when all debug statements inside loop evaluates to nothing. Note, this is only an issue because the kernel is compiled with GCC option: -fno-strict-overflow In slab_alloc_node() the compiler inlines and optimizes the invocation of slab_post_alloc_hook(s, flags, 1, &object) by removing the loop and access object directly. Signed-off-by: Jesper Dangaard Brouer <brouer@redhat.com> Reported-by: Vladimir Davydov <vdavydov@virtuozzo.com> Suggested-by: Vladimir Davydov <vdavydov@virtuozzo.com> Reviewed-by: Vladimir Davydov <vdavydov@virtuozzo.com> Cc: Christoph Lameter <cl@linux.com> Cc: Pekka Enberg <penberg@kernel.org> Cc: David Rientjes <rientjes@google.com> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
#
d0ecd894 |
|
20-Nov-2015 |
Jesper Dangaard Brouer <brouer@redhat.com> |
slub: optimize bulk slowpath free by detached freelist This change focus on improving the speed of object freeing in the "slowpath" of kmem_cache_free_bulk. The calls slab_free (fastpath) and __slab_free (slowpath) have been extended with support for bulk free, which amortize the overhead of the (locked) cmpxchg_double. To use the new bulking feature, we build what I call a detached freelist. The detached freelist takes advantage of three properties: 1) the free function call owns the object that is about to be freed, thus writing into this memory is synchronization-free. 2) many freelist's can co-exist side-by-side in the same slab-page each with a separate head pointer. 3) it is the visibility of the head pointer that needs synchronization. Given these properties, the brilliant part is that the detached freelist can be constructed without any need for synchronization. The freelist is constructed directly in the page objects, without any synchronization needed. The detached freelist is allocated on the stack of the function call kmem_cache_free_bulk. Thus, the freelist head pointer is not visible to other CPUs. All objects in a SLUB freelist must belong to the same slab-page. Thus, constructing the detached freelist is about matching objects that belong to the same slab-page. The bulk free array is scanned is a progressive manor with a limited look-ahead facility. Kmem debug support is handled in call of slab_free(). Notice kmem_cache_free_bulk no longer need to disable IRQs. This only slowed down single free bulk with approx 3 cycles. Performance data: Benchmarked[1] obj size 256 bytes on CPU i7-4790K @ 4.00GHz SLUB fastpath single object quick reuse: 47 cycles(tsc) 11.931 ns To get stable and comparable numbers, the kernel have been booted with "slab_merge" (this also improve performance for larger bulk sizes). Performance data, compared against fallback bulking: bulk - fallback bulk - improvement with this patch 1 - 62 cycles(tsc) 15.662 ns - 49 cycles(tsc) 12.407 ns- improved 21.0% 2 - 55 cycles(tsc) 13.935 ns - 30 cycles(tsc) 7.506 ns - improved 45.5% 3 - 53 cycles(tsc) 13.341 ns - 23 cycles(tsc) 5.865 ns - improved 56.6% 4 - 52 cycles(tsc) 13.081 ns - 20 cycles(tsc) 5.048 ns - improved 61.5% 8 - 50 cycles(tsc) 12.627 ns - 18 cycles(tsc) 4.659 ns - improved 64.0% 16 - 49 cycles(tsc) 12.412 ns - 17 cycles(tsc) 4.495 ns - improved 65.3% 30 - 49 cycles(tsc) 12.484 ns - 18 cycles(tsc) 4.533 ns - improved 63.3% 32 - 50 cycles(tsc) 12.627 ns - 18 cycles(tsc) 4.707 ns - improved 64.0% 34 - 96 cycles(tsc) 24.243 ns - 23 cycles(tsc) 5.976 ns - improved 76.0% 48 - 83 cycles(tsc) 20.818 ns - 21 cycles(tsc) 5.329 ns - improved 74.7% 64 - 74 cycles(tsc) 18.700 ns - 20 cycles(tsc) 5.127 ns - improved 73.0% 128 - 90 cycles(tsc) 22.734 ns - 27 cycles(tsc) 6.833 ns - improved 70.0% 158 - 99 cycles(tsc) 24.776 ns - 30 cycles(tsc) 7.583 ns - improved 69.7% 250 - 104 cycles(tsc) 26.089 ns - 37 cycles(tsc) 9.280 ns - improved 64.4% Performance data, compared current in-kernel bulking: bulk - curr in-kernel - improvement with this patch 1 - 46 cycles(tsc) - 49 cycles(tsc) - improved (cycles:-3) -6.5% 2 - 27 cycles(tsc) - 30 cycles(tsc) - improved (cycles:-3) -11.1% 3 - 21 cycles(tsc) - 23 cycles(tsc) - improved (cycles:-2) -9.5% 4 - 18 cycles(tsc) - 20 cycles(tsc) - improved (cycles:-2) -11.1% 8 - 17 cycles(tsc) - 18 cycles(tsc) - improved (cycles:-1) -5.9% 16 - 18 cycles(tsc) - 17 cycles(tsc) - improved (cycles: 1) 5.6% 30 - 18 cycles(tsc) - 18 cycles(tsc) - improved (cycles: 0) 0.0% 32 - 18 cycles(tsc) - 18 cycles(tsc) - improved (cycles: 0) 0.0% 34 - 78 cycles(tsc) - 23 cycles(tsc) - improved (cycles:55) 70.5% 48 - 60 cycles(tsc) - 21 cycles(tsc) - improved (cycles:39) 65.0% 64 - 49 cycles(tsc) - 20 cycles(tsc) - improved (cycles:29) 59.2% 128 - 69 cycles(tsc) - 27 cycles(tsc) - improved (cycles:42) 60.9% 158 - 79 cycles(tsc) - 30 cycles(tsc) - improved (cycles:49) 62.0% 250 - 86 cycles(tsc) - 37 cycles(tsc) - improved (cycles:49) 57.0% Performance with normal SLUB merging is significantly slower for larger bulking. This is believed to (primarily) be an effect of not having to share the per-CPU data-structures, as tuning per-CPU size can achieve similar performance. bulk - slab_nomerge - normal SLUB merge 1 - 49 cycles(tsc) - 49 cycles(tsc) - merge slower with cycles:0 2 - 30 cycles(tsc) - 30 cycles(tsc) - merge slower with cycles:0 3 - 23 cycles(tsc) - 23 cycles(tsc) - merge slower with cycles:0 4 - 20 cycles(tsc) - 20 cycles(tsc) - merge slower with cycles:0 8 - 18 cycles(tsc) - 18 cycles(tsc) - merge slower with cycles:0 16 - 17 cycles(tsc) - 17 cycles(tsc) - merge slower with cycles:0 30 - 18 cycles(tsc) - 23 cycles(tsc) - merge slower with cycles:5 32 - 18 cycles(tsc) - 22 cycles(tsc) - merge slower with cycles:4 34 - 23 cycles(tsc) - 22 cycles(tsc) - merge slower with cycles:-1 48 - 21 cycles(tsc) - 22 cycles(tsc) - merge slower with cycles:1 64 - 20 cycles(tsc) - 48 cycles(tsc) - merge slower with cycles:28 128 - 27 cycles(tsc) - 57 cycles(tsc) - merge slower with cycles:30 158 - 30 cycles(tsc) - 59 cycles(tsc) - merge slower with cycles:29 250 - 37 cycles(tsc) - 56 cycles(tsc) - merge slower with cycles:19 Joint work with Alexander Duyck. [1] https://github.com/netoptimizer/prototype-kernel/blob/master/kernel/mm/slab_bulk_test01.c [akpm@linux-foundation.org: BUG_ON -> WARN_ON;return] Signed-off-by: Jesper Dangaard Brouer <brouer@redhat.com> Signed-off-by: Alexander Duyck <alexander.h.duyck@redhat.com> Acked-by: Christoph Lameter <cl@linux.com> Cc: Pekka Enberg <penberg@kernel.org> Cc: David Rientjes <rientjes@google.com> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
#
81084651 |
|
20-Nov-2015 |
Jesper Dangaard Brouer <brouer@redhat.com> |
slub: support for bulk free with SLUB freelists Make it possible to free a freelist with several objects by adjusting API of slab_free() and __slab_free() to have head, tail and an objects counter (cnt). Tail being NULL indicate single object free of head object. This allow compiler inline constant propagation in slab_free() and slab_free_freelist_hook() to avoid adding any overhead in case of single object free. This allows a freelist with several objects (all within the same slab-page) to be free'ed using a single locked cmpxchg_double in __slab_free() and with an unlocked cmpxchg_double in slab_free(). Object debugging on the free path is also extended to handle these freelists. When CONFIG_SLUB_DEBUG is enabled it will also detect if objects don't belong to the same slab-page. These changes are needed for the next patch to bulk free the detached freelists it introduces and constructs. Micro benchmarking showed no performance reduction due to this change, when debugging is turned off (compiled with CONFIG_SLUB_DEBUG). Signed-off-by: Jesper Dangaard Brouer <brouer@redhat.com> Signed-off-by: Alexander Duyck <alexander.h.duyck@redhat.com> Acked-by: Christoph Lameter <cl@linux.com> Cc: Pekka Enberg <penberg@kernel.org> Cc: David Rientjes <rientjes@google.com> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
#
b4a64718 |
|
20-Nov-2015 |
Jesper Dangaard Brouer <brouer@redhat.com> |
slub: mark the dangling ifdef #else of CONFIG_SLUB_DEBUG The #ifdef of CONFIG_SLUB_DEBUG is located very far from the associated #else. For readability mark it with a comment. Signed-off-by: Jesper Dangaard Brouer <brouer@redhat.com> Acked-by: Christoph Lameter <cl@linux.com> Cc: Pekka Enberg <penberg@kernel.org> Cc: David Rientjes <rientjes@google.com> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com> Cc: Alexander Duyck <alexander.h.duyck@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
#
87098373 |
|
20-Nov-2015 |
Christoph Lameter <cl@linux.com> |
slub: avoid irqoff/on in bulk allocation Use the new function that can do allocation while interrupts are disabled. Avoids irq on/off sequences. Signed-off-by: Christoph Lameter <cl@linux.com> Cc: Jesper Dangaard Brouer <brouer@redhat.com> Cc: Pekka Enberg <penberg@kernel.org> Cc: David Rientjes <rientjes@google.com> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com> Cc: Alexander Duyck <alexander.h.duyck@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
#
a380a3c7 |
|
20-Nov-2015 |
Christoph Lameter <cl@linux.com> |
slub: create new ___slab_alloc function that can be called with irqs disabled Bulk alloc needs a function like that because it enables interrupts before calling __slab_alloc which promptly disables them again using the expensive local_irq_save(). Signed-off-by: Christoph Lameter <cl@linux.com> Cc: Jesper Dangaard Brouer <brouer@redhat.com> Cc: Pekka Enberg <penberg@kernel.org> Cc: David Rientjes <rientjes@google.com> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com> Cc: Alexander Duyck <alexander.h.duyck@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
#
bc4f610d |
|
06-Nov-2015 |
Kirill A. Shutemov <kirill.shutemov@linux.intel.com> |
slab, slub: use page->rcu_head instead of page->lru plus cast We have properly typed page->rcu_head, no need to cast page->lru. Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Reviewed-by: Andrea Arcangeli <aarcange@redhat.com> Acked-by: Christoph Lameter <cl@linux.com> Cc: "Paul E. McKenney" <paulmck@linux.vnet.ibm.com> Cc: Andi Kleen <ak@linux.intel.com> Cc: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com> Cc: David Rientjes <rientjes@google.com> Cc: Hugh Dickins <hughd@google.com> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com> Cc: Michal Hocko <mhocko@suse.com> Cc: Sergey Senozhatsky <sergey.senozhatsky@gmail.com> Cc: Vlastimil Babka <vbabka@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
#
d0164adc |
|
06-Nov-2015 |
Mel Gorman <mgorman@techsingularity.net> |
mm, page_alloc: distinguish between being unable to sleep, unwilling to sleep and avoiding waking kswapd __GFP_WAIT has been used to identify atomic context in callers that hold spinlocks or are in interrupts. They are expected to be high priority and have access one of two watermarks lower than "min" which can be referred to as the "atomic reserve". __GFP_HIGH users get access to the first lower watermark and can be called the "high priority reserve". Over time, callers had a requirement to not block when fallback options were available. Some have abused __GFP_WAIT leading to a situation where an optimisitic allocation with a fallback option can access atomic reserves. This patch uses __GFP_ATOMIC to identify callers that are truely atomic, cannot sleep and have no alternative. High priority users continue to use __GFP_HIGH. __GFP_DIRECT_RECLAIM identifies callers that can sleep and are willing to enter direct reclaim. __GFP_KSWAPD_RECLAIM to identify callers that want to wake kswapd for background reclaim. __GFP_WAIT is redefined as a caller that is willing to enter direct reclaim and wake kswapd for background reclaim. This patch then converts a number of sites o __GFP_ATOMIC is used by callers that are high priority and have memory pools for those requests. GFP_ATOMIC uses this flag. o Callers that have a limited mempool to guarantee forward progress clear __GFP_DIRECT_RECLAIM but keep __GFP_KSWAPD_RECLAIM. bio allocations fall into this category where kswapd will still be woken but atomic reserves are not used as there is a one-entry mempool to guarantee progress. o Callers that are checking if they are non-blocking should use the helper gfpflags_allow_blocking() where possible. This is because checking for __GFP_WAIT as was done historically now can trigger false positives. Some exceptions like dm-crypt.c exist where the code intent is clearer if __GFP_DIRECT_RECLAIM is used instead of the helper due to flag manipulations. o Callers that built their own GFP flags instead of starting with GFP_KERNEL and friends now also need to specify __GFP_KSWAPD_RECLAIM. The first key hazard to watch out for is callers that removed __GFP_WAIT and was depending on access to atomic reserves for inconspicuous reasons. In some cases it may be appropriate for them to use __GFP_HIGH. The second key hazard is callers that assembled their own combination of GFP flags instead of starting with something like GFP_KERNEL. They may now wish to specify __GFP_KSWAPD_RECLAIM. It's almost certainly harmless if it's missed in most cases as other activity will wake kswapd. Signed-off-by: Mel Gorman <mgorman@techsingularity.net> Acked-by: Vlastimil Babka <vbabka@suse.cz> Acked-by: Michal Hocko <mhocko@suse.com> Acked-by: Johannes Weiner <hannes@cmpxchg.org> Cc: Christoph Lameter <cl@linux.com> Cc: David Rientjes <rientjes@google.com> Cc: Vitaly Wool <vitalywool@gmail.com> Cc: Rik van Riel <riel@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
#
89d3c87e |
|
05-Nov-2015 |
Andrey Ryabinin <ryabinin.a.a@gmail.com> |
mm, slub, kasan: enable user tracking by default with KASAN=y It's recommended to have slub's user tracking enabled with CONFIG_KASAN, because: a) User tracking disables slab merging which improves detecting out-of-bounds accesses. b) User tracking metadata acts as redzone which also improves detecting out-of-bounds accesses. c) User tracking provides additional information about object. This information helps to understand bugs. Currently it is not enabled by default. Besides recompiling the kernel with KASAN and reinstalling it, user also have to change the boot cmdline, which is not very handy. Enable slub user tracking by default with KASAN=y, since there is no good reason to not do this. [akpm@linux-foundation.org: little fixes, per David] Signed-off-by: Andrey Ryabinin <aryabinin@virtuozzo.com> Cc: Christoph Lameter <cl@linux.com> Cc: Pekka Enberg <penberg@kernel.org> Cc: David Rientjes <rientjes@google.com> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
#
f3ccb2c4 |
|
05-Nov-2015 |
Vladimir Davydov <vdavydov.dev@gmail.com> |
memcg: unify slab and other kmem pages charging We have memcg_kmem_charge and memcg_kmem_uncharge methods for charging and uncharging kmem pages to memcg, but currently they are not used for charging slab pages (i.e. they are only used for charging pages allocated with alloc_kmem_pages). The only reason why the slab subsystem uses special helpers, memcg_charge_slab and memcg_uncharge_slab, is that it needs to charge to the memcg of kmem cache while memcg_charge_kmem charges to the memcg that the current task belongs to. To remove this diversity, this patch adds an extra argument to __memcg_kmem_charge that can be a pointer to a memcg or NULL. If it is not NULL, the function tries to charge to the memcg it points to, otherwise it charge to the current context. Next, it makes the slab subsystem use this function to charge slab pages. Since memcg_charge_kmem and memcg_uncharge_kmem helpers are now used only in __memcg_kmem_charge and __memcg_kmem_uncharge, they are inlined. Since __memcg_kmem_charge stores a pointer to the memcg in the page struct, we don't need memcg_uncharge_slab anymore and can use free_kmem_pages. Besides, one can now detect which memcg a slab page belongs to by reading /proc/kpagecgroup. Note, this patch switches slab to charge-after-alloc design. Since this design is already used for all other memcg charges, it should not make any difference. [hannes@cmpxchg.org: better to have an outer function than a magic parameter for the memcg lookup] Signed-off-by: Vladimir Davydov <vdavydov@virtuozzo.com> Acked-by: Michal Hocko <mhocko@suse.com> Signed-off-by: Johannes Weiner <hannes@cmpxchg.org> Cc: Christoph Lameter <cl@linux.com> Cc: Pekka Enberg <penberg@kernel.org> Cc: David Rientjes <rientjes@google.com> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
#
9f835703 |
|
05-Nov-2015 |
Wei Yang <weiyang@linux.vnet.ibm.com> |
mm/slub: calculate start order with reserved in consideration In slub_order(), the order starts from max(min_order, get_order(min_objects * size)). When (min_objects * size) has different order from (min_objects * size + reserved), it will skip this order via a check in the loop. This patch optimizes this a little by calculating the start order with `reserved' in consideration and removing the check in loop. Signed-off-by: Wei Yang <weiyang@linux.vnet.ibm.com> Acked-by: Christoph Lameter <cl@linux.com> Cc: Pekka Enberg <penberg@kernel.org> Cc: David Rientjes <rientjes@google.com> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
#
033fd1bd |
|
05-Nov-2015 |
Wei Yang <weiyang@linux.vnet.ibm.com> |
mm/slub: use get_order() instead of fls() get_order() is more easy to understand. This patch just replaces it. Signed-off-by: Wei Yang <weiyang@linux.vnet.ibm.com> Cc: Christoph Lameter <cl@linux.com> Cc: David Rientjes <rientjes@google.com> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com> Reviewed-by: Pekka Enberg <penberg@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
#
422ff4d7 |
|
05-Nov-2015 |
Wei Yang <weiyang@linux.vnet.ibm.com> |
mm/slub: correct the comment in calculate_order() In calculate_order(), it tries to calculate the best order by adjusting the fraction and min_objects. On each iteration on min_objects, fraction iterates on 16, 8, 4. Which means the acceptable waste increases with 1/16, 1/8, 1/4. This patch corrects the comment according to the code. Signed-off-by: Wei Yang <weiyang@linux.vnet.ibm.com> Acked-by: Christoph Lameter <cl@linux.com> Cc: Pekka Enberg <penberg@kernel.org> Cc: David Rientjes <rientjes@google.com> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
#
96db800f |
|
08-Sep-2015 |
Vlastimil Babka <vbabka@suse.cz> |
mm: rename alloc_pages_exact_node() to __alloc_pages_node() alloc_pages_exact_node() was introduced in commit 6484eb3e2a81 ("page allocator: do not check NUMA node ID when the caller knows the node is valid") as an optimized variant of alloc_pages_node(), that doesn't fallback to current node for nid == NUMA_NO_NODE. Unfortunately the name of the function can easily suggest that the allocation is restricted to the given node and fails otherwise. In truth, the node is only preferred, unless __GFP_THISNODE is passed among the gfp flags. The misleading name has lead to mistakes in the past, see for example commits 5265047ac301 ("mm, thp: really limit transparent hugepage allocation to local node") and b360edb43f8e ("mm, mempolicy: migrate_to_node should only migrate to node"). Another issue with the name is that there's a family of alloc_pages_exact*() functions where 'exact' means exact size (instead of page order), which leads to more confusion. To prevent further mistakes, this patch effectively renames alloc_pages_exact_node() to __alloc_pages_node() to better convey that it's an optimized variant of alloc_pages_node() not intended for general usage. Both functions get described in comments. It has been also considered to really provide a convenience function for allocations restricted to a node, but the major opinion seems to be that __GFP_THISNODE already provides that functionality and we shouldn't duplicate the API needlessly. The number of users would be small anyway. Existing callers of alloc_pages_exact_node() are simply converted to call __alloc_pages_node(), with the exception of sba_alloc_coherent() which open-codes the check for NUMA_NO_NODE, so it is converted to use alloc_pages_node() instead. This means it no longer performs some VM_BUG_ON checks, and since the current check for nid in alloc_pages_node() uses a 'nid < 0' comparison (which includes NUMA_NO_NODE), it may hide wrong values which would be previously exposed. Both differences will be rectified by the next patch. To sum up, this patch makes no functional changes, except temporarily hiding potentially buggy callers. Restricting the checks in alloc_pages_node() is left for the next patch which can in turn expose more existing buggy callers. Signed-off-by: Vlastimil Babka <vbabka@suse.cz> Acked-by: Johannes Weiner <hannes@cmpxchg.org> Acked-by: Robin Holt <robinmholt@gmail.com> Acked-by: Michal Hocko <mhocko@suse.com> Acked-by: Christoph Lameter <cl@linux.com> Acked-by: Michael Ellerman <mpe@ellerman.id.au> Cc: Mel Gorman <mgorman@suse.de> Cc: David Rientjes <rientjes@google.com> Cc: Greg Thelen <gthelen@google.com> Cc: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com> Cc: Pekka Enberg <penberg@kernel.org> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com> Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com> Cc: Tony Luck <tony.luck@intel.com> Cc: Fenghua Yu <fenghua.yu@intel.com> Cc: Arnd Bergmann <arnd@arndb.de> Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org> Cc: Paul Mackerras <paulus@samba.org> Cc: Gleb Natapov <gleb@kernel.org> Cc: Paolo Bonzini <pbonzini@redhat.com> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Ingo Molnar <mingo@redhat.com> Cc: "H. Peter Anvin" <hpa@zytor.com> Cc: Cliff Whickman <cpw@sgi.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
#
45eb00cd |
|
04-Sep-2015 |
Joonsoo Kim <js1304@gmail.com> |
mm/slub: don't wait for high-order page allocation Description is almost copied from commit fb05e7a89f50 ("net: don't wait for order-3 page allocation"). I saw excessive direct memory reclaim/compaction triggered by slub. This causes performance issues and add latency. Slub uses high-order allocation to reduce internal fragmentation and management overhead. But, direct memory reclaim/compaction has high overhead and the benefit of high-order allocation can't compensate the overhead of both work. This patch makes auxiliary high-order allocation atomic. If there is no memory pressure and memory isn't fragmented, the alloction will still success, so we don't sacrifice high-order allocation's benefit here. If the atomic allocation fails, direct memory reclaim/compaction will not be triggered, allocation fallback to low-order immediately, hence the direct memory reclaim/compaction overhead is avoided. In the allocation failure case, kswapd is waken up and trying to make high-order freepages, so allocation could success next time. Following is the test to measure effect of this patch. System: QEMU, CPU 8, 512 MB Mem: 25% memory is allocated at random position to make fragmentation. Memory-hogger occupies 150 MB memory. Workload: hackbench -g 20 -l 1000 Average result by 10 runs (Base va Patched) elapsed_time(s): 4.3468 vs 2.9838 compact_stall: 461.7 vs 73.6 pgmigrate_success: 28315.9 vs 7256.1 Signed-off-by: Joonsoo Kim <iamjoonsoo.kim@lge.com> Cc: Christoph Lameter <cl@linux.com> Cc: Pekka Enberg <penberg@kernel.org> Acked-by: David Rientjes <rientjes@google.com> Cc: Shaohua Li <shli@fb.com> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Michal Hocko <mhocko@suse.cz> Cc: Eric Dumazet <edumazet@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
#
80da026a |
|
04-Sep-2015 |
Konstantin Khlebnikov <koct9i@gmail.com> |
mm/slub: fix slab double-free in case of duplicate sysfs filename sysfs_slab_add() shouldn't call kobject_put at error path: this puts last reference of kmem-cache kobject and frees it. Kmem cache will be freed second time at error path in kmem_cache_create(). For example this happens when slub debug was enabled in runtime and somebody creates new kmem cache: # echo 1 | tee /sys/kernel/slab/*/sanity_checks # modprobe configfs "configfs_dir_cache" cannot be merged because existing slab have debug and cannot create new slab because unique name ":t-0000096" already taken. Signed-off-by: Konstantin Khlebnikov <khlebnikov@yandex-team.ru> Acked-by: Christoph Lameter <cl@linux.com> Cc: Pekka Enberg <penberg@kernel.org> Acked-by: David Rientjes <rientjes@google.com> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
#
588f8ba9 |
|
04-Sep-2015 |
Thomas Gleixner <tglx@linutronix.de> |
mm/slub: move slab initialization into irq enabled region Initializing a new slab can introduce rather large latencies because most of the initialization runs always with interrupts disabled. There is no point in doing so. The newly allocated slab is not visible yet, so there is no reason to protect it against concurrent alloc/free. Move the expensive parts of the initialization into allocate_slab(), so for all allocations with GFP_WAIT set, interrupts are enabled. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Acked-by: Christoph Lameter <cl@linux.com> Cc: Pekka Enberg <penberg@kernel.org> Acked-by: David Rientjes <rientjes@google.com> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com> Cc: Sebastian Andrzej Siewior <bigeasy@linutronix.de> Cc: Steven Rostedt <rostedt@goodmis.org> Cc: Peter Zijlstra <peterz@infradead.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
#
3eed034d |
|
04-Sep-2015 |
Jesper Dangaard Brouer <brouer@redhat.com> |
slub: add support for kmem_cache_debug in bulk calls Per request of Joonsoo Kim adding kmem debug support. I've tested that when debugging is disabled, then there is almost no performance impact as this code basically gets removed by the compiler. Need some guidance in enabling and testing this. bulk- PREVIOUS - THIS-PATCH 1 - 43 cycles(tsc) 10.811 ns - 44 cycles(tsc) 11.236 ns improved -2.3% 2 - 27 cycles(tsc) 6.867 ns - 28 cycles(tsc) 7.019 ns improved -3.7% 3 - 21 cycles(tsc) 5.496 ns - 22 cycles(tsc) 5.526 ns improved -4.8% 4 - 24 cycles(tsc) 6.038 ns - 19 cycles(tsc) 4.786 ns improved 20.8% 8 - 17 cycles(tsc) 4.280 ns - 18 cycles(tsc) 4.572 ns improved -5.9% 16 - 17 cycles(tsc) 4.483 ns - 18 cycles(tsc) 4.658 ns improved -5.9% 30 - 18 cycles(tsc) 4.531 ns - 18 cycles(tsc) 4.568 ns improved 0.0% 32 - 58 cycles(tsc) 14.586 ns - 65 cycles(tsc) 16.454 ns improved -12.1% 34 - 53 cycles(tsc) 13.391 ns - 63 cycles(tsc) 15.932 ns improved -18.9% 48 - 65 cycles(tsc) 16.268 ns - 50 cycles(tsc) 12.506 ns improved 23.1% 64 - 53 cycles(tsc) 13.440 ns - 63 cycles(tsc) 15.929 ns improved -18.9% 128 - 79 cycles(tsc) 19.899 ns - 86 cycles(tsc) 21.583 ns improved -8.9% 158 - 90 cycles(tsc) 22.732 ns - 90 cycles(tsc) 22.552 ns improved 0.0% 250 - 95 cycles(tsc) 23.916 ns - 98 cycles(tsc) 24.589 ns improved -3.2% Signed-off-by: Jesper Dangaard Brouer <brouer@redhat.com> Cc: Christoph Lameter <cl@linux.com> Cc: Pekka Enberg <penberg@kernel.org> Cc: David Rientjes <rientjes@google.com> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
#
fbd02630 |
|
04-Sep-2015 |
Jesper Dangaard Brouer <brouer@redhat.com> |
slub: initial bulk free implementation This implements SLUB specific kmem_cache_free_bulk(). SLUB allocator now both have bulk alloc and free implemented. Choose to reenable local IRQs while calling slowpath __slab_free(). In worst case, where all objects hit slowpath call, the performance should still be faster than fallback function __kmem_cache_free_bulk(), because local_irq_{disable+enable} is very fast (7-cycles), while the fallback invokes this_cpu_cmpxchg() which is slightly slower (9-cycles). Nitpicking, this should be faster for N>=4, due to the entry cost of local_irq_{disable+enable}. Do notice that the save+restore variant is very expensive, this is key to why this optimization works. CPU: i7-4790K CPU @ 4.00GHz * local_irq_{disable,enable}: 7 cycles(tsc) - 1.821 ns * local_irq_{save,restore} : 37 cycles(tsc) - 9.443 ns Measurements on CPU CPU i7-4790K @ 4.00GHz Baseline normal fastpath (alloc+free cost): 43 cycles(tsc) 10.834 ns Bulk- fallback - this-patch 1 - 58 cycles(tsc) 14.542 ns - 43 cycles(tsc) 10.811 ns improved 25.9% 2 - 50 cycles(tsc) 12.659 ns - 27 cycles(tsc) 6.867 ns improved 46.0% 3 - 48 cycles(tsc) 12.168 ns - 21 cycles(tsc) 5.496 ns improved 56.2% 4 - 47 cycles(tsc) 11.987 ns - 24 cycles(tsc) 6.038 ns improved 48.9% 8 - 46 cycles(tsc) 11.518 ns - 17 cycles(tsc) 4.280 ns improved 63.0% 16 - 45 cycles(tsc) 11.366 ns - 17 cycles(tsc) 4.483 ns improved 62.2% 30 - 45 cycles(tsc) 11.433 ns - 18 cycles(tsc) 4.531 ns improved 60.0% 32 - 75 cycles(tsc) 18.983 ns - 58 cycles(tsc) 14.586 ns improved 22.7% 34 - 71 cycles(tsc) 17.940 ns - 53 cycles(tsc) 13.391 ns improved 25.4% 48 - 80 cycles(tsc) 20.077 ns - 65 cycles(tsc) 16.268 ns improved 18.8% 64 - 71 cycles(tsc) 17.799 ns - 53 cycles(tsc) 13.440 ns improved 25.4% 128 - 91 cycles(tsc) 22.980 ns - 79 cycles(tsc) 19.899 ns improved 13.2% 158 - 100 cycles(tsc) 25.241 ns - 90 cycles(tsc) 22.732 ns improved 10.0% 250 - 102 cycles(tsc) 25.583 ns - 95 cycles(tsc) 23.916 ns improved 6.9% Signed-off-by: Jesper Dangaard Brouer <brouer@redhat.com> Cc: Christoph Lameter <cl@linux.com> Cc: Pekka Enberg <penberg@kernel.org> Cc: David Rientjes <rientjes@google.com> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
#
ebe909e0 |
|
04-Sep-2015 |
Jesper Dangaard Brouer <brouer@redhat.com> |
slub: improve bulk alloc strategy Call slowpath __slab_alloc() from within the bulk loop, as the side-effect of this call likely repopulates c->freelist. Choose to reenable local IRQs while calling slowpath. Saving some optimizations for later. E.g. it is possible to extract parts of __slab_alloc() and avoid the unnecessary and expensive (37 cycles) local_irq_{save,restore}. For now, be happy calling __slab_alloc() this lower icache impact of this func and I don't have to worry about correctness. Measurements on CPU CPU i7-4790K @ 4.00GHz Baseline normal fastpath (alloc+free cost): 42 cycles(tsc) 10.601 ns Bulk- fallback - this-patch 1 - 58 cycles(tsc) 14.516 ns - 49 cycles(tsc) 12.459 ns improved 15.5% 2 - 51 cycles(tsc) 12.930 ns - 38 cycles(tsc) 9.605 ns improved 25.5% 3 - 49 cycles(tsc) 12.274 ns - 34 cycles(tsc) 8.525 ns improved 30.6% 4 - 48 cycles(tsc) 12.058 ns - 32 cycles(tsc) 8.036 ns improved 33.3% 8 - 46 cycles(tsc) 11.609 ns - 31 cycles(tsc) 7.756 ns improved 32.6% 16 - 45 cycles(tsc) 11.451 ns - 32 cycles(tsc) 8.148 ns improved 28.9% 30 - 79 cycles(tsc) 19.865 ns - 68 cycles(tsc) 17.164 ns improved 13.9% 32 - 76 cycles(tsc) 19.212 ns - 66 cycles(tsc) 16.584 ns improved 13.2% 34 - 74 cycles(tsc) 18.600 ns - 63 cycles(tsc) 15.954 ns improved 14.9% 48 - 88 cycles(tsc) 22.092 ns - 77 cycles(tsc) 19.373 ns improved 12.5% 64 - 80 cycles(tsc) 20.043 ns - 68 cycles(tsc) 17.188 ns improved 15.0% 128 - 99 cycles(tsc) 24.818 ns - 89 cycles(tsc) 22.404 ns improved 10.1% 158 - 99 cycles(tsc) 24.977 ns - 92 cycles(tsc) 23.089 ns improved 7.1% 250 - 106 cycles(tsc) 26.552 ns - 99 cycles(tsc) 24.785 ns improved 6.6% Signed-off-by: Jesper Dangaard Brouer <brouer@redhat.com> Cc: Christoph Lameter <cl@linux.com> Cc: Pekka Enberg <penberg@kernel.org> Cc: David Rientjes <rientjes@google.com> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
#
994eb764 |
|
04-Sep-2015 |
Jesper Dangaard Brouer <brouer@redhat.com> |
slub bulk alloc: extract objects from the per cpu slab First piece: acceleration of retrieval of per cpu objects If we are allocating lots of objects then it is advantageous to disable interrupts and avoid the this_cpu_cmpxchg() operation to get these objects faster. Note that we cannot do the fast operation if debugging is enabled, because we would have to add extra code to do all the debugging checks. And it would not be fast anyway. Note also that the requirement of having interrupts disabled avoids having to do processor flag operations. Allocate as many objects as possible in the fast way and then fall back to the generic implementation for the rest of the objects. Measurements on CPU CPU i7-4790K @ 4.00GHz Baseline normal fastpath (alloc+free cost): 42 cycles(tsc) 10.554 ns Bulk- fallback - this-patch 1 - 57 cycles(tsc) 14.432 ns - 48 cycles(tsc) 12.155 ns improved 15.8% 2 - 50 cycles(tsc) 12.746 ns - 37 cycles(tsc) 9.390 ns improved 26.0% 3 - 48 cycles(tsc) 12.180 ns - 33 cycles(tsc) 8.417 ns improved 31.2% 4 - 48 cycles(tsc) 12.015 ns - 32 cycles(tsc) 8.045 ns improved 33.3% 8 - 46 cycles(tsc) 11.526 ns - 30 cycles(tsc) 7.699 ns improved 34.8% 16 - 45 cycles(tsc) 11.418 ns - 32 cycles(tsc) 8.205 ns improved 28.9% 30 - 80 cycles(tsc) 20.246 ns - 73 cycles(tsc) 18.328 ns improved 8.8% 32 - 79 cycles(tsc) 19.946 ns - 72 cycles(tsc) 18.208 ns improved 8.9% 34 - 78 cycles(tsc) 19.659 ns - 71 cycles(tsc) 17.987 ns improved 9.0% 48 - 86 cycles(tsc) 21.516 ns - 82 cycles(tsc) 20.566 ns improved 4.7% 64 - 93 cycles(tsc) 23.423 ns - 89 cycles(tsc) 22.480 ns improved 4.3% 128 - 100 cycles(tsc) 25.170 ns - 99 cycles(tsc) 24.871 ns improved 1.0% 158 - 102 cycles(tsc) 25.549 ns - 101 cycles(tsc) 25.375 ns improved 1.0% 250 - 101 cycles(tsc) 25.344 ns - 100 cycles(tsc) 25.182 ns improved 1.0% Signed-off-by: Christoph Lameter <cl@linux.com> Signed-off-by: Jesper Dangaard Brouer <brouer@redhat.com> Cc: Pekka Enberg <penberg@kernel.org> Cc: David Rientjes <rientjes@google.com> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
#
484748f0 |
|
04-Sep-2015 |
Christoph Lameter <cl@linux.com> |
slab: infrastructure for bulk object allocation and freeing Add the basic infrastructure for alloc/free operations on pointer arrays. It includes a generic function in the common slab code that is used in this infrastructure patch to create the unoptimized functionality for slab bulk operations. Allocators can then provide optimized allocation functions for situations in which large numbers of objects are needed. These optimization may avoid taking locks repeatedly and bypass metadata creation if all objects in slab pages can be used to provide the objects required. Allocators can extend the skeletons provided and add their own code to the bulk alloc and free functions. They can keep the generic allocation and freeing and just fall back to those if optimizations would not work (like for example when debugging is on). Signed-off-by: Christoph Lameter <cl@linux.com> Signed-off-by: Jesper Dangaard Brouer <brouer@redhat.com> Cc: Pekka Enberg <penberg@kernel.org> Cc: David Rientjes <rientjes@google.com> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
#
2ae44005 |
|
04-Sep-2015 |
Jesper Dangaard Brouer <brouer@redhat.com> |
slub: fix spelling succedd to succeed With this patchset the SLUB allocator now has both bulk alloc and free implemented. This patchset mostly optimizes the "fastpath" where objects are available on the per CPU fastpath page. This mostly amortize the less-heavy none-locked cmpxchg_double used on fastpath. The "fallback" bulking (e.g __kmem_cache_free_bulk) provides a good basis for comparison. Measurements[1] of the fallback functions __kmem_cache_{free,alloc}_bulk have been copied from slab_common.c and forced "noinline" to force a function call like slab_common.c. Measurements on CPU CPU i7-4790K @ 4.00GHz Baseline normal fastpath (alloc+free cost): 42 cycles(tsc) 10.601 ns Measurements last-patch with disabled debugging: Bulk- fallback - this-patch 1 - 57 cycles(tsc) 14.448 ns - 44 cycles(tsc) 11.236 ns improved 22.8% 2 - 51 cycles(tsc) 12.768 ns - 28 cycles(tsc) 7.019 ns improved 45.1% 3 - 48 cycles(tsc) 12.232 ns - 22 cycles(tsc) 5.526 ns improved 54.2% 4 - 48 cycles(tsc) 12.025 ns - 19 cycles(tsc) 4.786 ns improved 60.4% 8 - 46 cycles(tsc) 11.558 ns - 18 cycles(tsc) 4.572 ns improved 60.9% 16 - 45 cycles(tsc) 11.458 ns - 18 cycles(tsc) 4.658 ns improved 60.0% 30 - 45 cycles(tsc) 11.499 ns - 18 cycles(tsc) 4.568 ns improved 60.0% 32 - 79 cycles(tsc) 19.917 ns - 65 cycles(tsc) 16.454 ns improved 17.7% 34 - 78 cycles(tsc) 19.655 ns - 63 cycles(tsc) 15.932 ns improved 19.2% 48 - 68 cycles(tsc) 17.049 ns - 50 cycles(tsc) 12.506 ns improved 26.5% 64 - 80 cycles(tsc) 20.009 ns - 63 cycles(tsc) 15.929 ns improved 21.3% 128 - 94 cycles(tsc) 23.749 ns - 86 cycles(tsc) 21.583 ns improved 8.5% 158 - 97 cycles(tsc) 24.299 ns - 90 cycles(tsc) 22.552 ns improved 7.2% 250 - 102 cycles(tsc) 25.681 ns - 98 cycles(tsc) 24.589 ns improved 3.9% Benchmarking shows impressive improvements in the "fastpath" with a small number of objects in the working set. Once the working set increases, resulting in activating the "slowpath" (that contains the heavier locked cmpxchg_double) the improvement decreases. I'm currently working on also optimizing the "slowpath" (as network stack use-case hits this), but this patchset should provide a good foundation for further improvements. Rest of my patch queue in this area needs some more work, but preliminary results are good. I'm attending Netfilter Workshop[2] next week, and I'll hopefully return working on further improvements in this area. This patch (of 6): s/succedd/succeed/ Signed-off-by: Jesper Dangaard Brouer <brouer@redhat.com> Cc: Christoph Lameter <cl@linux.com> Cc: Pekka Enberg <penberg@kernel.org> Cc: David Rientjes <rientjes@google.com> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
#
2f064f34 |
|
21-Aug-2015 |
Michal Hocko <mhocko@suse.com> |
mm: make page pfmemalloc check more robust Commit c48a11c7ad26 ("netvm: propagate page->pfmemalloc to skb") added checks for page->pfmemalloc to __skb_fill_page_desc(): if (page->pfmemalloc && !page->mapping) skb->pfmemalloc = true; It assumes page->mapping == NULL implies that page->pfmemalloc can be trusted. However, __delete_from_page_cache() can set set page->mapping to NULL and leave page->index value alone. Due to being in union, a non-zero page->index will be interpreted as true page->pfmemalloc. So the assumption is invalid if the networking code can see such a page. And it seems it can. We have encountered this with a NFS over loopback setup when such a page is attached to a new skbuf. There is no copying going on in this case so the page confuses __skb_fill_page_desc which interprets the index as pfmemalloc flag and the network stack drops packets that have been allocated using the reserves unless they are to be queued on sockets handling the swapping which is the case here and that leads to hangs when the nfs client waits for a response from the server which has been dropped and thus never arrive. The struct page is already heavily packed so rather than finding another hole to put it in, let's do a trick instead. We can reuse the index again but define it to an impossible value (-1UL). This is the page index so it should never see the value that large. Replace all direct users of page->pfmemalloc by page_is_pfmemalloc which will hide this nastiness from unspoiled eyes. The information will get lost if somebody wants to use page->index obviously but that was the case before and the original code expected that the information should be persisted somewhere else if that is really needed (e.g. what SLAB and SLUB do). [akpm@linux-foundation.org: fix blooper in slub] Fixes: c48a11c7ad26 ("netvm: propagate page->pfmemalloc to skb") Signed-off-by: Michal Hocko <mhocko@suse.com> Debugged-by: Vlastimil Babka <vbabka@suse.com> Debugged-by: Jiri Bohac <jbohac@suse.com> Cc: Eric Dumazet <eric.dumazet@gmail.com> Cc: David Miller <davem@davemloft.net> Acked-by: Mel Gorman <mgorman@suse.de> Cc: <stable@vger.kernel.org> [3.6+] Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
#
34cc6990 |
|
24-Jun-2015 |
Daniel Sanders <daniel.sanders@imgtec.com> |
slab: correct size_index table before replacing the bootstrap kmem_cache_node This patch moves the initialization of the size_index table slightly earlier so that the first few kmem_cache_node's can be safely allocated when KMALLOC_MIN_SIZE is large. There are currently two ways to generate indices into kmalloc_caches (via kmalloc_index() and via the size_index table in slab_common.c) and on some arches (possibly only MIPS) they potentially disagree with each other until create_kmalloc_caches() has been called. It seems that the intention is that the size_index table is a fast equivalent to kmalloc_index() and that create_kmalloc_caches() patches the table to return the correct value for the cases where kmalloc_index()'s if-statements apply. The failing sequence was: * kmalloc_caches contains NULL elements * kmem_cache_init initialises the element that 'struct kmem_cache_node' will be allocated to. For 32-bit Mips, this is a 56-byte struct and kmalloc_index returns KMALLOC_SHIFT_LOW (7). * init_list is called which calls kmalloc_node to allocate a 'struct kmem_cache_node'. * kmalloc_slab selects the kmem_caches element using size_index[size_index_elem(size)]. For MIPS, size is 56, and the expression returns 6. * This element of kmalloc_caches is NULL and allocation fails. * If it had not already failed, it would have called create_kmalloc_caches() at this point which would have changed size_index[size_index_elem(size)] to 7. I don't believe the bug to be LLVM specific but GCC doesn't normally encounter the problem. I haven't been able to identify exactly what GCC is doing better (probably inlining) but it seems that GCC is managing to optimize to the point that it eliminates the problematic allocations. This theory is supported by the fact that GCC can be made to fail in the same way by changing inline, __inline, __inline__, and __always_inline in include/linux/compiler-gcc.h such that they don't actually inline things. Signed-off-by: Daniel Sanders <daniel.sanders@imgtec.com> Acked-by: Pekka Enberg <penberg@kernel.org> Acked-by: Christoph Lameter <cl@linux.com> Cc: Pekka Enberg <penberg@kernel.org> Cc: David Rientjes <rientjes@google.com> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
#
4db0c3c2 |
|
15-Apr-2015 |
Jason Low <jason.low2@hp.com> |
mm: remove rest of ACCESS_ONCE() usages We converted some of the usages of ACCESS_ONCE to READ_ONCE in the mm/ tree since it doesn't work reliably on non-scalar types. This patch removes the rest of the usages of ACCESS_ONCE, and use the new READ_ONCE API for the read accesses. This makes things cleaner, instead of using separate/multiple sets of APIs. Signed-off-by: Jason Low <jason.low2@hp.com> Acked-by: Michal Hocko <mhocko@suse.cz> Acked-by: Davidlohr Bueso <dave@stgolabs.net> Acked-by: Rik van Riel <riel@redhat.com> Reviewed-by: Christian Borntraeger <borntraeger@de.ibm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
#
6f6528a1 |
|
14-Apr-2015 |
Joe Perches <joe@perches.com> |
slub: use bool function return values of true/false not 1/0 Use the normal return values for bool functions Signed-off-by: Joe Perches <joe@perches.com> Cc: Christoph Lameter <cl@linux.com> Cc: Pekka Enberg <penberg@kernel.org> Acked-by: David Rientjes <rientjes@google.com> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
#
08303a73 |
|
14-Apr-2015 |
Chris J Arges <chris.j.arges@canonical.com> |
mm/slub.c: parse slub_debug O option in switch statement By moving the O option detection into the switch statement, we allow this parameter to be combined with other options correctly. Previously options like slub_debug=OFZ would only detect the 'o' and use DEBUG_DEFAULT_FLAGS to fill in the rest of the flags. Signed-off-by: Chris J Arges <chris.j.arges@canonical.com> Cc: Christoph Lameter <cl@linux.com> Cc: Pekka Enberg <penberg@kernel.org> Acked-by: David Rientjes <rientjes@google.com> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
#
859b7a0e |
|
25-Mar-2015 |
Mark Rutland <mark.rutland@arm.com> |
mm/slub: fix lockups on PREEMPT && !SMP kernels Commit 9aabf810a67c ("mm/slub: optimize alloc/free fastpath by removing preemption on/off") introduced an occasional hang for kernels built with CONFIG_PREEMPT && !CONFIG_SMP. The problem is the following loop the patch introduced to slab_alloc_node and slab_free: do { tid = this_cpu_read(s->cpu_slab->tid); c = raw_cpu_ptr(s->cpu_slab); } while (IS_ENABLED(CONFIG_PREEMPT) && unlikely(tid != c->tid)); GCC 4.9 has been observed to hoist the load of c and c->tid above the loop for !SMP kernels (as in this case raw_cpu_ptr(x) is compile-time constant and does not force a reload). On arm64 the generated assembly looks like: ldr x4, [x0,#8] loop: ldr x1, [x0,#8] cmp x1, x4 b.ne loop If the thread is preempted between the load of c->tid (into x1) and tid (into x4), and an allocation or free occurs in another thread (bumping the cpu_slab's tid), the thread will be stuck in the loop until s->cpu_slab->tid wraps, which may be forever in the absence of allocations/frees on the same CPU. This patch changes the loop condition to access c->tid with READ_ONCE. This ensures that the value is reloaded even when the compiler would otherwise assume it could cache the value, and also ensures that the load will not be torn. Signed-off-by: Mark Rutland <mark.rutland@arm.com> Cc: Catalin Marinas <catalin.marinas@arm.com> Acked-by: Christoph Lameter <cl@linux.com> Cc: David Rientjes <rientjes@google.com> Cc: Jesper Dangaard Brouer <brouer@redhat.com> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com> Cc: Pekka Enberg <penberg@kernel.org> Cc: Steve Capper <steve.capper@linaro.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
#
0316bec2 |
|
13-Feb-2015 |
Andrey Ryabinin <ryabinin.a.a@gmail.com> |
mm: slub: add kernel address sanitizer support for slub allocator With this patch kasan will be able to catch bugs in memory allocated by slub. Initially all objects in newly allocated slab page, marked as redzone. Later, when allocation of slub object happens, requested by caller number of bytes marked as accessible, and the rest of the object (including slub's metadata) marked as redzone (inaccessible). We also mark object as accessible if ksize was called for this object. There is some places in kernel where ksize function is called to inquire size of really allocated area. Such callers could validly access whole allocated memory, so it should be marked as accessible. Code in slub.c and slab_common.c files could validly access to object's metadata, so instrumentation for this files are disabled. Signed-off-by: Andrey Ryabinin <a.ryabinin@samsung.com> Signed-off-by: Dmitry Chernenkov <dmitryc@google.com> Cc: Dmitry Vyukov <dvyukov@google.com> Cc: Konstantin Serebryany <kcc@google.com> Signed-off-by: Andrey Konovalov <adech.fo@gmail.com> Cc: Yuri Gribov <tetra2005@gmail.com> Cc: Konstantin Khlebnikov <koct9i@gmail.com> Cc: Sasha Levin <sasha.levin@oracle.com> Cc: Christoph Lameter <cl@linux.com> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com> Cc: Dave Hansen <dave.hansen@intel.com> Cc: Andi Kleen <andi@firstfloor.org> Cc: Ingo Molnar <mingo@elte.hu> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: "H. Peter Anvin" <hpa@zytor.com> Cc: Christoph Lameter <cl@linux.com> Cc: Pekka Enberg <penberg@kernel.org> Cc: David Rientjes <rientjes@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
#
a79316c6 |
|
13-Feb-2015 |
Andrey Ryabinin <ryabinin.a.a@gmail.com> |
mm: slub: introduce metadata_access_enable()/metadata_access_disable() It's ok for slub to access memory that marked by kasan as inaccessible (object's metadata). Kasan shouldn't print report in that case because these accesses are valid. Disabling instrumentation of slub.c code is not enough to achieve this because slub passes pointer to object's metadata into external functions like memchr_inv(). We don't want to disable instrumentation for memchr_inv() because this is quite generic function, and we don't want to miss bugs. metadata_access_enable/metadata_access_disable used to tell KASan where accesses to metadata starts/end, so we could temporarily disable KASan reports. Signed-off-by: Andrey Ryabinin <a.ryabinin@samsung.com> Cc: Dmitry Vyukov <dvyukov@google.com> Cc: Konstantin Serebryany <kcc@google.com> Cc: Dmitry Chernenkov <dmitryc@google.com> Signed-off-by: Andrey Konovalov <adech.fo@gmail.com> Cc: Yuri Gribov <tetra2005@gmail.com> Cc: Konstantin Khlebnikov <koct9i@gmail.com> Cc: Sasha Levin <sasha.levin@oracle.com> Cc: Christoph Lameter <cl@linux.com> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com> Cc: Dave Hansen <dave.hansen@intel.com> Cc: Andi Kleen <andi@firstfloor.org> Cc: Ingo Molnar <mingo@elte.hu> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: "H. Peter Anvin" <hpa@zytor.com> Cc: Christoph Lameter <cl@linux.com> Cc: Pekka Enberg <penberg@kernel.org> Cc: David Rientjes <rientjes@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
#
75c66def |
|
13-Feb-2015 |
Andrey Ryabinin <ryabinin.a.a@gmail.com> |
mm: slub: share object_err function Remove static and add function declarations to linux/slub_def.h so it could be used by kernel address sanitizer. Signed-off-by: Andrey Ryabinin <a.ryabinin@samsung.com> Cc: Dmitry Vyukov <dvyukov@google.com> Cc: Konstantin Serebryany <kcc@google.com> Cc: Dmitry Chernenkov <dmitryc@google.com> Signed-off-by: Andrey Konovalov <adech.fo@gmail.com> Cc: Yuri Gribov <tetra2005@gmail.com> Cc: Konstantin Khlebnikov <koct9i@gmail.com> Cc: Sasha Levin <sasha.levin@oracle.com> Cc: Christoph Lameter <cl@linux.com> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com> Cc: Dave Hansen <dave.hansen@intel.com> Cc: Andi Kleen <andi@firstfloor.org> Cc: Ingo Molnar <mingo@elte.hu> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: "H. Peter Anvin" <hpa@zytor.com> Cc: Christoph Lameter <cl@linux.com> Cc: Pekka Enberg <penberg@kernel.org> Cc: David Rientjes <rientjes@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
#
5024c1d7 |
|
13-Feb-2015 |
Tejun Heo <tj@kernel.org> |
slub: use %*pb[l] to print bitmaps including cpumasks and nodemasks printk and friends can now format bitmaps using '%*pb[l]'. cpumask and nodemask also provide cpumask_pr_args() and nodemask_pr_args() respectively which can be used to generate the two printf arguments necessary to format the specified cpu/nodemask. * This is an equivalent conversion but the whole function should be converted to use scnprinf famiily of functions rather than performing custom output length predictions in multiple places. Signed-off-by: Tejun Heo <tj@kernel.org> Acked-by: Christoph Lameter <cl@linux.com> Cc: Pekka Enberg <penberg@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
#
d6e0b7fa |
|
12-Feb-2015 |
Vladimir Davydov <vdavydov.dev@gmail.com> |
slub: make dead caches discard free slabs immediately To speed up further allocations SLUB may store empty slabs in per cpu/node partial lists instead of freeing them immediately. This prevents per memcg caches destruction, because kmem caches created for a memory cgroup are only destroyed after the last page charged to the cgroup is freed. To fix this issue, this patch resurrects approach first proposed in [1]. It forbids SLUB to cache empty slabs after the memory cgroup that the cache belongs to was destroyed. It is achieved by setting kmem_cache's cpu_partial and min_partial constants to 0 and tuning put_cpu_partial() so that it would drop frozen empty slabs immediately if cpu_partial = 0. The runtime overhead is minimal. From all the hot functions, we only touch relatively cold put_cpu_partial(): we make it call unfreeze_partials() after freezing a slab that belongs to an offline memory cgroup. Since slab freezing exists to avoid moving slabs from/to a partial list on free/alloc, and there can't be allocations from dead caches, it shouldn't cause any overhead. We do have to disable preemption for put_cpu_partial() to achieve that though. The original patch was accepted well and even merged to the mm tree. However, I decided to withdraw it due to changes happening to the memcg core at that time. I had an idea of introducing per-memcg shrinkers for kmem caches, but now, as memcg has finally settled down, I do not see it as an option, because SLUB shrinker would be too costly to call since SLUB does not keep free slabs on a separate list. Besides, we currently do not even call per-memcg shrinkers for offline memcgs. Overall, it would introduce much more complexity to both SLUB and memcg than this small patch. Regarding to SLAB, there's no problem with it, because it shrinks per-cpu/node caches periodically. Thanks to list_lru reparenting, we no longer keep entries for offline cgroups in per-memcg arrays (such as memcg_cache_params->memcg_caches), so we do not have to bother if a per-memcg cache will be shrunk a bit later than it could be. [1] http://thread.gmane.org/gmane.linux.kernel.mm/118649/focus=118650 Signed-off-by: Vladimir Davydov <vdavydov@parallels.com> Cc: Christoph Lameter <cl@linux.com> Cc: Pekka Enberg <penberg@kernel.org> Cc: David Rientjes <rientjes@google.com> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
#
ce3712d7 |
|
12-Feb-2015 |
Vladimir Davydov <vdavydov.dev@gmail.com> |
slub: fix kmem_cache_shrink return value It is supposed to return 0 if the cache has no remaining objects and 1 otherwise, while currently it always returns 0. Fix it. Signed-off-by: Vladimir Davydov <vdavydov@parallels.com> Acked-by: Christoph Lameter <cl@linux.com> Cc: Pekka Enberg <penberg@kernel.org> Cc: David Rientjes <rientjes@google.com> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
#
832f37f5 |
|
12-Feb-2015 |
Vladimir Davydov <vdavydov.dev@gmail.com> |
slub: never fail to shrink cache SLUB's version of __kmem_cache_shrink() not only removes empty slabs, but also tries to rearrange the partial lists to place slabs filled up most to the head to cope with fragmentation. To achieve that, it allocates a temporary array of lists used to sort slabs by the number of objects in use. If the allocation fails, the whole procedure is aborted. This is unacceptable for the kernel memory accounting extension of the memory cgroup, where we want to make sure that kmem_cache_shrink() successfully discarded empty slabs. Although the allocation failure is utterly unlikely with the current page allocator implementation, which retries GFP_KERNEL allocations of order <= 2 infinitely, it is better not to rely on that. This patch therefore makes __kmem_cache_shrink() allocate the array on stack instead of calling kmalloc, which may fail. The array size is chosen to be equal to 32, because most SLUB caches store not more than 32 objects per slab page. Slab pages with <= 32 free objects are sorted using the array by the number of objects in use and promoted to the head of the partial list, while slab pages with > 32 free objects are left in the end of the list without any ordering imposed on them. Signed-off-by: Vladimir Davydov <vdavydov@parallels.com> Acked-by: Christoph Lameter <cl@linux.com> Acked-by: Pekka Enberg <penberg@kernel.org> Cc: David Rientjes <rientjes@google.com> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com> Cc: Huang Ying <ying.huang@intel.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
#
426589f5 |
|
12-Feb-2015 |
Vladimir Davydov <vdavydov.dev@gmail.com> |
slab: link memcg caches of the same kind into a list Sometimes, we need to iterate over all memcg copies of a particular root kmem cache. Currently, we use memcg_cache_params->memcg_caches array for that, because it contains all existing memcg caches. However, it's a bad practice to keep all caches, including those that belong to offline cgroups, in this array, because it will be growing beyond any bounds then. I'm going to wipe away dead caches from it to save space. To still be able to perform iterations over all memcg caches of the same kind, let us link them into a list. Signed-off-by: Vladimir Davydov <vdavydov@parallels.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Michal Hocko <mhocko@suse.cz> Cc: Tejun Heo <tj@kernel.org> Cc: Christoph Lameter <cl@linux.com> Cc: Pekka Enberg <penberg@kernel.org> Cc: David Rientjes <rientjes@google.com> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com> Cc: Dave Chinner <david@fromorbit.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
#
f7ce3190 |
|
12-Feb-2015 |
Vladimir Davydov <vdavydov.dev@gmail.com> |
slab: embed memcg_cache_params to kmem_cache Currently, kmem_cache stores a pointer to struct memcg_cache_params instead of embedding it. The rationale is to save memory when kmem accounting is disabled. However, the memcg_cache_params has shrivelled drastically since it was first introduced: * Initially: struct memcg_cache_params { bool is_root_cache; union { struct kmem_cache *memcg_caches[0]; struct { struct mem_cgroup *memcg; struct list_head list; struct kmem_cache *root_cache; bool dead; atomic_t nr_pages; struct work_struct destroy; }; }; }; * Now: struct memcg_cache_params { bool is_root_cache; union { struct { struct rcu_head rcu_head; struct kmem_cache *memcg_caches[0]; }; struct { struct mem_cgroup *memcg; struct kmem_cache *root_cache; }; }; }; So the memory saving does not seem to be a clear win anymore. OTOH, keeping a pointer to memcg_cache_params struct instead of embedding it results in touching one more cache line on kmem alloc/free hot paths. Besides, it makes linking kmem caches in a list chained by a field of struct memcg_cache_params really painful due to a level of indirection, while I want to make them linked in the following patch. That said, let us embed it. Signed-off-by: Vladimir Davydov <vdavydov@parallels.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Michal Hocko <mhocko@suse.cz> Cc: Tejun Heo <tj@kernel.org> Cc: Christoph Lameter <cl@linux.com> Cc: Pekka Enberg <penberg@kernel.org> Cc: David Rientjes <rientjes@google.com> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com> Cc: Dave Chinner <david@fromorbit.com> Cc: Dan Carpenter <dan.carpenter@oracle.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
#
94e4d712 |
|
10-Feb-2015 |
Kim Phillips <kim.phillips@freescale.com> |
mm/slub.c: fix typo in comment Signed-off-by: Kim Phillips <kim.phillips@freescale.com> Acked-by: Christoph Lameter <cl@linux.com> Acked-by: David Rientjes <rientjes@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
#
9aabf810 |
|
10-Feb-2015 |
Joonsoo Kim <iamjoonsoo.kim@lge.com> |
mm/slub: optimize alloc/free fastpath by removing preemption on/off We had to insert a preempt enable/disable in the fastpath a while ago in order to guarantee that tid and kmem_cache_cpu are retrieved on the same cpu. It is the problem only for CONFIG_PREEMPT in which scheduler can move the process to other cpu during retrieving data. Now, I reach the solution to remove preempt enable/disable in the fastpath. If tid is matched with kmem_cache_cpu's tid after tid and kmem_cache_cpu are retrieved by separate this_cpu operation, it means that they are retrieved on the same cpu. If not matched, we just have to retry it. With this guarantee, preemption enable/disable isn't need at all even if CONFIG_PREEMPT, so this patch removes it. I saw roughly 5% win in a fast-path loop over kmem_cache_alloc/free in CONFIG_PREEMPT. (14.821 ns -> 14.049 ns) Below is the result of Christoph's slab_test reported by Jesper Dangaard Brouer. * Before Single thread testing ===================== 1. Kmalloc: Repeatedly allocate then free test 10000 times kmalloc(8) -> 49 cycles kfree -> 62 cycles 10000 times kmalloc(16) -> 48 cycles kfree -> 64 cycles 10000 times kmalloc(32) -> 53 cycles kfree -> 70 cycles 10000 times kmalloc(64) -> 64 cycles kfree -> 77 cycles 10000 times kmalloc(128) -> 74 cycles kfree -> 84 cycles 10000 times kmalloc(256) -> 84 cycles kfree -> 114 cycles 10000 times kmalloc(512) -> 83 cycles kfree -> 116 cycles 10000 times kmalloc(1024) -> 81 cycles kfree -> 120 cycles 10000 times kmalloc(2048) -> 104 cycles kfree -> 136 cycles 10000 times kmalloc(4096) -> 142 cycles kfree -> 165 cycles 10000 times kmalloc(8192) -> 238 cycles kfree -> 226 cycles 10000 times kmalloc(16384) -> 403 cycles kfree -> 264 cycles 2. Kmalloc: alloc/free test 10000 times kmalloc(8)/kfree -> 68 cycles 10000 times kmalloc(16)/kfree -> 68 cycles 10000 times kmalloc(32)/kfree -> 69 cycles 10000 times kmalloc(64)/kfree -> 68 cycles 10000 times kmalloc(128)/kfree -> 68 cycles 10000 times kmalloc(256)/kfree -> 68 cycles 10000 times kmalloc(512)/kfree -> 74 cycles 10000 times kmalloc(1024)/kfree -> 75 cycles 10000 times kmalloc(2048)/kfree -> 74 cycles 10000 times kmalloc(4096)/kfree -> 74 cycles 10000 times kmalloc(8192)/kfree -> 75 cycles 10000 times kmalloc(16384)/kfree -> 510 cycles * After Single thread testing ===================== 1. Kmalloc: Repeatedly allocate then free test 10000 times kmalloc(8) -> 46 cycles kfree -> 61 cycles 10000 times kmalloc(16) -> 46 cycles kfree -> 63 cycles 10000 times kmalloc(32) -> 49 cycles kfree -> 69 cycles 10000 times kmalloc(64) -> 57 cycles kfree -> 76 cycles 10000 times kmalloc(128) -> 66 cycles kfree -> 83 cycles 10000 times kmalloc(256) -> 84 cycles kfree -> 110 cycles 10000 times kmalloc(512) -> 77 cycles kfree -> 114 cycles 10000 times kmalloc(1024) -> 80 cycles kfree -> 116 cycles 10000 times kmalloc(2048) -> 102 cycles kfree -> 131 cycles 10000 times kmalloc(4096) -> 135 cycles kfree -> 163 cycles 10000 times kmalloc(8192) -> 238 cycles kfree -> 218 cycles 10000 times kmalloc(16384) -> 399 cycles kfree -> 262 cycles 2. Kmalloc: alloc/free test 10000 times kmalloc(8)/kfree -> 65 cycles 10000 times kmalloc(16)/kfree -> 66 cycles 10000 times kmalloc(32)/kfree -> 65 cycles 10000 times kmalloc(64)/kfree -> 66 cycles 10000 times kmalloc(128)/kfree -> 66 cycles 10000 times kmalloc(256)/kfree -> 71 cycles 10000 times kmalloc(512)/kfree -> 72 cycles 10000 times kmalloc(1024)/kfree -> 71 cycles 10000 times kmalloc(2048)/kfree -> 71 cycles 10000 times kmalloc(4096)/kfree -> 71 cycles 10000 times kmalloc(8192)/kfree -> 65 cycles 10000 times kmalloc(16384)/kfree -> 511 cycles Most of the results are better than before. Note that this change slightly worses performance in !CONFIG_PREEMPT, roughly 0.3%. Implementing each case separately would help performance, but, since it's so marginal, I didn't do that. This would help maintanance since we have same code for all cases. Signed-off-by: Joonsoo Kim <iamjoonsoo.kim@lge.com> Acked-by: Christoph Lameter <cl@linux.com> Tested-by: Jesper Dangaard Brouer <brouer@redhat.com> Acked-by: Jesper Dangaard Brouer <brouer@redhat.com> Cc: Pekka Enberg <penberg@kernel.org> Cc: David Rientjes <rientjes@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
#
dee2f8aa |
|
12-Dec-2014 |
Vladimir Davydov <vdavydov.dev@gmail.com> |
slub: fix cpuset check in get_any_partial If we fail to allocate from the current node's stock, we look for free objects on other nodes before calling the page allocator (see get_any_partial). While checking other nodes we respect cpuset constraints by calling cpuset_zone_allowed. We enforce hardwall check. As a result, we will fallback to the page allocator even if there are some pages cached on other nodes, but the current cpuset doesn't have them set. However, the page allocator uses softwall check for kernel allocations, so it may allocate from one of the other nodes in this case. Therefore we should use softwall cpuset check in get_any_partial to conform with the cpuset check in the page allocator. Signed-off-by: Vladimir Davydov <vdavydov@parallels.com> Acked-by: Zefan Li <lizefan@huawei.com> Acked-by: Christoph Lameter <cl@linux.com> Cc: Pekka Enberg <penberg@kernel.org> Cc: David Rientjes <rientjes@google.com> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com> Cc: Tejun Heo <tj@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
#
8135be5a |
|
12-Dec-2014 |
Vladimir Davydov <vdavydov.dev@gmail.com> |
memcg: fix possible use-after-free in memcg_kmem_get_cache() Suppose task @t that belongs to a memory cgroup @memcg is going to allocate an object from a kmem cache @c. The copy of @c corresponding to @memcg, @mc, is empty. Then if kmem_cache_alloc races with the memory cgroup destruction we can access the memory cgroup's copy of the cache after it was destroyed: CPU0 CPU1 ---- ---- [ current=@t @mc->memcg_params->nr_pages=0 ] kmem_cache_alloc(@c): call memcg_kmem_get_cache(@c); proceed to allocation from @mc: alloc a page for @mc: ... move @t from @memcg destroy @memcg: mem_cgroup_css_offline(@memcg): memcg_unregister_all_caches(@memcg): kmem_cache_destroy(@mc) add page to @mc We could fix this issue by taking a reference to a per-memcg cache, but that would require adding a per-cpu reference counter to per-memcg caches, which would look cumbersome. Instead, let's take a reference to a memory cgroup, which already has a per-cpu reference counter, in the beginning of kmem_cache_alloc to be dropped in the end, and move per memcg caches destruction from css offline to css free. As a side effect, per-memcg caches will be destroyed not one by one, but all at once when the last page accounted to the memory cgroup is freed. This doesn't sound as a high price for code readability though. Note, this patch does add some overhead to the kmem_cache_alloc hot path, but it is pretty negligible - it's just a function call plus a per cpu counter decrement, which is comparable to what we already have in memcg_kmem_get_cache. Besides, it's only relevant if there are memory cgroups with kmem accounting enabled. I don't think we can find a way to handle this race w/o it, because alloc_page called from kmem_cache_alloc may sleep so we can't flush all pending kmallocs w/o reference counting. Signed-off-by: Vladimir Davydov <vdavydov@parallels.com> Acked-by: Christoph Lameter <cl@linux.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Michal Hocko <mhocko@suse.cz> Cc: Pekka Enberg <penberg@kernel.org> Cc: David Rientjes <rientjes@google.com> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
#
c871ac4e |
|
10-Dec-2014 |
Andrew Morton <akpm@linux-foundation.org> |
slab: improve checking for invalid gfp_flags The code goes BUG, but doesn't tell us which bits were unexpectedly set. Print that out. Cc: Christoph Lameter <cl@linux.com> Cc: Pekka Enberg <penberg@kernel.org> Cc: David Rientjes <rientjes@google.com> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
#
f6edde9c |
|
10-Dec-2014 |
Andrey Ryabinin <ryabinin.a.a@gmail.com> |
mm: slub: fix format mismatches in slab_err() callers Adding __printf(3, 4) to slab_err exposed following: mm/slub.c: In function `check_slab': mm/slub.c:852:4: warning: format `%u' expects argument of type `unsigned int', but argument 4 has type `const char *' [-Wformat=] s->name, page->objects, maxobj); ^ mm/slub.c:852:4: warning: too many arguments for format [-Wformat-extra-args] mm/slub.c:857:4: warning: format `%u' expects argument of type `unsigned int', but argument 4 has type `const char *' [-Wformat=] s->name, page->inuse, page->objects); ^ mm/slub.c:857:4: warning: too many arguments for format [-Wformat-extra-args] mm/slub.c: In function `on_freelist': mm/slub.c:905:4: warning: format `%d' expects argument of type `int', but argument 5 has type `long unsigned int' [-Wformat=] "should be %d", page->objects, max_objects); Fix first two warnings by removing redundant s->name. Fix the last by changing type of max_object from unsigned long to int. Signed-off-by: Andrey Ryabinin <a.ryabinin@samsung.com> Cc: Christoph Lameter <cl@linux.com> Cc: Pekka Enberg <penberg@kernel.org> Acked-by: David Rientjes <rientjes@google.com> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
#
b455def2 |
|
10-Dec-2014 |
LQYMGT <lqymgt@gmail.com> |
mm: slab/slub: coding style: whitespaces and tabs mixture Some code in mm/slab.c and mm/slub.c use whitespaces in indent. Clean them up. Signed-off-by: LQYMGT <lqymgt@gmail.com> Acked-by: Christoph Lameter <cl@linux.com> Cc: Pekka Enberg <penberg@kernel.org> Cc: David Rientjes <rientjes@google.com> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
#
344736f2 |
|
20-Oct-2014 |
Vladimir Davydov <vdavydov.dev@gmail.com> |
cpuset: simplify cpuset_node_allowed API Current cpuset API for checking if a zone/node is allowed to allocate from looks rather awkward. We have hardwall and softwall versions of cpuset_node_allowed with the softwall version doing literally the same as the hardwall version if __GFP_HARDWALL is passed to it in gfp flags. If it isn't, the softwall version may check the given node against the enclosing hardwall cpuset, which it needs to take the callback lock to do. Such a distinction was introduced by commit 02a0e53d8227 ("cpuset: rework cpuset_zone_allowed api"). Before, we had the only version with the __GFP_HARDWALL flag determining its behavior. The purpose of the commit was to avoid sleep-in-atomic bugs when someone would mistakenly call the function without the __GFP_HARDWALL flag for an atomic allocation. The suffixes introduced were intended to make the callers think before using the function. However, since the callback lock was converted from mutex to spinlock by the previous patch, the softwall check function cannot sleep, and these precautions are no longer necessary. So let's simplify the API back to the single check. Suggested-by: David Rientjes <rientjes@google.com> Signed-off-by: Vladimir Davydov <vdavydov@parallels.com> Acked-by: Christoph Lameter <cl@linux.com> Acked-by: Zefan Li <lizefan@huawei.com> Signed-off-by: Tejun Heo <tj@kernel.org>
|
#
423c929c |
|
09-Oct-2014 |
Joonsoo Kim <iamjoonsoo.kim@lge.com> |
mm/slab_common: commonize slab merge logic Slab merge is good feature to reduce fragmentation. Now, it is only applied to SLUB, but, it would be good to apply it to SLAB. This patch is preparation step to apply slab merge to SLAB by commonizing slab merge logic. Signed-off-by: Joonsoo Kim <iamjoonsoo.kim@lge.com> Cc: Randy Dunlap <rdunlap@infradead.org> Cc: Christoph Lameter <cl@linux.com> Cc: Pekka Enberg <penberg@kernel.org> Cc: David Rientjes <rientjes@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
#
a561ce00 |
|
09-Oct-2014 |
Joonsoo Kim <iamjoonsoo.kim@lge.com> |
slub: fall back to node_to_mem_node() node if allocating on memoryless node Update the SLUB code to search for partial slabs on the nearest node with memory in the presence of memoryless nodes. Additionally, do not consider it to be an ALLOC_NODE_MISMATCH (and deactivate the slab) when a memoryless-node specified allocation goes off-node. Signed-off-by: Joonsoo Kim <iamjoonsoo.kim@lge.com> Signed-off-by: Nishanth Aravamudan <nacc@linux.vnet.ibm.com> Cc: David Rientjes <rientjes@google.com> Cc: Han Pingtian <hanpt@linux.vnet.ibm.com> Cc: Pekka Enberg <penberg@kernel.org> Cc: Paul Mackerras <paulus@samba.org> Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org> Cc: Michael Ellerman <mpe@ellerman.id.au> Cc: Anton Blanchard <anton@samba.org> Cc: Christoph Lameter <cl@linux.com> Cc: Wanpeng Li <liwanp@linux.vnet.ibm.com> Cc: Tejun Heo <tj@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
#
c9e16131 |
|
09-Oct-2014 |
Christoph Lameter <cl@linux.com> |
slub: disable tracing and failslab for merged slabs Tracing of mergeable slabs as well as uses of failslab are confusing since the objects of multiple slab caches will be affected. Moreover this creates a situation where a mergeable slab will become unmergeable. If tracing or failslab testing is desired then it may be best to switch merging off for starters. Signed-off-by: Christoph Lameter <cl@linux.com> Tested-by: WANG Chao <chaowang@redhat.com> Cc: Pekka Enberg <penberg@kernel.org> Cc: David Rientjes <rientjes@google.com> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
#
aee52cae |
|
06-Aug-2014 |
Vladimir Davydov <vdavydov.dev@gmail.com> |
slub: remove kmemcg id from create_unique_id This function is never called for memcg caches, because they are unmergeable, so remove the dead code. Signed-off-by: Vladimir Davydov <vdavydov@parallels.com> Cc: Michal Hocko <mhocko@suse.cz> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Christoph Lameter <cl@linux.com> Reviewed-by: Pekka Enberg <penberg@kernel.org> Cc: David Rientjes <rientjes@google.com> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
#
4307c14f |
|
06-Aug-2014 |
Gu Zheng <guz.fnst@cn.fujitsu.com> |
slab: fix the alias count (via sysfs) of slab cache We mark some slab caches (e.g. kmem_cache_node) as unmergeable by setting refcount to -1, and their alias should be 0, not refcount-1, so correct it here. Signed-off-by: Gu Zheng <guz.fnst@cn.fujitsu.com> Acked-by: David Rientjes <rientjes@google.com> Cc: Christoph Lameter <cl@linux.com> Cc: Pekka Enberg <penberg@kernel.org> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
#
0aa9a13d |
|
06-Aug-2014 |
Dan Carpenter <dan.carpenter@oracle.com> |
mm, slub: fix some indenting in cmpxchg_double_slab() The return statement goes with the cmpxchg_double() condition so it needs to be indented another tab. Also these days the fashion is to line function parameters up, and it looks nicer that way because then the "freelist_new" is not at the same indent level as the "return 1;". Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com> Signed-off-by: Pekka Enberg <penberg@kernel.org> Signed-off-by: David Rientjes <rientjes@google.com> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com> Cc: Christoph Lameter <cl@linux.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
#
54266640 |
|
06-Aug-2014 |
Wei Yang <weiyang@linux.vnet.ibm.com> |
slub: avoid duplicate creation on the first object When a kmem_cache is created with ctor, each object in the kmem_cache will be initialized before ready to use. While in slub implementation, the first object will be initialized twice. This patch reduces the duplication of initialization of the first object. Fix commit 7656c72b ("SLUB: add macros for scanning objects in a slab"). Signed-off-by: Wei Yang <weiyang@linux.vnet.ibm.com> Acked-by: Christoph Lameter <cl@linux.com> Cc: Pekka Enberg <penberg@kernel.org> Cc: David Rientjes <rientjes@google.com> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
#
02e72cc6 |
|
06-Aug-2014 |
Andrey Ryabinin <ryabinin.a.a@gmail.com> |
mm: slub: SLUB_DEBUG=n: use the same alloc/free hooks as for SLUB_DEBUG=y There are two versions of alloc/free hooks now - one for CONFIG_SLUB_DEBUG=y and another one for CONFIG_SLUB_DEBUG=n. I see no reason why calls to other debugging subsystems (LOCKDEP, DEBUG_ATOMIC_SLEEP, KMEMCHECK and FAILSLAB) are hidden under SLUB_DEBUG. All this features should work regardless of SLUB_DEBUG config, as all of them already have own Kconfig options. This also fixes failslab for CONFIG_SLUB_DEBUG=n configuration. It simply has not worked before because should_failslab() call was in a hook hidden under "#ifdef CONFIG_SLUB_DEBUG #else". Note: There is one concealed change in allocation path for SLUB_DEBUG=n and all other debugging features disabled. The might_sleep_if() call can generate some code even if DEBUG_ATOMIC_SLEEP=n. For PREEMPT_VOLUNTARY=y might_sleep() inserts _cond_resched() call, but I think it should be ok. Signed-off-by: Andrey Ryabinin <a.ryabinin@samsung.com> Cc: Christoph Lameter <cl@linux.com> Cc: Pekka Enberg <penberg@kernel.org> Cc: David Rientjes <rientjes@google.com> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
#
c07b8183 |
|
06-Aug-2014 |
David Rientjes <rientjes@google.com> |
mm, slub: mark resiliency_test as init text resiliency_test() is only called for bootstrap, so it may be moved to init.text and freed after boot. Signed-off-by: David Rientjes <rientjes@google.com> Acked-by: Christoph Lameter <cl@linux.com> Cc: Pekka Enberg <penberg@kernel.org> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
#
fa45dc25 |
|
06-Aug-2014 |
Christoph Lameter <cl@linux.com> |
slub: use new node functions Make use of the new node functions in mm/slab.h to reduce code size and simplify. [akpm@linux-foundation.org: coding-style fixes] Signed-off-by: Christoph Lameter <cl@linux.com> Cc: Christoph Lameter <cl@linux.com> Cc: Pekka Enberg <penberg@kernel.org> Acked-by: David Rientjes <rientjes@google.com> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
#
44c5356f |
|
06-Aug-2014 |
Christoph Lameter <cl@linux.com> |
slab common: add functions for kmem_cache_node access The patchset provides two new functions in mm/slab.h and modifies SLAB and SLUB to use these. The kmem_cache_node structure is shared between both allocators and the use of common accessors will allow us to move more code into slab_common.c in the future. This patch (of 3): These functions allow to eliminate repeatedly used code in both SLAB and SLUB and also allow for the insertion of debugging code that may be needed in the development process. Signed-off-by: Christoph Lameter <cl@linux.com> Cc: Pekka Enberg <penberg@kernel.org> Acked-by: David Rientjes <rientjes@google.com> Acked-by: Joonsoo Kim <iamjoonsoo.kim@lge.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
#
8a5b20ae |
|
02-Jul-2014 |
Joonsoo Kim <iamjoonsoo.kim@lge.com> |
slub: fix off by one in number of slab tests min_partial means minimum number of slab cached in node partial list. So, if nr_partial is less than it, we keep newly empty slab on node partial list rather than freeing it. But if nr_partial is equal or greater than it, it means that we have enough partial slabs so should free newly empty slab. Current implementation missed the equal case so if we set min_partial is 0, then, at least one slab could be cached. This is critical problem to kmemcg destroying logic because it doesn't works properly if some slabs is cached. This patch fixes this problem. Fixes 91cb69620284 ("slub: make dead memcg caches discard free slabs immediately"). Signed-off-by: Joonsoo Kim <iamjoonsoo.kim@lge.com> Acked-by: Vladimir Davydov <vdavydov@parallels.com> Cc: Christoph Lameter <cl@linux.com> Cc: Pekka Enberg <penberg@kernel.org> Acked-by: David Rientjes <rientjes@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
#
844e4d66 |
|
06-Jun-2014 |
Joonsoo Kim <iamjoonsoo.kim@lge.com> |
slub: search partial list on numa_mem_id(), instead of numa_node_id() Currently, if allocation constraint to node is NUMA_NO_NODE, we search a partial slab on numa_node_id() node. This doesn't work properly on a system having memoryless nodes, since it can have no memory on that node so there must be no partial slab on that node. On that node, page allocation always falls back to numa_mem_id() first. So searching a partial slab on numa_node_id() in that case is the proper solution for the memoryless node case. Signed-off-by: Joonsoo Kim <iamjoonsoo.kim@lge.com> Acked-by: Nishanth Aravamudan <nacc@linux.vnet.ibm.com> Acked-by: David Rientjes <rientjes@google.com> Acked-by: Christoph Lameter <cl@linux.com> Cc: Pekka Enberg <penberg@kernel.org> Cc: Wanpeng Li <liwanp@linux.vnet.ibm.com> Cc: Han Pingtian <hanpt@linux.vnet.ibm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
#
7c8e0181 |
|
04-Jun-2014 |
Christoph Lameter <cl@linux.com> |
mm: replace __get_cpu_var uses with this_cpu_ptr Replace places where __get_cpu_var() is used for an address calculation with this_cpu_ptr(). Signed-off-by: Christoph Lameter <cl@linux.com> Cc: Tejun Heo <tj@kernel.org> Cc: Hugh Dickins <hughd@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
#
c67a8a68 |
|
04-Jun-2014 |
Vladimir Davydov <vdavydov.dev@gmail.com> |
memcg, slab: merge memcg_{bind,release}_pages to memcg_{un}charge_slab Currently we have two pairs of kmemcg-related functions that are called on slab alloc/free. The first is memcg_{bind,release}_pages that count the total number of pages allocated on a kmem cache. The second is memcg_{un}charge_slab that {un}charge slab pages to kmemcg resource counter. Let's just merge them to keep the code clean. Signed-off-by: Vladimir Davydov <vdavydov@parallels.com> Acked-by: Johannes Weiner <hannes@cmpxchg.org> Cc: Michal Hocko <mhocko@suse.cz> Cc: Glauber Costa <glommer@gmail.com> Cc: Pekka Enberg <penberg@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
#
03afc0e2 |
|
04-Jun-2014 |
Vladimir Davydov <vdavydov.dev@gmail.com> |
slab: get_online_mems for kmem_cache_{create,destroy,shrink} When we create a sl[au]b cache, we allocate kmem_cache_node structures for each online NUMA node. To handle nodes taken online/offline, we register memory hotplug notifier and allocate/free kmem_cache_node corresponding to the node that changes its state for each kmem cache. To synchronize between the two paths we hold the slab_mutex during both the cache creationg/destruction path and while tuning per-node parts of kmem caches in memory hotplug handler, but that's not quite right, because it does not guarantee that a newly created cache will have all kmem_cache_nodes initialized in case it races with memory hotplug. For instance, in case of slub: CPU0 CPU1 ---- ---- kmem_cache_create: online_pages: __kmem_cache_create: slab_memory_callback: slab_mem_going_online_callback: lock slab_mutex for each slab_caches list entry allocate kmem_cache node unlock slab_mutex lock slab_mutex init_kmem_cache_nodes: for_each_node_state(node, N_NORMAL_MEMORY) allocate kmem_cache node add kmem_cache to slab_caches list unlock slab_mutex online_pages (continued): node_states_set_node As a result we'll get a kmem cache with not all kmem_cache_nodes allocated. To avoid issues like that we should hold get/put_online_mems() during the whole kmem cache creation/destruction/shrink paths, just like we deal with cpu hotplug. This patch does the trick. Note, that after it's applied, there is no need in taking the slab_mutex for kmem_cache_shrink any more, so it is removed from there. Signed-off-by: Vladimir Davydov <vdavydov@parallels.com> Cc: Christoph Lameter <cl@linux.com> Cc: Pekka Enberg <penberg@kernel.org> Cc: Tang Chen <tangchen@cn.fujitsu.com> Cc: Zhang Yanfei <zhangyanfei@cn.fujitsu.com> Cc: Toshi Kani <toshi.kani@hp.com> Cc: Xishi Qiu <qiuxishi@huawei.com> Cc: Jiang Liu <liuj97@gmail.com> Cc: Rafael J. Wysocki <rafael.j.wysocki@intel.com> Cc: David Rientjes <rientjes@google.com> Cc: Wen Congyang <wency@cn.fujitsu.com> Cc: Yasuaki Ishimatsu <isimatu.yasuaki@jp.fujitsu.com> Cc: Lai Jiangshan <laijs@cn.fujitsu.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
#
bfc8c901 |
|
04-Jun-2014 |
Vladimir Davydov <vdavydov.dev@gmail.com> |
mem-hotplug: implement get/put_online_mems kmem_cache_{create,destroy,shrink} need to get a stable value of cpu/node online mask, because they init/destroy/access per-cpu/node kmem_cache parts, which can be allocated or destroyed on cpu/mem hotplug. To protect against cpu hotplug, these functions use {get,put}_online_cpus. However, they do nothing to synchronize with memory hotplug - taking the slab_mutex does not eliminate the possibility of race as described in patch 2. What we need there is something like get_online_cpus, but for memory. We already have lock_memory_hotplug, which serves for the purpose, but it's a bit of a hammer right now, because it's backed by a mutex. As a result, it imposes some limitations to locking order, which are not desirable, and can't be used just like get_online_cpus. That's why in patch 1 I substitute it with get/put_online_mems, which work exactly like get/put_online_cpus except they block not cpu, but memory hotplug. [ v1 can be found at https://lkml.org/lkml/2014/4/6/68. I NAK'ed it by myself, because it used an rw semaphore for get/put_online_mems, making them dead lock prune. ] This patch (of 2): {un}lock_memory_hotplug, which is used to synchronize against memory hotplug, is currently backed by a mutex, which makes it a bit of a hammer - threads that only want to get a stable value of online nodes mask won't be able to proceed concurrently. Also, it imposes some strong locking ordering rules on it, which narrows down the set of its usage scenarios. This patch introduces get/put_online_mems, which are the same as get/put_online_cpus, but for memory hotplug, i.e. executing a code inside a get/put_online_mems section will guarantee a stable value of online nodes, present pages, etc. lock_memory_hotplug()/unlock_memory_hotplug() are removed altogether. Signed-off-by: Vladimir Davydov <vdavydov@parallels.com> Cc: Christoph Lameter <cl@linux.com> Cc: Pekka Enberg <penberg@kernel.org> Cc: Tang Chen <tangchen@cn.fujitsu.com> Cc: Zhang Yanfei <zhangyanfei@cn.fujitsu.com> Cc: Toshi Kani <toshi.kani@hp.com> Cc: Xishi Qiu <qiuxishi@huawei.com> Cc: Jiang Liu <liuj97@gmail.com> Cc: Rafael J. Wysocki <rafael.j.wysocki@intel.com> Cc: David Rientjes <rientjes@google.com> Cc: Wen Congyang <wency@cn.fujitsu.com> Cc: Yasuaki Ishimatsu <isimatu.yasuaki@jp.fujitsu.com> Cc: Lai Jiangshan <laijs@cn.fujitsu.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
#
52383431 |
|
04-Jun-2014 |
Vladimir Davydov <vdavydov.dev@gmail.com> |
mm: get rid of __GFP_KMEMCG Currently to allocate a page that should be charged to kmemcg (e.g. threadinfo), we pass __GFP_KMEMCG flag to the page allocator. The page allocated is then to be freed by free_memcg_kmem_pages. Apart from looking asymmetrical, this also requires intrusion to the general allocation path. So let's introduce separate functions that will alloc/free pages charged to kmemcg. The new functions are called alloc_kmem_pages and free_kmem_pages. They should be used when the caller actually would like to use kmalloc, but has to fall back to the page allocator for the allocation is large. They only differ from alloc_pages and free_pages in that besides allocating or freeing pages they also charge them to the kmem resource counter of the current memory cgroup. [sfr@canb.auug.org.au: export kmalloc_order() to modules] Signed-off-by: Vladimir Davydov <vdavydov@parallels.com> Acked-by: Greg Thelen <gthelen@google.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Acked-by: Michal Hocko <mhocko@suse.cz> Cc: Glauber Costa <glommer@gmail.com> Cc: Christoph Lameter <cl@linux-foundation.org> Cc: Pekka Enberg <penberg@kernel.org> Signed-off-by: Stephen Rothwell <sfr@canb.auug.org.au> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
#
5dfb4175 |
|
04-Jun-2014 |
Vladimir Davydov <vdavydov.dev@gmail.com> |
sl[au]b: charge slabs to kmemcg explicitly We have only a few places where we actually want to charge kmem so instead of intruding into the general page allocation path with __GFP_KMEMCG it's better to explictly charge kmem there. All kmem charges will be easier to follow that way. This is a step towards removing __GFP_KMEMCG. It removes __GFP_KMEMCG from memcg caches' allocflags. Instead it makes slab allocation path call memcg_charge_kmem directly getting memcg to charge from the cache's memcg params. This also eliminates any possibility of misaccounting an allocation going from one memcg's cache to another memcg, because now we always charge slabs against the memcg the cache belongs to. That's why this patch removes the big comment to memcg_kmem_get_cache. Signed-off-by: Vladimir Davydov <vdavydov@parallels.com> Acked-by: Greg Thelen <gthelen@google.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Acked-by: Michal Hocko <mhocko@suse.cz> Cc: Glauber Costa <glommer@gmail.com> Cc: Christoph Lameter <cl@linux-foundation.org> Cc: Pekka Enberg <penberg@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
#
8eae1492 |
|
04-Jun-2014 |
Dave Hansen <dave.hansen@linux.intel.com> |
mm: slub: fix ALLOC_SLOWPATH stat There used to be only one path out of __slab_alloc(), and ALLOC_SLOWPATH got bumped in that exit path. Now there are two, and a bunch of gotos. ALLOC_SLOWPATH can now get set more than once during a single call to __slab_alloc() which is pretty bogus. Here's the sequence: 1. Enter __slab_alloc(), fall through all the way to the stat(s, ALLOC_SLOWPATH); 2. hit 'if (!freelist)', and bump DEACTIVATE_BYPASS, jump to new_slab (goto #1) 3. Hit 'if (c->partial)', bump CPU_PARTIAL_ALLOC, goto redo (goto #2) 4. Fall through in the same path we did before all the way to stat(s, ALLOC_SLOWPATH) 5. bump ALLOC_REFILL stat, then return Doing this is obviously bogus. It keeps us from being able to accurately compare ALLOC_SLOWPATH vs. ALLOC_FASTPATH. It also means that the total number of allocs always exceeds the total number of frees. This patch moves stat(s, ALLOC_SLOWPATH) to be called from the same place that __slab_alloc() is. This makes it much less likely that ALLOC_SLOWPATH will get botched again in the spaghetti-code inside __slab_alloc(). Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com> Acked-by: Christoph Lameter <cl@linux.com> Acked-by: David Rientjes <rientjes@google.com> Cc: Pekka Enberg <penberg@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
#
9a02d699 |
|
04-Jun-2014 |
David Rientjes <rientjes@google.com> |
mm, slab: suppress out of memory warning unless debug is enabled When the slab or slub allocators cannot allocate additional slab pages, they emit diagnostic information to the kernel log such as current number of slabs, number of objects, active objects, etc. This is always coupled with a page allocation failure warning since it is controlled by !__GFP_NOWARN. Suppress this out of memory warning if the allocator is configured without debug supported. The page allocation failure warning will indicate it is a failed slab allocation, the order, and the gfp mask, so this is only useful to diagnose allocator issues. Since CONFIG_SLUB_DEBUG is already enabled by default for the slub allocator, there is no functional change with this patch. If debug is disabled, however, the warnings are now suppressed. Signed-off-by: David Rientjes <rientjes@google.com> Cc: Pekka Enberg <penberg@kernel.org> Acked-by: Christoph Lameter <cl@linux.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
#
ecc42fbe |
|
04-Jun-2014 |
Fabian Frederick <fabf@skynet.be> |
mm/slub.c: convert vnsprintf-static to va_format Inspired by Joe Perches suggestion in ntfs logging clean-up. Signed-off-by: Fabian Frederick <fabf@skynet.be> Acked-by: Christoph Lameter <cl@linux.com> Cc: Joe Perches <joe@perches.com> Cc: Pekka Enberg <penberg@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
#
f9f58285 |
|
04-Jun-2014 |
Fabian Frederick <fabf@skynet.be> |
mm/slub.c: convert printk to pr_foo() All printk(KERN_foo converted to pr_foo() Default printk converted to pr_warn() Coalesce format fragments Signed-off-by: Fabian Frederick <fabf@skynet.be> Acked-by: Christoph Lameter <cl@linux.com> Cc: Joe Perches <joe@perches.com> Cc: Pekka Enberg <penberg@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
#
41a21285 |
|
06-May-2014 |
Christoph Lameter <cl@linux.com> |
slub: use sysfs'es release mechanism for kmem_cache debugobjects warning during netfilter exit: ------------[ cut here ]------------ WARNING: CPU: 6 PID: 4178 at lib/debugobjects.c:260 debug_print_object+0x8d/0xb0() ODEBUG: free active (active state 0) object type: timer_list hint: delayed_work_timer_fn+0x0/0x20 Modules linked in: CPU: 6 PID: 4178 Comm: kworker/u16:2 Tainted: G W 3.11.0-next-20130906-sasha #3984 Workqueue: netns cleanup_net Call Trace: dump_stack+0x52/0x87 warn_slowpath_common+0x8c/0xc0 warn_slowpath_fmt+0x46/0x50 debug_print_object+0x8d/0xb0 __debug_check_no_obj_freed+0xa5/0x220 debug_check_no_obj_freed+0x15/0x20 kmem_cache_free+0x197/0x340 kmem_cache_destroy+0x86/0xe0 nf_conntrack_cleanup_net_list+0x131/0x170 nf_conntrack_pernet_exit+0x5d/0x70 ops_exit_list+0x5e/0x70 cleanup_net+0xfb/0x1c0 process_one_work+0x338/0x550 worker_thread+0x215/0x350 kthread+0xe7/0xf0 ret_from_fork+0x7c/0xb0 Also during dcookie cleanup: WARNING: CPU: 12 PID: 9725 at lib/debugobjects.c:260 debug_print_object+0x8c/0xb0() ODEBUG: free active (active state 0) object type: timer_list hint: delayed_work_timer_fn+0x0/0x20 Modules linked in: CPU: 12 PID: 9725 Comm: trinity-c141 Not tainted 3.15.0-rc2-next-20140423-sasha-00018-gc4ff6c4 #408 Call Trace: dump_stack (lib/dump_stack.c:52) warn_slowpath_common (kernel/panic.c:430) warn_slowpath_fmt (kernel/panic.c:445) debug_print_object (lib/debugobjects.c:262) __debug_check_no_obj_freed (lib/debugobjects.c:697) debug_check_no_obj_freed (lib/debugobjects.c:726) kmem_cache_free (mm/slub.c:2689 mm/slub.c:2717) kmem_cache_destroy (mm/slab_common.c:363) dcookie_unregister (fs/dcookies.c:302 fs/dcookies.c:343) event_buffer_release (arch/x86/oprofile/../../../drivers/oprofile/event_buffer.c:153) __fput (fs/file_table.c:217) ____fput (fs/file_table.c:253) task_work_run (kernel/task_work.c:125 (discriminator 1)) do_notify_resume (include/linux/tracehook.h:196 arch/x86/kernel/signal.c:751) int_signal (arch/x86/kernel/entry_64.S:807) Sysfs has a release mechanism. Use that to release the kmem_cache structure if CONFIG_SYSFS is enabled. Only slub is changed - slab currently only supports /proc/slabinfo and not /sys/kernel/slab/*. We talked about adding that and someone was working on it. [akpm@linux-foundation.org: fix CONFIG_SYSFS=n build] [akpm@linux-foundation.org: fix CONFIG_SYSFS=n build even more] Signed-off-by: Christoph Lameter <cl@linux.com> Reported-by: Sasha Levin <sasha.levin@oracle.com> Tested-by: Sasha Levin <sasha.levin@oracle.com> Acked-by: Greg KH <greg@kroah.com> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Pekka Enberg <penberg@kernel.org> Cc: Russell King <rmk@arm.linux.org.uk> Cc: Bart Van Assche <bvanassche@acm.org> Cc: Al Viro <viro@ZenIV.linux.org.uk> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
#
93030d83 |
|
06-May-2014 |
Vladimir Davydov <vdavydov.dev@gmail.com> |
slub: fix memcg_propagate_slab_attrs After creating a cache for a memcg we should initialize its sysfs attrs with the values from its parent. That's what memcg_propagate_slab_attrs is for. Currently it's broken - we clearly muddled root-vs-memcg caches there. Let's fix it up. Signed-off-by: Vladimir Davydov <vdavydov@parallels.com> Cc: Christoph Lameter <cl@linux.com> Cc: Pekka Enberg <penberg@kernel.org> Cc: Michal Hocko <mhocko@suse.cz> Cc: Johannes Weiner <hannes@cmpxchg.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
#
88da03a6 |
|
07-Apr-2014 |
Christoph Lameter <cl@linux.com> |
slub: use raw_cpu_inc for incrementing statistics Statistics are not critical to the operation of the allocation but should also not cause too much overhead. When __this_cpu_inc is altered to check if preemption is disabled this triggers. Use raw_cpu_inc to avoid the checks. Using this_cpu_ops may cause interrupt disable/enable sequences on various arches which may significantly impact allocator performance. [akpm@linux-foundation.org: add comment] Signed-off-by: Christoph Lameter <cl@linux.com> Cc: Fengguang Wu <fengguang.wu@intel.com> Cc: Pekka Enberg <penberg@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
#
54b6a731 |
|
07-Apr-2014 |
Dave Jones <davej@fedoraproject.org> |
slub: fix leak of 'name' in sysfs_slab_add The failure paths of sysfs_slab_add don't release the allocation of 'name' made by create_unique_id() a few lines above the context of the diff below. Create a common exit path to make it more obvious what needs freeing. [vdavydov@parallels.com: free the name only if !unmergeable] Signed-off-by: Dave Jones <davej@fedoraproject.org> Signed-off-by: Vladimir Davydov <vdavydov@parallels.com> Cc: Pekka Enberg <penberg@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
#
9a41707b |
|
07-Apr-2014 |
Vladimir Davydov <vdavydov.dev@gmail.com> |
slub: rework sysfs layout for memcg caches Currently, we try to arrange sysfs entries for memcg caches in the same manner as for global caches. Apart from turning /sys/kernel/slab into a mess when there are a lot of kmem-active memcgs created, it actually does not work properly - we won't create more than one link to a memcg cache in case its parent is merged with another cache. For instance, if A is a root cache merged with another root cache B, we will have the following sysfs setup: X A -> X B -> X where X is some unique id (see create_unique_id()). Now if memcgs M and N start to allocate from cache A (or B, which is the same), we will get: X X:M X:N A -> X B -> X A:M -> X:M A:N -> X:N Since B is an alias for A, we won't get entries B:M and B:N, which is confusing. It is more logical to have entries for memcg caches under the corresponding root cache's sysfs directory. This would allow us to keep sysfs layout clean, and avoid such inconsistencies like one described above. This patch does the trick. It creates a "cgroup" kset in each root cache kobject to keep its children caches there. Signed-off-by: Vladimir Davydov <vdavydov@parallels.com> Cc: Michal Hocko <mhocko@suse.cz> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: David Rientjes <rientjes@google.com> Cc: Pekka Enberg <penberg@kernel.org> Cc: Glauber Costa <glommer@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
#
84d0ddd6 |
|
07-Apr-2014 |
Vladimir Davydov <vdavydov.dev@gmail.com> |
slub: adjust memcg caches when creating cache alias Otherwise, kzalloc() called from a memcg won't clear the whole object. Signed-off-by: Vladimir Davydov <vdavydov@parallels.com> Cc: Michal Hocko <mhocko@suse.cz> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: David Rientjes <rientjes@google.com> Cc: Pekka Enberg <penberg@kernel.org> Cc: Glauber Costa <glommer@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
#
a44cb944 |
|
07-Apr-2014 |
Vladimir Davydov <vdavydov.dev@gmail.com> |
memcg, slab: never try to merge memcg caches When a kmem cache is created (kmem_cache_create_memcg()), we first try to find a compatible cache that already exists and can handle requests from the new cache, i.e. has the same object size, alignment, ctor, etc. If there is such a cache, we do not create any new caches, instead we simply increment the refcount of the cache found and return it. Currently we do this procedure not only when creating root caches, but also for memcg caches. However, there is no point in that, because, as every memcg cache has exactly the same parameters as its parent and cache merging cannot be turned off in runtime (only on boot by passing "slub_nomerge"), the root caches of any two potentially mergeable memcg caches should be merged already, i.e. it must be the same root cache, and therefore we couldn't even get to the memcg cache creation, because it already exists. The only exception is boot caches - they are explicitly forbidden to be merged by setting their refcount to -1. There are currently only two of them - kmem_cache and kmem_cache_node, which are used in slab internals (I do not count kmalloc caches as their refcount is set to 1 immediately after creation). Since they are prevented from merging preliminary I guess we should avoid to merge their children too. So let's remove the useless code responsible for merging memcg caches. Signed-off-by: Vladimir Davydov <vdavydov@parallels.com> Cc: Michal Hocko <mhocko@suse.cz> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: David Rientjes <rientjes@google.com> Cc: Pekka Enberg <penberg@kernel.org> Cc: Glauber Costa <glommer@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
#
2a389610 |
|
07-Apr-2014 |
David Rientjes <rientjes@google.com> |
mm, mempolicy: rename slab_node for clarity slab_node() is actually a mempolicy function, so rename it to mempolicy_slab_node() to make it clearer that it used for processes with mempolicies. At the same time, cleanup its code by saving numa_mem_id() in a local variable (since we require a node with memory, not just any node) and remove an obsolete comment that assumes the mempolicy is actually passed into the function. Signed-off-by: David Rientjes <rientjes@google.com> Acked-by: Christoph Lameter <cl@linux.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Michal Hocko <mhocko@suse.cz> Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: Christoph Lameter <cl@linux-foundation.org> Cc: Pekka Enberg <penberg@kernel.org> Cc: Tejun Heo <tj@kernel.org> Cc: Mel Gorman <mgorman@suse.de> Cc: Oleg Nesterov <oleg@redhat.com> Cc: Rik van Riel <riel@redhat.com> Cc: Jianguo Wu <wujianguo@huawei.com> Cc: Tim Hockin <thockin@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
#
421af243 |
|
03-Apr-2014 |
Vladimir Davydov <vdavydov.dev@gmail.com> |
slub: do not drop slab_mutex for sysfs_slab_add We release the slab_mutex while calling sysfs_slab_add from __kmem_cache_create since commit 66c4c35c6bc5 ("slub: Do not hold slub_lock when calling sysfs_slab_add()"), because kobject_uevent called by sysfs_slab_add might block waiting for the usermode helper to exec, which would result in a deadlock if we took the slab_mutex while executing it. However, apart from complicating synchronization rules, releasing the slab_mutex on kmem cache creation can result in a kmemcg-related race. The point is that we check if the memcg cache exists before going to __kmem_cache_create, but register the new cache in memcg subsys after it. Since we can drop the mutex there, several threads can see that the memcg cache does not exist and proceed to creating it, which is wrong. Fortunately, recently kobject_uevent was patched to call the usermode helper with the UMH_NO_WAIT flag, making the deadlock impossible. Therefore there is no point in releasing the slab_mutex while calling sysfs_slab_add, so let's simplify kmem_cache_create synchronization and fix the kmemcg-race mentioned above by holding the slab_mutex during the whole cache creation path. Signed-off-by: Vladimir Davydov <vdavydov@parallels.com> Acked-by: Christoph Lameter <cl@linux.com> Cc: Greg KH <greg@kroah.com> Cc: Pekka Enberg <penberg@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
#
d26914d1 |
|
03-Apr-2014 |
Mel Gorman <mgorman@suse.de> |
mm: optimize put_mems_allowed() usage Since put_mems_allowed() is strictly optional, its a seqcount retry, we don't need to evaluate the function if the allocation was in fact successful, saving a smp_rmb some loads and comparisons on some relative fast-paths. Since the naming, get/put_mems_allowed() does suggest a mandatory pairing, rename the interface, as suggested by Mel, to resemble the seqcount interface. This gives us: read_mems_allowed_begin() and read_mems_allowed_retry(), where it is important to note that the return value of the latter call is inverted from its previous incarnation. Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Signed-off-by: Mel Gorman <mgorman@suse.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
#
80c3a998 |
|
12-Mar-2014 |
Joonsoo Kim <iamjoonsoo.kim@lge.com> |
slub: fix high order page allocation problem with __GFP_NOFAIL SLUB already try to allocate high order page with clearing __GFP_NOFAIL. But, when allocating shadow page for kmemcheck, it missed clearing the flag. This trigger WARN_ON_ONCE() reported by Christian Casteyde. https://bugzilla.kernel.org/show_bug.cgi?id=65991 https://lkml.org/lkml/2013/12/3/764 This patch fix this situation by using same allocation flag as original allocation. Reported-by: Christian Casteyde <casteyde.christian@free.fr> Acked-by: David Rientjes <rientjes@google.com> Signed-off-by: Joonsoo Kim <iamjoonsoo.kim@lge.com> Signed-off-by: Pekka Enberg <penberg@kernel.org>
|
#
1e4dd946 |
|
10-Feb-2014 |
Steven Rostedt <rostedt@goodmis.org> |
slub: do not assert not having lock in removing freed partial Vladimir reported the following issue: Commit c65c1877bd68 ("slub: use lockdep_assert_held") requires remove_partial() to be called with n->list_lock held, but free_partial() called from kmem_cache_close() on cache destruction does not follow this rule, leading to a warning: WARNING: CPU: 0 PID: 2787 at mm/slub.c:1536 __kmem_cache_shutdown+0x1b2/0x1f0() Modules linked in: CPU: 0 PID: 2787 Comm: modprobe Tainted: G W 3.14.0-rc1-mm1+ #1 Hardware name: 0000000000000600 ffff88003ae1dde8 ffffffff816d9583 0000000000000600 0000000000000000 ffff88003ae1de28 ffffffff8107c107 0000000000000000 ffff880037ab2b00 ffff88007c240d30 ffffea0001ee5280 ffffea0001ee52a0 Call Trace: __kmem_cache_shutdown+0x1b2/0x1f0 kmem_cache_destroy+0x43/0xf0 xfs_destroy_zones+0x103/0x110 [xfs] exit_xfs_fs+0x38/0x4e4 [xfs] SyS_delete_module+0x19a/0x1f0 system_call_fastpath+0x16/0x1b His solution was to add a spinlock in order to quiet lockdep. Although there would be no contention to adding the lock, that lock also requires disabling of interrupts which will have a larger impact on the system. Instead of adding a spinlock to a location where it is not needed for lockdep, make a __remove_partial() function that does not test if the list_lock is held, as no one should have it due to it being freed. Also added a __add_partial() function that does not do the lock validation either, as it is not needed for the creation of the cache. Signed-off-by: Steven Rostedt <rostedt@goodmis.org> Reported-by: Vladimir Davydov <vdavydov@parallels.com> Suggested-by: David Rientjes <rientjes@google.com> Acked-by: David Rientjes <rientjes@google.com> Acked-by: Vladimir Davydov <vdavydov@parallels.com> Acked-by: Christoph Lameter <cl@linux.com> Cc: Pekka Enberg <penberg@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
#
255d0884 |
|
10-Feb-2014 |
David Rientjes <rientjes@google.com> |
mm/slub.c: list_lock may not be held in some circumstances Commit c65c1877bd68 ("slub: use lockdep_assert_held") incorrectly required that add_full() and remove_full() hold n->list_lock. The lock is only taken when kmem_cache_debug(s), since that's the only time it actually does anything. Require that the lock only be taken under such a condition. Reported-by: Larry Finger <Larry.Finger@lwfinger.net> Tested-by: Larry Finger <Larry.Finger@lwfinger.net> Tested-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com> Acked-by: Christoph Lameter <cl@linux.com> Cc: Pekka Enberg <penberg@kernel.org> Signed-off-by: David Rientjes <rientjes@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
#
67b6c900 |
|
24-Jan-2014 |
Dave Hansen <dave.hansen@linux.intel.com> |
mm: slub: work around unneeded lockdep warning The slub code does some setup during early boot in early_kmem_cache_node_alloc() with some local data. There is no possible way that another CPU can see this data, so the slub code doesn't unnecessarily lock it. However, some new lockdep asserts check to make sure that add_partial() _always_ has the list_lock held. Just add the locking, even though it is technically unnecessary. Cc: Peter Zijlstra <peterz@infradead.org> Cc: Russell King <linux@arm.linux.org.uk> Acked-by: David Rientjes <rientjes@google.com> Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com> Signed-off-by: Pekka Enberg <penberg@kernel.org>
|
#
a0320865 |
|
30-Jan-2014 |
Dave Hansen <dave.hansen@linux.intel.com> |
mm/slub.c: fix page->_count corruption (again) Commit abca7c496584 ("mm: fix slab->page _count corruption when using slub") notes that we can not _set_ a page->counters directly, except when using a real double-cmpxchg. Doing so can lose updates to ->_count. That is an absolute rule: You may not *set* page->counters except via a cmpxchg. Commit abca7c496584 fixed this for the folks who have the slub cmpxchg_double code turned off at compile time, but it left the bad case alone. It can still be reached, and the same bug triggered in two cases: 1. Turning on slub debugging at runtime, which is available on the distro kernels that I looked at. 2. On 64-bit CPUs with no CMPXCHG16B (some early AMD x86-64 cpus, evidently) There are at least 3 ways we could fix this: 1. Take all of the exising calls to cmpxchg_double_slab() and __cmpxchg_double_slab() and convert them to take an old, new and target 'struct page'. 2. Do (1), but with the newly-introduced 'slub_data'. 3. Do some magic inside the two cmpxchg...slab() functions to pull the counters out of new_counters and only set those fields in page->{inuse,frozen,objects}. I've done (2) as well, but it's a bunch more code. This patch is an attempt at (3). This was the most straightforward and foolproof way that I could think to do this. This would also technically allow us to get rid of the ugly #if defined(CONFIG_HAVE_CMPXCHG_DOUBLE) && \ defined(CONFIG_HAVE_ALIGNED_STRUCT_PAGE) in 'struct page', but leaving it alone has the added benefit that 'counters' stays 'unsigned' instead of 'unsigned long', so all the copies that the slub code does stay a bit smaller. Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com> Cc: Christoph Lameter <cl@linux-foundation.org> Cc: Pekka Enberg <penberg@kernel.org> Cc: Matt Mackall <mpm@selenic.com> Cc: Pravin B Shelar <pshelar@nicira.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
#
a0132ac0 |
|
29-Jan-2014 |
Dave Hansen <dave.hansen@linux.intel.com> |
mm/slub.c: do not VM_BUG_ON_PAGE() for temporary on-stack pages Commit 309381feaee5 ("mm: dump page when hitting a VM_BUG_ON using VM_BUG_ON_PAGE") added a bunch of VM_BUG_ON_PAGE() calls. But, most of the ones in the slub code are for _temporary_ 'struct page's which are declared on the stack and likely have lots of gunk in them. Dumping their contents out will just confuse folks looking at bad_page() output. Plus, if we try to page_to_pfn() on them or soemthing, we'll probably oops anyway. Turn them back in to VM_BUG_ON()s. Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com> Cc: Sasha Levin <sasha.levin@oracle.com> Cc: "Kirill A. Shutemov" <kirill@shutemov.name> Cc: Pekka Enberg <penberg@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
#
309381fea |
|
23-Jan-2014 |
Sasha Levin <sasha.levin@oracle.com> |
mm: dump page when hitting a VM_BUG_ON using VM_BUG_ON_PAGE Most of the VM_BUG_ON assertions are performed on a page. Usually, when one of these assertions fails we'll get a BUG_ON with a call stack and the registers. I've recently noticed based on the requests to add a small piece of code that dumps the page to various VM_BUG_ON sites that the page dump is quite useful to people debugging issues in mm. This patch adds a VM_BUG_ON_PAGE(cond, page) which beyond doing what VM_BUG_ON() does, also dumps the page before executing the actual BUG_ON. [akpm@linux-foundation.org: fix up includes] Signed-off-by: Sasha Levin <sasha.levin@oracle.com> Cc: "Kirill A. Shutemov" <kirill@shutemov.name> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
#
26e4f205 |
|
04-Jan-2014 |
Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp> |
slub: Fix possible format string bug. The "name" is determined at runtime and is parsed as format string. Acked-by: David Rientjes <rientjes@google.com> Signed-off-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp> Signed-off-by: Pekka Enberg <penberg@kernel.org>
|
#
c65c1877 |
|
10-Jan-2014 |
Peter Zijlstra <peterz@infradead.org> |
slub: use lockdep_assert_held Instead of using comments in an attempt at getting the locking right, use proper assertions that actively warn you if you got it wrong. Also add extra braces in a few sites to comply with coding-style. Signed-off-by: Peter Zijlstra <peterz@infradead.org> Signed-off-by: Pekka Enberg <penberg@kernel.org>
|
#
8afb1474 |
|
09-Sep-2013 |
Li Zefan <lizefan@huawei.com> |
slub: Fix calculation of cpu slabs /sys/kernel/slab/:t-0000048 # cat cpu_slabs 231 N0=16 N1=215 /sys/kernel/slab/:t-0000048 # cat slabs 145 N0=36 N1=109 See, the number of slabs is smaller than that of cpu slabs. The bug was introduced by commit 49e2258586b423684f03c278149ab46d8f8b6700 ("slub: per cpu cache for partial pages"). We should use page->pages instead of page->pobjects when calculating the number of cpu partial slabs. This also fixes the mapping of slabs and nodes. As there's no variable storing the number of total/active objects in cpu partial slabs, and we don't have user interfaces requiring those statistics, I just add WARN_ON for those cases. Cc: <stable@vger.kernel.org> # 3.2+ Acked-by: Christoph Lameter <cl@linux.com> Reviewed-by: Wanpeng Li <liwanp@linux.vnet.ibm.com> Signed-off-by: Li Zefan <lizefan@huawei.com> Signed-off-by: Pekka Enberg <penberg@kernel.org>
|
#
2ade4de8 |
|
12-Nov-2013 |
Qiang Huang <h.huangqiang@huawei.com> |
memcg, kmem: rename cache_from_memcg to cache_from_memcg_idx We can't see the relationship with memcg from the parameters, so the name with memcg_idx would be more reasonable. Signed-off-by: Qiang Huang <h.huangqiang@huawei.com> Reviewed-by: Pekka Enberg <penberg@kernel.org> Acked-by: David Rientjes <rientjes@google.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Michal Hocko <mhocko@suse.cz> Cc: Glauber Costa <glommer@parallels.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
#
721ae22a |
|
08-Nov-2013 |
Zhi Yong Wu <wuzhy@linux.vnet.ibm.com> |
mm, slub: fix the typo in mm/slub.c Acked-by: Christoph Lameter <cl@linux.com> Signed-off-by: Zhi Yong Wu <wuzhy@linux.vnet.ibm.com> Signed-off-by: Pekka Enberg <penberg@kernel.org>
|
#
c6f58d9b |
|
07-Nov-2013 |
Christoph Lameter <cl@linux.com> |
slub: Handle NULL parameter in kmem_cache_flags Andreas Herrmann writes: When I've used slub_debug kernel option (e.g. "slub_debug=,skbuff_fclone_cache" or similar) on a debug session I've seen a panic like: Highbank #setenv bootargs console=ttyAMA0 root=/dev/sda2 kgdboc.kgdboc=ttyAMA0,115200 slub_debug=,kmalloc-4096 earlyprintk=ttyAMA0 ... Unable to handle kernel NULL pointer dereference at virtual address 00000000 pgd = c0004000 [00000000] *pgd=00000000 Internal error: Oops: 5 [#1] SMP ARM Modules linked in: CPU: 0 PID: 0 Comm: swapper Tainted: G W 3.12.0-00048-gbe408cd #314 task: c0898360 ti: c088a000 task.ti: c088a000 PC is at strncmp+0x1c/0x84 LR is at kmem_cache_flags.isra.46.part.47+0x44/0x60 pc : [<c02c6da0>] lr : [<c0110a3c>] psr: 200001d3 sp : c088bea8 ip : c088beb8 fp : c088beb4 r10: 00000000 r9 : 413fc090 r8 : 00000001 r7 : 00000000 r6 : c2984a08 r5 : c0966e78 r4 : 00000000 r3 : 0000006b r2 : 0000000c r1 : 00000000 r0 : c2984a08 Flags: nzCv IRQs off FIQs off Mode SVC_32 ISA ARM Segment kernel Control: 10c5387d Table: 0000404a DAC: 00000015 Process swapper (pid: 0, stack limit = 0xc088a248) Stack: (0xc088bea8 to 0xc088c000) bea0: c088bed4 c088beb8 c0110a3c c02c6d90 c0966e78 00000040 bec0: ef001f00 00000040 c088bf14 c088bed8 c0112070 c0110a04 00000005 c010fac8 bee0: c088bf5c c088bef0 c010fac8 ef001f00 00000040 00000000 00000040 00000001 bf00: 413fc090 00000000 c088bf34 c088bf18 c0839190 c0112040 00000000 ef001f00 bf20: 00000000 00000000 c088bf54 c088bf38 c0839200 c083914c 00000006 c0961c4c bf40: c0961c28 00000000 c088bf7c c088bf58 c08392ac c08391c0 c08a2ed8 c0966e78 bf60: c086b874 c08a3f50 c0961c28 00000001 c088bfb4 c088bf80 c083b258 c0839248 bf80: 2f800000 0f000000 c08935b4 ffffffff c08cd400 ffffffff c08cd400 c0868408 bfa0: c29849c0 00000000 c088bff4 c088bfb8 c0824974 c083b1e4 ffffffff ffffffff bfc0: c08245c0 00000000 00000000 c0868408 00000000 10c5387d c0892bcc c0868404 bfe0: c0899440 0000406a 00000000 c088bff8 00008074 c0824824 00000000 00000000 [<c02c6da0>] (strncmp+0x1c/0x84) from [<c0110a3c>] (kmem_cache_flags.isra.46.part.47+0x44/0x60) [<c0110a3c>] (kmem_cache_flags.isra.46.part.47+0x44/0x60) from [<c0112070>] (__kmem_cache_create+0x3c/0x410) [<c0112070>] (__kmem_cache_create+0x3c/0x410) from [<c0839190>] (create_boot_cache+0x50/0x74) [<c0839190>] (create_boot_cache+0x50/0x74) from [<c0839200>] (create_kmalloc_cache+0x4c/0x88) [<c0839200>] (create_kmalloc_cache+0x4c/0x88) from [<c08392ac>] (create_kmalloc_caches+0x70/0x114) [<c08392ac>] (create_kmalloc_caches+0x70/0x114) from [<c083b258>] (kmem_cache_init+0x80/0xe0) [<c083b258>] (kmem_cache_init+0x80/0xe0) from [<c0824974>] (start_kernel+0x15c/0x318) [<c0824974>] (start_kernel+0x15c/0x318) from [<00008074>] (0x8074) Code: e3520000 01a00002 089da800 e5d03000 (e5d1c000) ---[ end trace 1b75b31a2719ed1d ]--- Kernel panic - not syncing: Fatal exception Problem is that slub_debug option is not parsed before create_boot_cache is called. Solve this by changing slub_debug to early_param. Kernels 3.11, 3.10 are also affected. I am not sure about older kernels. Christoph Lameter explains: kmem_cache_flags may be called with NULL parameter during early boot. Skip the test in that case. Cc: stable@vger.kernel.org # 3.10 and 3.11 Reported-by: Andreas Herrmann <andreas.herrmann@calxeda.com> Signed-off-by: Christoph Lameter <cl@linux.com> Signed-off-by: Pekka Enberg <penberg@kernel.org>
|
#
d56791b3 |
|
08-Oct-2013 |
Roman Bobniev <Roman.Bobniev@sonymobile.com> |
slub: proper kmemleak tracking if CONFIG_SLUB_DEBUG disabled Move all kmemleak calls into hook functions, and make it so that all hooks (both inside and outside of #ifdef CONFIG_SLUB_DEBUG) call the appropriate kmemleak routines. This allows for kmemleak to be configured independently of slub debug features. It also fixes a bug where kmemleak was only partially enabled in some configurations. Acked-by: Catalin Marinas <catalin.marinas@arm.com> Acked-by: Christoph Lameter <cl@linux.com> Signed-off-by: Roman Bobniev <Roman.Bobniev@sonymobile.com> Signed-off-by: Tim Bird <tim.bird@sonymobile.com> Signed-off-by: Pekka Enberg <penberg@iki.fi>
|
#
d1756174 |
|
17-Oct-2013 |
Xie XiuQi <xiexiuqi@huawei.com> |
mm: Fix some trivial typos in comments Signed-off-by: Xie XiuQi <xiexiuqi@huawei.com> Signed-off-by: Jiri Kosina <jkosina@suse.cz>
|
#
3dbb95f7 |
|
11-Sep-2013 |
Jingoo Han <jg1.han@samsung.com> |
mm: replace strict_strtoul() with kstrtoul() The use of strict_strtoul() is not preferred, because strict_strtoul() is obsolete. Thus, kstrtoul() should be used. Signed-off-by: Jingoo Han <jg1.han@samsung.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
#
76b6f3d2 |
|
04-Sep-2013 |
Christoph Lameter <cl@linux.com> |
slub: remove verify_mem_not_deleted() I do not see any user for this code in the tree. Signed-off-by: Christoph Lameter <cl@linux.com> Signed-off-by: Pekka Enberg <penberg@kernel.org>
|
#
f1b6eb6e |
|
04-Sep-2013 |
Christoph Lameter <cl@linux.com> |
mm/sl[aou]b: Move kmallocXXX functions to common code The kmalloc* functions of all slab allcoators are similar now so lets move them into slab.h. This requires some function naming changes in slob. As a results of this patch there is a common set of functions for all allocators. Also means that kmalloc_large() is now available in general to perform large order allocations that go directly via the page allocator. kmalloc_large() can be substituted if kmalloc() throws warnings because of too large allocations. kmalloc_large() has exactly the same semantics as kmalloc but can only used for allocations > PAGE_SIZE. Signed-off-by: Christoph Lameter <cl@linux.com> Signed-off-by: Pekka Enberg <penberg@kernel.org>
|
#
68f06650 |
|
14-Jul-2013 |
Chen Gang <gang.chen@asianux.com> |
mm/slub.c: beautify code for removing redundancy 'break' statement. Remove redundancy 'break' statement. Acked-by: Christoph Lameter <cl@linux.com> Signed-off-by: Chen Gang <gang.chen@asianux.com> Signed-off-by: Pekka Enberg <penberg@kernel.org>
|
#
ac6434e6 |
|
18-Jul-2013 |
Libin <huawei.libin@huawei.com> |
slub: Remove unnecessary page NULL check In commit 4d7868e6(slub: Do not dereference NULL pointer in node_match) had added check for page NULL in node_match. Thus, it is not needed to check it before node_match, remove it. Acked-by: Christoph Lameter <cl@linux.com> Signed-off-by: Libin <huawei.libin@huawei.com> Signed-off-by: Pekka Enberg <penberg@kernel.org>
|
#
37090506 |
|
08-Aug-2013 |
Linus Torvalds <torvalds@linux-foundation.org> |
Revert "slub: do not put a slab to cpu partial list when cpu_partial is 0" This reverts commit 318df36e57c0ca9f2146660d41ff28e8650af423. This commit caused Steven Rostedt's hackbench runs to run out of memory due to a leak. As noted by Joonsoo Kim, it is buggy in the following scenario: "I guess, you may set 0 to all kmem caches's cpu_partial via sysfs, doesn't it? In this case, memory leak is possible in following case. Code flow of possible leak is follwing case. * in __slab_free() 1. (!new.inuse || !prior) && !was_frozen 2. !kmem_cache_debug && !prior 3. new.frozen = 1 4. after cmpxchg_double_slab, run the (!n) case with new.frozen=1 5. with this patch, put_cpu_partial() doesn't do anything, because this cache's cpu_partial is 0 6. return In step 5, leak occur" And Steven does indeed have cpu_partial set to 0 due to RT testing. Joonsoo is cooking up a patch, but everybody agrees that reverting this for now is the right thing to do. Reported-and-bisected-by: Steven Rostedt <rostedt@goodmis.org> Acked-by: Joonsoo Kim <iamjoonsoo.kim@lge.com> Acked-by: Pekka Enberg <penberg@kernel.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
#
d0e0ac97 |
|
14-Jul-2013 |
Chen Gang <gang.chen@asianux.com> |
mm/slub: beautify code for 80 column limitation and tab alignment Be sure of 80 column limitation for both code and comments. Correct tab alignment for 'if-else' statement. Acked-by: Christoph Lameter <cl@linux.com> Signed-off-by: Chen Gang <gang.chen@asianux.com> Signed-off-by: Pekka Enberg <penberg@kernel.org>
|
#
e35e1a97 |
|
11-Jul-2013 |
Chen Gang <gang.chen@asianux.com> |
mm/slub: remove 'per_cpu' which is useless variable Remove 'per_cpu', since it is useless now after the patch: "205ab99 slub: Update statistics handling for variable order slabs". And the partial list is handled in the same way as the per cpu slab. Acked-by: Christoph Lameter <cl@linux.com> Signed-off-by: Chen Gang <gang.chen@asianux.com> Signed-off-by: Pekka Enberg <penberg@kernel.org>
|
#
0db0628d |
|
19-Jun-2013 |
Paul Gortmaker <paul.gortmaker@windriver.com> |
kernel: delete __cpuinit usage from all core kernel files The __cpuinit type of throwaway sections might have made sense some time ago when RAM was more constrained, but now the savings do not offset the cost and complications. For example, the fix in commit 5e427ec2d0 ("x86: Fix bit corruption at CPU resume time") is a good example of the nasty type of bugs that can be created with improper use of the various __init prefixes. After a discussion on LKML[1] it was decided that cpuinit should go the way of devinit and be phased out. Once all the users are gone, we can then finally remove the macros themselves from linux/init.h. This removes all the uses of the __cpuinit macros from C files in the core kernel directories (kernel, init, lib, mm, and include) that don't really have a specific maintainer. [1] https://lkml.org/lkml/2013/5/20/589 Signed-off-by: Paul Gortmaker <paul.gortmaker@windriver.com>
|
#
c25f195e |
|
17-Jan-2013 |
Steven Rostedt <rostedt@goodmis.org> |
slub: Check for page NULL before doing the node_match check In the -rt kernel (mrg), we hit the following dump: BUG: unable to handle kernel NULL pointer dereference at (null) IP: [<ffffffff811573f1>] kmem_cache_alloc_node+0x51/0x180 PGD a2d39067 PUD b1641067 PMD 0 Oops: 0000 [#1] PREEMPT SMP Modules linked in: sunrpc cpufreq_ondemand ipv6 tg3 joydev sg serio_raw pcspkr k8temp amd64_edac_mod edac_core i2c_piix4 e100 mii shpchp ext4 mbcache jbd2 sd_mod crc_t10dif sr_mod cdrom sata_svw ata_generic pata_acpi pata_serverworks radeon ttm drm_kms_helper drm hwmon i2c_algo_bit i2c_core dm_mirror dm_region_hash dm_log dm_mod CPU 3 Pid: 20878, comm: hackbench Not tainted 3.6.11-rt25.14.el6rt.x86_64 #1 empty empty/Tyan Transport GT24-B3992 RIP: 0010:[<ffffffff811573f1>] [<ffffffff811573f1>] kmem_cache_alloc_node+0x51/0x180 RSP: 0018:ffff8800a9b17d70 EFLAGS: 00010213 RAX: 0000000000000000 RBX: 0000000001200011 RCX: ffff8800a06d8000 RDX: 0000000004d92a03 RSI: 00000000000000d0 RDI: ffff88013b805500 RBP: ffff8800a9b17dc0 R08: ffff88023fd14d10 R09: ffffffff81041cbd R10: 00007f4e3f06e9d0 R11: 0000000000000246 R12: ffff88013b805500 R13: ffff8801ff46af40 R14: 0000000000000001 R15: 0000000000000000 FS: 00007f4e3f06e700(0000) GS:ffff88023fd00000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b CR2: 0000000000000000 CR3: 00000000a2d3a000 CR4: 00000000000007e0 DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400 Process hackbench (pid: 20878, threadinfo ffff8800a9b16000, task ffff8800a06d8000) Stack: ffff8800a9b17da0 ffffffff81202e08 ffff8800a9b17de0 000000d001200011 0000000001200011 0000000001200011 0000000000000000 0000000000000000 00007f4e3f06e9d0 0000000000000000 ffff8800a9b17e60 ffffffff81041cbd Call Trace: [<ffffffff81202e08>] ? current_has_perm+0x68/0x80 [<ffffffff81041cbd>] copy_process+0xdd/0x15b0 [<ffffffff810a2125>] ? rt_up_read+0x25/0x30 [<ffffffff8104369a>] do_fork+0x5a/0x360 [<ffffffff8107c66b>] ? migrate_enable+0xeb/0x220 [<ffffffff8100b068>] sys_clone+0x28/0x30 [<ffffffff81527423>] stub_clone+0x13/0x20 [<ffffffff81527152>] ? system_call_fastpath+0x16/0x1b Code: 89 fc 89 75 cc 41 89 d6 4d 8b 04 24 65 4c 03 04 25 48 ae 00 00 49 8b 50 08 4d 8b 28 49 8b 40 10 4d 85 ed 74 12 41 83 fe ff 74 27 <48> 8b 00 48 c1 e8 3a 41 39 c6 74 1b 8b 75 cc 4c 89 c9 44 89 f2 RIP [<ffffffff811573f1>] kmem_cache_alloc_node+0x51/0x180 RSP <ffff8800a9b17d70> CR2: 0000000000000000 ---[ end trace 0000000000000002 ]--- Now, this uses SLUB pretty much unmodified, but as it is the -rt kernel with CONFIG_PREEMPT_RT set, spinlocks are mutexes, although they do disable migration. But the SLUB code is relatively lockless, and the spin_locks there are raw_spin_locks (not converted to mutexes), thus I believe this bug can happen in mainline without -rt features. The -rt patch is just good at triggering mainline bugs ;-) Anyway, looking at where this crashed, it seems that the page variable can be NULL when passed to the node_match() function (which does not check if it is NULL). When this happens we get the above panic. As page is only used in slab_alloc() to check if the node matches, if it's NULL I'm assuming that we can say it doesn't and call the __slab_alloc() code. Is this a correct assumption? Acked-by: Christoph Lameter <cl@linux.com> Signed-off-by: Steven Rostedt <rostedt@goodmis.org> Signed-off-by: Pekka Enberg <penberg@kernel.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
#
345c905d |
|
18-Jun-2013 |
Joonsoo Kim <iamjoonsoo.kim@lge.com> |
slub: Make cpu partial slab support configurable CPU partial support can introduce level of indeterminism that is not wanted in certain context (like a realtime kernel). Make it configurable. This patch is based on Christoph Lameter's "slub: Make cpu partial slab support configurable V2". Acked-by: Christoph Lameter <cl@linux.com> Signed-off-by: Joonsoo Kim <iamjoonsoo.kim@lge.com> Signed-off-by: Pekka Enberg <penberg@kernel.org>
|
#
318df36e |
|
19-Jun-2013 |
Joonsoo Kim <iamjoonsoo.kim@lge.com> |
slub: do not put a slab to cpu partial list when cpu_partial is 0 In free path, we don't check number of cpu_partial, so one slab can be linked in cpu partial list even if cpu_partial is 0. To prevent this, we should check number of cpu_partial in put_cpu_partial(). Acked-by: Christoph Lameeter <cl@linux.com> Reviewed-by: Wanpeng Li <liwanp@linux.vnet.ibm.com> Signed-off-by: Joonsoo Kim <iamjoonsoo.kim@lge.com> Signed-off-by: Pekka Enberg <penberg@kernel.org>
|
#
c17fd13e |
|
03-Jul-2013 |
Wanpeng Li <liwanp@linux.vnet.ibm.com> |
mm/slub: Use node_nr_slabs and node_nr_objs in get_slabinfo Use existing interface node_nr_slabs and node_nr_objs to get nr_slabs and nr_objs. Acked-by: Christoph Lameter <cl@linux.com> Signed-off-by: Wanpeng Li <liwanp@linux.vnet.ibm.com> Signed-off-by: Pekka Enberg <penberg@kernel.org>
|
#
a4463364 |
|
03-Jul-2013 |
Wanpeng Li <liwanp@linux.vnet.ibm.com> |
mm/slub: Drop unnecessary nr_partials This patch remove unused nr_partials variable. Acked-by: Christoph Lameter <cl@linux.com> Signed-off-by: Wanpeng Li <liwanp@linux.vnet.ibm.com> Signed-off-by: Pekka Enberg <penberg@kernel.org>
|
#
3ac38faa |
|
29-Apr-2013 |
Andrew Morton <akpm@linux-foundation.org> |
mm/slub.c: use register_hotmemory_notifier() Squishes a statement-with-no-effect warning, removes some ifdefs and shrinks .text by 2 bytes. Note that this code fails to check for blocking_notifier_chain_register() failures. Cc: Pekka Enberg <penberg@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
#
7cccd80b |
|
23-Jan-2013 |
Christoph Lameter <cl@linux.com> |
slub: tid must be retrieved from the percpu area of the current processor As Steven Rostedt has pointer out: rescheduling could occur on a different processor after the determination of the per cpu pointer and before the tid is retrieved. This could result in allocation from the wrong node in slab_alloc(). The effect is much more severe in slab_free() where we could free to the freelist of the wrong page. The window for something like that occurring is pretty small but it is possible. Signed-off-by: Christoph Lameter <cl@linux.com> Signed-off-by: Pekka Enberg <penberg@kernel.org>
|
#
4d7868e6 |
|
23-Jan-2013 |
Christoph Lameter <cl@linux.com> |
slub: Do not dereference NULL pointer in node_match The variables accessed in slab_alloc are volatile and therefore the page pointer passed to node_match can be NULL. The processing of data in slab_alloc is tentative until either the cmpxhchg succeeds or the __slab_alloc slowpath is invoked. Both are able to perform the same allocation from the freelist. Check for the NULL pointer in node_match. A false positive will lead to a retry of the loop in __slab_alloc. Signed-off-by: Christoph Lameter <cl@linux.com> Signed-off-by: Pekka Enberg <penberg@kernel.org>
|
#
338b2642 |
|
21-Jan-2013 |
Joonsoo Kim <iamjoonsoo.kim@lge.com> |
slub: add 'likely' macro to inc_slabs_node() After boot phase, 'n' always exist. So add 'likely' macro for helping compiler. Acked-by: Christoph Lameter <cl@linux.com> Signed-off-by: Joonsoo Kim <iamjoonsoo.kim@lge.com> Signed-off-by: Pekka Enberg <penberg@kernel.org>
|
#
633b0764 |
|
21-Jan-2013 |
Joonsoo Kim <iamjoonsoo.kim@lge.com> |
slub: correct to calculate num of acquired objects in get_partial_node() There is a subtle bug when calculating a number of acquired objects. Currently, we calculate "available = page->objects - page->inuse", after acquire_slab() is called in get_partial_node(). In acquire_slab() with mode = 1, we always set new.inuse = page->objects. So, acquire_slab(s, n, page, object == NULL); if (!object) { c->page = page; stat(s, ALLOC_FROM_PARTIAL); object = t; available = page->objects - page->inuse; !!! availabe is always 0 !!! ... Therfore, "available > s->cpu_partial / 2" is always false and we always go to second iteration. This patch correct this problem. After that, we don't need return value of put_cpu_partial(). So remove it. Reviewed-by: Wanpeng Li <liwanp@linux.vnet.ibm.com> Acked-by: Christoph Lameter <cl@linux.com> Signed-off-by: Joonsoo Kim <iamjoonsoo.kim@lge.com> Signed-off-by: Pekka Enberg <penberg@kernel.org>
|
#
7d557b3c |
|
22-Feb-2013 |
Glauber Costa <glommer@parallels.com> |
slub: correctly bootstrap boot caches After we create a boot cache, we may allocate from it until it is bootstraped. This will move the page from the partial list to the cpu slab list. If this happens, the loop: list_for_each_entry(p, &n->partial, lru) that we use to scan for all partial pages will yield nothing, and the pages will keep pointing to the boot cpu cache, which is of course, invalid. To do that, we should flush the cache to make sure that the cpu slab is back to the partial list. Signed-off-by: Glauber Costa <glommer@parallels.com> Reported-by: Steffen Michalke <StMichalke@web.de> Tested-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Acked-by: Christoph Lameter <cl@linux.com> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Tejun Heo <tj@kernel.org> Signed-off-by: Pekka Enberg <penberg@kernel.org>
|
#
22b751c3 |
|
22-Feb-2013 |
Mel Gorman <mgorman@suse.de> |
mm: rename page struct field helpers The function names page_xchg_last_nid(), page_last_nid() and reset_page_last_nid() were judged to be inconsistent so rename them to a struct_field_op style pattern. As it looked jarring to have reset_page_mapcount() and page_nid_reset_last() beside each other in memmap_init_zone(), this patch also renames reset_page_mapcount() to page_mapcount_reset(). There are others like init_page_count() but as it is used throughout the arch code a rename would likely cause more conflicts than it is worth. [akpm@linux-foundation.org: fix zcache] Signed-off-by: Mel Gorman <mgorman@suse.de> Suggested-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
#
2c59dd65 |
|
10-Jan-2013 |
Christoph Lameter <cl@linux.com> |
slab: Common Kmalloc cache determination Extract the optimized lookup functions from slub and put them into slab_common.c. Then make slab use these functions as well. Joonsoo notes that this fixes some issues with constant folding which also reduces the code size for slub. https://lkml.org/lkml/2012/10/20/82 Signed-off-by: Christoph Lameter <cl@linux.com> Signed-off-by: Pekka Enberg <penberg@kernel.org>
|
#
f97d5f63 |
|
10-Jan-2013 |
Christoph Lameter <cl@linux.com> |
slab: Common function to create the kmalloc array The kmalloc array is created in similar ways in both SLAB and SLUB. Create a common function and have both allocators call that function. V1->V2: Whitespace cleanup Reviewed-by: Glauber Costa <glommer@parallels.com> Signed-off-by: Christoph Lameter <cl@linux.com> Signed-off-by: Pekka Enberg <penberg@kernel.org>
|
#
9425c58e |
|
10-Jan-2013 |
Christoph Lameter <cl@linux.com> |
slab: Common definition for the array of kmalloc caches Have a common definition fo the kmalloc cache arrays in SLAB and SLUB Acked-by: Glauber Costa <glommer@parallels.com> Signed-off-by: Christoph Lameter <cl@linux.com> Signed-off-by: Pekka Enberg <penberg@kernel.org>
|
#
95a05b42 |
|
10-Jan-2013 |
Christoph Lameter <cl@linux.com> |
slab: Common constants for kmalloc boundaries Standardize the constants that describe the smallest and largest object kept in the kmalloc arrays for SLAB and SLUB. Differentiate between the maximum size for which a slab cache is used (KMALLOC_MAX_CACHE_SIZE) and the maximum allocatable size (KMALLOC_MAX_SIZE, KMALLOC_MAX_ORDER). Signed-off-by: Christoph Lameter <cl@linux.com> Signed-off-by: Pekka Enberg <penberg@kernel.org>
|
#
373d4d09 |
|
20-Jan-2013 |
Rusty Russell <rusty@rustcorp.com.au> |
taint: add explicit flag to show whether lock dep is still OK. Fix up all callers as they were before, with make one change: an unsigned module taints the kernel, but doesn't turn off lockdep. Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
|
#
5413dfba |
|
18-Dec-2012 |
Glauber Costa <glommer@parallels.com> |
slub: drop mutex before deleting sysfs entry Sasha Levin recently reported a lockdep problem resulting from the new attribute propagation introduced by kmemcg series. In short, slab_mutex will be called from within the sysfs attribute store function. This will create a dependency, that will later be held backwards when a cache is destroyed - since destruction occurs with the slab_mutex held, and then calls in to the sysfs directory removal function. In this patch, I propose to adopt a strategy close to what __kmem_cache_create does before calling sysfs_slab_add, and release the lock before the call to sysfs_slab_remove. This is pretty much the last operation in the kmem_cache_shutdown() path, so we could do better by splitting this and moving this call alone to later on. This will fit nicely when sysfs handling is consistent between all caches, but will look weird now. Lockdep info: ====================================================== [ INFO: possible circular locking dependency detected ] 3.7.0-rc4-next-20121106-sasha-00008-g353b62f #117 Tainted: G W ------------------------------------------------------- trinity-child13/6961 is trying to acquire lock: (s_active#43){++++.+}, at: sysfs_addrm_finish+0x31/0x60 but task is already holding lock: (slab_mutex){+.+.+.}, at: kmem_cache_destroy+0x22/0xe0 which lock already depends on the new lock. the existing dependency chain (in reverse order) is: -> #1 (slab_mutex){+.+.+.}: lock_acquire+0x1aa/0x240 __mutex_lock_common+0x59/0x5a0 mutex_lock_nested+0x3f/0x50 slab_attr_store+0xde/0x110 sysfs_write_file+0xfa/0x150 vfs_write+0xb0/0x180 sys_pwrite64+0x60/0xb0 tracesys+0xe1/0xe6 -> #0 (s_active#43){++++.+}: __lock_acquire+0x14df/0x1ca0 lock_acquire+0x1aa/0x240 sysfs_deactivate+0x122/0x1a0 sysfs_addrm_finish+0x31/0x60 sysfs_remove_dir+0x89/0xd0 kobject_del+0x16/0x40 __kmem_cache_shutdown+0x40/0x60 kmem_cache_destroy+0x40/0xe0 mon_text_release+0x78/0xe0 __fput+0x122/0x2d0 ____fput+0x9/0x10 task_work_run+0xbe/0x100 do_exit+0x432/0xbd0 do_group_exit+0x84/0xd0 get_signal_to_deliver+0x81d/0x930 do_signal+0x3a/0x950 do_notify_resume+0x3e/0x90 int_signal+0x12/0x17 other info that might help us debug this: Possible unsafe locking scenario: CPU0 CPU1 ---- ---- lock(slab_mutex); lock(s_active#43); lock(slab_mutex); lock(s_active#43); *** DEADLOCK *** 2 locks held by trinity-child13/6961: #0: (mon_lock){+.+.+.}, at: mon_text_release+0x25/0xe0 #1: (slab_mutex){+.+.+.}, at: kmem_cache_destroy+0x22/0xe0 stack backtrace: Pid: 6961, comm: trinity-child13 Tainted: G W 3.7.0-rc4-next-20121106-sasha-00008-g353b62f #117 Call Trace: print_circular_bug+0x1fb/0x20c __lock_acquire+0x14df/0x1ca0 lock_acquire+0x1aa/0x240 sysfs_deactivate+0x122/0x1a0 sysfs_addrm_finish+0x31/0x60 sysfs_remove_dir+0x89/0xd0 kobject_del+0x16/0x40 __kmem_cache_shutdown+0x40/0x60 kmem_cache_destroy+0x40/0xe0 mon_text_release+0x78/0xe0 __fput+0x122/0x2d0 ____fput+0x9/0x10 task_work_run+0xbe/0x100 do_exit+0x432/0xbd0 do_group_exit+0x84/0xd0 get_signal_to_deliver+0x81d/0x930 do_signal+0x3a/0x950 do_notify_resume+0x3e/0x90 int_signal+0x12/0x17 Signed-off-by: Glauber Costa <glommer@parallels.com> Reported-by: Sasha Levin <sasha.levin@oracle.com> Cc: Michal Hocko <mhocko@suse.cz> Cc: Kamezawa Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Christoph Lameter <cl@linux-foundation.org> Cc: Pekka Enberg <penberg@kernel.org> Acked-by: David Rientjes <rientjes@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
#
ebe945c2 |
|
18-Dec-2012 |
Glauber Costa <glommer@parallels.com> |
memcg: add comments clarifying aspects of cache attribute propagation This patch clarifies two aspects of cache attribute propagation. First, the expected context for the for_each_memcg_cache macro in memcontrol.h. The usages already in the codebase are safe. In mm/slub.c, it is trivially safe because the lock is acquired right before the loop. In mm/slab.c, it is less so: the lock is acquired by an outer function a few steps back in the stack, so a VM_BUG_ON() is added to make sure it is indeed safe. A comment is also added to detail why we are returning the value of the parent cache and ignoring the children's when we propagate the attributes. Signed-off-by: Glauber Costa <glommer@parallels.com> Cc: Michal Hocko <mhocko@suse.cz> Cc: Kamezawa Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Acked-by: David Rientjes <rientjes@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
#
107dab5c |
|
18-Dec-2012 |
Glauber Costa <glommer@parallels.com> |
slub: slub-specific propagation changes SLUB allows us to tune a particular cache behavior with sysfs-based tunables. When creating a new memcg cache copy, we'd like to preserve any tunables the parent cache already had. This can be done by tapping into the store attribute function provided by the allocator. We of course don't need to mess with read-only fields. Since the attributes can have multiple types and are stored internally by sysfs, the best strategy is to issue a ->show() in the root cache, and then ->store() in the memcg cache. The drawback of that, is that sysfs can allocate up to a page in buffering for show(), that we are likely not to need, but also can't guarantee. To avoid always allocating a page for that, we can update the caches at store time with the maximum attribute size ever stored to the root cache. We will then get a buffer big enough to hold it. The corolary to this, is that if no stores happened, nothing will be propagated. It can also happen that a root cache has its tunables updated during normal system operation. In this case, we will propagate the change to all caches that are already active. [akpm@linux-foundation.org: tweak code to avoid __maybe_unused] Signed-off-by: Glauber Costa <glommer@parallels.com> Cc: Christoph Lameter <cl@linux.com> Cc: David Rientjes <rientjes@google.com> Cc: Frederic Weisbecker <fweisbec@redhat.com> Cc: Greg Thelen <gthelen@google.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: JoonSoo Kim <js1304@gmail.com> Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: Mel Gorman <mel@csn.ul.ie> Cc: Michal Hocko <mhocko@suse.cz> Cc: Pekka Enberg <penberg@cs.helsinki.fi> Cc: Rik van Riel <riel@redhat.com> Cc: Suleiman Souhlal <suleiman@google.com> Cc: Tejun Heo <tj@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
#
1f458cbf |
|
18-Dec-2012 |
Glauber Costa <glommer@parallels.com> |
memcg: destroy memcg caches Implement destruction of memcg caches. Right now, only caches where our reference counter is the last remaining are deleted. If there are any other reference counters around, we just leave the caches lying around until they go away. When that happens, a destruction function is called from the cache code. Caches are only destroyed in process context, so we queue them up for later processing in the general case. Signed-off-by: Glauber Costa <glommer@parallels.com> Cc: Christoph Lameter <cl@linux.com> Cc: David Rientjes <rientjes@google.com> Cc: Frederic Weisbecker <fweisbec@redhat.com> Cc: Greg Thelen <gthelen@google.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: JoonSoo Kim <js1304@gmail.com> Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: Mel Gorman <mel@csn.ul.ie> Cc: Michal Hocko <mhocko@suse.cz> Cc: Pekka Enberg <penberg@cs.helsinki.fi> Cc: Rik van Riel <riel@redhat.com> Cc: Suleiman Souhlal <suleiman@google.com> Cc: Tejun Heo <tj@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
#
d79923fa |
|
18-Dec-2012 |
Glauber Costa <glommer@parallels.com> |
sl[au]b: allocate objects from memcg cache We are able to match a cache allocation to a particular memcg. If the task doesn't change groups during the allocation itself - a rare event, this will give us a good picture about who is the first group to touch a cache page. This patch uses the now available infrastructure by calling memcg_kmem_get_cache() before all the cache allocations. Signed-off-by: Glauber Costa <glommer@parallels.com> Cc: Christoph Lameter <cl@linux.com> Cc: David Rientjes <rientjes@google.com> Cc: Frederic Weisbecker <fweisbec@redhat.com> Cc: Greg Thelen <gthelen@google.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: JoonSoo Kim <js1304@gmail.com> Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: Mel Gorman <mel@csn.ul.ie> Cc: Michal Hocko <mhocko@suse.cz> Cc: Pekka Enberg <penberg@cs.helsinki.fi> Cc: Rik van Riel <riel@redhat.com> Cc: Suleiman Souhlal <suleiman@google.com> Cc: Tejun Heo <tj@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
#
b9ce5ef4 |
|
18-Dec-2012 |
Glauber Costa <glommer@parallels.com> |
sl[au]b: always get the cache from its page in kmem_cache_free() struct page already has this information. If we start chaining caches, this information will always be more trustworthy than whatever is passed into the function. Signed-off-by: Glauber Costa <glommer@parallels.com> Cc: Christoph Lameter <cl@linux.com> Cc: David Rientjes <rientjes@google.com> Cc: Frederic Weisbecker <fweisbec@redhat.com> Cc: Greg Thelen <gthelen@google.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: JoonSoo Kim <js1304@gmail.com> Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: Mel Gorman <mel@csn.ul.ie> Cc: Michal Hocko <mhocko@suse.cz> Cc: Pekka Enberg <penberg@cs.helsinki.fi> Cc: Rik van Riel <riel@redhat.com> Cc: Suleiman Souhlal <suleiman@google.com> Cc: Tejun Heo <tj@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
#
2633d7a0 |
|
18-Dec-2012 |
Glauber Costa <glommer@parallels.com> |
slab/slub: consider a memcg parameter in kmem_create_cache Allow a memcg parameter to be passed during cache creation. When the slub allocator is being used, it will only merge caches that belong to the same memcg. We'll do this by scanning the global list, and then translating the cache to a memcg-specific cache Default function is created as a wrapper, passing NULL to the memcg version. We only merge caches that belong to the same memcg. A helper is provided, memcg_css_id: because slub needs a unique cache name for sysfs. Since this is visible, but not the canonical location for slab data, the cache name is not used, the css_id should suffice. Signed-off-by: Glauber Costa <glommer@parallels.com> Cc: Christoph Lameter <cl@linux.com> Cc: David Rientjes <rientjes@google.com> Cc: Frederic Weisbecker <fweisbec@redhat.com> Cc: Greg Thelen <gthelen@google.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: JoonSoo Kim <js1304@gmail.com> Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: Mel Gorman <mel@csn.ul.ie> Cc: Michal Hocko <mhocko@suse.cz> Cc: Pekka Enberg <penberg@cs.helsinki.fi> Cc: Rik van Riel <riel@redhat.com> Cc: Suleiman Souhlal <suleiman@google.com> Cc: Tejun Heo <tj@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
#
b9d5ab25 |
|
11-Dec-2012 |
Lai Jiangshan <laijs@cn.fujitsu.com> |
slub, hotplug: ignore unrelated node's hot-adding and hot-removing SLUB only focuses on the nodes which have normal memory and it ignores the other node's hot-adding and hot-removing. Aka: if some memory of a node which has no onlined memory is online, but this new memory onlined is not normal memory (for example, highmem), we should not allocate kmem_cache_node for SLUB. And if the last normal memory is offlined, but the node still has memory, we should remove kmem_cache_node for that node. (The current code delays it when all of the memory is offlined) So we only do something when marg->status_change_nid_normal > 0. marg->status_change_nid is not suitable here. The same problem doesn't exist in SLAB, because SLAB allocates kmem_list3 for every node even the node don't have normal memory, SLAB tolerates kmem_list3 on alien nodes. SLUB only focuses on the nodes which have normal memory, it don't tolerate alien kmem_cache_node. The patch makes SLUB become self-compatible and avoids WARNs and BUGs in rare conditions. Signed-off-by: Lai Jiangshan <laijs@cn.fujitsu.com> Cc: David Rientjes <rientjes@google.com> Cc: Minchan Kim <minchan.kim@gmail.com> Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Cc: Yasuaki Ishimatsu <isimatu.yasuaki@jp.fujitsu.com> Cc: Rob Landley <rob@landley.net> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Jiang Liu <jiang.liu@huawei.com> Cc: Kay Sievers <kay.sievers@vrfy.org> Cc: Greg Kroah-Hartman <gregkh@suse.de> Cc: Mel Gorman <mgorman@suse.de> Cc: Wen Congyang <wency@cn.fujitsu.com> Acked-by: Christoph Lameter <cl@linux.com> Cc: Pekka Enberg <penberg@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
#
45906855 |
|
28-Nov-2012 |
Christoph Lameter <cl@linux.com> |
mm/sl[aou]b: Common alignment code Extract the code to do object alignment from the allocators. Do the alignment calculations in slab_common so that the __kmem_cache_create functions of the allocators do not have to deal with alignment. Signed-off-by: Christoph Lameter <cl@linux.com> Signed-off-by: Pekka Enberg <penberg@kernel.org>
|
#
dffb4d60 |
|
28-Nov-2012 |
Christoph Lameter <cl@linux.com> |
slub: Use statically allocated kmem_cache boot structure for bootstrap Simplify bootstrap by statically allocated two kmem_cache structures. These are freed after bootup is complete. Allows us to no longer worry about calculations of sizes of kmem_cache structures during bootstrap. Reviewed-by: Glauber Costa <glommer@parallels.com> Signed-off-by: Christoph Lameter <cl@linux.com> Signed-off-by: Pekka Enberg <penberg@kernel.org>
|
#
45530c44 |
|
28-Nov-2012 |
Christoph Lameter <cl@linux.com> |
mm, sl[au]b: create common functions for boot slab creation Use a special function to create kmalloc caches and use that function in SLAB and SLUB. Acked-by: Joonsoo Kim <js1304@gmail.com> Reviewed-by: Glauber Costa <glommer@parallels.com> Acked-by: David Rientjes <rientjes@google.com> Signed-off-by: Christoph Lameter <cl@linux.com> Signed-off-by: Pekka Enberg <penberg@kernel.org>
|
#
59a09917 |
|
28-Nov-2012 |
Christoph Lameter <cl@linux.com> |
slub: Use correct cpu_slab on dead cpu Pass a kmem_cache_cpu pointer into unfreeze partials so that a different kmem_cache_cpu structure than the local one can be specified. Acked-by: David Rientjes <rientjes@google.com> Signed-off-by: Christoph Lameter <cl@linux.com> Signed-off-by: Pekka Enberg <penberg@kernel.org>
|
#
d8843922 |
|
17-Oct-2012 |
Glauber Costa <glommer@parallels.com> |
slab: Ignore internal flags in cache creation Some flags are used internally by the allocators for management purposes. One example of that is the CFLGS_OFF_SLAB flag that slab uses to mark that the metadata for that cache is stored outside of the slab. No cache should ever pass those as a creation flags. We can just ignore this bit if it happens to be passed (such as when duplicating a cache in the kmem memcg patches). Because such flags can vary from allocator to allocator, we allow them to make their own decisions on that, defining SLAB_AVAILABLE_FLAGS with all flags that are valid at creation time. Allocators that doesn't have any specific flag requirement should define that to mean all flags. Common code will mask out all flags not belonging to that set. Acked-by: Christoph Lameter <cl@linux.com> Acked-by: David Rientjes <rientjes@google.com> Signed-off-by: Glauber Costa <glommer@parallels.com> Signed-off-by: Pekka Enberg <penberg@kernel.org>
|
#
242860a4 |
|
19-Oct-2012 |
Ezequiel Garcia <elezegarcia@gmail.com> |
mm/sl[aou]b: Move common kmem_cache_size() to slab.h This function is identically defined in all three allocators and it's trivial to move it to slab.h Since now it's static, inline, header-defined function this patch also drops the EXPORT_SYMBOL tag. Cc: Pekka Enberg <penberg@kernel.org> Cc: Matt Mackall <mpm@selenic.com> Acked-by: Christoph Lameter <cl@linux.com> Signed-off-by: Ezequiel Garcia <elezegarcia@gmail.com> Signed-off-by: Pekka Enberg <penberg@kernel.org>
|
#
1b4f59e3 |
|
22-Oct-2012 |
Glauber Costa <glommer@parallels.com> |
slub: Commonize slab_cache field in struct page Right now, slab and slub have fields in struct page to derive which cache a page belongs to, but they do it slightly differently. slab uses a field called slab_cache, that lives in the third double word. slub, uses a field called "slab", living outside of the doublewords area. Ideally, we could use the same field for this. Since slub heavily makes use of the doubleword region, there isn't really much room to move slub's slab_cache field around. Since slab does not have such strict placement restrictions, we can move it outside the doubleword area. The naming used by slab, "slab_cache", is less confusing, and it is preferred over slub's generic "slab". Signed-off-by: Glauber Costa <glommer@parallels.com> Acked-by: Christoph Lameter <cl@linux.com> CC: David Rientjes <rientjes@google.com> Signed-off-by: Pekka Enberg <penberg@kernel.org>
|
#
0d7561c6 |
|
19-Oct-2012 |
Glauber Costa <glommer@parallels.com> |
sl[au]b: Process slabinfo_show in common code With all the infrastructure in place, we can now have slabinfo_show done from slab_common.c. A cache-specific function is called to grab information about the cache itself, since that is still heavily dependent on the implementation. But with the values produced by it, all the printing and handling is done from common code. Signed-off-by: Glauber Costa <glommer@parallels.com> CC: Christoph Lameter <cl@linux.com> CC: David Rientjes <rientjes@google.com> Signed-off-by: Pekka Enberg <penberg@kernel.org>
|
#
bcee6e2a |
|
19-Oct-2012 |
Glauber Costa <glommer@parallels.com> |
mm/sl[au]b: Move print_slabinfo_header to slab_common.c The header format is highly similar between slab and slub. The main difference lays in the fact that slab may optionally have statistics added here in case of CONFIG_SLAB_DEBUG, while the slub will stick them somewhere else. By making sure that information conditionally lives inside a globally-visible CONFIG_DEBUG_SLAB switch, we can move the header printing to a common location. Signed-off-by: Glauber Costa <glommer@parallels.com> Acked-by: Christoph Lameter <cl@linux.com> CC: David Rientjes <rientjes@google.com> Signed-off-by: Pekka Enberg <penberg@kernel.org>
|
#
b7454ad3 |
|
19-Oct-2012 |
Glauber Costa <glommer@parallels.com> |
mm/sl[au]b: Move slabinfo processing to slab_common.c This patch moves all the common machinery to slabinfo processing to slab_common.c. We can do better by noticing that the output is heavily common, and having the allocators to just provide finished information about this. But after this first step, this can be done easier. Signed-off-by: Glauber Costa <glommer@parallels.com> Acked-by: Christoph Lameter <cl@linux.com> CC: David Rientjes <rientjes@google.com> Signed-off-by: Pekka Enberg <penberg@kernel.org>
|
#
837d678d |
|
15-Aug-2012 |
Joonsoo Kim <js1304@gmail.com> |
slub: remove one code path and reduce lock contention in __slab_free() When we try to free object, there is some of case that we need to take a node lock. This is the necessary step for preventing a race. After taking a lock, then we try to cmpxchg_double_slab(). But, there is a possible scenario that cmpxchg_double_slab() is failed with taking a lock. Following example explains it. CPU A CPU B need lock ... need lock ... lock!! lock..but spin free success spin... unlock lock!! free fail In this case, retry with taking a lock is occured in CPU A. I think that in this case for CPU A, "release a lock first, and re-take a lock if necessary" is preferable way. There are two reasons for this. First, this makes __slab_free()'s logic somehow simple. With this patch, 'was_frozen = 1' is "always" handled without taking a lock. So we can remove one code path. Second, it may reduce lock contention. When we do retrying, status of slab is already changed, so we don't need a lock anymore in almost every case. "release a lock first, and re-take a lock if necessary" policy is helpful to this. Signed-off-by: Joonsoo Kim <js1304@gmail.com> Acked-by: Christoph Lameter <cl@linux.com> Signed-off-by: Pekka Enberg <penberg@kernel.org>
|
#
788e1aad |
|
28-Sep-2012 |
Fengguang Wu <fengguang.wu@intel.com> |
slub: init_kmem_cache_cpus() and put_cpu_partial() can be static Acked-by: Glauber Costa <glommer@parallels.com> Acked-by: Christoph Lameter <cl@linux.com> Acked-by: David Rientjes <rientjes@google.com> Signed-off-by: Fengguang Wu <fengguang.wu@intel.com> Signed-off-by: Pekka Enberg <penberg@kernel.org>
|
#
2b847c3c |
|
08-Sep-2012 |
Ezequiel Garcia <elezegarcia@gmail.com> |
mm, slub: Rename slab_alloc() -> slab_alloc_node() to match SLAB This patch does not fix anything, and its only goal is to enable us to obtain some common code between SLAB and SLUB. Neither behavior nor produced code is affected. Cc: Christoph Lameter <cl@linux.com> Signed-off-by: Ezequiel Garcia <elezegarcia@gmail.com> Signed-off-by: Pekka Enberg <penberg@kernel.org>
|
#
645df230 |
|
18-Sep-2012 |
Dave Jones <davej@redhat.com> |
mm, sl[au]b: Taint kernel when we detect a corrupted slab It doesn't seem worth adding a new taint flag for this, so just re-use the one from 'bad page' Acked-by: Christoph Lameter <cl@linux.com> # SLUB Acked-by: David Rientjes <rientjes@google.com> Signed-off-by: Dave Jones <davej@redhat.com> Signed-off-by: Pekka Enberg <penberg@kernel.org>
|
#
8ba00bb6 |
|
17-Sep-2012 |
Joonsoo Kim <js1304@gmail.com> |
slub: consider pfmemalloc_match() in get_partial_node() get_partial() is currently not checking pfmemalloc_match() meaning that it is possible for pfmemalloc pages to leak to non-pfmemalloc users. This is a problem in the following situation. Assume that there is a request from normal allocation and there are no objects in the per-cpu cache and no node-partial slab. In this case, slab_alloc enters the slow path and new_slab_objects() is called which may return a PFMEMALLOC page. As the current user is not allowed to access PFMEMALLOC page, deactivate_slab() is called ([5091b74a: mm: slub: optimise the SLUB fast path to avoid pfmemalloc checks]) and returns an object from PFMEMALLOC page. Next time, when we get another request from normal allocation, slab_alloc() enters the slow-path and calls new_slab_objects(). In new_slab_objects(), we call get_partial() and get a partial slab which was just deactivated but is a pfmemalloc page. We extract one object from it and re-deactivate. "deactivate -> re-get in get_partial -> re-deactivate" occures repeatedly. As a result, access to PFMEMALLOC page is not properly restricted and it can cause a performance degradation due to frequent deactivation. deactivation frequently. This patch changes get_partial_node() to take pfmemalloc_match() into account and prevents the "deactivate -> re-get in get_partial() scenario. Instead, new_slab() is called. Signed-off-by: Joonsoo Kim <js1304@gmail.com> Acked-by: David Rientjes <rientjes@google.com> Signed-off-by: Mel Gorman <mgorman@suse.de> Cc: David Miller <davem@davemloft.net> Cc: Chuck Lever <chuck.lever@oracle.com> Cc: Pekka Enberg <penberg@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
#
9df53b15 |
|
08-Sep-2012 |
Christoph Lameter <cl@linux.com> |
slub: Zero initial memory segment for kmem_cache and kmem_cache_node Tony Luck reported the following problem on IA-64: Worked fine yesterday on next-20120905, crashes today. First sign of trouble was an unaligned access, then a NULL dereference. SL*B related bits of my config: CONFIG_SLUB_DEBUG=y # CONFIG_SLAB is not set CONFIG_SLUB=y CONFIG_SLABINFO=y # CONFIG_SLUB_DEBUG_ON is not set # CONFIG_SLUB_STATS is not set And he console log. PID hash table entries: 4096 (order: 1, 32768 bytes) Dentry cache hash table entries: 262144 (order: 7, 2097152 bytes) Inode-cache hash table entries: 131072 (order: 6, 1048576 bytes) Memory: 2047920k/2086064k available (13992k code, 38144k reserved, 6012k data, 880k init) kernel unaligned access to 0xca2ffc55fb373e95, ip=0xa0000001001be550 swapper[0]: error during unaligned kernel access -1 [1] Modules linked in: Pid: 0, CPU 0, comm: swapper psr : 00001010084a2018 ifs : 800000000000060f ip : [<a0000001001be550>] Not tainted (3.6.0-rc4-zx1-smp-next-20120906) ip is at new_slab+0x90/0x680 unat: 0000000000000000 pfs : 000000000000060f rsc : 0000000000000003 rnat: 9666960159966a59 bsps: a0000001001441c0 pr : 9666960159965a59 ldrs: 0000000000000000 ccv : 0000000000000000 fpsr: 0009804c8a70433f csd : 0000000000000000 ssd : 0000000000000000 b0 : a0000001001be500 b6 : a00000010112cb20 b7 : a0000001011660a0 f6 : 0fff7f0f0f0f0e54f0000 f7 : 0ffe8c5c1000000000000 f8 : 1000d8000000000000000 f9 : 100068800000000000000 f10 : 10005f0f0f0f0e54f0000 f11 : 1003e0000000000000078 r1 : a00000010155eef0 r2 : 0000000000000000 r3 : fffffffffffc1638 r8 : e0000040600081b8 r9 : ca2ffc55fb373e95 r10 : 0000000000000000 r11 : e000004040001646 r12 : a000000101287e20 r13 : a000000101280000 r14 : 0000000000004000 r15 : 0000000000000078 r16 : ca2ffc55fb373e75 r17 : e000004040040000 r18 : fffffffffffc1646 r19 : e000004040001646 r20 : fffffffffffc15f8 r21 : 000000000000004d r22 : a00000010132fa68 r23 : 00000000000000ed r24 : 0000000000000000 r25 : 0000000000000000 r26 : 0000000000000001 r27 : a0000001012b8500 r28 : a00000010135f4a0 r29 : 0000000000000000 r30 : 0000000000000000 r31 : 0000000000000001 Unable to handle kernel NULL pointer dereference (address 0000000000000018) swapper[0]: Oops 11003706212352 [2] Modules linked in: Pid: 0, CPU 0, comm: swapper psr : 0000121008022018 ifs : 800000000000cc18 ip : [<a0000001004dc8f1>] Not tainted (3.6.0-rc4-zx1-smp-next-20120906) ip is at __copy_user+0x891/0x960 unat: 0000000000000000 pfs : 0000000000000813 rsc : 0000000000000003 rnat: 0000000000000000 bsps: 0000000000000000 pr : 9666960159961765 ldrs: 0000000000000000 ccv : 0000000000000000 fpsr: 0009804c0270033f csd : 0000000000000000 ssd : 0000000000000000 b0 : a00000010004b550 b6 : a00000010004b740 b7 : a00000010000c750 f6 : 000000000000000000000 f7 : 1003e9e3779b97f4a7c16 f8 : 1003e0a00000010001550 f9 : 100068800000000000000 f10 : 10005f0f0f0f0e54f0000 f11 : 1003e0000000000000078 r1 : a00000010155eef0 r2 : a0000001012870b0 r3 : a0000001012870b8 r8 : 0000000000000298 r9 : 0000000000000013 r10 : 0000000000000000 r11 : 9666960159961a65 r12 : a000000101287010 r13 : a000000101280000 r14 : a000000101287068 r15 : a000000101287080 r16 : 0000000000000298 r17 : 0000000000000010 r18 : 0000000000000018 r19 : a000000101287310 r20 : 0000000000000290 r21 : 0000000000000000 r22 : 0000000000000000 r23 : a000000101386f58 r24 : 0000000000000000 r25 : 000000007fffffff r26 : a000000101287078 r27 : a0000001013c69b0 r28 : 0000000000000000 r29 : 0000000000000014 r30 : 0000000000000000 r31 : 0000000000000813 Sedat Dilek and Hugh Dickins reported similar problems as well. Earlier patches in the common set moved the zeroing of the kmem_cache structure into common code. See "Move allocation of kmem_cache into common code". The allocation for the two special structures is still done from SLUB specific code but no zeroing is done since the cache creation functions used to zero. This now needs to be updated so that the structures are zeroed during allocation in kmem_cache_init(). Otherwise random pointer values may be followed. Reported-by: Tony Luck <tony.luck@intel.com> Reported-by: Sedat Dilek <sedat.dilek@gmail.com> Tested-by: Sedat Dilek <sedat.dilek@gmail.com> Reported-by: Hugh Dickins <hughd@google.com> Tested-by: Sedat Dilek <sedat.dilek@gmail.com> Signed-off-by: Christoph Lameter <cl@linux.com> Signed-off-by: Pekka Enberg <penberg@kernel.org>
|
#
aac3a166 |
|
04-Sep-2012 |
Pekka Enberg <penberg@kernel.org> |
Revert "mm/sl[aou]b: Move sysfs_slab_add to common" This reverts commit 96d17b7be0a9849d381442030886211dbb2a7061 which caused the following errors at boot: [ 1.114885] kobject (ffff88001a802578): tried to init an initialized object, something is seriously wrong. [ 1.114885] Pid: 1, comm: swapper/0 Tainted: G W 3.6.0-rc1+ #6 [ 1.114885] Call Trace: [ 1.114885] [<ffffffff81273f37>] kobject_init+0x87/0xa0 [ 1.115555] [<ffffffff8127426a>] kobject_init_and_add+0x2a/0x90 [ 1.115555] [<ffffffff8127c870>] ? sprintf+0x40/0x50 [ 1.115555] [<ffffffff81124c60>] sysfs_slab_add+0x80/0x210 [ 1.115555] [<ffffffff81100175>] kmem_cache_create+0xa5/0x250 [ 1.115555] [<ffffffff81cf24cd>] ? md_init+0x144/0x144 [ 1.115555] [<ffffffff81cf25b6>] local_init+0xa4/0x11b [ 1.115555] [<ffffffff81cf24e1>] dm_init+0x14/0x45 [ 1.115836] [<ffffffff810001ba>] do_one_initcall+0x3a/0x160 [ 1.116834] [<ffffffff81cc2c90>] kernel_init+0x133/0x1b7 [ 1.117835] [<ffffffff81cc25c4>] ? do_early_param+0x86/0x86 [ 1.117835] [<ffffffff8171aff4>] kernel_thread_helper+0x4/0x10 [ 1.118401] [<ffffffff81cc2b5d>] ? start_kernel+0x33f/0x33f [ 1.119832] [<ffffffff8171aff0>] ? gs_change+0xb/0xb [ 1.120325] ------------[ cut here ]------------ [ 1.120835] WARNING: at fs/sysfs/dir.c:536 sysfs_add_one+0xc1/0xf0() [ 1.121437] sysfs: cannot create duplicate filename '/kernel/slab/:t-0000016' [ 1.121831] Modules linked in: [ 1.122138] Pid: 1, comm: swapper/0 Tainted: G W 3.6.0-rc1+ #6 [ 1.122831] Call Trace: [ 1.123074] [<ffffffff81195ce1>] ? sysfs_add_one+0xc1/0xf0 [ 1.123833] [<ffffffff8103adfa>] warn_slowpath_common+0x7a/0xb0 [ 1.124405] [<ffffffff8103aed1>] warn_slowpath_fmt+0x41/0x50 [ 1.124832] [<ffffffff81195ce1>] sysfs_add_one+0xc1/0xf0 [ 1.125337] [<ffffffff81195eb3>] create_dir+0x73/0xd0 [ 1.125832] [<ffffffff81196221>] sysfs_create_dir+0x81/0xe0 [ 1.126363] [<ffffffff81273d3d>] kobject_add_internal+0x9d/0x210 [ 1.126832] [<ffffffff812742a3>] kobject_init_and_add+0x63/0x90 [ 1.127406] [<ffffffff81124c60>] sysfs_slab_add+0x80/0x210 [ 1.127832] [<ffffffff81100175>] kmem_cache_create+0xa5/0x250 [ 1.128384] [<ffffffff81cf24cd>] ? md_init+0x144/0x144 [ 1.128833] [<ffffffff81cf25b6>] local_init+0xa4/0x11b [ 1.129831] [<ffffffff81cf24e1>] dm_init+0x14/0x45 [ 1.130305] [<ffffffff810001ba>] do_one_initcall+0x3a/0x160 [ 1.130831] [<ffffffff81cc2c90>] kernel_init+0x133/0x1b7 [ 1.131351] [<ffffffff81cc25c4>] ? do_early_param+0x86/0x86 [ 1.131830] [<ffffffff8171aff4>] kernel_thread_helper+0x4/0x10 [ 1.132392] [<ffffffff81cc2b5d>] ? start_kernel+0x33f/0x33f [ 1.132830] [<ffffffff8171aff0>] ? gs_change+0xb/0xb [ 1.133315] ---[ end trace 2703540871c8fab7 ]--- [ 1.133830] ------------[ cut here ]------------ [ 1.134274] WARNING: at lib/kobject.c:196 kobject_add_internal+0x1f5/0x210() [ 1.134829] kobject_add_internal failed for :t-0000016 with -EEXIST, don't try to register things with the same name in the same directory. [ 1.135829] Modules linked in: [ 1.136135] Pid: 1, comm: swapper/0 Tainted: G W 3.6.0-rc1+ #6 [ 1.136828] Call Trace: [ 1.137071] [<ffffffff81273e95>] ? kobject_add_internal+0x1f5/0x210 [ 1.137830] [<ffffffff8103adfa>] warn_slowpath_common+0x7a/0xb0 [ 1.138402] [<ffffffff8103aed1>] warn_slowpath_fmt+0x41/0x50 [ 1.138830] [<ffffffff811955a3>] ? release_sysfs_dirent+0x73/0xf0 [ 1.139419] [<ffffffff81273e95>] kobject_add_internal+0x1f5/0x210 [ 1.139830] [<ffffffff812742a3>] kobject_init_and_add+0x63/0x90 [ 1.140429] [<ffffffff81124c60>] sysfs_slab_add+0x80/0x210 [ 1.140830] [<ffffffff81100175>] kmem_cache_create+0xa5/0x250 [ 1.141829] [<ffffffff81cf24cd>] ? md_init+0x144/0x144 [ 1.142307] [<ffffffff81cf25b6>] local_init+0xa4/0x11b [ 1.142829] [<ffffffff81cf24e1>] dm_init+0x14/0x45 [ 1.143307] [<ffffffff810001ba>] do_one_initcall+0x3a/0x160 [ 1.143829] [<ffffffff81cc2c90>] kernel_init+0x133/0x1b7 [ 1.144352] [<ffffffff81cc25c4>] ? do_early_param+0x86/0x86 [ 1.144829] [<ffffffff8171aff4>] kernel_thread_helper+0x4/0x10 [ 1.145405] [<ffffffff81cc2b5d>] ? start_kernel+0x33f/0x33f [ 1.145828] [<ffffffff8171aff0>] ? gs_change+0xb/0xb [ 1.146313] ---[ end trace 2703540871c8fab8 ]--- Conflicts: mm/slub.c Signed-off-by: Pekka Enberg <penberg@kernel.org>
|
#
cce89f4f |
|
04-Sep-2012 |
Christoph Lameter <cl@linux.com> |
mm/sl[aou]b: Move kmem_cache refcounting to common code Get rid of the refcount stuff in the allocators and do that part of kmem_cache management in the common code. Signed-off-by: Christoph Lameter <cl@linux.com> Signed-off-by: Pekka Enberg <penberg@kernel.org>
|
#
8a13a4cc |
|
04-Sep-2012 |
Christoph Lameter <cl@linux.com> |
mm/sl[aou]b: Shrink __kmem_cache_create() parameter lists Do the initial settings of the fields in common code. This will allow us to push more processing into common code later and improve readability. Signed-off-by: Christoph Lameter <cl@linux.com> Signed-off-by: Pekka Enberg <penberg@kernel.org>
|
#
278b1bb1 |
|
04-Sep-2012 |
Christoph Lameter <cl@linux.com> |
mm/sl[aou]b: Move kmem_cache allocations into common code Shift the allocations to common code. That way the allocation and freeing of the kmem_cache structures is handled by common code. Reviewed-by: Glauber Costa <glommer@parallels.com> Signed-off-by: Christoph Lameter <cl@linux.com> Signed-off-by: Pekka Enberg <penberg@kernel.org>
|
#
96d17b7b |
|
04-Sep-2012 |
Christoph Lameter <cl@linux.com> |
mm/sl[aou]b: Move sysfs_slab_add to common Simplify locking by moving the slab_add_sysfs after all locks have been dropped. Eases the upcoming move to provide sysfs support for all allocators. Reviewed-by: Glauber Costa <glommer@parallels.com> Signed-off-by: Christoph Lameter <cl@linux.com> Signed-off-by: Pekka Enberg <penberg@kernel.org>
|
#
cbb79694 |
|
04-Sep-2012 |
Christoph Lameter <cl@linux.com> |
mm/sl[aou]b: Do slab aliasing call from common code The slab aliasing logic causes some strange contortions in slub. So add a call to deal with aliases to slab_common.c but disable it for other slab allocators by providng stubs that fail to create aliases. Full general support for aliases will require additional cleanup passes and more standardization of fields in kmem_cache. Signed-off-by: Christoph Lameter <cl@linux.com> Signed-off-by: Pekka Enberg <penberg@kernel.org>
|
#
db265eca |
|
04-Sep-2012 |
Christoph Lameter <cl@linux.com> |
mm/sl[aou]b: Move duping of slab name to slab_common.c Duping of the slabname has to be done by each slab. Moving this code to slab_common avoids duplicate implementations. With this patch we have common string handling for all slab allocators. Strings passed to kmem_cache_create() are copied internally. Subsystems can create temporary strings to create slab caches. Slabs allocated in early states of bootstrap will never be freed (and those can never be freed since they are essential to slab allocator operations). During bootstrap we therefore do not have to worry about duping names. Reviewed-by: Glauber Costa <glommer@parallels.com> Signed-off-by: Christoph Lameter <cl@linux.com> Signed-off-by: Pekka Enberg <penberg@kernel.org>
|
#
12c3667f |
|
04-Sep-2012 |
Christoph Lameter <cl@linux.com> |
mm/sl[aou]b: Get rid of __kmem_cache_destroy What is done there can be done in __kmem_cache_shutdown. This affects RCU handling somewhat. On rcu free all slab allocators do not refer to other management structures than the kmem_cache structure. Therefore these other structures can be freed before the rcu deferred free to the page allocator occurs. Reviewed-by: Joonsoo Kim <js1304@gmail.com> Signed-off-by: Christoph Lameter <cl@linux.com> Signed-off-by: Pekka Enberg <penberg@kernel.org>
|
#
8f4c765c |
|
04-Sep-2012 |
Christoph Lameter <cl@linux.com> |
mm/sl[aou]b: Move freeing of kmem_cache structure to common code The freeing action is basically the same in all slab allocators. Move to the common kmem_cache_destroy() function. Reviewed-by: Glauber Costa <glommer@parallels.com> Reviewed-by: Joonsoo Kim <js1304@gmail.com> Signed-off-by: Christoph Lameter <cl@linux.com> Signed-off-by: Pekka Enberg <penberg@kernel.org>
|
#
9b030cb8 |
|
04-Sep-2012 |
Christoph Lameter <cl@linux.com> |
mm/sl[aou]b: Use "kmem_cache" name for slab cache with kmem_cache struct Make all allocators use the "kmem_cache" slabname for the "kmem_cache" structure. Reviewed-by: Glauber Costa <glommer@parallels.com> Reviewed-by: Joonsoo Kim <js1304@gmail.com> Signed-off-by: Christoph Lameter <cl@linux.com> Signed-off-by: Pekka Enberg <penberg@kernel.org>
|
#
945cf2b6 |
|
04-Sep-2012 |
Christoph Lameter <cl@linux.com> |
mm/sl[aou]b: Extract a common function for kmem_cache_destroy kmem_cache_destroy does basically the same in all allocators. Extract common code which is easy since we already have common mutex handling. Reviewed-by: Glauber Costa <glommer@parallels.com> Signed-off-by: Christoph Lameter <cl@linux.com> Signed-off-by: Pekka Enberg <penberg@kernel.org>
|
#
7c9adf5a |
|
04-Sep-2012 |
Christoph Lameter <cl@linux.com> |
mm/sl[aou]b: Move list_add() to slab_common.c Move the code to append the new kmem_cache to the list of slab caches to the kmem_cache_create code in the shared code. This is possible now since the acquisition of the mutex was moved into kmem_cache_create(). Acked-by: David Rientjes <rientjes@google.com> Reviewed-by: Glauber Costa <glommer@parallels.com> Reviewed-by: Joonsoo Kim <js1304@gmail.com> Signed-off-by: Christoph Lameter <cl@linux.com> Signed-off-by: Pekka Enberg <penberg@kernel.org>
|
#
208c4358 |
|
04-Sep-2012 |
Christoph Lameter <cl@linux.com> |
mm/slub: Use kmem_cache for the kmem_cache structure Do not use kmalloc() but kmem_cache_alloc() for the allocation of the kmem_cache structures in slub. Reviewed-by: Glauber Costa <glommer@parallels.com> Acked-by: David Rientjes <rientjes@google.com> Signed-off-by: Christoph Lameter <cl@linux.com> Signed-off-by: Pekka Enberg <penberg@kernel.org>
|
#
79576102 |
|
04-Sep-2012 |
Christoph Lameter <cl@linux.com> |
mm/slub: Add debugging to verify correct cache use on kmem_cache_free() Add additional debugging to check that the objects is actually from the cache the caller claims. Doing so currently trips up some other debugging code. It takes a lot to infer from that what was happening. Reviewed-by: Glauber Costa <glommer@parallels.com> Signed-off-by: Christoph Lameter <cl@linux.com> [ penberg@kernel.org: Use pr_err() ] Signed-off-by: Pekka Enberg <penberg@kernel.org>
|
#
e24fc410 |
|
22-Jun-2012 |
Joonsoo Kim <js1304@gmail.com> |
slub: reduce failure of this_cpu_cmpxchg in put_cpu_partial() after unfreezing In current implementation, after unfreezing, we doesn't touch oldpage, so it remain 'NOT NULL'. When we call this_cpu_cmpxchg() with this old oldpage, this_cpu_cmpxchg() is mostly be failed. We can change value of oldpage to NULL after unfreezing, because unfreeze_partial() ensure that all the cpu partial slabs is removed from cpu partial list. In this time, we could expect that this_cpu_cmpxchg is mostly succeed. Acked-by: Christoph Lameter <cl@linux.com> Signed-off-by: Joonsoo Kim <js1304@gmail.com> Signed-off-by: Pekka Enberg <penberg@kernel.org>
|
#
19c7ff9e |
|
29-May-2012 |
Christoph Lameter <cl@linux.com> |
slub: Take node lock during object free checks Only applies to scenarios where debugging is on: Validation of slabs can currently occur while debugging information is updated from the fast paths of the allocator. This results in various races where we get false reports about slab metadata not being in order. This patch makes the fast paths take the node lock so that serialization with slab validation will occur. Causes additional slowdown in debug scenarios. Reported-by: Waiman Long <Waiman.Long@hp.com> Signed-off-by: Christoph Lameter <cl@linux.com> Signed-off-by: Pekka Enberg <penberg@kernel.org>
|
#
d9b7f226 |
|
03-Aug-2012 |
Glauber Costa <glommer@parallels.com> |
slub: use free_page instead of put_page for freeing kmalloc allocation When freeing objects, the slub allocator will most of the time free empty pages by calling __free_pages(). But high-order kmalloc will be diposed by means of put_page() instead. It makes no sense to call put_page() in kernel pages that are provided by the object allocators, so we shouldn't be doing this ourselves. Aside from the consistency change, we don't change the flow too much. put_page()'s would call its dtor function, which is __free_pages. We also already do all of the Compound page tests ourselves, and the Mlock test we lose don't really matter. Signed-off-by: Glauber Costa <glommer@parallels.com> Acked-by: Christoph Lameter <cl@linux.com> CC: David Rientjes <rientjes@google.com> CC: Pekka Enberg <penberg@kernel.org> Signed-off-by: Pekka Enberg <penberg@kernel.org>
|
#
5091b74a |
|
31-Jul-2012 |
Christoph Lameter <cl@linux.com> |
mm: slub: optimise the SLUB fast path to avoid pfmemalloc checks This patch removes the check for pfmemalloc from the alloc hotpath and puts the logic after the election of a new per cpu slab. For a pfmemalloc page we do not use the fast path but force the use of the slow path which is also used for the debug case. This has the side-effect of weakening pfmemalloc processing in the following way; 1. A process that is allocating for network swap calls __slab_alloc. pfmemalloc_match is true so the freelist is loaded and c->freelist is now pointing to a pfmemalloc page. 2. A process that is attempting normal allocations calls slab_alloc, finds the pfmemalloc page on the freelist and uses it because it did not check pfmemalloc_match() The patch allows non-pfmemalloc allocations to use pfmemalloc pages with the kmalloc slabs being the most vunerable caches on the grounds they are most likely to have a mix of pfmemalloc and !pfmemalloc requests. A later patch will still protect the system as processes will get throttled if the pfmemalloc reserves get depleted but performance will not degrade as smoothly. [mgorman@suse.de: Expanded changelog] Signed-off-by: Christoph Lameter <cl@linux.com> Signed-off-by: Mel Gorman <mgorman@suse.de> Cc: David Miller <davem@davemloft.net> Cc: Neil Brown <neilb@suse.de> Cc: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: Mike Christie <michaelc@cs.wisc.edu> Cc: Eric B Munson <emunson@mgebm.net> Cc: Eric Dumazet <eric.dumazet@gmail.com> Cc: Sebastian Andrzej Siewior <sebastian@breakpoint.cc> Cc: Mel Gorman <mgorman@suse.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
#
072bb0aa |
|
31-Jul-2012 |
Mel Gorman <mgorman@suse.de> |
mm: sl[au]b: add knowledge of PFMEMALLOC reserve pages When a user or administrator requires swap for their application, they create a swap partition and file, format it with mkswap and activate it with swapon. Swap over the network is considered as an option in diskless systems. The two likely scenarios are when blade servers are used as part of a cluster where the form factor or maintenance costs do not allow the use of disks and thin clients. The Linux Terminal Server Project recommends the use of the Network Block Device (NBD) for swap according to the manual at https://sourceforge.net/projects/ltsp/files/Docs-Admin-Guide/LTSPManual.pdf/download There is also documentation and tutorials on how to setup swap over NBD at places like https://help.ubuntu.com/community/UbuntuLTSP/EnableNBDSWAP The nbd-client also documents the use of NBD as swap. Despite this, the fact is that a machine using NBD for swap can deadlock within minutes if swap is used intensively. This patch series addresses the problem. The core issue is that network block devices do not use mempools like normal block devices do. As the host cannot control where they receive packets from, they cannot reliably work out in advance how much memory they might need. Some years ago, Peter Zijlstra developed a series of patches that supported swap over an NFS that at least one distribution is carrying within their kernels. This patch series borrows very heavily from Peter's work to support swapping over NBD as a pre-requisite to supporting swap-over-NFS. The bulk of the complexity is concerned with preserving memory that is allocated from the PFMEMALLOC reserves for use by the network layer which is needed for both NBD and NFS. Patch 1 adds knowledge of the PFMEMALLOC reserves to SLAB and SLUB to preserve access to pages allocated under low memory situations to callers that are freeing memory. Patch 2 optimises the SLUB fast path to avoid pfmemalloc checks Patch 3 introduces __GFP_MEMALLOC to allow access to the PFMEMALLOC reserves without setting PFMEMALLOC. Patch 4 opens the possibility for softirqs to use PFMEMALLOC reserves for later use by network packet processing. Patch 5 only sets page->pfmemalloc when ALLOC_NO_WATERMARKS was required Patch 6 ignores memory policies when ALLOC_NO_WATERMARKS is set. Patches 7-12 allows network processing to use PFMEMALLOC reserves when the socket has been marked as being used by the VM to clean pages. If packets are received and stored in pages that were allocated under low-memory situations and are unrelated to the VM, the packets are dropped. Patch 11 reintroduces __skb_alloc_page which the networking folk may object to but is needed in some cases to propogate pfmemalloc from a newly allocated page to an skb. If there is a strong objection, this patch can be dropped with the impact being that swap-over-network will be slower in some cases but it should not fail. Patch 13 is a micro-optimisation to avoid a function call in the common case. Patch 14 tags NBD sockets as being SOCK_MEMALLOC so they can use PFMEMALLOC if necessary. Patch 15 notes that it is still possible for the PFMEMALLOC reserve to be depleted. To prevent this, direct reclaimers get throttled on a waitqueue if 50% of the PFMEMALLOC reserves are depleted. It is expected that kswapd and the direct reclaimers already running will clean enough pages for the low watermark to be reached and the throttled processes are woken up. Patch 16 adds a statistic to track how often processes get throttled Some basic performance testing was run using kernel builds, netperf on loopback for UDP and TCP, hackbench (pipes and sockets), iozone and sysbench. Each of them were expected to use the sl*b allocators reasonably heavily but there did not appear to be significant performance variances. For testing swap-over-NBD, a machine was booted with 2G of RAM with a swapfile backed by NBD. 8*NUM_CPU processes were started that create anonymous memory mappings and read them linearly in a loop. The total size of the mappings were 4*PHYSICAL_MEMORY to use swap heavily under memory pressure. Without the patches and using SLUB, the machine locks up within minutes and runs to completion with them applied. With SLAB, the story is different as an unpatched kernel run to completion. However, the patched kernel completed the test 45% faster. MICRO 3.5.0-rc2 3.5.0-rc2 vanilla swapnbd Unrecognised test vmscan-anon-mmap-write MMTests Statistics: duration Sys Time Running Test (seconds) 197.80 173.07 User+Sys Time Running Test (seconds) 206.96 182.03 Total Elapsed Time (seconds) 3240.70 1762.09 This patch: mm: sl[au]b: add knowledge of PFMEMALLOC reserve pages Allocations of pages below the min watermark run a risk of the machine hanging due to a lack of memory. To prevent this, only callers who have PF_MEMALLOC or TIF_MEMDIE set and are not processing an interrupt are allowed to allocate with ALLOC_NO_WATERMARKS. Once they are allocated to a slab though, nothing prevents other callers consuming free objects within those slabs. This patch limits access to slab pages that were alloced from the PFMEMALLOC reserves. When this patch is applied, pages allocated from below the low watermark are returned with page->pfmemalloc set and it is up to the caller to determine how the page should be protected. SLAB restricts access to any page with page->pfmemalloc set to callers which are known to able to access the PFMEMALLOC reserve. If one is not available, an attempt is made to allocate a new page rather than use a reserve. SLUB is a bit more relaxed in that it only records if the current per-CPU page was allocated from PFMEMALLOC reserve and uses another partial slab if the caller does not have the necessary GFP or process flags. This was found to be sufficient in tests to avoid hangs due to SLUB generally maintaining smaller lists than SLAB. In low-memory conditions it does mean that !PFMEMALLOC allocators can fail a slab allocation even though free objects are available because they are being preserved for callers that are freeing pages. [a.p.zijlstra@chello.nl: Original implementation] [sebastian@breakpoint.cc: Correct order of page flag clearing] Signed-off-by: Mel Gorman <mgorman@suse.de> Cc: David Miller <davem@davemloft.net> Cc: Neil Brown <neilb@suse.de> Cc: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: Mike Christie <michaelc@cs.wisc.edu> Cc: Eric B Munson <emunson@mgebm.net> Cc: Eric Dumazet <eric.dumazet@gmail.com> Cc: Sebastian Andrzej Siewior <sebastian@breakpoint.cc> Cc: Mel Gorman <mgorman@suse.de> Cc: Christoph Lameter <cl@linux.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
#
737b719e |
|
09-Jul-2012 |
David Rientjes <rientjes@google.com> |
mm, slub: ensure irqs are enabled for kmemcheck kmemcheck_alloc_shadow() requires irqs to be enabled, so wait to disable them until after its called for __GFP_WAIT allocations. This fixes a warning for such allocations: WARNING: at kernel/lockdep.c:2739 lockdep_trace_alloc+0x14e/0x1c0() Acked-by: Fengguang Wu <fengguang.wu@intel.com> Acked-by: Steven Rostedt <rostedt@goodmis.org> Tested-by: Fengguang Wu <fengguang.wu@intel.com> Signed-off-by: David Rientjes <rientjes@google.com> Signed-off-by: Pekka Enberg <penberg@kernel.org>
|
#
20cea968 |
|
06-Jul-2012 |
Christoph Lameter <cl@linux.com> |
mm, sl[aou]b: Move kmem_cache_create mutex handling to common code Move the mutex handling into the common kmem_cache_create() function. Then we can also move more checks out of SLAB's kmem_cache_create() into the common code. Reviewed-by: Glauber Costa <glommer@parallels.com> Signed-off-by: Christoph Lameter <cl@linux.com> Signed-off-by: Pekka Enberg <penberg@kernel.org>
|
#
18004c5d |
|
06-Jul-2012 |
Christoph Lameter <cl@linux.com> |
mm, sl[aou]b: Use a common mutex definition Use the mutex definition from SLAB and make it the common way to take a sleeping lock. This has the effect of using a mutex instead of a rw semaphore for SLUB. SLOB gains the use of a mutex for kmem_cache_create serialization. Not needed now but SLOB may acquire some more features later (like slabinfo / sysfs support) through the expansion of the common code that will need this. Reviewed-by: Glauber Costa <glommer@parallels.com> Reviewed-by: Joonsoo Kim <js1304@gmail.com> Signed-off-by: Christoph Lameter <cl@linux.com> Signed-off-by: Pekka Enberg <penberg@kernel.org>
|
#
97d06609 |
|
06-Jul-2012 |
Christoph Lameter <cl@linux.com> |
mm, sl[aou]b: Common definition for boot state of the slab allocators All allocators have some sort of support for the bootstrap status. Setup a common definition for the boot states and make all slab allocators use that definition. Reviewed-by: Glauber Costa <glommer@parallels.com> Reviewed-by: Joonsoo Kim <js1304@gmail.com> Signed-off-by: Christoph Lameter <cl@linux.com> Signed-off-by: Pekka Enberg <penberg@kernel.org>
|
#
039363f3 |
|
06-Jul-2012 |
Christoph Lameter <cl@linux.com> |
mm, sl[aou]b: Extract common code for kmem_cache_create() Kmem_cache_create() does a variety of sanity checks but those vary depending on the allocator. Use the strictest tests and put them into a slab_common file. Make the tests conditional on CONFIG_DEBUG_VM. This patch has the effect of adding sanity checks for SLUB and SLOB under CONFIG_DEBUG_VM and removes the checks in SLAB for !CONFIG_DEBUG_VM. Signed-off-by: Christoph Lameter <cl@linux.com> Signed-off-by: Pekka Enberg <penberg@kernel.org>
|
#
068ce415 |
|
08-Jul-2012 |
Julia Lawall <Julia.Lawall@lip6.fr> |
slub: remove invalid reference to list iterator variable If list_for_each_entry, etc complete a traversal of the list, the iterator variable ends up pointing to an address at an offset from the list head, and not a meaningful structure. Thus this value should not be used after the end of the iterator. The patch replaces s->name by al->name, which is referenced nearby. This problem was found using Coccinelle (http://coccinelle.lip6.fr/). Signed-off-by: Julia Lawall <Julia.Lawall@lip6.fr> Signed-off-by: Pekka Enberg <penberg@kernel.org>
|
#
43d77867 |
|
08-Jun-2012 |
Joonsoo Kim <js1304@gmail.com> |
slub: refactoring unfreeze_partials() Current implementation of unfreeze_partials() is so complicated, but benefit from it is insignificant. In addition many code in do {} while loop have a bad influence to a fail rate of cmpxchg_double_slab. Under current implementation which test status of cpu partial slab and acquire list_lock in do {} while loop, we don't need to acquire a list_lock and gain a little benefit when front of the cpu partial slab is to be discarded, but this is a rare case. In case that add_partial is performed and cmpxchg_double_slab is failed, remove_partial should be called case by case. I think that these are disadvantages of current implementation, so I do refactoring unfreeze_partials(). Minimizing code in do {} while loop introduce a reduced fail rate of cmpxchg_double_slab. Below is output of 'slabinfo -r kmalloc-256' when './perf stat -r 33 hackbench 50 process 4000 > /dev/null' is done. ** before ** Cmpxchg_double Looping ------------------------ Locked Cmpxchg Double redos 182685 Unlocked Cmpxchg Double redos 0 ** after ** Cmpxchg_double Looping ------------------------ Locked Cmpxchg Double redos 177995 Unlocked Cmpxchg Double redos 1 We can see cmpxchg_double_slab fail rate is improved slightly. Bolow is output of './perf stat -r 30 hackbench 50 process 4000 > /dev/null'. ** before ** Performance counter stats for './hackbench 50 process 4000' (30 runs): 108517.190463 task-clock # 7.926 CPUs utilized ( +- 0.24% ) 2,919,550 context-switches # 0.027 M/sec ( +- 3.07% ) 100,774 CPU-migrations # 0.929 K/sec ( +- 4.72% ) 124,201 page-faults # 0.001 M/sec ( +- 0.15% ) 401,500,234,387 cycles # 3.700 GHz ( +- 0.24% ) <not supported> stalled-cycles-frontend <not supported> stalled-cycles-backend 250,576,913,354 instructions # 0.62 insns per cycle ( +- 0.13% ) 45,934,956,860 branches # 423.297 M/sec ( +- 0.14% ) 188,219,787 branch-misses # 0.41% of all branches ( +- 0.56% ) 13.691837307 seconds time elapsed ( +- 0.24% ) ** after ** Performance counter stats for './hackbench 50 process 4000' (30 runs): 107784.479767 task-clock # 7.928 CPUs utilized ( +- 0.22% ) 2,834,781 context-switches # 0.026 M/sec ( +- 2.33% ) 93,083 CPU-migrations # 0.864 K/sec ( +- 3.45% ) 123,967 page-faults # 0.001 M/sec ( +- 0.15% ) 398,781,421,836 cycles # 3.700 GHz ( +- 0.22% ) <not supported> stalled-cycles-frontend <not supported> stalled-cycles-backend 250,189,160,419 instructions # 0.63 insns per cycle ( +- 0.09% ) 45,855,370,128 branches # 425.436 M/sec ( +- 0.10% ) 169,881,248 branch-misses # 0.37% of all branches ( +- 0.43% ) 13.596272341 seconds time elapsed ( +- 0.22% ) No regression is found, but rather we can see slightly better result. Acked-by: Christoph Lameter <cl@linux.com> Signed-off-by: Joonsoo Kim <js1304@gmail.com> Signed-off-by: Pekka Enberg <penberg@kernel.org>
|
#
d24ac77f |
|
18-May-2012 |
Joonsoo Kim <js1304@gmail.com> |
slub: use __cmpxchg_double_slab() at interrupt disabled place get_freelist(), unfreeze_partials() are only called with interrupt disabled, so __cmpxchg_double_slab() is suitable. Acked-by: Christoph Lameter <cl@linux.com> Signed-off-by: Joonsoo Kim <js1304@gmail.com> Signed-off-by: Pekka Enberg <penberg@kernel.org>
|
#
e7b691b0 |
|
09-Jun-2012 |
Andi Kleen <ak@linux.intel.com> |
slab/mempolicy: always use local policy from interrupt context slab_node() could access current->mempolicy from interrupt context. However there's a race condition during exit where the mempolicy is first freed and then the pointer zeroed. Using this from interrupts seems bogus anyways. The interrupt will interrupt a random process and therefore get a random mempolicy. Many times, this will be idle's, which noone can change. Just disable this here and always use local for slab from interrupts. I also cleaned up the callers of slab_node a bit which always passed the same argument. I believe the original mempolicy code did that in fact, so it's likely a regression. v2: send version with correct logic v3: simplify. fix typo. Reported-by: Arun Sharma <asharma@fb.com> Cc: penberg@kernel.org Cc: cl@linux.com Signed-off-by: Andi Kleen <ak@linux.intel.com> [tdmackey@twitter.com: Rework control flow based on feedback from cl@linux.com, fix logic, and cleanup current task_struct reference] Acked-by: David Rientjes <rientjes@google.com> Acked-by: Christoph Lameter <cl@linux.com> Acked-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Signed-off-by: David Mackey <tdmackey@twitter.com> Signed-off-by: Pekka Enberg <penberg@kernel.org>
|
#
3b0efdfa |
|
13-Jun-2012 |
Christoph Lameter <cl@linux.com> |
mm, sl[aou]b: Extract common fields from struct kmem_cache Define a struct that describes common fields used in all slab allocators. A slab allocator either uses the common definition (like SLOB) or is required to provide members of kmem_cache with the definition given. After that it will be possible to share code that only operates on those fields of kmem_cache. The patch basically takes the slob definition of kmem cache and uses the field namees for the other allocators. It also standardizes the names used for basic object lengths in allocators: object_size Struct size specified at kmem_cache_create. Basically the payload expected to be used by the subsystem. size The size of memory allocator for each object. This size is larger than object_size and includes padding, alignment and extra metadata for each object (f.e. for debugging and rcu). Signed-off-by: Christoph Lameter <cl@linux.com> Signed-off-by: Pekka Enberg <penberg@kernel.org>
|
#
57d437d2 |
|
09-May-2012 |
Christoph Lameter <cl@linux.com> |
slub: pass page to node_match() instead of kmem_cache_cpu structure Avoid passing the kmem_cache_cpu pointer to node_match. This makes the node_match function more generic and easier to understand. Acked-by: David Rientjes <rientjes@google.com> Signed-off-by: Christoph Lameter <cl@linux.com> Signed-off-by: Pekka Enberg <penberg@kernel.org>
|
#
f6e7def7 |
|
09-May-2012 |
Christoph Lameter <cl@linux.com> |
slub: Use page variable instead of c->page. Store the value of c->page to avoid additional fetches from per cpu data. Acked-by: David Rientjes <rientjes@google.com> Signed-off-by: Christoph Lameter <cl@linux.com> Signed-off-by: Pekka Enberg <penberg@kernel.org>
|
#
c17dda40 |
|
09-May-2012 |
Christoph Lameter <cl@linux.com> |
slub: Separate out kmem_cache_cpu processing from deactivate_slab Processing on fields of kmem_cache_cpu is cleaner if code working on fields of this struct is taken out of deactivate_slab(). Acked-by: David Rientjes <rientjes@google.com> Signed-off-by: Christoph Lameter <cl@linux.com> Signed-off-by: Pekka Enberg <penberg@kernel.org>
|
#
ec3ab083 |
|
09-May-2012 |
Christoph Lameter <cl@linux.com> |
slub: Get rid of the node field The node field is always page_to_nid(c->page). So its rather easy to replace. Note that there maybe slightly more overhead in various hot paths due to the need to shift the bits from page->flags. However, that is mostly compensated for by a smaller footprint of the kmem_cache_cpu structure (this patch reduces that to 3 words per cache) which allows better caching. Signed-off-by: Christoph Lameter <cl@linux.com> Signed-off-by: Pekka Enberg <penberg@kernel.org>
|
#
188fd063 |
|
09-May-2012 |
Christoph Lameter <cl@linux.com> |
slub: new_slab_objects() can also get objects from partial list Moving the attempt to get a slab page from the partial lists simplifies __slab_alloc which is rather complicated. Signed-off-by: Christoph Lameter <cl@linux.com> Signed-off-by: Pekka Enberg <penberg@kernel.org>
|
#
f4697436 |
|
09-May-2012 |
Christoph Lameter <cl@linux.com> |
slub: Simplify control flow in __slab_alloc() Simplify control flow a bit avoiding nesting. Signed-off-by: Christoph Lameter <cl@linux.com> Signed-off-by: Pekka Enberg <penberg@kernel.org>
|
#
7ced3719 |
|
09-May-2012 |
Christoph Lameter <cl@linux.com> |
slub: Acquire_slab() avoid loop Avoid the loop in acquire slab and simply fail if there is a conflict. This will cause the next page on the list to be considered. Acked-by: David Rientjes <rientjes@google.com> Signed-off-by: Christoph Lameter <cl@linux.com> Signed-off-by: Pekka Enberg <penberg@kernel.org>
|
#
507effea |
|
09-May-2012 |
Christoph Lameter <cl@linux.com> |
slub: Add frozen check in __slab_alloc Verify that objects returned from __slab_alloc come from slab pages in the correct state. Signed-off-by: Christoph Lameter <cl@linux.com> Signed-off-by: Pekka Enberg <penberg@kernel.org>
|
#
6faa6833 |
|
09-May-2012 |
Christoph Lameter <cl@linux.com> |
slub: Use freelist instead of "object" in __slab_alloc The variable "object" really refers to a list of objects that we are handling. Signed-off-by: Christoph Lameter <cl@linux.com> Signed-off-by: Pekka Enberg <penberg@kernel.org>
|
#
c03f94cc |
|
17-May-2012 |
Joonsoo Kim <js1304@gmail.com> |
slub: use __SetPageSlab function to set PG_slab flag To set page-flag, using SetPageXXXX() and __SetPageXXXX() is more understandable and maintainable. So change it. Signed-off-by: Joonsoo Kim <js1304@gmail.com> Signed-off-by: Pekka Enberg <penberg@kernel.org>
|
#
02d7633f |
|
16-May-2012 |
Joonsoo Kim <js1304@gmail.com> |
slub: fix a memory leak in get_partial_node() In the case which is below, 1. acquire slab for cpu partial list 2. free object to it by remote cpu 3. page->freelist = t then memory leak is occurred. Change acquire_slab() not to zap freelist when it works for cpu partial list. I think it is a sufficient solution for fixing a memory leak. Below is output of 'slabinfo -r kmalloc-256' when './perf stat -r 30 hackbench 50 process 4000 > /dev/null' is done. ***Vanilla*** Sizes (bytes) Slabs Debug Memory ------------------------------------------------------------------------ Object : 256 Total : 468 Sanity Checks : Off Total: 3833856 SlabObj: 256 Full : 111 Redzoning : Off Used : 2004992 SlabSiz: 8192 Partial: 302 Poisoning : Off Loss : 1828864 Loss : 0 CpuSlab: 55 Tracking : Off Lalig: 0 Align : 8 Objects: 32 Tracing : Off Lpadd: 0 ***Patched*** Sizes (bytes) Slabs Debug Memory ------------------------------------------------------------------------ Object : 256 Total : 300 Sanity Checks : Off Total: 2457600 SlabObj: 256 Full : 204 Redzoning : Off Used : 2348800 SlabSiz: 8192 Partial: 33 Poisoning : Off Loss : 108800 Loss : 0 CpuSlab: 63 Tracking : Off Lalig: 0 Align : 8 Objects: 32 Tracing : Off Lpadd: 0 Total and loss number is the impact of this patch. Cc: <stable@vger.kernel.org> Acked-by: Christoph Lameter <cl@linux.com> Signed-off-by: Joonsoo Kim <js1304@gmail.com> Signed-off-by: Pekka Enberg <penberg@kernel.org>
|
#
02e1a9cd |
|
17-May-2012 |
majianpeng <majianpeng@gmail.com> |
slub: missing test for partial pages flush work in flush_all() I found some kernel messages such as: SLUB raid5-md127: kmem_cache_destroy called for cache that still has objects. Pid: 6143, comm: mdadm Tainted: G O 3.4.0-rc6+ #75 Call Trace: kmem_cache_destroy+0x328/0x400 free_conf+0x2d/0xf0 [raid456] stop+0x41/0x60 [raid456] md_stop+0x1a/0x60 [md_mod] do_md_stop+0x74/0x470 [md_mod] md_ioctl+0xff/0x11f0 [md_mod] blkdev_ioctl+0xd8/0x7a0 block_ioctl+0x3b/0x40 do_vfs_ioctl+0x96/0x560 sys_ioctl+0x91/0xa0 system_call_fastpath+0x16/0x1b Then using kmemleak I found these messages: unreferenced object 0xffff8800b6db7380 (size 112): comm "mdadm", pid 5783, jiffies 4294810749 (age 90.589s) hex dump (first 32 bytes): 01 01 db b6 ad 4e ad de ff ff ff ff ff ff ff ff .....N.......... ff ff ff ff ff ff ff ff 98 40 4a 82 ff ff ff ff .........@J..... backtrace: kmemleak_alloc+0x21/0x50 kmem_cache_alloc+0xeb/0x1b0 kmem_cache_open+0x2f1/0x430 kmem_cache_create+0x158/0x320 setup_conf+0x649/0x770 [raid456] run+0x68b/0x840 [raid456] md_run+0x529/0x940 [md_mod] do_md_run+0x18/0xc0 [md_mod] md_ioctl+0xba8/0x11f0 [md_mod] blkdev_ioctl+0xd8/0x7a0 block_ioctl+0x3b/0x40 do_vfs_ioctl+0x96/0x560 sys_ioctl+0x91/0xa0 system_call_fastpath+0x16/0x1b This bug was introduced by commit a8364d5555b ("slub: only IPI CPUs that have per cpu obj to flush"), which did not include checks for per cpu partial pages being present on a cpu. Signed-off-by: majianpeng <majianpeng@gmail.com> Cc: Gilad Ben-Yossef <gilad@benyossef.com> Acked-by: Christoph Lameter <cl@linux.com> Cc: Pekka Enberg <penberg@kernel.org> Tested-by: Jeff Layton <jlayton@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
#
4053497d |
|
10-May-2012 |
Joonsoo Kim <js1304@gmail.com> |
slub: remove unused argument of init_kmem_cache_node() We don't use the argument since commit 3b89d7d881a1dbb4da158f7eb5d6b3ceefc72810 ('slub: move min_partial to struct kmem_cache'), so remove it Acked-by: Christoph Lameter <cl@linux.com> Acked-by: David Rientjes <rientjes@google.com> Signed-off-by: Joonsoo Kim <js1304@gmail.com> Signed-off-by: Pekka Enberg <penberg@kernel.org>
|
#
601d39d0 |
|
10-May-2012 |
Joonsoo Kim <js1304@gmail.com> |
slub: fix a possible memory leak Memory allocated by kstrdup should be freed, when kmalloc(kmem_size, GFP_KERNEL) is failed. Acked-by: Christoph Lameter <cl@linux.com> Acked-by: David Rientjes <rientjes@google.com> Signed-off-by: Joonsoo Kim <js1304@gmail.com> Signed-off-by: Pekka Enberg <penberg@kernel.org>
|
#
de3ec035 |
|
27-Jan-2012 |
Joonsoo Kim <js1304@gmail.com> |
slub: fix incorrect return type of get_any_partial() Commit 497b66f2ecc97844493e6a147fd5a7e73f73f408 ('slub: return object pointer from get_partial() / new_slab().') changed return type of some functions. This updates missing part. Signed-off-by: Joonsoo Kim <js1304@gmail.com> Signed-off-by: Pekka Enberg <penberg@kernel.org>
|
#
a8364d55 |
|
28-Mar-2012 |
Gilad Ben-Yossef <gilad@benyossef.com> |
slub: only IPI CPUs that have per cpu obj to flush flush_all() is called for each kmem_cache_destroy(). So every cache being destroyed dynamically ends up sending an IPI to each CPU in the system, regardless if the cache has ever been used there. For example, if you close the Infinband ipath driver char device file, the close file ops calls kmem_cache_destroy(). So running some infiniband config tool on one a single CPU dedicated to system tasks might interrupt the rest of the 127 CPUs dedicated to some CPU intensive or latency sensitive task. I suspect there is a good chance that every line in the output of "git grep kmem_cache_destroy linux/ | grep '\->'" has a similar scenario. This patch attempts to rectify this issue by sending an IPI to flush the per cpu objects back to the free lists only to CPUs that seem to have such objects. The check which CPU to IPI is racy but we don't care since asking a CPU without per cpu objects to flush does no damage and as far as I can tell the flush_all by itself is racy against allocs on remote CPUs anyway, so if you required the flush_all to be determinstic, you had to arrange for locking regardless. Without this patch the following artificial test case: $ cd /sys/kernel/slab $ for DIR in *; do cat $DIR/alloc_calls > /dev/null; done produces 166 IPIs on an cpuset isolated CPU. With it it produces none. The code path of memory allocation failure for CPUMASK_OFFSTACK=y config was tested using fault injection framework. Signed-off-by: Gilad Ben-Yossef <gilad@benyossef.com> Acked-by: Christoph Lameter <cl@linux.com> Cc: Chris Metcalf <cmetcalf@tilera.com> Acked-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: Frederic Weisbecker <fweisbec@gmail.com> Cc: Russell King <linux@arm.linux.org.uk> Cc: Pekka Enberg <penberg@kernel.org> Cc: Matt Mackall <mpm@selenic.com> Cc: Sasha Levin <levinsasha928@gmail.com> Cc: Rik van Riel <riel@redhat.com> Cc: Andi Kleen <andi@firstfloor.org> Cc: Mel Gorman <mel@csn.ul.ie> Cc: Alexander Viro <viro@zeniv.linux.org.uk> Cc: Avi Kivity <avi@redhat.com> Cc: Michal Nazarewicz <mina86@mina86.org> Cc: Kosaki Motohiro <kosaki.motohiro@gmail.com> Cc: Milton Miller <miltonm@bga.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
#
cc9a6c87 |
|
21-Mar-2012 |
Mel Gorman <mgorman@suse.de> |
cpuset: mm: reduce large amounts of memory barrier related damage v3 Commit c0ff7453bb5c ("cpuset,mm: fix no node to alloc memory when changing cpuset's mems") wins a super prize for the largest number of memory barriers entered into fast paths for one commit. [get|put]_mems_allowed is incredibly heavy with pairs of full memory barriers inserted into a number of hot paths. This was detected while investigating at large page allocator slowdown introduced some time after 2.6.32. The largest portion of this overhead was shown by oprofile to be at an mfence introduced by this commit into the page allocator hot path. For extra style points, the commit introduced the use of yield() in an implementation of what looks like a spinning mutex. This patch replaces the full memory barriers on both read and write sides with a sequence counter with just read barriers on the fast path side. This is much cheaper on some architectures, including x86. The main bulk of the patch is the retry logic if the nodemask changes in a manner that can cause a false failure. While updating the nodemask, a check is made to see if a false failure is a risk. If it is, the sequence number gets bumped and parallel allocators will briefly stall while the nodemask update takes place. In a page fault test microbenchmark, oprofile samples from __alloc_pages_nodemask went from 4.53% of all samples to 1.15%. The actual results were 3.3.0-rc3 3.3.0-rc3 rc3-vanilla nobarrier-v2r1 Clients 1 UserTime 0.07 ( 0.00%) 0.08 (-14.19%) Clients 2 UserTime 0.07 ( 0.00%) 0.07 ( 2.72%) Clients 4 UserTime 0.08 ( 0.00%) 0.07 ( 3.29%) Clients 1 SysTime 0.70 ( 0.00%) 0.65 ( 6.65%) Clients 2 SysTime 0.85 ( 0.00%) 0.82 ( 3.65%) Clients 4 SysTime 1.41 ( 0.00%) 1.41 ( 0.32%) Clients 1 WallTime 0.77 ( 0.00%) 0.74 ( 4.19%) Clients 2 WallTime 0.47 ( 0.00%) 0.45 ( 3.73%) Clients 4 WallTime 0.38 ( 0.00%) 0.37 ( 1.58%) Clients 1 Flt/sec/cpu 497620.28 ( 0.00%) 520294.53 ( 4.56%) Clients 2 Flt/sec/cpu 414639.05 ( 0.00%) 429882.01 ( 3.68%) Clients 4 Flt/sec/cpu 257959.16 ( 0.00%) 258761.48 ( 0.31%) Clients 1 Flt/sec 495161.39 ( 0.00%) 517292.87 ( 4.47%) Clients 2 Flt/sec 820325.95 ( 0.00%) 850289.77 ( 3.65%) Clients 4 Flt/sec 1020068.93 ( 0.00%) 1022674.06 ( 0.26%) MMTests Statistics: duration Sys Time Running Test (seconds) 135.68 132.17 User+Sys Time Running Test (seconds) 164.2 160.13 Total Elapsed Time (seconds) 123.46 120.87 The overall improvement is small but the System CPU time is much improved and roughly in correlation to what oprofile reported (these performance figures are without profiling so skew is expected). The actual number of page faults is noticeably improved. For benchmarks like kernel builds, the overall benefit is marginal but the system CPU time is slightly reduced. To test the actual bug the commit fixed I opened two terminals. The first ran within a cpuset and continually ran a small program that faulted 100M of anonymous data. In a second window, the nodemask of the cpuset was continually randomised in a loop. Without the commit, the program would fail every so often (usually within 10 seconds) and obviously with the commit everything worked fine. With this patch applied, it also worked fine so the fix should be functionally equivalent. Signed-off-by: Mel Gorman <mgorman@suse.de> Cc: Miao Xie <miaox@cn.fujitsu.com> Cc: David Rientjes <rientjes@google.com> Cc: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: Christoph Lameter <cl@linux.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
#
8028dcea |
|
03-Feb-2012 |
Alex Shi <alex.shi@linux.alibaba.com> |
slub: per cpu partial statistics change This patch split the cpu_partial_free into 2 parts: cpu_partial_node, PCP refilling times from node partial; and same name cpu_partial_free, PCP refilling times in slab_free slow path. A new statistic 'cpu_partial_drain' is added to get PCP drain to node partial times. These info are useful when do PCP tunning. The slabinfo.c code is unchanged, since cpu_partial_node is not on slow path. Signed-off-by: Alex Shi <alex.shi@intel.com> Acked-by: Christoph Lameter <cl@linux.com> Signed-off-by: Pekka Enberg <penberg@kernel.org>
|
#
4de900b4 |
|
30-Jan-2012 |
Christoph Lameter <cl@linux.com> |
slub: include include for prefetch Otherwise m68k breaks: On Mon, 30 Jan 2012, Geert Uytterhoeven wrote: > m68k/allmodconfig at http://kisskb.ellerman.id.au/kisskb/buildresult/5527349/ > > mm/slub.c:274: error: implicit declaration of function 'prefetch' > > Sorry, didn't notice it earlier due to other build breakage in -next. Reported-by: Geert Uytterhoeven <geert@linux-m68k.org> Acked-by: Geert Uytterhoeven <geert@linux-m68k.org> Signed-off-by: Christoph Lameter <cl@linux.com> Signed-off-by: Pekka Enberg <penberg@kernel.org>
|
#
66c4c35c |
|
17-Jan-2012 |
Christoph Lameter <cl@linux.com> |
slub: Do not hold slub_lock when calling sysfs_slab_add() sysfs_slab_add() calls various sysfs functions that actually may end up in userspace doing all sorts of things. Release the slub_lock after adding the kmem_cache structure to the list. At that point the address of the kmem_cache is not known so we are guaranteed exlusive access to the following modifications to the kmem_cache structure. If the sysfs_slab_add fails then reacquire the slub_lock to remove the kmem_cache structure from the list. Cc: <stable@vger.kernel.org> # 3.3+ Reported-by: Sasha Levin <levinsasha928@gmail.com> Acked-by: Eric Dumazet <eric.dumazet@gmail.com> Signed-off-by: Christoph Lameter <cl@linux.com> Signed-off-by: Pekka Enberg <penberg@kernel.org>
|
#
0ad9500e |
|
16-Dec-2011 |
Eric Dumazet <eric.dumazet@gmail.com> |
slub: prefetch next freelist pointer in slab_alloc() Recycling a page is a problem, since freelist link chain is hot on cpu(s) which freed objects, and possibly very cold on cpu currently owning slab. Adding a prefetch of cache line containing the pointer to next object in slab_alloc() helps a lot in many workloads, in particular on assymetric ones (allocations done on one cpu, frees on another cpus). Added cost is three machine instructions only. Examples on my dual socket quad core ht machine (Intel CPU E5540 @2.53GHz) (16 logical cpus, 2 memory nodes), 64bit kernel. Before patch : # perf stat -r 32 hackbench 50 process 4000 >/dev/null Performance counter stats for 'hackbench 50 process 4000' (32 runs): 327577,471718 task-clock # 15,821 CPUs utilized ( +- 0,64% ) 28 866 491 context-switches # 0,088 M/sec ( +- 1,80% ) 1 506 929 CPU-migrations # 0,005 M/sec ( +- 3,24% ) 127 151 page-faults # 0,000 M/sec ( +- 0,16% ) 829 399 813 448 cycles # 2,532 GHz ( +- 0,64% ) 580 664 691 740 stalled-cycles-frontend # 70,01% frontend cycles idle ( +- 0,71% ) 197 431 700 448 stalled-cycles-backend # 23,80% backend cycles idle ( +- 1,03% ) 503 548 648 975 instructions # 0,61 insns per cycle # 1,15 stalled cycles per insn ( +- 0,46% ) 95 780 068 471 branches # 292,389 M/sec ( +- 0,48% ) 1 426 407 916 branch-misses # 1,49% of all branches ( +- 1,35% ) 20,705679994 seconds time elapsed ( +- 0,64% ) After patch : # perf stat -r 32 hackbench 50 process 4000 >/dev/null Performance counter stats for 'hackbench 50 process 4000' (32 runs): 286236,542804 task-clock # 15,786 CPUs utilized ( +- 1,32% ) 19 703 372 context-switches # 0,069 M/sec ( +- 4,99% ) 1 658 249 CPU-migrations # 0,006 M/sec ( +- 6,62% ) 126 776 page-faults # 0,000 M/sec ( +- 0,12% ) 724 636 593 213 cycles # 2,532 GHz ( +- 1,32% ) 499 320 714 837 stalled-cycles-frontend # 68,91% frontend cycles idle ( +- 1,47% ) 156 555 126 809 stalled-cycles-backend # 21,60% backend cycles idle ( +- 2,22% ) 463 897 792 661 instructions # 0,64 insns per cycle # 1,08 stalled cycles per insn ( +- 0,94% ) 87 717 352 563 branches # 306,451 M/sec ( +- 0,99% ) 941 738 280 branch-misses # 1,07% of all branches ( +- 3,35% ) 18,132070670 seconds time elapsed ( +- 1,30% ) Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com> Acked-by: Christoph Lameter <cl@linux.com> CC: Matt Mackall <mpm@selenic.com> CC: David Rientjes <rientjes@google.com> CC: "Alex,Shi" <alex.shi@intel.com> CC: Shaohua Li <shaohua.li@intel.com> Signed-off-by: Pekka Enberg <penberg@kernel.org>
|
#
2565409f |
|
12-Jan-2012 |
Heiko Carstens <hca@linux.ibm.com> |
mm,x86,um: move CMPXCHG_DOUBLE config option Move CMPXCHG_DOUBLE and rename it to HAVE_CMPXCHG_DOUBLE so architectures can simply select the option if it is supported. Signed-off-by: Heiko Carstens <heiko.carstens@de.ibm.com> Acked-by: Christoph Lameter <cl@linux.com> Cc: Pekka Enberg <penberg@kernel.org> Cc: Ingo Molnar <mingo@elte.hu> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: "H. Peter Anvin" <hpa@zytor.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
#
43570fd2 |
|
12-Jan-2012 |
Heiko Carstens <hca@linux.ibm.com> |
mm,slub,x86: decouple size of struct page from CONFIG_CMPXCHG_LOCAL While implementing cmpxchg_double() on s390 I realized that we don't set CONFIG_CMPXCHG_LOCAL despite the fact that we have support for it. However setting that option will increase the size of struct page by eight bytes on 64 bit, which we certainly do not want. Also, it doesn't make sense that a present cpu feature should increase the size of struct page. Besides that it looks like the dependency to CMPXCHG_LOCAL is wrong and that it should depend on CMPXCHG_DOUBLE instead. This patch: If an architecture supports CMPXCHG_LOCAL this shouldn't result automatically in larger struct pages if the SLUB allocator is used. Instead introduce a new config option "HAVE_ALIGNED_STRUCT_PAGE" which can be selected if a double word aligned struct page is required. Also update x86 Kconfig so that it should work as before. Signed-off-by: Heiko Carstens <heiko.carstens@de.ibm.com> Acked-by: Christoph Lameter <cl@linux.com> Cc: Pekka Enberg <penberg@kernel.org> Cc: Ingo Molnar <mingo@elte.hu> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: "H. Peter Anvin" <hpa@zytor.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
#
fc8d8620 |
|
10-Jan-2012 |
Stanislaw Gruszka <sgruszka@redhat.com> |
slub: min order when debug_guardpage_minorder > 0 Disable slub debug facilities and allocate slabs at minimal order when debug_guardpage_minorder > 0 to increase probability to catch random memory corruption by cpu exception. Signed-off-by: Stanislaw Gruszka <sgruszka@redhat.com> Cc: "Rafael J. Wysocki" <rjw@sisk.pl> Cc: Andrea Arcangeli <aarcange@redhat.com> Acked-by: Christoph Lameter <cl@linux.com> Cc: Mel Gorman <mgorman@suse.de> Cc: Stanislaw Gruszka <sgruszka@redhat.com> Cc: Pekka Enberg <penberg@cs.helsinki.fi> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
#
74ee4ef1 |
|
09-Jan-2012 |
David Rientjes <rientjes@google.com> |
slub: disallow changing cpu_partial from userspace for debug caches For caches with debugging enabled, "slub: Switch per cpu partial page support off for debugging" changes cpu_partial to 0. It shouldn't be tunable from userspace for such caches, otherwise the same accounting issues arise during validation. This patch disallows tuning /sys/kernel/slab/cache/cpu_partial to be non- zero for caches with debugging enabled. Acked-by: Christoph Lameter <cl@linux.com> Signed-off-by: David Rientjes <rientjes@google.com> Signed-off-by: Pekka Enberg <penberg@kernel.org>
|
#
cdcd6298 |
|
02-Jan-2012 |
Jan Beulich <JBeulich@suse.com> |
x86: Fix and improve cmpxchg_double{,_local}() Just like the per-CPU ones they had several problems/shortcomings: Only the first memory operand was mentioned in the asm() operands, and the 2x64-bit version didn't have a memory clobber while the 2x32-bit one did. The former allowed the compiler to not recognize the need to re-load the data in case it had it cached in some register, while the latter was overly destructive. The types of the local copies of the old and new values were incorrect (the types of the pointed-to variables should be used here, to make sure the respective old/new variable types are compatible). The __dummy/__junk variables were pointless, given that local copies of the inputs already existed (and can hence be used for discarded outputs). The 32-bit variant of cmpxchg_double_local() referenced cmpxchg16b_local(). At once also: - change the return value type to what it really is: 'bool' - unify 32- and 64-bit variants - abstract out the common part of the 'normal' and 'local' variants Signed-off-by: Jan Beulich <jbeulich@suse.com> Cc: Christoph Lameter <cl@linux.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Andrew Morton <akpm@linux-foundation.org> Link: http://lkml.kernel.org/r/4F01F12A020000780006A19B@nat28.tlf.novell.com Signed-off-by: Ingo Molnar <mingo@elte.hu>
|
#
933393f5 |
|
22-Dec-2011 |
Christoph Lameter <cl@linux.com> |
percpu: Remove irqsafe_cpu_xxx variants We simply say that regular this_cpu use must be safe regardless of preemption and interrupt state. That has no material change for x86 and s390 implementations of this_cpu operations. However, arches that do not provide their own implementation for this_cpu operations will now get code generated that disables interrupts instead of preemption. -tj: This is part of on-going percpu API cleanup. For detailed discussion of the subject, please refer to the following thread. http://thread.gmane.org/gmane.linux.kernel/1222078 Signed-off-by: Christoph Lameter <cl@linux.com> Signed-off-by: Tejun Heo <tj@kernel.org> LKML-Reference: <alpine.DEB.2.00.1112221154380.11787@router.home>
|
#
b13683d1 |
|
10-Nov-2011 |
Shaohua Li <shaohua.li@intel.com> |
slub: add missed accounting With per-cpu partial list, slab is added to partial list first and then moved to node list. The __slab_free() code path for add/remove_partial is almost deprecated(except for slub debug). But we forget to account add/remove_partial when move per-cpu partial pages to node list, so the statistics for such events are always 0. Add corresponding accounting. This is against the patch "slub: use correct parameter to add a page to partial list tail" Acked-by: Christoph Lameter <cl@linux.com> Signed-off-by: Shaohua Li <shaohua.li@intel.com> Signed-off-by: Pekka Enberg <penberg@kernel.org>
|
#
213eeb9f |
|
11-Nov-2011 |
Christoph Lameter <cl@linux.com> |
slub: Extract get_freelist from __slab_alloc get_freelist retrieves free objects from the page freelist (put there by remote frees) or deactivates a slab page if no more objects are available. Acked-by: David Rientjes <rientjes@google.com> Signed-off-by: Christoph Lameter <cl@linux.com> Signed-off-by: Pekka Enberg <penberg@kernel.org>
|
#
8f1e33da |
|
23-Nov-2011 |
Christoph Lameter <cl@linux.com> |
slub: Switch per cpu partial page support off for debugging Eric saw an issue with accounting of slabs during validation. Its not possible to determine accurately how many per cpu partial slabs exist at any time so this switches off per cpu partial pages during debug. Acked-by: Eric Dumazet <eric.dumazet@gmail.com> Signed-off-by: Christoph Lameter <cl@linux.com> Signed-off-by: Pekka Enberg <penberg@kernel.org>
|
#
73736e03 |
|
12-Dec-2011 |
Eric Dumazet <eric.dumazet@gmail.com> |
slub: fix a possible memleak in __slab_alloc() Zhihua Che reported a possible memleak in slub allocator on CONFIG_PREEMPT=y builds. It is possible current thread migrates right before disabling irqs in __slab_alloc(). We must check again c->freelist, and perform a normal allocation instead of scratching c->freelist. Many thanks to Zhihua Che for spotting this bug, introduced in 2.6.39 V2: Its also possible an IRQ freed one (or several) object(s) and populated c->freelist, so its not a CONFIG_PREEMPT only problem. Cc: <stable@vger.kernel.org> [2.6.39+] Reported-by: Zhihua Che <zhihua.che@gmail.com> Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com> Acked-by: Christoph Lameter <cl@linux.com> Signed-off-by: Pekka Enberg <penberg@kernel.org>
|
#
4c493a5a |
|
10-Nov-2011 |
Shaohua Li <shaohua.li@intel.com> |
slub: add missed accounting With per-cpu partial list, slab is added to partial list first and then moved to node list. The __slab_free() code path for add/remove_partial is almost deprecated(except for slub debug). But we forget to account add/remove_partial when move per-cpu partial pages to node list, so the statistics for such events are always 0. Add corresponding accounting. This is against the patch "slub: use correct parameter to add a page to partial list tail" Acked-by: Christoph Lameter <cl@linux.com> Signed-off-by: Shaohua Li <shaohua.li@intel.com> Signed-off-by: Pekka Enberg <penberg@kernel.org>
|
#
bc6697d8 |
|
22-Nov-2011 |
Eric Dumazet <eric.dumazet@gmail.com> |
slub: avoid potential NULL dereference or corruption show_slab_objects() can trigger NULL dereferences or memory corruption. Another cpu can change its c->page to NULL or c->node to NUMA_NO_NODE while we use them. Use ACCESS_ONCE(c->page) and ACCESS_ONCE(c->node) to make sure this cannot happen. Acked-by: Christoph Lameter <cl@linux.com> Acked-by: David Rientjes <rientjes@google.com> Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com> Signed-off-by: Pekka Enberg <penberg@kernel.org>
|
#
42d623a8 |
|
23-Nov-2011 |
Christoph Lameter <cl@linux.com> |
slub: use irqsafe_cpu_cmpxchg for put_cpu_partial The cmpxchg must be irq safe. The fallback for this_cpu_cmpxchg only disables preemption which results in per cpu partial page operation potentially failing on non x86 platforms. This patch fixes the following problem reported by Christian Kujau: I seem to hit it with heavy disk & cpu IO is in progress on this PowerBook G4. Full dmesg & .config: http://nerdbynature.de/bits/3.2.0-rc1/oops/ I've enabled some debug options and now it really points to slub.c:2166 http://nerdbynature.de/bits/3.2.0-rc1/oops/oops4m.jpg With debug options enabled I'm currently in the xmon debugger, not sure what to make of it yet, I'll try to get something useful out of it :) Reported-by: Christian Kujau <lists@nerdbynature.de> Tested-by: Christian Kujau <lists@nerdbynature.de> Acked-by: Eric Dumazet <eric.dumazet@gmail.com> Acked-by: David Rientjes <rientjes@google.com> Signed-off-by: Christoph Lameter <cl@linux.com> Signed-off-by: Pekka Enberg <penberg@kernel.org>
|
#
265d47e7 |
|
15-Nov-2011 |
Dave Jones <davej@redhat.com> |
slub: add taint flag outputting to debug paths When we get corruption reports, it's useful to see if the kernel was tainted, to rule out problems we can't do anything about. Signed-off-by: Dave Jones <davej@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Pekka Enberg <penberg@kernel.org>
|
#
9ada1934 |
|
13-Nov-2011 |
Shaohua Li <shaohua.li@intel.com> |
slub: move discard_slab out of node lock Lockdep reports there is potential deadlock for slub node list_lock. discard_slab() is called with the lock hold in unfreeze_partials(), which could trigger a slab allocation, which could hold the lock again. discard_slab() doesn't need hold the lock actually, if the slab is already removed from partial list. Acked-by: Christoph Lameter <cl@linux.com> Reported-and-tested-by: Yong Zhang <yong.zhang0@gmail.com> Reported-and-tested-by: Julie Sullivan <kernelmail.jms@gmail.com> Signed-off-by: Shaohua Li <shaohua.li@intel.com> Signed-off-by: Pekka Enberg <penberg@kernel.org>
|
#
f64ae042 |
|
10-Nov-2011 |
Shaohua Li <shaohua.li@intel.com> |
slub: use correct parameter to add a page to partial list tail unfreeze_partials() needs add the page to partial list tail, since such page hasn't too many free objects. We now explictly use DEACTIVATE_TO_TAIL for this, while DEACTIVATE_TO_TAIL != 1. This will cause performance regression (eg, more lock contention in node->list_lock) without below fix. Signed-off-by: Shaohua Li <shaohua.li@intel.com> Acked-by: Christoph Lameter <cl@linux.com> Acked-by: David Rientjes <rientjes@google.com> Signed-off-by: Pekka Enberg <penberg@kernel.org>
|
#
79824820 |
|
31-Oct-2011 |
Akinobu Mita <akinobu.mita@gmail.com> |
lib/string.c: introduce memchr_inv() memchr_inv() is mainly used to check whether the whole buffer is filled with just a specified byte. The function name and prototype are stolen from logfs and the implementation is from SLUB. Signed-off-by: Akinobu Mita <akinobu.mita@gmail.com> Acked-by: Christoph Lameter <cl@linux-foundation.org> Acked-by: Pekka Enberg <penberg@kernel.org> Cc: Matt Mackall <mpm@selenic.com> Acked-by: Joern Engel <joern@logfs.org> Cc: Marcin Slusarz <marcin.slusarz@gmail.com> Cc: Eric Dumazet <eric.dumazet@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
#
dcc3be6a |
|
06-Sep-2011 |
Alex Shi <alex.shi@linux.alibaba.com> |
slub: Discard slab page when node partial > minimum partial number Discarding slab should be done when node partial > min_partial. Otherwise, node partial slab may eat up all memory. Signed-off-by: Alex Shi <alex.shi@intel.com> Acked-by: Christoph Lameter <cl@linux.com> Signed-off-by: Pekka Enberg <penberg@kernel.org>
|
#
9f264904 |
|
31-Aug-2011 |
Alex Shi <alex.shi@linux.alibaba.com> |
slub: correct comments error for per cpu partial Correct comment errors, that mistake cpu partial objects number as pages number, may make reader misunderstand. Signed-off-by: Alex Shi <alex.shi@intel.com> Reviewed-by: Christoph Lameter <cl@linux.com> Signed-off-by: Pekka Enberg <penberg@kernel.org>
|
#
ab067e99 |
|
27-Sep-2011 |
Vasiliy Kulikov <segoon@openwall.com> |
mm: restrict access to slab files under procfs and sysfs Historically /proc/slabinfo and files under /sys/kernel/slab/* have world read permissions and are accessible to the world. slabinfo contains rather private information related both to the kernel and userspace tasks. Depending on the situation, it might reveal either private information per se or information useful to make another targeted attack. Some examples of what can be learned by reading/watching for /proc/slabinfo entries: 1) dentry (and different *inode*) number might reveal other processes fs activity. The number of dentry "active objects" doesn't strictly show file count opened/touched by a process, however, there is a good correlation between them. The patch "proc: force dcache drop on unauthorized access" relies on the privacy of dentry count. 2) different inode entries might reveal the same information as (1), but these are more fine granted counters. If a filesystem is mounted in a private mount point (or even a private namespace) and fs type differs from other mounted fs types, fs activity in this mount point/namespace is revealed. If there is a single ecryptfs mount point, the whole fs activity of a single user is revealed. Number of files in ecryptfs mount point is a private information per se. 3) fuse_* reveals number of files / fs activity of a user in a user private mount point. It is approx. the same severity as ecryptfs infoleak in (2). 4) sysfs_dir_cache similar to (2) reveals devices' addition/removal, which can be otherwise hidden by "chmod 0700 /sys/". With 0444 slabinfo the precise number of sysfs files is known to the world. 5) buffer_head might reveal some kernel activity. With other information leaks an attacker might identify what specific kernel routines generate buffer_head activity. 6) *kmalloc* infoleaks are very situational. Attacker should watch for the specific kmalloc size entry and filter the noise related to the unrelated kernel activity. If an attacker has relatively silent victim system, he might get rather precise counters. Additional information sources might significantly increase the slabinfo infoleak benefits. E.g. if an attacker knows that the processes activity on the system is very low (only core daemons like syslog and cron), he may run setxid binaries / trigger local daemon activity / trigger network services activity / await sporadic cron jobs activity / etc. and get rather precise counters for fs and network activity of these privileged tasks, which is unknown otherwise. Also hiding slabinfo and /sys/kernel/slab/* is a one step to complicate exploitation of kernel heap overflows (and possibly, other bugs). The related discussion: http://thread.gmane.org/gmane.linux.kernel/1108378 To keep compatibility with old permission model where non-root monitoring daemon could watch for kernel memleaks though slabinfo one should do: groupadd slabinfo usermod -a -G slabinfo $MONITOR_USER And add the following commands to init scripts (to mountall.conf in Ubuntu's upstart case): chmod g+r /proc/slabinfo /sys/kernel/slab/*/* chgrp slabinfo /proc/slabinfo /sys/kernel/slab/*/* Signed-off-by: Vasiliy Kulikov <segoon@openwall.com> Reviewed-by: Kees Cook <kees@ubuntu.com> Reviewed-by: Dave Hansen <dave@linux.vnet.ibm.com> Acked-by: Christoph Lameter <cl@gentwo.org> Acked-by: David Rientjes <rientjes@google.com> CC: Valdis.Kletnieks@vt.edu CC: Linus Torvalds <torvalds@linux-foundation.org> CC: Alan Cox <alan@linux.intel.com> Signed-off-by: Pekka Enberg <penberg@kernel.org>
|
#
12d79634 |
|
06-Sep-2011 |
Alex Shi <alex.shi@linux.alibaba.com> |
slub: Code optimization in get_partial_node() I find a way to reduce a variable in get_partial_node(). That is also helpful for code understanding. Acked-by: Christoph Lameter <cl@linux.com> Signed-off-by: Alex Shi <alex.shi@intel.com> Signed-off-by: Pekka Enberg <penberg@kernel.org>
|
#
136333d1 |
|
23-Aug-2011 |
Shaohua Li <shaohua.li@intel.com> |
slub: explicitly document position of inserting slab to partial list Adding slab to partial list head/tail is sensitive to performance. So explicitly uses DEACTIVATE_TO_TAIL/DEACTIVATE_TO_HEAD to document it to avoid we get it wrong. Acked-by: Christoph Lameter <cl@linux.com> Signed-off-by: Shaohua Li <shli@kernel.org> Signed-off-by: Shaohua Li <shaohua.li@intel.com> Signed-off-by: Pekka Enberg <penberg@kernel.org>
|
#
130655ef |
|
22-Aug-2011 |
Shaohua Li <shaohua.li@intel.com> |
slub: add slab with one free object to partial list tail The slab has just one free object, adding it to partial list head doesn't make sense. And it can cause lock contentation. For example, 1. CPU takes the slab from partial list 2. fetch an object 3. switch to another slab 4. free an object, then the slab is added to partial list again In this way n->list_lock will be heavily contended. In fact, Alex had a hackbench regression. 3.1-rc1 performance drops about 70% against 3.0. This patch fixes it. Acked-by: Christoph Lameter <cl@linux.com> Reported-by: Alex Shi <alex.shi@intel.com> Signed-off-by: Shaohua Li <shli@kernel.org> Signed-off-by: Shaohua Li <shaohua.li@intel.com> Signed-off-by: Pekka Enberg <penberg@kernel.org>
|
#
49e22585 |
|
09-Aug-2011 |
Christoph Lameter <cl@linux.com> |
slub: per cpu cache for partial pages Allow filling out the rest of the kmem_cache_cpu cacheline with pointers to partial pages. The partial page list is used in slab_free() to avoid per node lock taking. In __slab_alloc() we can then take multiple partial pages off the per node partial list in one go reducing node lock pressure. We can also use the per cpu partial list in slab_alloc() to avoid scanning partial lists for pages with free objects. The main effect of a per cpu partial list is that the per node list_lock is taken for batches of partial pages instead of individual ones. Potential future enhancements: 1. The pickup from the partial list could be perhaps be done without disabling interrupts with some work. The free path already puts the page into the per cpu partial list without disabling interrupts. 2. __slab_free() may have some code paths that could use optimization. Performance: Before After ./hackbench 100 process 200000 Time: 1953.047 1564.614 ./hackbench 100 process 20000 Time: 207.176 156.940 ./hackbench 100 process 20000 Time: 204.468 156.940 ./hackbench 100 process 20000 Time: 204.879 158.772 ./hackbench 10 process 20000 Time: 20.153 15.853 ./hackbench 10 process 20000 Time: 20.153 15.986 ./hackbench 10 process 20000 Time: 19.363 16.111 ./hackbench 1 process 20000 Time: 2.518 2.307 ./hackbench 1 process 20000 Time: 2.258 2.339 ./hackbench 1 process 20000 Time: 2.864 2.163 Signed-off-by: Christoph Lameter <cl@linux.com> Signed-off-by: Pekka Enberg <penberg@kernel.org>
|
#
497b66f2 |
|
09-Aug-2011 |
Christoph Lameter <cl@linux.com> |
slub: return object pointer from get_partial() / new_slab(). There is no need anymore to return the pointer to a slab page from get_partial() since the page reference can be stored in the kmem_cache_cpu structures "page" field. Return an object pointer instead. That in turn allows a simplification of the spaghetti code in __slab_alloc(). Signed-off-by: Christoph Lameter <cl@linux.com> Signed-off-by: Pekka Enberg <penberg@kernel.org>
|
#
acd19fd1 |
|
09-Aug-2011 |
Christoph Lameter <cl@linux.com> |
slub: pass kmem_cache_cpu pointer to get_partial() Pass the kmem_cache_cpu pointer to get_partial(). That way we can avoid the this_cpu_write() statements. Signed-off-by: Christoph Lameter <cl@linux.com> Signed-off-by: Pekka Enberg <penberg@kernel.org>
|
#
e6e82ea1 |
|
09-Aug-2011 |
Christoph Lameter <cl@linux.com> |
slub: Prepare inuse field in new_slab() inuse will always be set to page->objects. There is no point in initializing the field to zero in new_slab() and then overwriting the value in __slab_alloc(). Signed-off-by: Christoph Lameter <cl@linux.com> Signed-off-by: Pekka Enberg <penberg@kernel.org>
|
#
7db0d705 |
|
09-Aug-2011 |
Christoph Lameter <cl@linux.com> |
slub: Remove useless statements in __slab_alloc Two statements in __slab_alloc() do not have any effect. 1. c->page is already set to NULL by deactivate_slab() called right before. 2. gfpflags are masked in new_slab() before being passed to the page allocator. There is no need to mask gfpflags in __slab_alloc in particular since most frequent processing in __slab_alloc does not require the use of a gfpmask. Cc: torvalds@linux-foundation.org Signed-off-by: Christoph Lameter <cl@linux.com> Signed-off-by: Pekka Enberg <penberg@kernel.org>
|
#
69cb8e6b |
|
09-Aug-2011 |
Christoph Lameter <cl@linux.com> |
slub: free slabs without holding locks There are two situations in which slub holds a lock while releasing pages: A. During kmem_cache_shrink() B. During kmem_cache_close() For A build a list while holding the lock and then release the pages later. In case of B we are the last remaining user of the slab so there is no need to take the listlock. After this patch all calls to the page allocator to free pages are done without holding any spinlocks. kmem_cache_destroy() will still hold the slub_lock semaphore. Signed-off-by: Christoph Lameter <cl@linux.com> Signed-off-by: Pekka Enberg <penberg@kernel.org>
|
#
81107188 |
|
09-Aug-2011 |
Christoph Lameter <cl@linux.com> |
slub: Fix partial count comparison confusion deactivate_slab() has the comparison if more than the minimum number of partial pages are in the partial list wrong. An effect of this may be that empty pages are not freed from deactivate_slab(). The result could be an OOM due to growth of the partial slabs per node. Frees mostly occur from __slab_free which is okay so this would only affect use cases where a lot of switching around of per cpu slabs occur. Switching per cpu slabs occurs with high frequency if debugging options are enabled. Reported-and-tested-by: Xiaotian Feng <xtfeng@gmail.com> Signed-off-by: Christoph Lameter <cl@linux.com> Signed-off-by: Pekka Enberg <penberg@kernel.org>
|
#
ef62fb32 |
|
07-Aug-2011 |
Akinobu Mita <akinobu.mita@gmail.com> |
slub: fix check_bytes() for slub debugging The check_bytes() function is used by slub debugging. It returns a pointer to the first unmatching byte for a character in the given memory area. If the character for matching byte is greater than 0x80, check_bytes() doesn't work. Becuase 64-bit pattern is generated as below. value64 = value | value << 8 | value << 16 | value << 24; value64 = value64 | value64 << 32; The integer promotions are performed and sign-extended as the type of value is u8. The upper 32 bits of value64 is 0xffffffff in the first line, and the second line has no effect. This fixes the 64-bit pattern generation. Signed-off-by: Akinobu Mita <akinobu.mita@gmail.com> Cc: Christoph Lameter <cl@linux-foundation.org> Cc: Matt Mackall <mpm@selenic.com> Reviewed-by: Marcin Slusarz <marcin.slusarz@gmail.com> Acked-by: Eric Dumazet <eric.dumazet@gmail.com> Signed-off-by: Pekka Enberg <penberg@kernel.org>
|
#
6fbabb20 |
|
08-Aug-2011 |
Christoph Lameter <cl@linux.com> |
slub: Fix full list corruption if debugging is on When a slab is freed by __slab_free() and the slab can only contain a single object ever then it was full (and therefore not on the partial lists but on the full list in the debug case) before we reached slab_empty. This caused the following full list corruption when SLUB debugging was enabled: [ 5913.233035] ------------[ cut here ]------------ [ 5913.233097] WARNING: at lib/list_debug.c:53 __list_del_entry+0x8d/0x98() [ 5913.233101] Hardware name: Adamo 13 [ 5913.233105] list_del corruption. prev->next should be ffffea000434fd20, but was ffffea0004199520 [ 5913.233108] Modules linked in: nfs fscache fuse ebtable_nat ebtables ppdev parport_pc lp parport ipt_MASQUERADE iptable_nat nf_nat nfsd lockd nfs_acl auth_rpcgss xt_CHECKSUM sunrpc iptable_mangle bridge stp llc cpufreq_ondemand acpi_cpufreq freq_table mperf ip6t_REJECT nf_conntrack_ipv6 nf_defrag_ipv6 ip6table_filter ip6_tables rfcomm bnep arc4 iwlagn snd_hda_codec_hdmi snd_hda_codec_idt snd_hda_intel btusb mac80211 snd_hda_codec bluetooth snd_hwdep snd_seq snd_seq_device snd_pcm usb_debug dell_wmi sparse_keymap cdc_ether usbnet cdc_acm uvcvideo cdc_wdm mii cfg80211 snd_timer dell_laptop videodev dcdbas snd microcode v4l2_compat_ioctl32 soundcore joydev tg3 pcspkr snd_page_alloc iTCO_wdt i2c_i801 rfkill iTCO_vendor_support wmi virtio_net kvm_intel kvm ipv6 xts gf128mul dm_crypt i915 drm_kms_helper drm i2c_algo_bit i2c_core video [last unloaded: scsi_wait_scan] [ 5913.233213] Pid: 0, comm: swapper Not tainted 3.0.0+ #127 [ 5913.233213] Call Trace: [ 5913.233213] <IRQ> [<ffffffff8105df18>] warn_slowpath_common+0x83/0x9b [ 5913.233213] [<ffffffff8105dfd3>] warn_slowpath_fmt+0x46/0x48 [ 5913.233213] [<ffffffff8127e7c1>] __list_del_entry+0x8d/0x98 [ 5913.233213] [<ffffffff8127e7da>] list_del+0xe/0x2d [ 5913.233213] [<ffffffff814e0430>] __slab_free+0x1db/0x235 [ 5913.233213] [<ffffffff811706ab>] ? bvec_free_bs+0x35/0x37 [ 5913.233213] [<ffffffff811706ab>] ? bvec_free_bs+0x35/0x37 [ 5913.233213] [<ffffffff811706ab>] ? bvec_free_bs+0x35/0x37 [ 5913.233213] [<ffffffff81133085>] kmem_cache_free+0x88/0x102 [ 5913.233213] [<ffffffff811706ab>] bvec_free_bs+0x35/0x37 [ 5913.233213] [<ffffffff811706e1>] bio_free+0x34/0x64 [ 5913.233213] [<ffffffff813dc390>] dm_bio_destructor+0x12/0x14 [ 5913.233213] [<ffffffff8116fef6>] bio_put+0x2b/0x2d [ 5913.233213] [<ffffffff813dccab>] clone_endio+0x9e/0xb4 [ 5913.233213] [<ffffffff8116f7dd>] bio_endio+0x2d/0x2f [ 5913.233213] [<ffffffffa00148da>] crypt_dec_pending+0x5c/0x8b [dm_crypt] [ 5913.233213] [<ffffffffa00150a9>] crypt_endio+0x78/0x81 [dm_crypt] [ Full discussion here: https://lkml.org/lkml/2011/8/4/375 ] Make sure that we remove such a slab also from the full lists. Reported-and-tested-by: Dave Jones <davej@redhat.com> Reported-and-tested-by: Xiaotian Feng <xtfeng@gmail.com> Signed-off-by: Christoph Lameter <cl@linux.com> Signed-off-by: Pekka Enberg <penberg@kernel.org>
|
#
ffc79d28 |
|
29-Jul-2011 |
Sebastian Andrzej Siewior <bigeasy@linutronix.de> |
slub: use print_hex_dump Less code and same functionality. The output would be: | Object c7428000: 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b kkkkkkkkkkkkkkkk | Object c7428010: 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b kkkkkkkkkkkkkkkk | Object c7428020: 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b kkkkkkkkkkkkkkkk | Object c7428030: 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b a5 kkkkkkkkkkk. | Redzone c742803c: bb bb bb bb .... | Padding c7428064: 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a ZZZZZZZZZZZZZZZZ | Padding c7428074: 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a ZZZZZZZZZZZZ Acked-by: Christoph Lameter <cl@linux.com> Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de> Signed-off-by: Pekka Enberg <penberg@kernel.org>
|
#
9e577e8b |
|
22-Jul-2011 |
Christoph Lameter <cl@linux.com> |
slub: When allocating a new slab also prep the first object We need to branch to the debug code for the first object if we allocate a new slab otherwise the first object will be marked wrongly as inactive. Tested-by: Rabin Vincent <rabin@rab.in> Signed-off-by: Christoph Lameter <cl@linux.com> Signed-off-by: Pekka Enberg <penberg@kernel.org>
|
#
497888cf |
|
14-Jul-2011 |
Phil Carmody <ext-phil.2.carmody@nokia.com> |
treewide: fix potentially dangerous trailing ';' in #defined values/expressions All these are instances of #define NAME value; or #define NAME(params_opt) value; These of course fail to build when used in contexts like if(foo $OP NAME) while(bar $OP NAME) and may silently generate the wrong code in contexts such as foo = NAME + 1; /* foo = value; + 1; */ bar = NAME - 1; /* bar = value; - 1; */ baz = NAME & quux; /* baz = value; & quux; */ Reported on comp.lang.c, Message-ID: <ab0d55fe-25e5-482b-811e-c475aa6065c3@c29g2000yqd.googlegroups.com> Initial analysis of the dangers provided by Keith Thompson in that thread. There are many more instances of more complicated macros having unnecessary trailing semicolons, but this pile seems to be all of the cases of simple values suffering from the problem. (Thus things that are likely to be found in one of the contexts above, more complicated ones aren't.) Signed-off-by: Phil Carmody <ext-phil.2.carmody@nokia.com> Signed-off-by: Jiri Kosina <jkosina@suse.cz>
|
#
1d07171c |
|
13-Jul-2011 |
Christoph Lameter <cl@linux.com> |
slub: disable interrupts in cmpxchg_double_slab when falling back to pagelock Split cmpxchg_double_slab into two functions. One for the case where we know that interrupts are disabled (and therefore the fallback does not need to disable interrupts) and one for the other cases where fallback will also disable interrupts. This fixes the issue that __slab_free called cmpxchg_double_slab in some scenarios without disabling interrupts. Tested-by: Hugh Dickins <hughd@google.com> Signed-off-by: Christoph Lameter <cl@linux.com> Signed-off-by: Pekka Enberg <penberg@kernel.org>
|
#
bfa71457 |
|
07-Jul-2011 |
Pekka Enberg <penberg@kernel.org> |
SLUB: Fix missing <linux/stacktrace.h> include This fixes the following build breakage commit d6543e3 ("slub: Enable backtrace for create/delete points"): CC mm/slub.o mm/slub.c: In function ‘set_track’: mm/slub.c:428: error: storage size of ‘trace’ isn’t known mm/slub.c:435: error: implicit declaration of function ‘save_stack_trace’ mm/slub.c:428: warning: unused variable ‘trace’ make[1]: *** [mm/slub.o] Error 1 make: *** [mm/slub.o] Error 2 Signed-off-by: Pekka Enberg <penberg@kernel.org>
|
#
c4089f98 |
|
26-Jun-2011 |
Marcin Slusarz <marcin.slusarz@gmail.com> |
slub: reduce overhead of slub_debug slub checks for poison one byte by one, which is highly inefficient and shows up frequently as a highest cpu-eater in perf top. Joining reads gives nice speedup: (Compiling some project with different options) make -j12 make clean slub_debug disabled: 1m 27s 1.2 s slub_debug enabled: 1m 46s 7.6 s slub_debug enabled + this patch: 1m 33s 3.2 s check_bytes still shows up high, but not always at the top. Signed-off-by: Marcin Slusarz <marcin.slusarz@gmail.com> Cc: Christoph Lameter <cl@linux-foundation.org> Cc: Pekka Enberg <penberg@kernel.org> Cc: Matt Mackall <mpm@selenic.com> Cc: linux-mm@kvack.org Signed-off-by: Pekka Enberg <penberg@kernel.org>
|
#
d18a90dd |
|
07-Jul-2011 |
Ben Greear <greearb@candelatech.com> |
slub: Add method to verify memory is not freed This is for tracking down suspect memory usage. Acked-by: Christoph Lameter <cl@linux.com> Signed-off-by: Ben Greear <greearb@candelatech.com> Signed-off-by: Pekka Enberg <penberg@kernel.org>
|
#
d6543e39 |
|
07-Jul-2011 |
Ben Greear <greearb@candelatech.com> |
slub: Enable backtrace for create/delete points This patch attempts to grab a backtrace for the creation and deletion points of the slub object. When a fault is detected, we can then get a better idea of where the item was deleted. Example output from debugging some funky nfs/rpc behaviour: ============================================================================= BUG kmalloc-64: Object is on free-list ----------------------------------------------------------------------------- INFO: Allocated in rpcb_getport_async+0x39c/0x5a5 [sunrpc] age=381 cpu=3 pid=3750 __slab_alloc+0x348/0x3ba kmem_cache_alloc_trace+0x67/0xe7 rpcb_getport_async+0x39c/0x5a5 [sunrpc] call_bind+0x70/0x75 [sunrpc] __rpc_execute+0x78/0x24b [sunrpc] rpc_execute+0x3d/0x42 [sunrpc] rpc_run_task+0x79/0x81 [sunrpc] rpc_call_sync+0x3f/0x60 [sunrpc] rpc_ping+0x42/0x58 [sunrpc] rpc_create+0x4aa/0x527 [sunrpc] nfs_create_rpc_client+0xb1/0xf6 [nfs] nfs_init_client+0x3b/0x7d [nfs] nfs_get_client+0x453/0x5ab [nfs] nfs_create_server+0x10b/0x437 [nfs] nfs_fs_mount+0x4ca/0x708 [nfs] mount_fs+0x6b/0x152 INFO: Freed in rpcb_map_release+0x3f/0x44 [sunrpc] age=30 cpu=2 pid=29049 __slab_free+0x57/0x150 kfree+0x107/0x13a rpcb_map_release+0x3f/0x44 [sunrpc] rpc_release_calldata+0x12/0x14 [sunrpc] rpc_free_task+0x59/0x61 [sunrpc] rpc_final_put_task+0x82/0x8a [sunrpc] __rpc_execute+0x23c/0x24b [sunrpc] rpc_async_schedule+0x10/0x12 [sunrpc] process_one_work+0x230/0x41d worker_thread+0x133/0x217 kthread+0x7d/0x85 kernel_thread_helper+0x4/0x10 INFO: Slab 0xffffea00029aa470 objects=20 used=9 fp=0xffff8800be7830d8 flags=0x20000000004081 INFO: Object 0xffff8800be7830d8 @offset=4312 fp=0xffff8800be7827a8 Bytes b4 0xffff8800be7830c8: 87 a8 96 00 01 00 00 00 5a 5a 5a 5a 5a 5a 5a 5a .�......ZZZZZZZZ Object 0xffff8800be7830d8: 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b kkkkkkkkkkkkkkkk Object 0xffff8800be7830e8: 6b 6b 6b 6b 01 08 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b kkkk..kkkkkkkkkk Object 0xffff8800be7830f8: 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b kkkkkkkkkkkkkkkk Object 0xffff8800be783108: 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b a5 kkkkkkkkkkkkkkk� Redzone 0xffff8800be783118: bb bb bb bb bb bb bb bb ������������� Padding 0xffff8800be783258: 5a 5a 5a 5a 5a 5a 5a 5a ZZZZZZZZ Pid: 29049, comm: kworker/2:2 Not tainted 3.0.0-rc4+ #8 Call Trace: [<ffffffff811055c3>] print_trailer+0x131/0x13a [<ffffffff81105601>] object_err+0x35/0x3e [<ffffffff8110746f>] verify_mem_not_deleted+0x7a/0xb7 [<ffffffffa02851b5>] rpcb_getport_done+0x23/0x126 [sunrpc] [<ffffffffa027d0ba>] rpc_exit_task+0x3f/0x6d [sunrpc] [<ffffffffa027d4ab>] __rpc_execute+0x78/0x24b [sunrpc] [<ffffffffa027d6c0>] ? rpc_execute+0x42/0x42 [sunrpc] [<ffffffffa027d6d0>] rpc_async_schedule+0x10/0x12 [sunrpc] [<ffffffff810611b7>] process_one_work+0x230/0x41d [<ffffffff81061102>] ? process_one_work+0x17b/0x41d [<ffffffff81063613>] worker_thread+0x133/0x217 [<ffffffff810634e0>] ? manage_workers+0x191/0x191 [<ffffffff81066e10>] kthread+0x7d/0x85 [<ffffffff81485924>] kernel_thread_helper+0x4/0x10 [<ffffffff8147eb18>] ? retint_restore_args+0x13/0x13 [<ffffffff81066d93>] ? __init_kthread_worker+0x56/0x56 [<ffffffff81485920>] ? gs_change+0x13/0x13 Acked-by: Christoph Lameter <cl@linux.com> Signed-off-by: Ben Greear <greearb@candelatech.com> Signed-off-by: Pekka Enberg <penberg@kernel.org>
|
#
4eade540 |
|
31-May-2011 |
Christoph Lameter <cl@linux.com> |
slub: Not necessary to check for empty slab on load_freelist load_freelist is now only branched to only if there are objects available. So no need to check the object variable for NULL. Signed-off-by: Pekka Enberg <penberg@kernel.org>
|
#
03e404af |
|
31-May-2011 |
Christoph Lameter <cl@linux.com> |
slub: fast release on full slab Make deactivation occur implicitly while checking out the current freelist. This avoids one cmpxchg operation on a slab that is now fully in use. Signed-off-by: Christoph Lameter <cl@linux.com> Signed-off-by: Pekka Enberg <penberg@kernel.org>
|
#
e36a2652 |
|
31-May-2011 |
Christoph Lameter <cl@linux.com> |
slub: Add statistics for the case that the current slab does not match the node Slub reloads the per cpu slab if the page does not satisfy the NUMA condition. Track those reloads since doing so has a performance impact. Signed-off-by: Christoph Lameter <cl@linux.com> Signed-off-by: Pekka Enberg <penberg@kernel.org>
|
#
fc59c053 |
|
31-May-2011 |
Christoph Lameter <cl@linux.com> |
slub: Get rid of the another_slab label We can avoid deactivate slab in special cases if we do the deactivation of slabs in each code flow that leads to new_slab. Signed-off-by: Christoph Lameter <cl@linux.com> Signed-off-by: Pekka Enberg <penberg@kernel.org>
|
#
80f08c19 |
|
31-May-2011 |
Christoph Lameter <cl@linux.com> |
slub: Avoid disabling interrupts in free slowpath Disabling interrupts can be avoided now. However, list operation still require disabling interrupts since allocations can occur from interrupt contexts and there is no way to perform atomic list operations. The acquition of the list_lock therefore has to disable interrupts as well. Dropping interrupt handling significantly simplifies the slowpath. Signed-off-by: Christoph Lameter <cl@linux.com> Signed-off-by: Pekka Enberg <penberg@kernel.org>
|
#
5c2e4bbb |
|
31-May-2011 |
Christoph Lameter <cl@linux.com> |
slub: Disable interrupts in free_debug processing We will be calling free_debug_processing with interrupts disabled in some case when the later patches are applied. Some of the functions called by free_debug_processing expect interrupts to be off. Signed-off-by: Christoph Lameter <cl@linux.com> Signed-off-by: Pekka Enberg <penberg@kernel.org>
|
#
881db7fb |
|
31-May-2011 |
Christoph Lameter <cl@linux.com> |
slub: Invert locking and avoid slab lock Locking slabs is no longer necesary if the arch supports cmpxchg operations and if no debuggin features are used on a slab. If the arch does not support cmpxchg then we fallback to use the slab lock to do a cmpxchg like operation. The patch also changes the lock order. Slab locks are subsumed to the node lock now. With that approach slab_trylocking is no longer necessary. Signed-off-by: Christoph Lameter <cl@linux.com> Signed-off-by: Pekka Enberg <penberg@kernel.org>
|
#
2cfb7455 |
|
31-May-2011 |
Christoph Lameter <cl@linux.com> |
slub: Rework allocator fastpaths Rework the allocation paths so that updates of the page freelist, frozen state and number of objects use cmpxchg_double_slab(). Signed-off-by: Christoph Lameter <cl@linux.com> Signed-off-by: Pekka Enberg <penberg@kernel.org>
|
#
61728d1e |
|
31-May-2011 |
Christoph Lameter <cl@linux.com> |
slub: Pass kmem_cache struct to lock and freeze slab We need more information about the slab for the cmpxchg implementation. Signed-off-by: Christoph Lameter <cl@linux.com> Acked-by: David Rientjes <rientjes@google.com> Signed-off-by: Pekka Enberg <penberg@kernel.org>
|
#
5cc6eee8 |
|
31-May-2011 |
Christoph Lameter <cl@linux.com> |
slub: explicit list_lock taking The allocator fastpath rework does change the usage of the list_lock. Remove the list_lock processing from the functions that hide them from the critical sections and move them into those critical sections. This in turn simplifies the support functions (no __ variant needed anymore) and simplifies the lock handling on bootstrap. Inline add_partial since it becomes pretty simple. Signed-off-by: Christoph Lameter <cl@linux.com> Signed-off-by: Pekka Enberg <penberg@kernel.org>
|
#
b789ef51 |
|
31-May-2011 |
Christoph Lameter <cl@linux.com> |
slub: Add cmpxchg_double_slab() Add a function that operates on the second doubleword in the page struct and manipulates the object counters, the freelist and the frozen attribute. Signed-off-by: Christoph Lameter <cl@linux.com> Signed-off-by: Pekka Enberg <penberg@kernel.org>
|
#
8cb0a506 |
|
31-May-2011 |
Christoph Lameter <cl@linux.com> |
slub: Move page->frozen handling near where the page->freelist handling occurs This is necessary because the frozen bit has to be handled in the same cmpxchg_double with the freelist and the counters. Signed-off-by: Christoph Lameter <cl@linux.com> Acked-by: David Rientjes <rientjes@google.com> Signed-off-by: Pekka Enberg <penberg@kernel.org>
|
#
50d5c41c |
|
31-May-2011 |
Christoph Lameter <cl@linux.com> |
slub: Do not use frozen page flag but a bit in the page counters Do not use a page flag for the frozen bit. It needs to be part of the state that is handled with cmpxchg_double(). So use a bit in the counter struct in the page struct for that purpose. Signed-off-by: Christoph Lameter <cl@linux.com> Signed-off-by: Pekka Enberg <penberg@kernel.org>
|
#
7e0528da |
|
31-May-2011 |
Christoph Lameter <cl@linux.com> |
slub: Push irq disable into allocate_slab() Do the irq handling in allocate_slab() instead of __slab_alloc(). __slab_alloc() is already cluttered and allocate_slab() is already fiddling around with gfp flags. v6->v7: Only increment ORDER_FALLBACK if we get a page during fallback Signed-off-by: Christoph Lameter <cl@linux.com> Signed-off-by: Pekka Enberg <penberg@kernel.org>
|
#
d4d84fef |
|
02-Jun-2011 |
Chris Metcalf <cmetcalf@tilera.com> |
slub: always align cpu_slab to honor cmpxchg_double requirement On an architecture without CMPXCHG_LOCAL but with DEBUG_VM enabled, the VM_BUG_ON() in __pcpu_double_call_return_bool() will cause an early panic during boot unless we always align cpu_slab properly. In principle we could remove the alignment-testing VM_BUG_ON() for architectures that don't have CMPXCHG_LOCAL, but leaving it in means that new code will tend not to break x86 even if it is introduced on another platform, and it's low cost to require alignment. Acked-by: David Rientjes <rientjes@google.com> Acked-by: Christoph Lameter <cl@linux.com> Signed-off-by: Chris Metcalf <cmetcalf@tilera.com> Signed-off-by: Pekka Enberg <penberg@kernel.org>
|
#
49a78d08 |
|
25-May-2011 |
Linus Torvalds <torvalds@linux-foundation.org> |
slub: remove no-longer used 'unlock_out' label Commit a71ae47a2cbf ("slub: Fix double bit unlock in debug mode") removed the only goto to this label, resulting in mm/slub.c: In function '__slab_alloc': mm/slub.c:1834: warning: label 'unlock_out' defined but not used fixed trivially by the removal of the label itself too. Reported-by: Stephen Rothwell <sfr@canb.auug.org.au> Cc: Christoph Lameter <cl@linux.com> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
#
a71ae47a |
|
25-May-2011 |
Christoph Lameter <cl@linux.com> |
slub: Fix double bit unlock in debug mode Commit 442b06bcea23 ("slub: Remove node check in slab_free") added a call to deactivate_slab() in the debug case in __slab_alloc(), which unlocks the current slab used for allocation. Going to the label 'unlock_out' then does it again. Also, in the debug case we do not need all the other processing that the 'unlock_out' path does. We always fall back to the slow path in the debug case. So the tid update is useless. Similarly, ALLOC_SLOWPATH would just be incremented for all allocations. Also a pretty useless thing. So simply restore irq flags and return the object. Signed-off-by: Christoph Lameter <cl@linux.com> Reported-and-bisected-by: James Morris <jmorris@namei.org> Reported-by: Ingo Molnar <mingo@elte.hu> Reported-by: Jens Axboe <jaxboe@fusionio.com> Cc: Pekka Enberg <penberg@kernel.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
#
442b06bc |
|
17-May-2011 |
Christoph Lameter <cl@linux.com> |
slub: Remove node check in slab_free We can set the page pointing in the percpu structure to NULL to have the same effect as setting c->node to NUMA_NO_NODE. Gets rid of one check in slab_free() that was only used for forcing the slab_free to the slowpath for debugging. We still need to set c->node to NUMA_NO_NODE to force the slab_alloc() fastpath to the slowpath in case of debugging. Signed-off-by: Christoph Lameter <cl@linux.com> Signed-off-by: Pekka Enberg <penberg@kernel.org>
|
#
bd07d87f |
|
12-May-2011 |
David Rientjes <rientjes@google.com> |
slub: avoid label inside conditional Jumping to a label inside a conditional is considered poor style, especially considering the current organization of __slab_alloc(). This removes the 'load_from_page' label and just duplicates the three lines of code that it uses: c->node = page_to_nid(page); c->page = page; goto load_freelist; since it's probably not worth making this a separate helper function. Acked-by: Christoph Lameter <cl@linux.com> Signed-off-by: David Rientjes <rientjes@google.com> Signed-off-by: Pekka Enberg <penberg@kernel.org>
|
#
1393d9a1 |
|
16-May-2011 |
Christoph Lameter <cl@linux.com> |
slub: Make CONFIG_DEBUG_PAGE_ALLOC work with new fastpath Fastpath can do a speculative access to a page that CONFIG_DEBUG_PAGE_ALLOC may have marked as invalid to retrieve the pointer to the next free object. Use probe_kernel_read in that case in order not to cause a page fault. Cc: <stable@kernel.org> # 38.x Reported-by: Eric Dumazet <eric.dumazet@gmail.com> Signed-off-by: Christoph Lameter <cl@linux.com> Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com> Signed-off-by: Pekka Enberg <penberg@kernel.org>
|
#
6332aa9d |
|
16-May-2011 |
Christoph Lameter <cl@linux.com> |
slub: Avoid warning for !CONFIG_SLUB_DEBUG Move the #ifdef so that get_map is only defined if CONFIG_SLUB_DEBUG is defined. Reported-by: David Rientjes <rientjes@google.com> Signed-off-by: Christoph Lameter <cl@linux.com> Signed-off-by: Pekka Enberg <penberg@kernel.org>
|
#
1759415e |
|
05-May-2011 |
Christoph Lameter <cl@linux.com> |
slub: Remove CONFIG_CMPXCHG_LOCAL ifdeffery Remove the #ifdefs. This means that the irqsafe_cpu_cmpxchg_double() is used everywhere. There may be performance implications since: A. We now have to manage a transaction ID for all arches B. The interrupt holdoff for arches not supporting CONFIG_CMPXCHG_LOCAL is reduced to a very short irqoff section. There are no multiple irqoff/irqon sequences as a result of this change. Even in the fallback case we only have to do one disable and enable like before. Signed-off-by: Christoph Lameter <cl@linux.com> Signed-off-by: Pekka Enberg <penberg@kernel.org>
|
#
30106b8c |
|
04-May-2011 |
Thomas Gleixner <tglx@linutronix.de> |
slub: Fix the lockless code on 32-bit platforms with no 64-bit cmpxchg The SLUB allocator use of the cmpxchg_double logic was wrong: it actually needs the irq-safe one. That happens automatically when we use the native unlocked 'cmpxchg8b' instruction, but when compiling the kernel for older x86 CPUs that do not support that instruction, we fall back to the generic emulation code. And if you don't specify that you want the irq-safe version, the generic code ends up just open-coding the cmpxchg8b equivalent without any protection against interrupts or preemption. Which definitely doesn't work for SLUB. This was reported by Werner Landgraf <w.landgraf@ru.ru>, who saw instability with his distro-kernel that was compiled to support pretty much everything under the sun. Most big Linux distributions tend to compile for PPro and later, and would never have noticed this problem. This also fixes the prototypes for the irqsafe cmpxchg_double functions to use 'bool' like they should. [ Btw, that whole "generic code defaults to no protection" design just sounds stupid - if the code needs no protection, there is no reason to use "cmpxchg_double" to begin with. So we should probably just remove the unprotected version entirely as pointless. - Linus ] Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Reported-and-tested-by: werner <w.landgraf@ru.ru> Acked-and-tested-by: Ingo Molnar <mingo@elte.hu> Acked-by: Christoph Lameter <cl@linux.com> Cc: Pekka Enberg <penberg@kernel.org> Cc: Jens Axboe <axboe@kernel.dk> Cc: Tejun Heo <tj@kernel.org> Link: http://lkml.kernel.org/r/alpine.LFD.2.02.1105041539050.3005@ionos Signed-off-by: Ingo Molnar <mingo@elte.hu> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
#
8dc16c6c |
|
15-Apr-2011 |
Christoph Lameter <cl@linux.com> |
slub: Move debug handlign in __slab_free Its easier to read if its with the check for debugging flags. Signed-off-by: Christoph Lameter <cl@linux.com> Signed-off-by: Pekka Enberg <penberg@kernel.org>
|
#
dc1fb7f4 |
|
15-Apr-2011 |
Christoph Lameter <cl@linux.com> |
slub: Move node determination out of hotpath If the node does not change then there is no need to recalculate the node from the page struct. So move the node determination into the places where we acquire a new slab page. Signed-off-by: Christoph Lameter <cl@linux.com> Signed-off-by: Pekka Enberg <penberg@kernel.org>
|
#
01ad8a7b |
|
15-Apr-2011 |
Christoph Lameter <cl@linux.com> |
slub: Eliminate repeated use of c->page through a new page variable __slab_alloc is full of "c->page" repeats. Lets just use one local variable named "page" for this. Also avoids the need to a have another variable called "new". Signed-off-by: Christoph Lameter <cl@linux.com> Signed-off-by: Pekka Enberg <penberg@kernel.org>
|
#
5f80b13a |
|
15-Apr-2011 |
Christoph Lameter <cl@linux.com> |
slub: get_map() function to establish map of free objects in a slab The bit map of free objects in a slab page is determined in various functions if debugging is enabled. Provide a common function for that purpose. Signed-off-by: Christoph Lameter <cl@linux.com> Signed-off-by: Pekka Enberg <penberg@kernel.org>
|
#
33de04ec |
|
15-Apr-2011 |
Christoph Lameter <cl@linux.com> |
slub: Use NUMA_NO_NODE in get_partial A -1 was leftover during the conversion. Signed-off-by: Christoph Lameter <cl@linux.com> Signed-off-by: Pekka Enberg <penberg@kernel.org>
|
#
607bf324 |
|
12-Apr-2011 |
Li Zefan <lizf@cn.fujitsu.com> |
slub: Fix a typo in config name There's no config named SLAB_DEBUG, and it should be a typo of SLUB_DEBUG. Acked-by: Christoph Lameter <cl@linux.com> Signed-off-by: Li Zefan <lizf@cn.fujitsu.com> Signed-off-by: Pekka Enberg <penberg@kernel.org>
|
#
25985edc |
|
30-Mar-2011 |
Lucas De Marchi <lucas.demarchi@profusion.mobi> |
Fix common misspellings Fixes generated by 'codespell' and manually reviewed. Signed-off-by: Lucas De Marchi <lucas.demarchi@profusion.mobi>
|
#
b8c4c96e |
|
24-Mar-2011 |
Christoph Lameter <cl@linux.com> |
SLUB: Write to per cpu data when allocating it It turns out that the cmpxchg16b emulation has to access vmalloced percpu memory with interrupts disabled. If the memory has never been touched before then the fault necessary to establish the mapping will not to occur and the kernel will fail on boot. Fix that by reusing the CONFIG_PREEMPT code that writes the cpu number into a field on every cpu. Writing to the per cpu area before causes the mapping to be established before we get to a cmpxchg16b emulation. Tested-by: Ingo Molnar <mingo@elte.hu> Signed-off-by: Christoph Lameter <cl@linux.com> Signed-off-by: Pekka Enberg <penberg@kernel.org>
|
#
f9b615de |
|
24-Mar-2011 |
Thomas Gleixner <tglx@linutronix.de> |
slub: Fix debugobjects with lockless fastpath On Thu, 24 Mar 2011, Ingo Molnar wrote: > RIP: 0010:[<ffffffff810570a9>] [<ffffffff810570a9>] get_next_timer_interrupt+0x119/0x260 That's a typical timer crash, but you were unable to debug it with debugobjects because commit d3f661d6 broke those. Cc: Christoph Lameter <cl@linux.com> Tested-by: Ingo Molnar <mingo@elte.hu> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Pekka Enberg <penberg@kernel.org>
|
#
4fdccdfb |
|
22-Mar-2011 |
Christoph Lameter <cl@linux.com> |
slub: Add statistics for this_cmpxchg_double failures Add some statistics for debugging. Signed-off-by: Christoph Lameter <cl@linux.com> Signed-off-by: Pekka Enberg <penberg@kernel.org>
|
#
2fd66c51 |
|
22-Mar-2011 |
Christoph Lameter <cl@linux.com> |
slub: Add missing irq restore for the OOM path OOM path is missing the irq restore in the CONFIG_CMPXCHG_LOCAL case. Signed-off-by: Christoph Lameter <cl@linux.com> Signed-off-by: Pekka Enberg <penberg@kernel.org>
|
#
a24c5a0e |
|
14-Mar-2011 |
Christoph Lameter <cl@linux.com> |
slub: Dont define useless label in the !CONFIG_CMPXCHG_LOCAL case The redo label needs #ifdeffery. Fixes the following problem introduced by commit 8a5ec0ba42c4 ("Lockless (and preemptless) fastpaths for slub"): mm/slub.c: In function 'slab_free': mm/slub.c:2124: warning: label 'redo' defined but not used Reported-by: Stephen Rothwell <sfr@canb.auug.org.au> Signed-off-by: Christoph Lameter <cl@linux.com> Signed-off-by: Pekka Enberg <penberg@kernel.org>
|
#
da9a638c |
|
10-Mar-2011 |
Lai Jiangshan <laijs@cn.fujitsu.com> |
slub,rcu: don't assume the size of struct rcu_head The size of struct rcu_head may be changed. When it becomes larger, it will pollute the page array. We reserve some some bytes for struct rcu_head when a slab is allocated in this situation. Changed from V1: use VM_BUG_ON instead BUG_ON Acked-by: Christoph Lameter <cl@linux.com> Signed-off-by: Lai Jiangshan <laijs@cn.fujitsu.com> Signed-off-by: Pekka Enberg <penberg@kernel.org>
|
#
ab9a0f19 |
|
10-Mar-2011 |
Lai Jiangshan <laijs@cn.fujitsu.com> |
slub: automatically reserve bytes at the end of slab There is no "struct" for slub's slab, it shares with struct page. But struct page is very small, it is insufficient when we need to add some metadata for slab. So we add a field "reserved" to struct kmem_cache, when a slab is allocated, kmem_cache->reserved bytes are automatically reserved at the end of the slab for slab's metadata. Changed from v1: Export the reserved field via sysfs Acked-by: Christoph Lameter <cl@linux.com> Signed-off-by: Lai Jiangshan <laijs@cn.fujitsu.com> Signed-off-by: Pekka Enberg <penberg@kernel.org>
|
#
8a5ec0ba |
|
25-Feb-2011 |
Christoph Lameter <cl@linux.com> |
Lockless (and preemptless) fastpaths for slub Use the this_cpu_cmpxchg_double functionality to implement a lockless allocation algorithm on arches that support fast this_cpu_ops. Each of the per cpu pointers is paired with a transaction id that ensures that updates of the per cpu information can only occur in sequence on a certain cpu. A transaction id is a "long" integer that is comprised of an event number and the cpu number. The event number is incremented for every change to the per cpu state. This means that the cmpxchg instruction can verify for an update that nothing interfered and that we are updating the percpu structure for the processor where we picked up the information and that we are also currently on that processor when we update the information. This results in a significant decrease of the overhead in the fastpaths. It also makes it easy to adopt the fast path for realtime kernels since this is lockless and does not require the use of the current per cpu area over the critical section. It is only important that the per cpu area is current at the beginning of the critical section and at the end. So there is no need even to disable preemption. Test results show that the fastpath cycle count is reduced by up to ~ 40% (alloc/free test goes from ~140 cycles down to ~80). The slowpath for kfree adds a few cycles. Sadly this does nothing for the slowpath which is where the main issues with performance in slub are but the best case performance rises significantly. (For that see the more complex slub patches that require cmpxchg_double) Kmalloc: alloc/free test Before: 10000 times kmalloc(8)/kfree -> 134 cycles 10000 times kmalloc(16)/kfree -> 152 cycles 10000 times kmalloc(32)/kfree -> 144 cycles 10000 times kmalloc(64)/kfree -> 142 cycles 10000 times kmalloc(128)/kfree -> 142 cycles 10000 times kmalloc(256)/kfree -> 132 cycles 10000 times kmalloc(512)/kfree -> 132 cycles 10000 times kmalloc(1024)/kfree -> 135 cycles 10000 times kmalloc(2048)/kfree -> 135 cycles 10000 times kmalloc(4096)/kfree -> 135 cycles 10000 times kmalloc(8192)/kfree -> 144 cycles 10000 times kmalloc(16384)/kfree -> 754 cycles After: 10000 times kmalloc(8)/kfree -> 78 cycles 10000 times kmalloc(16)/kfree -> 78 cycles 10000 times kmalloc(32)/kfree -> 82 cycles 10000 times kmalloc(64)/kfree -> 88 cycles 10000 times kmalloc(128)/kfree -> 79 cycles 10000 times kmalloc(256)/kfree -> 79 cycles 10000 times kmalloc(512)/kfree -> 85 cycles 10000 times kmalloc(1024)/kfree -> 82 cycles 10000 times kmalloc(2048)/kfree -> 82 cycles 10000 times kmalloc(4096)/kfree -> 85 cycles 10000 times kmalloc(8192)/kfree -> 82 cycles 10000 times kmalloc(16384)/kfree -> 706 cycles Kmalloc: Repeatedly allocate then free test Before: 10000 times kmalloc(8) -> 211 cycles kfree -> 113 cycles 10000 times kmalloc(16) -> 174 cycles kfree -> 115 cycles 10000 times kmalloc(32) -> 235 cycles kfree -> 129 cycles 10000 times kmalloc(64) -> 222 cycles kfree -> 120 cycles 10000 times kmalloc(128) -> 343 cycles kfree -> 139 cycles 10000 times kmalloc(256) -> 827 cycles kfree -> 147 cycles 10000 times kmalloc(512) -> 1048 cycles kfree -> 272 cycles 10000 times kmalloc(1024) -> 2043 cycles kfree -> 528 cycles 10000 times kmalloc(2048) -> 4002 cycles kfree -> 571 cycles 10000 times kmalloc(4096) -> 7740 cycles kfree -> 628 cycles 10000 times kmalloc(8192) -> 8062 cycles kfree -> 850 cycles 10000 times kmalloc(16384) -> 8895 cycles kfree -> 1249 cycles After: 10000 times kmalloc(8) -> 190 cycles kfree -> 129 cycles 10000 times kmalloc(16) -> 76 cycles kfree -> 123 cycles 10000 times kmalloc(32) -> 126 cycles kfree -> 124 cycles 10000 times kmalloc(64) -> 181 cycles kfree -> 128 cycles 10000 times kmalloc(128) -> 310 cycles kfree -> 140 cycles 10000 times kmalloc(256) -> 809 cycles kfree -> 165 cycles 10000 times kmalloc(512) -> 1005 cycles kfree -> 269 cycles 10000 times kmalloc(1024) -> 1999 cycles kfree -> 527 cycles 10000 times kmalloc(2048) -> 3967 cycles kfree -> 570 cycles 10000 times kmalloc(4096) -> 7658 cycles kfree -> 637 cycles 10000 times kmalloc(8192) -> 8111 cycles kfree -> 859 cycles 10000 times kmalloc(16384) -> 8791 cycles kfree -> 1173 cycles Signed-off-by: Christoph Lameter <cl@linux.com> Signed-off-by: Pekka Enberg <penberg@kernel.org>
|
#
d3f661d6 |
|
25-Feb-2011 |
Christoph Lameter <cl@linux.com> |
slub: Get rid of slab_free_hook_irq() The following patch will make the fastpaths lockless and will no longer require interrupts to be disabled. Calling the free hook with irq disabled will no longer be possible. Move the slab_free_hook_irq() logic into slab_free_hook. Only disable interrupts if the features are selected that require callbacks with interrupts off and reenable after calls have been made. Signed-off-by: Christoph Lameter <cl@linux.com> Signed-off-by: Pekka Enberg <penberg@kernel.org>
|
#
d71f606f |
|
26-Feb-2011 |
Mariusz Kozlowski <mk@lab.zgora.pl> |
slub: fix ksize() build error mm/slub.c: In function 'ksize': mm/slub.c:2728: error: implicit declaration of function 'slab_ksize' slab_ksize() needs to go out of CONFIG_SLUB_DEBUG section. Acked-by: Randy Dunlap <randy.dunlap@oracle.com> Acked-by: David Rientjes <rientjes@google.com> Signed-off-by: Mariusz Kozlowski <mk@lab.zgora.pl> Signed-off-by: Pekka Enberg <penberg@kernel.org>
|
#
b3d41885 |
|
14-Feb-2011 |
Eric Dumazet <eric.dumazet@gmail.com> |
slub: fix kmemcheck calls to match ksize() hints Recent use of ksize() in network stack (commit ca44ac38 : net: don't reallocate skb->head unless the current one hasn't the needed extra size or is shared) triggers kmemcheck warnings, because ksize() can return more space than kmemcheck is aware of. Pekka Enberg noticed SLAB+kmemcheck is doing the right thing, while SLUB +kmemcheck doesnt. Bugzilla reference #27212 Reported-by: Christian Casteyde <casteyde.christian@free.fr> Suggested-by: Pekka Enberg <penberg@kernel.org> Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com> Acked-by: David S. Miller <davem@davemloft.net> Acked-by: David Rientjes <rientjes@google.com> Acked-by: Christoph Lameter <cl@linux.com> CC: Changli Gao <xiaosuo@gmail.com> CC: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Pekka Enberg <penberg@kernel.org>
|
#
63310467 |
|
20-Jan-2011 |
Christoph Lameter <cl@linux.com> |
mm: Remove support for kmem_cache_name() The last user was ext4 and Eric Sandeen removed the call in a recent patch. See the following URL for the discussion: http://marc.info/?l=linux-ext4&m=129546975702198&w=2 Signed-off-by: Christoph Lameter <cl@linux.com> Signed-off-by: Pekka Enberg <penberg@kernel.org>
|
#
62c70bce |
|
13-Jan-2011 |
Joe Perches <joe@perches.com> |
mm: convert sprintf_symbol to %pS Signed-off-by: Joe Perches <joe@perches.com> Acked-by: Pekka Enberg <penberg@kernel.org> Cc: Jiri Kosina <trivial@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
#
04d94879 |
|
10-Jan-2011 |
Christoph Lameter <cl@linux.com> |
slub: Avoid use of slub_lock in show_slab_objects() The purpose of the locking is to prevent removal and additions of nodes when statistics are gathered for a slab cache. So we need to avoid racing with memory hotplug functionality. It is enough to take the memory hotplug locks there instead of the slub_lock. online_pages() currently does not acquire the memory_hotplug lock. Another patch will be submitted by the memory hotplug authors to take the memory hotplug lock and describe the uses of the memory hotplug lock to protect against adding and removal of nodes from non hotplug data structures. Cc: <stable@kernel.org> # 2.6.37 Reported-and-tested-by: Bart Van Assche <bvanassche@acm.org> Acked-by: David Rientjes <rientjes@google.com> Signed-off-by: Christoph Lameter <cl@linux.com> Signed-off-by: Pekka Enberg <penberg@kernel.org>
|
#
ccd35fb9 |
|
06-Jan-2011 |
Nick Piggin <npiggin@kernel.dk> |
kernel: kmem_ptr_validate considered harmful This is a nasty and error prone API. It is no longer used, remove it. Signed-off-by: Nick Piggin <npiggin@kernel.dk>
|
#
37d57443 |
|
01-Dec-2010 |
Tero Roponen <tero.roponen@gmail.com> |
slub: Fix a crash during slabinfo -v Commit f7cb1933621bce66a77f690776a16fe3ebbc4d58 ("SLUB: Pass active and inactive redzone flags instead of boolean to debug functions") missed two instances of check_object(). This caused a lot of warnings during 'slabinfo -v' finally leading to a crash: BUG ext4_xattr: Freepointer corrupt ... BUG buffer_head: Freepointer corrupt ... BUG ext4_alloc_context: Freepointer corrupt ... ... BUG: unable to handle kernel NULL pointer dereference at 0000000000000008 IP: [<ffffffff810a291f>] file_sb_list_del+0x1c/0x35 PGD 79d78067 PUD 79e67067 PMD 0 Oops: 0002 [#1] SMP last sysfs file: /sys/kernel/slab/:t-0000192/validate This patch fixes the problem by converting the two missed instances. Acked-by: Christoph Lameter <cl@linux.com> Signed-off-by: Tero Roponen <tero.roponen@gmail.com> Signed-off-by: Pekka Enberg <penberg@kernel.org>
|
#
8165984a |
|
01-Dec-2010 |
Tero Roponen <tero.roponen@gmail.com> |
slub: Fix a crash during slabinfo -v Commit f7cb1933621bce66a77f690776a16fe3ebbc4d58 ("SLUB: Pass active and inactive redzone flags instead of boolean to debug functions") missed two instances of check_object(). This caused a lot of warnings during 'slabinfo -v' finally leading to a crash: BUG ext4_xattr: Freepointer corrupt ... BUG buffer_head: Freepointer corrupt ... BUG ext4_alloc_context: Freepointer corrupt ... ... BUG: unable to handle kernel NULL pointer dereference at 0000000000000008 IP: [<ffffffff810a291f>] file_sb_list_del+0x1c/0x35 PGD 79d78067 PUD 79e67067 PMD 0 Oops: 0002 [#1] SMP last sysfs file: /sys/kernel/slab/:t-0000192/validate This patch fixes the problem by converting the two missed instances. Acked-by: Christoph Lameter <cl@linux.com> Signed-off-by: Tero Roponen <tero.roponen@gmail.com> Signed-off-by: Pekka Enberg <penberg@kernel.org>
|
#
68cee4f1 |
|
28-Oct-2010 |
Pavel Emelyanov <xemul@parallels.com> |
slub: Fix slub_lock down/up imbalance There are two places, that do not release the slub_lock. Respective bugs were introduced by sysfs changes ab4d5ed5 (slub: Enable sysfs support for !CONFIG_SLUB_DEBUG) and 2bce6485 ( slub: Allow removal of slab caches during boot). Acked-by: Christoph Lameter <cl@linux.com> Signed-off-by: Pavel Emelyanov <xemul@openvz.org> Signed-off-by: Pekka Enberg <penberg@kernel.org>
|
#
98072e4d |
|
28-Oct-2010 |
Pavel Emelyanov <xemul@parallels.com> |
slub: Fix slub_lock down/up imbalance There are two places, that do not release the slub_lock. Respective bugs were introduced by sysfs changes ab4d5ed5 (slub: Enable sysfs support for !CONFIG_SLUB_DEBUG) and 2bce6485 ( slub: Allow removal of slab caches during boot). Acked-by: Christoph Lameter <cl@linux.com> Signed-off-by: Pavel Emelyanov <xemul@openvz.org> Signed-off-by: Pekka Enberg <penberg@kernel.org>
|
#
4a92379b |
|
21-Oct-2010 |
Richard Kennedy <richard@rsk.demon.co.uk> |
slub tracing: move trace calls out of always inlined functions to reduce kernel code size Having the trace calls defined in the always inlined kmalloc functions in include/linux/slub_def.h causes a lot of code duplication as the trace functions get instantiated for each kamalloc call site. This can simply be removed by pushing the trace calls down into the functions in slub.c. On my x86_64 built this patch shrinks the code size of the kernel by approx 36K and also shrinks the code size of many modules -- too many to list here ;) size vmlinux (2.6.36) reports text data bss dec hex filename 5410611 743172 828928 6982711 6a8c37 vmlinux 5373738 744244 828928 6946910 6a005e vmlinux + patch The resulting kernel has had some testing & kmalloc trace still seems to work. This patch - moves trace_kmalloc out of the inlined kmalloc() and pushes it down into kmem_cache_alloc_trace() so this it only get instantiated once. - rename kmem_cache_alloc_notrace() to kmem_cache_alloc_trace() to indicate that now is does have tracing. (maybe this would better being called something like kmalloc_kmem_cache ?) - adds a new function kmalloc_order() to handle allocation and tracing of large allocations of page order. - removes tracing from the inlined kmalloc_large() replacing them with a call to kmalloc_order(); - move tracing out of inlined kmalloc_node() and pushing it down into kmem_cache_alloc_node_trace - rename kmem_cache_alloc_node_notrace() to kmem_cache_alloc_node_trace() - removes the include of trace/events/kmem.h from slub_def.h. v2 - keep kmalloc_order_trace inline when !CONFIG_TRACE Signed-off-by: Richard Kennedy <richard@rsk.demon.co.uk> Signed-off-by: Pekka Enberg <penberg@kernel.org>
|
#
92a5bbc1 |
|
06-Oct-2010 |
Pekka Enberg <penberg@kernel.org> |
SLUB: Fix memory hotplug with !NUMA This patch fixes the following build breakage when memory hotplug is enabled on UMA configurations: /home/test/linux-2.6/mm/slub.c: In function 'kmem_cache_init': /home/test/linux-2.6/mm/slub.c:3031:2: error: 'slab_memory_callback' undeclared (first use in this function) /home/test/linux-2.6/mm/slub.c:3031:2: note: each undeclared identifier is reported only once for each function it appears in make[2]: *** [mm/slub.o] Error 1 make[1]: *** [mm] Error 2 make: *** [sub-make] Error 2 Reported-by: Zimny Lech <napohybelskurwysynom2010@gmail.com> Acked-by: Christoph Lameter <cl@linux.com> Signed-off-by: Pekka Enberg <penberg@kernel.org>
|
#
a5a84755 |
|
05-Oct-2010 |
Christoph Lameter <cl@linux.com> |
slub: Move functions to reduce #ifdefs There is a lot of #ifdef/#endifs that can be avoided if functions would be in different places. Move them around and reduce #ifdef. Signed-off-by: Christoph Lameter <cl@linux.com> Signed-off-by: Pekka Enberg <penberg@kernel.org>
|
#
ab4d5ed5 |
|
05-Oct-2010 |
Christoph Lameter <cl@linux.com> |
slub: Enable sysfs support for !CONFIG_SLUB_DEBUG Currently disabling CONFIG_SLUB_DEBUG also disabled SYSFS support meaning that the slabs cannot be tuned without DEBUG. Make SYSFS support independent of CONFIG_SLUB_DEBUG Signed-off-by: Christoph Lameter <cl@linux.com> Signed-off-by: Pekka Enberg <penberg@kernel.org>
|
#
15b7c514 |
|
02-Oct-2010 |
Pekka Enberg <penberg@kernel.org> |
SLUB: Optimize slab_free() debug check This patch optimizes slab_free() debug check to use "c->node != NUMA_NO_NODE" instead of "c->node >= 0" because the former generates smaller code on x86-64: Before: 4736: 48 39 70 08 cmp %rsi,0x8(%rax) 473a: 75 26 jne 4762 <kfree+0xa2> 473c: 44 8b 48 10 mov 0x10(%rax),%r9d 4740: 45 85 c9 test %r9d,%r9d 4743: 78 1d js 4762 <kfree+0xa2> After: 4736: 48 39 70 08 cmp %rsi,0x8(%rax) 473a: 75 23 jne 475f <kfree+0x9f> 473c: 83 78 10 ff cmpl $0xffffffffffffffff,0x10(%rax) 4740: 74 1d je 475f <kfree+0x9f> This patch also cleans up __slab_alloc() to use NUMA_NO_NODE instead of "-1" for enabling debugging for a per-CPU cache. Acked-by: Christoph Lameter <cl@linux.com> Acked-by: David Rientjes <rientjes@google.com> Signed-off-by: Pekka Enberg <penberg@kernel.org>
|
#
5d1f57e4 |
|
29-Sep-2010 |
Namhyung Kim <namhyung@gmail.com> |
slub: Move NUMA-related functions under CONFIG_NUMA Make kmalloc_cache_alloc_node_notrace(), kmalloc_large_node() and __kmalloc_node_track_caller() to be compiled only when CONFIG_NUMA is selected. Acked-by: David Rientjes <rientjes@google.com> Signed-off-by: Namhyung Kim <namhyung@gmail.com> Signed-off-by: Pekka Enberg <penberg@kernel.org>
|
#
3478973d |
|
29-Sep-2010 |
Namhyung Kim <namhyung@gmail.com> |
slub: Add lock release annotation The unfreeze_slab() releases page's PG_locked bit but was missing proper annotation. The deactivate_slab() needs to be marked also since it calls unfreeze_slab() without grabbing the lock. Acked-by: David Rientjes <rientjes@google.com> Signed-off-by: Namhyung Kim <namhyung@gmail.com> Signed-off-by: Pekka Enberg <penberg@kernel.org>
|
#
a5dd5c11 |
|
29-Sep-2010 |
Namhyung Kim <namhyung@gmail.com> |
slub: Fix signedness warnings The bit-ops routines require its arg to be a pointer to unsigned long. This leads sparse to complain about different signedness as follows: mm/slub.c:2425:49: warning: incorrect type in argument 2 (different signedness) mm/slub.c:2425:49: expected unsigned long volatile *addr mm/slub.c:2425:49: got long *map Acked-by: Christoph Lameter <cl@linux.com> Acked-by: David Rientjes <rientjes@google.com> Signed-off-by: Namhyung Kim <namhyung@gmail.com> Signed-off-by: Pekka Enberg <penberg@kernel.org>
|
#
62e346a8 |
|
28-Sep-2010 |
Christoph Lameter <cl@linux.com> |
slub: extract common code to remove objects from partial list without locking There are a couple of places where repeat the same statements when removing a page from the partial list. Consolidate that into __remove_partial(). Acked-by: David Rientjes <rientjes@google.com> Signed-off-by: Christoph Lameter <cl@linux.com> Signed-off-by: Pekka Enberg <penberg@kernel.org>
|
#
f7cb1933 |
|
29-Sep-2010 |
Christoph Lameter <cl@linux.com> |
SLUB: Pass active and inactive redzone flags instead of boolean to debug functions Pass the actual values used for inactive and active redzoning to the functions that check the objects. Avoids a lot of the ? : things to lookup the values in the functions. Acked-by: David Rientjes <rientjes@google.com> Signed-off-by: Christoph Lameter <cl@linux.com> Signed-off-by: Pekka Enberg <penberg@kernel.org>
|
#
7340cc84 |
|
28-Sep-2010 |
Christoph Lameter <cl@linux.com> |
slub: reduce differences between SMP and NUMA Reduce the #ifdefs and simplify bootstrap by making SMP and NUMA as much alike as possible. This means that there will be an additional indirection to get to the kmem_cache_node field under SMP. Acked-by: David Rientjes <rientjes@google.com> Signed-off-by: Christoph Lameter <cl@linux.com> Signed-off-by: Pekka Enberg <penberg@kernel.org>
|
#
ed59ecbf |
|
18-Sep-2010 |
Pekka Enberg <penberg@kernel.org> |
Revert "Slub: UP bandaid" This reverts commit 5249d039500f05a5ab379286b1d23ab9b04d3f2c. It's not needed after commit bbddff0545878a8649c091a9dd7c43ce91516734 ("percpu: use percpu allocator on UP too").
|
#
84c1cf62 |
|
14-Sep-2010 |
Pekka Enberg <penberg@kernel.org> |
SLUB: Fix merged slab cache names As explained by Linus "I'm Proud to be an American" Torvalds: Looking at the merging code, I actually think it's totally buggy. If you have something like this: - load module A: create slab cache A - load module B: create slab cache B that can merge with A - unload module A - "cat /proc/slabinfo": BOOM. Oops. exactly because the name is not handled correctly, and you'll have module B holding open a slab cache that has a name pointer that points to module A that no longer exists. This patch fixes the problem by using kstrdup() to allocate dynamic memory for ->name of "struct kmem_cache" as suggested by Christoph Lameter. Acked-by: Christoph Lameter <cl@linux.com> Cc: David Rientjes <rientjes@google.com> Reported-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Pekka Enberg <penberg@kernel.org> Conflicts: mm/slub.c
|
#
db210e70 |
|
26-Aug-2010 |
Christoph Lameter <cl@linux.com> |
Slub: UP bandaid Since the percpu allocator does not provide early allocation in UP mode (only in SMP configurations) use __get_free_page() to improvise a compound page allocation that can be later freed via kfree(). Compound pages will be released when the cpu caches are resized. Acked-by: David Rientjes <rientjes@google.com> Signed-off-by: Christoph Lameter <cl@linux.com> Signed-off-by: Pekka Enberg <penberg@kernel.org>
|
#
a016471a |
|
25-Aug-2010 |
David Rientjes <rientjes@google.com> |
slub: fix SLUB_RESILIENCY_TEST for dynamic kmalloc caches Now that the kmalloc_caches array is dynamically allocated at boot, SLUB_RESILIENCY_TEST needs to be fixed to pass the correct type. Acked-by: Christoph Lameter <cl@linux.com> Signed-off-by: David Rientjes <rientjes@google.com> Signed-off-by: Pekka Enberg <penberg@kernel.org>
|
#
8de66a0c |
|
25-Aug-2010 |
Christoph Lameter <cl@linux.com> |
slub: Fix up missing kmalloc_cache -> kmem_cache_node case for memoryhotplug Memory hotplug allocates and frees per node structures. Use the correct name. Acked-by: David Rientjes <rientjes@google.com> Acked-by: Randy Dunlap <randy.dunlap@oracle.com> Signed-off-by: Christoph Lameter <cl@linux.com> Signed-off-by: Pekka Enberg <penberg@kernel.org>
|
#
7d550c56 |
|
25-Aug-2010 |
Christoph Lameter <cl@linux.com> |
slub: Add dummy functions for the !SLUB_DEBUG case On Wed, 25 Aug 2010, Randy Dunlap wrote: > mm/slub.c:1732: error: implicit declaration of function 'slab_pre_alloc_hook' > mm/slub.c:1751: error: implicit declaration of function 'slab_post_alloc_hook' > mm/slub.c:1881: error: implicit declaration of function 'slab_free_hook' > mm/slub.c:1886: error: implicit declaration of function 'slab_free_hook_irq' Empty functions are missing if the runtime debuggability option is compiled out. Provide the fall back functions to empty hooks if SLUB_DEBUG is not set. Acked-by: Randy Dunlap <randy.dunlap@oracle.com> Acked-by: David Rientjes <rientjes@google.com> Signed-off-by: Christoph Lameter <cl@linux.com> Signed-off-by: Pekka Enberg <penberg@kernel.org>
|
#
c1d50836 |
|
19-Aug-2010 |
Christoph Lameter <cl@linux.com> |
slub: Move gfpflag masking out of the hotpath Move the gfpflags masking into the hooks for checkers and into the slowpaths. gfpflag masking requires access to a global variable and thus adds an additional cacheline reference to the hotpaths. If no hooks are active then the gfpflag masking will result in code that the compiler can toss out. Acked-by: David Rientjes <rientjes@google.com> Signed-off-by: Christoph Lameter <cl@linux-foundation.org> Signed-off-by: Pekka Enberg <penberg@kernel.org>
|
#
c016b0bd |
|
19-Aug-2010 |
Christoph Lameter <cl@linux.com> |
slub: Extract hooks for memory checkers from hotpaths Extract the code that memory checkers and other verification tools use from the hotpaths. Makes it easier to add new ones and reduces the disturbances of the hotpaths. Signed-off-by: Christoph Lameter <cl@linux-foundation.org> Signed-off-by: Pekka Enberg <penberg@kernel.org>
|
#
51df1142 |
|
19-Aug-2010 |
Christoph Lameter <cl@linux.com> |
slub: Dynamically size kmalloc cache allocations kmalloc caches are statically defined and may take up a lot of space just because the sizes of the node array has to be dimensioned for the largest node count supported. This patch makes the size of the kmem_cache structure dynamic throughout by creating a kmem_cache slab cache for the kmem_cache objects. The bootstrap occurs by allocating the initial one or two kmem_cache objects from the page allocator. C2->C3 - Fix various issues indicated by David - Make create kmalloc_cache return a kmem_cache * pointer. Acked-by: David Rientjes <rientjes@google.com> Signed-off-by: Christoph Lameter <cl@linux-foundation.org> Signed-off-by: Pekka Enberg <penberg@kernel.org>
|
#
6c182dc0 |
|
19-Aug-2010 |
Christoph Lameter <cl@linux.com> |
slub: Remove static kmem_cache_cpu array for boot The percpu allocator can now handle allocations during early boot. So drop the static kmem_cache_cpu array. Cc: Tejun Heo <tj@kernel.org> Acked-by: David Rientjes <rientjes@google.com> Signed-off-by: Christoph Lameter <cl@linux-foundation.org> Signed-off-by: Pekka Enberg <penberg@kernel.org>
|
#
55136592 |
|
19-Aug-2010 |
Christoph Lameter <cl@linux.com> |
slub: Remove dynamic dma slab allocation Remove the dynamic dma slab allocation since this causes too many issues with nested locks etc etc. The change avoids passing gfpflags into many functions. V3->V4: - Create dma caches in kmem_cache_init() instead of kmem_cache_init_late(). Acked-by: David Rientjes <rientjes@google.com> Signed-off-by: Christoph Lameter <cl@linux-foundation.org> Signed-off-by: Pekka Enberg <penberg@kernel.org>
|
#
1537066c |
|
19-Aug-2010 |
Christoph Lameter <cl@linux.com> |
slub: Force no inlining of debug functions Compiler folds the debgging functions into the critical paths. Avoid that by adding noinline to the functions that check for problems. Acked-by: David Rientjes <rientjes@google.com> Signed-off-by: Christoph Lameter <cl@linux.com> Signed-off-by: Pekka Enberg <penberg@kernel.org>
|
#
2bce6485 |
|
19-Jul-2010 |
Christoph Lameter <cl@linux-foundation.org> |
slub: Allow removal of slab caches during boot Serialize kmem_cache_create and kmem_cache_destroy using the slub_lock. Only possible after the use of the slub_lock during dynamic dma creation has been removed. Then make sure that the setup of the slab sysfs entries does not race with kmem_cache_create and kmem_cache destroy. If a slab cache is removed before we have setup sysfs then simply skip over the sysfs handling. Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org> Cc: Roland Dreier <rdreier@cisco.com> Signed-off-by: Christoph Lameter <cl@linux-foundation.org> Signed-off-by: Pekka Enberg <penberg@cs.helsinki.fi>
|
#
e438444d |
|
02-Aug-2010 |
Pekka Enberg <penberg@cs.helsinki.fi> |
Revert "slub: Allow removal of slab caches during boot" This reverts commit f5b801ac38a9612b380ee9a75ab1861f0594e79f.
|
#
bc6488e9 |
|
26-Jul-2010 |
Christoph Lameter <cl@linux-foundation.org> |
slub numa: Fix rare allocation from unexpected node The network developers have seen sporadic allocations resulting in objects coming from unexpected NUMA nodes despite asking for objects from a specific node. This is due to get_partial() calling get_any_partial() if partial slabs are exhausted for a node even if a node was specified and therefore one would expect allocations only from the specified node. get_any_partial() sporadically may return a slab from a foreign node to gradually reduce the size of partial lists on remote nodes and thereby reduce total memory use for a slab cache. The behavior is controlled by the remote_defrag_ratio of each cache. Strictly speaking this is permitted behavior since __GFP_THISNODE was not specified for the allocation but it is certain surprising. This patch makes sure that the remote defrag behavior only occurs if no node was specified. Signed-off-by: Christoph Lameter <cl@linux-foundation.org> Signed-off-by: Pekka Enberg <penberg@cs.helsinki.fi>
|
#
af537b0a |
|
09-Jul-2010 |
Christoph Lameter <cl@linux-foundation.org> |
slub: Use kmem_cache flags to detect if slab is in debugging mode. The cacheline with the flags is reachable from the hot paths after the percpu allocator changes went in. So there is no need anymore to put a flag into each slab page. Get rid of the SlubDebug flag and use the flags in kmem_cache instead. Acked-by: David Rientjes <rientjes@google.com> Signed-off-by: Christoph Lameter <cl@linux-foundation.org> Signed-off-by: Pekka Enberg <penberg@cs.helsinki.fi>
|
#
f5b801ac |
|
09-Jul-2010 |
Christoph Lameter <cl@linux-foundation.org> |
slub: Allow removal of slab caches during boot If a slab cache is removed before we have setup sysfs then simply skip over the sysfs handling. Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org> Cc: Roland Dreier <rdreier@cisco.com> Acked-by: David Rientjes <rientjes@google.com> Signed-off-by: Christoph Lameter <cl@linux-foundation.org> Signed-off-by: Pekka Enberg <penberg@cs.helsinki.fi>
|
#
d7278bd7 |
|
09-Jul-2010 |
Christoph Lameter <cl@linux-foundation.org> |
slub: Check kasprintf results in kmem_cache_init() Small allocations may fail during slab bringup which is fatal. Add a BUG_ON() so that we fail immediately rather than failing later during sysfs processing. Acked-by: David Rientjes <rientjes@google.com> Signed-off-by: Christoph Lameter <cl@linux-foundation.org> Signed-off-by: Pekka Enberg <penberg@cs.helsinki.fi>
|
#
f90ec390 |
|
09-Jul-2010 |
Christoph Lameter <cl@linux-foundation.org> |
SLUB: Constants need UL UL suffix is missing in some constants. Conform to how slab.h uses constants. Acked-by: David Rientjes <rientjes@google.com> Signed-off-by: Christoph Lameter <cl@linux-foundation.org> Signed-off-by: Pekka Enberg <penberg@cs.helsinki.fi>
|
#
2154a336 |
|
09-Jul-2010 |
Christoph Lameter <cl@linux-foundation.org> |
slub: Use a constant for a unspecified node. kmalloc_node() and friends can be passed a constant -1 to indicate that no choice was made for the node from which the object needs to come. Use NUMA_NO_NODE instead of -1. CC: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Signed-off-by: David Rientjes <rientjes@google.com> Signed-off-by: Christoph Lameter <cl@linux-foundation.org> Signed-off-by: Pekka Enberg <penberg@cs.helsinki.fi>
|
#
039ca4e7 |
|
26-May-2010 |
Li Zefan <lizf@cn.fujitsu.com> |
tracing: Remove kmemtrace ftrace plugin We have been resisting new ftrace plugins and removing existing ones, and kmemtrace has been superseded by kmem trace events and perf-kmem, so we remove it. Signed-off-by: Li Zefan <lizf@cn.fujitsu.com> Acked-by: Pekka Enberg <penberg@cs.helsinki.fi> Acked-by: Eduard - Gabriel Munteanu <eduard.munteanu@linux360.ro> Cc: Ingo Molnar <mingo@elte.hu> Cc: Steven Rostedt <rostedt@goodmis.org> [ remove kmemtrace from the makefile, handle slob too ] Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>
|
#
c0ff7453 |
|
24-May-2010 |
Miao Xie <miaox@cn.fujitsu.com> |
cpuset,mm: fix no node to alloc memory when changing cpuset's mems Before applying this patch, cpuset updates task->mems_allowed and mempolicy by setting all new bits in the nodemask first, and clearing all old unallowed bits later. But in the way, the allocator may find that there is no node to alloc memory. The reason is that cpuset rebinds the task's mempolicy, it cleans the nodes which the allocater can alloc pages on, for example: (mpol: mempolicy) task1 task1's mpol task2 alloc page 1 alloc on node0? NO 1 1 change mems from 1 to 0 1 rebind task1's mpol 0-1 set new bits 0 clear disallowed bits alloc on node1? NO 0 ... can't alloc page goto oom This patch fixes this problem by expanding the nodes range first(set newly allowed bits) and shrink it lazily(clear newly disallowed bits). So we use a variable to tell the write-side task that read-side task is reading nodemask, and the write-side task clears newly disallowed nodes after read-side task ends the current memory allocation. [akpm@linux-foundation.org: fix spello] Signed-off-by: Miao Xie <miaox@cn.fujitsu.com> Cc: David Rientjes <rientjes@google.com> Cc: Nick Piggin <npiggin@suse.de> Cc: Paul Menage <menage@google.com> Cc: Lee Schermerhorn <lee.schermerhorn@hp.com> Cc: Hugh Dickins <hugh.dickins@tiscali.co.uk> Cc: Ravikiran Thirumalai <kiran@scalex86.org> Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Cc: Christoph Lameter <cl@linux-foundation.org> Cc: Andi Kleen <andi@firstfloor.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
#
73367bd8 |
|
21-May-2010 |
Alexander Duyck <alexander.h.duyck@intel.com> |
slub: move kmem_cache_node into it's own cacheline This patch is meant to improve the performance of SLUB by moving the local kmem_cache_node lock into it's own cacheline separate from kmem_cache. This is accomplished by simply removing the local_node when NUMA is enabled. On my system with 2 nodes I saw around a 5% performance increase w/ hackbench times dropping from 6.2 seconds to 5.9 seconds on average. I suspect the performance gain would increase as the number of nodes increases, but I do not have the data to currently back that up. Bugzilla-Reference: http://bugzilla.kernel.org/show_bug.cgi?id=15713 Cc: <stable@kernel.org> Reported-by: Alex Shi <alex.shi@intel.com> Tested-by: Alex Shi <alex.shi@intel.com> Acked-by: Yanmin Zhang <yanmin_zhang@linux.intel.com> Acked-by: Christoph Lameter <cl@linux-foundation.org> Signed-off-by: Alexander Duyck <alexander.h.duyck@intel.com> Signed-off-by: Pekka Enberg <penberg@cs.helsinki.fi>
|
#
6b65aaf3 |
|
14-Apr-2010 |
Minchan Kim <minchan.kim@gmail.com> |
slub: Use alloc_pages_exact_node() for page allocation The alloc_slab_page() in SLUB uses alloc_pages() if node is '-1'. This means that node validity check in alloc_pages_node is unnecessary and we can use alloc_pages_exact_node() to avoid comparison and branch as commit 6484eb3e2a81807722 ("page allocator: do not check NUMA node ID when the caller knows the node is valid") did for the page allocator. Cc: Christoph Lameter <cl@linux-foundation.org> Reviewed-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Reviewed-by: Mel Gorman <mel@csn.ul.ie> Signed-off-by: Minchan Kim <minchan.kim@gmail.com> Signed-off-by: Pekka Enberg <penberg@cs.helsinki.fi>
|
#
d3e14aa3 |
|
08-Apr-2010 |
Xiaotian Feng <dfeng@redhat.com> |
slub: __kmalloc_node_track_caller should trace kmalloc_large_node case commit 94b528d (kmemtrace: SLUB hooks for caller-tracking functions) missed tracing kmalloc_large_node in __kmalloc_node_track_caller. We should trace it same as __kmalloc_node. Acked-by: David Rientjes <rientjes@google.com> Cc: Matt Mackall <mpm@selenic.com> Cc: Ingo Molnar <mingo@elte.hu> Cc: Vegard Nossum <vegard.nossum@gmail.com> Signed-off-by: Xiaotian Feng <dfeng@redhat.com> Signed-off-by: Pekka Enberg <penberg@cs.helsinki.fi>
|
#
bbd7d57b |
|
24-Mar-2010 |
Eric Dumazet <eric.dumazet@gmail.com> |
slub: Potential stack overflow I discovered that we can overflow stack if CONFIG_SLUB_DEBUG=y and use slabs with many objects, since list_slab_objects() and process_slab() use DECLARE_BITMAP(map, page->objects). With 65535 bits, we use 8192 bytes of stack ... Switch these allocations to dynamic allocations. Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com> Signed-off-by: Christoph Lameter <cl@linux-foundation.org> Signed-off-by: Pekka Enberg <penberg@cs.helsinki.fi>
|
#
4581ced3 |
|
18-May-2010 |
David Woodhouse <dwmw2@infradead.org> |
mm: Move ARCH_SLAB_MINALIGN and ARCH_KMALLOC_MINALIGN to <linux/slub_def.h> Acked-by: Herbert Xu <herbert@gondor.apana.org.au> Signed-off-by: David Woodhouse <David.Woodhouse@intel.com> Signed-off-by: Pekka Enberg <penberg@cs.helsinki.fi>
|
#
111c7d82 |
|
01-Apr-2010 |
Zhang, Yanmin <yanmin_zhang@linux.intel.com> |
slub: Fix bad boundary check in init_kmem_cache_nodes() Function init_kmem_cache_nodes is incorrect when checking upper limitation of kmalloc_caches. The breakage was introduced by commit 91efd773c74bb26b5409c85ad755d536448e229c ("dma kmalloc handling fixes"). Acked-by: Christoph Lameter <cl@linux-foundation.org> Signed-off-by: Pekka Enberg <penberg@cs.helsinki.fi>
|
#
d3e06e2b |
|
07-Apr-2010 |
Pekka Enberg <penberg@cs.helsinki.fi> |
slub: Fix kmem_ptr_validate() for non-kernel pointers As suggested by Linus, fix up kmem_ptr_validate() to handle non-kernel pointers more graciously. The patch changes kmem_ptr_validate() to use the newly introduced kern_ptr_validate() helper to check that a pointer is a valid kernel pointer before we attempt to convert it into a 'struct page'. Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Ingo Molnar <mingo@elte.hu> Cc: Matt Mackall <mpm@selenic.com> Cc: Nick Piggin <npiggin@suse.de> Signed-off-by: Pekka Enberg <penberg@cs.helsinki.fi> Acked-by: Christoph Lameter <cl@linux-foundation.org> Acked-by: David Rientjes <rientjes@google.com> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
#
52cf25d0 |
|
18-Jan-2010 |
Emese Revfy <re.emese@gmail.com> |
Driver core: Constify struct sysfs_ops in struct kobj_type Constify struct sysfs_ops. This is part of the ops structure constification effort started by Arjan van de Ven et al. Benefits of this constification: * prevents modification of data that is shared (referenced) by many other structure instances at runtime * detects/prevents accidental (but not intentional) modification attempts on archs that enforce read-only kernel data at runtime * potentially better optimized code as the compiler can assume that the const data cannot be changed * the compiler/linker move const data into .rodata and therefore exclude them from false sharing Signed-off-by: Emese Revfy <re.emese@gmail.com> Acked-by: David Teigland <teigland@redhat.com> Acked-by: Matt Domsch <Matt_Domsch@dell.com> Acked-by: Maciej Sosnowski <maciej.sosnowski@intel.com> Acked-by: Hans J. Koch <hjk@linutronix.de> Acked-by: Pekka Enberg <penberg@cs.helsinki.fi> Acked-by: Jens Axboe <jens.axboe@oracle.com> Acked-by: Stephen Hemminger <shemminger@vyatta.com> Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
|
#
9cd43611 |
|
31-Dec-2009 |
Emese Revfy <re.emese@gmail.com> |
kobject: Constify struct kset_uevent_ops Constify struct kset_uevent_ops. This is part of the ops structure constification effort started by Arjan van de Ven et al. Benefits of this constification: * prevents modification of data that is shared (referenced) by many other structure instances at runtime * detects/prevents accidental (but not intentional) modification attempts on archs that enforce read-only kernel data at runtime * potentially better optimized code as the compiler can assume that the const data cannot be changed * the compiler/linker move const data into .rodata and therefore exclude them from false sharing Signed-off-by: Emese Revfy <re.emese@gmail.com> Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
|
#
1154fab7 |
|
28-Feb-2010 |
Stephen Rothwell <sfr@canb.auug.org.au> |
SLUB: Fix per-cpu merge conflict The slab tree adds a percpu variable usage case (commit 9dfc6e68bfe6ee452efb1a4e9ca26a9007f2b864 "SLUB: Use this_cpu operations in slub"), but the percpu tree removes the prefixing of percpu variables (commit dd17c8f72993f9461e9c19250e3f155d6d99df22 "percpu: remove per_cpu__ prefix"), thus causing the following compilation error: CC mm/slub.o mm/slub.c: In function ‘alloc_kmem_cache_cpus’: mm/slub.c:2078: error: implicit declaration of function ‘per_cpu_var’ mm/slub.c:2078: warning: assignment makes pointer from integer without a cast make[1]: *** [mm/slub.o] Error 1 Signed-off-by: Pekka Enberg <penberg@cs.helsinki.fi>
|
#
4c13dd3b |
|
25-Feb-2010 |
Dmitry Monakhov <dmonakhov@openvz.org> |
failslab: add ability to filter slab caches This patch allow to inject faults only for specific slabs. In order to preserve default behavior cache filter is off by default (all caches are faulty). One may define specific set of slabs like this: # mark skbuff_head_cache as faulty echo 1 > /sys/kernel/slab/skbuff_head_cache/failslab # Turn on cache filter (off by default) echo 1 > /sys/kernel/debug/failslab/cache-filter # Turn on fault injection echo 1 > /sys/kernel/debug/failslab/times echo 1 > /sys/kernel/debug/failslab/probability Acked-by: David Rientjes <rientjes@google.com> Acked-by: Akinobu Mita <akinobu.mita@gmail.com> Acked-by: Christoph Lameter <cl@linux-foundation.org> Signed-off-by: Dmitry Monakhov <dmonakhov@openvz.org> Signed-off-by: Pekka Enberg <penberg@cs.helsinki.fi>
|
#
c9404c9c |
|
18-Dec-2009 |
Adam Buchbinder <adam.buchbinder@gmail.com> |
Fix misspelling of "should" and "shouldn't" in comments. Some comments misspell "should" or "shouldn't"; this fixes them. No code changes. Signed-off-by: Adam Buchbinder <adam.buchbinder@gmail.com> Signed-off-by: Jiri Kosina <jkosina@suse.cz>
|
#
91efd773 |
|
21-Jan-2010 |
Christoph Lameter <cl@linux-foundation.org> |
dma kmalloc handling fixes 1. We need kmalloc_percpu for all of the now extended kmalloc caches array not just for each shift value. 2. init_kmem_cache_nodes() must assume node 0 locality for statically allocated dma kmem_cache structures even after boot is complete. Reported-and-tested-by: Alex Chiang <achiang@hp.com> Signed-off-by: Christoph Lameter <cl@linux-foundation.org> Signed-off-by: Pekka Enberg <penberg@cs.helsinki.fi>
|
#
7738dd9e |
|
15-Jan-2010 |
David Rientjes <rientjes@google.com> |
slub: remove impossible condition `s' cannot be NULL if kmalloc_caches is not NULL. This conditional would trigger a NULL pointer on `s', anyway, since it is immediately derefernced if true. Acked-by: Christoph Lameter <cl@linux-foundation.org> Signed-off-by: David Rientjes <rientjes@google.com> Signed-off-by: Pekka Enberg <penberg@cs.helsinki.fi>
|
#
84e554e6 |
|
18-Dec-2009 |
Christoph Lameter <cl@linux-foundation.org> |
SLUB: Make slub statistics use this_cpu_inc this_cpu_inc() translates into a single instruction on x86 and does not need any register. So use it in stat(). We also want to avoid the calculation of the per cpu kmem_cache_cpu structure pointer. So pass a kmem_cache pointer instead of a kmem_cache_cpu pointer. Signed-off-by: Christoph Lameter <cl@linux-foundation.org> Signed-off-by: Pekka Enberg <penberg@cs.helsinki.fi>
|
#
ff12059e |
|
18-Dec-2009 |
Christoph Lameter <cl@linux-foundation.org> |
SLUB: this_cpu: Remove slub kmem_cache fields Remove the fields in struct kmem_cache_cpu that were used to cache data from struct kmem_cache when they were in different cachelines. The cacheline that holds the per cpu array pointer now also holds these values. We can cut down the struct kmem_cache_cpu size to almost half. The get_freepointer() and set_freepointer() functions that used to be only intended for the slow path now are also useful for the hot path since access to the size field does not require accessing an additional cacheline anymore. This results in consistent use of functions for setting the freepointer of objects throughout SLUB. Also we initialize all possible kmem_cache_cpu structures when a slab is created. No need to initialize them when a processor or node comes online. Signed-off-by: Christoph Lameter <cl@linux-foundation.org> Signed-off-by: Pekka Enberg <penberg@cs.helsinki.fi>
|
#
756dee75 |
|
18-Dec-2009 |
Christoph Lameter <cl@linux-foundation.org> |
SLUB: Get rid of dynamic DMA kmalloc cache allocation Dynamic DMA kmalloc cache allocation is troublesome since the new percpu allocator does not support allocations in atomic contexts. Reserve some statically allocated kmalloc_cpu structures instead. Signed-off-by: Christoph Lameter <cl@linux-foundation.org> Signed-off-by: Pekka Enberg <penberg@cs.helsinki.fi>
|
#
9dfc6e68 |
|
18-Dec-2009 |
Christoph Lameter <cl@linux-foundation.org> |
SLUB: Use this_cpu operations in slub Using per cpu allocations removes the needs for the per cpu arrays in the kmem_cache struct. These could get quite big if we have to support systems with thousands of cpus. The use of this_cpu_xx operations results in: 1. The size of kmem_cache for SMP configuration shrinks since we will only need 1 pointer instead of NR_CPUS. The same pointer can be used by all processors. Reduces cache footprint of the allocator. 2. We can dynamically size kmem_cache according to the actual nodes in the system meaning less memory overhead for configurations that may potentially support up to 1k NUMA nodes / 4k cpus. 3. We can remove the diddle widdle with allocating and releasing of kmem_cache_cpu structures when bringing up and shutting down cpus. The cpu alloc logic will do it all for us. Removes some portions of the cpu hotplug functionality. 4. Fastpath performance increases since per cpu pointer lookups and address calculations are avoided. V7-V8 - Convert missed get_cpu_slab() under CONFIG_SLUB_STATS Signed-off-by: Christoph Lameter <cl@linux-foundation.org> Signed-off-by: Pekka Enberg <penberg@cs.helsinki.fi>
|
#
0f24f128 |
|
11-Dec-2009 |
Li Zefan <lizf@cn.fujitsu.com> |
tracing, slab: Define kmem_cache_alloc_notrace ifdef CONFIG_TRACING Define kmem_trace_alloc_{,node}_notrace() if CONFIG_TRACING is enabled, otherwise perf-kmem will show wrong stats ifndef CONFIG_KMEM_TRACE, because a kmalloc() memory allocation may be traced by both trace_kmalloc() and trace_kmem_cache_alloc(). Signed-off-by: Li Zefan <lizf@cn.fujitsu.com> Reviewed-by: Pekka Enberg <penberg@cs.helsinki.fi> Cc: Christoph Lameter <cl@linux-foundation.org> Cc: Steven Rostedt <rostedt@goodmis.org> Cc: Frederic Weisbecker <fweisbec@gmail.com> Cc: linux-mm@kvack.org <linux-mm@kvack.org> Cc: Eduard - Gabriel Munteanu <eduard.munteanu@linux360.ro> LKML-Reference: <4B21F89A.7000801@cn.fujitsu.com> Signed-off-by: Ingo Molnar <mingo@elte.hu>
|
#
74e2134f |
|
25-Nov-2009 |
Pekka Enberg <penberg@cs.helsinki.fi> |
SLUB: Fix __GFP_ZERO unlikely() annotation The unlikely() annotation in slab_alloc() covers too much of the expression. It's actually very likely that the object is not NULL so use unlikely() only for the __GFP_ZERO expression like SLAB does. The patch reduces kernel text by 29 bytes on x86-64: text data bss dec hex filename 24185 8560 176 32921 8099 mm/slub.o.orig 24156 8560 176 32892 807c mm/slub.o Acked-by: Christoph Lameter <cl@linux-foundation.org> Signed-off-by: Pekka Enberg <penberg@cs.helsinki.fi>
|
#
78eb00cc |
|
15-Oct-2009 |
David Rientjes <rientjes@google.com> |
slub: allow stats to be cleared When collecting slub stats for particular workloads, it's necessary to collect each statistic for all caches before the job is even started because the counters are usually greater than zero just from boot and initialization. This allows a statistic to be cleared on each cpu by writing '0' to its sysfs file. This creates a baseline for statistics of interest before the workload is started. Setting a statistic to a particular value is not supported, so all values written to these files other than '0' returns -EINVAL. Cc: Christoph Lameter <cl@linux-foundation.org> Signed-off-by: David Rientjes <rientjes@google.com> Signed-off-by: Pekka Enberg <penberg@cs.helsinki.fi>
|
#
fe1ff49d |
|
21-Sep-2009 |
Benjamin Herrenschmidt <benh@kernel.crashing.org> |
mm: kmem_cache_create(): make it easier to catch NULL cache names Right now, if you inadvertently pass NULL to kmem_cache_create() at boot time, it crashes much later after boot somewhere deep inside sysfs which makes it very non obvious to figure out what's going on. Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org> Cc: Pekka Enberg <penberg@cs.helsinki.fi> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
#
fdaa45e9 |
|
15-Sep-2009 |
Ingo Molnar <mingo@elte.hu> |
slub: Fix build error in kmem_cache_open() with !CONFIG_SLUB_DEBUG This build bug: mm/slub.c: In function 'kmem_cache_open': mm/slub.c:2476: error: 'disable_higher_order_debug' undeclared (first use in this function) mm/slub.c:2476: error: (Each undeclared identifier is reported only once mm/slub.c:2476: error: for each function it appears in.) Triggers because there's no !CONFIG_SLUB_DEBUG definition for disable_higher_order_debug. Signed-off-by: Ingo Molnar <mingo@elte.hu> Signed-off-by: Pekka Enberg <penberg@cs.helsinki.fi>
|
#
8a3d271d |
|
03-Sep-2009 |
Eric Dumazet <eric.dumazet@gmail.com> |
slub: fix slab_pad_check() When SLAB_POISON is used and slab_pad_check() finds an overwrite of the slab padding, we call restore_bytes() on the whole slab, not only on the padding. Acked-by: Christoph Lameer <cl@linux-foundation.org> Reported-by: Zdenek Kabelac <zdenek.kabelac@gmail.com> Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com> Signed-off-by: Pekka Enberg <penberg@cs.helsinki.fi>
|
#
d76b1590 |
|
03-Sep-2009 |
Eric Dumazet <eric.dumazet@gmail.com> |
slub: Fix kmem_cache_destroy() with SLAB_DESTROY_BY_RCU kmem_cache_destroy() should call rcu_barrier() *after* kmem_cache_close() and *before* sysfs_slab_remove() or risk rcu_free_slab() being called after kmem_cache is deleted (kfreed). rmmod nf_conntrack can crash the machine because it has to kmem_cache_destroy() a SLAB_DESTROY_BY_RCU enabled cache. Cc: <stable@kernel.org> Reported-by: Zdenek Kabelac <zdenek.kabelac@gmail.com> Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com> Acked-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com> Signed-off-by: Pekka Enberg <penberg@cs.helsinki.fi>
|
#
5788d8ad |
|
21-Jul-2009 |
Xiaotian Feng <dfeng@redhat.com> |
slub: release kobject if sysfs_create_group failed in sysfs_slab_add When CONFIG_SLUB_DEBUG is enabled, sysfs_slab_add should unlink and put the kobject if sysfs_create_group failed. Otherwise, sysfs_slab_add returns error then free kmem_cache s, thus memory of s->kobj is leaked. Acked-by: Christoph Lameter <cl@linux-foundation.org> Signed-off-by: Xiaotian Feng <dfeng@redhat.com> Signed-off-by: Pekka Enberg <penberg@cs.helsinki.fi>
|
#
acdfcd04 |
|
28-Aug-2009 |
Aaro Koskinen <aaro.koskinen@nokia.com> |
SLUB: fix ARCH_KMALLOC_MINALIGN cases 64 and 256 If the minalign is 64 bytes, then the 96 byte cache should not be created because it would conflict with the 128 byte cache. If the minalign is 256 bytes, patching the size_index table should not result in a buffer overrun. The calculation "(i - 1) / 8" used to access size_index[] is moved to a separate function as suggested by Christoph Lameter. Acked-by: Christoph Lameter <cl@linux-foundation.org> Signed-off-by: Aaro Koskinen <aaro.koskinen@nokia.com> Signed-off-by: Pekka Enberg <penberg@cs.helsinki.fi>
|
#
5086c389c |
|
19-Aug-2009 |
Amerigo Wang <amwang@redhat.com> |
SLUB: Fix some coding style issues Signed-off-by: WANG Cong <amwang@redhat.com> Signed-off-by: Pekka Enberg <penberg@cs.helsinki.fi>
|
#
cf5d1131 |
|
18-Aug-2009 |
WANG Cong <amwang@redhat.com> |
SLUB: Drop write permission to /proc/slabinfo SLUB does not support writes to /proc/slabinfo so there should not be write permission to do that either. Signed-off-by: WANG Cong <amwang@redhat.com> Signed-off-by: Pekka Enberg <penberg@cs.helsinki.fi>
|
#
dcb0ce1b |
|
29-Jul-2009 |
Zhang, Yanmin <yanmin_zhang@linux.intel.com> |
slub: change kmem_cache->align to record the real alignment kmem_cache->align records the original align parameter value specified by users. Function calculate_alignment might change it based on cache line size. So change kmem_cache->align correspondingly. Signed-off-by: Zhang Yanmin <yanmin_zhang@linux.intel.com> Signed-off-by: Pekka Enberg <penberg@cs.helsinki.fi>
|
#
3de47213 |
|
27-Jul-2009 |
David Rientjes <rientjes@google.com> |
slub: use size and objsize orders to disable debug flags This patch moves the masking of debugging flags which increase a cache's min order due to metadata when `slub_debug=O' is used from kmem_cache_flags() to kmem_cache_open(). Instead of defining the maximum metadata size increase in a preprocessor macro, this approach uses the cache's ->size and ->objsize members to determine if the min order increased due to debugging options. If so, the flags specified in the more appropriately named DEBUG_METADATA_FLAGS are masked off. This approach was suggested by Christoph Lameter <cl@linux-foundation.org>. Cc: Christoph Lameter <cl@linux-foundation.org> Signed-off-by: David Rientjes <rientjes@google.com> Signed-off-by: Pekka Enberg <penberg@cs.helsinki.fi>
|
#
fa5ec8a1 |
|
07-Jul-2009 |
David Rientjes <rientjes@google.com> |
slub: add option to disable higher order debugging slabs When debugging is enabled, slub requires that additional metadata be stored in slabs for certain options: SLAB_RED_ZONE, SLAB_POISON, and SLAB_STORE_USER. Consequently, it may require that the minimum possible slab order needed to allocate a single object be greater when using these options. The most notable example is for objects that are PAGE_SIZE bytes in size. Higher minimum slab orders may cause page allocation failures when oom or under heavy fragmentation. This patch adds a new slub_debug option, which disables debugging by default for caches that would have resulted in higher minimum orders: slub_debug=O When this option is used on systems with 4K pages, kmalloc-4096, for example, will not have debugging enabled by default even if CONFIG_SLUB_DEBUG_ON is defined because it would have resulted in a order-1 minimum slab order. Reported-by: Larry Finger <Larry.Finger@lwfinger.net> Tested-by: Larry Finger <Larry.Finger@lwfinger.net> Cc: Christoph Lameter <cl@linux-foundation.org> Signed-off-by: David Rientjes <rientjes@google.com> Signed-off-by: Pekka Enberg <penberg@cs.helsinki.fi>
|
#
e4f7c0b4 |
|
07-Jul-2009 |
Catalin Marinas <catalin.marinas@arm.com> |
kmemleak: Trace the kmalloc_large* functions in slub The kmalloc_large() and kmalloc_large_node() functions were missed when adding the kmemleak hooks to the slub allocator. However, they should be traced to avoid false positives. Signed-off-by: Catalin Marinas <catalin.marinas@arm.com> Cc: Christoph Lameter <cl@linux-foundation.org> Acked-by: Pekka Enberg <penberg@cs.helsinki.fi>
|
#
7ed9f7e5 |
|
25-Jun-2009 |
Paul E. McKenney <paulmck@kernel.org> |
fix RCU-callback-after-kmem_cache_destroy problem in sl[aou]b Jesper noted that kmem_cache_destroy() invokes synchronize_rcu() rather than rcu_barrier() in the SLAB_DESTROY_BY_RCU case, which could result in RCU callbacks accessing a kmem_cache after it had been destroyed. Cc: <stable@kernel.org> Acked-by: Matt Mackall <mpm@selenic.com> Reported-by: Jesper Dangaard Brouer <hawk@comx.dk> Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com> Signed-off-by: Pekka Enberg <penberg@cs.helsinki.fi>
|
#
ba52270d |
|
24-Jun-2009 |
Pekka Enberg <penberg@cs.helsinki.fi> |
SLUB: Don't pass __GFP_FAIL for the initial allocation SLUB uses higher order allocations by default but falls back to small orders under memory pressure. Make sure the GFP mask used in the initial allocation doesn't include __GFP_NOFAIL. Signed-off-by: Pekka Enberg <penberg@cs.helsinki.fi> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
#
204fba4a |
|
24-Jun-2009 |
Tejun Heo <tj@kernel.org> |
percpu: cleanup percpu array definitions Currently, the following three different ways to define percpu arrays are in use. 1. DEFINE_PER_CPU(elem_type[array_len], array_name); 2. DEFINE_PER_CPU(elem_type, array_name[array_len]); 3. DEFINE_PER_CPU(elem_type, array_name)[array_len]; Unify to #1 which correctly separates the roles of the two parameters and thus allows more flexibility in the way percpu variables are defined. [ Impact: cleanup ] Signed-off-by: Tejun Heo <tj@kernel.org> Reviewed-by: Christoph Lameter <cl@linux-foundation.org> Cc: Ingo Molnar <mingo@elte.hu> Cc: Tony Luck <tony.luck@intel.com> Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Jeremy Fitzhardinge <jeremy@xensource.com> Cc: linux-mm@kvack.org Cc: Christoph Lameter <cl@linux-foundation.org> Cc: David S. Miller <davem@davemloft.net>
|
#
dcce284a |
|
17-Jun-2009 |
Benjamin Herrenschmidt <benh@kernel.crashing.org> |
mm: Extend gfp masking to the page allocator The page allocator also needs the masking of gfp flags during boot, so this moves it out of slab/slub and uses it with the page allocator as well. Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org> Acked-by: Pekka Enberg <penberg@cs.helsinki.fi> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
#
62bc62a8 |
|
16-Jun-2009 |
Christoph Lameter <cl@linux-foundation.org> |
page allocator: use a pre-calculated value instead of num_online_nodes() in fast paths num_online_nodes() is called in a number of places but most often by the page allocator when deciding whether the zonelist needs to be filtered based on cpusets or the zonelist cache. This is actually a heavy function and touches a number of cache lines. This patch stores the number of online nodes at boot time and updates the value when nodes get onlined and offlined. The value is then used in a number of important paths in place of num_online_nodes(). [rientjes@google.com: do not override definition of node_set_online() with macro] Signed-off-by: Christoph Lameter <cl@linux-foundation.org> Signed-off-by: Mel Gorman <mel@csn.ul.ie> Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Cc: Pekka Enberg <penberg@cs.helsinki.fi> Cc: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: Nick Piggin <nickpiggin@yahoo.com.au> Cc: Dave Hansen <dave@linux.vnet.ibm.com> Cc: Lee Schermerhorn <Lee.Schermerhorn@hp.com> Signed-off-by: David Rientjes <rientjes@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
#
b1eeab67 |
|
25-Nov-2008 |
Vegard Nossum <vegard.nossum@gmail.com> |
kmemcheck: add hooks for the page allocator This adds support for tracking the initializedness of memory that was allocated with the page allocator. Highmem requests are not tracked. Cc: Dave Hansen <dave@linux.vnet.ibm.com> Acked-by: Pekka Enberg <penberg@cs.helsinki.fi> [build fix for !CONFIG_KMEMCHECK] Signed-off-by: Ingo Molnar <mingo@elte.hu> [rebased for mainline inclusion] Signed-off-by: Vegard Nossum <vegard.nossum@gmail.com>
|
#
964cf35c |
|
15-Jun-2009 |
Nick Piggin <npiggin@suse.de> |
SLUB: Fix early boot GFP_DMA allocations Recent change to use slab allocations earlier exposed a bug where SLUB can call schedule_work and try to call sysfs before it is safe to do so. Reported-by: Heiko Carstens <heiko.carstens@de.ibm.com> Tested-by: Heiko Carstens <heiko.carstens@de.ibm.com> Signed-off-by: Nick Piggin <npiggin@suse.de> Signed-off-by: Pekka Enberg <penberg@cs.helsinki.fi>
|
#
5a896d9e |
|
03-Apr-2008 |
Vegard Nossum <vegard.nossum@gmail.com> |
slub: add hooks for kmemcheck Parts of this patch were contributed by Pekka Enberg but merged for atomicity. Cc: Christoph Lameter <clameter@sgi.com> Signed-off-by: Vegard Nossum <vegardno@ifi.uio.no> Signed-off-by: Pekka Enberg <penberg@cs.helsinki.fi> Signed-off-by: Ingo Molnar <mingo@elte.hu> [rebased for mainline inclusion] Signed-off-by: Vegard Nossum <vegardno@ifi.uio.no>
|
#
95f85989 |
|
11-Jun-2009 |
Pekka Enberg <penberg@cs.helsinki.fi> |
SLUB: Don't print out OOM warning for __GFP_NOFAIL We must check for __GFP_NOFAIL like the page allocator does; otherwise we end up with false positives. While at it, add the printk_ratelimit() check in SLUB as well. Cc: Alexander Beregalov <a.beregalov@gmail.com> Signed-off-by: Pekka Enberg <penberg@cs.helsinki.fi>
|
#
26c02cf0 |
|
11-Jun-2009 |
Alexander Beregalov <a.beregalov@gmail.com> |
SLUB: fix build when !SLUB_DEBUG Fix this build error when CONFIG_SLUB_DEBUG is not set: mm/slub.c: In function 'slab_out_of_memory': mm/slub.c:1551: error: 'struct kmem_cache_node' has no member named 'nr_slabs' mm/slub.c:1552: error: 'struct kmem_cache_node' has no member named 'total_objects' [ penberg@cs.helsinki.fi: cleanups ] Signed-off-by: Alexander Beregalov <a.beregalov@gmail.com> Signed-off-by: Pekka Enberg <penberg@cs.helsinki.fi>
|
#
7e85ee0c |
|
12-Jun-2009 |
Pekka Enberg <penberg@cs.helsinki.fi> |
slab,slub: don't enable interrupts during early boot As explained by Benjamin Herrenschmidt: Oh and btw, your patch alone doesn't fix powerpc, because it's missing a whole bunch of GFP_KERNEL's in the arch code... You would have to grep the entire kernel for things that check slab_is_available() and even then you'll be missing some. For example, slab_is_available() didn't always exist, and so in the early days on powerpc, we used a mem_init_done global that is set form mem_init() (not perfect but works in practice). And we still have code using that to do the test. Therefore, mask out __GFP_WAIT, __GFP_IO, and __GFP_FS in the slab allocators in early boot code to avoid enabling interrupts. Signed-off-by: Pekka Enberg <penberg@cs.helsinki.fi>
|
#
83b519e8 |
|
10-Jun-2009 |
Pekka Enberg <penberg@cs.helsinki.fi> |
slab: setup allocators earlier in the boot sequence This patch makes kmalloc() available earlier in the boot sequence so we can get rid of some bootmem allocations. The bulk of the changes are due to kmem_cache_init() being called with interrupts disabled which requires some changes to allocator boostrap code. Note: 32-bit x86 does WP protect test in mem_init() so we must setup traps before we call mem_init() during boot as reported by Ingo Molnar: We have a hard crash in the WP-protect code: [ 0.000000] Checking if this processor honours the WP bit even in supervisor mode...BUG: Int 14: CR2 ffcff000 [ 0.000000] EDI 00000188 ESI 00000ac7 EBP c17eaf9c ESP c17eaf8c [ 0.000000] EBX 000014e0 EDX 0000000e ECX 01856067 EAX 00000001 [ 0.000000] err 00000003 EIP c10135b1 CS 00000060 flg 00010002 [ 0.000000] Stack: c17eafa8 c17fd410 c16747bc c17eafc4 c17fd7e5 000011fd f8616000 c18237cc [ 0.000000] 00099800 c17bb000 c17eafec c17f1668 000001c5 c17f1322 c166e039 c1822bf0 [ 0.000000] c166e033 c153a014 c18237cc 00020800 c17eaff8 c17f106a 00020800 01ba5003 [ 0.000000] Pid: 0, comm: swapper Not tainted 2.6.30-tip-02161-g7a74539-dirty #52203 [ 0.000000] Call Trace: [ 0.000000] [<c15357c2>] ? printk+0x14/0x16 [ 0.000000] [<c10135b1>] ? do_test_wp_bit+0x19/0x23 [ 0.000000] [<c17fd410>] ? test_wp_bit+0x26/0x64 [ 0.000000] [<c17fd7e5>] ? mem_init+0x1ba/0x1d8 [ 0.000000] [<c17f1668>] ? start_kernel+0x164/0x2f7 [ 0.000000] [<c17f1322>] ? unknown_bootoption+0x0/0x19c [ 0.000000] [<c17f106a>] ? __init_begin+0x6a/0x6f Acked-by: Johannes Weiner <hannes@cmpxchg.org> Acked-by Linus Torvalds <torvalds@linux-foundation.org> Cc: Christoph Lameter <cl@linux-foundation.org> Cc: Ingo Molnar <mingo@elte.hu> Cc: Matt Mackall <mpm@selenic.com> Cc: Nick Piggin <npiggin@suse.de> Cc: Yinghai Lu <yinghai@kernel.org> Signed-off-by: Pekka Enberg <penberg@cs.helsinki.fi>
|
#
06f22f13 |
|
11-Jun-2009 |
Catalin Marinas <catalin.marinas@arm.com> |
kmemleak: Add the slub memory allocation/freeing hooks This patch adds the callbacks to kmemleak_(alloc|free) functions from the slub allocator. Signed-off-by: Catalin Marinas <catalin.marinas@arm.com> Cc: Christoph Lameter <cl@linux-foundation.org> Reviewed-by: Pekka Enberg <penberg@cs.helsinki.fi>
|
#
781b2ba6 |
|
10-Jun-2009 |
Pekka Enberg <penberg@cs.helsinki.fi> |
SLUB: Out-of-memory diagnostics As suggested by Mel Gorman, add out-of-memory diagnostics to the SLUB allocator to make debugging OOM conditions easier. This patch helped hunt down a nasty OOM issue that popped up every now that was caused by SLUB debugging code which forced 4096 byte allocations to use order 1 pages even in the fallback case. An example print out looks like this: <snip page allocator out-of-memory message> SLUB: Unable to allocate memory on node -1 (gfp=20) cache: kmalloc-4096, object size: 4096, buffer size: 4168, default order: 3, min order: 1 node 0: slabs: 95, objs: 665, free: 0 Acked-by: Christoph Lameter <cl@linux-foundation.org> Acked-by: Mel Gorman <mel@csn.ul.ie> Tested-by: Larry Finger <Larry.Finger@lwfinger.net> Signed-off-by: Pekka Enberg <penberg@cs.helsinki.fi>
|
#
1eb5ac64 |
|
05-May-2009 |
Nick Piggin <npiggin@suse.de> |
mm: SLUB fix reclaim_state SLUB does not correctly account reclaim_state.reclaimed_slab, so it will break memory reclaim. Account it like SLAB does. Cc: stable@kernel.org Cc: linux-mm@kvack.org Cc: Matt Mackall <mpm@selenic.com> Acked-by: Christoph Lameter <cl@linux-foundation.org> Signed-off-by: Nick Piggin <npiggin@suse.de> Signed-off-by: Pekka Enberg <penberg@cs.helsinki.fi>
|
#
818cf590 |
|
23-Apr-2009 |
David Rientjes <rientjes@google.com> |
slub: enforce MAX_ORDER slub_max_order may not be equal to or greater than MAX_ORDER. Additionally, if a single object cannot be placed in a slab of slub_max_order, it still must allocate slabs below MAX_ORDER. Acked-by: Christoph Lameter <cl@linux-foundation.org> Signed-off-by: David Rientjes <rientjes@google.com> Signed-off-by: Pekka Enberg <penberg@cs.helsinki.fi>
|
#
02af61bb |
|
10-Apr-2009 |
Zhaolei <zhaolei@cn.fujitsu.com> |
tracing, kmemtrace: Separate include/trace/kmemtrace.h to kmemtrace part and tracepoint part Impact: refactor code for future changes Current kmemtrace.h is used both as header file of kmemtrace and kmem's tracepoints definition. Tracepoints' definition file may be used by other code, and should only have definition of tracepoint. We can separate include/trace/kmemtrace.h into 2 files: include/linux/kmemtrace.h: header file for kmemtrace include/trace/kmem.h: definition of kmem tracepoints Signed-off-by: Zhao Lei <zhaolei@cn.fujitsu.com> Acked-by: Eduard - Gabriel Munteanu <eduard.munteanu@linux360.ro> Acked-by: Pekka Enberg <penberg@cs.helsinki.fi> Cc: Steven Rostedt <rostedt@goodmis.org> Cc: Frederic Weisbecker <fweisbec@gmail.com> Cc: Tom Zanussi <tzanussi@gmail.com> LKML-Reference: <49DEE68A.5040902@cn.fujitsu.com> Signed-off-by: Ingo Molnar <mingo@elte.hu>
|
#
2121db74 |
|
25-Mar-2009 |
Pekka Enberg <penberg@cs.helsinki.fi> |
kmemtrace: trace kfree() calls with NULL or zero-length objects Impact: also output kfree(NULL) entries This patch moves the trace_kfree() calls before the ZERO_OR_NULL_PTR check so that we can trace call-sites that call kfree() with NULL many times which might be an indication of a bug. Signed-off-by: Pekka Enberg <penberg@cs.helsinki.fi> Cc: Eduard - Gabriel Munteanu <eduard.munteanu@linux360.ro> LKML-Reference: <1237971957.30175.18.camel@penberg-laptop> Signed-off-by: Ingo Molnar <mingo@elte.hu>
|
#
ca2b84cb |
|
23-Mar-2009 |
Eduard - Gabriel Munteanu <eduard.munteanu@linux360.ro> |
kmemtrace: use tracepoints kmemtrace now uses tracepoints instead of markers. We no longer need to use format specifiers to pass arguments. Signed-off-by: Eduard - Gabriel Munteanu <eduard.munteanu@linux360.ro> [ folded: Use the new TP_PROTO and TP_ARGS to fix the build. ] [ folded: fix build when CONFIG_KMEMTRACE is disabled. ] [ folded: define tracepoints when CONFIG_TRACEPOINTS is enabled. ] Signed-off-by: Pekka Enberg <penberg@cs.helsinki.fi> LKML-Reference: <ae61c0f37156db8ec8dc0d5778018edde60a92e3.1237813499.git.eduard.munteanu@linux360.ro> Signed-off-by: Ingo Molnar <mingo@elte.hu>
|
#
1a00df4a |
|
06-Mar-2009 |
Akinobu Mita <akinobu.mita@gmail.com> |
slub: use get_track() Use get_track() in set_track() Signed-off-by: Akinobu Mita <akinobu.mita@gmail.com> Cc: Christoph Lameter <cl@linux-foundation.org> Cc: Pekka Enberg <penberg@cs.helsinki.fi> Signed-off-by: Pekka Enberg <penberg@cs.helsinki.fi>
|
#
c0bdb232 |
|
25-Feb-2009 |
David Rientjes <rientjes@google.com> |
slub: rename calculate_min_partial() to set_min_partial() As suggested by Christoph Lameter, rename calculate_min_partial() to set_min_partial() as the function doesn't really do any calculations. Cc: Christoph Lameter <cl@linux-foundation.org> Signed-off-by: David Rientjes <rientjes@google.com> Signed-off-by: Pekka Enberg <penberg@cs.helsinki.fi>
|
#
73d342b1 |
|
22-Feb-2009 |
David Rientjes <rientjes@google.com> |
slub: add min_partial sysfs tunable Now that a cache's min_partial has been moved to struct kmem_cache, it's possible to easily tune it from userspace by adding a sysfs attribute. It may not be desirable to keep a large number of partial slabs around if a cache is used infrequently and memory, especially when constrained by a cgroup, is scarce. It's better to allow userspace to set the minimum policy per cache instead of relying explicitly on kmem_cache_shrink(). The memory savings from simply moving min_partial from struct kmem_cache_node to struct kmem_cache is obviously not significant (unless maybe you're from SGI or something), at the largest it's # allocated caches * (MAX_NUMNODES - 1) * sizeof(unsigned long) The true savings occurs when userspace reduces the number of partial slabs that would otherwise be wasted, especially on machines with a large number of nodes (ia64 with CONFIG_NODES_SHIFT at 10 for default?). As well as the kernel estimates ideal values for n->min_partial and ensures it's within a sane range, userspace has no other input other than writing to /sys/kernel/slab/cache/shrink. There simply isn't any better heuristic to add when calculating the partial values for a better estimate that works for all possible caches. And since it's currently a static value, the user really has no way of reclaiming that wasted space, which can be significant when constrained by a cgroup (either cpusets or, later, memory controller slab limits) without shrinking it entirely. This also allows the user to specify that increased fragmentation and more partial slabs are actually desired to avoid the cost of allocating new slabs at runtime for specific caches. There's also no reason why this should be a per-struct kmem_cache_node value in the first place. You could argue that a machine would have such node size asymmetries that it should be specified on a per-node basis, but we know nobody is doing that right now since it's a purely static value at the moment and there's no convenient way to tune that via slub's sysfs interface. Cc: Christoph Lameter <cl@linux-foundation.org> Signed-off-by: David Rientjes <rientjes@google.com> Signed-off-by: Pekka Enberg <penberg@cs.helsinki.fi>
|
#
3b89d7d8 |
|
22-Feb-2009 |
David Rientjes <rientjes@google.com> |
slub: move min_partial to struct kmem_cache Although it allows for better cacheline use, it is unnecessary to save a copy of the cache's min_partial value in each kmem_cache_node. Cc: Christoph Lameter <cl@linux-foundation.org> Signed-off-by: David Rientjes <rientjes@google.com> Signed-off-by: Pekka Enberg <penberg@cs.helsinki.fi>
|
#
fe1200b6 |
|
16-Feb-2009 |
Christoph Lameter <cl@linux-foundation.org> |
SLUB: Introduce and use SLUB_MAX_SIZE and SLUB_PAGE_SHIFT constants As a preparational patch to bump up page allocator pass-through threshold, introduce two new constants SLUB_MAX_SIZE and SLUB_PAGE_SHIFT and convert mm/slub.c to use them. Reported-by: "Zhang, Yanmin" <yanmin_zhang@linux.intel.com> Tested-by: "Zhang, Yanmin" <yanmin_zhang@linux.intel.com> Signed-off-by: Christoph Lameter <cl@linux-foundation.org> Signed-off-by: Pekka Enberg <penberg@cs.helsinki.fi>
|
#
e8120ff1 |
|
12-Feb-2009 |
Zhang Yanmin <yanmin.zhang@linux.intel.com> |
SLUB: Fix default slab order for big object sizes The default order of kmalloc-8192 on 2*4 stoakley is an issue of calculate_order. slab_size order name ------------------------------------------------- 4096 3 sgpool-128 8192 2 kmalloc-8192 16384 3 kmalloc-16384 kmalloc-8192's default order is smaller than sgpool-128's. On 4*4 tigerton machine, a similiar issue appears on another kmem_cache. Function calculate_order uses 'min_objects /= 2;' to shrink. Plus size calculation/checking in slab_order, sometimes above issue appear. Below patch against 2.6.29-rc2 fixes it. I checked the default orders of all kmem_cache and they don't become smaller than before. So the patch wouldn't hurt performance. Signed-off-by Zhang Yanmin <yanmin.zhang@linux.intel.com> Signed-off-by: Pekka Enberg <penberg@cs.helsinki.fi>
|
#
ffadd4d0 |
|
16-Feb-2009 |
Christoph Lameter <cl@linux-foundation.org> |
SLUB: Introduce and use SLUB_MAX_SIZE and SLUB_PAGE_SHIFT constants As a preparational patch to bump up page allocator pass-through threshold, introduce two new constants SLUB_MAX_SIZE and SLUB_PAGE_SHIFT and convert mm/slub.c to use them. Reported-by: "Zhang, Yanmin" <yanmin_zhang@linux.intel.com> Tested-by: "Zhang, Yanmin" <yanmin_zhang@linux.intel.com> Signed-off-by: Christoph Lameter <cl@linux-foundation.org> Signed-off-by: Pekka Enberg <penberg@cs.helsinki.fi>
|
#
cf40bd16 |
|
21-Jan-2009 |
Nick Piggin <npiggin@suse.de> |
lockdep: annotate reclaim context (__GFP_NOFS) Here is another version, with the incremental patch rolled up, and added reclaim context annotation to kswapd, and allocation tracing to slab allocators (which may only ever reach the page allocator in rare cases, so it is good to put annotations here too). Haven't tested this version as such, but it should be getting closer to merge worthy ;) -- After noticing some code in mm/filemap.c accidentally perform a __GFP_FS allocation when it should not have been, I thought it might be a good idea to try to catch this kind of thing with lockdep. I coded up a little idea that seems to work. Unfortunately the system has to actually be in __GFP_FS page reclaim, then take the lock, before it will mark it. But at least that might still be some orders of magnitude more common (and more debuggable) than an actual deadlock condition, so we have some improvement I hope (the concept is no less complete than discovery of a lock's interrupt contexts). I guess we could even do the same thing with __GFP_IO (normal reclaim), and even GFP_NOIO locks too... but filesystems will have the most locks and fiddly code paths, so let's start there and see how it goes. It *seems* to work. I did a quick test. ================================= [ INFO: inconsistent lock state ] 2.6.28-rc6-00007-ged31348-dirty #26 --------------------------------- inconsistent {in-reclaim-W} -> {ov-reclaim-W} usage. modprobe/8526 [HC0[0]:SC0[0]:HE1:SE1] takes: (testlock){--..}, at: [<ffffffffa0020055>] brd_init+0x55/0x216 [brd] {in-reclaim-W} state was registered at: [<ffffffff80267bdb>] __lock_acquire+0x75b/0x1a60 [<ffffffff80268f71>] lock_acquire+0x91/0xc0 [<ffffffff8070f0e1>] mutex_lock_nested+0xb1/0x310 [<ffffffffa002002b>] brd_init+0x2b/0x216 [brd] [<ffffffff8020903b>] _stext+0x3b/0x170 [<ffffffff80272ebf>] sys_init_module+0xaf/0x1e0 [<ffffffff8020c3fb>] system_call_fastpath+0x16/0x1b [<ffffffffffffffff>] 0xffffffffffffffff irq event stamp: 3929 hardirqs last enabled at (3929): [<ffffffff8070f2b5>] mutex_lock_nested+0x285/0x310 hardirqs last disabled at (3928): [<ffffffff8070f089>] mutex_lock_nested+0x59/0x310 softirqs last enabled at (3732): [<ffffffff8061f623>] sk_filter+0x83/0xe0 softirqs last disabled at (3730): [<ffffffff8061f5b6>] sk_filter+0x16/0xe0 other info that might help us debug this: 1 lock held by modprobe/8526: #0: (testlock){--..}, at: [<ffffffffa0020055>] brd_init+0x55/0x216 [brd] stack backtrace: Pid: 8526, comm: modprobe Not tainted 2.6.28-rc6-00007-ged31348-dirty #26 Call Trace: [<ffffffff80265483>] print_usage_bug+0x193/0x1d0 [<ffffffff80266530>] mark_lock+0xaf0/0xca0 [<ffffffff80266735>] mark_held_locks+0x55/0xc0 [<ffffffffa0020000>] ? brd_init+0x0/0x216 [brd] [<ffffffff802667ca>] trace_reclaim_fs+0x2a/0x60 [<ffffffff80285005>] __alloc_pages_internal+0x475/0x580 [<ffffffff8070f29e>] ? mutex_lock_nested+0x26e/0x310 [<ffffffffa0020000>] ? brd_init+0x0/0x216 [brd] [<ffffffffa002006a>] brd_init+0x6a/0x216 [brd] [<ffffffffa0020000>] ? brd_init+0x0/0x216 [brd] [<ffffffff8020903b>] _stext+0x3b/0x170 [<ffffffff8070f8b9>] ? mutex_unlock+0x9/0x10 [<ffffffff8070f83d>] ? __mutex_unlock_slowpath+0x10d/0x180 [<ffffffff802669ec>] ? trace_hardirqs_on_caller+0x12c/0x190 [<ffffffff80272ebf>] sys_init_module+0xaf/0x1e0 [<ffffffff8020c3fb>] system_call_fastpath+0x16/0x1b Signed-off-by: Nick Piggin <npiggin@suse.de> Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Signed-off-by: Ingo Molnar <mingo@elte.hu>
|
#
b1aabecd |
|
10-Feb-2009 |
Kirill A. Shutemov <kirill@shutemov.name> |
mm: Export symbol ksize() Commit 7b2cd92adc5430b0c1adeb120971852b4ea1ab08 ("crypto: api - Fix zeroing on free") added modular user of ksize(). Export that to fix crypto.ko compilation. Cc: Herbert Xu <herbert@gondor.apana.org.au> Signed-off-by: Kirill A. Shutemov <kirill@shutemov.name> Signed-off-by: Pekka Enberg <penberg@cs.helsinki.fi>
|
#
37189094 |
|
27-Jan-2009 |
David Rientjes <rientjes@google.com> |
slub: fix per cpu kmem_cache_cpu array memory leak The per cpu array of kmem_cache_cpu structures accomodates NR_KMEM_CACHE_CPU such structs. When this array overflows and a struct is allocated by kmalloc(), it may have an address at the upper bound of this array. If this happens, it does not get freed and the per cpu kmem_cache_cpu_free pointer will be out of bounds after kmem_cache_destroy() or cpu offlining. Cc: Christoph Lameter <cl@linux-foundation.org> Signed-off-by: David Rientjes <rientjes@google.com> Signed-off-by: Pekka Enberg <penberg@cs.helsinki.fi>
|
#
6047a007 |
|
13-Jan-2009 |
Pekka Enberg <penberg@cs.helsinki.fi> |
SLUB: Use ->objsize from struct kmem_cache_cpu in slab_free() There's no reason to use ->objsize from struct kmem_cache in slab_free() for the SLAB_DEBUG_OBJECTS case. All it does is generate extra cache pressure as we try very hard not to touch struct kmem_cache in the fast-path. Signed-off-by: Pekka Enberg <penberg@cs.helsinki.fi>
|
#
0211a9c8 |
|
29-Dec-2008 |
Frederik Schwarzer <schwarzerf@gmail.com> |
trivial: fix an -> a typos in documentation and comments It is always "an" if there is a vowel _spoken_ (not written). So it is: "an hour" (spoken vowel) but "a uniform" (spoken 'j') Signed-off-by: Frederik Schwarzer <schwarzerf@gmail.com> Signed-off-by: Jiri Kosina <jkosina@suse.cz>
|
#
174596a0 |
|
31-Dec-2008 |
Rusty Russell <rusty@rustcorp.com.au> |
cpumask: convert mm/ Impact: Use new API Convert kernel mm functions to use struct cpumask. We skip include/linux/percpu.h and mm/allocpercpu.c, which are in flux. Signed-off-by: Rusty Russell <rusty@rustcorp.com.au> Signed-off-by: Mike Travis <travis@sgi.com> Reviewed-by: Christoph Lameter <cl@linux-foundation.org>
|
#
36994e58 |
|
29-Dec-2008 |
Frederic Weisbecker <fweisbec@gmail.com> |
tracing/kmemtrace: normalize the raw tracer event to the unified tracing API Impact: new tracer plugin This patch adapts kmemtrace raw events tracing to the unified tracing API. To enable and use this tracer, just do the following: echo kmemtrace > /debugfs/tracing/current_tracer cat /debugfs/tracing/trace You will have the following output: # tracer: kmemtrace # # # ALLOC TYPE REQ GIVEN FLAGS POINTER NODE CALLER # FREE | | | | | | | | # | type_id 1 call_site 18446744071565527833 ptr 18446612134395152256 type_id 0 call_site 18446744071565585597 ptr 18446612134405955584 bytes_req 4096 bytes_alloc 4096 gfp_flags 208 node -1 type_id 1 call_site 18446744071565585534 ptr 18446612134405955584 type_id 0 call_site 18446744071565585597 ptr 18446612134405955584 bytes_req 4096 bytes_alloc 4096 gfp_flags 208 node -1 type_id 0 call_site 18446744071565636711 ptr 18446612134345164672 bytes_req 240 bytes_alloc 240 gfp_flags 208 node -1 type_id 1 call_site 18446744071565585534 ptr 18446612134405955584 type_id 0 call_site 18446744071565585597 ptr 18446612134405955584 bytes_req 4096 bytes_alloc 4096 gfp_flags 208 node -1 type_id 0 call_site 18446744071565636711 ptr 18446612134345164912 bytes_req 240 bytes_alloc 240 gfp_flags 208 node -1 type_id 1 call_site 18446744071565585534 ptr 18446612134405955584 type_id 0 call_site 18446744071565585597 ptr 18446612134405955584 bytes_req 4096 bytes_alloc 4096 gfp_flags 208 node -1 type_id 0 call_site 18446744071565636711 ptr 18446612134345165152 bytes_req 240 bytes_alloc 240 gfp_flags 208 node -1 type_id 0 call_site 18446744071566144042 ptr 18446612134346191680 bytes_req 1304 bytes_alloc 1312 gfp_flags 208 node -1 type_id 1 call_site 18446744071565585534 ptr 18446612134405955584 type_id 0 call_site 18446744071565585597 ptr 18446612134405955584 bytes_req 4096 bytes_alloc 4096 gfp_flags 208 node -1 type_id 1 call_site 18446744071565585534 ptr 18446612134405955584 That was to stay backward compatible with the format output produced in inux/tracepoint.h. This is the default ouput, but note that I tried something else. If you change an option: echo kmem_minimalistic > /debugfs/trace_options and then cat /debugfs/trace, you will have the following output: # tracer: kmemtrace # # # ALLOC TYPE REQ GIVEN FLAGS POINTER NODE CALLER # FREE | | | | | | | | # | - C 0xffff88007c088780 file_free_rcu + K 4096 4096 000000d0 0xffff88007cad6000 -1 getname - C 0xffff88007cad6000 putname + K 4096 4096 000000d0 0xffff88007cad6000 -1 getname + K 240 240 000000d0 0xffff8800790dc780 -1 d_alloc - C 0xffff88007cad6000 putname + K 4096 4096 000000d0 0xffff88007cad6000 -1 getname + K 240 240 000000d0 0xffff8800790dc870 -1 d_alloc - C 0xffff88007cad6000 putname + K 4096 4096 000000d0 0xffff88007cad6000 -1 getname + K 240 240 000000d0 0xffff8800790dc960 -1 d_alloc + K 1304 1312 000000d0 0xffff8800791d7340 -1 reiserfs_alloc_inode - C 0xffff88007cad6000 putname + K 4096 4096 000000d0 0xffff88007cad6000 -1 getname - C 0xffff88007cad6000 putname + K 992 1000 000000d0 0xffff880079045b58 -1 alloc_inode + K 768 1024 000080d0 0xffff88007c096400 -1 alloc_pipe_info + K 240 240 000000d0 0xffff8800790dca50 -1 d_alloc + K 272 320 000080d0 0xffff88007c088780 -1 get_empty_filp + K 272 320 000080d0 0xffff88007c088000 -1 get_empty_filp Yeah I shall confess kmem_minimalistic should be: kmem_alternative. Whatever, I find it more readable but this a personal opinion of course. We can drop it if you want. On the ALLOC/FREE column, + means an allocation and - a free. On the type column, you have K = kmalloc, C = cache, P = page I would like the flags to be GFP_* strings but that would not be easy to not break the column with strings.... About the node...it seems to always be -1. I don't know why but that shouldn't be difficult to find. I moved linux/tracepoint.h to trace/tracepoint.h as well. I think that would be more easy to find the tracer headers if they are all in their common directory. Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com> Signed-off-by: Ingo Molnar <mingo@elte.hu>
|
#
2a38b1c4 |
|
29-Dec-2008 |
Ingo Molnar <mingo@elte.hu> |
kmemtrace: move #include lines Impact: avoid conflicts with kmemcheck kmemcheck modifies the same area of slab.c and slub.c - move the include lines up a bit. Signed-off-by: Ingo Molnar <mingo@elte.hu>
|
#
2e67624c |
|
01-Sep-2008 |
Pekka Enberg <penberg@cs.helsinki.fi> |
kmemtrace: remove unnecessary casts Now that we use _RET_IP_ there's no need to cast 'caller' to unsigned long. Signed-off-by: Pekka Enberg <penberg@cs.helsinki.fi>
|
#
94b528d0 |
|
24-Aug-2008 |
Eduard - Gabriel Munteanu <eduard.munteanu@linux360.ro> |
kmemtrace: SLUB hooks for caller-tracking functions. This patch adds kmemtrace hooks for __kmalloc_track_caller() and __kmalloc_node_track_caller(). Currently, they set the call site pointer to the value recieved as a parameter. (This could change if we implement stack trace exporting in kmemtrace.) Signed-off-by: Eduard - Gabriel Munteanu <eduard.munteanu@linux360.ro> Signed-off-by: Pekka Enberg <penberg@cs.helsinki.fi>
|
#
5b882be4 |
|
19-Aug-2008 |
Eduard - Gabriel Munteanu <eduard.munteanu@linux360.ro> |
kmemtrace: SLUB hooks. This adds hooks for the SLUB allocator, to allow tracing with kmemtrace. Signed-off-by: Eduard - Gabriel Munteanu <eduard.munteanu@linux360.ro> Signed-off-by: Pekka Enberg <penberg@cs.helsinki.fi>
|
#
35995a4d |
|
19-Aug-2008 |
Eduard - Gabriel Munteanu <eduard.munteanu@linux360.ro> |
SLUB: Replace __builtin_return_address(0) with _RET_IP_. This patch replaces __builtin_return_address(0) with _RET_IP_, since a previous patch moved _RET_IP_ and _THIS_IP_ to include/linux/kernel.h and they're widely available now. This makes for shorter and easier to read code. [penberg@cs.helsinki.fi: remove _RET_IP_ casts to void pointer] Signed-off-by: Eduard - Gabriel Munteanu <eduard.munteanu@linux360.ro> Signed-off-by: Pekka Enberg <penberg@cs.helsinki.fi>
|
#
7b8f3b66 |
|
17-Dec-2008 |
David Rientjes <rientjes@google.com> |
slub: avoid leaking caches or refcounts on sysfs error If a slab cache is mergeable and the sysfs alias cannot be added, the target cache shall have its refcount decremented. kmem_cache_create() will return NULL, so if kmem_cache_destroy() is ever called on the target cache, it will never be freed if the refcount has been leaked. Likewise, if a slab cache is not mergeable and the sysfs link cannot be added, the new cache shall be removed from the slab_caches list. kmem_cache_create() will return NULL, so it will be impossible to call kmem_cache_destroy() on it. Both of these operations require slub_lock since refcount of all slab caches and slab_caches are protected by the lock. In the mergeable case, it would be better to restore objsize and offset back to their original values, but this could race with another merge since slub_lock was dropped. Cc: Christoph Lameter <cl@linux-foundation.org> Signed-off-by: David Rientjes <rientjes@google.com> Signed-off-by: Pekka Enberg <penberg@cs.helsinki.fi>
|
#
89124d70 |
|
19-Nov-2008 |
OGAWA Hirofumi <hirofumi@mail.parknet.co.jp> |
slub: Add might_sleep_if() to slab_alloc() Currently SLUB doesn't warn about __GFP_WAIT. Add it into slab_alloc(). Acked-by: Christoph Lameter <cl@linux-foundation.org> Signed-off-by: OGAWA Hirofumi <hirofumi@mail.parknet.co.jp> Signed-off-by: Pekka Enberg <penberg@cs.helsinki.fi>
|
#
773ff60e |
|
23-Dec-2008 |
Akinobu Mita <akinobu.mita@gmail.com> |
SLUB: failslab support Currently fault-injection capability for SLAB allocator is only available to SLAB. This patch makes it available to SLUB, too. [penberg@cs.helsinki.fi: unify slab and slub implementations] Cc: Christoph Lameter <cl@linux-foundation.org> Cc: Matt Mackall <mpm@selenic.com> Signed-off-by: Akinobu Mita <akinobu.mita@gmail.com> Signed-off-by: Pekka Enberg <penberg@cs.helsinki.fi>
|
#
29c0177e |
|
13-Dec-2008 |
Rusty Russell <rusty@rustcorp.com.au> |
cpumask: change cpumask_scnprintf, cpumask_parse_user, cpulist_parse, and cpulist_scnprintf to take pointers. Impact: change calling convention of existing cpumask APIs Most cpumask functions started with cpus_: these have been replaced by cpumask_ ones which take struct cpumask pointers as expected. These four functions don't have good replacement names; fortunately they're rarely used, so we just change them over. Signed-off-by: Rusty Russell <rusty@rustcorp.com.au> Signed-off-by: Mike Travis <travis@sgi.com> Acked-by: Ingo Molnar <mingo@elte.hu> Cc: paulus@samba.org Cc: mingo@redhat.com Cc: tony.luck@intel.com Cc: ralf@linux-mips.org Cc: Greg Kroah-Hartman <gregkh@suse.de> Cc: cl@linux-foundation.org Cc: srostedt@redhat.com
|
#
9c246247 |
|
09-Dec-2008 |
Hugh Dickins <hugh@veritas.com> |
KSYM_SYMBOL_LEN fixes Miles Lane tailing /sys files hit a BUG which Pekka Enberg has tracked to my 966c8c12dc9e77f931e2281ba25d2f0244b06949 sprint_symbol(): use less stack exposing a bug in slub's list_locations() - kallsyms_lookup() writes a 0 to namebuf[KSYM_NAME_LEN-1], but that was beyond the end of page provided. The 100 slop which list_locations() allows at end of page looks roughly enough for all the other stuff it might print after the symbol before it checks again: break out KSYM_SYMBOL_LEN earlier than before. Latencytop and ftrace and are using KSYM_NAME_LEN buffers where they need KSYM_SYMBOL_LEN buffers, and vmallocinfo a 2*KSYM_NAME_LEN buffer where it wants a KSYM_SYMBOL_LEN buffer: fix those before anyone copies them. [akpm@linux-foundation.org: ftrace.h needs module.h] Signed-off-by: Hugh Dickins <hugh@veritas.com> Cc: Christoph Lameter <cl@linux-foundation.org> Cc Miles Lane <miles.lane@gmail.com> Acked-by: Pekka Enberg <penberg@cs.helsinki.fi> Acked-by: Steven Rostedt <srostedt@redhat.com> Acked-by: Frederic Weisbecker <fweisbec@gmail.com> Cc: Rusty Russell <rusty@rustcorp.com.au> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
#
9f6c708e |
|
04-Dec-2008 |
Nick Andrew <nick@nick-andrew.net> |
slub: Fix incorrect use of loose It should be 'lose', not 'loose'. Signed-off-by: Nick Andrew <nick@nick-andrew.net> Signed-off-by: Pekka Enberg <penberg@cs.helsinki.fi>
|
#
dc19f9db |
|
01-Dec-2008 |
KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> |
memcg: memory hotplug fix for notifier callback Fixes for memcg/memory hotplug. While memory hotplug allocate/free memmap, page_cgroup doesn't free page_cgroup at OFFLINE when page_cgroup is allocated via bootomem. (Because freeing bootmem requires special care.) Then, if page_cgroup is allocated by bootmem and memmap is freed/allocated by memory hotplug, page_cgroup->page == page is no longer true. But current MEM_ONLINE handler doesn't check it and update page_cgroup->page if it's not necessary to allocate page_cgroup. (This was not found because memmap is not freed if SPARSEMEM_VMEMMAP is y.) And I noticed that MEM_ONLINE can be called against "part of section". So, freeing page_cgroup at CANCEL_ONLINE will cause trouble. (freeing used page_cgroup) Don't rollback at CANCEL. One more, current memory hotplug notifier is stopped by slub because it sets NOTIFY_STOP_MASK to return vaule. So, page_cgroup's callback never be called. (low priority than slub now.) I think this slub's behavior is not intentional(BUG). and fixes it. Another way to be considered about page_cgroup allocation: - free page_cgroup at OFFLINE even if it's from bootmem and remove specieal handler. But it requires more changes. Addresses http://bugzilla.kernel.org/show_bug.cgi?id=12041 Signed-off-by: KAMEZAWA Hiruyoki <kamezawa.hiroyu@jp.fujitsu.com> Cc: Li Zefan <lizf@cn.fujitsu.com> Cc: Balbir Singh <balbir@in.ibm.com> Cc: Pavel Emelyanov <xemul@openvz.org> Tested-by: Badari Pulavarty <pbadari@us.ibm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
#
0094de92 |
|
25-Nov-2008 |
David Rientjes <rientjes@google.com> |
slub: make early_kmem_cache_node_alloc void The return value for early_kmem_cache_node_alloc() is unused, so it is better defined as void. Acked-by: Christoph Lameter <cl@linux-foundation.org> Signed-off-by: David Rientjes <rientjes@google.com> Signed-off-by: Pekka Enberg <penberg@cs.helsinki.fi>
|
#
e9beef18 |
|
28-Oct-2008 |
Cyrill Gorcunov <gorcunov@gmail.com> |
slub - fix get_object_page comment Use 'slab page' instead of 'slab object'. Acked-by: Christoph Lameter <cl@linux-foundation.org> Signed-off-by: Pekka Enberg <penberg@cs.helsinki.fi>
|
#
ce71e27c |
|
19-Aug-2008 |
Eduard - Gabriel Munteanu <eduard.munteanu@linux360.ro> |
SLUB: Replace __builtin_return_address(0) with _RET_IP_. This patch replaces __builtin_return_address(0) with _RET_IP_, since a previous patch moved _RET_IP_ and _THIS_IP_ to include/linux/kernel.h and they're widely available now. This makes for shorter and easier to read code. [penberg@cs.helsinki.fi: remove _RET_IP_ casts to void pointer] Signed-off-by: Eduard - Gabriel Munteanu <eduard.munteanu@linux360.ro> Signed-off-by: Pekka Enberg <penberg@cs.helsinki.fi>
|
#
210b5c06 |
|
22-Oct-2008 |
Cyrill Gorcunov <gorcunov@gmail.com> |
SLUB: cleanup - define macros instead of hardcoded numbers Signed-off-by: Cyrill Gorcunov <gorcunov@gmail.com> Acked-by: Christoph Lameter <cl@linux-foundation.org> Signed-off-by: Pekka Enberg <penberg@cs.helsinki.fi>
|
#
7b3c3a50 |
|
05-Oct-2008 |
Alexey Dobriyan <adobriyan@gmail.com> |
proc: move /proc/slabinfo boilerplate to mm/slub.c, mm/slab.c Lose dummy ->write hook in case of SLUB, it's possible now. Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com> Acked-by: Pekka Enberg <penberg@cs.helsinki.fi>
|
#
02b71b70 |
|
11-Sep-2008 |
Salman Qazi <sqazi@google.com> |
slub: fixed uninitialized counter in struct kmem_cache_node Initialized total objects atomic for the node in init_kmem_cache_node. The uninitialized value was ruining the stats in /proc/slabinfo. Acked-by: Christoph Lameter <cl@linux-foundation.org> Signed-off-by: Salman Qazi <sqazi@google.com> Signed-off-by: Pekka Enberg <penberg@cs.helsinki.fi>
|
#
e2cb96b7 |
|
19-Aug-2008 |
Christoph Lameter <cl@linux-foundation.org> |
slub: Disable NUMA remote node defragmentation by default Switch remote node defragmentation off by default. The current settings can cause excessive node local allocations with hackbench: SLAB: % cat /proc/meminfo MemTotal: 7701760 kB MemFree: 5940096 kB Slab: 123840 kB SLUB: % cat /proc/meminfo MemTotal: 7701376 kB MemFree: 4740928 kB Slab: 1591680 kB [Note: this feature is not related to slab defragmentation.] You can find the original discussion here: http://lkml.org/lkml/2008/8/4/308 Reported-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Tested-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Signed-off-by: Christoph Lameter <cl@linux-foundation.org> Signed-off-by: Pekka Enberg <penberg@cs.helsinki.fi>
|
#
5595cffc |
|
05-Aug-2008 |
Pekka Enberg <penberg@cs.helsinki.fi> |
SLUB: dynamic per-cache MIN_PARTIAL This patch changes the static MIN_PARTIAL to a dynamic per-cache ->min_partial value that is calculated from object size. The bigger the object size, the more pages we keep on the partial list. I tested SLAB, SLUB, and SLUB with this patch on Jens Axboe's 'netio' example script of the fio benchmarking tool. The script stresses the networking subsystem which should also give a fairly good beating of kmalloc() et al. To run the test yourself, first clone the fio repository: git clone git://git.kernel.dk/fio.git and then run the following command n times on your machine: time ./fio examples/netio The results on my 2-way 64-bit x86 machine are as follows: [ the minimum, maximum, and average are captured from 50 individual runs ] real time (seconds) min max avg sd SLAB 22.76 23.38 22.98 0.17 SLUB 22.80 25.78 23.46 0.72 SLUB (dynamic) 22.74 23.54 23.00 0.20 sys time (seconds) min max avg sd SLAB 6.90 8.28 7.70 0.28 SLUB 7.42 16.95 8.89 2.28 SLUB (dynamic) 7.17 8.64 7.73 0.29 user time (seconds) min max avg sd SLAB 36.89 38.11 37.50 0.29 SLUB 30.85 37.99 37.06 1.67 SLUB (dynamic) 36.75 38.07 37.59 0.32 As you can see from the above numbers, this patch brings SLUB to the same level as SLAB for this particular workload fixing a ~2% regression. I'd expect this change to help similar workloads that allocate a lot of objects that are close to the size of a page. Cc: Matthew Wilcox <matthew@wil.cx> Cc: Andrew Morton <akpm@linux-foundation.org> Acked-by: Christoph Lameter <cl@linux-foundation.org> Signed-off-by: Pekka Enberg <penberg@cs.helsinki.fi>
|
#
231367fd |
|
22-Jul-2008 |
Adrian Bunk <bunk@kernel.org> |
mm: unexport ksize This patch removes the obsolete and no longer used exports of ksize. Signed-off-by: Adrian Bunk <bunk@kernel.org> Signed-off-by: Pekka Enberg <penberg@cs.helsinki.fi>
|
#
51cc5068 |
|
25-Jul-2008 |
Alexey Dobriyan <adobriyan@gmail.com> |
SL*B: drop kmem cache argument from constructor Kmem cache passed to constructor is only needed for constructors that are themselves multiplexeres. Nobody uses this "feature", nor does anybody uses passed kmem cache in non-trivial way, so pass only pointer to object. Non-trivial places are: arch/powerpc/mm/init_64.c arch/powerpc/mm/hugetlbpage.c This is flag day, yes. Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com> Acked-by: Pekka Enberg <penberg@cs.helsinki.fi> Acked-by: Christoph Lameter <cl@linux-foundation.org> Cc: Jon Tollefson <kniht@linux.vnet.ibm.com> Cc: Nick Piggin <nickpiggin@yahoo.com.au> Cc: Matt Mackall <mpm@selenic.com> [akpm@linux-foundation.org: fix arch/powerpc/mm/hugetlbpage.c] [akpm@linux-foundation.org: fix mm/slab.c] [akpm@linux-foundation.org: fix ubifs] Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
#
8a38082d |
|
23-Jul-2008 |
Andy Whitcroft <apw@shadowen.org> |
slub: record page flag overlays explicitly SLUB reuses two page bits for internal purposes, it overlays PG_active and PG_error. This is hidden away in slub.c. Document these overlays explicitly in the main page-flags enum along with all the others. Signed-off-by: Andy Whitcroft <apw@shadowen.org> Cc: Pekka Enberg <penberg@cs.helsinki.fi> Cc: Christoph Lameter <cl@linux-foundation.org> Cc: Matt Mackall <mpm@selenic.com> Cc: Nick Piggin <nickpiggin@yahoo.com.au> Tested-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Cc: Rik van Riel <riel@redhat.com> Cc: Jeremy Fitzhardinge <jeremy@goop.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
#
0ebd652b |
|
19-Jul-2008 |
Pekka Enberg <penberg@cs.helsinki.fi> |
slub: dump more data on slab corruption The limit of 128 bytes is too small when debugging slab corruption of the skb cache, for example. So increase the limit to PAGE_SIZE to make debugging corruptions easier. Acked-by: Ingo Molnar <mingo@elte.hu> Acked-by: Christoph Lameter <cl@linux-foundation.org> Signed-off-by: Pekka Enberg <penberg@cs.helsinki.fi>
|
#
41ab8592 |
|
16-Jul-2008 |
Alexey Dobriyan <adobriyan@gmail.com> |
SLUB: simplify re on_each_cpu() on_each_cpu() expands to function call on UP, too. Acked-by: Christoph Lameter <cl@linux-foundation.org> Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com> Signed-off-by: Pekka Enberg <penberg@cs.helsinki.fi>
|
#
88e4ccf2 |
|
22-Jun-2008 |
Alexey Dobriyan <adobriyan@gmail.com> |
slub: current is always valid Acked-by: Christoph Lameter <clameter@sgi.com> Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com> Signed-off-by: Pekka Enberg <penberg@cs.helsinki.fi>
|
#
0937502a |
|
28-May-2008 |
Christoph Lameter <clameter@sgi.com> |
slub: Add check for kfree() of non slab objects. We can detect kfree()s on non slab objects by checking for PageCompound(). Works in the same way as for ksize. This helped me catch an invalid kfree(). Signed-off-by: Christoph Lameter <clameter@sgi.com> Signed-off-by: Pekka Enberg <penberg@cs.helsinki.fi>
|
#
7daf705f |
|
14-Jul-2008 |
Linus Torvalds <torvalds@linux-foundation.org> |
Start using the new '%pS' infrastructure to print symbols This simplifies the code significantly, and was the whole point of the exercise. Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
#
bdb21928 |
|
10-Jul-2008 |
Dmitry Adamushko <dmitry.adamushko@gmail.com> |
slub: Fix use-after-preempt of per-CPU data structure Vegard Nossum reported a crash in kmem_cache_alloc(): BUG: unable to handle kernel paging request at da87d000 IP: [<c01991c7>] kmem_cache_alloc+0xc7/0xe0 *pde = 28180163 *pte = 1a87d160 Oops: 0002 [#1] PREEMPT SMP DEBUG_PAGEALLOC Pid: 3850, comm: grep Not tainted (2.6.26-rc9-00059-gb190333 #5) EIP: 0060:[<c01991c7>] EFLAGS: 00210203 CPU: 0 EIP is at kmem_cache_alloc+0xc7/0xe0 EAX: 00000000 EBX: da87c100 ECX: 1adad71a EDX: 6b6b6b6b ESI: 00200282 EDI: da87d000 EBP: f60bfe74 ESP: f60bfe54 DS: 007b ES: 007b FS: 00d8 GS: 0033 SS: 0068 and analyzed it: "The register %ecx looks innocent but is very important here. The disassembly: mov %edx,%ecx shr $0x2,%ecx rep stos %eax,%es:(%edi) <-- the fault So %ecx has been loaded from %edx... which is 0x6b6b6b6b/POISON_FREE. (0x6b6b6b6b >> 2 == 0x1adadada.) %ecx is the counter for the memset, from here: memset(object, 0, c->objsize); i.e. %ecx was loaded from c->objsize, so "c" must have been freed. Where did "c" come from? Uh-oh... c = get_cpu_slab(s, smp_processor_id()); This looks like it has very much to do with CPU hotplug/unplug. Is there a race between SLUB/hotplug since the CPU slab is used after it has been freed?" Good analysis. Yeah, it's possible that a caller of kmem_cache_alloc() -> slab_alloc() can be migrated on another CPU right after local_irq_restore() and before memset(). The inital cpu can become offline in the mean time (or a migration is a consequence of the CPU going offline) so its 'kmem_cache_cpu' structure gets freed ( slab_cpuup_callback). At some point of time the caller continues on another CPU having an obsolete pointer... Signed-off-by: Dmitry Adamushko <dmitry.adamushko@gmail.com> Reported-by: Vegard Nossum <vegard.nossum@gmail.com> Acked-by: Ingo Molnar <mingo@elte.hu> Cc: stable@kernel.org Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
#
cde53535 |
|
04-Jul-2008 |
Christoph Lameter <clameter@sgi.com> |
Christoph has moved Remove all clameter@sgi.com addresses from the kernel tree since they will become invalid on June 27th. Change my maintainer email address for the slab allocators to cl@linux-foundation.org (which will be the new email address for the future). Signed-off-by: Christoph Lameter <clameter@sgi.com> Signed-off-by: Christoph Lameter <cl@linux-foundation.org> Cc: Pekka Enberg <penberg@cs.helsinki.fi> Cc: Stephen Rothwell <sfr@canb.auug.org.au> Cc: Matt Mackall <mpm@selenic.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
#
41d54d3b |
|
03-Jul-2008 |
Christoph Lameter <cl@linux-foundation.org> |
slub: Do not use 192 byte sized cache if minimum alignment is 128 byte The 192 byte cache is not necessary if we have a basic alignment of 128 byte. If it would be used then the 192 would be aligned to the next 128 byte boundary which would result in another 256 byte cache. Two 256 kmalloc caches cause sysfs to complain about a duplicate entry. MIPS needs 128 byte aligned kmalloc caches and spits out warnings on boot without this patch. Signed-off-by: Christoph Lameter <cl@linux-foundation.org> Signed-off-by: Pekka Enberg <penberg@cs.helsinki.fi>
|
#
15c8b6c1 |
|
09-May-2008 |
Jens Axboe <jens.axboe@oracle.com> |
on_each_cpu(): kill unused 'retry' parameter It's not even passed on to smp_call_function() anymore, since that was removed. So kill it. Acked-by: Jeremy Fitzhardinge <jeremy.fitzhardinge@citrix.com> Reviewed-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com> Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
|
#
76994412 |
|
22-May-2008 |
Pekka Enberg <penberg@cs.helsinki.fi> |
slub: ksize() abuse checks Add a WARN_ON for pages that don't have PageSlab nor PageCompound set to catch the worst abusers of ksize() in the kernel. Acked-by: Christoph Lameter <clameter@sgi.com> Cc: Matt Mackall <mpm@selenic.com> Signed-off-by: Pekka Enberg <penberg@cs.helsinki.fi>
|
#
4ea33e2d |
|
06-May-2008 |
Benjamin Herrenschmidt <benh@kernel.crashing.org> |
slub: fix atomic usage in any_slab_objects() any_slab_objects() does an atomic_read on an atomic_long_t, this fixes it to use atomic_long_read instead. Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org> Cc: Christoph Lameter <clameter@sgi.com> Cc: Pekka Enberg <penberg@cs.helsinki.fi> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
#
f6acb635 |
|
29-Apr-2008 |
Christoph Lameter <clameter@sgi.com> |
slub: #ifdef simplification If we make SLUB_DEBUG depend on SYSFS then we can simplify some #ifdefs and avoid others. Signed-off-by: Christoph Lameter <clameter@sgi.com> Signed-off-by: Pekka Enberg <penberg@cs.helsinki.fi>
|
#
0121c619 |
|
29-Apr-2008 |
Christoph Lameter <clameter@sgi.com> |
slub: Whitespace cleanup and use of strict_strtoul Fix some issues with wrapping and use strict_strtoul to make parameter passing from sysfs safer. Signed-off-by: Christoph Lameter <clameter@sgi.com> Signed-off-by: Pekka Enberg <penberg@cs.helsinki.fi>
|
#
f8bd2258 |
|
01-May-2008 |
Roman Zippel <zippel@linux-m68k.org> |
remove div_long_long_rem x86 is the only arch right now, which provides an optimized for div_long_long_rem and it has the downside that one has to be very careful that the divide doesn't overflow. The API is a little akward, as the arguments for the unsigned divide are signed. The signed version also doesn't handle a negative divisor and produces worse code on 64bit archs. There is little incentive to keep this API alive, so this converts the few users to the new API. Signed-off-by: Roman Zippel <zippel@linux-m68k.org> Cc: Ralf Baechle <ralf@linux-mips.org> Cc: Ingo Molnar <mingo@elte.hu> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: john stultz <johnstul@us.ibm.com> Cc: Christoph Lameter <clameter@sgi.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
#
3ac7fe5a |
|
30-Apr-2008 |
Thomas Gleixner <tglx@linutronix.de> |
infrastructure to debug (dynamic) objects We can see an ever repeating problem pattern with objects of any kind in the kernel: 1) freeing of active objects 2) reinitialization of active objects Both problems can be hard to debug because the crash happens at a point where we have no chance to decode the root cause anymore. One problem spot are kernel timers, where the detection of the problem often happens in interrupt context and usually causes the machine to panic. While working on a timer related bug report I had to hack specialized code into the timer subsystem to get a reasonable hint for the root cause. This debug hack was fine for temporary use, but far from a mergeable solution due to the intrusiveness into the timer code. The code further lacked the ability to detect and report the root cause instantly and keep the system operational. Keeping the system operational is important to get hold of the debug information without special debugging aids like serial consoles and special knowledge of the bug reporter. The problems described above are not restricted to timers, but timers tend to expose it usually in a full system crash. Other objects are less explosive, but the symptoms caused by such mistakes can be even harder to debug. Instead of creating specialized debugging code for the timer subsystem a generic infrastructure is created which allows developers to verify their code and provides an easy to enable debug facility for users in case of trouble. The debugobjects core code keeps track of operations on static and dynamic objects by inserting them into a hashed list and sanity checking them on object operations and provides additional checks whenever kernel memory is freed. The tracked object operations are: - initializing an object - adding an object to a subsystem list - deleting an object from a subsystem list Each operation is sanity checked before the operation is executed and the subsystem specific code can provide a fixup function which allows to prevent the damage of the operation. When the sanity check triggers a warning message and a stack trace is printed. The list of operations can be extended if the need arises. For now it's limited to the requirements of the first user (timers). The core code enqueues the objects into hash buckets. The hash index is generated from the address of the object to simplify the lookup for the check on kfree/vfree. Each bucket has it's own spinlock to avoid contention on a global lock. The debug code can be compiled in without being active. The runtime overhead is minimal and could be optimized by asm alternatives. A kernel command line option enables the debugging code. Thanks to Ingo Molnar for review, suggestions and cleanup patches. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Ingo Molnar <mingo@elte.hu> Cc: Greg KH <greg@kroah.com> Cc: Randy Dunlap <randy.dunlap@oracle.com> Cc: Kay Sievers <kay.sievers@vrfy.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
#
0c40ba4f |
|
29-Apr-2008 |
Nadia Derbey <Nadia.Derbey@bull.net> |
ipc: define the slab_memory_callback priority as a constant This is a trivial patch that defines the priority of slab_memory_callback in the callback chain as a constant. This is to prepare for next patch in the series. Signed-off-by: Nadia Derbey <Nadia.Derbey@bull.net> Cc: Yasunori Goto <y-goto@jp.fujitsu.com> Cc: Matt Helsley <matthltc@us.ibm.com> Cc: Mingming Cao <cmm@us.ibm.com> Cc: Pierre Peiffer <pierre.peiffer@bull.net> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
#
1b27d05b |
|
28-Apr-2008 |
Pekka Enberg <penberg@cs.helsinki.fi> |
mm: move cache_line_size() to <linux/cache.h> Not all architectures define cache_line_size() so as suggested by Andrew move the private implementations in mm/slab.c and mm/slob.c to <linux/cache.h>. Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Ingo Molnar <mingo@redhat.com> Cc: H. Peter Anvin <hpa@zytor.com> Reviewed-by: Christoph Lameter <clameter@sgi.com> Signed-off-by: Pekka Enberg <penberg@cs.helsinki.fi> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
#
dd1a239f |
|
28-Apr-2008 |
Mel Gorman <mel@csn.ul.ie> |
mm: have zonelist contains structs with both a zone pointer and zone_idx Filtering zonelists requires very frequent use of zone_idx(). This is costly as it involves a lookup of another structure and a substraction operation. As the zone_idx is often required, it should be quickly accessible. The node idx could also be stored here if it was found that accessing zone->node is significant which may be the case on workloads where nodemasks are heavily used. This patch introduces a struct zoneref to store a zone pointer and a zone index. The zonelist then consists of an array of these struct zonerefs which are looked up as necessary. Helpers are given for accessing the zone index as well as the node index. [kamezawa.hiroyu@jp.fujitsu.com: Suggested struct zoneref instead of embedding information in pointers] [hugh@veritas.com: mm-have-zonelist: fix memcg ooms] [hugh@veritas.com: just return do_try_to_free_pages] [hugh@veritas.com: do_try_to_free_pages gfp_mask redundant] Signed-off-by: Mel Gorman <mel@csn.ul.ie> Acked-by: Christoph Lameter <clameter@sgi.com> Acked-by: David Rientjes <rientjes@google.com> Signed-off-by: Lee Schermerhorn <lee.schermerhorn@hp.com> Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: Mel Gorman <mel@csn.ul.ie> Cc: Christoph Lameter <clameter@sgi.com> Cc: Nick Piggin <nickpiggin@yahoo.com.au> Signed-off-by: Hugh Dickins <hugh@veritas.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
#
54a6eb5c |
|
28-Apr-2008 |
Mel Gorman <mel@csn.ul.ie> |
mm: use two zonelist that are filtered by GFP mask Currently a node has two sets of zonelists, one for each zone type in the system and a second set for GFP_THISNODE allocations. Based on the zones allowed by a gfp mask, one of these zonelists is selected. All of these zonelists consume memory and occupy cache lines. This patch replaces the multiple zonelists per-node with two zonelists. The first contains all populated zones in the system, ordered by distance, for fallback allocations when the target/preferred node has no free pages. The second contains all populated zones in the node suitable for GFP_THISNODE allocations. An iterator macro is introduced called for_each_zone_zonelist() that interates through each zone allowed by the GFP flags in the selected zonelist. Signed-off-by: Mel Gorman <mel@csn.ul.ie> Acked-by: Christoph Lameter <clameter@sgi.com> Signed-off-by: Lee Schermerhorn <lee.schermerhorn@hp.com> Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: Mel Gorman <mel@csn.ul.ie> Cc: Christoph Lameter <clameter@sgi.com> Cc: Hugh Dickins <hugh@veritas.com> Cc: Nick Piggin <nickpiggin@yahoo.com.au> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
#
0e88460d |
|
28-Apr-2008 |
Mel Gorman <mel@csn.ul.ie> |
mm: introduce node_zonelist() for accessing the zonelist for a GFP mask Introduce a node_zonelist() helper function. It is used to lookup the appropriate zonelist given a node and a GFP mask. The patch on its own is a cleanup but it helps clarify parts of the two-zonelist-per-node patchset. If necessary, it can be merged with the next patch in this set without problems. Reviewed-by: Christoph Lameter <clameter@sgi.com> Signed-off-by: Mel Gorman <mel@csn.ul.ie> Signed-off-by: Lee Schermerhorn <lee.schermerhorn@hp.com> Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: Mel Gorman <mel@csn.ul.ie> Cc: Christoph Lameter <clameter@sgi.com> Cc: Hugh Dickins <hugh@veritas.com> Cc: Nick Piggin <nickpiggin@yahoo.com.au> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
#
c124f5b5 |
|
14-Apr-2008 |
Christoph Lameter <clameter@sgi.com> |
slub: pack objects denser Since we now have more orders available use a denser packing. Increase slab order if more than 1/16th of a slab would be wasted. Signed-off-by: Christoph Lameter <clameter@sgi.com> Signed-off-by: Pekka Enberg <penberg@cs.helsinki.fi>
|
#
9b2cd506 |
|
14-Apr-2008 |
Christoph Lameter <clameter@sgi.com> |
slub: Calculate min_objects based on number of processors. The mininum objects per slab is calculated based on the number of processors that may come online. Processors min_objects --------------------------- 1 8 2 12 4 16 8 20 16 24 32 28 64 32 1024 48 4096 56 The higher the number of processors the large the order sizes used for various slab caches will become. This has been shown to address the performance issues in hackbench on 16p etc. The calculation is only performed if slub_min_objects is zero (default). If one specifies a slub_min_objects on boot then that setting is taken. As suggested by Zhang Yanmin's performance tests on 16-core Tigerton, use the formula '4 * (fls(nr_cpu_ids) + 1)': ./hackbench 100 process 2000: 1) 2.6.25-rc6slab: 23.5 seconds 2) 2.6.25-rc7SLUB+slub_min_objects=20: 31 seconds 3) 2.6.25-rc7SLUB+slub_min_objects=24: 23.5 seconds Signed-off-by: Christoph Lameter <clameter@sgi.com> Signed-off-by: Zhang Yanmin <yanmin.zhang@intel.com> Signed-off-by: Pekka Enberg <penberg@cs.helsinki.fi>
|
#
114e9e89 |
|
14-Apr-2008 |
Christoph Lameter <clameter@sgi.com> |
slub: Drop DEFAULT_MAX_ORDER / DEFAULT_MIN_OBJECTS We can now fallback to order 0 slabs. So set the slub_max_order to PAGE_CACHE_ORDER_COSTLY but keep the slub_min_objects at 4. This will mostly preserve the orders used in 2.6.25. F.e. The 2k kmalloc slab will use order 1 allocs and the 4k kmalloc slab order 2. Signed-off-by: Christoph Lameter <clameter@sgi.com> Signed-off-by: Pekka Enberg <penberg@cs.helsinki.fi>
|
#
31d33baf |
|
14-Apr-2008 |
Christoph Lameter <clameter@sgi.com> |
slub: Simplify any_slab_object checks Since we now have total_objects counter per node use that to check for the presence of any objects. The loop over all cpu slabs is not that useful since any cpu slab would require an object allocation first. So drop that. Signed-off-by: Christoph Lameter <clameter@sgi.com> Signed-off-by: Pekka Enberg <penberg@cs.helsinki.fi>
|
#
06b285dc |
|
14-Apr-2008 |
Christoph Lameter <clameter@sgi.com> |
slub: Make the order configurable for each slab cache Makes /sys/kernel/slab/<slabname>/order writable. The allocation order of a slab cache can then be changed dynamically during runtime. This can be used to override the objects per slabs value establisheed with the slub_min_objects setting that was manually specified or calculated on bootup. The changes of the slab order can occur while allocate_slab() runs. Allocate slab needs the order and the number of slab objects that are both changed by the change of order. Both are put into a single word (struct kmem_cache_order_objects). They can then be atomically updated and retrieved. Signed-off-by: Christoph Lameter <clameter@sgi.com> Signed-off-by: Pekka Enberg <penberg@cs.helsinki.fi>
|
#
319d1e24 |
|
14-Apr-2008 |
Christoph Lameter <clameter@sgi.com> |
slub: Drop fallback to page allocator method There is now a generic method of falling back to a slab page of minimal order. No need anymore for the fallback to kmalloc_large(). Signed-off-by: Christoph Lameter <clameter@sgi.com> Signed-off-by: Pekka Enberg <penberg@cs.helsinki.fi>
|
#
65c3376a |
|
14-Apr-2008 |
Christoph Lameter <clameter@sgi.com> |
slub: Fallback to minimal order during slab page allocation If any higher order allocation fails then fall back the smallest order necessary to contain at least one object. This enables fallback for all allocations to order 0 pages. The fallback will waste more memory (objects will not fit neatly) and the fallback slabs will be not as efficient as larger slabs since they contain less objects. Note that SLAB also depends on order 1 allocations for some slabs that waste too much memory if forced into PAGE_SIZE'd page. SLUB now can now deal with failing order 1 allocs which SLAB cannot do. Add a new field min that will contain the objects for the smallest possible order for a slab cache. Signed-off-by: Christoph Lameter <clameter@sgi.com> Signed-off-by: Pekka Enberg <penberg@cs.helsinki.fi>
|
#
205ab99d |
|
14-Apr-2008 |
Christoph Lameter <clameter@sgi.com> |
slub: Update statistics handling for variable order slabs Change the statistics to consider that slabs of the same slabcache can have different number of objects in them since they may be of different order. Provide a new sysfs field total_objects which shows the total objects that the allocated slabs of a slabcache could hold. Add a max field that holds the largest slab order that was ever used for a slab cache. Signed-off-by: Christoph Lameter <clameter@sgi.com> Signed-off-by: Pekka Enberg <penberg@cs.helsinki.fi>
|
#
834f3d11 |
|
14-Apr-2008 |
Christoph Lameter <clameter@sgi.com> |
slub: Add kmem_cache_order_objects struct Pack the order and the number of objects into a single word. This saves some memory in the kmem_cache_structure and more importantly allows us to fetch both values atomically. Later the slab orders become runtime configurable and we need to fetch these two items together in order to properly allocate a slab and initialize its objects. Fix the race by fetching the order and the number of objects in one word. [penberg@cs.helsinki.fi: fix memset() page order in new_slab()] Signed-off-by: Christoph Lameter <clameter@sgi.com> Signed-off-by: Pekka Enberg <penberg@cs.helsinki.fi>
|
#
224a88be |
|
14-Apr-2008 |
Christoph Lameter <clameter@sgi.com> |
slub: for_each_object must be passed the number of objects in a slab Pass the number of objects to the for_each_object macro. Most of these are debug related. Signed-off-by: Christoph Lameter <clameter@sgi.com> Signed-off-by: Pekka Enberg <penberg@cs.helsinki.fi>
|
#
39b26464 |
|
14-Apr-2008 |
Christoph Lameter <clameter@sgi.com> |
slub: Store max number of objects in the page struct. Split the inuse field up to be able to store the number of objects in this page in the page struct as well. Necessary if we want to have pages of various orders for a slab. Also avoids touching struct kmem_cache cachelines in __slab_alloc(). Update diagnostic code to check the number of objects and make sure that the number of objects always stays within the bounds of a 16 bit unsigned integer. Signed-off-by: Christoph Lameter <clameter@sgi.com> Signed-off-by: Pekka Enberg <penberg@cs.helsinki.fi>
|
#
33b12c38 |
|
25-Apr-2008 |
Christoph Lameter <clameter@sgi.com> |
slub: Dump list of objects not freed on kmem_cache_close() Dump a list of unfreed objects if a slab cache is closed but objects still remain. Signed-off-by: Christoph Lameter <clameter@sgi.com> Signed-off-by: Pekka Enberg <penberg@cs.helsinki.fi>
|
#
599870b1 |
|
23-Apr-2008 |
Christoph Lameter <clameter@sgi.com> |
slub: free_list() cleanup free_list looked a bit screwy so here is an attempt to clean it up. free_list is is only used for freeing partial lists. We do not need to return a parameter if we decrement nr_partial within the function which allows a simplification of the whole thing. The current version modifies nr_partial outside of the list_lock which is technically not correct. It was only ok because we should be the only user of this slab cache at this point. Signed-off-by: Christoph Lameter <clameter@sgi.com> Signed-off-by: Pekka Enberg <penberg@cs.helsinki.fi>
|
#
d629d819 |
|
23-Apr-2008 |
Pekka Enberg <penberg@cs.helsinki.fi> |
slub: improve kmem_cache_destroy() error message As pointed out by Ingo, the SLUB warning of calling kmem_cache_destroy() with cache that still has objects triggers in practice. So turn this WARN_ON() into a nice SLUB specific error message to avoid people confusing it to a SLUB bug. Cc: Ingo Molnar <mingo@elte.hu> Acked-by: Christoph Lameter <clameter@sgi.com> Signed-off-by: Pekka Enberg <penberg@cs.helsinki.fi>
|
#
3dc50637 |
|
23-Apr-2008 |
Christoph Lameter <clameter@sgi.com> |
slab_err: Pass parameters correctly to slab_bug Signed-off-by: Christoph Lameter <clameter@sgi.com> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
#
0f389ec6 |
|
14-Apr-2008 |
Christoph Lameter <clameter@sgi.com> |
slub: No need for per node slab counters if !SLUB_DEBUG The per node counters are used mainly for showing data through the sysfs API. If that API is not compiled in then there is no point in keeping track of this data. Disable counters for the number of slabs and the number of total slabs if !SLUB_DEBUG. Incrementing the per node counters is also accessing a potentially contended cacheline so this could actually be a performance benefit to embedded systems. SLABINFO support is also affected. It now must depends on SLUB_DEBUG (which is on by default). Patch also avoids a check for a NULL kmem_cache_node pointer in new_slab() if the system is not compiled with NUMA support. [penberg@cs.helsinki.fi: fix oops and move ->nr_slabs into CONFIG_SLUB_DEBUG] Signed-off-by: Christoph Lameter <clameter@sgi.com> Signed-off-by: Pekka Enberg <penberg@cs.helsinki.fi>
|
#
49bd5221 |
|
14-Apr-2008 |
Christoph Lameter <clameter@sgi.com> |
slub: Move map/flag clearing to __free_slab __free_slab does some diagnostics. The resetting of mapcount etc in discard_slab() can interfere with debug processing. So move the reset immediately before the page is freed. Signed-off-by: Christoph Lameter <clameter@sgi.com> Signed-off-by: Pekka Enberg <penberg@cs.helsinki.fi>
|
#
50ef37b9 |
|
14-Apr-2008 |
Christoph Lameter <clameter@sgi.com> |
slub: Fixes to per cpu stat output in sysfs Only output per cpu stats if the kernel is build for SMP. Use a capital "C" as a leading character for the processor number (same as the numa statistics that also use a capital letter "N"). Signed-off-by: Christoph Lameter <clameter@sgi.com> Signed-off-by: Pekka Enberg <penberg@cs.helsinki.fi>
|
#
5b06c853 |
|
14-Apr-2008 |
Christoph Lameter <clameter@sgi.com> |
slub: Deal with config variable dependencies count_partial() is used by both slabinfo and the sysfs proc support. Move the function directly before the beginning of the sysfs code so that it can be easily found. Rework the preprocessor conditional to take into account that slub sysfs support depends on CONFIG_SYSFS *and* CONFIG_SLUB_DEBUG. Make CONFIG_SLUB_STATS depend on CONFIG_SLUB_DEBUG and CONFIG_SYSFS. There is no point of keeping statistics if no one can restrive them. Signed-off-by: Christoph Lameter <clameter@sgi.com> Signed-off-by: Pekka Enberg <penberg@cs.helsinki.fi>
|
#
4097d601 |
|
14-Apr-2008 |
Christoph Lameter <clameter@sgi.com> |
slub: Reduce #ifdef ZONE_DMA by moving kmalloc_caches_dma near dma logic Move the definition of kmalloc_caches_dma() into a later #ifdef CONFIG_ZONE_DMA. This saves one #ifdef and leaves us with a total of two #ifdefs for dma slab support. Signed-off-by: Christoph Lameter <clameter@sgi.com> Signed-off-by: Pekka Enberg <penberg@cs.helsinki.fi>
|
#
62f75532 |
|
14-Apr-2008 |
Pekka Enberg <penberg@cs.helsinki.fi> |
slub: Initialize per-cpu stats As spotted by kmemcheck, we need to initialize the per-CPU ->stat array before using it. [kmem_cache_cpu structures are usually allocated from arrays defined via DEFINE_PER_CPU that are zeroed so we have not noticed this so far --cl]. Reported-by: Vegard Nossum <vegard.nossum@gmail.com> Signed-off-by: Christoph Lameter <clameter@sgi.com> Signed-off-by: Pekka Enberg <penberg@cs.helsinki.fi>
|
#
00460dd5 |
|
01-Apr-2008 |
Christoph Lameter <clameter@sgi.com> |
Fix undefined count_partial if !CONFIG_SLABINFO Small typo in the patch recently merged to avoid the unused symbol message for count_partial(). Discussion thread with confirmation of fix at http://marc.info/?t=120696854400001&r=1&w=2 Typo in the check if we need the count_partial function that was introduced by 53625b4204753b904addd40ca96d9ba802e6977d Signed-off-by: Christoph Lameter <clameter@sgi.com> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
#
e72e9c23 |
|
27-Mar-2008 |
Linus Torvalds <torvalds@linux-foundation.org> |
Revert "SLUB: remove useless masking of GFP_ZERO" This reverts commit 3811dbf67162bd08412f1b0e02e554f353e93bdb. The masking was not at all useless, and it was sensible. We handle GFP_ZERO in the caller, and passing it down to any page allocator logic is buggy and wrong. Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
#
53625b42 |
|
19-Mar-2008 |
Christoph Lameter <clameter@sgi.com> |
count_partial() is not used if !SLUB_DEBUG and !CONFIG_SLABINFO Avoid warnings about unused functions if neither SLUB_DEBUG nor CONFIG_SLABINFO is defined. This patch will be reversed when slab defrag is merged since slab defrag requires count_partial() to determine the fragmentation status of slab caches. Signed-off-by: Christoph Lameter <clameter@sgi.com>
|
#
caeab084 |
|
13-Mar-2008 |
Christoph Lameter <clameter@sgi.com> |
slub page alloc fallback: Enable interrupts for GFP_WAIT. The fallback path needs to enable interrupts like done for the other page allocator calls. This was not necessary with the alternate fast path since we handled irq enable/disable in the slow path. The regular fastpath handles irq enable/disable around calls to the slow path so we need to restore the proper status before calling the page allocator from the slowpath. Signed-off-by: Christoph Lameter <clameter@sgi.com>
|
#
b6210386 |
|
05-Mar-2008 |
Nick Piggin <npiggin@suse.de> |
slub: Do not cross cacheline boundaries for very small objects SLUB should pack even small objects nicely into cachelines if that is what has been asked for. Use the same algorithm as SLAB for this. The effect of this patch for a system with a cacheline size of 64 bytes is that the 24 byte sized slab caches will now put exactly 2 objects into a cacheline instead of 3 with some overlap into the next cacheline. This reduces the object density in a 4k slab from 170 to 128 objects (same as SLAB). Signed-off-by: Nick Piggin <npiggin@suse.de> Signed-off-by: Christoph Lameter <clameter@sgi.com>
|
#
b773ad73 |
|
04-Mar-2008 |
Christoph Lameter <clameter@sgi.com> |
slub statistics: Fix check for DEACTIVATE_REMOTE_FREES The remote frees are in the freelist of the page and not in the percpu freelist. Reviewed-by: Pekka Enberg <penberg@cs.helsinki.fi> Signed-off-by: Christoph Lameter <clameter@sgi.com>
|
#
62e5c4b4 |
|
02-Mar-2008 |
Cyrill Gorcunov <gorcunov@gmail.com> |
slub: fix possible NULL pointer dereference This patch fix possible NULL pointer dereference if kzalloc failed. To be able to return proper error code the function return type is changed to ssize_t (according to callees and sysfs definitions). Signed-off-by: Cyrill Gorcunov <gorcunov@gmail.com> Signed-off-by: Christoph Lameter <clameter@sgi.com>
|
#
f619cfe1 |
|
01-Mar-2008 |
Christoph Lameter <clameter@sgi.com> |
slub: Add kmalloc_large_node() to support kmalloc_node fallback Slub is missing some NUMA support for large kmallocs. Provide that. Reviewed-by: Pekka Enberg <penberg@cs.helsinki.fi> Signed-off-by: Christoph Lameter <clameter@sgi.com>
|
#
76931434 |
|
01-Mar-2008 |
Pekka J Enberg <penberg@cs.helsinki.fi> |
slub: look up object from the freelist once We only need to look up object from c->page->freelist once in __slab_alloc(). Signed-off-by: Pekka Enberg <penberg@cs.helsinki.fi> Signed-off-by: Christoph Lameter <clameter@sgi.com>
|
#
6446faa2 |
|
16-Feb-2008 |
Christoph Lameter <clameter@sgi.com> |
slub: Fix up comments Provide comments and fix up various spelling / style issues. Signed-off-by: Christoph Lameter <clameter@sgi.com>
|
#
d8b42bf5 |
|
16-Feb-2008 |
Christoph Lameter <clameter@sgi.com> |
slub: Rearrange #ifdef CONFIG_SLUB_DEBUG in calculate_sizes() Group SLUB_DEBUG code together to reduce the number of #ifdefs. Move some debug checks under the #ifdef. Reviewed-by: Pekka Enberg <penberg@cs.helsinki.fi> Signed-off-by: Christoph Lameter <clameter@sgi.com>
|
#
ae20bfda |
|
16-Feb-2008 |
Christoph Lameter <clameter@sgi.com> |
slub: Remove BUG_ON() from ksize and omit checks for !SLUB_DEBUG The BUG_ONs are useless since the pointer derefs will lead to NULL deref errors anyways. Some of the checks are not necessary if no debugging is possible. Signed-off-by: Christoph Lameter <clameter@sgi.com>
|
#
27d9e4e9 |
|
16-Feb-2008 |
Christoph Lameter <clameter@sgi.com> |
slub: Use the objsize from the kmem_cache_cpu structure No need to access the kmem_cache structure. We have the same value in kmem_cache_cpu. Reviewed-by: Pekka Enberg <penberg@cs.helsinki.fi> Signed-off-by: Christoph Lameter <clameter@sgi.com>
|
#
d692ef6d |
|
16-Feb-2008 |
Christoph Lameter <clameter@sgi.com> |
slub: Remove useless checks in alloc_debug_processing Alloc debug processing is never called with a NULL object pointer. No reason to check for NULL. Reviewed-by: Pekka Enberg <penberg@cs.helsinki.fi> Signed-off-by: Christoph Lameter <clameter@sgi.com>
|
#
e153362a |
|
16-Feb-2008 |
Christoph Lameter <clameter@sgi.com> |
slub: Remove objsize check in kmem_cache_flags() There is no page->offset anymore and also no associated limit on the number of objects. The page->offset field was removed for 2.6.24. So the check in kmem_cache_flags() is now also obsolete (should have been dropped earlier, somehow a hunk vanished). Reviewed-by: Pekka Enberg <penberg@cs.helsinki.fi> Signed-by: Christoph Lameter <clameter@sgi.com>
|
#
d9acf4b7 |
|
15-Feb-2008 |
Christoph Lameter <clameter@sgi.com> |
slub: rename slab_objects to show_slab_objects The sysfs callback is better named show_slab_objects since it is always called from the xxx_show callbacks. We need the name for other purposes later. Reviewed-by: Pekka Enberg <penberg@cs.helsinki.fi> Signed-off-by: Christoph Lameter <clameter@sgi.com>
|
#
a973e9dd |
|
01-Mar-2008 |
Christoph Lameter <clameter@sgi.com> |
Revert "unique end pointer" patch This only made sense for the alternate fastpath which was reverted last week. Mathieu is working on a new version that addresses the fastpath issues but that new code first needs to go through mm and it is not clear if we need the unique end pointers with his new scheme. Reviewed-by: Pekka Enberg <penberg@cs.helsinki.fi> Signed-off-by: Christoph Lameter <clameter@sgi.com>
|
#
00e962c5 |
|
19-Feb-2008 |
Linus Torvalds <torvalds@woody.linux-foundation.org> |
Revert "SLUB: Alternate fast paths using cmpxchg_local" This reverts commit 1f84260c8ce3b1ce26d4c1d6dedc2f33a3a29c0c, which is suspected to be the reason for some very occasional and hard-to-trigger crashes that usually look related to memory allocation (mostly reported in networking, but since that's generally the most common source of shortlived allocations - and allocations in interrupt contexts - that in itself is not a big clue). See for example http://bugzilla.kernel.org/show_bug.cgi?id=9973 http://lkml.org/lkml/2008/2/19/278 etc. One promising suspicion for what the root cause of bug is (which also explains why it's so hard to trigger in practice) came from Eric Dumazet: "I wonder how SLUB_FASTPATH is supposed to work, since it is affected by a classical ABA problem of lockless algo. cmpxchg_local(&c->freelist, object, object[c->offset]) can succeed, while an interrupt came (on this cpu), and several allocations were done, and one free was performed at the end of this interruption, so 'object' was recycled. c->freelist can then contain the previous value (object), but object[c->offset] was changed by IRQ. We then put back in freelist an already allocated object." but another reason for the revert is simply that everybody agrees that this code was the main suspect just by virtue of the pattern of oopses. Cc: Torsten Kaiser <just.for.lkml@googlemail.com> Cc: Christoph Lameter <clameter@sgi.com> Cc: Mathieu Desnoyers <mathieu.desnoyers@polymtl.ca> Cc: Pekka Enberg <penberg@cs.helsinki.fi> Cc: Ingo Molnar <mingo@elte.hu> Cc: Eric Dumazet <dada1@cosmosbay.com> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
#
331dc558 |
|
14-Feb-2008 |
Christoph Lameter <clameter@sgi.com> |
slub: Support 4k kmallocs again to compensate for page allocator slowness Currently we hand off PAGE_SIZEd kmallocs to the page allocator in the mistaken belief that the page allocator can handle these allocations effectively. However, measurements indicate a minimum slowdown by the factor of 8 (and that is only SMP, NUMA is much worse) vs the slub fastpath which causes regressions in tbench. Increase the number of kmalloc caches by one so that we again handle 4k kmallocs directly from slub. 4k page buffering for the page allocator will be performed by slub like done by slab. At some point the page allocator fastpath should be fixed. A lot of the kernel would benefit from a faster ability to allocate a single page. If that is done then the 4k allocs may again be forwarded to the page allocator and this patch could be reverted. Reviewed-by: Pekka Enberg <penberg@cs.helsinki.fi> Acked-by: Mel Gorman <mel@csn.ul.ie> Signed-off-by: Christoph Lameter <clameter@sgi.com>
|
#
71c7a06f |
|
14-Feb-2008 |
Christoph Lameter <clameter@sgi.com> |
slub: Fallback to kmalloc_large for failing higher order allocs Slub already has two ways of allocating an object. One is via its own logic and the other is via the call to kmalloc_large to hand off object allocation to the page allocator. kmalloc_large is typically used for objects >= PAGE_SIZE. We can use that handoff to avoid failing if a higher order kmalloc slab allocation cannot be satisfied by the page allocator. If we reach the out of memory path then simply try a kmalloc_large(). kfree() can already handle the case of an object that was allocated via the page allocator and so this will work just fine (apart from object accounting...). For any kmalloc slab that already requires higher order allocs (which makes it impossible to use the page allocator fastpath!) we just use PAGE_ALLOC_COSTLY_ORDER to get the largest number of objects in one go from the page allocator slowpath. On a 4k platform this patch will lead to the following use of higher order pages for the following kmalloc slabs: 8 ... 1024 order 0 2048 .. 4096 order 3 (4k slab only after the next patch) We may waste some space if fallback occurs on a 2k slab but we are always able to fallback to an order 0 alloc. Reviewed-by: Pekka Enberg <penberg@cs.helsinki.fi> Signed-off-by: Christoph Lameter <clameter@sgi.com>
|
#
b7a49f0d |
|
14-Feb-2008 |
Christoph Lameter <clameter@sgi.com> |
slub: Determine gfpflags once and not every time a slab is allocated Currently we determine the gfp flags to pass to the page allocator each time a slab is being allocated. Determine the bits to be set at the time the slab is created. Store in a new allocflags field and add the flags in allocate_slab(). Acked-by: Mel Gorman <mel@csn.ul.ie> Reviewed-by: Pekka Enberg <penberg@cs.helsinki.fi> Signed-off-by: Christoph Lameter <clameter@sgi.com>
|
#
dada123d |
|
13-Feb-2008 |
Adrian Bunk <bunk@kernel.org> |
make slub.c:slab_address() static slab_address() can become static. Signed-off-by: Adrian Bunk <bunk@kernel.org> Signed-off-by: Christoph Lameter <clameter@sgi.com>
|
#
eada35ef |
|
11-Feb-2008 |
Pekka Enberg <penberg@cs.helsinki.fi> |
slub: kmalloc page allocator pass-through cleanup This adds a proper function for kmalloc page allocator pass-through. While it simplifies any code that does slab tracing code a lot, I think it's a worthwhile cleanup in itself. Signed-off-by: Pekka Enberg <penberg@cs.helsinki.fi> Signed-off-by: Christoph Lameter <clameter@sgi.com>
|
#
3adbefee |
|
05-Feb-2008 |
Ingo Molnar <mingo@elte.hu> |
SLUB: fix checkpatch warnings fix checkpatch --file mm/slub.c errors and warnings. $ q-code-quality-compare errors lines of code errors/KLOC mm/slub.c [before] 22 4204 5.2 mm/slub.c [after] 0 4210 0 no code changed: text data bss dec hex filename 22195 8634 136 30965 78f5 slub.o.before 22195 8634 136 30965 78f5 slub.o.after md5: 93cdfbec2d6450622163c590e1064358 slub.o.before.asm 93cdfbec2d6450622163c590e1064358 slub.o.after.asm [clameter: rediffed against Pekka's cleanup patch, omitted moves of the name of a function to the start of line] Signed-off-by: Ingo Molnar <mingo@elte.hu> Signed-off-by: Christoph Lameter <clameter@sgi.com>
|
#
a76d3546 |
|
08-Jan-2008 |
Nick Piggin <nickpiggin@yahoo.com.au> |
Use non atomic unlock Slub can use the non-atomic version to unlock because other flags will not get modified with the lock held. Signed-off-by: Nick Piggin <npiggin@suse.de> Acked-by: Christoph Lameter <clameter@sgi.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|
#
8ff12cfc |
|
07-Feb-2008 |
Christoph Lameter <clameter@sgi.com> |
SLUB: Support for performance statistics The statistics provided here allow the monitoring of allocator behavior but at the cost of some (minimal) loss of performance. Counters are placed in SLUB's per cpu data structure. The per cpu structure may be extended by the statistics to grow larger than one cacheline which will increase the cache footprint of SLUB. There is a compile option to enable/disable the inclusion of the runtime statistics and its off by default. The slabinfo tool is enhanced to support these statistics via two options: -D Switches the line of information displayed for a slab from size mode to activity mode. -A Sorts the slabs displayed by activity. This allows the display of the slabs most important to the performance of a certain load. -r Report option will report detailed statistics on Example (tbench load): slabinfo -AD ->Shows the most active slabs Name Objects Alloc Free %Fast skbuff_fclone_cache 33 111953835 111953835 99 99 :0000192 2666 5283688 5281047 99 99 :0001024 849 5247230 5246389 83 83 vm_area_struct 1349 119642 118355 91 22 :0004096 15 66753 66751 98 98 :0000064 2067 25297 23383 98 78 dentry 10259 28635 18464 91 45 :0000080 11004 18950 8089 98 98 :0000096 1703 12358 10784 99 98 :0000128 762 10582 9875 94 18 :0000512 184 9807 9647 95 81 :0002048 479 9669 9195 83 65 anon_vma 777 9461 9002 99 71 kmalloc-8 6492 9981 5624 99 97 :0000768 258 7174 6931 58 15 So the skbuff_fclone_cache is of highest importance for the tbench load. Pretty high load on the 192 sized slab. Look for the aliases slabinfo -a | grep 000192 :0000192 <- xfs_btree_cur filp kmalloc-192 uid_cache tw_sock_TCP request_sock_TCPv6 tw_sock_TCPv6 skbuff_head_cache xfs_ili Likely skbuff_head_cache. Looking into the statistics of the skbuff_fclone_cache is possible through slabinfo skbuff_fclone_cache ->-r option implied if cache name is mentioned .... Usual output ... Slab Perf Counter Alloc Free %Al %Fr -------------------------------------------------- Fastpath 111953360 111946981 99 99 Slowpath 1044 7423 0 0 Page Alloc 272 264 0 0 Add partial 25 325 0 0 Remove partial 86 264 0 0 RemoteObj/SlabFrozen 350 4832 0 0 Total 111954404 111954404 Flushes 49 Refill 0 Deactivate Full=325(92%) Empty=0(0%) ToHead=24(6%) ToTail=1(0%) Looks good because the fastpath is overwhelmingly taken. skbuff_head_cache: Slab Perf Counter Alloc Free %Al %Fr -------------------------------------------------- Fastpath 5297262 5259882 99 99 Slowpath 4477 39586 0 0 Page Alloc 937 824 0 0 Add partial 0 2515 0 0 Remove partial 1691 824 0 0 RemoteObj/SlabFrozen 2621 9684 0 0 Total 5301739 5299468 Deactivate Full=2620(100%) Empty=0(0%) ToHead=0(0%) ToTail=0(0%) Descriptions of the output: Total: The total number of allocation and frees that occurred for a slab Fastpath: The number of allocations/frees that used the fastpath. Slowpath: Other allocations Page Alloc: Number of calls to the page allocator as a result of slowpath processing Add Partial: Number of slabs added to the partial list through free or alloc (occurs during cpuslab flushes) Remove Partial: Number of slabs removed from the partial list as a result of allocations retrieving a partial slab or by a free freeing the last object of a slab. RemoteObj/Froz: How many times were remotely freed object encountered when a slab was about to be deactivated. Frozen: How many times was free able to skip list processing because the slab was in use as the cpuslab of another processor. Flushes: Number of times the cpuslab was flushed on request (kmem_cache_shrink, may result from races in __slab_alloc) Refill: Number of times we were able to refill the cpuslab from remotely freed objects for the same slab. Deactivate: Statistics how slabs were deactivated. Shows how they were put onto the partial list. In general fastpath is very good. Slowpath without partial list processing is also desirable. Any touching of partial list uses node specific locks which may potentially cause list lock contention. Signed-off-by: Christoph Lameter <clameter@sgi.com>
|
#
1f84260c |
|
08-Jan-2008 |
Christoph Lameter <clameter@sgi.com> |
SLUB: Alternate fast paths using cmpxchg_local Provide an alternate implementation of the SLUB fast paths for alloc and free using cmpxchg_local. The cmpxchg_local fast path is selected for arches that have CONFIG_FAST_CMPXCHG_LOCAL set. An arch should only set CONFIG_FAST_CMPXCHG_LOCAL if the cmpxchg_local is faster than an interrupt enable/disable sequence. This is known to be true for both x86 platforms so set FAST_CMPXCHG_LOCAL for both arches. Currently another requirement for the fastpath is that the kernel is compiled without preemption. The restriction will go away with the introduction of a new per cpu allocator and new per cpu operations. The advantages of a cmpxchg_local based fast path are: 1. Potentially lower cycle count (30%-60% faster) 2. There is no need to disable and enable interrupts on the fast path. Currently interrupts have to be disabled and enabled on every slab operation. This is likely avoiding a significant percentage of interrupt off / on sequences in the kernel. 3. The disposal of freed slabs can occur with interrupts enabled. The alternate path is realized using #ifdef's. Several attempts to do the same with macros and inline functions resulted in a mess (in particular due to the strange way that local_interrupt_save() handles its argument and due to the need to define macros/functions that sometimes disable interrupts and sometimes do something else). [clameter: Stripped preempt bits and disabled fastpath if preempt is enabled] Signed-off-by: Christoph Lameter <clameter@sgi.com> Reviewed-by: Pekka Enberg <penberg@cs.helsinki.fi> Cc: <linux-arch@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|
#
683d0baa |
|
08-Jan-2008 |
Christoph Lameter <clameter@sgi.com> |
SLUB: Use unique end pointer for each slab page. We use a NULL pointer on freelists to signal that there are no more objects. However the NULL pointers of all slabs match in contrast to the pointers to the real objects which are in different ranges for different slab pages. Change the end pointer to be a pointer to the first object and set bit 0. Every slab will then have a different end pointer. This is necessary to ensure that end markers can be matched to the source slab during cmpxchg_local. Bring back the use of the mapping field by SLUB since we would otherwise have to call a relatively expensive function page_address() in __slab_alloc(). Use of the mapping field allows avoiding a call to page_address() in various other functions as well. There is no need to change the page_mapping() function since bit 0 is set on the mapping as also for anonymous pages. page_mapping(slab_page) will therefore still return NULL although the mapping field is overloaded. Signed-off-by: Christoph Lameter <clameter@sgi.com> Cc: Pekka Enberg <penberg@cs.helsinki.fi> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|
#
5bb983b0 |
|
07-Feb-2008 |
Christoph Lameter <clameter@sgi.com> |
SLUB: Deal with annoying gcc warning on kfree() gcc 4.2 spits out an annoying warning if one casts a const void * pointer to a void * pointer. No warning is generated if the conversion is done through an assignment. Signed-off-by: Christoph Lameter <clameter@sgi.com>
|
#
ba84c73c |
|
08-Jan-2008 |
root <root@programming.kicks-ass.net> |
SLUB: Do not upset lockdep inconsistent {softirq-on-W} -> {in-softirq-W} usage. swapper/0 [HC0[0]:SC1[1]:HE0:SE0] takes: (&n->list_lock){-+..}, at: [<ffffffff802935c1>] add_partial+0x31/0xa0 {softirq-on-W} state was registered at: [<ffffffff80259fb8>] __lock_acquire+0x3e8/0x1140 [<ffffffff80259838>] debug_check_no_locks_freed+0x188/0x1a0 [<ffffffff8025ad65>] lock_acquire+0x55/0x70 [<ffffffff802935c1>] add_partial+0x31/0xa0 [<ffffffff805c76de>] _spin_lock+0x1e/0x30 [<ffffffff802935c1>] add_partial+0x31/0xa0 [<ffffffff80296f9c>] kmem_cache_open+0x1cc/0x330 [<ffffffff805c7984>] _spin_unlock_irq+0x24/0x30 [<ffffffff802974f4>] create_kmalloc_cache+0x64/0xf0 [<ffffffff80295640>] init_alloc_cpu_cpu+0x70/0x90 [<ffffffff8080ada5>] kmem_cache_init+0x65/0x1d0 [<ffffffff807f1b4e>] start_kernel+0x23e/0x350 [<ffffffff807f112d>] _sinittext+0x12d/0x140 [<ffffffffffffffff>] 0xffffffffffffffff This change isn't really necessary for correctness, but it prevents lockdep from getting upset and then disabling itself. Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: Christoph Lameter <clameter@sgi.com> Cc: Kamalesh Babulal <kamalesh@linux.vnet.ibm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Christoph Lameter <clameter@sgi.com>
|
#
06428780 |
|
08-Jan-2008 |
Pekka Enberg <penberg@cs.helsinki.fi> |
SLUB: Fix coding style violations This fixes most of the obvious coding style violations in mm/slub.c as reported by checkpatch. Acked-by: Christoph Lameter <clameter@sgi.com> Signed-off-by: Pekka Enberg <penberg@cs.helsinki.fi> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Christoph Lameter <clameter@sgi.com>
|
#
7c2e132c |
|
08-Jan-2008 |
Christoph Lameter <clameter@sgi.com> |
Add parameter to add_partial to avoid having two functions Add a parameter to add_partial instead of having separate functions. The parameter allows a more detailed control of where the slab pages is placed in the partial queues. If we put slabs back to the front then they are likely immediately used for allocations. If they are put at the end then we can maximize the time that the partial slabs spent without being subject to allocations. When deactivating slab we can put the slabs that had remote objects freed (we can see that because objects were put on the freelist that requires locks) to them at the end of the list so that the cachelines of remote processors can cool down. Slabs that had objects from the local cpu freed to them (objects exist in the lockless freelist) are put in the front of the list to be reused ASAP in order to exploit the cache hot state of the local cpu. Patch seems to slightly improve tbench speed (1-2%). Signed-off-by: Christoph Lameter <clameter@sgi.com> Reviewed-by: Pekka Enberg <penberg@cs.helsinki.fi> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|
#
9824601e |
|
08-Jan-2008 |
Christoph Lameter <clameter@sgi.com> |
SLUB: rename defrag to remote_node_defrag_ratio The NUMA defrag works by allocating objects from partial slabs on remote nodes. Rename it to remote_node_defrag_ratio to be clear about this. Signed-off-by: Christoph Lameter <clameter@sgi.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|
#
f61396ae |
|
08-Jan-2008 |
Christoph Lameter <clameter@sgi.com> |
Move count_partial before kmem_cache_shrink Move the counting function for objects in partial slabs so that it is placed before kmem_cache_shrink. Signed-off-by: Christoph Lameter <clameter@sgi.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|
#
151c602f |
|
07-Jan-2008 |
Christoph Lameter <clameter@sgi.com> |
SLUB: Fix sysfs refcounting If CONFIG_SYSFS is set then free the kmem_cache structure when sysfs tells us its okay. Otherwise there is the danger (as pointed out by Al Viro) that sysfs thinks the kobject still exists after kmem_cache_destroy() removed it. Signed-off-by: Christoph Lameter <clameter@sgi.com> Reviewed-by: Pekka J Enberg <penberg@cs.helsinki.fi>
|
#
e374d483 |
|
31-Jan-2008 |
Harvey Harrison <harvey.harrison@gmail.com> |
slub: fix shadowed variable sparse warnings Introduce 'len' at outer level: mm/slub.c:3406:26: warning: symbol 'n' shadows an earlier one mm/slub.c:3393:6: originally declared here No need to declare new node: mm/slub.c:3501:7: warning: symbol 'node' shadows an earlier one mm/slub.c:3491:6: originally declared here No need to declare new x: mm/slub.c:3513:9: warning: symbol 'x' shadows an earlier one mm/slub.c:3492:6: originally declared here Signed-off-by: Harvey Harrison <harvey.harrison@gmail.com> Signed-off-by: Christoph Lameter <clameter@sgi.com>
|
#
1eada11c |
|
17-Dec-2007 |
Greg Kroah-Hartman <gregkh@suse.de> |
Kobject: convert mm/slub.c to use kobject_init/add_ng() This converts the code to use the new kobject functions, cleaning up the logic in doing so. Cc: Christoph Lameter <clameter@sgi.com> Cc: Kay Sievers <kay.sievers@vrfy.org> Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
|
#
0ff21e46 |
|
06-Nov-2007 |
Greg Kroah-Hartman <gregkh@suse.de> |
kobject: convert kernel_kset to be a kobject kernel_kset does not need to be a kset, but a much simpler kobject now that we have kobj_attributes. We also rename kernel_kset to kernel_kobj to catch all users of this symbol with a build error instead of an easy-to-ignore build warning. Cc: Kay Sievers <kay.sievers@vrfy.org> Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
|
#
081248de |
|
01-Nov-2007 |
Greg Kroah-Hartman <gregkh@suse.de> |
kset: move /sys/slab to /sys/kernel/slab /sys/kernel is where these things should go. Also updated the documentation and tool that used this directory. Cc: Kay Sievers <kay.sievers@vrfy.org> Acked-by: Christoph Lameter <clameter@sgi.com> Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
|
#
27c3a314 |
|
01-Nov-2007 |
Greg Kroah-Hartman <gregkh@suse.de> |
kset: convert slub to use kset_create Dynamically create the kset instead of declaring it statically. Cc: Kay Sievers <kay.sievers@vrfy.org> Cc: Christoph Lameter <clameter@sgi.com> Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
|
#
3514faca |
|
16-Oct-2007 |
Greg Kroah-Hartman <gregkh@suse.de> |
kobject: remove struct kobj_type from struct kset We don't need a "default" ktype for a kset. We should set this explicitly every time for each kset. This change is needed so that we can make ksets dynamic, and cleans up one of the odd, undocumented assumption that the kset/kobject/ktype model has. This patch is based on a lot of help from Kay Sievers. Nasty bug in the block code was found by Dave Young <hidave.darkstar@gmail.com> Cc: Kay Sievers <kay.sievers@vrfy.org> Cc: Dave Young <hidave.darkstar@gmail.com> Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
|
#
158a9624 |
|
02-Jan-2008 |
Linus Torvalds <torvalds@woody.linux-foundation.org> |
Unify /proc/slabinfo configuration Both SLUB and SLAB really did almost exactly the same thing for /proc/slabinfo setup, using duplicate code and per-allocator #ifdef's. This just creates a common CONFIG_SLABINFO that is enabled by both SLUB and SLAB, and shares all the setup code. Maybe SLOB will want this some day too. Reviewed-by: Pekka Enberg <penberg@cs.helsinki.fi> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
#
57ed3eda |
|
01-Jan-2008 |
Pekka J Enberg <penberg@cs.helsinki.fi> |
slub: provide /proc/slabinfo This adds a read-only /proc/slabinfo file on SLUB, that makes slabtop work. [ mingo@elte.hu: build fix. ] Cc: Andi Kleen <andi@firstfloor.org> Cc: Christoph Lameter <clameter@sgi.com> Cc: Peter Zijlstra <a.p.zijlstra@chello.nl> Signed-off-by: Pekka Enberg <penberg@cs.helsinki.fi> Signed-off-by: Ingo Molnar <mingo@elte.hu> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
#
76be8950 |
|
21-Dec-2007 |
Christoph Lameter <clameter@sgi.com> |
SLUB: Improve hackbench speed Increase the mininum number of partial slabs to keep around and put partial slabs to the end of the partial queue so that they can add more objects. Signed-off-by: Christoph Lameter <clameter@sgi.com> Reviewed-by: Pekka Enberg <penberg@cs.helsinki.fi> Acked-by: Ingo Molnar <mingo@elte.hu> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
#
3811dbf6 |
|
17-Dec-2007 |
Christoph Lameter <clameter@sgi.com> |
SLUB: remove useless masking of GFP_ZERO Remove a recently added useless masking of GFP_ZERO. GFP_ZERO is already masked out in new_slab() (See how it calls allocate_slab). No need to do it twice. This reverts the SLUB parts of 7fd272550bd43cc1d7289ef0ab2fa50de137e767. Cc: Matt Mackall <mpm@selenic.com> Reviewed-by: Pekka Enberg <penberg@cs.helsinki.fi> Signed-off-by: Christoph Lameter <clameter@sgi.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
#
7fd27255 |
|
09-Dec-2007 |
Linus Torvalds <torvalds@woody.linux-foundation.org> |
Avoid double memclear() in SLOB/SLUB Both slob and slub react to __GFP_ZERO by clearing the allocation, which means that passing the GFP_ZERO bit down to the page allocator is just wasteful and pointless. Acked-by: Matt Mackall <mpm@selenic.com> Reviewed-by: Pekka Enberg <penberg@cs.helsinki.fi> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
#
294a80a8 |
|
05-Dec-2007 |
Vegard Nossum <vegard.nossum@gmail.com> |
SLUB's ksize() fails for size > 2048 I can't pass memory allocated by kmalloc() to ksize() if it is allocated by SLUB allocator and size is larger than (I guess) PAGE_SIZE / 2. The error of ksize() seems to be that it does not check if the allocation was made by SLUB or the page allocator. Reviewed-by: Pekka Enberg <penberg@cs.helsinki.fi> Tested-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp> Cc: Christoph Lameter <clameter@sgi.com>, Matt Mackall <mpm@selenic.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
#
efe44183 |
|
12-Nov-2007 |
Denis Cheng <crquan@gmail.com> |
SLUB: killed the unused "end" variable Since the macro "for_each_object" introduced, the "end" variable becomes unused anymore. Signed-off-by: Denis Cheng <crquan@gmail.com> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
#
05aa3450 |
|
05-Nov-2007 |
Christoph Lameter <clameter@sgi.com> |
SLUB: Fix memory leak by not reusing cpu_slab Fix the memory leak that may occur when we attempt to reuse a cpu_slab that was allocated while we reenabled interrupts in order to be able to grow a slab cache. The per cpu freelist may contain objects and in that situation we may overwrite the per cpu freelist pointer loosing objects. This only occurs if we find that the concurrently allocated slab fits our allocation needs. If we simply always deactivate the slab then the freelist will be properly reintegrated and the memory leak will go away. Signed-off-by: Christoph Lameter <clameter@sgi.com> Acked-by: Hugh Dickins <hugh@veritas.com> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
#
27bb628a |
|
28-Oct-2007 |
Al Viro <viro@ftp.linux.org.uk> |
missing atomic_read_long() in slub.c nr_slabs is atomic_long_t, not atomic_t Signed-off-by: Al Viro <viro@zeniv.linux.org.uk> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
#
b9049e23 |
|
21-Oct-2007 |
Yasunori Goto <y-goto@jp.fujitsu.com> |
memory hotplug: make kmem_cache_node for SLUB on memory online avoid panic Fix a panic due to access NULL pointer of kmem_cache_node at discard_slab() after memory online. When memory online is called, kmem_cache_nodes are created for all SLUBs for new node whose memory are available. slab_mem_going_online_callback() is called to make kmem_cache_node() in callback of memory online event. If it (or other callbacks) fails, then slab_mem_offline_callback() is called for rollback. In memory offline, slab_mem_going_offline_callback() is called to shrink all slub cache, then slab_mem_offline_callback() is called later. [akpm@linux-foundation.org: coding-style fixes] [akpm@linux-foundation.org: locking fix] [akpm@linux-foundation.org: build fix] Signed-off-by: Yasunori Goto <y-goto@jp.fujitsu.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
#
4ba9b9d0 |
|
17-Oct-2007 |
Christoph Lameter <clameter@sgi.com> |
Slab API: remove useless ctor parameter and reorder parameters Slab constructors currently have a flags parameter that is never used. And the order of the arguments is opposite to other slab functions. The object pointer is placed before the kmem_cache pointer. Convert ctor(void *object, struct kmem_cache *s, unsigned long flags) to ctor(struct kmem_cache *s, void *object) throughout the kernel [akpm@linux-foundation.org: coupla fixes] Signed-off-by: Christoph Lameter <clameter@sgi.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
#
b811c202 |
|
17-Oct-2007 |
Christoph Lameter <clameter@sgi.com> |
SLUB: simplify IRQ off handling Move irq handling out of new slab into __slab_alloc. That is useful for Mathieu's cmpxchg_local patchset and also allows us to remove the crude local_irq_off in early_kmem_cache_alloc(). Signed-off-by: Christoph Lameter <clameter@sgi.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
#
ea3061d2 |
|
16-Oct-2007 |
Andrew Morton <akpm@linux-foundation.org> |
slub: list_locations() can use GFP_TEMPORARY It's a short-lived allocation. Cc: Christoph Lameter <clameter@sgi.com> Cc: Mel Gorman <mel@csn.ul.ie> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
#
42a9fdbb |
|
16-Oct-2007 |
Christoph Lameter <clameter@sgi.com> |
SLUB: Optimize cacheline use for zeroing We touch a cacheline in the kmem_cache structure for zeroing to get the size. However, the hot paths in slab_alloc and slab_free do not reference any other fields in kmem_cache, so we may have to just bring in the cacheline for this one access. Add a new field to kmem_cache_cpu that contains the object size. That cacheline must already be used in the hotpaths. So we save one cacheline on every slab_alloc if we zero. We need to update the kmem_cache_cpu object size if an aliasing operation changes the objsize of an non debug slab. Signed-off-by: Christoph Lameter <clameter@sgi.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
#
4c93c355 |
|
16-Oct-2007 |
Christoph Lameter <clameter@sgi.com> |
SLUB: Place kmem_cache_cpu structures in a NUMA aware way The kmem_cache_cpu structures introduced are currently an array placed in the kmem_cache struct. Meaning the kmem_cache_cpu structures are overwhelmingly on the wrong node for systems with a higher amount of nodes. These are performance critical structures since the per node information has to be touched for every alloc and free in a slab. In order to place the kmem_cache_cpu structure optimally we put an array of pointers to kmem_cache_cpu structs in kmem_cache (similar to SLAB). However, the kmem_cache_cpu structures can now be allocated in a more intelligent way. We would like to put per cpu structures for the same cpu but different slab caches in cachelines together to save space and decrease the cache footprint. However, the slab allocators itself control only allocations per node. We set up a simple per cpu array for every processor with 100 per cpu structures which is usually enough to get them all set up right. If we run out then we fall back to kmalloc_node. This also solves the bootstrap problem since we do not have to use slab allocator functions early in boot to get memory for the small per cpu structures. Pro: - NUMA aware placement improves memory performance - All global structures in struct kmem_cache become readonly - Dense packing of per cpu structures reduces cacheline footprint in SMP and NUMA. - Potential avoidance of exclusive cacheline fetches on the free and alloc hotpath since multiple kmem_cache_cpu structures are in one cacheline. This is particularly important for the kmalloc array. Cons: - Additional reference to one read only cacheline (per cpu array of pointers to kmem_cache_cpu) in both slab_alloc() and slab_free(). [akinobu.mita@gmail.com: fix cpu hotplug offline/online path] Signed-off-by: Christoph Lameter <clameter@sgi.com> Cc: "Pekka Enberg" <penberg@cs.helsinki.fi> Cc: Akinobu Mita <akinobu.mita@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
#
ee3c72a1 |
|
16-Oct-2007 |
Christoph Lameter <clameter@sgi.com> |
SLUB: Avoid touching page struct when freeing to per cpu slab Set c->node to -1 if we allocate from a debug slab instead for SlabDebug which requires access the page struct cacheline. Signed-off-by: Christoph Lameter <clameter@sgi.com> Tested-by: Alexey Dobriyan <adobriyan@sw.ru> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
#
b3fba8da |
|
16-Oct-2007 |
Christoph Lameter <clameter@sgi.com> |
SLUB: Move page->offset to kmem_cache_cpu->offset We need the offset from the page struct during slab_alloc and slab_free. In both cases we also reference the cacheline of the kmem_cache_cpu structure. We can therefore move the offset field into the kmem_cache_cpu structure freeing up 16 bits in the page struct. Moving the offset allows an allocation from slab_alloc() without touching the page struct in the hot path. The only thing left in slab_free() that touches the page struct cacheline for per cpu freeing is the checking of SlabDebug(page). The next patch deals with that. Use the available 16 bits to broaden page->inuse. More than 64k objects per slab become possible and we can get rid of the checks for that limitation. No need anymore to shrink the order of slabs if we boot with 2M sized slabs (slub_min_order=9). No need anymore to switch off the offset calculation for very large slabs since the field in the kmem_cache_cpu structure is 32 bits and so the offset field can now handle slab sizes of up to 8GB. Signed-off-by: Christoph Lameter <clameter@sgi.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
#
8e65d24c |
|
16-Oct-2007 |
Christoph Lameter <clameter@sgi.com> |
SLUB: Do not use page->mapping After moving the lockless_freelist to kmem_cache_cpu we no longer need page->lockless_freelist. Restructure the use of the struct page fields in such a way that we never touch the mapping field. This is turn allows us to remove the special casing of SLUB when determining the mapping of a page (needed for corner cases of virtual caches machines that need to flush caches of processors mapping a page). Signed-off-by: Christoph Lameter <clameter@sgi.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
#
dfb4f096 |
|
16-Oct-2007 |
Christoph Lameter <clameter@sgi.com> |
SLUB: Avoid page struct cacheline bouncing due to remote frees to cpu slab A remote free may access the same page struct that also contains the lockless freelist for the cpu slab. If objects have a short lifetime and are freed by a different processor then remote frees back to the slab from which we are currently allocating are frequent. The cacheline with the page struct needs to be repeately acquired in exclusive mode by both the allocating thread and the freeing thread. If this is frequent enough then performance will suffer because of cacheline bouncing. This patchset puts the lockless_freelist pointer in its own cacheline. In order to make that happen we introduce a per cpu structure called kmem_cache_cpu. Instead of keeping an array of pointers to page structs we now keep an array to a per cpu structure that--among other things--contains the pointer to the lockless freelist. The freeing thread can then keep possession of exclusive access to the page struct cacheline while the allocating thread keeps its exclusive access to the cacheline containing the per cpu structure. This works as long as the allocating cpu is able to service its request from the lockless freelist. If the lockless freelist runs empty then the allocating thread needs to acquire exclusive access to the cacheline with the page struct lock the slab. The allocating thread will then check if new objects were freed to the per cpu slab. If so it will keep the slab as the cpu slab and continue with the recently remote freed objects. So the allocating thread can take a series of just freed remote pages and dish them out again. Ideally allocations could be just recycling objects in the same slab this way which will lead to an ideal allocation / remote free pattern. The number of objects that can be handled in this way is limited by the capacity of one slab. Increasing slab size via slub_min_objects/ slub_max_order may increase the number of objects and therefore performance. If the allocating thread runs out of objects and finds that no objects were put back by the remote processor then it will retrieve a new slab (from the partial lists or from the page allocator) and start with a whole new set of objects while the remote thread may still be freeing objects to the old cpu slab. This may then repeat until the new slab is also exhausted. If remote freeing has freed objects in the earlier slab then that earlier slab will now be on the partial freelist and the allocating thread will pick that slab next for allocation. So the loop is extended. However, both threads need to take the list_lock to make the swizzling via the partial list happen. It is likely that this kind of scheme will keep the objects being passed around to a small set that can be kept in the cpu caches leading to increased performance. More code cleanups become possible: - Instead of passing a cpu we can now pass a kmem_cache_cpu structure around. Allows reducing the number of parameters to various functions. - Can define a new node_match() function for NUMA to encapsulate locality checks. Effect on allocations: Cachelines touched before this patch: Write: page cache struct and first cacheline of object Cachelines touched after this patch: Write: kmem_cache_cpu cacheline and first cacheline of object Read: page cache struct (but see later patch that avoids touching that cacheline) The handling when the lockless alloc list runs empty gets to be a bit more complicated since another cacheline has now to be written to. But that is halfway out of the hot path. Effect on freeing: Cachelines touched before this patch: Write: page_struct and first cacheline of object Cachelines touched after this patch depending on how we free: Write(to cpu_slab): kmem_cache_cpu struct and first cacheline of object Write(to other): page struct and first cacheline of object Read(to cpu_slab): page struct to id slab etc. (but see later patch that avoids touching the page struct on free) Read(to other): cpu local kmem_cache_cpu struct to verify its not the cpu slab. Summary: Pro: - Distinct cachelines so that concurrent remote frees and local allocs on a cpuslab can occur without cacheline bouncing. - Avoids potential bouncing cachelines because of neighboring per cpu pointer updates in kmem_cache's cpu_slab structure since it now grows to a cacheline (Therefore remove the comment that talks about that concern). Cons: - Freeing objects now requires the reading of one additional cacheline. That can be mitigated for some cases by the following patches but its not possible to completely eliminate these references. - Memory usage grows slightly. The size of each per cpu object is blown up from one word (pointing to the page_struct) to one cacheline with various data. So this is NR_CPUS*NR_SLABS*L1_BYTES more memory use. Lets say NR_SLABS is 100 and a cache line size of 128 then we have just increased SLAB metadata requirements by 12.8k per cpu. (Another later patch reduces these requirements) Signed-off-by: Christoph Lameter <clameter@sgi.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
#
e12ba74d |
|
16-Oct-2007 |
Mel Gorman <mel@csn.ul.ie> |
Group short-lived and reclaimable kernel allocations This patch marks a number of allocations that are either short-lived such as network buffers or are reclaimable such as inode allocations. When something like updatedb is called, long-lived and unmovable kernel allocations tend to be spread throughout the address space which increases fragmentation. This patch groups these allocations together as much as possible by adding a new MIGRATE_TYPE. The MIGRATE_RECLAIMABLE type is for allocations that can be reclaimed on demand, but not moved. i.e. they can be migrated by deleting them and re-reading the information from elsewhere. Signed-off-by: Mel Gorman <mel@csn.ul.ie> Cc: Andy Whitcroft <apw@shadowen.org> Cc: Christoph Lameter <clameter@sgi.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
#
6cb06229 |
|
16-Oct-2007 |
Christoph Lameter <clameter@sgi.com> |
Categorize GFP flags The function of GFP_LEVEL_MASK seems to be unclear. In order to clear up the mystery we get rid of it and replace GFP_LEVEL_MASK with 3 sets of GFP flags: GFP_RECLAIM_MASK Flags used to control page allocator reclaim behavior. GFP_CONSTRAINT_MASK Flags used to limit where allocations can occur. GFP_SLAB_BUG_MASK Flags that the slab allocator BUG()s on. These replace the uses of GFP_LEVEL mask in the slab allocators and in vmalloc.c. The use of the flags not included in these sets may occur as a result of a slab allocation standing in for a page allocation when constructing scatter gather lists. Extraneous flags are cleared and not passed through to the page allocator. __GFP_MOVABLE/RECLAIMABLE, __GFP_COLD and __GFP_COMP will now be ignored if passed to a slab allocator. Change the allocation of allocator meta data in SLAB and vmalloc to not pass through flags listed in GFP_CONSTRAINT_MASK. SLAB already removes the __GFP_THISNODE flag for such allocations. Generalize that to also cover vmalloc. The use of GFP_CONSTRAINT_MASK also includes __GFP_HARDWALL. The impact of allocator metadata placement on access latency to the cachelines of the object itself is minimal since metadata is only referenced on alloc and free. The attempt is still made to place the meta data optimally but we consistently allow fallback both in SLAB and vmalloc (SLUB does not need to allocate metadata like that). Allocator metadata may serve multiple in kernel users and thus should not be subject to the limitations arising from a single allocation context. [akpm@linux-foundation.org: fix fallback_alloc()] Signed-off-by: Christoph Lameter <clameter@sgi.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
#
f64dc58c |
|
16-Oct-2007 |
Christoph Lameter <clameter@sgi.com> |
Memoryless nodes: SLUB support Simply switch all for_each_online_node to for_each_node_state(NORMAL_MEMORY). That way SLUB only operates on nodes with regular memory. Any allocation attempt on a memoryless node or a node with just highmem will fall whereupon SLUB will fetch memory from a nearby node (depending on how memory policies and cpuset describe fallback). Signed-off-by: Christoph Lameter <clameter@sgi.com> Tested-by: Lee Schermerhorn <lee.schermerhorn@hp.com> Acked-by: Bob Picco <bob.picco@hp.com> Cc: Nishanth Aravamudan <nacc@us.ibm.com> Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: Mel Gorman <mel@skynet.ie> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
#
ef8b4520 |
|
16-Oct-2007 |
Christoph Lameter <clameter@sgi.com> |
Slab allocators: fail if ksize is called with a NULL parameter A NULL pointer means that the object was not allocated. One cannot determine the size of an object that has not been allocated. Currently we return 0 but we really should BUG() on attempts to determine the size of something nonexistent. krealloc() interprets NULL to mean a zero sized object. Handle that separately in krealloc(). Signed-off-by: Christoph Lameter <clameter@sgi.com> Acked-by: Pekka Enberg <penberg@cs.helsinki.fi> Cc: Matt Mackall <mpm@selenic.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
#
2408c550 |
|
16-Oct-2007 |
Satyam Sharma <satyam@infradead.org> |
{slub, slob}: use unlikely() for kfree(ZERO_OR_NULL_PTR) check Considering kfree(NULL) would normally occur only in error paths and kfree(ZERO_SIZE_PTR) is uncommon as well, so let's use unlikely() for the condition check in SLUB's and SLOB's kfree() to optimize for the common case. SLAB has this already. Signed-off-by: Satyam Sharma <satyam@infradead.org> Cc: Pekka Enberg <penberg@cs.helsinki.fi> Cc: Christoph Lameter <clameter@sgi.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
#
aadb4bc4 |
|
16-Oct-2007 |
Christoph Lameter <clameter@sgi.com> |
SLUB: direct pass through of page size or higher kmalloc requests This gets rid of all kmalloc caches larger than page size. A kmalloc request larger than PAGE_SIZE > 2 is going to be passed through to the page allocator. This works both inline where we will call __get_free_pages instead of kmem_cache_alloc and in __kmalloc. kfree is modified to check if the object is in a slab page. If not then the page is freed via the page allocator instead. Roughly similar to what SLOB does. Advantages: - Reduces memory overhead for kmalloc array - Large kmalloc operations are faster since they do not need to pass through the slab allocator to get to the page allocator. - Performance increase of 10%-20% on alloc and 50% on free for PAGE_SIZEd allocations. SLUB must call page allocator for each alloc anyways since the higher order pages which that allowed avoiding the page alloc calls are not available in a reliable way anymore. So we are basically removing useless slab allocator overhead. - Large kmallocs yields page aligned object which is what SLAB did. Bad things like using page sized kmalloc allocations to stand in for page allocate allocs can be transparently handled and are not distinguishable from page allocator uses. - Checking for too large objects can be removed since it is done by the page allocator. Drawbacks: - No accounting for large kmalloc slab allocations anymore - No debugging of large kmalloc slab allocations. Signed-off-by: Christoph Lameter <clameter@sgi.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
#
1cd7daa5 |
|
16-Oct-2007 |
Adrian Bunk <bunk@stusta.de> |
slub.c:early_kmem_cache_node_alloc() shouldn't be __init WARNING: mm/built-in.o(.text+0x24bd3): Section mismatch: reference to .init.text:early_kmem_cache_node_alloc (between 'init_kmem_cache_nodes' and 'calculate_sizes') ... Signed-off-by: Adrian Bunk <bunk@stusta.de> Acked-by: Christoph Lameter <clameter@sgi.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
#
ba0268a8 |
|
11-Sep-2007 |
Christoph Lameter <clameter@sgi.com> |
SLUB: accurately compare debug flags during slab cache merge This was posted on Aug 28 and fixes an issue that could cause troubles when slab caches >=128k are created. http://marc.info/?l=linux-mm&m=118798149918424&w=2 Currently we simply add the debug flags unconditional when checking for a matching slab. This creates issues for sysfs processing when slabs exist that are exempt from debugging due to their huge size or because only a subset of slabs was selected for debugging. We need to only add the flags if kmem_cache_open() would also add them. Create a function to calculate the flags that would be set if the cache would be opened and use that function to determine the flags before looking for a compatible slab. [akpm@linux-foundation.org: fixlets] Signed-off-by: Christoph Lameter <clameter@sgi.com> Cc: Chuck Ebbert <cebbert@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
#
5d540fb7 |
|
31-Aug-2007 |
Christoph Lameter <clameter@sgi.com> |
slub: do not fail if we cannot register a slab with sysfs Do not BUG() if we cannot register a slab with sysfs. Just print an error. The only consequence of not registering is that the slab cache is not visible via /sys/slab. A BUG() may not be visible that early during boot and we have had multiple issues here already. Signed-off-by: Christoph Lameter <clameter@sgi.com> Acked-by: David S. Miller <davem@davemloft.net> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
#
a2f92ee7 |
|
22-Aug-2007 |
Christoph Lameter <clameter@sgi.com> |
SLUB: do not fail on broken memory configurations Print a big fat warning and do what is necessary to continue if a node is marked as up (meaning either node is online (upstream) or node has memory (Andrew's tree)) but allocations from the node do not succeed. Signed-off-by: Christoph Lameter <clameter@sgi.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
#
9e86943b |
|
22-Aug-2007 |
Christoph Lameter <clameter@sgi.com> |
SLUB: use atomic_long_read for atomic_long variables SLUB is using atomic_read() for variables declared atomic_long_t. Switch to atomic_long_read(). Signed-off-by: Christoph Lameter <clameter@sgi.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
#
1ceef402 |
|
07-Aug-2007 |
Christoph Lameter <clameter@sgi.com> |
SLUB: Fix dynamic dma kmalloc cache creation The dynamic dma kmalloc creation can run into trouble if a GFP_ATOMIC allocation is the first one performed for a certain size of dma kmalloc slab. - Move the adding of the slab to sysfs into a workqueue (sysfs does GFP_KERNEL allocations) - Do not call kmem_cache_destroy() (uses slub_lock) - Only acquire the slub_lock once and--if we cannot wait--do a trylock. This introduces a slight risk of the first kmalloc(x, GFP_DMA|GFP_ATOMIC) for a range of sizes failing due to another process holding the slub_lock. However, we only need to acquire the spinlock once in order to establish each power of two DMA kmalloc cache. The possible conflict is with the slub_lock taken during slab management actions (create / remove slab cache). It is rather typical that a driver will first fill its buffers using GFP_KERNEL allocations which will wait until the slub_lock can be acquired. Drivers will also create its slab caches first outside of an atomic context before starting to use atomic kmalloc from an interrupt context. If there are any failures then they will occur early after boot or when loading of multiple drivers concurrently. Drivers can already accomodate failures of GFP_ATOMIC for other reasons. Retries will then create the slab. Signed-off-by: Christoph Lameter <clameter@sgi.com>
|
#
fcda3d89 |
|
30-Jul-2007 |
Christoph Lameter <clameter@sgi.com> |
SLUB: Remove checks for MAX_PARTIAL from kmem_cache_shrink The MAX_PARTIAL checks were supposed to be an optimization. However, slab shrinking is a manually triggered process either through running slabinfo or by the kernel calling kmem_cache_shrink. If one really wants to shrink a slab then all operations should be done regardless of the size of the partial list. This also fixes an issue that could surface if the number of partial slabs was initially above MAX_PARTIAL in kmem_cache_shrink and later drops below MAX_PARTIAL through the elimination of empty slabs on the partial list (rare). In that case a few slabs may be left off the partial list (and only be put back when they are empty). Signed-off-by: Christoph Lameter <clameter@sgi.com>
|
#
2208b764 |
|
26-Jul-2007 |
Peter Zijlstra <a.p.zijlstra@chello.nl> |
slub: fix bug in slub debug support We ClearSlabDebug() before the last SlabDebug() check. Clear it later. Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Signed-off-by: Christoph Lameter <clameter@sgi.com>
|
#
02febdf7 |
|
26-Jul-2007 |
Peter Zijlstra <peterz@infradead.org> |
slub: add lock debugging check Ingo noticed that the SLUB code does include the lock debugging free check. Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Acked-by: Ingo Molnar <mingo@elte.hu> Acked-by: Pekka Enberg <penberg@cs.helsinki.fi> Signed-off-by: Christoph Lameter <clameter@sgi.com>
|
#
20c2df83 |
|
19-Jul-2007 |
Paul Mundt <lethal@linux-sh.org> |
mm: Remove slab destructors from kmem_cache_create(). Slab destructors were no longer supported after Christoph's c59def9f222d44bb7e2f0a559f2906191a0862d7 change. They've been BUGs for both slab and slub, and slob never supported them either. This rips out support for the dtor pointer from kmem_cache_create() completely and fixes up every single callsite in the kernel (there were about 224, not including the slab allocator definitions themselves, or the documentation references). Signed-off-by: Paul Mundt <lethal@linux-sh.org>
|
#
9550b105 |
|
19-Jul-2007 |
Linus Torvalds <torvalds@woody.linux-foundation.org> |
slub: fix ksize() for zero-sized pointers The slab and slob allocators already did this right, but slub would call "get_object_page()" on the magic ZERO_SIZE_PTR, with all kinds of nasty end results. Noted by Ingo Molnar. Cc: Ingo Molnar <mingo@elte.hu> Cc: Christoph Lameter <clameter@sgi.com> Cc: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
#
8ab1372f |
|
17-Jul-2007 |
Christoph Lameter <clameter@sgi.com> |
SLUB: Fix CONFIG_SLUB_DEBUG use for CONFIG_NUMA We currently cannot disable CONFIG_SLUB_DEBUG for CONFIG_NUMA. Now that embedded systems start to use NUMA we may need this. Put an #ifdef around places where NUMA only code uses fields only valid for CONFIG_SLUB_DEBUG. Signed-off-by: Christoph Lameter <clameter@sgi.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
#
a0e1d1be |
|
17-Jul-2007 |
Christoph Lameter <clameter@sgi.com> |
SLUB: Move sysfs operations outside of slub_lock Sysfs can do a gazillion things when called. Make sure that we do not call any sysfs functions while holding the slub_lock. Just protect the essentials: 1. The list of all slab caches 2. The kmalloc_dma array 3. The ref counters of the slabs. Signed-off-by: Christoph Lameter <clameter@sgi.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
#
434e245d |
|
17-Jul-2007 |
Christoph Lameter <clameter@sgi.com> |
SLUB: Do not allocate object bit array on stack The objects per slab increase with the current patches in mm since we allow up to order 3 allocs by default. More patches in mm actually allow to use 2M or higher sized slabs. For slab validation we need per object bitmaps in order to check a slab. We end up with up to 64k objects per slab resulting in a potential requirement of 8K stack space. That does not look good. Allocate the bit arrays via kmalloc. Signed-off-by: Christoph Lameter <clameter@sgi.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
#
81cda662 |
|
17-Jul-2007 |
Christoph Lameter <clameter@sgi.com> |
Slab allocators: Cleanup zeroing allocations It becomes now easy to support the zeroing allocs with generic inline functions in slab.h. Provide inline definitions to allow the continued use of kzalloc, kmem_cache_zalloc etc but remove other definitions of zeroing functions from the slab allocators and util.c. Signed-off-by: Christoph Lameter <clameter@sgi.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
#
ce15fea8 |
|
17-Jul-2007 |
Christoph Lameter <clameter@sgi.com> |
SLUB: Do not use length parameter in slab_alloc() We can get to the length of the object through the kmem_cache_structure. The additional parameter does no good and causes the compiler to generate bad code. Signed-off-by: Christoph Lameter <clameter@sgi.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
#
12ad6843 |
|
17-Jul-2007 |
Christoph Lameter <clameter@sgi.com> |
SLUB: Style fix up the loop to disable small slabs Do proper spacing and we only need to do this in steps of 8. Signed-off-by: Christoph Lameter <clameter@sgi.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
#
5af328a5 |
|
17-Jul-2007 |
Adrian Bunk <bunk@stusta.de> |
mm/slub.c: make code static Signed-off-by: Adrian Bunk <bunk@stusta.de> Cc: Christoph Lameter <clameter@sgi.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
#
7b55f620 |
|
17-Jul-2007 |
Christoph Lameter <clameter@sgi.com> |
SLUB: Simplify dma index -> size calculation There is no need to caculate the dma slab size ourselves. We can simply lookup the size of the corresponding non dma slab. Signed-off-by: Christoph Lameter <clameter@sgi.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
#
f1b26339 |
|
17-Jul-2007 |
Christoph Lameter <clameter@sgi.com> |
SLUB: faster more efficient slab determination for __kmalloc kmalloc_index is a long series of comparisons. The attempt to replace kmalloc_index with something more efficient like ilog2 failed due to compiler issues with constant folding on gcc 3.3 / powerpc. kmalloc_index()'es long list of comparisons works fine for constant folding since all the comparisons are optimized away. However, SLUB also uses kmalloc_index to determine the slab to use for the __kmalloc_xxx functions. This leads to a large set of comparisons in get_slab(). The patch here allows to get rid of that list of comparisons in get_slab(): 1. If the requested size is larger than 192 then we can simply use fls to determine the slab index since all larger slabs are of the power of two type. 2. If the requested size is smaller then we cannot use fls since there are non power of two caches to be considered. However, the sizes are in a managable range. So we divide the size by 8. Then we have only 24 possibilities left and then we simply look up the kmalloc index in a table. Code size of slub.o decreases by more than 200 bytes through this patch. Signed-off-by: Christoph Lameter <clameter@sgi.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
#
dfce8648 |
|
17-Jul-2007 |
Christoph Lameter <clameter@sgi.com> |
SLUB: do proper locking during dma slab creation We modify the kmalloc_cache_dma[] array without proper locking. Do the proper locking and undo the dma cache creation if another processor has already created it. Signed-off-by: Christoph Lameter <clameter@sgi.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
#
2e443fd0 |
|
17-Jul-2007 |
Christoph Lameter <clameter@sgi.com> |
SLUB: extract dma_kmalloc_cache from get_cache. The rarely used dma functionality in get_slab() makes the function too complex. The compiler begins to spill variables from the working set onto the stack. The created function is only used in extremely rare cases so make sure that the compiler does not decide on its own to merge it back into get_slab(). Signed-off-by: Christoph Lameter <clameter@sgi.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
#
0c710013 |
|
17-Jul-2007 |
Christoph Lameter <clameter@sgi.com> |
SLUB: add some more inlines and #ifdef CONFIG_SLUB_DEBUG Add #ifdefs around data structures only needed if debugging is compiled into SLUB. Add inlines to small functions to reduce code size. Signed-off-by: Christoph Lameter <clameter@sgi.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
#
d07dbea4 |
|
17-Jul-2007 |
Christoph Lameter <clameter@sgi.com> |
Slab allocators: support __GFP_ZERO in all allocators A kernel convention for many allocators is that if __GFP_ZERO is passed to an allocator then the allocated memory should be zeroed. This is currently not supported by the slab allocators. The inconsistency makes it difficult to implement in derived allocators such as in the uncached allocator and the pool allocators. In addition the support zeroed allocations in the slab allocators does not have a consistent API. There are no zeroing allocator functions for NUMA node placement (kmalloc_node, kmem_cache_alloc_node). The zeroing allocations are only provided for default allocs (kzalloc, kmem_cache_zalloc_node). __GFP_ZERO will make zeroing universally available and does not require any addititional functions. So add the necessary logic to all slab allocators to support __GFP_ZERO. The code is added to the hot path. The gfp flags are on the stack and so the cacheline is readily available for checking if we want a zeroed object. Zeroing while allocating is now a frequent operation and we seem to be gradually approaching a 1-1 parity between zeroing and not zeroing allocs. The current tree has 3476 uses of kmalloc vs 2731 uses of kzalloc. Signed-off-by: Christoph Lameter <clameter@sgi.com> Acked-by: Pekka Enberg <penberg@cs.helsinki.fi> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
#
6cb8f913 |
|
17-Jul-2007 |
Christoph Lameter <clameter@sgi.com> |
Slab allocators: consistent ZERO_SIZE_PTR support and NULL result semantics Define ZERO_OR_NULL_PTR macro to be able to remove the checks from the allocators. Move ZERO_SIZE_PTR related stuff into slab.h. Make ZERO_SIZE_PTR work for all slab allocators and get rid of the WARN_ON_ONCE(size == 0) that is still remaining in SLAB. Make slub return NULL like the other allocators if a too large memory segment is requested via __kmalloc. Signed-off-by: Christoph Lameter <clameter@sgi.com> Acked-by: Pekka Enberg <penberg@cs.helsinki.fi> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
#
ef2ad80c |
|
17-Jul-2007 |
Christoph Lameter <clameter@sgi.com> |
Slab allocators: consolidate code for krealloc in mm/util.c The size of a kmalloc object is readily available via ksize(). ksize is provided by all allocators and thus we can implement krealloc in a generic way. Implement krealloc in mm/util.c and drop slab specific implementations of krealloc. Signed-off-by: Christoph Lameter <clameter@sgi.com> Acked-by: Pekka Enberg <penberg@cs.helsinki.fi> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
#
d45f39cb |
|
17-Jul-2007 |
Christoph Lameter <clameter@sgi.com> |
SLUB Debug: fix initial object debug state of NUMA bootstrap objects The function we are calling to initialize object debug state during early NUMA bootstrap sets up an inactive object giving it the wrong redzone signature. The bootstrap nodes are active objects and should have active redzone signatures. Currently slab validation complains and reverts the object to active state. Signed-off-by: Christoph Lameter <clameter@sgi.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
#
6300ea75 |
|
17-Jul-2007 |
Christoph Lameter <clameter@sgi.com> |
SLUB: ensure that the number of objects per slab stays low for high orders Currently SLUB has no provision to deal with too high page orders that may be specified on the kernel boot line. If an order higher than 6 (on a 4k platform) is generated then we will BUG() because slabs get more than 65535 objects. Add some logic that decreases order for slabs that have too many objects. This allow booting with slab sizes up to MAX_ORDER. For example slub_min_order=10 will boot with a default slab size of 4M and reduce slab sizes for small object sizes to lower orders if the number of objects becomes too big. Large slab sizes like that allow a concentration of objects of the same slab cache under as few as possible TLB entries and thus potentially reduces TLB pressure. Signed-off-by: Christoph Lameter <clameter@sgi.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
#
68dff6a9 |
|
17-Jul-2007 |
Christoph Lameter <clameter@sgi.com> |
SLUB slab validation: Move tracking information alloc outside of lock We currently have to do an GFP_ATOMIC allocation because the list_lock is already taken when we first allocate memory for tracking allocation information. It would be better if we could avoid atomic allocations. Allocate a size of the tracking table that is usually sufficient (one page) before we take the list lock. We will then only do the atomic allocation if we need to resize the table to become larger than a page (mostly only needed under large NUMA because of the tracking of cpus and nodes otherwise the table stays small). Signed-off-by: Christoph Lameter <clameter@sgi.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
#
5b95a4ac |
|
17-Jul-2007 |
Christoph Lameter <clameter@sgi.com> |
SLUB: use list_for_each_entry for loops over all slabs Use list_for_each_entry() instead of list_for_each(). Get rid of for_all_slabs(). It had only one user. So fold it into the callback. This also gets rid of cpu_slab_flush. Signed-off-by: Christoph Lameter <clameter@sgi.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
#
24922684 |
|
17-Jul-2007 |
Christoph Lameter <clameter@sgi.com> |
SLUB: change error reporting format to follow lockdep loosely Changes the error reporting format to loosely follow lockdep. If data corruption is detected then we generate the following lines: ============================================ BUG <slab-cache>: <problem> -------------------------------------------- INFO: <more information> [possibly multiple times] <object dump> FIX <slab-cache>: <remedial action> This also adds some more intelligence to the data corruption detection. Its now capable of figuring out the start and end. Add a comment on how to configure SLUB so that a production system may continue to operate even though occasional slab corruption occur through a misbehaving kernel component. See "Emergency operations" in Documentation/vm/slub.txt. [akpm@linux-foundation.org: build fix] Signed-off-by: Christoph Lameter <clameter@sgi.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
#
f0630fff |
|
16-Jul-2007 |
Christoph Lameter <clameter@sgi.com> |
SLUB: support slub_debug on by default Add a new configuration variable CONFIG_SLUB_DEBUG_ON If set then the kernel will be booted by default with slab debugging switched on. Similar to CONFIG_SLAB_DEBUG. By default slab debugging is available but must be enabled by specifying "slub_debug" as a kernel parameter. Also add support to switch off slab debugging for a kernel that was built with CONFIG_SLUB_DEBUG_ON. This works by specifying slub_debug=- as a kernel parameter. Dave Jones wanted this feature. http://marc.info/?l=linux-kernel&m=118072189913045&w=2 [akpm@linux-foundation.org: clean up switch statement] Signed-off-by: Christoph Lameter <clameter@sgi.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
#
d23cf676 |
|
06-Jul-2007 |
Christoph Lameter <clameter@sgi.com> |
slub: remove useless EXPORT_SYMBOL kmem_cache_open is static. EXPORT_SYMBOL was leftover from some earlier time period where kmem_cache_open was usable outside of slub. (Fixes powerpc build error) Signed-off-by: Chrsitoph Lameter <clameter@sgi.com> Cc: Johannes Berg <johannes@sipsolutions.net> Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org> Cc: Paul Mackerras <paulus@samba.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
#
dbc55faa |
|
03-Jul-2007 |
Christoph Lameter <clameter@sgi.com> |
SLUB: Make lockdep happy by not calling add_partial with interrupts enabled during bootstrap If we move the local_irq_enable() to the end of the function then add_partial() in early_kmem_cache_node_alloc() will be called with interrupts disabled like during regular operations. This makes lockdep happy. Signed-off-by: Christoph Lameter <clameter@sgi.com> Tested-by: Andre Noll <maan@systemlinux.org> Acked-by: Ingo Molnar <mingo@elte.hu> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
#
84966343 |
|
23-Jun-2007 |
Christoph Lameter <clameter@sgi.com> |
SLUB: fix behavior if the text output of list_locations overflows PAGE_SIZE If slabs are allocated or freed from a large set of call sites (typical for the kmalloc area) then we may create more output than fits into a single PAGE and sysfs only gives us one page. The output should be truncated. This patch fixes the checks to do the truncation properly. Signed-off-by: Christoph Lameter <clameter@sgi.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
#
4b356be0 |
|
16-Jun-2007 |
Christoph Lameter <clameter@sgi.com> |
SLUB: minimum alignment fixes If ARCH_KMALLOC_MINALIGN is set to a value greater than 8 (SLUBs smallest kmalloc cache) then SLUB may generate duplicate slabs in sysfs (yes again) because the object size is padded to reach ARCH_KMALLOC_MINALIGN. Thus the size of the small slabs is all the same. No arch sets ARCH_KMALLOC_MINALIGN larger than 8 though except mips which for some reason wants a 128 byte alignment. This patch increases the size of the smallest cache if ARCH_KMALLOC_MINALIGN is greater than 8. In that case more and more of the smallest caches are disabled. If we do that then the count of the active general caches that is displayed on boot is not correct anymore since we may skip elements of the kmalloc array. So count them separately. This approach was tested by Havard yesterday. Signed-off-by: Christoph Lameter <clameter@sgi.com> Cc: Haavard Skinnemoen <hskinnemoen@atmel.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
#
dd08c40e |
|
16-Jun-2007 |
Christoph Lameter <clameter@sgi.com> |
SLUB slab validation: Alloc while interrupts are disabled must use GFP_ATOMIC The data structure to manage the information gathered about functions allocating and freeing objects is allocated when the list_lock has already been taken. We need to allocate with GFP_ATOMIC instead of GFP_KERNEL. Signed-off-by: Christoph Lameter <clameter@sgi.com> Cc: Mel Gorman <mel@csn.ul.ie> Cc: Andy Whitcroft <apw@shadowen.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
#
272c1d21 |
|
08-Jun-2007 |
Christoph Lameter <clameter@sgi.com> |
SLUB: return ZERO_SIZE_PTR for kmalloc(0) Instead of returning the smallest available object return ZERO_SIZE_PTR. A ZERO_SIZE_PTR can be legitimately used as an object pointer as long as it is not deferenced. The dereference of ZERO_SIZE_PTR causes a distinctive fault. kfree can handle a ZERO_SIZE_PTR in the same way as NULL. This enables functions to use zero sized object. e.g. n = number of objects. objects = kmalloc(n * sizeof(object)); for (i = 0; i < n; i++) objects[i].x = y; kfree(objects); Signed-off-by: Christoph Lameter <clameter@sgi.com> Acked-by: Pekka Enberg <penberg@cs.helsinki.fi> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
#
27390bc3 |
|
01-Jun-2007 |
Christoph Lameter <clameter@sgi.com> |
SLUB: fix locking for hotplug callbacks Hotplug callbacks are performed with interrupts enabled. Slub requires interrupts to be disabled for flushing caches. Signed-off-by: Christoph Lameter <clameter@sgi.com> Cc: Michal Piotrowski <michal.k.k.piotrowski@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
#
8ffa6875 |
|
31-May-2007 |
Christoph Lameter <clameter@sgi.com> |
SLUB: Fix NUMA / SYSFS bootstrap issue We need this patch in ASAP. Patch fixes the mysterious hang that remained on some particular configurations with lockdep on after the first fix that moved the #idef CONFIG_SLUB_DEBUG to the right location. See http://marc.info/?t=117963072300001&r=1&w=2 The kmem_cache_node cache is very special because it is needed for NUMA bootstrap. Under certain conditions (like for example if lockdep is enabled and significantly increases the size of spinlock_t) the structure may become exactly the size as one of the larger caches in the kmalloc array. That early during bootstrap we cannot perform merging properly. The unique id for the kmem_cache_node cache will match one of the kmalloc array. Sysfs will complain about a duplicate directory entry. All of this occurs while the console is not yet fully operational. Thus boot may appear to be silently failing. The kmem_cache_node cache is very special. During early boostrap the main allocation function is not operational yet and so we have to run our own small special alloc function during early boot. It is also special in that it is never freed. We really do not want any merging on that cache. Set the refcount -1 and forbid merging of slabs that have a negative refcount. Signed-off-by: Christoph Lameter <clameter@sgi.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
#
33e9e241 |
|
23-May-2007 |
Christoph Lameter <clameter@sgi.com> |
SLUB Debug: fix check for super sized slabs (>512k 64bit, >256k 32bit) The check for super sized slabs where we can no longer move the free pointer behind the object for debugging purposes etc is accessing a field that is not setup yet. We must use objsize here since the size of the slab has not been determined yet. The effect of this is that a global slab shrink via "slabinfo -s" will show errors about offsets being wrong if booted with slub_debug. Potentially there are other troubles with huge slabs under slub_debug because the calculated free pointer offset is truncated. Signed-off-by: Christoph Lameter <clameter@sgi.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
#
c12b3c62 |
|
23-May-2007 |
Christoph Lameter <clameter@sgi.com> |
SLUB Debug: Fix object size calculation The object size calculation is wrong if !CONFIG_SLUB_DEBUG because the #ifdef CONFIG_SLUB_DEBUG is now switching off the size adjustments for DESTROY_BY_RCU and ctor. Signed-off-by: Christoph Lameter <clameter@sgi.com> Acked-by: Hugh Dickins <hugh@veritas.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
#
3ec09742 |
|
16-May-2007 |
Christoph Lameter <clameter@sgi.com> |
SLUB: Simplify debug code Consolidate functionality into the #ifdef section. Extract tracing into one subroutine. Move object debug processing into the #ifdef section so that the code in __slab_alloc and __slab_free becomes minimal. Reduce number of functions we need to provide stubs for in the !SLUB_DEBUG case. Signed-off-by: Christoph Lameter <clameter@sgi.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
#
a35afb83 |
|
16-May-2007 |
Christoph Lameter <clameter@sgi.com> |
Remove SLAB_CTOR_CONSTRUCTOR SLAB_CTOR_CONSTRUCTOR is always specified. No point in checking it. Signed-off-by: Christoph Lameter <clameter@sgi.com> Cc: David Howells <dhowells@redhat.com> Cc: Jens Axboe <jens.axboe@oracle.com> Cc: Steven French <sfrench@us.ibm.com> Cc: Michael Halcrow <mhalcrow@us.ibm.com> Cc: OGAWA Hirofumi <hirofumi@mail.parknet.co.jp> Cc: Miklos Szeredi <miklos@szeredi.hu> Cc: Steven Whitehouse <swhiteho@redhat.com> Cc: Roman Zippel <zippel@linux-m68k.org> Cc: David Woodhouse <dwmw2@infradead.org> Cc: Dave Kleikamp <shaggy@austin.ibm.com> Cc: Trond Myklebust <trond.myklebust@fys.uio.no> Cc: "J. Bruce Fields" <bfields@fieldses.org> Cc: Anton Altaparmakov <aia21@cantab.net> Cc: Mark Fasheh <mark.fasheh@oracle.com> Cc: Paul Mackerras <paulus@samba.org> Cc: Christoph Hellwig <hch@lst.de> Cc: Jan Kara <jack@ucw.cz> Cc: David Chinner <dgc@sgi.com> Cc: "David S. Miller" <davem@davemloft.net> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
#
5577bd8a |
|
16-May-2007 |
Christoph Lameter <clameter@sgi.com> |
SLUB: Do our own flags based on PG_active and PG_error The atomicity when handling flags in SLUB is not necessary since both flags used by SLUB are not updated in a racy way. Flag updates are either done during slab creation or destruction or under slab_lock. Some of these flags do not have the non atomic variants that we need. So define our own. Signed-off-by: Christoph Lameter <clameter@sgi.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
#
4b6f0750 |
|
16-May-2007 |
Christoph Lameter <clameter@sgi.com> |
SLUB: Define functions for cpu slab handling instead of using PageActive Use inline functions to access the per cpu bit. Intoduce the notion of "freezing" a slab to make things more understandable. Signed-off-by: Christoph Lameter <clameter@sgi.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
#
c59def9f |
|
16-May-2007 |
Christoph Lameter <clameter@sgi.com> |
Slab allocators: Drop support for destructors There is no user of destructors left. There is no reason why we should keep checking for destructors calls in the slab allocators. The RFC for this patch was discussed at http://marc.info/?l=linux-kernel&m=117882364330705&w=2 Destructors were mainly used for list management which required them to take a spinlock. Taking a spinlock in a destructor is a bit risky since the slab allocators may run the destructors anytime they decide a slab is no longer needed. Patch drops destructor support. Any attempt to use a destructor will BUG(). Acked-by: Pekka Enberg <penberg@cs.helsinki.fi> Acked-by: Paul Mundt <lethal@linux-sh.org> Signed-off-by: Christoph Lameter <clameter@sgi.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
#
18007820 |
|
16-May-2007 |
Hugh Dickins <hugh@veritas.com> |
slub: don't confuse ctor and dtor kmem_cache_create() was swapping ctor and dtor in calling find_mergeable(): though it caused no bug, and probably never would, even if destructors are retained; but fix it so as not to generate anxiety ;) Signed-off-by: Hugh Dickins <hugh@veritas.com> Cc: Christoph Lameter <clameter@sgi.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
#
bcf889f9 |
|
10-May-2007 |
Christoph Lameter <clameter@sgi.com> |
SLUB: remove nr_cpu_ids hack This was in SLUB in order to head off trouble while the nr_cpu_ids functionality was not merged. Its merged now so no need to still have this. Signed-off-by: Christoph Lameter <clameter@sgi.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
#
894b8788 |
|
10-May-2007 |
Christoph Lameter <clameter@sgi.com> |
slub: support concurrent local and remote frees and allocs on a slab Avoid atomic overhead in slab_alloc and slab_free SLUB needs to use the slab_lock for the per cpu slabs to synchronize with potential kfree operations. This patch avoids that need by moving all free objects onto a lockless_freelist. The regular freelist continues to exist and will be used to free objects. So while we consume the lockless_freelist the regular freelist may build up objects. If we are out of objects on the lockless_freelist then we may check the regular freelist. If it has objects then we move those over to the lockless_freelist and do this again. There is a significant savings in terms of atomic operations that have to be performed. We can even free directly to the lockless_freelist if we know that we are running on the same processor. So this speeds up short lived objects. They may be allocated and freed without taking the slab_lock. This is particular good for netperf. In order to maximize the effect of the new faster hotpath we extract the hottest performance pieces into inlined functions. These are then inlined into kmem_cache_alloc and kmem_cache_free. So hotpath allocation and freeing no longer requires a subroutine call within SLUB. [I am not sure that it is worth doing this because it changes the easy to read structure of slub just to reduce atomic ops. However, there is someone out there with a benchmark on 4 way and 8 way processor systems that seems to show a 5% regression vs. Slab. Seems that the regression is due to increased atomic operations use vs. SLAB in SLUB). I wonder if this is applicable or discernable at all in a real workload?] Signed-off-by: Christoph Lameter <clameter@sgi.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
#
4037d452 |
|
09-May-2007 |
Christoph Lameter <clameter@sgi.com> |
Move remote node draining out of slab allocators Currently the slab allocators contain callbacks into the page allocator to perform the draining of pagesets on remote nodes. This requires SLUB to have a whole subsystem in order to be compatible with SLAB. Moving node draining out of the slab allocators avoids a section of code in SLUB. Move the node draining so that is is done when the vm statistics are updated. At that point we are already touching all the cachelines with the pagesets of a processor. Add a expire counter there. If we have to update per zone or global vm statistics then assume that the pageset will require subsequent draining. The expire counter will be decremented on each vm stats update pass until it reaches zero. Then we will drain one batch from the pageset. The draining will cause vm counter updates which will then cause another expiration until the pcp is empty. So we will drain a batch every 3 seconds. Note that remote node draining is a somewhat esoteric feature that is required on large NUMA systems because otherwise significant portions of system memory can become trapped in pcp queues. The number of pcp is determined by the number of processors and nodes in a system. A system with 4 processors and 2 nodes has 8 pcps which is okay. But a system with 1024 processors and 512 nodes has 512k pcps with a high potential for large amount of memory being caught in them. Signed-off-by: Christoph Lameter <clameter@sgi.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
#
d1187ed2 |
|
09-May-2007 |
Christoph Lameter <clameter@sgi.com> |
vmstat: use our own timer events vmstat is currently using the cache reaper to periodically bring the statistics up to date. The cache reaper does only exists in SLUB as a way to provide compatibility with SLAB. This patch removes the vmstat calls from the slab allocators and provides its own handling. The advantage is also that we can use a different frequency for the updates. Refreshing vm stats is a pretty fast job so we can run this every second and stagger this by only one tick. This will lead to some overlap in large systems. F.e a system running at 250 HZ with 1024 processors will have 4 vm updates occurring at once. However, the vm stats update only accesses per node information. It is only necessary to stagger the vm statistics updates per processor in each node. Vm counter updates occurring on distant nodes will not cause cacheline contention. We could implement an alternate approach that runs the first processor on each node at the second and then each of the other processor on a node on a subsequent tick. That may be useful to keep a large amount of the second free of timer activity. Maybe the timer folks will have some feedback on this one? [jirislaby@gmail.com: add missing break] Cc: Arjan van de Ven <arjan@linux.intel.com> Signed-off-by: Christoph Lameter <clameter@sgi.com> Signed-off-by: Jiri Slaby <jirislaby@gmail.com> Cc: Oleg Nesterov <oleg@tv-sign.ru> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
#
8bb78442 |
|
09-May-2007 |
Rafael J. Wysocki <rjw@rjwysocki.net> |
Add suspend-related notifications for CPU hotplug Since nonboot CPUs are now disabled after tasks and devices have been frozen and the CPU hotplug infrastructure is used for this purpose, we need special CPU hotplug notifications that will help the CPU-hotplug-aware subsystems distinguish normal CPU hotplug events from CPU hotplug events related to a system-wide suspend or resume operation in progress. This patch introduces such notifications and causes them to be used during suspend and resume transitions. It also changes all of the CPU-hotplug-aware subsystems to take these notifications into consideration (for now they are handled in the same way as the corresponding "normal" ones). [oleg@tv-sign.ru: cleanups] Signed-off-by: Rafael J. Wysocki <rjw@sisk.pl> Cc: Gautham R Shenoy <ego@in.ibm.com> Cc: Pavel Machek <pavel@ucw.cz> Signed-off-by: Oleg Nesterov <oleg@tv-sign.ru> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
#
7ae439ce |
|
09-May-2007 |
Pekka J Enberg <penberg@cs.helsinki.fi> |
krealloc: fix kerneldoc comments No "blank" (or "*") line is allowed between the function name and lines for it parameter(s). Cc: Randy Dunlap <randy.dunlap@oracle.com> Signed-off-by: Pekka Enberg <penberg@cs.helsinki.fi> Cc: Christoph Lameter <clameter@sgi.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
#
5e6d444e |
|
09-May-2007 |
Christoph Lameter <clameter@sgi.com> |
SLUB: rework slab order determination In some cases SLUB is creating uselessly slabs that are larger than slub_max_order. Also the layout of some of the slabs was not satisfactory. Go to an iterarive approach. Signed-off-by: Christoph Lameter <clameter@sgi.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
#
45edfa58 |
|
09-May-2007 |
Christoph Lameter <clameter@sgi.com> |
SLUB: include lifetime stats and sets of cpus / nodes in tracking output We have information about how long an object existed and about the nodes and cpus where the allocations and frees took place. Add that information to the tracking output in /sys/slab/xx/alloc_calls and /sys/slab/free_calls This will then enable slabinfo to output nice reports like this: christoph@qirst:~/slub$ ./slabinfo kmalloc-128 Slabcache: kmalloc-128 Aliases: 0 Order : 0 Sizes (bytes) Slabs Debug Memory ------------------------------------------------------------------------ Object : 128 Total : 12 Sanity Checks : On Total: 49152 SlabObj: 200 Full : 7 Redzoning : On Used : 24832 SlabSiz: 4096 Partial: 4 Poisoning : On Loss : 24320 Loss : 72 CpuSlab: 1 Tracking : On Lalig: 13968 Align : 8 Objects: 20 Tracing : Off Lpadd: 1152 kmalloc-128 has no kmem_cache operations kmalloc-128: Kernel object allocation ----------------------------------------------------------------------- 6 param_sysfs_setup+0x71/0x130 age=284512/284512/284512 pid=1 nodes=0-1,3 11 percpu_populate+0x39/0x80 age=283914/284428/284512 pid=1 nodes=0 21 __register_chrdev_region+0x31/0x170 age=282896/284347/284473 pid=1-1705 nodes=0-2 1 sys_inotify_init+0x76/0x1c0 age=283423 pid=1004 nodes=0 19 as_get_io_context+0x32/0xd0 age=6/247567/283988 pid=1-11782 nodes=0,2 10 ida_pre_get+0x4a/0x80 age=277666/283773/284526 pid=0-2177 nodes=0,2 24 kobject_kset_add_dir+0x37/0xb0 age=282727/283860/284472 pid=1-1723 nodes=0-2 1 acpi_ds_build_internal_buffer_obj+0xd3/0x11d age=284508 pid=1 nodes=0 24 con_insert_unipair+0xd7/0x110 age=284438/284438/284438 pid=1 nodes=0,2 1 uart_open+0x2d2/0x4b0 age=283896 pid=1 nodes=0 26 dma_pool_create+0x73/0x1a0 age=282762/282833/282916 pid=1705-1723 nodes=0 1 neigh_table_init_no_netlink+0xd2/0x210 age=284461 pid=1 nodes=0 2 neigh_parms_alloc+0x2b/0xe0 age=284410/284411/284412 pid=1 nodes=2 2 neigh_resolve_output+0x1e1/0x280 age=276289/276291/276293 pid=0-2443 nodes=0 1 netlink_kernel_create+0x90/0x170 age=284472 pid=1 nodes=0 4 xt_alloc_table_info+0x39/0xf0 age=283958/283958/283959 pid=1 nodes=1 3 fn_hash_insert+0x473/0x720 age=277653/277661/277666 pid=2177-2185 nodes=0 1 get_mtrr_state+0x285/0x2a0 age=284526 pid=0 nodes=0 1 cacheinfo_cpu_callback+0x26d/0x3e0 age=284458 pid=1 nodes=0 29 kernel_param_sysfs_setup+0x25/0x90 age=284511/284511/284512 pid=1 nodes=0-1,3 5 process_zones+0x5e/0x170 age=284546/284546/284546 pid=0 nodes=0 1 drm_core_init+0x48/0x160 age=284421 pid=1 nodes=2 kmalloc-128: Kernel object freeing ------------------------------------------------------------------------ 163 <not-available> age=4295176847 pid=0 nodes=0-3 1 __vunmap+0x6e/0xf0 age=282907 pid=1723 nodes=0 28 free_as_io_context+0x12/0x90 age=9243/262197/283474 pid=42-11754 nodes=0 1 acpi_get_object_info+0x1b7/0x1d4 age=284475 pid=1 nodes=0 1 do_acpi_find_child+0x45/0x4e age=284475 pid=1 nodes=0 NUMA nodes : 0 1 2 3 ------------------------------------------ All slabs 7 2 2 1 Partial slabs 2 2 0 0 Signed-off-by: Christoph Lameter <clameter@sgi.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
#
41ecc55b |
|
09-May-2007 |
Christoph Lameter <clameter@sgi.com> |
SLUB: add CONFIG_SLUB_DEBUG CONFIG_SLUB_DEBUG can be used to switch off the debugging and sysfs components of SLUB. Thus SLUB will be able to replace SLOB. SLUB can arrange objects in a denser way than SLOB and the code size should be minimal without debugging and sysfs support. Note that CONFIG_SLUB_DEBUG is materially different from CONFIG_SLAB_DEBUG. CONFIG_SLAB_DEBUG is used to enable slab debugging in SLAB. SLUB enables debugging via a boot parameter. SLUB debug code should always be present. CONFIG_SLUB_DEBUG can be modified in the embedded config section. Signed-off-by: Christoph Lameter <clameter@sgi.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
#
02cbc874 |
|
09-May-2007 |
Christoph Lameter <clameter@sgi.com> |
SLUB: move tracking definitions and check_valid_pointer() away from debug code Move the tracking definitions and the check_valid_pointer() function away from the debugging related functions. Signed-off-by: Christoph Lameter <clameter@sgi.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
#
636f0d7d |
|
09-May-2007 |
Christoph Lameter <clameter@sgi.com> |
SLUB: consolidate trace code Trace in both slab_alloc and slab_free has a lot of common code. Use a single function for both. Signed-off-by: Christoph Lameter <clameter@sgi.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
#
35e5d7ee |
|
09-May-2007 |
Christoph Lameter <clameter@sgi.com> |
SLUB: introduce DebugSlab(page) This replaces the PageError() checking. DebugSlab is clearer and allows for future changes to the page bit used. We also need it to support CONFIG_SLUB_DEBUG. Signed-off-by: Christoph Lameter <clameter@sgi.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
#
b3459709 |
|
09-May-2007 |
Christoph Lameter <clameter@sgi.com> |
SLUB: move resiliency check into SYSFS section Move the resiliency check into the SYSFS section after validate_slab that is used by the resiliency check. This will avoid a forward declaration. Signed-off-by: Christoph Lameter <clameter@sgi.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
#
7656c72b |
|
09-May-2007 |
Christoph Lameter <clameter@sgi.com> |
SLUB: add macros for scanning objects in a slab Scanning of objects happens in a number of functions. Consolidate that code. DECLARE_BITMAP instead of coding the declaration for bitmaps. Signed-off-by: Christoph Lameter <clameter@sgi.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
#
672bba3a |
|
09-May-2007 |
Christoph Lameter <clameter@sgi.com> |
SLUB: update comments Update comments throughout SLUB to reflect the new developments. Fix up various awkward sentences. Signed-off-by: Christoph Lameter <clameter@sgi.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
#
26a7bd03 |
|
09-May-2007 |
Christoph Lameter <clameter@sgi.com> |
SLUB: get rid of finish_bootstrap Its only purpose was to bring some sort of symmetry to sysfs usage when dealing with bootstrapping per cpu flushing. Since we do not time out slabs anymore we have no need to run finish_bootstrap even without sysfs. Fold it back into slab_sysfs_init and drop the initcall for the !SYFS case. Signed-off-by: Christoph Lameter <clameter@sgi.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
#
1f99a283 |
|
09-May-2007 |
Christoph Lameter <clameter@sgi.com> |
SLUB: clean up krealloc We really do not need all this gaga there. ksize gives us all the information we need to figure out if the object can cope with the new size. Signed-off-by: Christoph Lameter <clameter@sgi.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
#
abcd08a6 |
|
09-May-2007 |
Christoph Lameter <clameter@sgi.com> |
SLUB: use check_valid_pointer in kmem_ptr_validate We needlessly duplicate code. Also make check_valid_pointer inline. Signed-off-by: Christoph LAemter <clameter@sgi.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
#
be7b3fbc |
|
09-May-2007 |
Christoph Lameter <clameter@sgi.com> |
SLUB: after object padding only needed for Redzoning If no redzoning is selected then we do not need padding before the next object. Signed-off-by: Christoph Lameter <clameter@sgi.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
#
65c02d4c |
|
09-May-2007 |
Christoph Lameter <clameter@sgi.com> |
SLUB: add support for dynamic cacheline size determination SLUB currently assumes that the cacheline size is static. However, i386 f.e. supports dynamic cache line size determination. Use cache_line_size() instead of L1_CACHE_BYTES in the allocator. That also explains the purpose of SLAB_HWCACHE_ALIGN. So we will need to keep that one around to allow dynamic aligning of objects depending on boot determination of the cache line size. [akpm@linux-foundation.org: need to define it before we use it] Signed-off-by: Christoph Lameter <clameter@sgi.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
#
0f9008ef |
|
07-May-2007 |
Linus Torvalds <torvalds@woody.linux-foundation.org> |
Fix up SLUB compile The newly merged SLUB allocator patches had been generated before the removal of "struct subsystem", and ended up applying fine, but wouldn't build based on the current tree as a result. Fix up that merge error - not that SLUB is likely really ready for showtime yet, but at least I can fix the trivial stuff. Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
#
cfce6604 |
|
06-May-2007 |
Christoph Lameter <clameter@sgi.com> |
Slab allocators: remove useless __GFP_NO_GROW flag There is no user remaining and I have never seen any use of that flag. Signed-off-by: Christoph Lameter <clameter@sgi.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
#
4f104934 |
|
06-May-2007 |
Christoph Lameter <clameter@sgi.com> |
slab allocators: Remove SLAB_CTOR_ATOMIC SLAB_CTOR atomic is never used which is no surprise since I cannot imagine that one would want to do something serious in a constructor or destructor. In particular given that the slab allocators run with interrupts disabled. Actions in constructors and destructors are by their nature very limited and usually do not go beyond initializing variables and list operations. (The i386 pgd ctor and dtors do take a spinlock in constructor and destructor..... I think that is the furthest we go at this point.) There is no flag passed to the destructor so removing SLAB_CTOR_ATOMIC also establishes a certain symmetry. Signed-off-by: Christoph Lameter <clameter@sgi.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
#
50953fe9 |
|
06-May-2007 |
Christoph Lameter <clameter@sgi.com> |
slab allocators: Remove SLAB_DEBUG_INITIAL flag I have never seen a use of SLAB_DEBUG_INITIAL. It is only supported by SLAB. I think its purpose was to have a callback after an object has been freed to verify that the state is the constructor state again? The callback is performed before each freeing of an object. I would think that it is much easier to check the object state manually before the free. That also places the check near the code object manipulation of the object. Also the SLAB_DEBUG_INITIAL callback is only performed if the kernel was compiled with SLAB debugging on. If there would be code in a constructor handling SLAB_DEBUG_INITIAL then it would have to be conditional on SLAB_DEBUG otherwise it would just be dead code. But there is no such code in the kernel. I think SLUB_DEBUG_INITIAL is too problematic to make real use of, difficult to understand and there are easier ways to accomplish the same effect (i.e. add debug code before kfree). There is a related flag SLAB_CTOR_VERIFY that is frequently checked to be clear in fs inode caches. Remove the pointless checks (they would even be pointless without removeal of SLAB_DEBUG_INITIAL) from the fs constructors. This is the last slab flag that SLUB did not support. Remove the check for unimplemented flags from SLUB. Signed-off-by: Christoph Lameter <clameter@sgi.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
#
5af60839 |
|
06-May-2007 |
Christoph Lameter <clameter@sgi.com> |
slab allocators: Remove obsolete SLAB_MUST_HWCACHE_ALIGN This patch was recently posted to lkml and acked by Pekka. The flag SLAB_MUST_HWCACHE_ALIGN is 1. Never checked by SLAB at all. 2. A duplicate of SLAB_HWCACHE_ALIGN for SLUB 3. Fulfills the role of SLAB_HWCACHE_ALIGN for SLOB. The only remaining use is in sparc64 and ppc64 and their use there reflects some earlier role that the slab flag once may have had. If its specified then SLAB_HWCACHE_ALIGN is also specified. The flag is confusing, inconsistent and has no purpose. Remove it. Acked-by: Pekka Enberg <penberg@cs.helsinki.fi> Signed-off-by: Christoph Lameter <clameter@sgi.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
#
70d71228 |
|
06-May-2007 |
Christoph Lameter <clameter@sgi.com> |
slub: remove object activities out of checking functions Make sure that the check function really only check things and do not perform activities. Extract the tracing and object seeding out of the two check functions and place them into slab_alloc and slab_free Signed-off-by: Christoph Lameter <clameter@sgi.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
#
2086d26a |
|
06-May-2007 |
Christoph Lameter <clameter@sgi.com> |
SLUB: Free slabs and sort partial slab lists in kmem_cache_shrink At kmem_cache_shrink check if we have any empty slabs on the partial if so then remove them. Also--as an anti-fragmentation measure--sort the partial slabs so that the most fully allocated ones come first and the least allocated last. The next allocations may fill up the nearly full slabs. Having the least allocated slabs last gives them the maximum chance that their remaining objects may be freed. Thus we can hopefully minimize the partial slabs. I think this is the best one can do in terms antifragmentation measures. Real defragmentation (meaning moving objects out of slabs with the least free objects to those that are almost full) can be implemted by reverse scanning through the list produced here but that would mean that we need to provide a callback at slab cache creation that allows the deletion or moving of an object. This will involve slab API changes, so defer for now. Cc: Mel Gorman <mel@skynet.ie> Signed-off-by: Christoph Lameter <clameter@sgi.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
#
88a420e4 |
|
06-May-2007 |
Christoph Lameter <clameter@sgi.com> |
slub: add ability to list alloc / free callers per slab This patch enables listing the callers who allocated or freed objects in a cache. For example to list the allocators for kmalloc-128 do cat /sys/slab/kmalloc-128/alloc_calls 7 sn_io_slot_fixup+0x40/0x700 7 sn_io_slot_fixup+0x80/0x700 9 sn_bus_fixup+0xe0/0x380 6 param_sysfs_setup+0xf0/0x280 276 percpu_populate+0xf0/0x1a0 19 __register_chrdev_region+0x30/0x360 8 expand_files+0x2e0/0x6e0 1 sys_epoll_create+0x60/0x200 1 __mounts_open+0x140/0x2c0 65 kmem_alloc+0x110/0x280 3 alloc_disk_node+0xe0/0x200 33 as_get_io_context+0x90/0x280 74 kobject_kset_add_dir+0x40/0x140 12 pci_create_bus+0x2a0/0x5c0 1 acpi_ev_create_gpe_block+0x120/0x9e0 41 con_insert_unipair+0x100/0x1c0 1 uart_open+0x1c0/0xba0 1 dma_pool_create+0xe0/0x340 2 neigh_table_init_no_netlink+0x260/0x4c0 6 neigh_parms_alloc+0x30/0x200 1 netlink_kernel_create+0x130/0x320 5 fz_hash_alloc+0x50/0xe0 2 sn_common_hubdev_init+0xd0/0x6e0 28 kernel_param_sysfs_setup+0x30/0x180 72 process_zones+0x70/0x2e0 cat /sys/slab/kmalloc-128/free_calls 558 <not-available> 3 sn_io_slot_fixup+0x600/0x700 84 free_fdtable_rcu+0x120/0x260 2 seq_release+0x40/0x60 6 kmem_free+0x70/0xc0 24 free_as_io_context+0x20/0x200 1 acpi_get_object_info+0x3a0/0x3e0 1 acpi_add_single_object+0xcf0/0x1e40 2 con_release_unimap+0x80/0x140 1 free+0x20/0x40 SLAB_STORE_USER must be enabled for a slab cache by either booting with "slab_debug" or enabling user tracking specifically for the slab of interest. Signed-off-by: Christoph Lameter <clameter@sgi.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
#
e95eed57 |
|
06-May-2007 |
Christoph Lameter <clameter@sgi.com> |
SLUB: Add MIN_PARTIAL We leave a mininum of partial slabs on nodes when we search for partial slabs on other node. Define a constant for that value. Then modify slub to keep MIN_PARTIAL slabs around. This avoids bad situations where a function frees the last object in a slab (which results in the page being returned to the page allocator) only to then allocate one again (which requires getting a page back from the page allocator if the partial list was empty). Keeping a couple of slabs on the partial list reduces overhead. Empty slabs are added to the end of the partial list to insure that partially allocated slabs are consumed first (defragmentation). Signed-off-by: Christoph Lameter <clameter@sgi.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
#
53e15af0 |
|
06-May-2007 |
Christoph Lameter <clameter@sgi.com> |
slub: validation of slabs (metadata and guard zones) This enables validation of slab. Validation means that all objects are checked to see if there are redzone violations, if padding has been overwritten or any pointers have been corrupted. Also checks the consistency of slab counters. Validation enables the detection of metadata corruption without the kernel having to execute code that actually uses (allocs/frees) and object. It allows one to make sure that the slab metainformation and the guard values around an object have not been compromised. A single slabcache can be checked by writing a 1 to the "validate" file. i.e. echo 1 >/sys/slab/kmalloc-128/validate or use the slabinfo tool to check all slabs slabinfo -v Error messages will show up in the syslog. Note that validation can only reach slabs that are on a list. This means that we are usually restricted to partial slabs and active slabs unless SLAB_STORE_USER is active which will build a full slab list and allows validation of slabs that are fully in use. Booting with "slub_debug" set will enable SLAB_STORE_USER and then full diagnostic are available. Note that we attempt to push cpu slabs back to the lists when we start the check. If the cpu slab is reactivated before we get to it (another processor grabs it before we get to it) then it cannot be checked. Signed-off-by: Christoph Lameter <clameter@sgi.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
#
643b1138 |
|
06-May-2007 |
Christoph Lameter <clameter@sgi.com> |
slub: enable tracking of full slabs If slab tracking is on then build a list of full slabs so that we can verify the integrity of all slabs and are also able to built list of alloc/free callers. Signed-off-by: Christoph Lameter <clameter@sgi.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
#
77c5e2d0 |
|
06-May-2007 |
Christoph Lameter <clameter@sgi.com> |
slub: fix object tracking Object tracking did not work the right way for several call chains. Fix this up by adding a new parameter to slub_alloc and slub_free that specifies the caller address explicitly. Signed-off-by: Christoph Lameter <clameter@sgi.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
#
b49af68f |
|
06-May-2007 |
Christoph Lameter <clameter@sgi.com> |
Add virt_to_head_page and consolidate code in slab and slub Signed-off-by: Christoph Lameter <clameter@sgi.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
#
d85f3385 |
|
06-May-2007 |
Christoph Lameter <clameter@sgi.com> |
Make page->private usable in compound pages If we add a new flag so that we can distinguish between the first page and the tail pages then we can avoid to use page->private in the first page. page->private == page for the first page, so there is no real information in there. Freeing up page->private makes the use of compound pages more transparent. They become more usable like real pages. Right now we have to be careful f.e. if we are going beyond PAGE_SIZE allocations in the slab on i386 because we can then no longer use the private field. This is one of the issues that cause us not to support debugging for page size slabs in SLAB. Having page->private available for SLUB would allow more meta information in the page struct. I can probably avoid the 16 bit ints that I have in there right now. Also if page->private is available then a compound page may be equipped with buffer heads. This may free up the way for filesystems to support larger blocks than page size. We add PageTail as an alias of PageReclaim. Compound pages cannot currently be reclaimed. Because of the alias one needs to check PageCompound first. The RFC for the this approach was discussed at http://marc.info/?t=117574302800001&r=1&w=2 [nacc@us.ibm.com: fix hugetlbfs] Signed-off-by: Christoph Lameter <clameter@sgi.com> Signed-off-by: Nishanth Aravamudan <nacc@us.ibm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
#
614410d5 |
|
06-May-2007 |
Christoph Lameter <clameter@sgi.com> |
SLUB: allocate smallest object size if the user asks for 0 bytes Makes SLUB behave like SLAB in this area to avoid issues.... Throw a stack dump to alert people. At some point the behavior should be switched back. NULL is no memory as far as I can tell and if the use asked for 0 bytes then he need to get no memory. Signed-off-by: Christoph Lameter <clameter@sgi.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
#
47bfdc0d |
|
06-May-2007 |
Christoph Lameter <clameter@sgi.com> |
SLUB: change default alignments Structures may contain u64 items on 32 bit platforms that are only able to address 64 bit items on 64 bit boundaries. Change the mininum alignment of slabs to conform to those expectations. ARCH_KMALLOC_MINALIGN must be changed for good since a variety of structure are mixed in the general slabs. ARCH_SLAB_MINALIGN is changed because currently there is no consistent specification of object alignment. We may have that in the future when the KMEM_CACHE and related macros are used to generate slabs. These pass the alignment of the structure generated by the compiler to the slab. With KMEM_CACHE etc we could align structures that do not contain 64 bit values to 32 bit boundaries potentially saving some memory. Signed-off-by: Christoph Lameter <clameter@sgi.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
#
81819f0f |
|
06-May-2007 |
Christoph Lameter <clameter@sgi.com> |
SLUB core This is a new slab allocator which was motivated by the complexity of the existing code in mm/slab.c. It attempts to address a variety of concerns with the existing implementation. A. Management of object queues A particular concern was the complex management of the numerous object queues in SLAB. SLUB has no such queues. Instead we dedicate a slab for each allocating CPU and use objects from a slab directly instead of queueing them up. B. Storage overhead of object queues SLAB Object queues exist per node, per CPU. The alien cache queue even has a queue array that contain a queue for each processor on each node. For very large systems the number of queues and the number of objects that may be caught in those queues grows exponentially. On our systems with 1k nodes / processors we have several gigabytes just tied up for storing references to objects for those queues This does not include the objects that could be on those queues. One fears that the whole memory of the machine could one day be consumed by those queues. C. SLAB meta data overhead SLAB has overhead at the beginning of each slab. This means that data cannot be naturally aligned at the beginning of a slab block. SLUB keeps all meta data in the corresponding page_struct. Objects can be naturally aligned in the slab. F.e. a 128 byte object will be aligned at 128 byte boundaries and can fit tightly into a 4k page with no bytes left over. SLAB cannot do this. D. SLAB has a complex cache reaper SLUB does not need a cache reaper for UP systems. On SMP systems the per CPU slab may be pushed back into partial list but that operation is simple and does not require an iteration over a list of objects. SLAB expires per CPU, shared and alien object queues during cache reaping which may cause strange hold offs. E. SLAB has complex NUMA policy layer support SLUB pushes NUMA policy handling into the page allocator. This means that allocation is coarser (SLUB does interleave on a page level) but that situation was also present before 2.6.13. SLABs application of policies to individual slab objects allocated in SLAB is certainly a performance concern due to the frequent references to memory policies which may lead a sequence of objects to come from one node after another. SLUB will get a slab full of objects from one node and then will switch to the next. F. Reduction of the size of partial slab lists SLAB has per node partial lists. This means that over time a large number of partial slabs may accumulate on those lists. These can only be reused if allocator occur on specific nodes. SLUB has a global pool of partial slabs and will consume slabs from that pool to decrease fragmentation. G. Tunables SLAB has sophisticated tuning abilities for each slab cache. One can manipulate the queue sizes in detail. However, filling the queues still requires the uses of the spin lock to check out slabs. SLUB has a global parameter (min_slab_order) for tuning. Increasing the minimum slab order can decrease the locking overhead. The bigger the slab order the less motions of pages between per CPU and partial lists occur and the better SLUB will be scaling. G. Slab merging We often have slab caches with similar parameters. SLUB detects those on boot up and merges them into the corresponding general caches. This leads to more effective memory use. About 50% of all caches can be eliminated through slab merging. This will also decrease slab fragmentation because partial allocated slabs can be filled up again. Slab merging can be switched off by specifying slub_nomerge on boot up. Note that merging can expose heretofore unknown bugs in the kernel because corrupted objects may now be placed differently and corrupt differing neighboring objects. Enable sanity checks to find those. H. Diagnostics The current slab diagnostics are difficult to use and require a recompilation of the kernel. SLUB contains debugging code that is always available (but is kept out of the hot code paths). SLUB diagnostics can be enabled via the "slab_debug" option. Parameters can be specified to select a single or a group of slab caches for diagnostics. This means that the system is running with the usual performance and it is much more likely that race conditions can be reproduced. I. Resiliency If basic sanity checks are on then SLUB is capable of detecting common error conditions and recover as best as possible to allow the system to continue. J. Tracing Tracing can be enabled via the slab_debug=T,<slabcache> option during boot. SLUB will then protocol all actions on that slabcache and dump the object contents on free. K. On demand DMA cache creation. Generally DMA caches are not needed. If a kmalloc is used with __GFP_DMA then just create this single slabcache that is needed. For systems that have no ZONE_DMA requirement the support is completely eliminated. L. Performance increase Some benchmarks have shown speed improvements on kernbench in the range of 5-10%. The locking overhead of slub is based on the underlying base allocation size. If we can reliably allocate larger order pages then it is possible to increase slub performance much further. The anti-fragmentation patches may enable further performance increases. Tested on: i386 UP + SMP, x86_64 UP + SMP + NUMA emulation, IA64 NUMA + Simulator SLUB Boot options slub_nomerge Disable merging of slabs slub_min_order=x Require a minimum order for slab caches. This increases the managed chunk size and therefore reduces meta data and locking overhead. slub_min_objects=x Mininum objects per slab. Default is 8. slub_max_order=x Avoid generating slabs larger than order specified. slub_debug Enable all diagnostics for all caches slub_debug=<options> Enable selective options for all caches slub_debug=<o>,<cache> Enable selective options for a certain set of caches Available Debug options F Double Free checking, sanity and resiliency R Red zoning P Object / padding poisoning U Track last free / alloc T Trace all allocs / frees (only use for individual slabs). To use SLUB: Apply this patch and then select SLUB as the default slab allocator. [hugh@veritas.com: fix an oops-causing locking error] [akpm@linux-foundation.org: various stupid cleanups and small fixes] Signed-off-by: Christoph Lameter <clameter@sgi.com> Signed-off-by: Hugh Dickins <hugh@veritas.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|