#
d59070d1 |
|
11-Aug-2023 |
Arnd Bergmann <arnd@arndb.de> |
radix tree: remove unused variable Recent versions of clang warn about an unused variable, though older versions saw the 'slot++' as a use and did not warn: radix-tree.c:1136:50: error: parameter 'slot' set but not used [-Werror,-Wunused-but-set-parameter] It's clearly not needed any more, so just remove it. Link: https://lkml.kernel.org/r/20230811131023.2226509-1-arnd@kernel.org Fixes: 3a08cd52c37c7 ("radix tree: Remove multiorder support") Signed-off-by: Arnd Bergmann <arnd@arndb.de> Cc: Matthew Wilcox <willy@infradead.org> Cc: Nathan Chancellor <nathan@kernel.org> Cc: Nick Desaulniers <ndesaulniers@google.com> Cc: Peng Zhang <zhangpeng.00@bytedance.com> Cc: Rong Tao <rongtao@cestc.cn> Cc: Tom Rix <trix@redhat.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|
#
bde1597d |
|
16-May-2023 |
Arnd Bergmann <arnd@arndb.de> |
radix-tree: move declarations to header The xarray.c file contains the only call to radix_tree_node_rcu_free(), and it comes with its own extern declaration for it. This means the function definition causes a missing-prototype warning: lib/radix-tree.c:288:6: error: no previous prototype for 'radix_tree_node_rcu_free' [-Werror=missing-prototypes] Instead, move the declaration for this function to a new header that can be included by both, and do the same for the radix_tree_node_cachep variable that has the same underlying problem but does not cause a warning with gcc. [zhangpeng.00@bytedance.com: fix building radix tree test suite] Link: https://lkml.kernel.org/r/20230521095450.21332-1-zhangpeng.00@bytedance.com Link: https://lkml.kernel.org/r/20230516194212.548910-1-arnd@kernel.org Signed-off-by: Arnd Bergmann <arnd@arndb.de> Signed-off-by: Peng Zhang <zhangpeng.00@bytedance.com> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|
#
fc0e7387 |
|
09-Nov-2022 |
Rong Tao <rongtao@cestc.cn> |
lib/radix-tree.c: fix uninitialized variable compilation warning We need to set an initial value for offset to eliminate compilation warning. How to reproduce warning: $ make -C tools/testing/radix-tree radix-tree.c: In function `radix_tree_tag_clear': radix-tree.c:1046:17: warning: `offset' may be used uninitialized in this function [-Wmaybe-uninitialized] 1046 | node_tag_clear(root, parent, tag, offset); | ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ Link: https://lkml.kernel.org/r/tencent_DF74099967595DCEA93CBDC28D062026180A@qq.com Signed-off-by: Rong Tao <rongtao@cestc.cn> Cc: Matthew Wilcox <willy@infradead.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|
#
cda83bb8 |
|
25-Jun-2022 |
wuchi <wuchi.zero@gmail.com> |
lib/radix-tree: remove unused argument of insert_entries insert_entries() doesn't use the 'bool replace' argument, and the function is only used locally, remove the argument. The historical context of the unused argument is as follow: 2: commit <3a08cd52c37c79> (radix tree: Remove multiorder support) Remove the code related to macro CONFIG_RADIX_TREE_MULTIORDER to convert to the xArray. Without the macro, there is no need to retain the argument. 1: commit <175542f575723e> (radix-tree: add radix_tree_join) Add insert_entries(..., bool replace) function, depending on the macro CONFIG_RADIX_TREE_MULTIORDER definition, the implementation is different. Notice that the implementation without the macro doesn't use the argument. [Matthew Wilcox: add historical context for argument] Link: https://lkml.kernel.org/r/20220625135324.72574-1-wuchi.zero@gmail.com Signed-off-by: wuchi <wuchi.zero@gmail.com> Cc: Matthew Wilcox <willy@infradead.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|
#
c95c2d32 |
|
16-Apr-2021 |
Randy Dunlap <rdunlap@infradead.org> |
lib: remove "expecting prototype" kernel-doc warnings Fix various kernel-doc warnings in lib/ due to missing or erroneous function names. Add kernel-doc for some function parameters that was missing. Use kernel-doc "Return:" notation in earlycpio.c. Quietens the following warnings: lib/earlycpio.c:61: warning: expecting prototype for cpio_data find_cpio_data(). Prototype was for find_cpio_data() instead lib/lru_cache.c:640: warning: expecting prototype for lc_dump(). Prototype was for lc_seq_dump_details() instead lru_cache.c:90: warning: Function parameter or member 'cache' not described in 'lc_create' lib/parman.c:368: warning: expecting prototype for parman_item_del(). Prototype was for parman_item_remove() instead parman.c:309: warning: Excess function parameter 'prority' description in 'parman_prio_init' lib/radix-tree.c:703: warning: expecting prototype for __radix_tree_insert(). Prototype was for radix_tree_insert() instead radix-tree.c:180: warning: Excess function parameter 'addr' description in 'radix_tree_find_next_bit' radix-tree.c:180: warning: Excess function parameter 'size' description in 'radix_tree_find_next_bit' radix-tree.c:931: warning: Function parameter or member 'iter' not described in 'radix_tree_iter_replace' Link: https://lkml.kernel.org/r/20210411221756.15461-1-rdunlap@infradead.org Signed-off-by: Randy Dunlap <rdunlap@infradead.org> Cc: Philipp Reisner <philipp.reisner@linbit.com> Cc: Lars Ellenberg <lars.ellenberg@linbit.com> Cc: Jiri Pirko <jiri@nvidia.com> Cc: Matthew Wilcox <willy@infradead.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
#
e0656501 |
|
15-Oct-2020 |
Randy Dunlap <rdunlap@infradead.org> |
lib: radix-tree: delete duplicated words Drop the repeated word "be". Signed-off-by: Randy Dunlap <rdunlap@infradead.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Link: https://lkml.kernel.org/r/20200823040508.26086-1-rdunlap@infradead.org Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
#
dd841a74 |
|
14-Jun-2020 |
Matthew Wilcox (Oracle) <willy@infradead.org> |
radix tree test suite: Fix compilation Introducing local_lock broke compilation; fix it all up. Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
|
#
3f649ab7 |
|
03-Jun-2020 |
Kees Cook <keescook@chromium.org> |
treewide: Remove uninitialized_var() usage Using uninitialized_var() is dangerous as it papers over real bugs[1] (or can in the future), and suppresses unrelated compiler warnings (e.g. "unused variable"). If the compiler thinks it is uninitialized, either simply initialize the variable or make compiler changes. In preparation for removing[2] the[3] macro[4], remove all remaining needless uses with the following script: git grep '\buninitialized_var\b' | cut -d: -f1 | sort -u | \ xargs perl -pi -e \ 's/\buninitialized_var\(([^\)]+)\)/\1/g; s:\s*/\* (GCC be quiet|to make compiler happy) \*/$::g;' drivers/video/fbdev/riva/riva_hw.c was manually tweaked to avoid pathological white-space. No outstanding warnings were found building allmodconfig with GCC 9.3.0 for x86_64, i386, arm64, arm, powerpc, powerpc64le, s390x, mips, sparc64, alpha, and m68k. [1] https://lore.kernel.org/lkml/20200603174714.192027-1-glider@google.com/ [2] https://lore.kernel.org/lkml/CA+55aFw+Vbj0i=1TGqCR5vQkCzWJ0QxK6CernOU6eedsudAixw@mail.gmail.com/ [3] https://lore.kernel.org/lkml/CA+55aFwgbgqhbp1fkxvRKEpzyR5J8n1vKT1VZdz9knmPuXhOeg@mail.gmail.com/ [4] https://lore.kernel.org/lkml/CA+55aFz2500WfbKXAx8s67wrm9=yVJu65TpLgN_ybYNv0VEOKA@mail.gmail.com/ Reviewed-by: Leon Romanovsky <leonro@mellanox.com> # drivers/infiniband and mlx4/mlx5 Acked-by: Jason Gunthorpe <jgg@mellanox.com> # IB Acked-by: Kalle Valo <kvalo@codeaurora.org> # wireless drivers Reviewed-by: Chao Yu <yuchao0@huawei.com> # erofs Signed-off-by: Kees Cook <keescook@chromium.org>
|
#
cfa6705d |
|
27-May-2020 |
Sebastian Andrzej Siewior <bigeasy@linutronix.de> |
radix-tree: Use local_lock for protection The radix-tree and idr preload mechanisms use preempt_disable() to protect the complete operation between xxx_preload() and xxx_preload_end(). As the code inside the preempt disabled section acquires regular spinlocks, which are converted to 'sleeping' spinlocks on a PREEMPT_RT kernel and eventually calls into a memory allocator, this conflicts with the RT semantics. Convert it to a local_lock which allows RT kernels to substitute them with a real per CPU lock. On non RT kernels this maps to preempt_disable() as before, but provides also lockdep coverage of the critical region. No functional change. Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de> Signed-off-by: Ingo Molnar <mingo@kernel.org> Acked-by: Peter Zijlstra <peterz@infradead.org> Link: https://lore.kernel.org/r/20200527201119.1692513-3-bigeasy@linutronix.de
|
#
3a00e7c4 |
|
21-Jan-2020 |
Alex Shi <alex.shi@linux.alibaba.com> |
ida: remove abandoned macros 3 IDA_ started macros aren't used from commit f32f004cddf8 ("ida: Convert to XArray"). so better to remove them. Signed-off-by: Alex Shi <alex.shi@linux.alibaba.com> Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
|
#
b7e9728f |
|
01-Nov-2019 |
Matthew Wilcox (Oracle) <willy@infradead.org> |
idr: Fix idr_alloc_u32 on 32-bit systems Attempting to allocate an entry at 0xffffffff when one is already present would succeed in allocating one at 2^32, which would confuse everything. Return -ENOSPC in this case, as expected. Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
|
#
de6cc651 |
|
27-May-2019 |
Thomas Gleixner <tglx@linutronix.de> |
treewide: Replace GPLv2 boilerplate/reference with SPDX - rule 153 Based on 1 normalized pattern(s): this program is free software you can redistribute it and or modify it under the terms of the gnu general public license as published by the free software foundation either version 2 or at your option any later version this program is distributed in the hope that it will be useful but without any warranty without even the implied warranty of merchantability or fitness for a particular purpose see the gnu general public license for more details you should have received a copy of the gnu general public license along with this program if not write to the free software foundation inc 675 mass ave cambridge ma 02139 usa extracted by the scancode license scanner the SPDX license identifier GPL-2.0-or-later has been chosen to replace the boilerplate/reference in 77 file(s). Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Reviewed-by: Allison Randal <allison@lohutok.net> Reviewed-by: Armijn Hemel <armijn@tjaldur.nl> Reviewed-by: Richard Fontana <rfontana@redhat.com> Cc: linux-spdx@vger.kernel.org Link: https://lkml.kernel.org/r/20190527070032.837555891@linutronix.de Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
#
eff3860b |
|
06-Dec-2018 |
Matthew Wilcox <willy@infradead.org> |
radix tree: Don't return retry entries from lookup Commit 66ee620f06f9 ("idr: Permit any valid kernel pointer to be stored") changed the radix tree lookup so that it stops when reaching the bottom of the tree. However, the condition was added in the wrong place, making it possible to return retry entries to the caller. Reorder the tests to check for the retry entry before checking whether we're at the bottom of the tree. The retry entry should never be found in the tree root, so it's safe to defer the check until the end of the loop. Add a regression test to the test-suite to be sure this doesn't come back. Fixes: 66ee620f06f9 ("idr: Permit any valid kernel pointer to be stored") Reported-by: Greg Kurz <groug@kaod.org> Signed-off-by: Matthew Wilcox <willy@infradead.org>
|
#
3a08cd52 |
|
22-Sep-2018 |
Matthew Wilcox <willy@infradead.org> |
radix tree: Remove multiorder support All users have now been converted to the XArray. Removing the support reduces code size and ensures new users will use the XArray instead. Signed-off-by: Matthew Wilcox <willy@infradead.org>
|
#
372266ba |
|
18-Aug-2018 |
Matthew Wilcox <willy@infradead.org> |
radix tree test suite: Convert tag_tagged_items to XArray The tag_tagged_items() function is supposed to test the page-writeback tagging code. Since that has been converted to the XArray, there's not much point in testing the radix tree's tagging code. This requires using the pthread mutex embedded in the xarray instead of an external lock, so remove the pthread mutexes which protect xarrays/radix trees. Also remove radix_tree_iter_tag_set() as this was the last user. Signed-off-by: Matthew Wilcox <willy@infradead.org>
|
#
adb9d9c4 |
|
09-Apr-2018 |
Matthew Wilcox <willy@infradead.org> |
radix tree: Remove radix_tree_clear_tags The page cache was the only user of this interface and it has now been converted to the XArray. Transform the test into a test of xas_init_marks(). Signed-off-by: Matthew Wilcox <willy@infradead.org>
|
#
8cf2f984 |
|
19-May-2018 |
Matthew Wilcox <willy@infradead.org> |
radix tree: Remove radix_tree_maybe_preload_order This function was only used by the page cache which is now converted to the XArray. Signed-off-by: Matthew Wilcox <willy@infradead.org>
|
#
2956c664 |
|
19-May-2018 |
Matthew Wilcox <willy@infradead.org> |
radix tree: Remove split/join code radix_tree_split and radix_tree_join were never used upstream. Remove them; if they're needed in future they will be replaced by XArray equivalents. Signed-off-by: Matthew Wilcox <willy@infradead.org>
|
#
1cf56f9d |
|
09-Apr-2018 |
Matthew Wilcox <willy@infradead.org> |
radix tree: Remove radix_tree_update_node_t The only user of this functionality was the workingset code, and it's now been converted to the XArray. Remove __radix_tree_delete_node() entirely as it was also only used by the workingset code. Signed-off-by: Matthew Wilcox <willy@infradead.org>
|
#
7b8d046f |
|
01-Dec-2017 |
Matthew Wilcox <willy@infradead.org> |
shmem: Convert shmem_alloc_hugepage to XArray xa_find() is a slightly easier API to use than radix_tree_gang_lookup_slot() because it contains its own RCU locking. This commit removes the last user of radix_tree_gang_lookup_slot() so remove the function too. Signed-off-by: Matthew Wilcox <willy@infradead.org>
|
#
74d60958 |
|
17-Nov-2017 |
Matthew Wilcox <willy@infradead.org> |
page cache: Add and replace pages using the XArray Use the XArray APIs to add and replace pages in the page cache. This removes two uses of the radix tree preload API and is significantly shorter code. It also removes the last user of __radix_tree_create() outside radix-tree.c itself, so make it static. Signed-off-by: Matthew Wilcox <willy@infradead.org>
|
#
f32f004c |
|
04-Jul-2018 |
Matthew Wilcox <willy@infradead.org> |
ida: Convert to XArray Use the XA_TRACK_FREE ability to track which entries have a free bit, similarly to how it uses the radix tree's IDR_FREE tag. This eliminates the per-cpu ida_bitmap preload, and fixes the memory consumption regression I introduced when making the IDR able to store any pointer. Signed-off-by: Matthew Wilcox <willy@infradead.org>
|
#
58d6ea30 |
|
10-Nov-2017 |
Matthew Wilcox <willy@infradead.org> |
xarray: Add XArray unconditional store operations xa_store() differs from radix_tree_insert() in that it will overwrite an existing element in the array rather than returning an error. This is the behaviour which most users want, and those that want more complex behaviour generally want to use the xas family of routines anyway. For memory allocation, xa_store() will first attempt to request memory from the slab allocator; if memory is not immediately available, it will drop the xa_lock and allocate memory, keeping a pointer in the xa_state. It does not use the per-CPU cache, although those will continue to exist until all radix tree users are converted to the xarray. This patch also includes xa_erase() and __xa_erase() for a streamlined way to store NULL. Since there is no need to allocate memory in order to store a NULL in the XArray, we do not need to trouble the user with deciding what memory allocation flags to use. Signed-off-by: Matthew Wilcox <willy@infradead.org>
|
#
ad3d6c72 |
|
07-Nov-2017 |
Matthew Wilcox <willy@infradead.org> |
xarray: Add XArray load operation The xa_load function brings with it a lot of infrastructure; xa_empty(), xa_is_err(), and large chunks of the XArray advanced API that are used to implement xa_load. As the test-suite demonstrates, it is possible to use the XArray functions on a radix tree. The radix tree functions depend on the GFP flags being stored in the root of the tree, so it's not possible to use the radix tree functions on an XArray. Signed-off-by: Matthew Wilcox <willy@infradead.org>
|
#
01959dfe |
|
09-Nov-2017 |
Matthew Wilcox <willy@infradead.org> |
xarray: Define struct xa_node This is a direct replacement for struct radix_tree_node. A couple of struct members have changed name, so convert those. Use a #define so that radix tree users continue to work without change. Signed-off-by: Matthew Wilcox <willy@infradead.org> Reviewed-by: Josef Bacik <jbacik@fb.com>
|
#
f8d5d0cc |
|
07-Nov-2017 |
Matthew Wilcox <willy@infradead.org> |
xarray: Add definition of struct xarray This is a direct replacement for struct radix_tree_root. Some of the struct members have changed name; convert those, and use a #define so that radix_tree users continue to work without change. Signed-off-by: Matthew Wilcox <willy@infradead.org> Reviewed-by: Josef Bacik <jbacik@fb.com>
|
#
02c02bf1 |
|
03-Nov-2017 |
Matthew Wilcox <willy@infradead.org> |
xarray: Change definition of sibling entries Instead of storing a pointer to the slot containing the canonical entry, store the offset of the slot. Produces slightly more efficient code (~300 bytes) and simplifies the implementation. Signed-off-by: Matthew Wilcox <willy@infradead.org> Reviewed-by: Josef Bacik <jbacik@fb.com>
|
#
3159f943 |
|
03-Nov-2017 |
Matthew Wilcox <willy@infradead.org> |
xarray: Replace exceptional entries Introduce xarray value entries and tagged pointers to replace radix tree exceptional entries. This is a slight change in encoding to allow the use of an extra bit (we can now store BITS_PER_LONG - 1 bits in a value entry). It is also a change in emphasis; exceptional entries are intimidating and different. As the comment explains, you can choose to store values or pointers in the xarray and they are both first-class citizens. Signed-off-by: Matthew Wilcox <willy@infradead.org> Reviewed-by: Josef Bacik <jbacik@fb.com>
|
#
66ee620f |
|
25-Jun-2018 |
Matthew Wilcox <willy@infradead.org> |
idr: Permit any valid kernel pointer to be stored An upcoming change to the encoding of internal entries will set the bottom two bits to 0b10. Unfortunately, m68k only aligns some data structures to 2 bytes, so the IDR will interpret them as internal entries and things will go badly wrong. Change the radix tree so that it stops either when the node indicates that it's the bottom of the tree (shift == 0) or when the entry is not an internal entry. This means we cannot insert an arbitrary kernel pointer as a multiorder entry, but the IDR does not permit multiorder entries. Annoyingly, this means the IDR can no longer take advantage of the radix tree's ability to store a single entry at offset 0 without allocating memory. A pointer which is 2-byte aligned cannot be stored directly in the root as it would be indistinguishable from a node, so we must allocate a node in order to store a 2-byte pointer at index 0. The idr_replace() function does not take a GFP flags argument, so cannot allocate memory. If a user inserts a 4-byte aligned pointer at index 0 and then replaces it with a 2-byte aligned pointer, we must be able to store it. Arbitrary pointer values are still not permitted; pointers of the form 2 + (i * 4) for values of i between 0 and 1023 are reserved for the implementation. These are not valid kernel pointers as they would point into the zero page. This change does cause a runtime memory consumption regression for the IDA. I will recover that later. Signed-off-by: Matthew Wilcox <willy@infradead.org> Tested-by: Guenter Roeck <linux@roeck-us.net>
|
#
b03f8e43 |
|
18-Jun-2018 |
Matthew Wilcox <willy@infradead.org> |
ida: Remove old API Delete ida_pre_get(), ida_get_new(), ida_get_new_above() and ida_remove() from the public API. Some of these functions still exist as internal helpers, but they should not be called by consumers. Signed-off-by: Matthew Wilcox <willy@infradead.org>
|
#
76f070b4 |
|
18-Aug-2018 |
Matthew Wilcox <willy@infradead.org> |
radix-tree: Fix UBSAN warning get_slot_offset() can be called with a NULL 'parent' argument. In this case, the calculated value will not be used, but calculating it is undefined. Rather than fixing the caller (__radix_tree_delete) to not call get_slot_offset(), make get_slot_offset() robust against being called with a NULL parent. Signed-off-by: Matthew Wilcox <willy@infradead.org>
|
#
7a4deea1 |
|
25-May-2018 |
Matthew Wilcox <willy@infradead.org> |
idr: fix invalid ptr dereference on item delete If the radix tree underlying the IDR happens to be full and we attempt to remove an id which is larger than any id in the IDR, we will call __radix_tree_delete() with an uninitialised 'slot' pointer, at which point anything could happen. This was easiest to hit with a single entry at id 0 and attempting to remove a non-0 id, but it could have happened with 64 entries and attempting to remove an id >= 64. Roman said: The syzcaller test boils down to opening /dev/kvm, creating an eventfd, and calling a couple of KVM ioctls. None of this requires superuser. And the result is dereferencing an uninitialized pointer which is likely a crash. The specific path caught by syzbot is via KVM_HYPERV_EVENTD ioctl which is new in 4.17. But I guess there are other user-triggerable paths, so cc:stable is probably justified. Matthew added: We have around 250 calls to idr_remove() in the kernel today. Many of them pass an ID which is embedded in the object they're removing, so they're safe. Picking a few likely candidates: drivers/firewire/core-cdev.c looks unsafe; the ID comes from an ioctl. drivers/gpu/drm/amd/amdgpu/amdgpu_ctx.c is similar drivers/atm/nicstar.c could be taken down by a handcrafted packet Link: http://lkml.kernel.org/r/20180518175025.GD6361@bombadil.infradead.org Fixes: 0a835c4f090a ("Reimplement IDR and IDA using the radix tree") Reported-by: <syzbot+35666cba7f0a337e2e79@syzkaller.appspotmail.com> Debugged-by: Roman Kagan <rkagan@virtuozzo.com> Signed-off-by: Matthew Wilcox <mawilcox@microsoft.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
#
9f418224 |
|
18-May-2018 |
Ross Zwisler <zwisler@kernel.org> |
radix tree: fix multi-order iteration race Fix a race in the multi-order iteration code which causes the kernel to hit a GP fault. This was first seen with a production v4.15 based kernel (4.15.6-300.fc27.x86_64) utilizing a DAX workload which used order 9 PMD DAX entries. The race has to do with how we tear down multi-order sibling entries when we are removing an item from the tree. Remember for example that an order 2 entry looks like this: struct radix_tree_node.slots[] = [entry][sibling][sibling][sibling] where 'entry' is in some slot in the struct radix_tree_node, and the three slots following 'entry' contain sibling pointers which point back to 'entry.' When we delete 'entry' from the tree, we call : radix_tree_delete() radix_tree_delete_item() __radix_tree_delete() replace_slot() replace_slot() first removes the siblings in order from the first to the last, then at then replaces 'entry' with NULL. This means that for a brief period of time we end up with one or more of the siblings removed, so: struct radix_tree_node.slots[] = [entry][NULL][sibling][sibling] This causes an issue if you have a reader iterating over the slots in the tree via radix_tree_for_each_slot() while only under rcu_read_lock()/rcu_read_unlock() protection. This is a common case in mm/filemap.c. The issue is that when __radix_tree_next_slot() => skip_siblings() tries to skip over the sibling entries in the slots, it currently does so with an exact match on the slot directly preceding our current slot. Normally this works: V preceding slot struct radix_tree_node.slots[] = [entry][sibling][sibling][sibling] ^ current slot This lets you find the first sibling, and you skip them all in order. But in the case where one of the siblings is NULL, that slot is skipped and then our sibling detection is interrupted: V preceding slot struct radix_tree_node.slots[] = [entry][NULL][sibling][sibling] ^ current slot This means that the sibling pointers aren't recognized since they point all the way back to 'entry', so we think that they are normal internal radix tree pointers. This causes us to think we need to walk down to a struct radix_tree_node starting at the address of 'entry'. In a real running kernel this will crash the thread with a GP fault when you try and dereference the slots in your broken node starting at 'entry'. We fix this race by fixing the way that skip_siblings() detects sibling nodes. Instead of testing against the preceding slot we instead look for siblings via is_sibling_entry() which compares against the position of the struct radix_tree_node.slots[] array. This ensures that sibling entries are properly identified, even if they are no longer contiguous with the 'entry' they point to. Link: http://lkml.kernel.org/r/20180503192430.7582-6-ross.zwisler@linux.intel.com Fixes: 148deab223b2 ("radix-tree: improve multiorder iterators") Signed-off-by: Ross Zwisler <ross.zwisler@linux.intel.com> Reported-by: CR, Sapthagirish <sapthagirish.cr@intel.com> Reviewed-by: Jan Kara <jack@suse.cz> Cc: Matthew Wilcox <willy@infradead.org> Cc: Christoph Hellwig <hch@lst.de> Cc: Dan Williams <dan.j.williams@intel.com> Cc: Dave Chinner <david@fromorbit.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
#
fa290cda |
|
10-Apr-2018 |
Matthew Wilcox <willy@infradead.org> |
radix tree: use GFP_ZONEMASK bits of gfp_t for flags Patch series "XArray", v9. (First part thereof). This patchset is, I believe, appropriate for merging for 4.17. It contains the XArray implementation, to eventually replace the radix tree, and converts the page cache to use it. This conversion keeps the radix tree and XArray data structures in sync at all times. That allows us to convert the page cache one function at a time and should allow for easier bisection. Other than renaming some elements of the structures, the data structures are fundamentally unchanged; a radix tree walk and an XArray walk will touch the same number of cachelines. I have changes planned to the XArray data structure, but those will happen in future patches. Improvements the XArray has over the radix tree: - The radix tree provides operations like other trees do; 'insert' and 'delete'. But what most users really want is an automatically resizing array, and so it makes more sense to give users an API that is like an array -- 'load' and 'store'. We still have an 'insert' operation for users that really want that semantic. - The XArray considers locking as part of its API. This simplifies a lot of users who formerly had to manage their own locking just for the radix tree. It also improves code generation as we can now tell RCU that we're holding a lock and it doesn't need to generate as much fencing code. The other advantage is that tree nodes can be moved (not yet implemented). - GFP flags are now parameters to calls which may need to allocate memory. The radix tree forced users to decide what the allocation flags would be at creation time. It's much clearer to specify them at allocation time. - Memory is not preloaded; we don't tie up dozens of pages on the off chance that the slab allocator fails. Instead, we drop the lock, allocate a new node and retry the operation. We have to convert all the radix tree, IDA and IDR preload users before we can realise this benefit, but I have not yet found a user which cannot be converted. - The XArray provides a cmpxchg operation. The radix tree forces users to roll their own (and at least four have). - Iterators take a 'max' parameter. That simplifies many users and will reduce the amount of iteration done. - Iteration can proceed backwards. We only have one user for this, but since it's called as part of the pagefault readahead algorithm, that seemed worth mentioning. - RCU-protected pointers are not exposed as part of the API. There are some fun bugs where the page cache forgets to use rcu_dereference() in the current codebase. - Value entries gain an extra bit compared to radix tree exceptional entries. That gives us the extra bit we need to put huge page swap entries in the page cache. - Some iterators now take a 'filter' argument instead of having separate iterators for tagged/untagged iterations. The page cache is improved by this: - Shorter, easier to read code - More efficient iterations - Reduction in size of struct address_space - Fewer walks from the top of the data structure; the XArray API encourages staying at the leaf node and conducting operations there. This patch (of 8): None of these bits may be used for slab allocations, so we can use them as radix tree flags as long as we mask them off before passing them to the slab allocator. Move the IDR flag from the high bits to the GFP_ZONEMASK bits. Link: http://lkml.kernel.org/r/20180313132639.17387-3-willy@infradead.org Signed-off-by: Matthew Wilcox <mawilcox@microsoft.com> Acked-by: Jeff Layton <jlayton@kernel.org> Cc: Darrick J. Wong <darrick.wong@oracle.com> Cc: Dave Chinner <david@fromorbit.com> Cc: Ryusuke Konishi <konishi.ryusuke@lab.ntt.co.jp> Cc: Will Deacon <will.deacon@arm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
#
b1a8a7a7 |
|
21-Feb-2018 |
Rasmus Villemoes <linux@rasmusvillemoes.dk> |
ida: do zeroing in ida_pre_get() As far as I can tell, the only place the per-cpu ida_bitmap is populated is in ida_pre_get. The pre-allocated element is stolen in two places in ida_get_new_above, in both cases immediately followed by a memset(0). Since ida_get_new_above is called with locks held, do the zeroing in ida_pre_get, or rather let kmalloc() do it. Also, apparently gcc generates ~44 bytes of code to do a memset(, 0, 128): $ scripts/bloat-o-meter vmlinux.{0,1} add/remove: 0/0 grow/shrink: 2/1 up/down: 5/-88 (-83) Function old new delta ida_pre_get 115 119 +4 vermagic 27 28 +1 ida_get_new_above 715 627 -88 Link: http://lkml.kernel.org/r/20180108225634.15340-1-linux@rasmusvillemoes.dk Signed-off-by: Rasmus Villemoes <linux@rasmusvillemoes.dk> Acked-by: Matthew Wilcox <mawilcox@microsoft.com> Cc: Eric Biggers <ebiggers@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
#
460488c5 |
|
28-Nov-2017 |
Matthew Wilcox <willy@infradead.org> |
idr: Remove idr_alloc_ext It has no more users, so remove it. Move idr_alloc() back into idr.c, move the guts of idr_alloc_cmn() into idr_alloc_u32(), remove the wrappers around idr_get_free_cmn() and rename it to idr_get_free(). While there is now no interface to allocate IDs larger than a u32, the IDR internals remain ready to handle a larger ID should a need arise. These changes make it possible to provide the guarantee that, if the nextid pointer points into the object, the object's ID will be initialised before a concurrent lookup can find the object. Signed-off-by: Matthew Wilcox <mawilcox@microsoft.com>
|
#
c7df8ad2 |
|
15-Nov-2017 |
Mel Gorman <mgorman@techsingularity.net> |
mm, truncate: do not check mapping for every page being truncated During truncation, the mapping has already been checked for shmem and dax so it's known that workingset_update_node is required. This patch avoids the checks on mapping for each page being truncated. In all other cases, a lookup helper is used to determine if workingset_update_node() needs to be called. The one danger is that the API is slightly harder to use as calling workingset_update_node directly without checking for dax or shmem mappings could lead to surprises. However, the API rarely needs to be used and hopefully the comment is enough to give people the hint. sparsetruncate (tiny) 4.14.0-rc4 4.14.0-rc4 oneirq-v1r1 pickhelper-v1r1 Min Time 141.00 ( 0.00%) 140.00 ( 0.71%) 1st-qrtle Time 142.00 ( 0.00%) 141.00 ( 0.70%) 2nd-qrtle Time 142.00 ( 0.00%) 142.00 ( 0.00%) 3rd-qrtle Time 143.00 ( 0.00%) 143.00 ( 0.00%) Max-90% Time 144.00 ( 0.00%) 144.00 ( 0.00%) Max-95% Time 147.00 ( 0.00%) 145.00 ( 1.36%) Max-99% Time 195.00 ( 0.00%) 191.00 ( 2.05%) Max Time 230.00 ( 0.00%) 205.00 ( 10.87%) Amean Time 144.37 ( 0.00%) 143.82 ( 0.38%) Stddev Time 10.44 ( 0.00%) 9.00 ( 13.74%) Coeff Time 7.23 ( 0.00%) 6.26 ( 13.41%) Best99%Amean Time 143.72 ( 0.00%) 143.34 ( 0.26%) Best95%Amean Time 142.37 ( 0.00%) 142.00 ( 0.26%) Best90%Amean Time 142.19 ( 0.00%) 141.85 ( 0.24%) Best75%Amean Time 141.92 ( 0.00%) 141.58 ( 0.24%) Best50%Amean Time 141.69 ( 0.00%) 141.31 ( 0.27%) Best25%Amean Time 141.38 ( 0.00%) 140.97 ( 0.29%) As you'd expect, the gain is marginal but it can be detected. The differences in bonnie are all within the noise which is not surprising given the impact on the microbenchmark. radix_tree_update_node_t is a callback for some radix operations that optionally passes in a private field. The only user of the callback is workingset_update_node and as it no longer requires a mapping, the private field is removed. Link: http://lkml.kernel.org/r/20171018075952.10627-3-mgorman@techsingularity.net Signed-off-by: Mel Gorman <mgorman@techsingularity.net> Acked-by: Johannes Weiner <hannes@cmpxchg.org> Reviewed-by: Jan Kara <jack@suse.cz> Cc: Andi Kleen <ak@linux.intel.com> Cc: Dave Chinner <david@fromorbit.com> Cc: Dave Hansen <dave.hansen@intel.com> Cc: Vlastimil Babka <vbabka@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
#
bc9ae224 |
|
08-Sep-2017 |
Eric Dumazet <edumazet@google.com> |
radix-tree: must check __radix_tree_preload() return value __radix_tree_preload() only disables preemption if no error is returned. So we really need to make sure callers always check the return value. idr_preload() contract is to always disable preemption, so we need to add a missing preempt_disable() if an error happened. Similarly, ida_pre_get() only needs to call preempt_enable() in the case no error happened. Link: http://lkml.kernel.org/r/1504637190.15310.62.camel@edumazet-glaptop3.roam.corp.google.com Fixes: 0a835c4f090a ("Reimplement IDR and IDA using the radix tree") Fixes: 7ad3d4d85c7a ("ida: Move ida_bitmap to a percpu variable") Signed-off-by: Eric Dumazet <edumazet@google.com> Cc: Matthew Wilcox <mawilcox@microsoft.com> Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com> Cc: <stable@vger.kernel.org> [4.11+] Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
#
388f79fd |
|
30-Aug-2017 |
Chris Mi <chrism@mellanox.com> |
idr: Add new APIs to support unsigned long The following new APIs are added: int idr_alloc_ext(struct idr *idr, void *ptr, unsigned long *index, unsigned long start, unsigned long end, gfp_t gfp); void *idr_remove_ext(struct idr *idr, unsigned long id); void *idr_find_ext(const struct idr *idr, unsigned long id); void *idr_replace_ext(struct idr *idr, void *ptr, unsigned long id); void *idr_get_next_ext(struct idr *idr, unsigned long *nextid); Signed-off-by: Chris Mi <chrism@mellanox.com> Signed-off-by: Jiri Pirko <jiri@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
d1b48c1e |
|
16-Aug-2017 |
Chris Wilson <chris@chris-wilson.co.uk> |
drm/i915: Replace execbuf vma ht with an idr This was the competing idea long ago, but it was only with the rewrite of the idr as an radixtree and using the radixtree directly ourselves, along with the realisation that we can store the vma directly in the radixtree and only need a list for the reverse mapping, that made the patch performant enough to displace using a hashtable. Though the vma ht is fast and doesn't require any extra allocation (as we can embed the node inside the vma), it does require a thread for resizing and serialization and will have the occasional slow lookup. That is hairy enough to investigate alternatives and favour them if equivalent in peak performance. One advantage of allocating an indirection entry is that we can support a single shared bo between many clients, something that was done on a first-come first-serve basis for shared GGTT vma previously. To offset the extra allocations, we create yet another kmem_cache for them. Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk> Reviewed-by: Tvrtko Ursulin <tvrtko.ursulin@intel.com> Link: https://patchwork.freedesktop.org/patch/msgid/20170816085210.4199-5-chris@chris-wilson.co.uk
|
#
7e784422 |
|
03-May-2017 |
Michal Hocko <mhocko@suse.com> |
lockdep: allow to disable reclaim lockup detection The current implementation of the reclaim lockup detection can lead to false positives and those even happen and usually lead to tweak the code to silence the lockdep by using GFP_NOFS even though the context can use __GFP_FS just fine. See http://lkml.kernel.org/r/20160512080321.GA18496@dastard as an example. ================================= [ INFO: inconsistent lock state ] 4.5.0-rc2+ #4 Tainted: G O --------------------------------- inconsistent {RECLAIM_FS-ON-R} -> {IN-RECLAIM_FS-W} usage. kswapd0/543 [HC0[0]:SC0[0]:HE1:SE1] takes: (&xfs_nondir_ilock_class){++++-+}, at: xfs_ilock+0x177/0x200 [xfs] {RECLAIM_FS-ON-R} state was registered at: mark_held_locks+0x79/0xa0 lockdep_trace_alloc+0xb3/0x100 kmem_cache_alloc+0x33/0x230 kmem_zone_alloc+0x81/0x120 [xfs] xfs_refcountbt_init_cursor+0x3e/0xa0 [xfs] __xfs_refcount_find_shared+0x75/0x580 [xfs] xfs_refcount_find_shared+0x84/0xb0 [xfs] xfs_getbmap+0x608/0x8c0 [xfs] xfs_vn_fiemap+0xab/0xc0 [xfs] do_vfs_ioctl+0x498/0x670 SyS_ioctl+0x79/0x90 entry_SYSCALL_64_fastpath+0x12/0x6f CPU0 ---- lock(&xfs_nondir_ilock_class); <Interrupt> lock(&xfs_nondir_ilock_class); *** DEADLOCK *** 3 locks held by kswapd0/543: stack backtrace: CPU: 0 PID: 543 Comm: kswapd0 Tainted: G O 4.5.0-rc2+ #4 Call Trace: lock_acquire+0xd8/0x1e0 down_write_nested+0x5e/0xc0 xfs_ilock+0x177/0x200 [xfs] xfs_reflink_cancel_cow_range+0x150/0x300 [xfs] xfs_fs_evict_inode+0xdc/0x1e0 [xfs] evict+0xc5/0x190 dispose_list+0x39/0x60 prune_icache_sb+0x4b/0x60 super_cache_scan+0x14f/0x1a0 shrink_slab.part.63.constprop.79+0x1e9/0x4e0 shrink_zone+0x15e/0x170 kswapd+0x4f1/0xa80 kthread+0xf2/0x110 ret_from_fork+0x3f/0x70 To quote Dave: "Ignoring whether reflink should be doing anything or not, that's a "xfs_refcountbt_init_cursor() gets called both outside and inside transactions" lockdep false positive case. The problem here is lockdep has seen this allocation from within a transaction, hence a GFP_NOFS allocation, and now it's seeing it in a GFP_KERNEL context. Also note that we have an active reference to this inode. So, because the reclaim annotations overload the interrupt level detections and it's seen the inode ilock been taken in reclaim ("interrupt") context, this triggers a reclaim context warning where it thinks it is unsafe to do this allocation in GFP_KERNEL context holding the inode ilock..." This sounds like a fundamental problem of the reclaim lock detection. It is really impossible to annotate such a special usecase IMHO unless the reclaim lockup detection is reworked completely. Until then it is much better to provide a way to add "I know what I am doing flag" and mark problematic places. This would prevent from abusing GFP_NOFS flag which has a runtime effect even on configurations which have lockdep disabled. Introduce __GFP_NOLOCKDEP flag which tells the lockdep gfp tracking to skip the current allocation request. While we are at it also make sure that the radix tree doesn't accidentaly override tags stored in the upper part of the gfp_mask. Link: http://lkml.kernel.org/r/20170306131408.9828-3-mhocko@kernel.org Signed-off-by: Michal Hocko <mhocko@suse.com> Suggested-by: Peter Zijlstra <peterz@infradead.org> Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org> Acked-by: Vlastimil Babka <vbabka@suse.cz> Cc: Dave Chinner <david@fromorbit.com> Cc: Theodore Ts'o <tytso@mit.edu> Cc: Chris Mason <clm@fb.com> Cc: David Sterba <dsterba@suse.cz> Cc: Jan Kara <jack@suse.cz> Cc: Brian Foster <bfoster@redhat.com> Cc: Darrick J. Wong <darrick.wong@oracle.com> Cc: Nikolay Borisov <nborisov@suse.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
#
4ecd9542 |
|
02-Mar-2017 |
Matthew Wilcox <willy@infradead.org> |
ida: Free correct IDA bitmap There's a relatively rare race where we look at the per-cpu preallocated IDA bitmap, see it's NULL, allocate a new one, and atomically update it. If the kmalloc() happened to sleep and we were rescheduled to a different CPU, or an interrupt came in at the exact right time, another task might have successfully allocated a bitmap and already deposited it. I forgot what the semantics of cmpxchg() were and ended up freeing the wrong bitmap leading to KASAN reporting a use-after-free. Dmitry found the bug with syzkaller & wrote the patch. I wrote the test case that will reproduce the bug without his patch being applied. Reported-by: Dmitry Vyukov <dvyukov@google.com> Signed-off-by: Matthew Wilcox <mawilcox@microsoft.com>
|
#
d7b62727 |
|
13-Feb-2017 |
Matthew Wilcox <willy@infradead.org> |
radix-tree: Fix __rcu annotations Many places were missing __rcu annotations. A few places needed a few lines of explanation about why it was safe to not use RCU accessors. Add a custom CFLAGS setting to the Makefile to ensure that new patches don't miss RCU annotations. Signed-off-by: Matthew Wilcox <mawilcox@microsoft.com>
|
#
12320d0f |
|
13-Feb-2017 |
Matthew Wilcox <willy@infradead.org> |
radix-tree: Add rcu_dereference and rcu_assign_pointer calls Some of these have been missing for many years. Others were recently introduced by me. Fortunately, we have tools that help us find such things. Signed-off-by: Matthew Wilcox <mawilcox@microsoft.com>
|
#
f7137f79 |
|
30-Jan-2017 |
Matthew Wilcox <willy@infradead.org> |
radix_tree_iter_resume: Fix out of bounds error The address sanitizer occasionally finds an out of bounds error while running the test-suite. It turned out to be a read of the pointer immediately next to the tree root, but this out of bounds error could have occurred elsewhere. This happens because radix_tree_iter_resume() dereferences 'slot' before checking whether we've come to the end of the chunk. We can just delete this line; the value was never used. Signed-off-by: Matthew Wilcox <mawilcox@microsoft.com>
|
#
d58275bc |
|
16-Jan-2017 |
Matthew Wilcox <willy@infradead.org> |
radix-tree: Store a pointer to the root in each node Instead of having this mysterious private_data in each radix_tree_node, store a pointer to the root, which can be useful for debugging. This also relieves the mm code from the duty of updating it. Acked-by: Johannes Weiner <hannes@cmpxchg.org> Signed-off-by: Matthew Wilcox <mawilcox@microsoft.com>
|
#
1293d5c5 |
|
16-Jan-2017 |
Matthew Wilcox <willy@infradead.org> |
radix-tree: Chain preallocated nodes through ->parent Chaining through the ->private_data member means we have to zero ->private_data after removing preallocated nodes from the list. We're about to initialise ->parent anyway, so we can avoid zeroing it. Signed-off-by: Matthew Wilcox <mawilcox@microsoft.com>
|
#
d37cacc5 |
|
17-Dec-2016 |
Matthew Wilcox <willy@infradead.org> |
ida: Use exceptional entries for small IDAs We can use the root entry as a bitmap and save allocating a 128 byte bitmap for an IDA that contains only a few entries (30 on a 32-bit machine, 62 on a 64-bit machine). This costs about 300 bytes of kernel text on x86-64, so as long as 3 IDAs fall into this category, this is a net win for memory consumption. Thanks to Rasmus Villemoes for his work documenting the problem and collecting statistics on IDAs. Signed-off-by: Matthew Wilcox <mawilcox@microsoft.com>
|
#
7ad3d4d8 |
|
16-Dec-2016 |
Matthew Wilcox <willy@infradead.org> |
ida: Move ida_bitmap to a percpu variable When we preload the IDA, we allocate an IDA bitmap. Instead of storing that preallocated bitmap in the IDA, we store it in a percpu variable. Generally there are more IDAs in the system than CPUs, so this cuts down on the number of preallocated bitmaps that are unused, and about half of the IDA users did not call ida_destroy() so they were leaking IDA bitmaps. Signed-off-by: Matthew Wilcox <mawilcox@microsoft.com>
|
#
0a835c4f |
|
20-Dec-2016 |
Matthew Wilcox <willy@infradead.org> |
Reimplement IDR and IDA using the radix tree The IDR is very similar to the radix tree. It has some functionality that the radix tree did not have (alloc next free, cyclic allocation, a callback-based for_each, destroy tree), which is readily implementable on top of the radix tree. A few small changes were needed in order to use a tag to represent nodes with free space below them. More extensive changes were needed to support storing NULL as a valid entry in an IDR. Plain radix trees still interpret NULL as a not-present entry. The IDA is reimplemented as a client of the newly enhanced radix tree. As in the current implementation, it uses a bitmap at the last level of the tree. Signed-off-by: Matthew Wilcox <willy@infradead.org> Signed-off-by: Matthew Wilcox <mawilcox@microsoft.com> Tested-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Cc: Konstantin Khlebnikov <koct9i@gmail.com> Cc: Ross Zwisler <ross.zwisler@linux.intel.com> Cc: Tejun Heo <tj@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|
#
0ac398ef |
|
28-Jan-2017 |
Matthew Wilcox <willy@infradead.org> |
radix-tree: Add radix_tree_iter_delete Factor the deletion code out into __radix_tree_delete() and provide a nice iterator-based wrapper around it. If we free the node, advance the iterator to avoid reading from freed memory. Signed-off-by: Matthew Wilcox <mawilcox@microsoft.com>
|
#
30b888ba |
|
28-Jan-2017 |
Matthew Wilcox <willy@infradead.org> |
radix-tree: Add radix_tree_iter_tag_clear() The counterpart to radix_tree_iter_tag_set(), used by the IDR code Signed-off-by: Matthew Wilcox <mawilcox@microsoft.com> Reviewed-by: Rehas Sachdeva <aquannie@gmail.com>
|
#
10257d71 |
|
11-Jan-2017 |
Song Liu <songliubraving@fb.com> |
EXPORT_SYMBOL radix_tree_replace_slot It will be used in drivers/md/raid5-cache.c Signed-off-by: Song Liu <songliubraving@fb.com> Signed-off-by: Shaohua Li <shli@fb.com>
|
#
35534c86 |
|
19-Dec-2016 |
Matthew Wilcox <willy@infradead.org> |
radix tree: constify some pointers If we're just getting the value of a tag, or looking up an entry, we won't modify the radix tree, so we can declare these functions as taking a const pointer. Mostly for documentation purposes, though it might help code generation. Signed-off-by: Matthew Wilcox <mawilcox@microsoft.com>
|
#
dd040b6f |
|
24-Jan-2017 |
Matthew Wilcox <willy@infradead.org> |
radix-tree: fix private list warnings The newly introduced warning in radix_tree_free_nodes() was testing the wrong variable; it should have been 'old' instead of 'node'. Fixes: ea07b862ac8e ("mm: workingset: fix use-after-free in shadow node shrinker") Link: http://lkml.kernel.org/r/20170118163746.GA32495@cmpxchg.org Signed-off-by: Matthew Wilcox <mawilcox@microsoft.com> Signed-off-by: Johannes Weiner <hannes@cmpxchg.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
#
ea07b862 |
|
06-Jan-2017 |
Johannes Weiner <hannes@cmpxchg.org> |
mm: workingset: fix use-after-free in shadow node shrinker Several people report seeing warnings about inconsistent radix tree nodes followed by crashes in the workingset code, which all looked like use-after-free access from the shadow node shrinker. Dave Jones managed to reproduce the issue with a debug patch applied, which confirmed that the radix tree shrinking indeed frees shadow nodes while they are still linked to the shadow LRU: WARNING: CPU: 2 PID: 53 at lib/radix-tree.c:643 delete_node+0x1e4/0x200 CPU: 2 PID: 53 Comm: kswapd0 Not tainted 4.10.0-rc2-think+ #3 Call Trace: delete_node+0x1e4/0x200 __radix_tree_delete_node+0xd/0x10 shadow_lru_isolate+0xe6/0x220 __list_lru_walk_one.isra.4+0x9b/0x190 list_lru_walk_one+0x23/0x30 scan_shadow_nodes+0x2e/0x40 shrink_slab.part.44+0x23d/0x5d0 shrink_node+0x22c/0x330 kswapd+0x392/0x8f0 This is the WARN_ON_ONCE(!list_empty(&node->private_list)) placed in the inlined radix_tree_shrink(). The problem is with 14b468791fa9 ("mm: workingset: move shadow entry tracking to radix tree exceptional tracking"), which passes an update callback into the radix tree to link and unlink shadow leaf nodes when tree entries change, but forgot to pass the callback when reclaiming a shadow node. While the reclaimed shadow node itself is unlinked by the shrinker, its deletion from the tree can cause the left-most leaf node in the tree to be shrunk. If that happens to be a shadow node as well, we don't unlink it from the LRU as we should. Consider this tree, where the s are shadow entries: root->rnode | [0 n] | | [s ] [sssss] Now the shadow node shrinker reclaims the rightmost leaf node through the shadow node LRU: root->rnode | [0 ] | [s ] Because the parent of the deleted node is the first level below the root and has only one child in the left-most slot, the intermediate level is shrunk and the node containing the single shadow is put in its place: root->rnode | [s ] The shrinker again sees a single left-most slot in a first level node and thus decides to store the shadow in root->rnode directly and free the node - which is a leaf node on the shadow node LRU. root->rnode | s Without the update callback, the freed node remains on the shadow LRU, where it causes later shrinker runs to crash. Pass the node updater callback into __radix_tree_delete_node() in case the deletion causes the left-most branch in the tree to collapse too. Also add warnings when linked nodes are freed right away, rather than wait for the use-after-free when the list is scanned much later. Fixes: 14b468791fa9 ("mm: workingset: move shadow entry tracking to radix tree exceptional tracking") Reported-by: Dave Chinner <david@fromorbit.com> Reported-by: Hugh Dickins <hughd@google.com> Reported-by: Andrea Arcangeli <aarcange@redhat.com> Reported-and-tested-by: Dave Jones <davej@codemonkey.org.uk> Signed-off-by: Johannes Weiner <hannes@cmpxchg.org> Cc: Christoph Hellwig <hch@lst.de> Cc: Chris Leech <cleech@redhat.com> Cc: Lee Duncan <lduncan@suse.com> Cc: Jan Kara <jack@suse.cz> Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Cc: Matthew Wilcox <mawilcox@linuxonhyperv.com> Cc: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
#
b9a0deb9 |
|
07-Dec-2016 |
Matthew Wilcox <willy@infradead.org> |
redo: radix tree test suite: fix compilation [ This resurrects commit 53855d10f456, which was reverted in 2b41226b39b6. It depended on commit d544abd5ff7d ("lib/radix-tree: Convert to hotplug state machine") so now it is correct to apply ] Patch "lib/radix-tree: Convert to hotplug state machine" breaks the test suite as it adds a call to cpuhp_setup_state_nocalls() which is not currently emulated in the test suite. Add it, and delete the emulation of the old CPU hotplug mechanism. Link: http://lkml.kernel.org/r/1480369871-5271-36-git-send-email-mawilcox@linuxonhyperv.com Signed-off-by: Matthew Wilcox <mawilcox@microsoft.com> Tested-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Cc: Konstantin Khlebnikov <koct9i@gmail.com> Cc: Ross Zwisler <ross.zwisler@linux.intel.com> Cc: Matthew Wilcox <mawilcox@microsoft.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
#
e8de4340 |
|
14-Dec-2016 |
Matthew Wilcox <willy@infradead.org> |
radix-tree: ensure counts are initialised radix_tree_join() was freeing nodes with a non-zero ->exceptional count, and radix_tree_split() wasn't zeroing ->exceptional when it allocated the new node. Fix this by making all callers of radix_tree_node_alloc() pass in the new counts (and some other always-initialised fields), which will prevent the problem recurring if in future we decide to do something similar. Link: http://lkml.kernel.org/r/1481667692-14500-3-git-send-email-mawilcox@linuxonhyperv.com Signed-off-by: Matthew Wilcox <mawilcox@microsoft.com> Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Cc: Konstantin Khlebnikov <koct9i@gmail.com> Cc: Ross Zwisler <ross.zwisler@linux.intel.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
#
a90eb3a2 |
|
14-Dec-2016 |
Matthew Wilcox <willy@infradead.org> |
radix-tree: fix replacement for multiorder entries When replacing an entry with NULL, we need to delete any sibling entries. Also account deleting exceptional entries properly. Also fix a bug with radix_tree_iter_replace() where we would fail to remove entirely freed nodes. Also fix accounting bug when switching between normal and exceptional entries with replace_slot. Also add testcases for all these bugs. Link: http://lkml.kernel.org/r/1480369871-5271-61-git-send-email-mawilcox@linuxonhyperv.com Signed-off-by: Matthew Wilcox <mawilcox@microsoft.com> Tested-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Cc: Konstantin Khlebnikov <koct9i@gmail.com> Cc: Ross Zwisler <ross.zwisler@linux.intel.com> Cc: Matthew Wilcox <mawilcox@microsoft.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
#
2791653a |
|
14-Dec-2016 |
Matthew Wilcox <willy@infradead.org> |
radix-tree: add radix_tree_split_preload() Calculate how many nodes we need to allocate to split an old_order entry into multiple entries, each of size new_order. The test suite checks that we allocated exactly the right number of nodes; neither too many (checked by rtp->nr == 0), nor too few (checked by comparing nr_allocated before and after the call to radix_tree_split()). Link: http://lkml.kernel.org/r/1480369871-5271-60-git-send-email-mawilcox@linuxonhyperv.com Signed-off-by: Matthew Wilcox <willy@linux.intel.com> Tested-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Cc: Konstantin Khlebnikov <koct9i@gmail.com> Cc: Ross Zwisler <ross.zwisler@linux.intel.com> Cc: Matthew Wilcox <mawilcox@microsoft.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
#
e157b555 |
|
14-Dec-2016 |
Matthew Wilcox <willy@infradead.org> |
radix-tree: add radix_tree_split This new function splits a larger multiorder entry into smaller entries (potentially multi-order entries). These entries are initialised to RADIX_TREE_RETRY to ensure that RCU walkers who see this state aren't confused. The caller should then call radix_tree_for_each_slot() and radix_tree_replace_slot() in order to turn these retry entries into the intended new entries. Tags are replicated from the original multiorder entry into each new entry. Link: http://lkml.kernel.org/r/1480369871-5271-59-git-send-email-mawilcox@linuxonhyperv.com Signed-off-by: Matthew Wilcox <willy@linux.intel.com> Tested-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Cc: Konstantin Khlebnikov <koct9i@gmail.com> Cc: Ross Zwisler <ross.zwisler@linux.intel.com> Cc: Matthew Wilcox <mawilcox@microsoft.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
#
175542f5 |
|
14-Dec-2016 |
Matthew Wilcox <willy@infradead.org> |
radix-tree: add radix_tree_join This new function allows for the replacement of many smaller entries in the radix tree with one larger multiorder entry. From the point of view of an RCU walker, they may see a mixture of the smaller entries and the large entry during the same walk, but they will never see NULL for an index which was populated before the join. Link: http://lkml.kernel.org/r/1480369871-5271-58-git-send-email-mawilcox@linuxonhyperv.com Signed-off-by: Matthew Wilcox <willy@linux.intel.com> Tested-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Cc: Konstantin Khlebnikov <koct9i@gmail.com> Cc: Ross Zwisler <ross.zwisler@linux.intel.com> Cc: Matthew Wilcox <mawilcox@microsoft.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
#
268f42de |
|
14-Dec-2016 |
Matthew Wilcox <willy@infradead.org> |
radix-tree: delete radix_tree_range_tag_if_tagged() This is an exceptionally complicated function with just one caller (tag_pages_for_writeback). We devote a large portion of the runtime of the test suite to testing this one function which has one caller. By introducing the new function radix_tree_iter_tag_set(), we can eliminate all of the complexity while keeping the performance. The caller can now use a fairly standard radix_tree_for_each() loop, and it doesn't need to worry about tricksy things like 'start' wrapping. The test suite continues to spend a large amount of time investigating this function, but now it's testing the underlying primitives such as radix_tree_iter_resume() and the radix_tree_for_each_tagged() iterator which are also used by other parts of the kernel. Link: http://lkml.kernel.org/r/1480369871-5271-57-git-send-email-mawilcox@linuxonhyperv.com Signed-off-by: Matthew Wilcox <willy@infradead.org> Tested-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Cc: Konstantin Khlebnikov <koct9i@gmail.com> Cc: Ross Zwisler <ross.zwisler@linux.intel.com> Cc: Matthew Wilcox <mawilcox@microsoft.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
#
478922e2 |
|
14-Dec-2016 |
Matthew Wilcox <willy@infradead.org> |
radix-tree: delete radix_tree_locate_item() This rather complicated function can be better implemented as an iterator. It has only one caller, so move the functionality to the only place that needs it. Update the test suite to follow the same pattern. Link: http://lkml.kernel.org/r/1480369871-5271-56-git-send-email-mawilcox@linuxonhyperv.com Signed-off-by: Matthew Wilcox <mawilcox@microsoft.com> Acked-by: Konstantin Khlebnikov <koct9i@gmail.com> Tested-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Cc: Ross Zwisler <ross.zwisler@linux.intel.com> Cc: Matthew Wilcox <mawilcox@microsoft.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
#
148deab2 |
|
14-Dec-2016 |
Matthew Wilcox <willy@infradead.org> |
radix-tree: improve multiorder iterators This fixes several interlinked problems with the iterators in the presence of multiorder entries. 1. radix_tree_iter_next() would only advance by one slot, which would result in the iterators returning the same entry more than once if there were sibling entries. 2. radix_tree_next_slot() could return an internal pointer instead of a user pointer if a tagged multiorder entry was immediately followed by an entry of lower order. 3. radix_tree_next_slot() expanded to a lot more code than it used to when multiorder support was compiled in. And I wasn't comfortable with entry_to_node() being in a header file. Fixing radix_tree_iter_next() for the presence of sibling entries necessarily involves examining the contents of the radix tree, so we now need to pass 'slot' to radix_tree_iter_next(), and we need to change the calling convention so it is called *before* dropping the lock which protects the tree. Also rename it to radix_tree_iter_resume(), as some people thought it was necessary to call radix_tree_iter_next() each time around the loop. radix_tree_next_slot() becomes closer to how it looked before multiorder support was introduced. It only checks to see if the next entry in the chunk is a sibling entry or a pointer to a node; this should be rare enough that handling this case out of line is not a performance impact (and such impact is amortised by the fact that the entry we just processed was a multiorder entry). Also, radix_tree_next_slot() used to force a new chunk lookup for untagged entries, which is more expensive than the out of line sibling entry skipping. Link: http://lkml.kernel.org/r/1480369871-5271-55-git-send-email-mawilcox@linuxonhyperv.com Signed-off-by: Matthew Wilcox <mawilcox@microsoft.com> Tested-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Cc: Konstantin Khlebnikov <koct9i@gmail.com> Cc: Ross Zwisler <ross.zwisler@linux.intel.com> Cc: Matthew Wilcox <mawilcox@microsoft.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
#
218ed750 |
|
14-Dec-2016 |
Matthew Wilcox <willy@infradead.org> |
radix-tree: improve dump output Print the indices of the entries as unsigned (instead of signed) integers and print the parent node of each entry to help navigate around larger trees where the layout is not quite so obvious. Print the indices covered by a node. Rearrange the order of fields printed so the indices and parents line up for each type of entry. Link: http://lkml.kernel.org/r/1480369871-5271-53-git-send-email-mawilcox@linuxonhyperv.com Signed-off-by: Matthew Wilcox <willy@infradead.org> Tested-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Cc: Konstantin Khlebnikov <koct9i@gmail.com> Cc: Ross Zwisler <ross.zwisler@linux.intel.com> Cc: Matthew Wilcox <mawilcox@microsoft.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
#
bc412fca |
|
14-Dec-2016 |
Matthew Wilcox <willy@infradead.org> |
radix-tree: make radix_tree_find_next_bit more useful Since this function is specialised to the radix tree, pass in the node and tag to calculate the address of the bitmap in radix_tree_find_next_bit() instead of the caller. Likewise, there is no need to pass in the size of the bitmap. Link: http://lkml.kernel.org/r/1480369871-5271-52-git-send-email-mawilcox@linuxonhyperv.com Signed-off-by: Matthew Wilcox <willy@infradead.org> Tested-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Cc: Konstantin Khlebnikov <koct9i@gmail.com> Cc: Ross Zwisler <ross.zwisler@linux.intel.com> Cc: Matthew Wilcox <mawilcox@microsoft.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
#
9498d2bb |
|
14-Dec-2016 |
Matthew Wilcox <willy@infradead.org> |
radix-tree: create node_tag_set() Similar to node_tag_clear(), factor node_tag_set() out of radix_tree_range_tag_if_tagged(). Link: http://lkml.kernel.org/r/1480369871-5271-51-git-send-email-mawilcox@linuxonhyperv.com Signed-off-by: Matthew Wilcox <willy@linux.intel.com> Tested-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Cc: Konstantin Khlebnikov <koct9i@gmail.com> Cc: Ross Zwisler <ross.zwisler@linux.intel.com> Cc: Matthew Wilcox <mawilcox@microsoft.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
#
91d9c05a |
|
14-Dec-2016 |
Matthew Wilcox <willy@infradead.org> |
radix-tree: move rcu_head into a union with private_list I want to be able to reference node->parent after freeing node. Currently node->parent is in a union with rcu_head, so it is overwritten when the node is put on the RCU list. We know that private_list is not referenced after the node is freed, so it is safe for these two members to share space. Link: http://lkml.kernel.org/r/1480369871-5271-50-git-send-email-mawilcox@linuxonhyperv.com Signed-off-by: Matthew Wilcox <willy@infradead.org> Tested-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Cc: Konstantin Khlebnikov <koct9i@gmail.com> Cc: Ross Zwisler <ross.zwisler@linux.intel.com> Cc: Matthew Wilcox <mawilcox@microsoft.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
#
91b9677c |
|
14-Dec-2016 |
Matthew Wilcox <willy@infradead.org> |
radix-tree: fix typo Signed-off-by: Matthew Wilcox <willy@linux.intel.com> Tested-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Cc: Konstantin Khlebnikov <koct9i@gmail.com> Cc: Ross Zwisler <ross.zwisler@linux.intel.com> Cc: Matthew Wilcox <mawilcox@microsoft.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
#
14b46879 |
|
12-Dec-2016 |
Johannes Weiner <hannes@cmpxchg.org> |
mm: workingset: move shadow entry tracking to radix tree exceptional tracking Currently, we track the shadow entries in the page cache in the upper bits of the radix_tree_node->count, behind the back of the radix tree implementation. Because the radix tree code has no awareness of them, we rely on random subtleties throughout the implementation (such as the node->count != 1 check in the shrinking code, which is meant to exclude multi-entry nodes but also happens to skip nodes with only one shadow entry, as that's accounted in the upper bits). This is error prone and has, in fact, caused the bug fixed in d3798ae8c6f3 ("mm: filemap: don't plant shadow entries without radix tree node"). To remove these subtleties, this patch moves shadow entry tracking from the upper bits of node->count to the existing counter for exceptional entries. node->count goes back to being a simple counter of valid entries in the tree node and can be shrunk to a single byte. This vastly simplifies the page cache code. All accounting happens natively inside the radix tree implementation, and maintaining the LRU linkage of shadow nodes is consolidated into a single function in the workingset code that is called for leaf nodes affected by a change in the page cache tree. This also removes the last user of the __radix_delete_node() return value. Eliminate it. Link: http://lkml.kernel.org/r/20161117193211.GE23430@cmpxchg.org Signed-off-by: Johannes Weiner <hannes@cmpxchg.org> Reviewed-by: Jan Kara <jack@suse.cz> Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Cc: Hugh Dickins <hughd@google.com> Cc: Matthew Wilcox <mawilcox@linuxonhyperv.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
#
4d693d08 |
|
12-Dec-2016 |
Johannes Weiner <hannes@cmpxchg.org> |
lib: radix-tree: update callback for changing leaf nodes Support handing __radix_tree_replace() a callback that gets invoked for all leaf nodes that change or get freed as a result of the slot replacement, to assist users tracking nodes with node->private_list. This prepares for putting page cache shadow entries into the radix tree root again and drastically simplifying the shadow tracking. Link: http://lkml.kernel.org/r/20161117193134.GD23430@cmpxchg.org Signed-off-by: Johannes Weiner <hannes@cmpxchg.org> Suggested-by: Jan Kara <jack@suse.cz> Reviewed-by: Jan Kara <jack@suse.cz> Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Cc: Hugh Dickins <hughd@google.com> Cc: Matthew Wilcox <mawilcox@linuxonhyperv.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
#
f4b109c6 |
|
12-Dec-2016 |
Johannes Weiner <hannes@cmpxchg.org> |
lib: radix-tree: add entry deletion support to __radix_tree_replace() Page cache shadow entry handling will be a lot simpler when it can use a single generic replacement function for pages, shadow entries, and emptying slots. Make __radix_tree_replace() properly account insertions and deletions in node->count and garbage collect nodes as they become empty. Then re-implement radix_tree_delete() on top of it. Link: http://lkml.kernel.org/r/20161117193058.GC23430@cmpxchg.org Signed-off-by: Johannes Weiner <hannes@cmpxchg.org> Reviewed-by: Jan Kara <jack@suse.cz> Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Cc: Hugh Dickins <hughd@google.com> Cc: Matthew Wilcox <mawilcox@linuxonhyperv.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
#
6d75f366 |
|
12-Dec-2016 |
Johannes Weiner <hannes@cmpxchg.org> |
lib: radix-tree: check accounting of existing slot replacement users The bug in khugepaged fixed earlier in this series shows that radix tree slot replacement is fragile; and it will become more so when not only NULL<->!NULL transitions need to be caught but transitions from and to exceptional entries as well. We need checks. Re-implement radix_tree_replace_slot() on top of the sanity-checked __radix_tree_replace(). This requires existing callers to also pass the radix tree root, but it'll warn us when somebody replaces slots with contents that need proper accounting (transitions between NULL entries, real entries, exceptional entries) and where a replacement through the slot pointer would corrupt the radix tree node counts. Link: http://lkml.kernel.org/r/20161117193021.GB23430@cmpxchg.org Signed-off-by: Johannes Weiner <hannes@cmpxchg.org> Suggested-by: Jan Kara <jack@suse.cz> Reviewed-by: Jan Kara <jack@suse.cz> Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Cc: Hugh Dickins <hughd@google.com> Cc: Matthew Wilcox <mawilcox@linuxonhyperv.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
#
f7942430 |
|
12-Dec-2016 |
Johannes Weiner <hannes@cmpxchg.org> |
lib: radix-tree: native accounting of exceptional entries The way the page cache is sneaking shadow entries of evicted pages into the radix tree past the node entry accounting and tracking them manually in the upper bits of node->count is fraught with problems. These shadow entries are marked in the tree as exceptional entries, which are a native concept to the radix tree. Maintain an explicit counter of exceptional entries in the radix tree node. Subsequent patches will switch shadow entry tracking over to that counter. DAX and shmem are the other users of exceptional entries. Since slot replacements that change the entry type from regular to exceptional must now be accounted, introduce a __radix_tree_replace() function that does replacement and accounting, and switch DAX and shmem over. The increase in radix tree node size is temporary. A followup patch switches the shadow tracking to this new scheme and we'll no longer need the upper bits in node->count and shrink that back to one byte. Link: http://lkml.kernel.org/r/20161117192945.GA23430@cmpxchg.org Signed-off-by: Johannes Weiner <hannes@cmpxchg.org> Reviewed-by: Jan Kara <jack@suse.cz> Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Cc: Hugh Dickins <hughd@google.com> Cc: Matthew Wilcox <mawilcox@linuxonhyperv.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
#
2b41226b |
|
09-Dec-2016 |
Linus Torvalds <torvalds@linux-foundation.org> |
Revert "radix tree test suite: fix compilation" This reverts commit 53855d10f4567a0577360b6448d52a863929775b. It shouldn't have come in yet - it depends on the changes in linux-next that will come in during the next merge window. As Matthew Wilcox says, the test suite is broken with the current state without the revert. Requested-by: Matthew Wilcox <mawilcox@microsoft.com> Cc: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
#
53855d10 |
|
07-Dec-2016 |
Matthew Wilcox <willy@infradead.org> |
radix tree test suite: fix compilation Patch "lib/radix-tree: Convert to hotplug state machine" breaks the test suite as it adds a call to cpuhp_setup_state_nocalls() which is not currently emulated in the test suite. Add it, and delete the emulation of the old CPU hotplug mechanism. Link: http://lkml.kernel.org/r/1480369871-5271-36-git-send-email-mawilcox@linuxonhyperv.com Signed-off-by: Matthew Wilcox <mawilcox@microsoft.com> Tested-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Cc: Konstantin Khlebnikov <koct9i@gmail.com> Cc: Ross Zwisler <ross.zwisler@linux.intel.com> Cc: Matthew Wilcox <mawilcox@microsoft.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
#
d544abd5 |
|
03-Nov-2016 |
Sebastian Andrzej Siewior <bigeasy@linutronix.de> |
lib/radix-tree: Convert to hotplug state machine Install the callbacks via the state machine. Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: rt@linutronix.de Link: http://lkml.kernel.org/r/20161103145021.28528-6-bigeasy@linutronix.de Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
|
#
d3798ae8 |
|
04-Oct-2016 |
Johannes Weiner <hannes@cmpxchg.org> |
mm: filemap: don't plant shadow entries without radix tree node When the underflow checks were added to workingset_node_shadow_dec(), they triggered immediately: kernel BUG at ./include/linux/swap.h:276! invalid opcode: 0000 [#1] SMP Modules linked in: isofs usb_storage fuse xt_CHECKSUM ipt_MASQUERADE nf_nat_masquerade_ipv4 tun nf_conntrack_netbios_ns nf_conntrack_broadcast ip6t_REJECT nf_reject_ipv6 soundcore wmi acpi_als pinctrl_sunrisepoint kfifo_buf tpm_tis industrialio acpi_pad pinctrl_intel tpm_tis_core tpm nfsd auth_rpcgss nfs_acl lockd grace sunrpc dm_crypt CPU: 0 PID: 20929 Comm: blkid Not tainted 4.8.0-rc8-00087-gbe67d60ba944 #1 Hardware name: System manufacturer System Product Name/Z170-K, BIOS 1803 05/06/2016 task: ffff8faa93ecd940 task.stack: ffff8faa7f478000 RIP: page_cache_tree_insert+0xf1/0x100 Call Trace: __add_to_page_cache_locked+0x12e/0x270 add_to_page_cache_lru+0x4e/0xe0 mpage_readpages+0x112/0x1d0 blkdev_readpages+0x1d/0x20 __do_page_cache_readahead+0x1ad/0x290 force_page_cache_readahead+0xaa/0x100 page_cache_sync_readahead+0x3f/0x50 generic_file_read_iter+0x5af/0x740 blkdev_read_iter+0x35/0x40 __vfs_read+0xe1/0x130 vfs_read+0x96/0x130 SyS_read+0x55/0xc0 entry_SYSCALL_64_fastpath+0x13/0x8f Code: 03 00 48 8b 5d d8 65 48 33 1c 25 28 00 00 00 44 89 e8 75 19 48 83 c4 18 5b 41 5c 41 5d 41 5e 5d c3 0f 0b 41 bd ef ff ff ff eb d7 <0f> 0b e8 88 68 ef ff 0f 1f 84 00 RIP page_cache_tree_insert+0xf1/0x100 This is a long-standing bug in the way shadow entries are accounted in the radix tree nodes. The shrinker needs to know when radix tree nodes contain only shadow entries, no pages, so node->count is split in half to count shadows in the upper bits and pages in the lower bits. Unfortunately, the radix tree implementation doesn't know of this and assumes all entries are in node->count. When there is a shadow entry directly in root->rnode and the tree is later extended, the radix tree implementation will copy that entry into the new node and and bump its node->count, i.e. increases the page count bits. Once the shadow gets removed and we subtract from the upper counter, node->count underflows and triggers the warning. Afterwards, without node->count reaching 0 again, the radix tree node is leaked. Limit shadow entries to when we have actual radix tree nodes and can count them properly. That means we lose the ability to detect refaults from files that had only the first page faulted in at eviction time. Fixes: 449dd6984d0e ("mm: keep page cache radix tree nodes in check") Signed-off-by: Johannes Weiner <hannes@cmpxchg.org> Reported-and-tested-by: Linus Torvalds <torvalds@linux-foundation.org> Reviewed-by: Jan Kara <jack@suse.cz> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: stable@vger.kernel.org Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
#
8d2c0d36 |
|
25-Sep-2016 |
Linus Torvalds <torvalds@linux-foundation.org> |
radix tree: fix sibling entry handling in radix_tree_descend() The fixes to the radix tree test suite show that the multi-order case is broken. The basic reason is that the radix tree code uses tagged pointers with the "internal" bit in the low bits, and calculating the pointer indices was supposed to mask off those bits. But gcc will notice that we then use the index to re-create the pointer, and will avoid doing the arithmetic and use the tagged pointer directly. This cleans the code up, using the existing is_sibling_entry() helper to validate the sibling pointer range (instead of open-coding it), and using entry_to_node() to mask off the low tag bit from the pointer. And once you do that, you might as well just use the now cleaned-up pointer directly. [ Side note: the multi-order code isn't actually ever used in the kernel right now, and the only reason I didn't just delete all that code is that Kirill Shutemov piped up and said: "Well, my ext4-with-huge-pages patchset[1] uses multi-order entries. It also converts shmem-with-huge-pages and hugetlb to them. I'm okay with converting it to other mechanism, but I need something. (I looked into Konstantin's RFC patchset[2]. It looks okay, but I don't feel myself qualified to review it as I don't know much about radix-tree internals.)" [1] http://lkml.kernel.org/r/20160915115523.29737-1-kirill.shutemov@linux.intel.com [2] http://lkml.kernel.org/r/147230727479.9957.1087787722571077339.stgit@zurg ] Reported-by: Matthew Wilcox <mawilcox@microsoft.com> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Ross Zwisler <ross.zwisler@linux.intel.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Cc: Konstantin Khlebnikov <koct9i@gmail.com> Cc: Cedric Blancher <cedric.blancher@gmail.com> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
#
05eb6e72 |
|
02-Aug-2016 |
Vladimir Davydov <vdavydov.dev@gmail.com> |
radix-tree: account nodes to memcg only if explicitly requested Radix trees may be used not only for storing page cache pages, so unconditionally accounting radix tree nodes to the current memory cgroup is bad: if a radix tree node is used for storing data shared among different cgroups we risk pinning dead memory cgroups forever. So let's only account radix tree nodes if it was explicitly requested by passing __GFP_ACCOUNT to INIT_RADIX_TREE. Currently, we only want to account page cache entries, so mark mapping->page_tree so. Fixes: 58e698af4c63 ("radix-tree: account radix_tree_node to memory cgroup") Link: http://lkml.kernel.org/r/1470057188-7864-1-git-send-email-vdavydov@virtuozzo.com Signed-off-by: Vladimir Davydov <vdavydov@virtuozzo.com> Acked-by: Johannes Weiner <hannes@cmpxchg.org> Acked-by: Michal Hocko <mhocko@suse.com> Cc: <stable@vger.kernel.org> [4.6+] Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
#
c78c66d1 |
|
26-Jul-2016 |
Kirill A. Shutemov <kirill.shutemov@linux.intel.com> |
radix-tree: implement radix_tree_maybe_preload_order() The new helper is similar to radix_tree_maybe_preload(), but tries to preload number of nodes required to insert (1 << order) continuous naturally-aligned elements. This is required to push huge pages into pagecache. Link: http://lkml.kernel.org/r/1466021202-61880-24-git-send-email-kirill.shutemov@linux.intel.com Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
#
9e85d811 |
|
20-May-2016 |
Matthew Wilcox <willy@infradead.org> |
radix-tree: make radix_tree_descend() more useful Now that the shift amount is stored in the node, radix_tree_descend() can calculate offset itself from index, which removes several lines of code from each of the tree walkers. Signed-off-by: Matthew Wilcox <willy@linux.intel.com> Cc: Konstantin Khlebnikov <koct9i@gmail.com> Cc: Kirill Shutemov <kirill.shutemov@linux.intel.com> Cc: Jan Kara <jack@suse.com> Cc: Neil Brown <neilb@suse.de> Cc: Ross Zwisler <ross.zwisler@linux.intel.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
#
d604c324 |
|
20-May-2016 |
Matthew Wilcox <willy@infradead.org> |
radix-tree: introduce radix_tree_replace_clear_tags() In addition to replacing the entry, we also clear all associated tags. This is really a one-off special for page_cache_tree_delete() which had far too much detailed knowledge about how the radix tree works. For efficiency, factor node_tag_clear() out of radix_tree_tag_clear() It can be used by radix_tree_delete_item() as well as radix_tree_replace_clear_tags(). Signed-off-by: Matthew Wilcox <willy@linux.intel.com> Cc: Konstantin Khlebnikov <koct9i@gmail.com> Cc: Kirill Shutemov <kirill.shutemov@linux.intel.com> Cc: Jan Kara <jack@suse.com> Cc: Neil Brown <neilb@suse.de> Cc: Ross Zwisler <ross.zwisler@linux.intel.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
#
89148aa4 |
|
20-May-2016 |
Matthew Wilcox <willy@infradead.org> |
radix-tree: tidy up __radix_tree_create() 1. Rename the existing variable 'slot' to 'child'. 2. Introduce a new variable called 'slot' which is the address of the slot we're dealing with. This lets us simplify the tree insertion, and removes the recalculation of 'slot' at the end of the function. 3. Using 'slot' in the sibling pointer insertion part makes the code more readable. Signed-off-by: Matthew Wilcox <willy@linux.intel.com> Cc: Konstantin Khlebnikov <koct9i@gmail.com> Cc: Kirill Shutemov <kirill.shutemov@linux.intel.com> Cc: Jan Kara <jack@suse.com> Cc: Neil Brown <neilb@suse.de> Cc: Ross Zwisler <ross.zwisler@linux.intel.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
#
a8e4da25 |
|
20-May-2016 |
Matthew Wilcox <willy@infradead.org> |
radix-tree: tidy up range_tag_if_tagged Convert radix_tree_range_tag_if_tagged to name the nodes parent, node and child instead of node & slot. Use parent->offset instead of playing games with 'upindex'. Signed-off-by: Matthew Wilcox <willy@linux.intel.com> Cc: Konstantin Khlebnikov <koct9i@gmail.com> Cc: Kirill Shutemov <kirill.shutemov@linux.intel.com> Cc: Jan Kara <jack@suse.com> Cc: Neil Brown <neilb@suse.de> Cc: Ross Zwisler <ross.zwisler@linux.intel.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
#
8c1244de |
|
20-May-2016 |
Matthew Wilcox <willy@infradead.org> |
radix-tree: tidy up next_chunk Convert radix_tree_next_chunk to use 'child' instead of 'slot' as the name of the child node. Also use node_maxindex() where it makes sense. The 'rnode' variable was unnecessary; it doesn't overlap in usage with 'node', so we can just use 'node' the whole way through the function. Improve the testcase to start the walk from every index in the carefully constructed tree, and to accept any index within the range covered by the entry. Signed-off-by: Matthew Wilcox <willy@linux.intel.com> Cc: Konstantin Khlebnikov <koct9i@gmail.com> Cc: Kirill Shutemov <kirill.shutemov@linux.intel.com> Cc: Jan Kara <jack@suse.com> Cc: Neil Brown <neilb@suse.de> Cc: Ross Zwisler <ross.zwisler@linux.intel.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
#
af49a63e |
|
20-May-2016 |
Matthew Wilcox <willy@infradead.org> |
radix-tree: change naming conventions in radix_tree_shrink Use the more standard 'node' and 'child' instead of 'to_free' and 'slot'. Signed-off-by: Matthew Wilcox <willy@linux.intel.com> Cc: Konstantin Khlebnikov <koct9i@gmail.com> Cc: Kirill Shutemov <kirill.shutemov@linux.intel.com> Cc: Jan Kara <jack@suse.com> Cc: Neil Brown <neilb@suse.de> Cc: Ross Zwisler <ross.zwisler@linux.intel.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
#
b194d16c |
|
20-May-2016 |
Matthew Wilcox <willy@infradead.org> |
radix-tree: rename radix_tree_is_indirect_ptr() As with indirect_to_ptr(), ptr_to_indirect() and RADIX_TREE_INDIRECT_PTR, change radix_tree_is_indirect_ptr() to radix_tree_is_internal_node(). Signed-off-by: Matthew Wilcox <willy@linux.intel.com> Cc: Konstantin Khlebnikov <koct9i@gmail.com> Cc: Kirill Shutemov <kirill.shutemov@linux.intel.com> Cc: Jan Kara <jack@suse.com> Cc: Neil Brown <neilb@suse.de> Cc: Ross Zwisler <ross.zwisler@linux.intel.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
#
4dd6c098 |
|
20-May-2016 |
Matthew Wilcox <willy@infradead.org> |
radix-tree: rename indirect_to_ptr() to entry_to_node() Mirrors the earlier commit introducing node_to_entry(). Also change the type returned to be a struct radix_tree_node pointer. That lets us simplify a couple of places in the radix tree shrink & extend paths where we could convert an entry into a pointer, modify the node, then convert the pointer back into an entry. Signed-off-by: Matthew Wilcox <willy@linux.intel.com> Cc: Konstantin Khlebnikov <koct9i@gmail.com> Cc: Kirill Shutemov <kirill.shutemov@linux.intel.com> Cc: Jan Kara <jack@suse.com> Cc: Neil Brown <neilb@suse.de> Cc: Ross Zwisler <ross.zwisler@linux.intel.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
#
a4db4dce |
|
20-May-2016 |
Matthew Wilcox <willy@infradead.org> |
radix-tree: rename ptr_to_indirect() to node_to_entry() ptr_to_indirect() was a bad name. What it really means is "Convert this pointer to a node into an entry suitable for storing in the radix tree". So node_to_entry() seemed like a better name. Signed-off-by: Matthew Wilcox <willy@linux.intel.com> Cc: Konstantin Khlebnikov <koct9i@gmail.com> Cc: Kirill Shutemov <kirill.shutemov@linux.intel.com> Cc: Jan Kara <jack@suse.com> Cc: Neil Brown <neilb@suse.de> Cc: Ross Zwisler <ross.zwisler@linux.intel.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
#
30ff46cc |
|
20-May-2016 |
Matthew Wilcox <willy@infradead.org> |
radix-tree: rename INDIRECT_PTR to INTERNAL_NODE The name RADIX_TREE_INDIRECT_PTR doesn't really match the meaning. RADIX_TREE_INTERNAL_NODE is a better name. Signed-off-by: Matthew Wilcox <willy@linux.intel.com> Cc: Konstantin Khlebnikov <koct9i@gmail.com> Cc: Kirill Shutemov <kirill.shutemov@linux.intel.com> Cc: Jan Kara <jack@suse.com> Cc: Neil Brown <neilb@suse.de> Cc: Ross Zwisler <ross.zwisler@linux.intel.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
#
d0891265 |
|
20-May-2016 |
Matthew Wilcox <willy@infradead.org> |
radix-tree: remove root->height The only remaining references to root->height were in extend and shrink, where it was updated. Now we can remove it entirely. Signed-off-by: Matthew Wilcox <willy@linux.intel.com> Reviewed-by: Ross Zwisler <ross.zwisler@linux.intel.com> Cc: Konstantin Khlebnikov <koct9i@gmail.com> Cc: Kirill Shutemov <kirill.shutemov@linux.intel.com> Cc: Jan Kara <jack@suse.com> Cc: Neil Brown <neilb@suse.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
#
fb209019 |
|
20-May-2016 |
Matthew Wilcox <willy@infradead.org> |
radix-tree: remove a use of root->height from delete_node If radix_tree_shrink returns whether it managed to shrink, then __radix_tree_delete_node doesn't ned to query the tree to find out whether it did any work or not. Signed-off-by: Matthew Wilcox <willy@linux.intel.com> Cc: Konstantin Khlebnikov <koct9i@gmail.com> Cc: Kirill Shutemov <kirill.shutemov@linux.intel.com> Cc: Jan Kara <jack@suse.com> Cc: Neil Brown <neilb@suse.de> Cc: Ross Zwisler <ross.zwisler@linux.intel.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
#
c12e51b0 |
|
20-May-2016 |
Matthew Wilcox <willy@infradead.org> |
radix-tree: replace node->height with node->shift node->shift represents the shift necessary for looking in the slots array at this level. It is equal to the old (node->height - 1) * RADIX_TREE_MAP_SHIFT. Signed-off-by: Matthew Wilcox <willy@linux.intel.com> Reviewed-by: Ross Zwisler <ross.zwisler@linux.intel.com> Cc: Konstantin Khlebnikov <koct9i@gmail.com> Cc: Kirill Shutemov <kirill.shutemov@linux.intel.com> Cc: Jan Kara <jack@suse.com> Cc: Neil Brown <neilb@suse.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
#
0c7fa0a8 |
|
20-May-2016 |
Matthew Wilcox <willy@infradead.org> |
radix-tree: split node->path into offset and height Neither piece of information we're storing in node->path can be larger than 64, so store each in its own unsigned char instead of shifting and masking to store them both in an unsigned int. Signed-off-by: Matthew Wilcox <willy@linux.intel.com> Reviewed-by: Ross Zwisler <ross.zwisler@linux.intel.com> Cc: Konstantin Khlebnikov <koct9i@gmail.com> Cc: Kirill Shutemov <kirill.shutemov@linux.intel.com> Cc: Jan Kara <jack@suse.com> Cc: Neil Brown <neilb@suse.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
#
2fcd9005 |
|
20-May-2016 |
Matthew Wilcox <willy@infradead.org> |
radix-tree: miscellaneous fixes Typos, whitespace, grammar, line length, using the correct types, etc. Signed-off-by: Matthew Wilcox <willy@linux.intel.com> Reviewed-by: Ross Zwisler <ross.zwisler@linux.intel.com> Cc: Konstantin Khlebnikov <koct9i@gmail.com> Cc: Kirill Shutemov <kirill.shutemov@linux.intel.com> Cc: Jan Kara <jack@suse.com> Cc: Neil Brown <neilb@suse.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
#
6b053b8e |
|
20-May-2016 |
Matthew Wilcox <willy@infradead.org> |
radix-tree: add copyright statements The multiorder support is a sufficiently large feature to be worth adding copyrigt lines for. Signed-off-by: Matthew Wilcox <willy@linux.intel.com> Reviewed-by: Ross Zwisler <ross.zwisler@linux.intel.com> Cc: Konstantin Khlebnikov <koct9i@gmail.com> Cc: Kirill Shutemov <kirill.shutemov@linux.intel.com> Cc: Jan Kara <jack@suse.com> Cc: Neil Brown <neilb@suse.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
#
0796c583 |
|
20-May-2016 |
Ross Zwisler <zwisler@kernel.org> |
radix-tree: fix radix_tree_dump() for multi-order entries - Print which indices are covered by every leaf entry - Print sibling entries - Print the node pointer instead of the slot entry - Build by default in userspace, and make it accessible to the test-suite Signed-off-by: Ross Zwisler <ross.zwisler@linux.intel.com> Signed-off-by: Matthew Wilcox <willy@linux.intel.com> Cc: Konstantin Khlebnikov <koct9i@gmail.com> Cc: Kirill Shutemov <kirill.shutemov@linux.intel.com> Cc: Jan Kara <jack@suse.com> Cc: Neil Brown <neilb@suse.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
#
070c5ac2 |
|
20-May-2016 |
Matthew Wilcox <willy@infradead.org> |
radix-tree: fix radix_tree_range_tag_if_tagged() for multiorder entries I had previously decided that tagging a single multiorder entry would count as tagging 2^order entries for the purposes of 'nr_to_tag'. I now believe that decision to be a mistake, and it should count as a single entry. That's more likely to be what callers expect. When walking back up the tree from a newly-tagged entry, the current code assumed we were starting from the lowest level of the tree; if we have a multiorder entry with an order at least RADIX_TREE_MAP_SHIFT in size then we need to shift the index by 'shift' before we start walking back up the tree, or we will end up not setting tags on higher entries, and then mistakenly thinking that entries below a certain point in the tree are not tagged. If the first index we examine is a sibling entry of a tagged multiorder entry, we were not tagging it. We need to examine the canonical entry, and the easiest way to do that is to use radix_tree_descend(). We then have to skip over sibling slots when looking for the next entry in the tree or we will end up walking back to the canonical entry. Add several tests for radix_tree_range_tag_if_tagged(). Signed-off-by: Matthew Wilcox <willy@linux.intel.com> Reviewed-by: Ross Zwisler <ross.zwisler@linux.intel.com> Cc: Konstantin Khlebnikov <koct9i@gmail.com> Cc: Kirill Shutemov <kirill.shutemov@linux.intel.com> Cc: Jan Kara <jack@suse.com> Cc: Neil Brown <neilb@suse.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
#
0a2efc6c |
|
20-May-2016 |
Matthew Wilcox <willy@infradead.org> |
radix-tree: rewrite radix_tree_locate_item Use the new multi-order support functions to rewrite radix_tree_locate_item(). Modify the locate tests to test multiorder entries too. [hughd@google.com: radix_tree_locate_item() is often returning the wrong index] Link: http://lkml.kernel.org/r/alpine.LSU.2.11.1605012108490.1166@eggly.anvils Signed-off-by: Matthew Wilcox <willy@linux.intel.com> Reviewed-by: Ross Zwisler <ross.zwisler@linux.intel.com> Cc: Konstantin Khlebnikov <koct9i@gmail.com> Cc: Kirill Shutemov <kirill.shutemov@linux.intel.com> Cc: Jan Kara <jack@suse.com> Cc: Neil Brown <neilb@suse.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
#
8a14f4d8 |
|
20-May-2016 |
Matthew Wilcox <willy@infradead.org> |
radix-tree: fix radix_tree_create for sibling entries If the radix tree user attempted to insert a colliding entry with an existing multiorder entry, then radix_tree_create() could encounter a sibling entry when walking down the tree to look for a slot. Use radix_tree_descend() to fix the problem, and add a test-case to make sure the problem doesn't come back in future. Signed-off-by: Matthew Wilcox <willy@linux.intel.com> Reviewed-by: Ross Zwisler <ross.zwisler@linux.intel.com> Cc: Konstantin Khlebnikov <koct9i@gmail.com> Cc: Kirill Shutemov <kirill.shutemov@linux.intel.com> Cc: Jan Kara <jack@suse.com> Cc: Neil Brown <neilb@suse.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
#
4589ba6d |
|
20-May-2016 |
Ross Zwisler <zwisler@kernel.org> |
radix-tree: rewrite radix_tree_tag_get Use the new multi-order support functions to rewrite radix_tree_tag_get() Signed-off-by: Ross Zwisler <ross.zwisler@linux.intel.com> Signed-off-by: Matthew Wilcox <willy@linux.intel.com> Cc: Konstantin Khlebnikov <koct9i@gmail.com> Cc: Kirill Shutemov <kirill.shutemov@linux.intel.com> Cc: Jan Kara <jack@suse.com> Cc: Neil Brown <neilb@suse.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
#
00f47b58 |
|
20-May-2016 |
Ross Zwisler <zwisler@kernel.org> |
radix-tree: rewrite radix_tree_tag_clear Use the new multi-order support functions to rewrite radix_tree_tag_clear() Signed-off-by: Ross Zwisler <ross.zwisler@linux.intel.com> Signed-off-by: Matthew Wilcox <willy@linux.intel.com> Cc: Konstantin Khlebnikov <koct9i@gmail.com> Cc: Kirill Shutemov <kirill.shutemov@linux.intel.com> Cc: Jan Kara <jack@suse.com> Cc: Neil Brown <neilb@suse.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
#
fb969909 |
|
20-May-2016 |
Ross Zwisler <zwisler@kernel.org> |
radix-tree: rewrite radix_tree_tag_set Use the new multi-order support functions to rewrite radix_tree_tag_set() Signed-off-by: Ross Zwisler <ross.zwisler@linux.intel.com> Signed-off-by: Matthew Wilcox <willy@linux.intel.com> Cc: Konstantin Khlebnikov <koct9i@gmail.com> Cc: Kirill Shutemov <kirill.shutemov@linux.intel.com> Cc: Jan Kara <jack@suse.com> Cc: Neil Brown <neilb@suse.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
#
21ef5339 |
|
20-May-2016 |
Ross Zwisler <zwisler@kernel.org> |
radix-tree: add support for multi-order iterating This enables the macros radix_tree_for_each_slot() and friends to be used with multi-order entries. The way that this works is that we treat all entries in a given slots[] array as a single chunk. If the index given to radix_tree_next_chunk() happens to point us to a sibling entry, we will back up iter->index so that it points to the canonical entry, and that will be the place where we start our iteration. As we're processing a chunk in radix_tree_next_slot(), we process canonical entries, skip over sibling entries, and restart the chunk lookup if we find a non-sibling indirect pointer. This drops back to the radix_tree_next_chunk() code, which will re-walk the tree and look for another chunk. This allows us to properly handle multi-order entries mixed with other entries that are at various heights in the radix tree. Signed-off-by: Ross Zwisler <ross.zwisler@linux.intel.com> Signed-off-by: Matthew Wilcox <willy@linux.intel.com> Cc: Konstantin Khlebnikov <koct9i@gmail.com> Cc: Kirill Shutemov <kirill.shutemov@linux.intel.com> Cc: Jan Kara <jack@suse.com> Cc: Neil Brown <neilb@suse.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
#
7b60e9ad |
|
20-May-2016 |
Matthew Wilcox <willy@infradead.org> |
radix-tree: fix multiorder BUG_ON in radix_tree_insert These BUG_ON tests are to ensure that all the tags are clear when inserting a new entry. If we insert a multiorder entry, we'll end up looking at the tags for a different node, and so the BUG_ON can end up triggering spuriously. Also, we now have three tags, not two, so check all three are clear, and check all the root tags with a single call to BUG_ON since the bits are stored contiguously. Include a test-case to ensure this problem does not reoccur. Signed-off-by: Matthew Wilcox <willy@linux.intel.com> Reviewed-by: Ross Zwisler <ross.zwisler@linux.intel.com> Cc: Konstantin Khlebnikov <koct9i@gmail.com> Cc: Kirill Shutemov <kirill.shutemov@linux.intel.com> Cc: Jan Kara <jack@suse.com> Cc: Neil Brown <neilb@suse.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
#
85829954 |
|
20-May-2016 |
Matthew Wilcox <willy@infradead.org> |
radix-tree: rewrite __radix_tree_lookup Use the new multi-order support functions to rewrite __radix_tree_lookup() Signed-off-by: Matthew Wilcox <willy@linux.intel.com> Reviewed-by: Ross Zwisler <ross.zwisler@linux.intel.com> Cc: Konstantin Khlebnikov <koct9i@gmail.com> Cc: Kirill Shutemov <kirill.shutemov@linux.intel.com> Cc: Jan Kara <jack@suse.com> Cc: Neil Brown <neilb@suse.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
#
afe0e395 |
|
20-May-2016 |
Matthew Wilcox <willy@infradead.org> |
radix-tree: fix several shrinking bugs with multiorder entries Setting the indirect bit on the user data entry used to be unambiguous because the tree walking code knew not to expect internal nodes in the last level of the tree. Multiorder entries can appear at any level of the tree, and a leaf with the indirect bit set is indistinguishable from a pointer to a node. Introduce a special entry (RADIX_TREE_RETRY) which is neither a valid user entry, nor a valid pointer to a node. The radix_tree_deref_retry() function continues to work the same way, but tree walking code can distinguish it from a pointer to a node. Also fix the condition for setting slot->parent to NULL; it does not matter what height the tree is, it only matters whether slot is an indirect pointer. Move this code above the comment which is referring to the assignment to root->rnode. Also fix the condition for preventing the tree from shrinking to a single entry if it's a multiorder entry. Add a test-case to the test suite that checks that the tree goes back down to its original height after an item is inserted & deleted from a higher index in the tree. Signed-off-by: Matthew Wilcox <willy@linux.intel.com> Reviewed-by: Ross Zwisler <ross.zwisler@linux.intel.com> Cc: Konstantin Khlebnikov <koct9i@gmail.com> Cc: Kirill Shutemov <kirill.shutemov@linux.intel.com> Cc: Jan Kara <jack@suse.com> Cc: Neil Brown <neilb@suse.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
#
49ea6ebc |
|
20-May-2016 |
Matthew Wilcox <willy@infradead.org> |
radix-tree: fix extending the tree for multi-order entries at offset 0 The current code will insert entries at each level, as if we're going to add a new entry at the bottom level, so we then get an -EEXIST when we try to insert the entry into the tree. The best way to fix this is to not check 'order' when inserting into an empty tree. We still need to 'extend' the tree to the height necessary for the maximum index corresponding to this entry, so pass that value to radix_tree_extend() rather than the index we're asked to create, or we won't create a tree that's deep enough. Signed-off-by: Matthew Wilcox <willy@linux.intel.com> Reviewed-by: Ross Zwisler <ross.zwisler@linux.intel.com> Cc: Konstantin Khlebnikov <koct9i@gmail.com> Cc: Kirill Shutemov <kirill.shutemov@linux.intel.com> Cc: Jan Kara <jack@suse.com> Cc: Neil Brown <neilb@suse.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
#
1456a439 |
|
20-May-2016 |
Matthew Wilcox <willy@infradead.org> |
radix-tree: introduce radix_tree_load_root() All the tree walking functions start with some variant of this code; centralise it in one place so we're not chasing subtly different bugs everywhere. Signed-off-by: Matthew Wilcox <willy@linux.intel.com> Reviewed-by: Ross Zwisler <ross.zwisler@linux.intel.com> Cc: Konstantin Khlebnikov <koct9i@gmail.com> Cc: Kirill Shutemov <kirill.shutemov@linux.intel.com> Cc: Jan Kara <jack@suse.com> Cc: Neil Brown <neilb@suse.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
#
aa547576 |
|
20-May-2016 |
Matthew Wilcox <willy@infradead.org> |
radix-tree: remove restriction on multi-order entries Now that sibling pointers are handled explicitly, there is no purpose served by restricting the order to be >= RADIX_TREE_MAP_SHIFT. Signed-off-by: Matthew Wilcox <willy@linux.intel.com> Reviewed-by: Ross Zwisler <ross.zwisler@linux.intel.com> Cc: Konstantin Khlebnikov <koct9i@gmail.com> Cc: Kirill Shutemov <kirill.shutemov@linux.intel.com> Cc: Jan Kara <jack@suse.com> Cc: Neil Brown <neilb@suse.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
#
29e0967c |
|
20-May-2016 |
Matthew Wilcox <willy@infradead.org> |
radix-tree: fix deleting a multi-order entry through an alias If we deleted an entry through an index which looked up a sibling pointer, we'd end up zeroing out the wrong slots in the node. Use get_slot_offset() to find the right slot. Signed-off-by: Matthew Wilcox <willy@linux.intel.com> Reviewed-by: Ross Zwisler <ross.zwisler@linux.intel.com> Cc: Konstantin Khlebnikov <koct9i@gmail.com> Cc: Kirill Shutemov <kirill.shutemov@linux.intel.com> Cc: Jan Kara <jack@suse.com> Cc: Neil Brown <neilb@suse.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
#
3b8c00f6 |
|
20-May-2016 |
Matthew Wilcox <willy@infradead.org> |
radix-tree: fix sibling entry insertion The subtraction was the wrong way round, leading to undefined behaviour (shift by an amount larger than the size of the type). Signed-off-by: Matthew Wilcox <willy@linux.intel.com> Reviewed-by: Ross Zwisler <ross.zwisler@linux.intel.com> Cc: Konstantin Khlebnikov <koct9i@gmail.com> Cc: Kirill Shutemov <kirill.shutemov@linux.intel.com> Cc: Jan Kara <jack@suse.com> Cc: Neil Brown <neilb@suse.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
#
db050f29 |
|
20-May-2016 |
Matthew Wilcox <willy@infradead.org> |
radix-tree: add missing sibling entry functionality The code I previously added to enable multiorder radix tree entries was untested and therefore buggy. This commit adds the support functions that Ross and I decided were necessary over a four-week period of iterating various designs. Signed-off-by: Matthew Wilcox <willy@linux.intel.com> Reviewed-by: Ross Zwisler <ross.zwisler@linux.intel.com> Cc: Konstantin Khlebnikov <koct9i@gmail.com> Cc: Kirill Shutemov <kirill.shutemov@linux.intel.com> Cc: Jan Kara <jack@suse.com> Cc: Neil Brown <neilb@suse.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
#
57578c2e |
|
20-May-2016 |
Matthew Wilcox <willy@infradead.org> |
raxix-tree: introduce CONFIG_RADIX_TREE_MULTIORDER I've been receiving increasingly concerned notes from 0day about how much my recent changes have been bloating the radix tree. Make it happier by only including multiorder support if CONFIG_TRANSPARENT_HUGEPAGES is set. This is an independent Kconfig option, so other radix tree users can also set it if they have a need. Signed-off-by: Matthew Wilcox <willy@linux.intel.com> Reviewed-by: Ross Zwisler <ross.zwisler@linux.intel.com> Cc: Konstantin Khlebnikov <koct9i@gmail.com> Cc: Kirill Shutemov <kirill.shutemov@linux.intel.com> Cc: Jan Kara <jack@suse.com> Cc: Neil Brown <neilb@suse.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
#
7cf19af4 |
|
17-Mar-2016 |
Matthew Wilcox <willy@infradead.org> |
radix_tree: add radix_tree_dump This is debug code which is #if 0 out. Signed-off-by: Matthew Wilcox <willy@linux.intel.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Matthew Wilcox <willy@linux.intel.com> Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com> Cc: Ross Zwisler <ross.zwisler@linux.intel.com> Cc: Hugh Dickins <hughd@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
#
e6145236 |
|
17-Mar-2016 |
Matthew Wilcox <willy@infradead.org> |
radix_tree: add support for multi-order entries With huge pages, it is convenient to have the radix tree be able to return an entry that covers multiple indices. Previous attempts to deal with the problem have involved inserting N duplicate entries, which is a waste of memory and leads to problems trying to handle aliased tags, or probing the tree multiple times to find alternative entries which might cover the requested index. This approach inserts one canonical entry into the tree for a given range of indices, and may also insert other entries in order to ensure that lookups find the canonical entry. This solution only tolerates inserting powers of two that are greater than the fanout of the tree. If we wish to expand the radix tree's abilities to support large-ish pages that is less than the fanout at the penultimate level of the tree, then we would need to add one more step in lookup to ensure that any sibling nodes in the final level of the tree are dereferenced and we return the canonical entry that they reference. Signed-off-by: Matthew Wilcox <willy@linux.intel.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Matthew Wilcox <willy@linux.intel.com> Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com> Cc: Ross Zwisler <ross.zwisler@linux.intel.com> Cc: Hugh Dickins <hughd@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
#
0070e28d |
|
17-Mar-2016 |
Matthew Wilcox <willy@infradead.org> |
radix_tree: loop based on shift count, not height When we introduce entries that can cover multiple indices, we will need to stop in __radix_tree_create based on the shift, not the height. Split out for ease of bisect. Signed-off-by: Matthew Wilcox <willy@linux.intel.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Matthew Wilcox <willy@linux.intel.com> Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com> Cc: Ross Zwisler <ross.zwisler@linux.intel.com> Cc: Hugh Dickins <hughd@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
#
339e6353 |
|
17-Mar-2016 |
Matthew Wilcox <willy@infradead.org> |
radix_tree: tag all internal tree nodes as indirect pointers Set the 'indirect_ptr' bit on all the pointers to internal nodes, not just on the root node. This enables the following patches to support multi-order entries in the radix tree. This patch is split out for ease of bisection. Signed-off-by: Matthew Wilcox <willy@linux.intel.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Matthew Wilcox <willy@linux.intel.com> Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com> Cc: Ross Zwisler <ross.zwisler@linux.intel.com> Cc: Hugh Dickins <hughd@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
#
58e698af |
|
17-Mar-2016 |
Vladimir Davydov <vdavydov.dev@gmail.com> |
radix-tree: account radix_tree_node to memory cgroup Allocation of radix_tree_node objects can be easily triggered from userspace, so we should account them to memory cgroup. Besides, we need them accounted for making shadow node shrinker per memcg (see mm/workingset.c). A tricky thing about accounting radix_tree_node objects is that they are mostly allocated through radix_tree_preload(), so we can't just set SLAB_ACCOUNT for radix_tree_node_cachep - that would likely result in a lot of unrelated cgroups using objects from each other's caches. One way to overcome this would be making radix tree preloads per memcg, but that would probably look cumbersome and overcomplicated. Instead, we make radix_tree_node_alloc() first try to allocate from the cache with __GFP_ACCOUNT, no matter if the caller has preloaded or not, and only if it fails fall back on using per cpu preloads. This should make most allocations accounted. Signed-off-by: Vladimir Davydov <vdavydov@virtuozzo.com> Acked-by: Johannes Weiner <hannes@cmpxchg.org> Cc: Michal Hocko <mhocko@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
#
46437f9a |
|
02-Feb-2016 |
Matthew Wilcox <willy@infradead.org> |
radix-tree: fix race in gang lookup If the indirect_ptr bit is set on a slot, that indicates we need to redo the lookup. Introduce a new function radix_tree_iter_retry() which forces the loop to retry the lookup by setting 'slot' to NULL and turning the iterator back to point at the problematic entry. This is a pretty rare problem to hit at the moment; the lookup has to race with a grow of the radix tree from a height of 0. The consequences of hitting this race are that gang lookup could return a pointer to a radix_tree_node instead of a pointer to whatever the user had inserted in the tree. Fixes: cebbd29e1c2f ("radix-tree: rewrite gang lookup using iterator") Signed-off-by: Matthew Wilcox <willy@linux.intel.com> Cc: Hugh Dickins <hughd@google.com> Cc: Ohad Ben-Cohen <ohad@wizery.com> Cc: Konstantin Khlebnikov <khlebnikov@openvz.org> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
#
d0164adc |
|
06-Nov-2015 |
Mel Gorman <mgorman@techsingularity.net> |
mm, page_alloc: distinguish between being unable to sleep, unwilling to sleep and avoiding waking kswapd __GFP_WAIT has been used to identify atomic context in callers that hold spinlocks or are in interrupts. They are expected to be high priority and have access one of two watermarks lower than "min" which can be referred to as the "atomic reserve". __GFP_HIGH users get access to the first lower watermark and can be called the "high priority reserve". Over time, callers had a requirement to not block when fallback options were available. Some have abused __GFP_WAIT leading to a situation where an optimisitic allocation with a fallback option can access atomic reserves. This patch uses __GFP_ATOMIC to identify callers that are truely atomic, cannot sleep and have no alternative. High priority users continue to use __GFP_HIGH. __GFP_DIRECT_RECLAIM identifies callers that can sleep and are willing to enter direct reclaim. __GFP_KSWAPD_RECLAIM to identify callers that want to wake kswapd for background reclaim. __GFP_WAIT is redefined as a caller that is willing to enter direct reclaim and wake kswapd for background reclaim. This patch then converts a number of sites o __GFP_ATOMIC is used by callers that are high priority and have memory pools for those requests. GFP_ATOMIC uses this flag. o Callers that have a limited mempool to guarantee forward progress clear __GFP_DIRECT_RECLAIM but keep __GFP_KSWAPD_RECLAIM. bio allocations fall into this category where kswapd will still be woken but atomic reserves are not used as there is a one-entry mempool to guarantee progress. o Callers that are checking if they are non-blocking should use the helper gfpflags_allow_blocking() where possible. This is because checking for __GFP_WAIT as was done historically now can trigger false positives. Some exceptions like dm-crypt.c exist where the code intent is clearer if __GFP_DIRECT_RECLAIM is used instead of the helper due to flag manipulations. o Callers that built their own GFP flags instead of starting with GFP_KERNEL and friends now also need to specify __GFP_KSWAPD_RECLAIM. The first key hazard to watch out for is callers that removed __GFP_WAIT and was depending on access to atomic reserves for inconspicuous reasons. In some cases it may be appropriate for them to use __GFP_HIGH. The second key hazard is callers that assembled their own combination of GFP flags instead of starting with something like GFP_KERNEL. They may now wish to specify __GFP_KSWAPD_RECLAIM. It's almost certainly harmless if it's missed in most cases as other activity will wake kswapd. Signed-off-by: Mel Gorman <mgorman@techsingularity.net> Acked-by: Vlastimil Babka <vbabka@suse.cz> Acked-by: Michal Hocko <mhocko@suse.com> Acked-by: Johannes Weiner <hannes@cmpxchg.org> Cc: Christoph Lameter <cl@linux.com> Cc: David Rientjes <rientjes@google.com> Cc: Vitaly Wool <vitalywool@gmail.com> Cc: Rik van Riel <riel@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
#
9d2a8da0 |
|
25-Jun-2015 |
Kirill A. Shutemov <kirill.shutemov@linux.intel.com> |
radix-tree: replace preallocated node array with linked list Currently we use per-cpu array to hold pointers to preallocated nodes. Let's replace it with linked list. On x86_64 it saves 256 bytes in per-cpu ELF section which may translate into freeing up 2MB of memory for NR_CPUS==8192. [akpm@linux-foundation.org: fix comment, coding style] Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Acked-by: Johannes Weiner <hannes@cmpxchg.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
#
92cf2118 |
|
12-May-2015 |
Frederic Weisbecker <fweisbec@gmail.com> |
sched/preempt: Merge preempt_mask.h into preempt.h preempt_mask.h defines all the preempt_count semantics and related symbols: preempt, softirq, hardirq, nmi, preempt active, need resched, etc... preempt.h defines the accessors and mutators of preempt_count. But there is a messy dependency game around those two header files: * preempt_mask.h includes preempt.h in order to access preempt_count() * preempt_mask.h defines all preempt_count semantic and symbols except PREEMPT_NEED_RESCHED that is needed by asm/preempt.h Thus we need to define it from preempt.h, right before including asm/preempt.h, instead of defining it to preempt_mask.h with the other preempt_count symbols. Therefore the preempt_count semantics happen to be spread out. * We plan to introduce preempt_active_[enter,exit]() to consolidate preempt_schedule*() code. But we'll need to access both preempt_count mutators (preempt_count_add()) and preempt_count symbols (PREEMPT_ACTIVE, PREEMPT_OFFSET). The usual place to define preempt operations is in preempt.h but then we'll need symbols in preempt_mask.h which already includes preempt.h. So we end up with a ressource circle dependency. Lets merge preempt_mask.h into preempt.h to solve these dependency issues. This way we gather semantic symbols and operation definition of preempt_count in a single file. This is a dumb copy-paste merge. Further merge re-arrangments are performed in a subsequent patch to ease review. Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Link: http://lkml.kernel.org/r/1431441711-29753-2-git-send-email-fweisbec@gmail.com Signed-off-by: Ingo Molnar <mingo@kernel.org>
|
#
886d3dfa |
|
12-Feb-2015 |
Rasmus Villemoes <linux@rasmusvillemoes.dk> |
lib/radix-tree.c: change to simpler include The comment helpfully explains why hardirq.h is included, but since commit 2d4b84739f0a ("hardirq: Split preempt count mask definitions") in_interrupt() has been provided by preempt_mask.h. Use that instead, saving around 40 lines in the generated dependency file. Signed-off-by: Rasmus Villemoes <linux@rasmusvillemoes.dk> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
#
ce80b067 |
|
06-Jun-2014 |
Catalin Marinas <catalin.marinas@arm.com> |
lib/radix-tree.c: update the kmemleak stack trace for radix tree allocations Since radix_tree_preload() stack trace is not always useful for debugging an actual radix tree memory leak, this patch updates the kmemleak allocation stack trace in the radix_tree_node_alloc() function. Signed-off-by: Catalin Marinas <catalin.marinas@arm.com> Acked-by: Johannes Weiner <hannes@cmpxchg.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
#
8e4c0b68 |
|
04-Jun-2014 |
Fabian Frederick <fabf@skynet.be> |
lib/radix-tree.c: kernel-doc warning fix index has been removed from __radix_tree_delete_node in 449dd6984d0e47 ("mm: keep page cache radix tree nodes in check") Signed-off-by: Fabian Frederick <fabf@skynet.be> Cc: Johannes Weiner <hannes@cmpxchg.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
#
7c8e0181 |
|
04-Jun-2014 |
Christoph Lameter <cl@linux.com> |
mm: replace __get_cpu_var uses with this_cpu_ptr Replace places where __get_cpu_var() is used for an address calculation with this_cpu_ptr(). Signed-off-by: Christoph Lameter <cl@linux.com> Cc: Tejun Heo <tj@kernel.org> Cc: Hugh Dickins <hughd@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
#
449dd698 |
|
03-Apr-2014 |
Johannes Weiner <hannes@cmpxchg.org> |
mm: keep page cache radix tree nodes in check Previously, page cache radix tree nodes were freed after reclaim emptied out their page pointers. But now reclaim stores shadow entries in their place, which are only reclaimed when the inodes themselves are reclaimed. This is problematic for bigger files that are still in use after they have a significant amount of their cache reclaimed, without any of those pages actually refaulting. The shadow entries will just sit there and waste memory. In the worst case, the shadow entries will accumulate until the machine runs out of memory. To get this under control, the VM will track radix tree nodes exclusively containing shadow entries on a per-NUMA node list. Per-NUMA rather than global because we expect the radix tree nodes themselves to be allocated node-locally and we want to reduce cross-node references of otherwise independent cache workloads. A simple shrinker will then reclaim these nodes on memory pressure. A few things need to be stored in the radix tree node to implement the shadow node LRU and allow tree deletions coming from the list: 1. There is no index available that would describe the reverse path from the node up to the tree root, which is needed to perform a deletion. To solve this, encode in each node its offset inside the parent. This can be stored in the unused upper bits of the same member that stores the node's height at no extra space cost. 2. The number of shadow entries needs to be counted in addition to the regular entries, to quickly detect when the node is ready to go to the shadow node LRU list. The current entry count is an unsigned int but the maximum number of entries is 64, so a shadow counter can easily be stored in the unused upper bits. 3. Tree modification needs tree lock and tree root, which are located in the address space, so store an address_space backpointer in the node. The parent pointer of the node is in a union with the 2-word rcu_head, so the backpointer comes at no extra cost as well. 4. The node needs to be linked to an LRU list, which requires a list head inside the node. This does increase the size of the node, but it does not change the number of objects that fit into a slab page. [akpm@linux-foundation.org: export the right function] Signed-off-by: Johannes Weiner <hannes@cmpxchg.org> Reviewed-by: Rik van Riel <riel@redhat.com> Reviewed-by: Minchan Kim <minchan@kernel.org> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Bob Liu <bob.liu@oracle.com> Cc: Christoph Hellwig <hch@infradead.org> Cc: Dave Chinner <david@fromorbit.com> Cc: Greg Thelen <gthelen@google.com> Cc: Hugh Dickins <hughd@google.com> Cc: Jan Kara <jack@suse.cz> Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Cc: Luigi Semenzato <semenzato@google.com> Cc: Mel Gorman <mgorman@suse.de> Cc: Metin Doslu <metin@citusdata.com> Cc: Michel Lespinasse <walken@google.com> Cc: Ozgun Erdogan <ozgun@citusdata.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Roman Gushchin <klamm@yandex-team.ru> Cc: Ryan Mallon <rmallon@gmail.com> Cc: Tejun Heo <tj@kernel.org> Cc: Vlastimil Babka <vbabka@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
#
139e5616 |
|
03-Apr-2014 |
Johannes Weiner <hannes@cmpxchg.org> |
lib: radix_tree: tree node interface Make struct radix_tree_node part of the public interface and provide API functions to create, look up, and delete whole nodes. Refactor the existing insert, look up, delete functions on top of these new node primitives. This will allow the VM to track and garbage collect page cache radix tree nodes. [sasha.levin@oracle.com: return correct error code on insertion failure] Signed-off-by: Johannes Weiner <hannes@cmpxchg.org> Reviewed-by: Rik van Riel <riel@redhat.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Bob Liu <bob.liu@oracle.com> Cc: Christoph Hellwig <hch@infradead.org> Cc: Dave Chinner <david@fromorbit.com> Cc: Greg Thelen <gthelen@google.com> Cc: Hugh Dickins <hughd@google.com> Cc: Jan Kara <jack@suse.cz> Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Cc: Luigi Semenzato <semenzato@google.com> Cc: Mel Gorman <mgorman@suse.de> Cc: Metin Doslu <metin@citusdata.com> Cc: Michel Lespinasse <walken@google.com> Cc: Ozgun Erdogan <ozgun@citusdata.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Roman Gushchin <klamm@yandex-team.ru> Cc: Ryan Mallon <rmallon@gmail.com> Cc: Tejun Heo <tj@kernel.org> Cc: Vlastimil Babka <vbabka@suse.cz> Signed-off-by: Sasha Levin <sasha.levin@oracle.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
#
e7b563bb |
|
03-Apr-2014 |
Johannes Weiner <hannes@cmpxchg.org> |
mm: filemap: move radix tree hole searching here The radix tree hole searching code is only used for page cache, for example the readahead code trying to get a a picture of the area surrounding a fault. It sufficed to rely on the radix tree definition of holes, which is "empty tree slot". But this is about to change, though, as shadow page descriptors will be stored in the page cache after the actual pages get evicted from memory. Move the functions over to mm/filemap.c and make them native page cache operations, where they can later be adapted to handle the new definition of "page cache hole". Signed-off-by: Johannes Weiner <hannes@cmpxchg.org> Reviewed-by: Rik van Riel <riel@redhat.com> Reviewed-by: Minchan Kim <minchan@kernel.org> Acked-by: Mel Gorman <mgorman@suse.de> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Bob Liu <bob.liu@oracle.com> Cc: Christoph Hellwig <hch@infradead.org> Cc: Dave Chinner <david@fromorbit.com> Cc: Greg Thelen <gthelen@google.com> Cc: Hugh Dickins <hughd@google.com> Cc: Jan Kara <jack@suse.cz> Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Cc: Luigi Semenzato <semenzato@google.com> Cc: Metin Doslu <metin@citusdata.com> Cc: Michel Lespinasse <walken@google.com> Cc: Ozgun Erdogan <ozgun@citusdata.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Roman Gushchin <klamm@yandex-team.ru> Cc: Ryan Mallon <rmallon@gmail.com> Cc: Tejun Heo <tj@kernel.org> Cc: Vlastimil Babka <vbabka@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
#
53c59f26 |
|
03-Apr-2014 |
Johannes Weiner <hannes@cmpxchg.org> |
lib: radix-tree: add radix_tree_delete_item() Provide a function that does not just delete an entry at a given index, but also allows passing in an expected item. Delete only if that item is still located at the specified index. This is handy when lockless tree traversals want to delete entries as well because they don't have to do an second, locked lookup to verify the slot has not changed under them before deleting the entry. Signed-off-by: Johannes Weiner <hannes@cmpxchg.org> Reviewed-by: Minchan Kim <minchan@kernel.org> Reviewed-by: Rik van Riel <riel@redhat.com> Acked-by: Mel Gorman <mgorman@suse.de> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Bob Liu <bob.liu@oracle.com> Cc: Christoph Hellwig <hch@infradead.org> Cc: Dave Chinner <david@fromorbit.com> Cc: Greg Thelen <gthelen@google.com> Cc: Hugh Dickins <hughd@google.com> Cc: Jan Kara <jack@suse.cz> Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Cc: Luigi Semenzato <semenzato@google.com> Cc: Metin Doslu <metin@citusdata.com> Cc: Michel Lespinasse <walken@google.com> Cc: Ozgun Erdogan <ozgun@citusdata.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Roman Gushchin <klamm@yandex-team.ru> Cc: Ryan Mallon <rmallon@gmail.com> Cc: Tejun Heo <tj@kernel.org> Cc: Vlastimil Babka <vbabka@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
#
5f30fc94 |
|
03-Mar-2014 |
Hugh Dickins <hughd@google.com> |
lib/radix-tree.c: swapoff tmpfs radix_tree: remember to rcu_read_unlock Running fsx on tmpfs with concurrent memhog-swapoff-swapon, lots of BUG: sleeping function called from invalid context at kernel/fork.c:606 in_atomic(): 0, irqs_disabled(): 0, pid: 1394, name: swapoff 1 lock held by swapoff/1394: #0: (rcu_read_lock){.+.+.+}, at: [<ffffffff812520a1>] radix_tree_locate_item+0x1f/0x2b6 followed by ================================================ [ BUG: lock held when returning to user space! ] 3.14.0-rc1 #3 Not tainted ------------------------------------------------ swapoff/1394 is leaving the kernel with locks still held! 1 lock held by swapoff/1394: #0: (rcu_read_lock){.+.+.+}, at: [<ffffffff812520a1>] radix_tree_locate_item+0x1f/0x2b6 after which the system recovered nicely. Whoops, I long ago forgot the rcu_read_unlock() on one unlikely branch. Fixes e504f3fdd63d ("tmpfs radix_tree: locate_item to speed up swapoff") Signed-off-by: Hugh Dickins <hughd@google.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
#
5e4c0d97 |
|
11-Sep-2013 |
Jan Kara <jack@suse.cz> |
lib/radix-tree.c: make radix_tree_node_alloc() work correctly within interrupt With users of radix_tree_preload() run from interrupt (block/blk-ioc.c is one such possible user), the following race can happen: radix_tree_preload() ... radix_tree_insert() radix_tree_node_alloc() if (rtp->nr) { ret = rtp->nodes[rtp->nr - 1]; <interrupt> ... radix_tree_preload() ... radix_tree_insert() radix_tree_node_alloc() if (rtp->nr) { ret = rtp->nodes[rtp->nr - 1]; And we give out one radix tree node twice. That clearly results in radix tree corruption with different results (usually OOPS) depending on which two users of radix tree race. We fix the problem by making radix_tree_node_alloc() always allocate fresh radix tree nodes when in interrupt. Using preloading when in interrupt doesn't make sense since all the allocations have to be atomic anyway and we cannot steal nodes from process-context users because some users rely on radix_tree_insert() succeeding after radix_tree_preload(). in_interrupt() check is somewhat ugly but we cannot simply key off passed gfp_mask as that is acquired from root_gfp_mask() and thus the same for all preload users. Another part of the fix is to avoid node preallocation in radix_tree_preload() when passed gfp_mask doesn't allow waiting. Again, preallocation in such case doesn't make sense and when preallocation would happen in interrupt we could possibly leak some allocated nodes. However, some users of radix_tree_preload() require following radix_tree_insert() to succeed. To avoid unexpected effects for these users, radix_tree_preload() only warns if passed gfp mask doesn't allow waiting and we provide a new function radix_tree_maybe_preload() for those users which get different gfp mask from different call sites and which are prepared to handle radix_tree_insert() failure. Signed-off-by: Jan Kara <jack@suse.cz> Cc: Jens Axboe <jaxboe@fusionio.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
#
fffaee36 |
|
05-Jun-2012 |
Konstantin Khlebnikov <khlebnikov@openvz.org> |
radix-tree: fix contiguous iterator This patch fixes bug in macro radix_tree_for_each_contig(). If radix_tree_next_slot() sees NULL in next slot it returns NULL, but following radix_tree_next_chunk() switches iterating into next chunk. As result iterating becomes non-contiguous and breaks vfs "splice" and all its users. Signed-off-by: Konstantin Khlebnikov <khlebnikov@openvz.org> Reported-and-bisected-by: Hans de Bruin <jmdebruin@xmsnet.nl> Reported-and-bisected-by: Ondrej Zary <linux@rainbow-software.org> Reported-bisected-and-tested-by: Toralf Förster <toralf.foerster@gmx.de> Link: https://lkml.org/lkml/2012/6/5/64 Cc: stable <stable@vger.kernel.org> # 3.4.x Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
#
55368052 |
|
29-May-2012 |
Nick Piggin <npiggin@kernel.dk> |
radix-tree: fix preload vector size We are not preallocating a sufficient number of nodes. Signed-off-by: Nick Piggin <npiggin@kernel.dk> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
#
cebbd29e |
|
28-Mar-2012 |
Konstantin Khlebnikov <khlebnikov@openvz.org> |
radix-tree: rewrite gang lookup using iterator Rewrite radix_tree_gang_lookup_* functions using the new radix-tree iterator. Signed-off-by: Konstantin Khlebnikov <khlebnikov@openvz.org> Tested-by: Hugh Dickins <hughd@google.com> Cc: Christoph Hellwig <hch@lst.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
#
78c1d784 |
|
28-Mar-2012 |
Konstantin Khlebnikov <khlebnikov@openvz.org> |
radix-tree: introduce bit-optimized iterator A series of radix tree cleanups, and usage of them in the core pagecache code. Micro-benchmark: lookup 14 slots (typical page-vector size) in radix-tree there earch <step> slot filled and tagged before/after - nsec per full scan through tree * Intel Sandy Bridge i7-2620M 4Mb L3 New code always faster * AMD Athlon 6000+ 2x1Mb L2, without L3 New code generally faster, Minor degradation (marked with "*") for huge sparse trees * i386 on Sandy Bridge New code faster for common cases: tagged and dense trees. Some degradations for non-tagged lookup on sparse trees. Ideally, there might help __ffs() analog for searching first non-zero long element in array, gcc sometimes cannot optimize this loop corretly. Numbers: CPU: Intel Sandy Bridge i7-2620M 4Mb L3 radix-tree with 1024 slots: tagged lookup step 1 before 7156 after 3613 step 2 before 5399 after 2696 step 3 before 4779 after 1928 step 4 before 4456 after 1429 step 5 before 4292 after 1213 step 6 before 4183 after 1052 step 7 before 4157 after 951 step 8 before 4016 after 812 step 9 before 3952 after 851 step 10 before 3937 after 732 step 11 before 4023 after 709 step 12 before 3872 after 657 step 13 before 3892 after 633 step 14 before 3720 after 591 step 15 before 3879 after 578 step 16 before 3561 after 513 normal lookup step 1 before 4266 after 3301 step 2 before 2695 after 2129 step 3 before 2083 after 1712 step 4 before 1801 after 1534 step 5 before 1628 after 1313 step 6 before 1551 after 1263 step 7 before 1475 after 1185 step 8 before 1432 after 1167 step 9 before 1373 after 1092 step 10 before 1339 after 1134 step 11 before 1292 after 1056 step 12 before 1319 after 1030 step 13 before 1276 after 1004 step 14 before 1256 after 987 step 15 before 1228 after 992 step 16 before 1247 after 999 radix-tree with 1024*1024*128 slots: tagged lookup step 1 before 1086102841 after 674196409 step 2 before 816839155 after 498138306 step 7 before 599728907 after 240676762 step 15 before 555729253 after 185219677 step 63 before 606637748 after 128585664 step 64 before 608384432 after 102945089 step 65 before 596987114 after 123996019 step 128 before 304459225 after 56783056 step 256 before 158846855 after 31232481 step 512 before 86085652 after 18950595 step 12345 before 6517189 after 1674057 normal lookup step 1 before 626064869 after 544418266 step 2 before 418809975 after 336321473 step 7 before 242303598 after 207755560 step 15 before 208380563 after 176496355 step 63 before 186854206 after 167283638 step 64 before 176188060 after 170143976 step 65 before 185139608 after 167487116 step 128 before 88181865 after 86913490 step 256 before 45733628 after 45143534 step 512 before 24506038 after 23859036 step 12345 before 2177425 after 2018662 * AMD Athlon 6000+ 2x1Mb L2, without L3 radix-tree with 1024 slots: tag-lookup step 1 before 8164 after 5379 step 2 before 5818 after 5581 step 3 before 4959 after 4213 step 4 before 4371 after 3386 step 5 before 4204 after 2997 step 6 before 4950 after 2744 step 7 before 4598 after 2480 step 8 before 4251 after 2288 step 9 before 4262 after 2243 step 10 before 4175 after 2131 step 11 before 3999 after 2024 step 12 before 3979 after 1994 step 13 before 3842 after 1929 step 14 before 3750 after 1810 step 15 before 3735 after 1810 step 16 before 3532 after 1660 normal-lookup step 1 before 7875 after 5847 step 2 before 4808 after 4071 step 3 before 4073 after 3462 step 4 before 3677 after 3074 step 5 before 4308 after 2978 step 6 before 3911 after 3807 step 7 before 3635 after 3522 step 8 before 3313 after 3202 step 9 before 3280 after 3257 step 10 before 3166 after 3083 step 11 before 3066 after 3026 step 12 before 2985 after 2982 step 13 before 2925 after 2924 step 14 before 2834 after 2808 step 15 before 2805 after 2803 step 16 before 2647 after 2622 radix-tree with 1024*1024*128 slots: tag-lookup step 1 before 1288059720 after 951736580 step 2 before 961292300 after 884212140 step 7 before 768905140 after 547267580 step 15 before 771319480 after 456550640 step 63 before 504847640 after 242704304 step 64 before 392484800 after 177920786 step 65 before 491162160 after 246895264 step 128 before 208084064 after 97348392 step 256 before 112401035 after 51408126 step 512 before 75825834 after 29145070 step 12345 before 5603166 after 2847330 normal-lookup step 1 before 1025677120 after 861375100 step 2 before 647220080 after 572258540 step 7 before 505518960 after 484041813 step 15 before 430483053 after 444815320 * step 63 before 388113453 after 404250546 * step 64 before 374154666 after 396027440 * step 65 before 381423973 after 396704853 * step 128 before 190078700 after 202619384 * step 256 before 100886756 after 102829108 * step 512 before 64074505 after 56158720 step 12345 before 4237289 after 4422299 * * i686 on Sandy bridge radix-tree with 1024 slots: tagged lookup step 1 before 7990 after 4019 step 2 before 5698 after 2897 step 3 before 5013 after 2475 step 4 before 4630 after 1721 step 5 before 4346 after 1759 step 6 before 4299 after 1556 step 7 before 4098 after 1513 step 8 before 4115 after 1222 step 9 before 3983 after 1390 step 10 before 4077 after 1207 step 11 before 3921 after 1231 step 12 before 3894 after 1116 step 13 before 3840 after 1147 step 14 before 3799 after 1090 step 15 before 3797 after 1059 step 16 before 3783 after 745 normal lookup step 1 before 5103 after 3499 step 2 before 3299 after 2550 step 3 before 2489 after 2370 step 4 before 2034 after 2302 * step 5 before 1846 after 2268 * step 6 before 1752 after 2249 * step 7 before 1679 after 2164 * step 8 before 1627 after 2153 * step 9 before 1542 after 2095 * step 10 before 1479 after 2109 * step 11 before 1469 after 2009 * step 12 before 1445 after 2039 * step 13 before 1411 after 2013 * step 14 before 1374 after 2046 * step 15 before 1340 after 1975 * step 16 before 1331 after 2000 * radix-tree with 1024*1024*128 slots: tagged lookup step 1 before 1225865377 after 667153553 step 2 before 842427423 after 471533007 step 7 before 609296153 after 276260116 step 15 before 544232060 after 226859105 step 63 before 519209199 after 141343043 step 64 before 588980279 after 141951339 step 65 before 521099710 after 138282060 step 128 before 298476778 after 83390628 step 256 before 149358342 after 43602609 step 512 before 76994713 after 22911077 step 12345 before 5328666 after 1472111 normal lookup step 1 before 819284564 after 533635310 step 2 before 512421605 after 364956155 step 7 before 271443305 after 305721345 * step 15 before 223591630 after 273960216 * step 63 before 190320247 after 217770207 * step 64 before 178538168 after 267411372 * step 65 before 186400423 after 215347937 * step 128 before 88106045 after 140540612 * step 256 before 44812420 after 70660377 * step 512 before 24435438 after 36328275 * step 12345 before 2123924 after 2148062 * bloat-o-meter delta for this patchset + patchset with related shmem cleanups bloat-o-meter: x86_64 add/remove: 4/3 grow/shrink: 5/6 up/down: 928/-939 (-11) function old new delta radix_tree_next_chunk - 499 +499 shmem_unuse 428 554 +126 shmem_radix_tree_replace 131 227 +96 find_get_pages_tag 354 419 +65 find_get_pages_contig 345 407 +62 find_get_pages 362 396 +34 __kstrtab_radix_tree_next_chunk - 22 +22 __ksymtab_radix_tree_next_chunk - 16 +16 __kcrctab_radix_tree_next_chunk - 8 +8 radix_tree_gang_lookup_slot 204 203 -1 static.shmem_xattr_set 384 381 -3 radix_tree_gang_lookup_tag_slot 208 191 -17 radix_tree_gang_lookup 231 187 -44 radix_tree_gang_lookup_tag 247 199 -48 shmem_unlock_mapping 278 190 -88 __lookup 217 - -217 __lookup_tag 242 - -242 radix_tree_locate_item 279 - -279 bloat-o-meter: i386 add/remove: 3/3 grow/shrink: 8/9 up/down: 1075/-1275 (-200) function old new delta radix_tree_next_chunk - 757 +757 shmem_unuse 352 449 +97 find_get_pages_contig 269 322 +53 shmem_radix_tree_replace 113 154 +41 find_get_pages_tag 277 318 +41 dcache_dir_lseek 426 458 +32 __kstrtab_radix_tree_next_chunk - 22 +22 vc_do_resize 968 977 +9 snd_pcm_lib_read1 725 733 +8 __ksymtab_radix_tree_next_chunk - 8 +8 netlbl_cipsov4_list 1120 1127 +7 find_get_pages 293 291 -2 new_slab 467 459 -8 bitfill_unaligned_rev 425 417 -8 radix_tree_gang_lookup_tag_slot 177 146 -31 blk_dump_cmd 267 229 -38 radix_tree_gang_lookup_slot 212 134 -78 shmem_unlock_mapping 221 128 -93 radix_tree_gang_lookup_tag 275 162 -113 radix_tree_gang_lookup 255 126 -129 __lookup 227 - -227 __lookup_tag 271 - -271 radix_tree_locate_item 277 - -277 This patch: Implement a clean, simple and effective radix-tree iteration routine. Iterating divided into two phases: * lookup next chunk in radix-tree leaf node * iterating through slots in this chunk Main iterator function radix_tree_next_chunk() returns pointer to first slot, and stores in the struct radix_tree_iter index of next-to-last slot. For tagged-iterating it also constuct bitmask of tags for retunted chunk. All additional logic implemented as static-inline functions and macroses. Also adds radix_tree_find_next_bit() static-inline variant of find_next_bit() optimized for small constant size arrays, because find_next_bit() too heavy for searching in an array with one/two long elements. [akpm@linux-foundation.org: rework comments a bit] Signed-off-by: Konstantin Khlebnikov <khlebnikov@openvz.org> Tested-by: Hugh Dickins <hughd@google.com> Cc: Christoph Hellwig <hch@lst.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
#
8bc3bcc9 |
|
16-Nov-2011 |
Paul Gortmaker <paul.gortmaker@windriver.com> |
lib: reduce the use of module.h wherever possible For files only using THIS_MODULE and/or EXPORT_SYMBOL, map them onto including export.h -- or if the file isn't even using those, then just delete the include. Fix up any implicit include dependencies that were being masked by module.h along the way. Signed-off-by: Paul Gortmaker <paul.gortmaker@windriver.com>
|
#
e2bdb933 |
|
12-Jan-2012 |
Hugh Dickins <hughd@google.com> |
radix_tree: take radix_tree_path off stack Down, down in the deepest depths of GFP_NOIO page reclaim, we have shrink_page_list() calling __remove_mapping() calling __delete_from_ swap_cache() or __delete_from_page_cache(). You would not expect those to need much stack, but in fact they call radix_tree_delete(): which declares a 192-byte radix_tree_path array on its stack (to record the node,offsets it visits when descending, in case it needs to ascend to update them). And if any tag is still set [1], that calls radix_tree_tag_clear(), which declares a further such 192-byte radix_tree_path array on the stack. (At least we have interrupts disabled here, so won't then be pushing registers too.) That was probably a good choice when most users were 32-bit (array of half the size), and adding fields to radix_tree_node would have bloated it unnecessarily. But nowadays many are 64-bit, and each radix_tree_node contains a struct rcu_head, which is only used when freeing; whereas the radix_tree_path info is only used for updating the tree (deleting, clearing tags or setting tags if tagged) when a lock must be held, of no interest when accessing the tree locklessly. So add a parent pointer to the radix_tree_node, in union with the rcu_head, and remove all uses of the radix_tree_path. There would be space in that union to save the offset when descending as before (we can argue that a lock must already be held to exclude other users), but recalculating it when ascending is both easy (a constant shift and a constant mask) and uncommon, so it seems better just to do that. Two little optimizations: no need to decrement height when descending, adjusting shift is enough; and once radix_tree_tag_if_tagged() has set tag on a node and its ancestors, it need not ascend from that node again. perf on the radix tree test harness reports radix_tree_insert() as 2% slower (now having to set parent), but radix_tree_delete() 24% faster. Surely that's an exaggeration from rtth's artificially low map shift 3, but forcing it back to 6 still rates radix_tree_delete() 8% faster. [1] Can a pagecache tag (dirty, writeback or towrite) actually still be set at the time of radix_tree_delete()? Perhaps not if the filesystem is well-behaved. But although I've not tracked any stack overflow down to this cause, I have observed a curious case in which a dirty tag is set and left set on tmpfs: page migration's migrate_page_copy() happens to use __set_page_dirty_nobuffers() to set PageDirty on the newpage, and that sets PAGECACHE_TAG_DIRTY as a side-effect - harmless to a filesystem which doesn't use tags, except for this stack depth issue. Signed-off-by: Hugh Dickins <hughd@google.com> Cc: Jan Kara <jack@suse.cz> Cc: Dave Chinner <david@fromorbit.com> Cc: Mel Gorman <mgorman@suse.de> Cc: Nai Xia <nai.xia@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
#
3fa36acb |
|
31-Oct-2011 |
Hugh Dickins <hughd@google.com> |
radix_tree: clean away saw_unset_tag leftovers radix_tree_tag_get()'s BUG (when it sees a tag after saw_unset_tag) was unsafe and removed in 2.6.34, but the pointless saw_unset_tag left behind. Remove it now, and return 0 as soon as we see unset tag - we already rely upon the root tag to be correct, returning 0 immediately if it's not set. Signed-off-by: Hugh Dickins <hughd@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
#
e504f3fd |
|
03-Aug-2011 |
Hugh Dickins <hughd@google.com> |
tmpfs radix_tree: locate_item to speed up swapoff We have already acknowledged that swapoff of a tmpfs file is slower than it was before conversion to the generic radix_tree: a little slower there will be acceptable, if the hotter paths are faster. But it was a shock to find swapoff of a 500MB file 20 times slower on my laptop, taking 10 minutes; and at that rate it significantly slows down my testing. Now, most of that turned out to be overhead from PROVE_LOCKING and PROVE_RCU: without those it was only 4 times slower than before; and more realistic tests on other machines don't fare as badly. I've tried a number of things to improve it, including tagging the swap entries, then doing lookup by tag: I'd expected that to halve the time, but in practice it's erratic, and often counter-productive. The only change I've so far found to make a consistent improvement, is to short-circuit the way we go back and forth, gang lookup packing entries into the array supplied, then shmem scanning that array for the target entry. Scanning in place doubles the speed, so it's now only twice as slow as before (or three times slower when the PROVEs are on). So, add radix_tree_locate_item() as an expedient, once-off, single-caller hack to do the lookup directly in place. #ifdef it on CONFIG_SHMEM and CONFIG_SWAP, as much to document its limited applicability as save space in other configurations. And, sadly, #include sched.h for cond_resched(). Signed-off-by: Hugh Dickins <hughd@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
#
6328650b |
|
03-Aug-2011 |
Hugh Dickins <hughd@google.com> |
radix_tree: exceptional entries and indices A patchset to extend tmpfs to MAX_LFS_FILESIZE by abandoning its peculiar swap vector, instead keeping a file's swap entries in the same radix tree as its struct page pointers: thus saving memory, and simplifying its code and locking. This patch: The radix_tree is used by several subsystems for different purposes. A major use is to store the struct page pointers of a file's pagecache for memory management. But what if mm wanted to store something other than page pointers there too? The low bit of a radix_tree entry is already used to denote an indirect pointer, for internal use, and the unlikely radix_tree_deref_retry() case. Define the next bit as denoting an exceptional entry, and supply inline functions radix_tree_exception() to return non-0 in either unlikely case, and radix_tree_exceptional_entry() to return non-0 in the second case. If a subsystem already uses radix_tree with that bit set, no problem: it does not affect internal workings at all, but is defined for the convenience of those storing well-aligned pointers in the radix_tree. The radix_tree_gang_lookups have an implicit assumption that the caller can deduce the offset of each entry returned e.g. by the page->index of a struct page. But that may not be feasible for some kinds of item to be stored there. radix_tree_gang_lookup_slot() allow for an optional indices argument, output array in which to return those offsets. The same could be added to other radix_tree_gang_lookups, but for now keep it to the only one for which we need it. Signed-off-by: Hugh Dickins <hughd@google.com> Acked-by: Rik van Riel <riel@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
#
ac15ee69 |
|
25-Jan-2011 |
Toshiyuki Okajima <toshi.okajima@jp.fujitsu.com> |
radix_tree: radix_tree_gang_lookup_tag_slot() may never return Executed command: fsstress -d /mnt -n 600 -p 850 crash> bt PID: 7947 TASK: ffff880160546a70 CPU: 0 COMMAND: "fsstress" #0 [ffff8800dfc07d00] machine_kexec at ffffffff81030db9 #1 [ffff8800dfc07d70] crash_kexec at ffffffff810a7952 #2 [ffff8800dfc07e40] oops_end at ffffffff814aa7c8 #3 [ffff8800dfc07e70] die_nmi at ffffffff814aa969 #4 [ffff8800dfc07ea0] do_nmi_callback at ffffffff8102b07b #5 [ffff8800dfc07f10] do_nmi at ffffffff814aa514 #6 [ffff8800dfc07f50] nmi at ffffffff814a9d60 [exception RIP: __lookup_tag+100] RIP: ffffffff812274b4 RSP: ffff88016056b998 RFLAGS: 00000287 RAX: 0000000000000000 RBX: 0000000000000002 RCX: 0000000000000006 RDX: 000000000000001d RSI: ffff88016056bb18 RDI: ffff8800c85366e0 RBP: ffff88016056b9c8 R8: ffff88016056b9e8 R9: 0000000000000000 R10: 000000000000000e R11: ffff8800c8536908 R12: 0000000000000010 R13: 0000000000000040 R14: ffffffffffffffc0 R15: ffff8800c85366e0 ORIG_RAX: ffffffffffffffff CS: 0010 SS: 0018 <NMI exception stack> #7 [ffff88016056b998] __lookup_tag at ffffffff812274b4 #8 [ffff88016056b9d0] radix_tree_gang_lookup_tag_slot at ffffffff81227605 #9 [ffff88016056ba20] find_get_pages_tag at ffffffff810fc110 #10 [ffff88016056ba80] pagevec_lookup_tag at ffffffff81105e85 #11 [ffff88016056baa0] write_cache_pages at ffffffff81104c47 #12 [ffff88016056bbd0] generic_writepages at ffffffff81105014 #13 [ffff88016056bbe0] do_writepages at ffffffff81105055 #14 [ffff88016056bbf0] __filemap_fdatawrite_range at ffffffff810fb2cb #15 [ffff88016056bc40] filemap_write_and_wait_range at ffffffff810fb32a #16 [ffff88016056bc70] generic_file_direct_write at ffffffff810fb3dc #17 [ffff88016056bce0] __generic_file_aio_write at ffffffff810fcee5 #18 [ffff88016056bda0] generic_file_aio_write at ffffffff810fd085 #19 [ffff88016056bdf0] do_sync_write at ffffffff8114f9ea #20 [ffff88016056bf00] vfs_write at ffffffff8114fcf8 #21 [ffff88016056bf30] sys_write at ffffffff81150691 #22 [ffff88016056bf80] system_call_fastpath at ffffffff8100c0b2 I think this root cause is the following: radix_tree_range_tag_if_tagged() always tags the root tag with settag if the root tag is set with iftag even if there are no iftag tags in the specified range (Of course, there are some iftag tags outside the specified range). =============================================================================== [[[Detailed description]]] (1) Why cannot radix_tree_gang_lookup_tag_slot() return forever? __lookup_tag(): - Return with 0. - Return with the index which is not bigger than the old one as the input parameter. Therefore the following "while" repeats forever because the above conditions cause "ret" not to be updated and the cur_index cannot be changed into the bigger one. (So, radix_tree_gang_lookup_tag_slot() cannot return forever.) radix_tree_gang_lookup_tag_slot(): 1178 while (ret < max_items) { 1179 unsigned int slots_found; 1180 unsigned long next_index; /* Index of next search */ 1181 1182 if (cur_index > max_index) 1183 break; 1184 slots_found = __lookup_tag(node, results + ret, 1185 cur_index, max_items - ret, &next_index, tag); 1186 ret += slots_found; // cannot update ret because slots_found == 0. // so, this while loops forever. 1187 if (next_index == 0) 1188 break; 1189 cur_index = next_index; 1190 } (2) Why does __lookup_tag() return with 0 and doesn't update the index? Assuming the following: - the one of the slot in radix_tree_node is NULL. - the one of the tag which corresponds to the slot sets with PAGECACHE_TAG_TOWRITE or other. - In a certain height(!=0), the corresponding index is 0. a) __lookup_tag() notices that the tag is set. 1005 static unsigned int 1006 __lookup_tag(struct radix_tree_node *slot, void ***results, unsigned long index, 1007 unsigned int max_items, unsigned long *next_index, unsigned int tag) 1008 { 1009 unsigned int nr_found = 0; 1010 unsigned int shift, height; 1011 1012 height = slot->height; 1013 if (height == 0) 1014 goto out; 1015 shift = (height-1) * RADIX_TREE_MAP_SHIFT; 1016 1017 while (height > 0) { 1018 unsigned long i = (index >> shift) & RADIX_TREE_MAP_MASK ; 1019 1020 for (;;) { 1021 if (tag_get(slot, tag, i)) 1022 break; ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ * the index is not updated yet. b) __lookup_tag() notices that the slot is NULL. 1023 index &= ~((1UL << shift) - 1); 1024 index += 1UL << shift; 1025 if (index == 0) 1026 goto out; /* 32-bit wraparound */ 1027 i++; 1028 if (i == RADIX_TREE_MAP_SIZE) 1029 goto out; 1030 } 1031 height--; 1032 if (height == 0) { /* Bottom level: grab some items */ ... 1055 } 1056 shift -= RADIX_TREE_MAP_SHIFT; 1057 slot = rcu_dereference_raw(slot->slots[i]); 1058 if (slot == NULL) 1059 break; ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ c) __lookup_tag() doesn't update the index and return with 0. 1060 } 1061 out: 1062 *next_index = index; ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 1063 return nr_found; 1064 } (3) Why is the slot NULL even if the tag is set? Because radix_tree_range_tag_if_tagged() always sets the root tag with PAGECACHE_TAG_TOWRITE if the root tag is set with PAGECACHE_TAG_DIRTY, even if there is no tag which can be set with PAGECACHE_TAG_TOWRITE in the specified range (from *first_indexp to last_index). Of course, some PAGECACHE_TAG_DIRTY nodes must exist outside the specified range. (radix_tree_range_tag_if_tagged() is called only from tag_pages_for_writeback()) 640 unsigned long radix_tree_range_tag_if_tagged(struct radix_tree_root *root, 641 unsigned long *first_indexp, unsigned long last_index, 642 unsigned long nr_to_tag, 643 unsigned int iftag, unsigned int settag) 644 { 645 unsigned int height = root->height; 646 struct radix_tree_path path[height]; 647 struct radix_tree_path *pathp = path; 648 struct radix_tree_node *slot; 649 unsigned int shift; 650 unsigned long tagged = 0; 651 unsigned long index = *first_indexp; 652 653 last_index = min(last_index, radix_tree_maxindex(height)); 654 if (index > last_index) 655 return 0; 656 if (!nr_to_tag) 657 return 0; 658 if (!root_tag_get(root, iftag)) { 659 *first_indexp = last_index + 1; 660 return 0; 661 } 662 if (height == 0) { 663 *first_indexp = last_index + 1; 664 root_tag_set(root, settag); 665 return 1; 666 } ... 733 root_tag_set(root, settag); ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 734 *first_indexp = index; 735 736 return tagged; 737 } As the result, there is no radix_tree_node which is set with PAGECACHE_TAG_TOWRITE but the root tag(radix_tree_root) is set with PAGECACHE_TAG_TOWRITE. [figure: inside radix_tree] (Please see the figure with typewriter font) =========================================== [roottag = DIRTY] | tag=0:NOTHING tag[0 0 0 1] 1:DIRTY [x x x +] 2:WRITEBACK | 3:DIRTY,WRITEBACK p 4:TOWRITE <---> 5:DIRTY,TOWRITE ... specified range (index: 0 to 2) * There is no DIRTY tag within the specified range. (But there is a DIRTY tag outside that range.) | | | | | | | | | after calling tag_pages_for_writeback() | | | | | | | | | v v v v v v v v v [roottag = DIRTY,TOWRITE] | p is "page". tag[0 0 0 1] x is NULL. [x x x +] +- is a pointer to "page". | p * But TOWRITE tag is set on the root tag. ============================================ After that, radix_tree_extend() via radix_tree_insert() is called when the page is added. This function sets the new radix_tree_node with PAGECACHE_TAG_TOWRITE to succeed the status of the root tag. 246 static int radix_tree_extend(struct radix_tree_root *root, unsigned long index) 247 { 248 struct radix_tree_node *node; 249 unsigned int height; 250 int tag; 251 252 /* Figure out what the height should be. */ 253 height = root->height + 1; 254 while (index > radix_tree_maxindex(height)) 255 height++; 256 257 if (root->rnode == NULL) { 258 root->height = height; 259 goto out; 260 } 261 262 do { 263 unsigned int newheight; 264 if (!(node = radix_tree_node_alloc(root))) 265 return -ENOMEM; 266 267 /* Increase the height. */ 268 node->slots[0] = radix_tree_indirect_to_ptr(root->rnode); 269 270 /* Propagate the aggregated tag info into the new root */ 271 for (tag = 0; tag < RADIX_TREE_MAX_TAGS; tag++) { 272 if (root_tag_get(root, tag)) 273 tag_set(node, tag, 0); ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 274 } =========================================== [roottag = DIRTY,TOWRITE] | : tag[0 0 0 1] [0 0 0 0] [x x x +] [+ x x x] | | p p (new page) | | | | | | | | | after calling radix_tree_insert | | | | | | | | | v v v v v v v v v [roottag = DIRTY,TOWRITE] | tag [5 0 0 0] * DIRTY and TOWRITE tags are [+ + x x] succeeded to the new node. | | tag [0 0 0 1] [0 0 0 0] [x x x +] [+ x x x] | | p p ============================================ After that, the index 3 page is released by remove_from_page_cache(). Then we can make the situation that the tag is set with PAGECACHE_TAG_TOWRITE and that the slot which corresponds to the tag is NULL. =========================================== [roottag = DIRTY,TOWRITE] | tag [5 0 0 0] [+ + x x] | | tag [0 0 0 1] [0 0 0 0] [x x x +] [+ x x x] | | p p (remove) | | | | | | | | | after calling remove_page_cache | | | | | | | | | v v v v v v v v v [roottag = DIRTY,TOWRITE] | tag [4 0 0 0] * Only DIRTY tag is cleared [x + x x] because no TOWRITE tag is existed | in the bottom node. [0 0 0 0] [+ x x x] | p ============================================ To solve this problem Change to that radix_tree_tag_if_tagged() doesn't tag the root tag if it doesn't set any tags within the specified range. Like this. ============================================ 640 unsigned long radix_tree_range_tag_if_tagged(struct radix_tree_root *root, 641 unsigned long *first_indexp, unsigned long last_index, 642 unsigned long nr_to_tag, 643 unsigned int iftag, unsigned int settag) 644 { 650 unsigned long tagged = 0; ... 733 if (tagged) ^^^^^^^^^^^^^^^^^^^^^^^^ 734 root_tag_set(root, settag); 735 *first_indexp = index; 736 737 return tagged; 738 } ============================================ Signed-off-by: Toshiyuki Okajima <toshi.okajima@jp.fujitsu.com> Acked-by: Jan Kara <jack@suse.cz> Cc: Dave Chinner <david@fromorbit.com> Cc: Nick Piggin <nickpiggin@yahoo.com.au> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
#
27d20fdd |
|
11-Nov-2010 |
Nick Piggin <npiggin@kernel.dk> |
radix-tree: fix RCU bug Salman Qazi describes the following radix-tree bug: In the following case, we get can get a deadlock: 0. The radix tree contains two items, one has the index 0. 1. The reader (in this case find_get_pages) takes the rcu_read_lock. 2. The reader acquires slot(s) for item(s) including the index 0 item. 3. The non-zero index item is deleted, and as a consequence the other item is moved to the root of the tree. The place where it used to be is queued for deletion after the readers finish. 3b. The zero item is deleted, removing it from the direct slot, it remains in the rcu-delayed indirect node. 4. The reader looks at the index 0 slot, and finds that the page has 0 ref count 5. The reader looks at it again, hoping that the item will either be freed or the ref count will increase. This never happens, as the slot it is looking at will never be updated. Also, this slot can never be reclaimed because the reader is holding rcu_read_lock and is in an infinite loop. The fix is to re-use the same "indirect" pointer case that requires a slot lookup retry into a general "retry the lookup" bit. Signed-off-by: Nick Piggin <npiggin@kernel.dk> Reported-by: Salman Qazi <sqazi@google.com> Cc: <stable@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
#
144dcfc0 |
|
22-Aug-2010 |
Dave Chinner <dchinner@redhat.com> |
radix-tree: radix_tree_range_tag_if_tagged() can set incorrect tags Commit ebf8aa44beed48cd17893a83d92a4403e5f9d9e2 ("radix-tree: omplement function radix_tree_range_tag_if_tagged") does not safely set tags on on intermediate tree nodes. The code walks down the tree setting tags before it has fully resolved the path to the leaf under the assumption there will be a leaf slot with the tag set in the range it is searching. Unfortunately, this is not a valid assumption - we can abort after setting a tag on an intermediate node if we overrun the number of tags we are allowed to set in a batch, or stop scanning because we we have passed the last scan index before we reach a leaf slot with the tag we are searching for set. As a result, we can leave the function with tags set on intemediate nodes which can be tripped over later by tag-based lookups. The result of these stale tags is that lookup may end prematurely or livelock because the lookup cannot make progress. The fix for the problem involves reocrding the traversal path we take to the leaf nodes, and only propagating the tags back up the tree once the tag is set in the leaf node slot. We are already recording the path for efficient traversal, so there is no additional overhead to do the intermediately node tag setting in this manner. This fixes a radix tree lookup livelock triggered by the new writeback sync livelock avoidance code introduced in commit f446daaea9d4a420d16c606f755f3689dcb2d0ce ("mm: implement writeback livelock avoidance using page tagging"). Signed-off-by: Dave Chinner <dchinner@redhat.com> Acked-by: Jan Kara <jack@suse.cz>
|
#
b6dd0865 |
|
22-Aug-2010 |
Dave Chinner <dchinner@redhat.com> |
radix-tree: clear all tags in radix_tree_node_rcu_free Commit f446daaea9d4a420d16c606f755f3689dcb2d0ce ("mm: implement writeback livelock avoidance using page tagging") introduced a new radix tree tag, increasing the number of tags in each node from 2 to 3. It did not, however, fix up the code in radix_tree_node_rcu_free() that cleans up after radix_tree_shrink() and hence could leave stray tags set in the new tag array. The result is that the livelock avoidance code added in the the above commit would hit stale tags when doing tag based lookups, resulting in livelocks when trying to traverse the tree. Fix this problem in radix_tree_node_rcu_free() so it doesn't happen again in the future by using a loop to walk all the tags up to RADIX_TREE_MAX_TAGS to clear the stray tags radix_tree_shrink() leaves behind. Signed-off-by: Dave Chinner <dchinner@redhat.com> Acked-by: Nick Piggin <npiggin@kernel.dk> Acked-by: Jan Kara <jack@suse.cz>
|
#
d5ed3a4a |
|
19-Aug-2010 |
Jan Kara <jack@suse.cz> |
lib/radix-tree.c: fix overflow in radix_tree_range_tag_if_tagged() When radix_tree_maxindex() is ~0UL, it can happen that scanning overflows index and tree traversal code goes astray reading memory until it hits unreadable memory. Check for overflow and exit in that case. Signed-off-by: Jan Kara <jack@suse.cz> Cc: Christoph Hellwig <hch@lst.de> Cc: Nick Piggin <nickpiggin@yahoo.com.au> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
#
a1115570 |
|
25-Feb-2010 |
Arnd Bergmann <arnd@arndb.de> |
radix-tree: __rcu annotations Signed-off-by: Arnd Bergmann <arnd@arndb.de> Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com> Cc: Nick Piggin <npiggin@suse.de> Reviewed-by: Josh Triplett <josh@joshtriplett.org>
|
#
ebf8aa44 |
|
09-Aug-2010 |
Jan Kara <jack@suse.cz> |
radix-tree: omplement function radix_tree_range_tag_if_tagged Implement function for setting one tag if another tag is set for each item in given range. Signed-off-by: Jan Kara <jack@suse.cz> Cc: Dave Chinner <david@fromorbit.com> Cc: Nick Piggin <nickpiggin@yahoo.com.au> Cc: Chris Mason <chris.mason@oracle.com> Cc: Theodore Ts'o <tytso@mit.edu> Cc: Jens Axboe <axboe@kernel.dk> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
#
edcd1d84 |
|
26-May-2010 |
Cesar Eduardo Barros <cesarb@cesarb.net> |
radix-tree: fix radix_tree_prev_hole() underflow case radix_tree_prev_hole() used LONG_MAX to detect underflow; however, ULONG_MAX is clearly what was intended, both here and by its only user (count_history_pages at mm/readahead.c). Reviewed-by: Wu Fengguang <fengguang.wu@intel.com> Signed-off-by: Cesar Eduardo Barros <cesarb@cesarb.net> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
#
ce82653d |
|
06-Apr-2010 |
David Howells <dhowells@redhat.com> |
radix_tree_tag_get() is not as safe as the docs make out [ver #2] radix_tree_tag_get() is not safe to use concurrently with radix_tree_tag_set() or radix_tree_tag_clear(). The problem is that the double tag_get() in radix_tree_tag_get(): if (!tag_get(node, tag, offset)) saw_unset_tag = 1; if (height == 1) { int ret = tag_get(node, tag, offset); may see the value change due to the action of set/clear. RCU is no protection against this as no pointers are being changed, no nodes are being replaced according to a COW protocol - set/clear alter the node directly. The documentation in linux/radix-tree.h, however, says that radix_tree_tag_get() is an exception to the rule that "any function modifying the tree or tags (...) must exclude other modifications, and exclude any functions reading the tree". The problem is that the next statement in radix_tree_tag_get() checks that the tag doesn't vary over time: BUG_ON(ret && saw_unset_tag); This has been seen happening in FS-Cache: https://www.redhat.com/archives/linux-cachefs/2010-April/msg00013.html To this end, remove the BUG_ON() from radix_tree_tag_get() and note in various comments that the value of the tag may change whilst the RCU read lock is held, and thus that the return value of radix_tree_tag_get() may not be relied upon unless radix_tree_tag_set/clear() and radix_tree_delete() are excluded from running concurrently with it. Reported-by: Romain DEGEZ <romain.degez@smartjog.com> Signed-off-by: David Howells <dhowells@redhat.com> Acked-by: Nick Piggin <npiggin@suse.de> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
#
5a0e3ad6 |
|
24-Mar-2010 |
Tejun Heo <tj@kernel.org> |
include cleanup: Update gfp.h and slab.h includes to prepare for breaking implicit slab.h inclusion from percpu.h percpu.h is included by sched.h and module.h and thus ends up being included when building most .c files. percpu.h includes slab.h which in turn includes gfp.h making everything defined by the two files universally available and complicating inclusion dependencies. percpu.h -> slab.h dependency is about to be removed. Prepare for this change by updating users of gfp and slab facilities include those headers directly instead of assuming availability. As this conversion needs to touch large number of source files, the following script is used as the basis of conversion. http://userweb.kernel.org/~tj/misc/slabh-sweep.py The script does the followings. * Scan files for gfp and slab usages and update includes such that only the necessary includes are there. ie. if only gfp is used, gfp.h, if slab is used, slab.h. * When the script inserts a new include, it looks at the include blocks and try to put the new include such that its order conforms to its surrounding. It's put in the include block which contains core kernel includes, in the same order that the rest are ordered - alphabetical, Christmas tree, rev-Xmas-tree or at the end if there doesn't seem to be any matching order. * If the script can't find a place to put a new include (mostly because the file doesn't have fitting include block), it prints out an error message indicating which .h file needs to be added to the file. The conversion was done in the following steps. 1. The initial automatic conversion of all .c files updated slightly over 4000 files, deleting around 700 includes and adding ~480 gfp.h and ~3000 slab.h inclusions. The script emitted errors for ~400 files. 2. Each error was manually checked. Some didn't need the inclusion, some needed manual addition while adding it to implementation .h or embedding .c file was more appropriate for others. This step added inclusions to around 150 files. 3. The script was run again and the output was compared to the edits from #2 to make sure no file was left behind. 4. Several build tests were done and a couple of problems were fixed. e.g. lib/decompress_*.c used malloc/free() wrappers around slab APIs requiring slab.h to be added manually. 5. The script was run on all .h files but without automatically editing them as sprinkling gfp.h and slab.h inclusions around .h files could easily lead to inclusion dependency hell. Most gfp.h inclusion directives were ignored as stuff from gfp.h was usually wildly available and often used in preprocessor macros. Each slab.h inclusion directive was examined and added manually as necessary. 6. percpu.h was updated not to include slab.h. 7. Build test were done on the following configurations and failures were fixed. CONFIG_GCOV_KERNEL was turned off for all tests (as my distributed build env didn't work with gcov compiles) and a few more options had to be turned off depending on archs to make things build (like ipr on powerpc/64 which failed due to missing writeq). * x86 and x86_64 UP and SMP allmodconfig and a custom test config. * powerpc and powerpc64 SMP allmodconfig * sparc and sparc64 SMP allmodconfig * ia64 SMP allmodconfig * s390 SMP allmodconfig * alpha SMP allmodconfig * um on x86_64 SMP allmodconfig 8. percpu.h modifications were reverted so that it could be applied as a separate patch and serve as bisection point. Given the fact that I had only a couple of failures from tests on step 6, I'm fairly confident about the coverage of this conversion patch. If there is a breakage, it's likely to be something in one of the arch headers which should be easily discoverable easily on most builds of the specific arch. Signed-off-by: Tejun Heo <tj@kernel.org> Guess-its-ok-by: Christoph Lameter <cl@linux-foundation.org> Cc: Ingo Molnar <mingo@redhat.com> Cc: Lee Schermerhorn <Lee.Schermerhorn@hp.com>
|
#
2676a58c |
|
22-Feb-2010 |
Paul E. McKenney <paulmck@kernel.org> |
radix-tree: Disable RCU lockdep checking in radix tree Because the radix tree is used with many different locking designs, we cannot do any effective checking without changing the radix-tree APIs. It might make sense to do this later, but only if the RCU lockdep checking proves itself sufficiently valuable. Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com> Cc: laijs@cn.fujitsu.com Cc: dipankar@in.ibm.com Cc: mathieu.desnoyers@polymtl.ca Cc: josh@joshtriplett.org Cc: dvhltc@us.ibm.com Cc: niv@us.ibm.com Cc: peterz@infradead.org Cc: rostedt@goodmis.org Cc: Valdis.Kletnieks@vt.edu Cc: dhowells@redhat.com LKML-Reference: <1266887105-1528-10-git-send-email-paulmck@linux.vnet.ibm.com> Signed-off-by: Ingo Molnar <mingo@elte.hu>
|
#
285e728b |
|
19-Nov-2009 |
David Howells <dhowells@redhat.com> |
FS-Cache: Don't delete pending pages from the page-store tracking tree Don't delete pending pages from the page-store tracking tree, but rather send them for another write as they've presumably been updated. Signed-off-by: David Howells <dhowells@redhat.com>
|
#
b34df792 |
|
19-Nov-2009 |
David Howells <dhowells@redhat.com> |
FS-Cache: Use radix tree preload correctly in tracking of pages to be stored __fscache_write_page() attempts to load the radix tree preallocation pool for the CPU it is on before calling radix_tree_insert(), as the insertion must be done inside a pair of spinlocks. Use of the preallocation pool, however, is contingent on the radix tree being initialised without __GFP_WAIT specified. __fscache_acquire_cookie() was passing GFP_NOFS to INIT_RADIX_TREE() - but that includes __GFP_WAIT. The solution is to AND out __GFP_WAIT. Additionally, the banner comment to radix_tree_preload() is altered to make note of this prerequisite. Possibly there should be a WARN_ON() too. Without this fix, I have seen the following recursive deadlock caused by radix_tree_insert() attempting to allocate memory inside the spinlocked region, which resulted in FS-Cache being called back into to release memory - which required the spinlock already held. ============================================= [ INFO: possible recursive locking detected ] 2.6.32-rc6-cachefs #24 --------------------------------------------- nfsiod/7916 is trying to acquire lock: (&cookie->lock){+.+.-.}, at: [<ffffffffa0076872>] __fscache_uncache_page+0xdb/0x160 [fscache] but task is already holding lock: (&cookie->lock){+.+.-.}, at: [<ffffffffa0076acc>] __fscache_write_page+0x15c/0x3f3 [fscache] other info that might help us debug this: 5 locks held by nfsiod/7916: #0: (nfsiod){+.+.+.}, at: [<ffffffff81048290>] worker_thread+0x19a/0x2e2 #1: (&task->u.tk_work#2){+.+.+.}, at: [<ffffffff81048290>] worker_thread+0x19a/0x2e2 #2: (&cookie->lock){+.+.-.}, at: [<ffffffffa0076acc>] __fscache_write_page+0x15c/0x3f3 [fscache] #3: (&object->lock#2){+.+.-.}, at: [<ffffffffa0076b07>] __fscache_write_page+0x197/0x3f3 [fscache] #4: (&cookie->stores_lock){+.+...}, at: [<ffffffffa0076b0f>] __fscache_write_page+0x19f/0x3f3 [fscache] stack backtrace: Pid: 7916, comm: nfsiod Not tainted 2.6.32-rc6-cachefs #24 Call Trace: [<ffffffff8105ac7f>] __lock_acquire+0x1649/0x16e3 [<ffffffff81059ded>] ? __lock_acquire+0x7b7/0x16e3 [<ffffffff8100e27d>] ? dump_trace+0x248/0x257 [<ffffffff8105ad70>] lock_acquire+0x57/0x6d [<ffffffffa0076872>] ? __fscache_uncache_page+0xdb/0x160 [fscache] [<ffffffff8135467c>] _spin_lock+0x2c/0x3b [<ffffffffa0076872>] ? __fscache_uncache_page+0xdb/0x160 [fscache] [<ffffffffa0076872>] __fscache_uncache_page+0xdb/0x160 [fscache] [<ffffffffa0077eb7>] ? __fscache_check_page_write+0x0/0x71 [fscache] [<ffffffffa00b4755>] nfs_fscache_release_page+0x86/0xc4 [nfs] [<ffffffffa00907f0>] nfs_release_page+0x3c/0x41 [nfs] [<ffffffff81087ffb>] try_to_release_page+0x32/0x3b [<ffffffff81092c2b>] shrink_page_list+0x316/0x4ac [<ffffffff81058a9b>] ? mark_held_locks+0x52/0x70 [<ffffffff8135451b>] ? _spin_unlock_irq+0x2b/0x31 [<ffffffff81093153>] shrink_inactive_list+0x392/0x67c [<ffffffff81058a9b>] ? mark_held_locks+0x52/0x70 [<ffffffff810934ca>] shrink_list+0x8d/0x8f [<ffffffff81093744>] shrink_zone+0x278/0x33c [<ffffffff81052c70>] ? ktime_get_ts+0xad/0xba [<ffffffff8109453b>] try_to_free_pages+0x22e/0x392 [<ffffffff8109184c>] ? isolate_pages_global+0x0/0x212 [<ffffffff8108e16b>] __alloc_pages_nodemask+0x3dc/0x5cf [<ffffffff810ae24a>] cache_alloc_refill+0x34d/0x6c1 [<ffffffff811bcf74>] ? radix_tree_node_alloc+0x52/0x5c [<ffffffff810ae929>] kmem_cache_alloc+0xb2/0x118 [<ffffffff811bcf74>] radix_tree_node_alloc+0x52/0x5c [<ffffffff811bcfd5>] radix_tree_insert+0x57/0x19c [<ffffffffa0076b53>] __fscache_write_page+0x1e3/0x3f3 [fscache] [<ffffffffa00b4248>] __nfs_readpage_to_fscache+0x58/0x11e [nfs] [<ffffffffa009bb77>] nfs_readpage_release+0x34/0x9b [nfs] [<ffffffffa009c0d9>] nfs_readpage_release_full+0x32/0x4b [nfs] [<ffffffffa0006cff>] rpc_release_calldata+0x12/0x14 [sunrpc] [<ffffffffa0006e2d>] rpc_free_task+0x59/0x61 [sunrpc] [<ffffffffa0006f03>] rpc_async_release+0x10/0x12 [sunrpc] [<ffffffff810482e5>] worker_thread+0x1ef/0x2e2 [<ffffffff81048290>] ? worker_thread+0x19a/0x2e2 [<ffffffff81352433>] ? thread_return+0x3e/0x101 [<ffffffffa0006ef3>] ? rpc_async_release+0x0/0x12 [sunrpc] [<ffffffff8104bff5>] ? autoremove_wake_function+0x0/0x34 [<ffffffff81058d25>] ? trace_hardirqs_on+0xd/0xf [<ffffffff810480f6>] ? worker_thread+0x0/0x2e2 [<ffffffff8104bd21>] kthread+0x7a/0x82 [<ffffffff8100beda>] child_rip+0xa/0x20 [<ffffffff8100b87c>] ? restore_args+0x0/0x30 [<ffffffff8104c2b9>] ? add_wait_queue+0x15/0x44 [<ffffffff8104bca7>] ? kthread+0x0/0x82 [<ffffffff8100bed0>] ? child_rip+0x0/0x20 Signed-off-by: David Howells <dhowells@redhat.com>
|
#
b72b71c6 |
|
16-Jun-2009 |
Huang Shijie <shijie8@gmail.com> |
lib: do code optimization for radix_tree_lookup() and radix_tree_lookup_slot() radix_tree_lookup() and radix_tree_lookup_slot() have much the same code except for the return value. Introduce radix_tree_lookup_element() to do the real work. /* * is_slot == 1 : search for the slot. * is_slot == 0 : search for the node. */ static void * radix_tree_lookup_element(struct radix_tree_root *root, unsigned long index, int is_slot); Signed-off-by: Huang Shijie <shijie8@gmail.com> Cc: Nick Piggin <nickpiggin@yahoo.com.au> Cc: Christoph Lameter <cl@linux-foundation.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
#
dc566127 |
|
16-Jun-2009 |
Wu Fengguang <fengguang.wu@intel.com> |
radix-tree: add radix_tree_prev_hole() The counterpart of radix_tree_next_hole(). To be used by context readahead. Signed-off-by: Wu Fengguang <fengguang.wu@intel.com> Cc: Vladislav Bolkhovitin <vst@vlnb.net> Cc: Jens Axboe <jens.axboe@oracle.com> Cc: Jeff Moyer <jmoyer@redhat.com> Cc: Nick Piggin <nickpiggin@yahoo.com.au> Cc: Ying Han <yinghan@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
#
8cef7d57 |
|
06-Jan-2009 |
Harvey Harrison <harvey.harrison@gmail.com> |
lib: radix_tree.c make percpu variable static radix_tree_preloads is unused outside of this file, make it static. Noticed by sparse: lib/radix-tree.c:84:1: warning: symbol 'per_cpu__radix_tree_preloads' was not declared. Should it be static? Signed-off-by: Harvey Harrison <harvey.harrison@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
#
8e6bdb7f |
|
27-Nov-2008 |
Wu Fengguang <fengguang.wu@intel.com> |
trivial: radix-tree: document wrap-around issue of radix_tree_next_hole() And some 80-line cleanups. Signed-off-by: Wu Fengguang <wfg@linux.intel.com> Signed-off-by: Jiri Kosina <jkosina@suse.cz>
|
#
51cc5068 |
|
25-Jul-2008 |
Alexey Dobriyan <adobriyan@gmail.com> |
SL*B: drop kmem cache argument from constructor Kmem cache passed to constructor is only needed for constructors that are themselves multiplexeres. Nobody uses this "feature", nor does anybody uses passed kmem cache in non-trivial way, so pass only pointer to object. Non-trivial places are: arch/powerpc/mm/init_64.c arch/powerpc/mm/hugetlbpage.c This is flag day, yes. Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com> Acked-by: Pekka Enberg <penberg@cs.helsinki.fi> Acked-by: Christoph Lameter <cl@linux-foundation.org> Cc: Jon Tollefson <kniht@linux.vnet.ibm.com> Cc: Nick Piggin <nickpiggin@yahoo.com.au> Cc: Matt Mackall <mpm@selenic.com> [akpm@linux-foundation.org: fix arch/powerpc/mm/hugetlbpage.c] [akpm@linux-foundation.org: fix mm/slab.c] [akpm@linux-foundation.org: fix ubifs] Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
#
47feff2c |
|
25-Jul-2008 |
Nick Piggin <npiggin@suse.de> |
radix-tree: add gang_lookup_slot, gang_lookup_slot_tag Introduce gang_lookup_slot() and gang_lookup_slot_tag() functions, which are used by lockless pagecache. Signed-off-by: Nick Piggin <npiggin@suse.de> Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org> Cc: Paul Mackerras <paulus@samba.org> Cc: Hugh Dickins <hugh@veritas.com> Cc: "Paul E. McKenney" <paulmck@us.ibm.com> Reviewed-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
#
cde53535 |
|
04-Jul-2008 |
Christoph Lameter <clameter@sgi.com> |
Christoph has moved Remove all clameter@sgi.com addresses from the kernel tree since they will become invalid on June 27th. Change my maintainer email address for the slab allocators to cl@linux-foundation.org (which will be the new email address for the future). Signed-off-by: Christoph Lameter <clameter@sgi.com> Signed-off-by: Christoph Lameter <cl@linux-foundation.org> Cc: Pekka Enberg <penberg@cs.helsinki.fi> Cc: Stephen Rothwell <sfr@canb.auug.org.au> Cc: Matt Mackall <mpm@selenic.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
#
643b52b9 |
|
12-Jun-2008 |
Nick Piggin <nickpiggin@yahoo.com.au> |
radix-tree: fix small lockless radix-tree bug We shrink a radix tree when its root node has only one child, in the left most slot. The child becomes the new root node. To perform this operation in a manner compatible with concurrent lockless lookups, we atomically switch the root pointer from the parent to its child. However a concurrent lockless lookup may now have loaded a pointer to the parent (and is presently deciding what to do next). For this reason, we also have to keep the parent node in a valid state after shrinking the tree, until the next RCU grace period -- otherwise this lookup with the parent pointer may not do the right thing. Notably, we need to keep the child in the left most slot there in case that is requested by the lookup. This is all pretty standard RCU stuff. It is worth repeating because in my eagerness to obey the radix tree node constructor scheme, I had broken it by zeroing the radix tree node before the grace period. What could happen is that a lookup can load the parent pointer, then decide it wants to follow the left most child slot, only to find the slot contained NULL due to the concurrent shrinker having zeroed the parent node before waiting for a grace period. The lookup would return a false negative as a result. Fix it by doing that clearing in the RCU callback. I would normally want to rip out the constructor entirely, but radix tree nodes are one of those places where they make sense (only few cachelines will be touched soon after allocation). This was never actually found in any lockless pagecache testing or by the test harness, but by seeing the odd problem with my scalable vmap rewrite. I have not tickled the test harness into reproducing it yet, but I'll keep working at it. Fortunately, it is not a problem anywhere lockless pagecache is used in mainline kernels (pagecache probe is not a guarantee, and brd does not have concurrent lookups and deletes). Signed-off-by: Nick Piggin <npiggin@suse.de> Acked-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: "Paul E. McKenney" <paulmck@us.ibm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
#
488514d1 |
|
28-Apr-2008 |
Christoph Lameter <clameter@sgi.com> |
Remove set_migrateflags() Migrate flags must be set on slab creation as agreed upon when the antifrag logic was reviewed. Otherwise some slabs of a slabcache will end up in the unmovable and others in the reclaimable section depending on which flag was active when a new slab page was allocated. This likely slid in somehow when antifrag was merged. Remove it. The buffer_heads are always allocated with __GFP_RECLAIMABLE because the SLAB_RECLAIM_ACCOUNT option is set. The set_migrateflags() never had any effect there. Radix tree allocations are not directly reclaimable but they are allocated with __GFP_RECLAIMABLE set on each allocation. We now set SLAB_RECLAIM_ACCOUNT on radix tree slab creation making sure that radix tree slabs are consistently placed in the reclaimable section. Radix tree slabs will also be accounted as such. There is then no user left of set_migratepages. So remove it. Signed-off-by: Christoph Lameter <clameter@sgi.com> Cc: Mel Gorman <mel@csn.ul.ie> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
#
e2848a0e |
|
04-Feb-2008 |
Nick Piggin <npiggin@suse.de> |
radix-tree: avoid atomic allocations for preloaded insertions Most pagecache (and some other) radix tree insertions have the great opportunity to preallocate a few nodes with relaxed gfp flags. But the preallocation is squandered when it comes time to allocate a node, we default to first attempting a GFP_ATOMIC allocation -- that doesn't normally fail, but it can eat into atomic memory reserves that we don't need to be using. Another upshot of this is that it removes the sometimes highly contended zone->lock from underneath tree_lock. Pagecache insertions are always performed with a radix tree preload, and after this change, such a situation will never fall back to kmem_cache_alloc within radix_tree_node_alloc. David Miller reports seeing this allocation fail on a highly threaded sparc64 system: [527319.459981] dd: page allocation failure. order:0, mode:0x20 [527319.460403] Call Trace: [527319.460568] [00000000004b71e0] __slab_alloc+0x1b0/0x6a8 [527319.460636] [00000000004b7bbc] kmem_cache_alloc+0x4c/0xa8 [527319.460698] [000000000055309c] radix_tree_node_alloc+0x20/0x90 [527319.460763] [0000000000553238] radix_tree_insert+0x12c/0x260 [527319.460830] [0000000000495cd0] add_to_page_cache+0x38/0xb0 [527319.460893] [00000000004e4794] mpage_readpages+0x6c/0x134 [527319.460955] [000000000049c7fc] __do_page_cache_readahead+0x170/0x280 [527319.461028] [000000000049cc88] ondemand_readahead+0x208/0x214 [527319.461094] [0000000000496018] do_generic_mapping_read+0xe8/0x428 [527319.461152] [0000000000497948] generic_file_aio_read+0x108/0x170 [527319.461217] [00000000004badac] do_sync_read+0x88/0xd0 [527319.461292] [00000000004bb5cc] vfs_read+0x78/0x10c [527319.461361] [00000000004bb920] sys_read+0x34/0x60 [527319.461424] [0000000000406294] linux_sparc_syscall32+0x3c/0x40 The calltrace is significant: __do_page_cache_readahead allocates a number of pages with GFP_KERNEL, and hence it should have reclaimed sufficient memory to satisfy GFP_ATOMIC allocations. However after the list of pages goes to mpage_readpages, there can be significant intervals (including disk IO) before all the pages are inserted into the radix-tree. So the reserves can easily be depleted at that point. The patch is confirmed to fix the problem. Signed-off-by: Nick Piggin <npiggin@suse.de> Cc: "David S. Miller" <davem@davemloft.net> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
#
430d275a |
|
17-Oct-2007 |
Peter Lund <firefly@vax64.dk> |
avoid negative (and full-width) shifts in radix-tree.c Negative shifts are not allowed in C (the result is undefined). Same thing with full-width shifts. It works on most platforms but not on the VAX with gcc 4.0.1 (it results in an "operand reserved" fault). Shifting by more than the width of the value on the left is also not allowed. I think the extra '>> 1' tacked on at the end in the original code was an attempt to work around that. Getting rid of that is an extra feature of this patch. Here's the chapter and verse, taken from the final draft of the C99 standard ("6.5.7 Bitwise shift operators", paragraph 3): "The integer promotions are performed on each of the operands. The type of the result is that of the promoted left operand. If the value of the right operand is negative or is greater than or equal to the width of the promoted left operand, the behavior is undefined." Thank you to Jan-Benedict Glaw, Christoph Hellwig, Maciej Rozycki, Pekka Enberg, Andreas Schwab, and Christoph Lameter for review. Special thanks to Andreas for spotting that my fix only removed half the undefined behaviour. Signed-off-by: Peter Lund <firefly@vax64.dk> Christoph Lameter <clameter@sgi.com> Cc: Christoph Hellwig <hch@infradead.org> Cc: "Maciej W. Rozycki" <macro@linux-mips.org> Cc: Pekka Enberg <penberg@cs.helsinki.fi> Cc: Andreas Schwab <schwab@suse.de> Cc: Nick Piggin <nickpiggin@yahoo.com.au> Cc: WU Fengguang <wfg@mail.ustc.edu.cn> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
#
4ba9b9d0 |
|
17-Oct-2007 |
Christoph Lameter <clameter@sgi.com> |
Slab API: remove useless ctor parameter and reorder parameters Slab constructors currently have a flags parameter that is never used. And the order of the arguments is opposite to other slab functions. The object pointer is placed before the kmem_cache pointer. Convert ctor(void *object, struct kmem_cache *s, unsigned long flags) to ctor(struct kmem_cache *s, void *object) throughout the kernel [akpm@linux-foundation.org: coupla fixes] Signed-off-by: Christoph Lameter <clameter@sgi.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
#
e12ba74d |
|
16-Oct-2007 |
Mel Gorman <mel@csn.ul.ie> |
Group short-lived and reclaimable kernel allocations This patch marks a number of allocations that are either short-lived such as network buffers or are reclaimable such as inode allocations. When something like updatedb is called, long-lived and unmovable kernel allocations tend to be spread throughout the address space which increases fragmentation. This patch groups these allocations together as much as possible by adding a new MIGRATE_TYPE. The MIGRATE_RECLAIMABLE type is for allocations that can be reclaimed on demand, but not moved. i.e. they can be migrated by deleting them and re-reading the information from elsewhere. Signed-off-by: Mel Gorman <mel@csn.ul.ie> Cc: Andy Whitcroft <apw@shadowen.org> Cc: Christoph Lameter <clameter@sgi.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
#
26fb1589 |
|
16-Oct-2007 |
Jeff Moyer <jmoyer@redhat.com> |
fix the max path calculation in radix-tree.c A while back, Nick Piggin introduced a patch to reduce the node memory usage for small files (commit cfd9b7df4abd3257c9e381b0e445817b26a51c0c): -#define RADIX_TREE_MAP_SHIFT 6 +#define RADIX_TREE_MAP_SHIFT (CONFIG_BASE_SMALL ? 4 : 6) Unfortunately, he didn't take into account the fact that the calculation of the maximum path was based on an assumption of having to round up: #define RADIX_TREE_MAX_PATH (RADIX_TREE_INDEX_BITS/RADIX_TREE_MAP_SHIFT + 2) So, if CONFIG_BASE_SMALL is set, you will end up with a RADIX_TREE_MAX_PATH that is one greater than necessary. The practical upshot of this is just a bit of wasted memory (one long in the height_to_maxindex array, an extra pre-allocated radix tree node per cpu, and extra stack usage in a couple of functions), but it seems worth getting right. It's also worth noting that I never build with CONFIG_BASE_SMALL. What I did to test this was duplicate the code in a small user-space program and check the results of the calculations for max path and the contents of the height_to_maxindex array. Signed-off-by: Jeff Moyer <jmoyer@redhat.com> Acked-by: Nick Piggin <nickpiggin@yahoo.com.au> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
#
c0bc9875 |
|
16-Oct-2007 |
Nick Piggin <npiggin@suse.de> |
radix-tree: use indirect bit Rather than sign direct radix-tree pointers with a special bit, sign the indirect one that hangs off the root. This means that, given a lookup_slot operation, the invalid result will be differentiated from the valid (previously, valid results could have the bit either set or clear). This does not affect slot lookups which occur under lock -- they can never return an invalid result. Is needed in future for lockless pagecache. Signed-off-by: Nick Piggin <npiggin@suse.de> Acked-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: Hugh Dickins <hugh@veritas.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
#
6df8ba4f |
|
16-Oct-2007 |
Fengguang Wu <wfg@mail.ustc.edu.cn> |
radixtree: introduce radix_tree_next_hole() Introduce radix_tree_next_hole(root, index, max_scan) to scan radix tree for the first hole. It will be used in interleaved readahead. The implementation is dumb and obviously correct. It can help debug(and document) the possible smart one in future. Cc: Nick Piggin <nickpiggin@yahoo.com.au> Signed-off-by: Fengguang Wu <wfg@mail.ustc.edu.cn> Cc: Rusty Russell <rusty@rustcorp.com.au> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
#
20c2df83 |
|
19-Jul-2007 |
Paul Mundt <lethal@linux-sh.org> |
mm: Remove slab destructors from kmem_cache_create(). Slab destructors were no longer supported after Christoph's c59def9f222d44bb7e2f0a559f2906191a0862d7 change. They've been BUGs for both slab and slub, and slob never supported them either. This rips out support for the dtor pointer from kmem_cache_create() completely and fixes up every single callsite in the kernel (there were about 224, not including the slab allocator definitions themselves, or the documentation references). Signed-off-by: Paul Mundt <lethal@linux-sh.org>
|
#
d7f0923d |
|
14-Jul-2007 |
David Chinner <dgc@sgi.com> |
[LIB]: export radix_tree_preload() XFS filestreams functionality uses radix trees and the preload functions. XFS can be built as a module and hence we need radix_tree_preload() exported. radix_tree_preload_end() is a static inline, so it doesn't need exporting. Signed-Off-By: Dave Chinner <dgc@sgi.com> Signed-Off-By: Tim Shimmin <tes@sgi.com>
|
#
8bb78442 |
|
09-May-2007 |
Rafael J. Wysocki <rjw@rjwysocki.net> |
Add suspend-related notifications for CPU hotplug Since nonboot CPUs are now disabled after tasks and devices have been frozen and the CPU hotplug infrastructure is used for this purpose, we need special CPU hotplug notifications that will help the CPU-hotplug-aware subsystems distinguish normal CPU hotplug events from CPU hotplug events related to a system-wide suspend or resume operation in progress. This patch introduces such notifications and causes them to be used during suspend and resume transitions. It also changes all of the CPU-hotplug-aware subsystems to take these notifications into consideration (for now they are handled in the same way as the corresponding "normal" ones). [oleg@tv-sign.ru: cleanups] Signed-off-by: Rafael J. Wysocki <rjw@sisk.pl> Cc: Gautham R Shenoy <ego@in.ibm.com> Cc: Pavel Machek <pavel@ucw.cz> Signed-off-by: Oleg Nesterov <oleg@tv-sign.ru> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
#
02316067 |
|
06-Dec-2006 |
Ingo Molnar <mingo@elte.hu> |
[PATCH] hotplug CPU: clean up hotcpu_notifier() use There was lots of #ifdef noise in the kernel due to hotcpu_notifier(fn, prio) not correctly marking 'fn' as used in the !HOTPLUG_CPU case, and thus generating compiler warnings of unused symbols, hence forcing people to add #ifdefs. the compiler can skip truly unused functions just fine: text data bss dec hex filename 1624412 728710 3674856 6027978 5bfaca vmlinux.before 1624412 728710 3674856 6027978 5bfaca vmlinux.after [akpm@osdl.org: topology.c fix] Signed-off-by: Ingo Molnar <mingo@elte.hu> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
|
#
7cf9c2c7 |
|
06-Dec-2006 |
Nick Piggin <npiggin@suse.de> |
[PATCH] radix-tree: RCU lockless readside Make radix tree lookups safe to be performed without locks. Readers are protected against nodes being deleted by using RCU based freeing. Readers are protected against new node insertion by using memory barriers to ensure the node itself will be properly written before it is visible in the radix tree. Each radix tree node keeps a record of their height (above leaf nodes). This height does not change after insertion -- when the radix tree is extended, higher nodes are only inserted in the top. So a lookup can take the pointer to what is *now* the root node, and traverse down it even if the tree is concurrently extended and this node becomes a subtree of a new root. "Direct" pointers (tree height of 0, where root->rnode points directly to the data item) are handled by using the low bit of the pointer to signal whether rnode is a direct pointer or a pointer to a radix tree node. When a reader wants to traverse the next branch, they will take a copy of the pointer. This pointer will be either NULL (and the branch is empty) or non-NULL (and will point to a valid node). [akpm@osdl.org: cleanups] [Lee.Schermerhorn@hp.com: bugfixes, comments, simplifications] [clameter@sgi.com: build fix] Signed-off-by: Nick Piggin <npiggin@suse.de> Cc: "Paul E. McKenney" <paulmck@us.ibm.com> Signed-off-by: Lee Schermerhorn <lee.schermerhorn@hp.com> Cc: Christoph Lameter <clameter@engr.sgi.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
|
#
e18b890b |
|
06-Dec-2006 |
Christoph Lameter <clameter@sgi.com> |
[PATCH] slab: remove kmem_cache_t Replace all uses of kmem_cache_t with struct kmem_cache. The patch was generated using the following script: #!/bin/sh # # Replace one string by another in all the kernel sources. # set -e for file in `find * -name "*.c" -o -name "*.h"|xargs grep -l $1`; do quilt add $file sed -e "1,\$s/$1/$2/g" $file >/tmp/$$ mv /tmp/$$ $file quilt refresh done The script was run like this sh replace kmem_cache_t "struct kmem_cache" Signed-off-by: Christoph Lameter <clameter@sgi.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
|
#
20241ad4 |
|
10-Oct-2006 |
Al Viro <viro@ftp.linux.org.uk> |
[PATCH] gfp annotations: radix_tree_root struct radix_tree_root has unused upper bits of ->gfp_mask reused for tags bitmap. Annotated. Signed-off-by: Al Viro <viro@zeniv.linux.org.uk> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
|
#
e5dcd90b |
|
25-Jun-2006 |
Wu Fengguang <wfg@mail.ustc.edu.cn> |
[PATCH] radixtree: normalize radix_tree_tag_get() return value In radix_tree_tag_get(), return normalized value of 0/1, as indicated by its comment. Signed-off-by: Wu Fengguang <wfg@mail.ustc.edu.cn> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
|
#
4c91c364 |
|
23-Jun-2006 |
Peter Zijlstra <a.p.zijlstra@chello.nl> |
[PATCH] buglet in radix_tree_tag_set The comment states: 'Setting a tag on a not-present item is a BUG.' Hence if 'index' is larger than the maxindex; the item _cannot_ be presen; it should also be a BUG. Also, this allows the following statement (assume a fresh tree): radix_tree_tag_set(root, 16, 1); to fail silently, but when preceded by: radix_tree_insert(root, 32, item); it would BUG, because the height has been extended by the insert. In neither case was 16 present. Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Acked-by: Nick Piggin <npiggin@suse.de> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
|
#
cfd9b7df |
|
23-Jun-2006 |
Nick Piggin <npiggin@suse.de> |
[PATCH] radix-tree: small Reduce radix tree node memory usage by about a factor of 4 for small files (< 64K). There are pointer traversal and memory usage costs for large files with dense pagecache. Signed-off-by: Nick Piggin <npiggin@suse.de> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
|
#
612d6c19 |
|
23-Jun-2006 |
Nick Piggin <npiggin@suse.de> |
[PATCH] radix-tree: direct data The ability to have height 0 radix trees (a direct pointer to the data item rather than going through a full node->slot) quietly disappeared with old-2.6-bkcvs commit ffee171812d51652f9ba284302d9e5c5cc14bdfd. On 64-bit machines this causes nearly 600 bytes to be used for every <= 4K file in pagecache. Re-introduce this feature, root tags stored in spare ->gfp_mask bits. Simplify radix_tree_delete's complex tag clearing arrangement (which would become even more complex) by just falling back to tag clearing functions (the pagecache radix-tree never uses this path anyway, so the icache savings will mean it's actually a speedup). On my 4GB G5, this saves 8MB RAM per kernel kernel source+object tree in pagecache. Pagecache lookup, insertion, and removal speed for small files will also be improved. This makes RCU radix tree harder, but it's worth it. Signed-off-by: Nick Piggin <npiggin@suse.de> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
|
#
daff89f3 |
|
25-Mar-2006 |
Jonathan Corbet <corbet@lwn.net> |
[PATCH] radix-tree documentation cleanups Documentation changes to help radix tree users avoid overrunning the tags array. RADIX_TREE_TAGS moves to linux/radix-tree.h and is now known as RADIX_TREE_MAX_TAGS (Nick Piggin's idea). Tag parameters are changed to unsigned, and some comments are updated. Signed-off-by: Jonathan Corbet <corbet@lwn.net> Cc: Nick Piggin <nickpiggin@yahoo.com.au> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
|
#
90f9dd8f |
|
15-Feb-2006 |
NeilBrown <neilb@suse.de> |
[PATCH] Fix over-zealous tag clearing in radix_tree_delete If a tag is set for a node being deleted from a radix_tree, then that tag gets cleared from the parent of the node, even if it is set for some siblings of the node begin deleted. This patch changes the logic to include a test for any_tag_set similar to the logic a little futher down. Care is taken to ensure that 'nr_cleared_tags' remains equals to the number of entries in the 'tags' array which are set to '0' (which means that this tag is not set in the tree below pathp->node, and should be cleared at pathp->node and possibly above. [ Nick says: "Linus FYI, I was able to modify the radix tree test harness to catch the bug and can no longer trigger it after the fix. Resulting code passes all other harness tests as well of course." ] Signed-off-by: Neil Brown <neilb@suse.de> Acked-by: Nick Piggin <npiggin@suse.de> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
|
#
a5f51c96 |
|
08-Jan-2006 |
Nick Piggin <nickpiggin@yahoo.com.au> |
[PATCH] radix-tree: reduce tree height upon partial truncation Shrink the height of a radix tree when it is partially truncated - we only do shrinkage of full truncation at present. Signed-off-by: Nick Piggin <npiggin@suse.de> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
|
#
d5274261 |
|
08-Jan-2006 |
Nick Piggin <nickpiggin@yahoo.com.au> |
[PATCH] radix tree: early termination of tag clearing Correctly determine the tags to be cleared in radix_tree_delete() so we don't keep moving up the tree clearing tags that we don't need to. For example, if a tag is simply not set in the deleted item, nor anywhere up the tree, radix_tree_delete() would attempt to clear it up the entire height of the tree. Also, tag_set() was made conditional so as not to dirty too many cachelines high up in the radix tree. Instead, put this logic into radix_tree_tag_set(). Signed-off-by: Nick Piggin <npiggin@suse.de> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
|
#
6e954b9e |
|
08-Jan-2006 |
Nick Piggin <nickpiggin@yahoo.com.au> |
[PATCH] radix tree: code consolidation Introduce helper any_tag_set() rather than repeat the same code sequence 4 times. Signed-off-by: Nick Piggin <npiggin@suse.de> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
|
#
a4331366 |
|
07-Nov-2005 |
Hans Reiser <reiser@namesys.com> |
[PATCH] reiser4: add radix_tree_lookup_slot() Reiser4 uses radix trees to solve a trouble reiser4_readdir has serving nfs requests. Unfortunately, radix tree api lacks an operation suitable for modifying existing entry. This patch adds radix_tree_lookup_slot which returns pointer to found item within the tree. That location can be then updated. Both Nick and Christoph Lameter have patches which need this as well. Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
|
#
dd0fc66f |
|
07-Oct-2005 |
Al Viro <viro@ftp.linux.org.uk> |
[PATCH] gfp flags annotations - part 1 - added typedef unsigned int __nocast gfp_t; - replaced __nocast uses for gfp flags with gfp_t - it gives exactly the same warnings as far as sparse is concerned, doesn't change generated code (from gcc point of view we replaced unsigned int with typedef) and documents what's going on far better. Signed-off-by: Al Viro <viro@zeniv.linux.org.uk> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
|
#
00b61f51 |
|
10-Sep-2005 |
Victor Fusco <victor@cetuc.puc-rio.br> |
[PATCH] lib/radix-tree: Fix "nocast type" warnings Fix the sparse warning "implicit cast to nocast type" Signed-off-by: Victor Fusco <victor@cetuc.puc-rio.br> Signed-off-by: Domen Puncer <domen@coderock.org> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
|
#
32605a18 |
|
06-Sep-2005 |
Marcelo Tosatti <marcelo.tosatti@cyclades.com> |
[PATCH] radix_tag_get(): differentiate between no present node and tag unset cases Simple patch to radix_tree_tag_get() to return different values for non present node and tag unset. The function is not used by any in-kernel callers (yet), but this information is definitely useful. Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
|
#
201b6264 |
|
06-Sep-2005 |
Christoph Lameter <clameter@engr.sgi.com> |
[PATCH] radix-tree: Remove unnecessary indirections and clean up code - There is frequent use of indirections in the radix code. This patch removes those indirections, makes the code more readable and allows the compilers to generate better code. - Removing indirections allows the removal of several casts. - Removing indirections allows the reduction of the radix_tree_path size from 3 to 2 words. - Use pathp-> consistently. - Remove unnecessary tmp variable in radix_tree_insert - Separate the upper layer processing from the lowest layer in __lookup() in order to make it easier to understand what is going on and allow compilers to generate better code for the loop. Signed-off-by: Christoph Lameter <clameter@sgi.com> Cc: Nick Piggin <nickpiggin@yahoo.com.au> Cc: James Bottomley <James.Bottomley@steeleye.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
|
#
6c036527 |
|
07-Jul-2005 |
Christoph Lameter <christoph@lameter.com> |
[PATCH] mostly_read data section Add a new section called ".data.read_mostly" for data items that are read frequently and rarely written to like cpumaps etc. If these maps are placed in the .data section then these frequenly read items may end up in cachelines with data is is frequently updated. In that case all processors in an SMP system must needlessly reload the cachelines again and again containing elements of those frequently used variables. The ability to share these cachelines will allow each cpu in an SMP system to keep local copies of those shared cachelines thereby optimizing performance. Signed-off-by: Alok N Kataria <alokk@calsoftinc.com> Signed-off-by: Shobhit Dayal <shobhit@calsoftinc.com> Signed-off-by: Christoph Lameter <christoph@scalex86.org> Signed-off-by: Shai Fultheim <shai@scalex86.org> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
|
#
1da177e4 |
|
16-Apr-2005 |
Linus Torvalds <torvalds@ppc970.osdl.org> |
Linux-2.6.12-rc2 Initial git repository build. I'm not bothering with the full history, even though we have it. We can create a separate "historical" git archive of that later if we want to, and in the meantime it's about 3.2GB when imported into git - space that would just make the early git days unnecessarily complicated, when we don't have a lot of good infrastructure for it. Let it rip!
|