Cross Reference: /linux-master/lib/percpu-refcount.c

History log of /linux-master/lib/percpu-refcount.c
Revision	Date	Author	Comments
# 343a72e5	16-Oct-2022	Joel Fernandes (Google) <joel@joelfernandes.org>	percpu-refcount: Use call_rcu_hurry() for atomic switch Earlier commits in this series allow battery-powered systems to build their kernels with the default-disabled CONFIG_RCU_LAZY=y Kconfig option. This Kconfig option causes call_rcu() to delay its callbacks in order to batch callbacks. This means that a given RCU grace period covers more callbacks, thus reducing the number of grace periods, in turn reducing the amount of energy consumed, which increases battery lifetime which can be a very good thing. This is not a subtle effect: In some important use cases, the battery lifetime is increased by more than 10%. This CONFIG_RCU_LAZY=y option is available only for CPUs that offload callbacks, for example, CPUs mentioned in the rcu_nocbs kernel boot parameter passed to kernels built with CONFIG_RCU_NOCB_CPU=y. Delaying callbacks is normally not a problem because most callbacks do nothing but free memory. If the system is short on memory, a shrinker will kick all currently queued lazy callbacks out of their laziness, thus freeing their memory in short order. Similarly, the rcu_barrier() function, which blocks until all currently queued callbacks are invoked, will also kick lazy callbacks, thus enabling rcu_barrier() to complete in a timely manner. However, there are some cases where laziness is not a good option. For example, synchronize_rcu() invokes call_rcu(), and blocks until the newly queued callback is invoked. It would not be a good for synchronize_rcu() to block for ten seconds, even on an idle system. Therefore, synchronize_rcu() invokes call_rcu_hurry() instead of call_rcu(). The arrival of a non-lazy call_rcu_hurry() callback on a given CPU kicks any lazy callbacks that might be already queued on that CPU. After all, if there is going to be a grace period, all callbacks might as well get full benefit from it. Yes, this could be done the other way around by creating a call_rcu_lazy(), but earlier experience with this approach and feedback at the 2022 Linux Plumbers Conference shifted the approach to call_rcu() being lazy with call_rcu_hurry() for the few places where laziness is inappropriate. And another call_rcu() instance that cannot be lazy is the one on the percpu refcounter's "per-CPU to atomic switch" code path, which uses RCU when switching to atomic mode. The enqueued callback wakes up waiters waiting in the percpu_ref_switch_waitq. Allowing this callback to be lazy would result in unacceptable slowdowns for users of per-CPU refcounts, such as blk_pre_runtime_suspend(). Therefore, make __percpu_ref_switch_to_atomic() use call_rcu_hurry() in order to revert to the old behavior. [ paulmck: Apply s/call_rcu_flush/call_rcu_hurry/ feedback from Tejun Heo. ] Signed-off-by: Joel Fernandes (Google) <joel@joelfernandes.org> Acked-by: Tejun Heo <tj@kernel.org> Signed-off-by: Paul E. McKenney <paulmck@kernel.org> Cc: Dennis Zhou <dennis@kernel.org> Cc: Christoph Lameter <cl@linux.com> Cc: <linux-mm@kvack.org>
# a9171431	18-May-2022	Al Viro <viro@zeniv.linux.org.uk>	percpu_ref_init(): clean ->percpu_count_ref on failure That way percpu_ref_exit() is safe after failing percpu_ref_init(). At least one user (cgroup_create()) had a double-free that way; there might be other similar bugs. Easier to fix in percpu_ref_init(), rather than playing whack-a-mole in sloppy users... Usual symptoms look like a messed refcounting in one of subsystems that use percpu allocations (might be percpu-refcount, might be something else). Having refcounts for two different objects share memory is Not Nice(tm)... Reported-by: syzbot+5b1e53987f858500ec00@syzkaller.appspotmail.com Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
# 9e9da02a	11-May-2021	Nikolay Borisov <nborisov@suse.com>	percpu_ref: Don't opencode percpu_ref_is_dying Signed-off-by: Nikolay Borisov <nborisov@suse.com> Signed-off-by: Dennis Zhou <dennis@kernel.org>
# 3375efed	08-Dec-2020	Paul E. McKenney <paulmck@kernel.org>	percpu_ref: Dump mem_dump_obj() info upon reference-count underflow Reference-count underflow for percpu_ref is detected in the RCU callback percpu_ref_switch_to_atomic_rcu(), and the resulting warning does not print anything allowing easy identification of which percpu_ref use case is underflowing. This is of course not normally a problem when developing a new percpu_ref use case because it is most likely that the problem resides in this new use case. However, when deploying a new kernel to a large set of servers, the underflow might well be a new corner case in any of the old percpu_ref use cases. This commit therefore calls mem_dump_obj() to dump out any additional available information on the underflowing percpu_ref instance. Cc: Ming Lei <ming.lei@redhat.com> Cc: Jens Axboe <axboe@kernel.dk> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com> Reported-by: Andrii Nakryiko <andrii@kernel.org> Tested-by: Naresh Kamboju <naresh.kamboju@linaro.org> Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
# 7ea6bf2e	08-Oct-2020	Ming Lei <ming.lei@redhat.com>	percpu_ref: don't refer to ref->data if it isn't allocated We can't check ref->data->confirm_switch directly in __percpu_ref_exit(), since ref->data may not be allocated in one not-initialized refcount. Fixes: 2b0d3d3e4fcf ("percpu_ref: reduce memory footprint of percpu_ref in fast path") Reported-by: syzbot+fd15ff734dace9e16437@syzkaller.appspotmail.com Signed-off-by: Ming Lei <ming.lei@redhat.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
# 2b0d3d3e	01-Oct-2020	Ming Lei <ming.lei@redhat.com>	percpu_ref: reduce memory footprint of percpu_ref in fast path 'struct percpu_ref' is often embedded into one user structure, and the instance is usually referenced in fast path, however actually only 'percpu_count_ptr' is needed in fast path. So move other fields into one new structure of 'percpu_ref_data', and allocate it dynamically via kzalloc(), then memory footprint of 'percpu_ref' in fast path is reduced a lot and becomes suitable to put into hot cacheline of user structure. Signed-off-by: Ming Lei <ming.lei@redhat.com> Tested-by: Veronika Kabatova <vkabatov@redhat.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Acked-by: Tejun Heo <tj@kernel.org> Cc: Sagi Grimberg <sagi@grimberg.me> Cc: Tejun Heo <tj@kernel.org> Cc: Christoph Hellwig <hch@lst.de> Cc: Jens Axboe <axboe@kernel.dk> Cc: Bart Van Assche <bvanassche@acm.org> Signed-off-by: Jens Axboe <axboe@kernel.dk>
# a818e526	04-Jun-2020	Joe Perches <joe@perches.com>	lib/percpu-refcount.c: use a more common logging style Remove the trailing newline from the used-once pr_fmt and add it to the single use of pr_<level> in this code to use a more common logging style. Miscellanea: o Use %lu in the pr_debug format and remove the unnecessary cast Signed-off-by: Joe Perches <joe@perches.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Cc: Christophe JAILLET <christophe.jaillet@wanadoo.fr> Link: http://lkml.kernel.org/r/47372467902a047c03b0fd29aab56e0c38d3f848.camel@perches.com Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
# 15617dff	21-Feb-2020	Ira Weiny <ira.weiny@intel.com>	percpu_ref: Fix comment regarding percpu_ref_init flags The comment for percpu_ref_init() implies that using PERCPU_REF_ALLOW_REINIT will cause the refcount to start at 0. But this is not true. PERCPU_REF_ALLOW_REINIT starts the count at 1 as if the flags were zero. Add this fact to the kernel doc comment. Signed-off-by: Ira Weiny <ira.weiny@intel.com> [Dennis: reworded] Signed-off-by: Dennis Zhou <dennis@kernel.org>
# 457c8996	19-May-2019	Thomas Gleixner <tglx@linutronix.de>	treewide: Add SPDX license identifier for missed files Add SPDX license identifiers to all files which: - Have no license information of any form - Have EXPORT_.*_SYMBOL_GPL inside which was used in the initial scan/conversion to ignore the file These files fall under the project license, GPL v2 only. The resulting SPDX license identifier is: GPL-2.0-only Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
# 7d9ab9b6	07-May-2019	Roman Gushchin <guro@fb.com>	percpu_ref: release percpu memory early without PERCPU_REF_ALLOW_REINIT Release percpu memory after finishing the switch to the atomic mode if only PERCPU_REF_ALLOW_REINIT isn't set. Signed-off-by: Roman Gushchin <guro@fb.com> Acked-by: Tejun Heo <tj@kernel.org> Signed-off-by: Dennis Zhou <dennis@kernel.org>
# d75f773c	25-Mar-2019	Sakari Ailus <sakari.ailus@linux.intel.com>	treewide: Switch printk users from %pf and %pF to %ps and %pS, respectively %pF and %pf are functionally equivalent to %pS and %ps conversion specifiers. The former are deprecated, therefore switch the current users to use the preferred variant. The changes have been produced by the following command: git grep -l '%p[fF]' \| grep -v '^$tools\\|Documentation$/' \| \ while read i; do perl -i -pe 's/%pf/%ps/g; s/%pF/%pS/g;' $i; done And verifying the result. Link: http://lkml.kernel.org/r/20190325193229.23390-1-sakari.ailus@linux.intel.com Cc: Andy Shevchenko <andriy.shevchenko@linux.intel.com> Cc: linux-arm-kernel@lists.infradead.org Cc: sparclinux@vger.kernel.org Cc: linux-um@lists.infradead.org Cc: xen-devel@lists.xenproject.org Cc: linux-acpi@vger.kernel.org Cc: linux-pm@vger.kernel.org Cc: drbd-dev@lists.linbit.com Cc: linux-block@vger.kernel.org Cc: linux-mmc@vger.kernel.org Cc: linux-nvdimm@lists.01.org Cc: linux-pci@vger.kernel.org Cc: linux-scsi@vger.kernel.org Cc: linux-btrfs@vger.kernel.org Cc: linux-f2fs-devel@lists.sourceforge.net Cc: linux-mm@kvack.org Cc: ceph-devel@vger.kernel.org Cc: netdev@vger.kernel.org Signed-off-by: Sakari Ailus <sakari.ailus@linux.intel.com> Acked-by: David Sterba <dsterba@suse.com> (for btrfs) Acked-by: Mike Rapoport <rppt@linux.ibm.com> (for mm/memblock.c) Acked-by: Bjorn Helgaas <bhelgaas@google.com> (for drivers/pci) Acked-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com> Signed-off-by: Petr Mladek <pmladek@suse.com>
# 36bd1a8e	06-Nov-2018	Paul E. McKenney <paulmck@kernel.org>	percpu-refcount: Replace call_rcu_sched() with call_rcu() Now that call_rcu()'s callback is not invoked until after all preempt-disable regions of code have completed (in addition to explicitly marked RCU read-side critical sections), call_rcu() can be used in place of call_rcu_sched(). This commit therefore makes that change. Signed-off-by: Paul E. McKenney <paulmck@linux.ibm.com> Cc: Ming Lei <ming.lei@redhat.com> Cc: Bart Van Assche <bvanassche@acm.org> Cc: Jens Axboe <axboe@kernel.dk> Acked-by: Tejun Heo <tj@kernel.org>
# 18c9a6bb	26-Sep-2018	Bart Van Assche <bvanassche@acm.org>	percpu-refcount: Introduce percpu_ref_resurrect() This function will be used in a later patch to switch the struct request_queue q_usage_counter from killed back to live. In contrast to percpu_ref_reinit(), this new function does not require that the refcount is zero. Signed-off-by: Bart Van Assche <bvanassche@acm.org> Acked-by: Tejun Heo <tj@kernel.org> Reviewed-by: Ming Lei <ming.lei@redhat.com> Cc: Christoph Hellwig <hch@lst.de> Cc: Jianchao Wang <jianchao.w.wang@oracle.com> Cc: Hannes Reinecke <hare@suse.com> Cc: Johannes Thumshirn <jthumshirn@suse.de> Signed-off-by: Jens Axboe <axboe@kernel.dk>
# b3a5d111	14-Mar-2018	Tejun Heo <tj@kernel.org>	percpu_ref: Update doc to dissuade users from depending on internal RCU grace periods percpu_ref internally uses sched-RCU to implement the percpu -> atomic mode switching and the documentation suggested that this could be depended upon. This doesn't seem like a good idea. * percpu_ref uses sched-RCU which has different grace periods regular RCU. Users may combine percpu_ref with regular RCU usage and incorrectly believe that regular RCU grace periods are performed by percpu_ref. This can lead to, for example, use-after-free due to premature freeing. * percpu_ref has a grace period when switching from percpu to atomic mode. It doesn't have one between the last put and release. This distinction is subtle and can lead to surprising bugs. * percpu_ref allows starting in and switching to atomic mode manually for debugging and other purposes. This means that there may not be any grace periods from kill to release. This patch makes it clear that the grace periods are percpu_ref's internal implementation detail and can't be depended upon by the users. Signed-off-by: Tejun Heo <tj@kernel.org> Cc: Kent Overstreet <kent.overstreet@gmail.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Tejun Heo <tj@kernel.org>
# b393e8b3	09-Oct-2017	Paul E. McKenney <paulmck@kernel.org>	percpu: READ_ONCE() now implies smp_read_barrier_depends() Because READ_ONCE() now implies smp_read_barrier_depends(), this commit removes the now-redundant smp_read_barrier_depends() following the READ_ONCE() in __ref_is_percpu(). Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com> Acked-by: Tejun Heo <tj@kernel.org> Cc: Christoph Lameter <cl@linux.com>
# 210f7cdc	14-Mar-2017	NeilBrown <neilb@suse.com>	percpu-refcount: support synchronous switch to atomic mode. percpu_ref_switch_to_atomic_sync() schedules the switch to atomic mode, then waits for it to complete. Also export percpu_ref_switch_to_* so they can be used from modules. This will be used in md/raid to count the number of pending write requests to an array. We occasionally need to check if the count is zero, but most often we don't care. We always want updates to the counter to be fast, as in some cases we count every 4K page. Signed-off-by: NeilBrown <neilb@suse.com> Acked-by: Tejun Heo <tj@kernel.org> Signed-off-by: Shaohua Li <shli@fb.com>
# a67823c1	11-Aug-2016	Roman Pen <roman.penyaev@profitbricks.com>	percpu-refcount: init ->confirm_switch member properly This patch targets two things which are related to ->confirm_switch: 1. Init ->confirm_switch pointer with NULL on percpu_ref_init() or kernel frightfully complains with WARN_ON_ONCE(ref->confirm_switch) at __percpu_ref_switch_to_atomic if memory chunk was not properly zeroed. 2. Warn if RCU callback is still in progress on percpu_ref_exit(). The race still exists, because percpu_ref_call_confirm_rcu() drops ->confirm_switch to NULL early, but that is only a warning and still the caller is responsible that ref is no longer in active use. Hopefully that can help to catch incorrect usage of percpu-refcount. Signed-off-by: Roman Pen <roman.penyaev@profitbricks.com> Cc: Tejun Heo <tj@kernel.org> Cc: linux-kernel@vger.kernel.org Signed-off-by: Tejun Heo <tj@kernel.org>
# 33e465ce	29-Sep-2015	Tejun Heo <tj@kernel.org>	percpu_ref: allow operation mode switching operations to be called concurrently percpu_ref initially didn't have explicit mode switching operations. It started out in percpu mode and switched to atomic mode on kill and then released. Ensuring that kill operation is initiated only after init completes was naturally the caller's responsibility. percpu_ref_reinit() was introduced later but it didn't shift the synchronization responsibility. Reinit can't be performed until kill is confirmed, so there was nothing to worry about synchronization-wise. Also, as both reinit and kill manipulate the base reference, invocations of the same function couldn't be allowed to race each other. The latest additions of percpu_ref_switch_to_atomic/percpu() changed the situation. These two functions can be called any time as long as the percpu_ref is between init and exit and thus there are valid valid usage scenarios where these new functions race with each other or against reinit/kill. Mostly from inertia, f47ad4578461 ("percpu_ref: decouple switching to percpu mode and reinit") still left synchronization among percpu mode switching operations to its users. That the new switch functions can be freely mixed with kill/reinit but the operations themselves should be synchronized is too subtle a requirement and led to a very subtle race condition in blk-mq freezing path. This patch fixes the situation by introducing percpu_ref_switch_lock to protect mode switching operations. This ensures that percpu-ref users don't have to worry about mode changing operations racing against each other, e.g. switch_to_percpu against kill, as long as the sequence of operations is valid. Signed-off-by: Tejun Heo <tj@kernel.org> Reported-by: Akinobu Mita <akinobu.mita@gmail.com> Link: http://lkml.kernel.org/g/1443287365-4244-7-git-send-email-akinobu.mita@gmail.com Fixes: f47ad4578461 ("percpu_ref: decouple switching to percpu mode and reinit")
# 3f49bdd9	29-Sep-2015	Tejun Heo <tj@kernel.org>	percpu_ref: restructure operation mode switching Restructure atomic/percpu mode switching. * The users of __percpu_ref_switch_to_atomic/percpu() now call a new function __percpu_ref_switch_mode() which calls either of the original switching functions depending on the current state of ref->force_atomic and the __PERCPU_REF_DEAD flag. The callers no longer check whether switching is necessary but always invoke __percpu_ref_switch_mode(). * !ref->confirm_switch waiting is collected into __percpu_ref_switch_mode(). This patch doesn't cause any behavior differences. Signed-off-by: Tejun Heo <tj@kernel.org>
# 18808354	29-Sep-2015	Tejun Heo <tj@kernel.org>	percpu_ref: unify staggered atomic switching wait behavior When an atomic or percpu switching starts before the previous atomic switching finishes, the taken behaviors are * If the new atomic switching has confirmation callback, it waits for the previous atomic switching to complete. * If the new percpu switching is the first percpu switching following the previous atomic switching, it waits the previous atomic switching to complete. No percpu_ref user depends on these subtleties. The only meaningful part is that, if the caller ensures that atomic switching isn't in progress, mode switching operations can be issued from any context. This patch pulls the wait logic to the top of both switching functions so that they always wait for the previous atomic switching to complete. This makes the behavior simpler and consistent for both directions and will help allowing concurrent invocations of mode switching functions. Signed-off-by: Tejun Heo <tj@kernel.org>
# b2302c7f	29-Sep-2015	Tejun Heo <tj@kernel.org>	percpu_ref: reorganize __percpu_ref_switch_to_atomic() and relocate percpu_ref_switch_to_atomic() Reorganize __percpu_ref_switch_to_atomic() so that it looks structurally similar to __percpu_ref_switch_to_percpu() and relocate percpu_ref_switch_to_atomic so that the two internal functions are co-located. This patch doesn't introduce any functional differences. Signed-off-by: Tejun Heo <tj@kernel.org>
# a2f5630c	29-Sep-2015	Tejun Heo <tj@kernel.org>	percpu_ref: remove unnecessary RCU grace period for staggered atomic switching confirmation At the beginning, percpu_ref guaranteed a RCU grace period between a call to percpu_ref_kill_and_confirm() and the invocation of the confirmation callback. This guarantee exposed internal implementation details and got rescinded while switching over to sched RCU; however, __percpu_ref_switch_to_atomic() still inserts a full sched RCU grace period even when it can simply wait for the previous attempt. Remove the unnecessary grace period and perform the confirmation synchronously for staggered atomic switching attempts. Update comments accordingly. Signed-off-by: Tejun Heo <tj@kernel.org>
# bdb428c8	27-Dec-2015	Bogdan Sikora <bsikora@redhat.com>	lib+mm: fix few spelling mistakes All are in comments. Signed-off-by: Bogdan Sikora <bsikora@redhat.com> Cc: <linux-mm@kvack.org> Cc: Rafael Aquini <aquini@redhat.com> Cc: Kent Overstreet <kmo@daterainc.com> Cc: Jan Kara <jack@suse.cz> [jkosina@suse.cz: more fixup] Acked-by: Rafael Aquini <aquini@redhat.com> Signed-off-by: Jiri Slaby <jslaby@suse.cz> Signed-off-by: Jiri Kosina <jkosina@suse.cz>
# 1cae13e7	24-Sep-2014	Tejun Heo <tj@kernel.org>	percpu_ref: make INIT_ATOMIC and switch_to_atomic() sticky Currently, a percpu_ref which is initialized with PERPCU_REF_INIT_ATOMIC or switched to atomic mode via switch_to_atomic() automatically reverts to percpu mode on the first percpu_ref_reinit(). This makes the atomic mode difficult to use for cases where a percpu_ref is used as a persistent on/off switch which may be cycled multiple times. This patch makes such atomic state sticky so that it survives through kill/reinit cycles. After this patch, atomic state is cleared only by an explicit percpu_ref_switch_to_percpu() call. Signed-off-by: Tejun Heo <tj@kernel.org> Reviewed-by: Kent Overstreet <kmo@daterainc.com> Cc: Jens Axboe <axboe@kernel.dk> Cc: Christoph Hellwig <hch@infradead.org> Cc: Johannes Weiner <hannes@cmpxchg.org>
# 2aad2a86	24-Sep-2014	Tejun Heo <tj@kernel.org>	percpu_ref: add PERCPU_REF_INIT_* flags With the recent addition of percpu_ref_reinit(), percpu_ref now can be used as a persistent switch which can be turned on and off repeatedly where turning off maps to killing the ref and waiting for it to drain; however, there currently isn't a way to initialize a percpu_ref in its off (killed and drained) state, which can be inconvenient for certain persistent switch use cases. Similarly, percpu_ref_switch_to_atomic/percpu() allow dynamic selection of operation mode; however, currently a newly initialized percpu_ref is always in percpu mode making it impossible to avoid the latency overhead of switching to atomic mode. This patch adds @flags to percpu_ref_init() and implements the following flags. * PERCPU_REF_INIT_ATOMIC : start ref in atomic mode * PERCPU_REF_INIT_DEAD : start ref killed and drained These flags should be able to serve the above two use cases. v2: target_core_tpg.c conversion was missing. Fixed. Signed-off-by: Tejun Heo <tj@kernel.org> Reviewed-by: Kent Overstreet <kmo@daterainc.com> Cc: Jens Axboe <axboe@kernel.dk> Cc: Christoph Hellwig <hch@infradead.org> Cc: Johannes Weiner <hannes@cmpxchg.org>
# f47ad457	24-Sep-2014	Tejun Heo <tj@kernel.org>	percpu_ref: decouple switching to percpu mode and reinit percpu_ref has treated the dropping of the base reference and switching to atomic mode as an integral operation; however, there's nothing inherent tying the two together. The use cases for percpu_ref have been expanding continuously. While the current init/kill/reinit/exit model can cover a lot, the coupling of kill/reinit with atomic/percpu mode switching is turning out to be too restrictive for use cases where many percpu_refs are created and destroyed back-to-back with only some of them reaching extended operation. The coupling also makes implementing always-atomic debug mode difficult. This patch separates out percpu mode switching into percpu_ref_switch_to_percpu() and reimplements percpu_ref_reinit() on top of it. * DEAD still requires ATOMIC. A dead ref can't be switched to percpu mode w/o going through reinit. v2: __percpu_ref_switch_to_percpu() was missing static. Fixed. Reported by Fengguang aka kbuild test robot. Signed-off-by: Tejun Heo <tj@kernel.org> Reviewed-by: Kent Overstreet <kmo@daterainc.com> Cc: Jens Axboe <axboe@kernel.dk> Cc: Christoph Hellwig <hch@infradead.org> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: kbuild test robot <fengguang.wu@intel.com>
# 490c79a6	24-Sep-2014	Tejun Heo <tj@kernel.org>	percpu_ref: decouple switching to atomic mode and killing percpu_ref has treated the dropping of the base reference and switching to atomic mode as an integral operation; however, there's nothing inherent tying the two together. The use cases for percpu_ref have been expanding continuously. While the current init/kill/reinit/exit model can cover a lot, the coupling of kill/reinit with atomic/percpu mode switching is turning out to be too restrictive for use cases where many percpu_refs are created and destroyed back-to-back with only some of them reaching extended operation. The coupling also makes implementing always-atomic debug mode difficult. This patch separates out atomic mode switching into percpu_ref_switch_to_atomic() and reimplements percpu_ref_kill_and_confirm() on top of it. * The handling of __PERCPU_REF_ATOMIC and __PERCPU_REF_DEAD is now differentiated. Among get/put operations, percpu_ref_tryget_live() is the only one which cares about DEAD. * percpu_ref_switch_to_atomic() can be called multiple times on the same ref. This means that multiple @confirm_switch may get queued up which we can't do reliably without extra memory area. This is handled by making the later invocation synchronously wait for the completion of the previous one. This isn't particularly desirable but such synchronous waits shouldn't happen in most cases. Signed-off-by: Tejun Heo <tj@kernel.org> Reviewed-by: Kent Overstreet <kmo@daterainc.com> Cc: Jens Axboe <axboe@kernel.dk> Cc: Christoph Hellwig <hch@infradead.org> Cc: Johannes Weiner <hannes@cmpxchg.org>
# 27344a90	24-Sep-2014	Tejun Heo <tj@kernel.org>	percpu_ref: add PCPU_REF_DEAD percpu_ref will be restructured so that percpu/atomic mode switching and reference killing are dedoupled. In preparation, add PCPU_REF_DEAD and PCPU_REF_ATOMIC_DEAD which is OR of ATOMIC and DEAD. For now, ATOMIC and DEAD are changed together and all PCPU_REF_ATOMIC uses are converted to PCPU_REF_ATOMIC_DEAD without causing any behavior changes. percpu_ref_init() now specifies an explicit alignment when allocating the percpu counters so that the pointer has enough unused low bits to accomodate the flags. Note that one flag was fine as min alignment for percpu memory is 2 bytes but two flags are already too many for the natural alignment of unsigned longs on archs like cris and m68k. v2: The original patch had BUILD_BUG_ON() which triggers if unsigned long's alignment isn't enough to accomodate the flags, which triggered on cris and m64k. percpu_ref_init() updated to specify the required alignment explicitly. Reported by Fengguang. Signed-off-by: Tejun Heo <tj@kernel.org> Reviewed-by: Kent Overstreet <kmo@daterainc.com> Cc: kbuild test robot <fengguang.wu@intel.com>
# 9e804d1f	24-Sep-2014	Tejun Heo <tj@kernel.org>	percpu_ref: rename things to prepare for decoupling percpu/atomic mode switch percpu_ref will be restructured so that percpu/atomic mode switching and reference killing are dedoupled. In preparation, do the following renames. * percpu_ref->confirm_kill -> percpu_ref->confirm_switch * __PERCPU_REF_DEAD -> __PERCPU_REF_ATOMIC * __percpu_ref_alive() -> __ref_is_percpu() This patch is pure rename and doesn't introduce any functional changes. Signed-off-by: Tejun Heo <tj@kernel.org> Reviewed-by: Kent Overstreet <kmo@daterainc.com>
# eecc16ba	24-Sep-2014	Tejun Heo <tj@kernel.org>	percpu_ref: replace pcpu_ prefix with percpu_ percpu_ref uses pcpu_ prefix for internal stuff and percpu_ for externally visible ones. This is the same convention used in the percpu allocator implementation. It works fine there but percpu_ref doesn't have too much internal-only stuff and scattered usages of pcpu_ prefix are confusing than helpful. This patch replaces all pcpu_ prefixes with percpu_. This is pure rename and there's no functional change. Note that PCPU_REF_DEAD is renamed to __PERCPU_REF_DEAD to signify that the flag is internal. Signed-off-by: Tejun Heo <tj@kernel.org> Reviewed-by: Kent Overstreet <kmo@daterainc.com>
# 6251f997	24-Sep-2014	Tejun Heo <tj@kernel.org>	percpu_ref: minor code and comment updates * Some comments became stale. Updated. * percpu_ref_tryget() unnecessarily initializes @ret. Removed. * A blank line removed from percpu_ref_kill_rcu(). * Explicit function name in a WARN format string replaced with __func__. * WARN_ON() in percpu_ref_reinit() converted to WARN_ON_ONCE(). Signed-off-by: Tejun Heo <tj@kernel.org> Reviewed-by: Kent Overstreet <kmo@daterainc.com>
# a2237370	24-Sep-2014	Tejun Heo <tj@kernel.org>	percpu_ref: relocate percpu_ref_reinit() percpu_ref is gonna go through restructuring. Move percpu_ref_reinit() after percpu_ref_kill_and_confirm(). This will make later changes easier to follow and result in cleaner organization. Signed-off-by: Tejun Heo <tj@kernel.org> Reviewed-by: Kent Overstreet <kmo@daterainc.com>
# 9eca8046	24-Sep-2014	Tejun Heo <tj@kernel.org>	Revert "blk-mq, percpu_ref: implement a kludge for SCSI blk-mq stall during probe" This reverts commit 0a30288da1aec914e158c2d7a3482a85f632750f, which was a temporary fix for SCSI blk-mq stall issue. The following patches will fix the issue properly by introducing atomic mode to percpu_ref. Signed-off-by: Tejun Heo <tj@kernel.org> Cc: Kent Overstreet <kmo@daterainc.com> Cc: Jens Axboe <axboe@kernel.dk> Cc: Christoph Hellwig <hch@lst.de>
# 0a30288d	23-Sep-2014	Tejun Heo <tj@kernel.org>	blk-mq, percpu_ref: implement a kludge for SCSI blk-mq stall during probe blk-mq uses percpu_ref for its usage counter which tracks the number of in-flight commands and used to synchronously drain the queue on freeze. percpu_ref shutdown takes measureable wallclock time as it involves a sched RCU grace period. This means that draining a blk-mq takes measureable wallclock time. One would think that this shouldn't matter as queue shutdown should be a rare event which takes place asynchronously w.r.t. userland. Unfortunately, SCSI probing involves synchronously setting up and then tearing down a lot of request_queues back-to-back for non-existent LUNs. This means that SCSI probing may take more than ten seconds when scsi-mq is used. This will be properly fixed by implementing a mechanism to keep q->mq_usage_counter in atomic mode till genhd registration; however, that involves rather big updates to percpu_ref which is difficult to apply late in the devel cycle (v3.17-rc6 at the moment). As a stop-gap measure till the proper fix can be implemented in the next cycle, this patch introduces __percpu_ref_kill_expedited() and makes blk_mq_freeze_queue() use it. This is heavy-handed but should work for testing the experimental SCSI blk-mq implementation. Signed-off-by: Tejun Heo <tj@kernel.org> Reported-by: Christoph Hellwig <hch@infradead.org> Link: http://lkml.kernel.org/g/20140919113815.GA10791@lst.de Fixes: add703fda981 ("blk-mq: use percpu_ref for mq usage count") Cc: Kent Overstreet <kmo@daterainc.com> Cc: Jens Axboe <axboe@kernel.dk> Tested-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jens Axboe <axboe@fb.com>
# e625305b	19-Sep-2014	Tejun Heo <tj@kernel.org>	percpu-refcount: make percpu_ref based on longs instead of ints percpu_ref is currently based on ints and the number of refs it can cover is (1 << 31). This makes it impossible to use a percpu_ref to count memory objects or pages on 64bit machines as it may overflow. This forces those users to somehow aggregate the references before contributing to the percpu_ref which is often cumbersome and sometimes challenging to get the same level of performance as using the percpu_ref directly. While using ints for the percpu counters makes them pack tighter on 64bit machines, the possible gain from using ints instead of longs is extremely small compared to the overall gain from per-cpu operation. This patch makes percpu_ref based on longs so that it can be used to directly count memory objects or pages. Signed-off-by: Tejun Heo <tj@kernel.org> Cc: Kent Overstreet <kmo@daterainc.com> Cc: Johannes Weiner <hannes@cmpxchg.org>
# 4843c332	19-Sep-2014	Tejun Heo <tj@kernel.org>	percpu-refcount: improve WARN messages percpu_ref's WARN messages can be a lot more helpful by indicating who's the culprit. Make them report the release function that the offending percpu-refcount is associated with. This should make it a lot easier to track down the reported invalid refcnting operations. Signed-off-by: Tejun Heo <tj@kernel.org> Cc: Kent Overstreet <kmo@daterainc.com>
# a34375ef	07-Sep-2014	Tejun Heo <tj@kernel.org>	percpu-refcount: add @gfp to percpu_ref_init() Percpu allocator now supports allocation mask. Add @gfp to percpu_ref_init() so that !GFP_KERNEL allocation masks can be used with percpu_refs too. This patch doesn't make any functional difference. v2: blk-mq conversion was missing. Updated. Signed-off-by: Tejun Heo <tj@kernel.org> Cc: Kent Overstreet <koverstreet@google.com> Cc: Benjamin LaHaise <bcrl@kvack.org> Cc: Li Zefan <lizefan@huawei.com> Cc: Nicholas A. Bellinger <nab@linux-iscsi.org> Cc: Jens Axboe <axboe@kernel.dk>
# 2d722782	28-Jun-2014	Tejun Heo <tj@kernel.org>	percpu-refcount: implement percpu_ref_reinit() and percpu_ref_is_zero() Now that explicit invocation of percpu_ref_exit() is necessary to free the percpu counter, we can implement percpu_ref_reinit() which reinitializes a released percpu_ref. This can be used implement scalable gating switch which can be drained and then re-opened without worrying about memory allocation failures. percpu_ref_is_zero() is added to be used in a sanity check in percpu_ref_exit(). As this function will be useful for other purposes too, make it a public interface. v2: Use smp_read_barrier_depends() instead of smp_load_acquire(). We only need data dep barrier and smp_load_acquire() is stronger and heavier on some archs. Spotted by Lai Jiangshan. Signed-off-by: Tejun Heo <tj@kernel.org> Cc: Kent Overstreet <kmo@daterainc.com> Cc: Christoph Lameter <cl@linux-foundation.org> Cc: Lai Jiangshan <laijs@cn.fujitsu.com>
# 9a1049da	28-Jun-2014	Tejun Heo <tj@kernel.org>	percpu-refcount: require percpu_ref to be exited explicitly Currently, a percpu_ref undoes percpu_ref_init() automatically by freeing the allocated percpu area when the percpu_ref is killed. While seemingly convenient, this has the following niggles. * It's impossible to re-init a released reference counter without going through re-allocation. * In the similar vein, it's impossible to initialize a percpu_ref count with static percpu variables. * We need and have an explicit destructor anyway for failure paths - percpu_ref_cancel_init(). This patch removes the automatic percpu counter freeing in percpu_ref_kill_rcu() and repurposes percpu_ref_cancel_init() into a generic destructor now named percpu_ref_exit(). percpu_ref_destroy() is considered but it gets confusing with percpu_ref_kill() while "exit" clearly indicates that it's the counterpart of percpu_ref_init(). All percpu_ref_cancel_init() users are updated to invoke percpu_ref_exit() instead and explicit percpu_ref_exit() calls are added to the destruction path of all percpu_ref users. Signed-off-by: Tejun Heo <tj@kernel.org> Acked-by: Benjamin LaHaise <bcrl@kvack.org> Cc: Kent Overstreet <kmo@daterainc.com> Cc: Christoph Lameter <cl@linux-foundation.org> Cc: Benjamin LaHaise <bcrl@kvack.org> Cc: Nicholas A. Bellinger <nab@linux-iscsi.org> Cc: Li Zefan <lizefan@huawei.com>
# 7d742075	28-Jun-2014	Tejun Heo <tj@kernel.org>	percpu-refcount: use unsigned long for pcpu_count pointer percpu_ref->pcpu_count is a percpu pointer with a status flag in its lowest bit. As such, it always goes through arithmetic operations which is very cumbersome to do on a pointer. It has to be first casted to unsigned long and then back. Let's just make the field unsigned long so that we can skip the first casts. While at it, rename it to pcpu_counter_ptr to clarify that it's a pointer value. Signed-off-by: Tejun Heo <tj@kernel.org> Cc: Kent Overstreet <kmo@daterainc.com> Cc: Christoph Lameter <cl@linux-foundation.org>
# eae7975d	28-Jun-2014	Tejun Heo <tj@kernel.org>	percpu-refcount: add helpers for ->percpu_count accesses * All four percpu_ref_() operations implemented in the header file perform the same operation to determine whether the percpu_ref is alive and extract the percpu pointer. Factor out the common logic into __pcpu_ref_alive(). This doesn't change the generated code. There are a couple places in percpu-refcount.c which masks out PCPU_REF_DEAD to obtain the percpu pointer. Factor it out into pcpu_count_ptr(). * The above changes make the WARN_ON_ONCE() conditional at the top of percpu_ref_kill_and_confirm() the only user of REF_STATUS(). Test PCPU_REF_DEAD directly and remove REF_STATUS(). This patch doesn't introduce any functional change. Signed-off-by: Tejun Heo <tj@kernel.org> Cc: Kent Overstreet <kmo@daterainc.com> Cc: Christoph Lameter <cl@linux-foundation.org>
# d630dc4c	28-Jun-2014	Tejun Heo <tj@kernel.org>	percpu-refcount: one bit is enough for REF_STATUS percpu-refcount currently reserves two lowest bits of its percpu pointer to indicate its state; however, only one bit is used for PCPU_REF_DEAD. Simplify it by removing PCPU_STATUS_BITS/MASK and testing PCPU_REF_DEAD directly. This also allows the compiler to choose a more efficient instruction depending on the architecture. Signed-off-by: Tejun Heo <tj@kernel.org> Cc: Kent Overstreet <kmo@daterainc.com> Cc: Christoph Lameter <cl@linux-foundation.org>
# 687b0ad2	06-Jan-2014	Kent Overstreet <kmo@daterainc.com>	percpu-refcount: Add a WARN() for ref going negative AIO had a missing get, which led to an ioctx leak - after percpu_ref_kill() the ref was 0 so percpu_ref_put() never saw it hit 0. This wasn't noticed at the time because it all happened completely silently, this adds a WARN() which would've caught the aio bug. tj: Used WARN_ONCE() instead of WARN(). Signed-off-by: Kent Overstreet <kmo@daterainc.com> Signed-off-by: Tejun Heo <tj@kernel.org>
# c9e8d128	07-Nov-2013	Nicholas Bellinger <nab@linux-iscsi.org>	percpu-refcount: Add EXPORT_SYMBOL to use percpu_ref from modules This patch adds EXPORT_SYMBOL() for percpu_ref_init(), percpu_ref_cancel_init() and percpu_ref_kill_and_confirm() so that percpu refcounting can be used by external modules. Cc: Kent Overstreet <kmo@daterainc.com> Signed-off-by: Nicholas Bellinger <nab@linux-iscsi.org>
# 5e9dd373	16-Oct-2013	Matias Bjorling <m@bjorling.me>	percpu_refcount: export symbols Export the interface to be used within modules. Signed-off-by: Matias Bjorling <m@bjorling.me> Acked-by: Tejun Heo <tj@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
# a4244454	16-Jun-2013	Tejun Heo <tj@kernel.org>	percpu-refcount: use RCU-sched insted of normal RCU percpu-refcount was incorrectly using preempt_disable/enable() for RCU critical sections against call_rcu(). 6a24474da8 ("percpu-refcount: consistently use plain (non-sched) RCU") fixed it by converting the preepmtion operations with rcu_read_[un]lock() citing that there isn't any advantage in using sched-RCU over using the usual one; however, rcu_read_[un]lock() for the preemptible RCU implementation - CONFIG_TREE_PREEMPT_RCU, chosen when CONFIG_PREEMPT - are slightly more expensive than preempt_disable/enable(). In a contrived microbench which repeats the followings, - percpu_ref_get() - copy 32 bytes of data into percpu buffer - percpu_put_get() - copy 32 bytes of data into percpu buffer rcu_read_[un]lock() used in percpu_ref_get/put() makes it go slower by about 15% when compared to using sched-RCU. As the RCU critical sections are extremely short, using sched-RCU shouldn't have any latency implications. Convert to RCU-sched. Signed-off-by: Tejun Heo <tj@kernel.org> Acked-by: Kent Overstreet <koverstreet@google.com> Acked-by: "Paul E. McKenney" <paulmck@linux.vnet.ibm.com> Cc: Michal Hocko <mhocko@suse.cz> Cc: Rusty Russell <rusty@rustcorp.com.au>
# dbece3a0	13-Jun-2013	Tejun Heo <tj@kernel.org>	percpu-refcount: implement percpu_tryget() along with percpu_ref_kill_and_confirm() Implement percpu_tryget() which stops giving out references once the percpu_ref is visible as killed. Because the refcnt is per-cpu, different CPUs will start to see a refcnt as killed at different points in time and tryget() may continue to succeed on subset of cpus for a while after percpu_ref_kill() returns. For use cases where it's necessary to know when all CPUs start to see the refcnt as dead, percpu_ref_kill_and_confirm() is added. The new function takes an extra argument @confirm_kill which is invoked when the refcnt is guaranteed to be viewed as killed on all CPUs. While this isn't the prettiest interface, it doesn't force synchronous wait and is much safer than requiring the caller to do its own call_rcu(). v2: Patch description rephrased to emphasize that tryget() may continue to succeed on some CPUs after kill() returns as suggested by Kent. v3: Function comment in percpu_ref_kill_and_confirm() updated warning people to not depend on the implied RCU grace period from the confirm callback as it's an implementation detail. Signed-off-by: Tejun Heo <tj@kernel.org> Slightly-Grumpily-Acked-by: Kent Overstreet <koverstreet@google.com>
# bc497bd3	12-Jun-2013	Tejun Heo <tj@kernel.org>	percpu-refcount: implement percpu_ref_cancel_init() Normally, percpu_ref_init() initializes and percpu_ref_kill() initiates destruction which completes asynchronously. The asynchronous destruction can be problematic in init failure path where the caller wants to destroy half-constructed object - distinguishing half-constructed objects from the usual release method can be painful for complex objects. This patch implements percpu_ref_cancel_init() which synchronously destroys the percpu_ref without invoking release. To avoid unintentional misuses, the function requires the ref to have finished percpu_ref_init() but never used and triggers WARN otherwise. v2: Explain the weird name and usage restriction in the function comment. Signed-off-by: Tejun Heo <tj@kernel.org> Acked-by: Kent Overstreet <koverstreet@google.com>
# acac7883	12-Jun-2013	Tejun Heo <tj@kernel.org>	percpu-refcount: add __must_check to percpu_ref_init() and don't use ACCESS_ONCE() in percpu_ref_kill_rcu() Two small changes. * Unlike most init functions, percpu_ref_init() allocates memory and may fail. Let's mark it with __must_check in case the caller forgets. * percpu_ref_kill_rcu() is unnecessarily using ACCESS_ONCE() to dereference @ref->pcpu_count, which can be misleading. The pointer is guaranteed to be valid and visible and can't change underneath the function. Drop ACCESS_ONCE(). Signed-off-by: Tejun Heo <tj@kernel.org>
# ac899061	12-Jun-2013	Tejun Heo <tj@kernel.org>	percpu-refcount: cosmetic updates * s/percpu_ref_release/percpu_ref_func_t/ as it's customary to have _t postfix for types and the type is gonna be used for a different type of callback too. * Add @ARG to function comments. * Drop unnecessary and unaligned indentation from percpu_ref_init() function comment. Signed-off-by: Tejun Heo <tj@kernel.org> Acked-by: Kent Overstreet <koverstreet@google.com>
# c1ae6e9b	03-Jun-2013	Kent Overstreet <koverstreet@google.com>	percpu-refcount: Don't use silly cmpxchg() The cmpxchg() was just to ensure the debug check didn't race, which was a bit excessive. The caller is supposed to do the appropriate synchronization, which means percpu_ref_kill() can just do a simple store. Signed-off-by: Kent Overstreet <koverstreet@google.com> Signed-off-by: Tejun Heo <tj@kernel.org>
# 215e262f	31-May-2013	Kent Overstreet <koverstreet@google.com>	percpu: implement generic percpu refcounting This implements a refcount with similar semantics to atomic_get()/atomic_dec_and_test() - but percpu. It also implements two stage shutdown, as we need it to tear down the percpu counts. Before dropping the initial refcount, you must call percpu_ref_kill(); this puts the refcount in "shutting down mode" and switches back to a single atomic refcount with the appropriate barriers (synchronize_rcu()). It's also legal to call percpu_ref_kill() multiple times - it only returns true once, so callers don't have to reimplement shutdown synchronization. [akpm@linux-foundation.org: fix build] [akpm@linux-foundation.org: coding-style tweak] Signed-off-by: Kent Overstreet <koverstreet@google.com> Cc: Zach Brown <zab@redhat.com> Cc: Felipe Balbi <balbi@ti.com> Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Cc: Mark Fasheh <mfasheh@suse.com> Cc: Joel Becker <jlbec@evilplan.org> Cc: Rusty Russell <rusty@rustcorp.com.au> Cc: Jens Axboe <axboe@kernel.dk> Cc: Asai Thambi S P <asamymuthupa@micron.com> Cc: Selvan Mani <smani@micron.com> Cc: Sam Bradshaw <sbradshaw@micron.com> Cc: Jeff Moyer <jmoyer@redhat.com> Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: Benjamin LaHaise <bcrl@kvack.org> Cc: Tejun Heo <tj@kernel.org> Cc: Oleg Nesterov <oleg@redhat.com> Cc: Christoph Lameter <cl@linux-foundation.org> Cc: Ingo Molnar <mingo@redhat.com> Reviewed-by: "Theodore Ts'o" <tytso@mit.edu> Signed-off-by: Tejun Heo <tj@kernel.org>