History log of /linux-master/kernel/cgroup/rstat.c
Revision Date Author Comments
# 6f3189f3 28-Jan-2024 Daniel Xu <dxu@dxuuu.xyz>

bpf: treewide: Annotate BPF kfuncs in BTF

This commit marks kfuncs as such inside the .BTF_ids section. The upshot
of these annotations is that we'll be able to automatically generate
kfunc prototypes for downstream users. The process is as follows:

1. In source, use BTF_KFUNCS_START/END macro pair to mark kfuncs
2. During build, pahole injects into BTF a "bpf_kfunc" BTF_DECL_TAG for
each function inside BTF_KFUNCS sets
3. At runtime, vmlinux or module BTF is made available in sysfs
4. At runtime, bpftool (or similar) can look at provided BTF and
generate appropriate prototypes for functions with "bpf_kfunc" tag

To ensure future kfunc are similarly tagged, we now also return error
inside kfunc registration for untagged kfuncs. For vmlinux kfuncs,
we also WARN(), as initcall machinery does not handle errors.

Signed-off-by: Daniel Xu <dxu@dxuuu.xyz>
Acked-by: Benjamin Tissoires <bentiss@kernel.org>
Link: https://lore.kernel.org/r/e55150ceecbf0a5d961e608941165c0bee7bc943.1706491398.git.dxu@dxuuu.xyz
Signed-off-by: Alexei Starovoitov <ast@kernel.org>


# d499fd41 30-Nov-2023 Waiman Long <longman@redhat.com>

cgroup/rstat: Optimize cgroup_rstat_updated_list()

The current design of cgroup_rstat_cpu_pop_updated() is to traverse
the updated tree in a way to pop out the leaf nodes first before
their parents. This can cause traversal of multiple nodes before a
leaf node can be found and popped out. IOW, a given node in the tree
can be visited multiple times before the whole operation is done. So
it is not very efficient and the code can be hard to read.

With the introduction of cgroup_rstat_updated_list() to build a list
of cgroups to be flushed first before any flushing operation is being
done, we can optimize the way the updated tree nodes are being popped
by pushing the parents first to the tail end of the list before their
children. In this way, most updated tree nodes will be visited only
once with the exception of the subtree root as we still need to go
back to its parent and popped it out of its updated_children list.
This also makes the code easier to read.

Signed-off-by: Waiman Long <longman@redhat.com>
Signed-off-by: Tejun Heo <tj@kernel.org>


# e76d28bd 03-Nov-2023 Waiman Long <longman@redhat.com>

cgroup/rstat: Reduce cpu_lock hold time in cgroup_rstat_flush_locked()

When cgroup_rstat_updated() isn't being called concurrently with
cgroup_rstat_flush_locked(), its run time is pretty short. When
both are called concurrently, the cgroup_rstat_updated() run time
can spike to a pretty high value due to high cpu_lock hold time in
cgroup_rstat_flush_locked(). This can be problematic if the task calling
cgroup_rstat_updated() is a realtime task running on an isolated CPU
with a strict latency requirement. The cgroup_rstat_updated() call can
happen when there is a page fault even though the task is running in
user space most of the time.

The percpu cpu_lock is used to protect the update tree -
updated_next and updated_children. This protection is only needed when
cgroup_rstat_cpu_pop_updated() is being called. The subsequent flushing
operation which can take a much longer time does not need that protection
as it is already protected by cgroup_rstat_lock.

To reduce the cpu_lock hold time, we need to perform all the
cgroup_rstat_cpu_pop_updated() calls up front with the lock
released afterward before doing any flushing. This patch adds a new
cgroup_rstat_updated_list() function to return a singly linked list of
cgroups to be flushed.

Some instrumentation code are added to measure the cpu_lock hold time
right after lock acquisition to after releasing the lock. Parallel
kernel build on a 2-socket x86-64 server is used as the benchmarking
tool for measuring the lock hold time.

The maximum cpu_lock hold time before and after the patch are 100us and
29us respectively. So the worst case time is reduced to about 30% of
the original. However, there may be some OS or hardware noises like NMI
or SMI in the test system that can worsen the worst case value. Those
noises are usually tuned out in a real production environment to get
a better result.

OTOH, the lock hold time frequency distribution should give a better
idea of the performance benefit of the patch. Below were the frequency
distribution before and after the patch:

Hold time Before patch After patch
--------- ------------ -----------
0-01 us 804,139 13,738,708
01-05 us 9,772,767 1,177,194
05-10 us 4,595,028 4,984
10-15 us 303,481 3,562
15-20 us 78,971 1,314
20-25 us 24,583 18
25-30 us 6,908 12
30-40 us 8,015
40-50 us 2,192
50-60 us 316
60-70 us 43
70-80 us 7
80-90 us 2
>90 us 3

Signed-off-by: Waiman Long <longman@redhat.com>
Reviewed-by: Yosry Ahmed <yosryahmed@google.com>
Signed-off-by: Tejun Heo <tj@kernel.org>


# 15fb6f2b 31-Oct-2023 Dave Marchevsky <davemarchevsky@fb.com>

bpf: Add __bpf_hook_{start,end} macros

Not all uses of __diag_ignore_all(...) in BPF-related code in order to
suppress warnings are wrapping kfunc definitions. Some "hook point"
definitions - small functions meant to be used as attach points for
fentry and similar BPF progs - need to suppress -Wmissing-declarations.

We could use __bpf_kfunc_{start,end}_defs added in the previous patch in
such cases, but this might be confusing to someone unfamiliar with BPF
internals. Instead, this patch adds __bpf_hook_{start,end} macros,
currently having the same effect as __bpf_kfunc_{start,end}_defs, then
uses them to suppress warnings for two hook points in the kernel itself
and some bpf_testmod hook points as well.

Signed-off-by: Dave Marchevsky <davemarchevsky@fb.com>
Cc: Yafang Shao <laoar.shao@gmail.com>
Acked-by: Jiri Olsa <jolsa@kernel.org>
Acked-by: Yafang Shao <laoar.shao@gmail.com>
Link: https://lore.kernel.org/r/20231031215625.2343848-2-davemarchevsky@fb.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>


# 0437719c 06-Aug-2023 Hao Jia <jiahao.os@bytedance.com>

cgroup/rstat: Record the cumulative per-cpu time of cgroup and its descendants

The member variable bstat of the structure cgroup_rstat_cpu
records the per-cpu time of the cgroup itself, but does not
include the per-cpu time of its descendants. The per-cpu time
including descendants is very useful for calculating the
per-cpu usage of cgroups.

Although we can indirectly obtain the total per-cpu time
of the cgroup and its descendants by accumulating the per-cpu
bstat of each descendant of the cgroup. But after a child cgroup
is removed, we will lose its bstat information. This will cause
the cumulative value to be non-monotonic, thus affecting
the accuracy of cgroup per-cpu usage.

So we add the subtree_bstat variable to record the total
per-cpu time of this cgroup and its descendants, which is
similar to "cpuacct.usage*" in cgroup v1. And this is
also helpful for the migration from cgroup v1 to cgroup v2.
After adding this variable, we can obtain the per-cpu time of
cgroup and its descendants in user mode through eBPF/drgn, etc.
And we are still trying to determine how to expose it in the
cgroupfs interface.

Suggested-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Hao Jia <jiahao.os@bytedance.com>
Signed-off-by: Tejun Heo <tj@kernel.org>


# 0a2dc6ac 21-Apr-2023 Yosry Ahmed <yosryahmed@google.com>

cgroup: remove cgroup_rstat_flush_atomic()

Previous patches removed the only caller of cgroup_rstat_flush_atomic().
Remove the function and simplify the code.

Link: https://lkml.kernel.org/r/20230421174020.2994750-6-yosryahmed@google.com
Signed-off-by: Yosry Ahmed <yosryahmed@google.com>
Acked-by: Shakeel Butt <shakeelb@google.com>
Acked-by: Tejun Heo <tj@kernel.org>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Cc: Christian Brauner <brauner@kernel.org>
Cc: Jan Kara <jack@suse.cz>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Michal Koutný <mkoutny@suse.com>
Cc: Muchun Song <songmuchun@bytedance.com>
Cc: Roman Gushchin <roman.gushchin@linux.dev>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>


# 8bff9a04 30-Mar-2023 Yosry Ahmed <yosryahmed@google.com>

cgroup: rename cgroup_rstat_flush_"irqsafe" to "atomic"

Patch series "memcg: avoid flushing stats atomically where possible", v3.

rstat flushing is an expensive operation that scales with the number of
cpus and the number of cgroups in the system. The purpose of this series
is to minimize the contexts where we flush stats atomically.

Patches 1 and 2 are cleanups requested during reviews of prior versions of
this series.

Patch 3 makes sure we never try to flush from within an irq context.

Patches 4 to 7 introduce separate variants of mem_cgroup_flush_stats() for
atomic and non-atomic flushing, and make sure we only flush the stats
atomically when necessary.

Patch 8 is a slightly tangential optimization that limits the work done by
rstat flushing in some scenarios.


This patch (of 8):

cgroup_rstat_flush_irqsafe() can be a confusing name. It may read as
"irqs are disabled throughout", which is what the current implementation
does (currently under discussion [1]), but is not the intention. The
intention is that this function is safe to call from atomic contexts.
Name it as such.

Link: https://lkml.kernel.org/r/20230330191801.1967435-1-yosryahmed@google.com
Link: https://lkml.kernel.org/r/20230330191801.1967435-2-yosryahmed@google.com
Signed-off-by: Yosry Ahmed <yosryahmed@google.com>
Suggested-by: Johannes Weiner <hannes@cmpxchg.org>
Acked-by: Shakeel Butt <shakeelb@google.com>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: Josef Bacik <josef@toxicpanda.com>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Michal Koutný <mkoutny@suse.com>
Cc: Muchun Song <muchun.song@linux.dev>
Cc: Roman Gushchin <roman.gushchin@linux.dev>
Cc: Tejun Heo <tj@kernel.org>
Cc: Vasily Averin <vasily.averin@linux.dev>
Cc: Zefan Li <lizefan.x@bytedance.com>
Cc: Michal Hocko <mhocko@suse.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>


# fcdb1eda 15-Mar-2023 Josh Don <joshdon@google.com>

cgroup: fix display of forceidle time at root

We need to reset forceidle_sum to 0 when reading from root, since the
bstat we accumulate into is stack allocated.

To make this more robust, just replace the existing cputime reset with a
memset of the overall bstat.

Signed-off-by: Josh Don <joshdon@google.com>
Fixes: 1fcf54deb767 ("sched/core: add forced idle accounting for cgroups")
Cc: stable@vger.kernel.org # v6.0+
Signed-off-by: Tejun Heo <tj@kernel.org>


# 400031e0 01-Feb-2023 David Vernet <void@manifault.com>

bpf: Add __bpf_kfunc tag to all kfuncs

Now that we have the __bpf_kfunc tag, we should use add it to all
existing kfuncs to ensure that they'll never be elided in LTO builds.

Signed-off-by: David Vernet <void@manifault.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Stanislav Fomichev <sdf@google.com>
Link: https://lore.kernel.org/bpf/20230201173016.342758-4-void@manifault.com


# c62256dd 30-Nov-2022 Jens Axboe <axboe@kernel.dk>

Revert "blk-cgroup: Flush stats at blkgs destruction path"

This reverts commit dae590a6c96c799434e0ff8156ef29b88c257e60.

We've had a few reports on this causing a crash at boot time, because
of a reference issue. While this problem seemginly did exist before
the patch and needs solving separately, this patch makes it a lot
easier to trigger.

Link: https://lore.kernel.org/linux-block/CA+QYu4oxiRKC6hJ7F27whXy-PRBx=Tvb+-7TQTONN8qTtV3aDA@mail.gmail.com/
Link: https://lore.kernel.org/linux-block/69af7ccb-6901-c84c-0e95-5682ccfb750c@acm.org/
Signed-off-by: Jens Axboe <axboe@kernel.dk>


# dae590a6 04-Nov-2022 Waiman Long <longman@redhat.com>

blk-cgroup: Flush stats at blkgs destruction path

As noted by Michal, the blkg_iostat_set's in the lockless list
hold reference to blkg's to protect against their removal. Those
blkg's hold reference to blkcg. When a cgroup is being destroyed,
cgroup_rstat_flush() is only called at css_release_work_fn() which is
called when the blkcg reference count reaches 0. This circular dependency
will prevent blkcg from being freed until some other events cause
cgroup_rstat_flush() to be called to flush out the pending blkcg stats.

To prevent this delayed blkcg removal, add a new cgroup_rstat_css_flush()
function to flush stats for a given css and cpu and call it at the blkgs
destruction path, blkcg_destroy_blkgs(), whenever there are still some
pending stats to be flushed. This will ensure that blkcg reference
count can reach 0 ASAP.

Signed-off-by: Waiman Long <longman@redhat.com>
Acked-by: Tejun Heo <tj@kernel.org>
Link: https://lore.kernel.org/r/20221105005902.407297-4-longman@redhat.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>


# a319185b 24-Aug-2022 Yosry Ahmed <yosryahmed@google.com>

cgroup: bpf: enable bpf programs to integrate with rstat

Enable bpf programs to make use of rstat to collect cgroup hierarchical
stats efficiently:
- Add cgroup_rstat_updated() kfunc, for bpf progs that collect stats.
- Add cgroup_rstat_flush() sleepable kfunc, for bpf progs that read stats.
- Add an empty bpf_rstat_flush() hook that is called during rstat
flushing, for bpf progs that flush stats to attach to. Attaching a bpf
prog to this hook effectively registers it as a flush callback.

Signed-off-by: Yosry Ahmed <yosryahmed@google.com>
Acked-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Hao Luo <haoluo@google.com>
Link: https://lore.kernel.org/r/20220824233117.1312810-4-haoluo@google.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>


# 1fcf54de 29-Jun-2022 Josh Don <joshdon@google.com>

sched/core: add forced idle accounting for cgroups

4feee7d1260 previously added per-task forced idle accounting. This patch
extends this to also include cgroups.

rstat is used for cgroup accounting, except for the root, which uses
kcpustat in order to bypass the need for doing an rstat flush when
reading root stats.

Only cgroup v2 is supported. Similar to the task accounting, the cgroup
accounting requires that schedstats is enabled.

Signed-off-by: Josh Don <joshdon@google.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Acked-by: Tejun Heo <tj@kernel.org>
Link: https://lkml.kernel.org/r/20220629211426.3329954-1-joshdon@google.com


# b1e2c8df 23-Mar-2022 Sebastian Andrzej Siewior <bigeasy@linutronix.de>

cgroup: use irqsave in cgroup_rstat_flush_locked().

All callers of cgroup_rstat_flush_locked() acquire cgroup_rstat_lock
either with spin_lock_irq() or spin_lock_irqsave().
cgroup_rstat_flush_locked() itself acquires cgroup_rstat_cpu_lock which
is a raw_spin_lock. This lock is also acquired in
cgroup_rstat_updated() in IRQ context and therefore requires _irqsave()
locking suffix in cgroup_rstat_flush_locked().

Since there is no difference between spin_lock_t and raw_spin_lock_t on
!RT lockdep does not complain here. On RT lockdep complains because the
interrupts were not disabled here and a deadlock is possible.

Acquire the raw_spin_lock_t with disabled interrupts.

Link: https://lkml.kernel.org/r/20220301122143.1521823-2-bigeasy@linutronix.de
Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Acked-by: Tejun Heo <tj@kernel.org>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Zefan Li <lizefan.x@bytedance.com>
From: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Subject: cgroup: add a comment to cgroup_rstat_flush_locked().

Add a comment why spin_lock_irq() -> raw_spin_lock_irqsave() is needed.

Link: https://lkml.kernel.org/r/Yh+DOK73hfVV5ThX@linutronix.de
Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Acked-by: Tejun Heo <tj@kernel.org>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Zefan Li <lizefan.x@bytedance.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 95b99f35 07-Jan-2022 Wei Yang <richard.weiyang@gmail.com>

cgroup: rstat: retrieve current bstat to delta directly

Instead of retrieve current bstat to cur and copy it to delta, let's use
delta directly.

This saves one copy operation and has the same code convention as
propagating delta to parent.

Signed-off-by: Wei Yang <richard.weiyang@gmail.com>
Signed-off-by: Tejun Heo <tj@kernel.org>


# 4148be7d 07-Jan-2022 Wei Yang <richard.weiyang@gmail.com>

cgroup: rstat: use same convention to assign cgroup_base_stat

In function cgroup_base_stat_flush(), we update cgroup_base_stat by
getting rstatc->bstat and adjust delta to related fields.

There are two convention to assign cgroup_base_stat in this function:

* rstat2 = rstat1
* rstat2.cputime = rstat1.cputime

The second convention may make audience think just field "cputime" is
updated, while cputime is the only field in cgroup_base_stat.

Let's use the same convention to eliminate this confusion.

Signed-off-by: Wei Yang <richard.weiyang@gmail.com>
Signed-off-by: Tejun Heo <tj@kernel.org>


# f5f60d23 24-Dec-2021 Wei Yang <richard.weiyang@gmail.com>

cgroup/rstat: check updated_next only for root

After commit dc26532aed0a ("cgroup: rstat: punt root-level optimization to
individual controllers"), each rstat on updated_children list has its
->updated_next not NULL.

This means we can remove the check on ->updated_next, if we make sure
the subtree from @root is on list, which could be done by checking
updated_next for root.

tj: Coding style fixes.

Signed-off-by: Wei Yang <richard.weiyang@gmail.com>
Reviewed-by: Michal Koutný <mkoutny@suse.com>
Signed-off-by: Tejun Heo <tj@kernel.org>


# 0da41f73 24-Dec-2021 Wei Yang <richard.weiyang@gmail.com>

cgroup: rstat: explicitly put loop variant in while

Instead of do while unconditionally, let's put the loop variant in
while.

Signed-off-by: Wei Yang <richard.weiyang@gmail.com>
Reviewed-by: Michal Koutný <mkoutny@suse.com>
Signed-off-by: Tejun Heo <tj@kernel.org>


# eda09706 03-Nov-2021 Michal Koutný <mkoutny@suse.com>

cgroup: rstat: Mark benign data race to silence KCSAN

There is a race between updaters and flushers (flush can possibly miss
the latest update(s)). This is expected as explained in
cgroup_rstat_updated() comment, add also machine readable annotation so
that KCSAN results aren't noisy.

Reported-by: Hao Sun <sunhao.th@gmail.com>
Link: https://lore.kernel.org/r/CACkBjsbPVdkub=e-E-p1WBOLxS515ith-53SFdmFHWV_QMo40w@mail.gmail.com
Suggested-by: Hao Sun <sunhao.th@gmail.com>

Signed-off-by: Michal Koutný <mkoutny@suse.com>
Reviewed-by: Shakeel Butt <shakeelb@google.com>
Signed-off-by: Tejun Heo <tj@kernel.org>


# 81c49d39 28-Oct-2021 Dan Schatzberg <schatzberg.dan@gmail.com>

cgroup: Fix rootcg cpu.stat guest double counting

In account_guest_time in kernel/sched/cputime.c guest time is
attributed to both CPUTIME_NICE and CPUTIME_USER in addition to
CPUTIME_GUEST_NICE and CPUTIME_GUEST respectively. Therefore, adding
both to calculate usage results in double counting any guest time at
the rootcg.

Fixes: 936f2a70f207 ("cgroup: add cpu.stat file to root cgroup")
Signed-off-by: Dan Schatzberg <schatzberg.dan@gmail.com>
Signed-off-by: Tejun Heo <tj@kernel.org>


# c3df5fb5 27-Jul-2021 Tejun Heo <tj@kernel.org>

cgroup: rstat: fix A-A deadlock on 32bit around u64_stats_sync

0fa294fb1985 ("cgroup: Replace cgroup_rstat_mutex with a spinlock") added
cgroup_rstat_flush_irqsafe() allowing flushing to happen from the irq
context. However, rstat paths use u64_stats_sync to synchronize access to
64bit stat counters on 32bit machines. u64_stats_sync is implemented using
seq_lock and trying to read from an irq context can lead to A-A deadlock if
the irq happens to interrupt the stat update.

Fix it by using the irqsafe variants - u64_stats_update_begin_irqsave() and
u64_stats_update_end_irqrestore() - in the update paths. Note that none of
this matters on 64bit machines. All these are just for 32bit SMP setups.

Note that the interface was introduced way back, its first and currently
only use was recently added by 2d146aa3aa84 ("mm: memcontrol: switch to
rstat"). Stable tagging targets this commit.

Signed-off-by: Tejun Heo <tj@kernel.org>
Reported-by: Rik van Riel <riel@surriel.com>
Fixes: 2d146aa3aa84 ("mm: memcontrol: switch to rstat")
Cc: stable@vger.kernel.org # v5.13+


# 2ca11b0e 25-May-2021 Yang Li <yang.lee@linux.alibaba.com>

cgroup: Fix kernel-doc

Fix function name in cgroup.c and rstat.c kernel-doc comment
to remove these warnings found by clang_w1.

kernel/cgroup/cgroup.c:2401: warning: expecting prototype for
cgroup_taskset_migrate(). Prototype was for cgroup_migrate_execute()
instead.
kernel/cgroup/rstat.c:233: warning: expecting prototype for
cgroup_rstat_flush_begin(). Prototype was for cgroup_rstat_flush_hold()
instead.

Reported-by: Abaci Robot <abaci@linux.alibaba.com>
Fixes: 'commit e595cd706982 ("cgroup: track migration context in cgroup_mgctx")'
Signed-off-by: Yang Li <yang.lee@linux.alibaba.com>
Signed-off-by: Tejun Heo <tj@kernel.org>


# 08b2b6fd 24-May-2021 Zhen Lei <thunder.leizhen@huawei.com>

cgroup: fix spelling mistakes

Fix some spelling mistakes in comments:
hierarhcy ==> hierarchy
automtically ==> automatically
overriden ==> overridden
In absense of .. or ==> In absence of .. and
assocaited ==> associated
taget ==> target
initate ==> initiate
succeded ==> succeeded
curremt ==> current
udpated ==> updated

Signed-off-by: Zhen Lei <thunder.leizhen@huawei.com>
Signed-off-by: Tejun Heo <tj@kernel.org>


# dc26532a 29-Apr-2021 Johannes Weiner <hannes@cmpxchg.org>

cgroup: rstat: punt root-level optimization to individual controllers

Current users of the rstat code can source root-level statistics from
the native counters of their respective subsystem, allowing them to
forego aggregation at the root level. This optimization is currently
implemented inside the generic rstat code, which doesn't track the root
cgroup and doesn't invoke the subsystem flush callbacks on it.

However, the memory controller cannot do this optimization, because
cgroup1 breaks out memory specifically for the local level, including at
the root level. In preparation for the memory controller switching to
rstat, move the optimization from rstat core to the controllers.

Afterwards, rstat will always track the root cgroup for changes and
invoke the subsystem callbacks on it; and it's up to the subsystem to
special-case and skip aggregation of the root cgroup if it can source
this information through other, cheaper means.

This is the case for the io controller and the cgroup base stats. In
their respective flush callbacks, check whether the parent is the root
cgroup, and if so, skip the unnecessary upward propagation.

The extra cost of tracking the root cgroup is negligible: on stat
changes, we actually remove a branch that checks for the root. The
queueing for a flush touches only per-cpu data, and only the first stat
change since a flush requires a (per-cpu) lock.

Link: https://lkml.kernel.org/r/20210209163304.77088-6-hannes@cmpxchg.org
Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
Acked-by: Tejun Heo <tj@kernel.org>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Michal Koutný <mkoutny@suse.com>
Cc: Roman Gushchin <guro@fb.com>
Cc: Shakeel Butt <shakeelb@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# a7df69b8 29-Apr-2021 Johannes Weiner <hannes@cmpxchg.org>

cgroup: rstat: support cgroup1

Rstat currently only supports the default hierarchy in cgroup2. In
order to replace memcg's private stats infrastructure - used in both
cgroup1 and cgroup2 - with rstat, the latter needs to support cgroup1.

The initialization and destruction callbacks for regular cgroups are
already in place. Remove the cgroup_on_dfl() guards to handle cgroup1.

The initialization of the root cgroup is currently hardcoded to only
handle cgrp_dfl_root.cgrp. Move those callbacks to cgroup_setup_root()
and cgroup_destroy_root() to handle the default root as well as the
various cgroup1 roots we may set up during mounting.

The linking of css to cgroups happens in code shared between cgroup1 and
cgroup2 as well. Simply remove the cgroup_on_dfl() guard.

Linkage of the root css to the root cgroup is a bit trickier: per
default, the root css of a subsystem controller belongs to the default
hierarchy (i.e. the cgroup2 root). When a controller is mounted in its
cgroup1 version, the root css is stolen and moved to the cgroup1 root;
on unmount, the css moves back to the default hierarchy. Annotate
rebind_subsystems() to move the root css linkage along between roots.

Link: https://lkml.kernel.org/r/20210209163304.77088-5-hannes@cmpxchg.org
Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
Reviewed-by: Roman Gushchin <guro@fb.com>
Reviewed-by: Shakeel Butt <shakeelb@google.com>
Acked-by: Tejun Heo <tj@kernel.org>
Reviewed-by: Michal Koutný <mkoutny@suse.com>
Cc: Michal Hocko <mhocko@suse.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 7582f30c 27-Jun-2020 Christoph Hellwig <hch@lst.de>

cgroup: unexport cgroup_rstat_updated

cgroup_rstat_updated is only used by core block code, no need to
export it.

Acked-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>


# 936f2a70 27-May-2020 Boris Burkov <boris@bur.io>

cgroup: add cpu.stat file to root cgroup

Currently, the root cgroup does not have a cpu.stat file. Add one which
is consistent with /proc/stat to capture global cpu statistics that
might not fall under cgroup accounting.

We haven't done this in the past because the data are already presented
in /proc/stat and we didn't want to add overhead from collecting root
cgroup stats when cgroups are configured, but no cgroups have been
created.

By keeping the data consistent with /proc/stat, I think we avoid the
first problem, while improving the usability of cgroups stats.
We avoid the second problem by computing the contents of cpu.stat from
existing data collected for /proc/stat anyway.

Signed-off-by: Boris Burkov <boris@bur.io>
Suggested-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Tejun Heo <tj@kernel.org>


# d8ef4b38 09-Apr-2020 Tejun Heo <tj@kernel.org>

Revert "cgroup: Add memory barriers to plug cgroup_rstat_updated() race window"

This reverts commit 9a9e97b2f1f2 ("cgroup: Add memory barriers to plug
cgroup_rstat_updated() race window").

The commit was added in anticipation of memcg rstat conversion which needed
synchronous accounting for the event counters (e.g. oom kill count). However,
the conversion didn't get merged due to percpu memory overhead concern which
couldn't be addressed at the time.

Unfortunately, the patch's addition of smp_mb() to cgroup_rstat_updated()
meant that every scheduling event now had to go through an additional full
barrier and Mel Gorman noticed it as 1% regression in netperf UDP_STREAM test.

There's no need to have this barrier in tree now and even if we need
synchronous accounting in the future, the right thing to do is separating that
out to a separate function so that hot paths which don't care about
synchronous behavior don't have to pay the overhead of the full barrier. Let's
revert.

Signed-off-by: Tejun Heo <tj@kernel.org>
Reported-by: Mel Gorman <mgorman@techsingularity.net>
Link: http://lkml.kernel.org/r/20200409154413.GK3818@techsingularity.net
Cc: v4.18+


# 75ea91cd 10-Jan-2020 Chen Zhou <chenzhou10@huawei.com>

cgroup: fix function name in comment

Function name cgroup_rstat_cpu_pop_upated() in comment should be
cgroup_rstat_cpu_pop_updated().

Signed-off-by: Chen Zhou <chenzhou10@huawei.com>
Signed-off-by: Tejun Heo <tj@kernel.org>


# 1bb5ec2e 06-Nov-2019 Tejun Heo <tj@kernel.org>

cgroup: use cgroup->last_bstat instead of cgroup->bstat_pending for consistency

cgroup->bstat_pending is used to determine the base stat delta to
propagate to the parent. While correct, this is different from how
percpu delta is determined for no good reason and the inconsistency
makes the code more difficult to understand.

This patch makes parent propagation delta calculation use the same
method as percpu to global propagation.

* cgroup_base_stat_accumulate() is renamed to cgroup_base_stat_add()
and cgroup_base_stat_sub() is added.

* percpu propagation calculation is updated to use the above helpers.

* cgroup->bstat_pending is replaced with cgroup->last_bstat and
updated to use the same calculation as percpu propagation.

Signed-off-by: Tejun Heo <tj@kernel.org>


# 457c8996 19-May-2019 Thomas Gleixner <tglx@linutronix.de>

treewide: Add SPDX license identifier for missed files

Add SPDX license identifiers to all files which:

- Have no license information of any form

- Have EXPORT_.*_SYMBOL_GPL inside which was used in the
initial scan/conversion to ignore the file

These files fall under the project license, GPL v2 only. The resulting SPDX
license identifier is:

GPL-2.0-only

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>


# b4ff1b44 15-Feb-2019 Tejun Heo <tj@kernel.org>

cgroup, rstat: Don't flush subtree root unless necessary

cgroup_rstat_cpu_pop_updated() is used to traverse the updated cgroups
on flush. While it was only visiting updated ones in the subtree, it
was visiting @root unconditionally. We can easily check whether @root
is updated or not by looking at its ->updated_next just as with the
cgroups in the subtree.

* Remove the unnecessary cgroup_parent() test. The system root cgroup
is never updated and thus its ->updated_next is always NULL. No
need to test whether cgroup_parent() exists in addition to
->updated_next.

* Terminate traverse if ->updated_next is NULL. This can only happen
for subtree @root and there's no reason to visit it if it's not
marked updated.

This reduces cpu consumption when reading a lot of rstat backed files.
In a micro benchmark reading stat from ~1600 cgroups, the sys time was
lowered by >40%.

Signed-off-by: Tejun Heo <tj@kernel.org>


# c43c5ea7 26-Apr-2018 Tejun Heo <tj@kernel.org>

cgroup: Make cgroup_rstat_updated() ready for root cgroup usage

cgroup_rstat_updated() ensures that the cgroup's rstat is linked to
the parent. If there's no parent, it never gets linked and the
function ends up grabbing and releasing the cgroup_rstat_lock each
time for no reason which can be expensive.

This hasn't been a problem till now because nobody was calling the
function for the root cgroup but rstat is gonna be exposed to
controllers and use cases, so let's get ready. Make
cgroup_rstat_updated() an no-op for the root cgroup.

Signed-off-by: Tejun Heo <tj@kernel.org>


# 9a9e97b2 26-Apr-2018 Tejun Heo <tj@kernel.org>

cgroup: Add memory barriers to plug cgroup_rstat_updated() race window

cgroup_rstat_updated() has a small race window where an updated
signaling can race with flush and could be lost till the next update.
This wasn't a problem for the existing usages, but we plan to use
rstat to track counters which need to be accurate.

This patch plugs the race window by synchronizing
cgroup_rstat_updated() and flush path with memory barriers around
cgroup_rstat_cpu->updated_next pointer.

Signed-off-by: Tejun Heo <tj@kernel.org>


# 8f53470b 26-Apr-2018 Tejun Heo <tj@kernel.org>

cgroup: Add cgroup_subsys->css_rstat_flush()

This patch adds cgroup_subsys->css_rstat_flush(). If a subsystem has
this callback, its csses are linked on cgrp->css_rstat_list and rstat
will call the function whenever the associated cgroup is flushed.
Flush is also performed when such csses are released so that residual
counts aren't lost.

Combined with the rstat API previous patches factored out, this allows
controllers to plug into rstat to manage their statistics in a
scalable way.

Signed-off-by: Tejun Heo <tj@kernel.org>


# 0fa294fb 26-Apr-2018 Tejun Heo <tj@kernel.org>

cgroup: Replace cgroup_rstat_mutex with a spinlock

Currently, rstat flush path is protected with a mutex which is fine as
all the existing users are from interface file show path. However,
rstat is being generalized for use by controllers and flushing from
atomic contexts will be necessary.

This patch replaces cgroup_rstat_mutex with a spinlock and adds a
irq-safe flush function - cgroup_rstat_flush_irqsafe(). Explicit
yield handling is added to the flush path so that other flush
functions can yield to other threads and flushers.

Signed-off-by: Tejun Heo <tj@kernel.org>


# 6162cef0 26-Apr-2018 Tejun Heo <tj@kernel.org>

cgroup: Factor out and expose cgroup_rstat_*() interface functions

cgroup_rstat is being generalized so that controllers can use it too.
This patch factors out and exposes the following interface functions.

* cgroup_rstat_updated(): Renamed from cgroup_rstat_cpu_updated() for
consistency.

* cgroup_rstat_flush_hold/release(): Factored out from base stat
implementation.

* cgroup_rstat_flush(): Verbatim expose.

While at it, drop assert on cgroup_rstat_mutex in
cgroup_base_stat_flush() as it crosses layers and make a minor comment
update.

v2: Added EXPORT_SYMBOL_GPL(cgroup_rstat_updated) to fix a build bug.

Signed-off-by: Tejun Heo <tj@kernel.org>


# a17556f8 26-Apr-2018 Tejun Heo <tj@kernel.org>

cgroup: Reorganize kernel/cgroup/rstat.c

Currently, rstat.c has rstat and base stat implementations intermixed.
Collect base stat implementation at the end of the file. Also,
reorder the prototypes.

This patch doesn't make any functional changes.

Signed-off-by: Tejun Heo <tj@kernel.org>


# d4ff749b 26-Apr-2018 Tejun Heo <tj@kernel.org>

cgroup: Distinguish base resource stat implementation from rstat

Base resource stat accounts universial (not specific to any
controller) resource consumptions on top of rstat. Currently, its
implementation is intermixed with rstat implementation making the code
confusing to follow.

This patch clarifies the distintion by doing the followings.

* Encapsulate base resource stat counters, currently only cputime, in
struct cgroup_base_stat.

* Move prev_cputime into struct cgroup and initialize it with cgroup.

* Rename the related functions so that they start with cgroup_base_stat.

* Prefix the related variables and field names with b.

This patch doesn't make any functional changes.

Signed-off-by: Tejun Heo <tj@kernel.org>


# c58632b3 26-Apr-2018 Tejun Heo <tj@kernel.org>

cgroup: Rename stat to rstat

stat is too generic a name and ends up causing subtle confusions.
It'll be made generic so that controllers can plug into it, which will
make the problem worse. Let's rename it to something more specific -
cgroup_rstat for cgroup recursive stat.

This patch does the following renames. No other changes.

* cpu_stat -> rstat_cpu
* stat -> rstat
* ?cstat -> ?rstatc

Note that the renames are selective. The unrenamed are the ones which
implement basic resource statistics on top of rstat. This will be
further cleaned up in the following patches.

Signed-off-by: Tejun Heo <tj@kernel.org>


# a5c2b93f 26-Apr-2018 Tejun Heo <tj@kernel.org>

cgroup: Rename kernel/cgroup/stat.c to kernel/cgroup/rstat.c

stat is too generic a name and ends up causing subtle confusions.
It'll be made generic so that controllers can plug into it, which will
make the problem worse. Let's rename it to something more specific -
cgroup_rstat for cgroup recursive stat.

First, rename kernel/cgroup/stat.c to kernel/cgroup/rstat.c. No
content changes.

Signed-off-by: Tejun Heo <tj@kernel.org>