History log of /linux-master/drivers/cpufreq/cpufreq_governor.h
Revision Date Author Comments
# a85ee640 23-Jan-2022 Kevin Hao <haokexin@gmail.com>

cpufreq: governor: Use kobject release() method to free dbs_data

The struct dbs_data embeds a struct gov_attr_set and
the struct gov_attr_set embeds a kobject. Since every kobject must have
a release() method and we can't use kfree() to free it directly,
so introduce cpufreq_dbs_data_release() to release the dbs_data via
the kobject::release() method. This fixes the calltrace like below:

ODEBUG: free active (active state 0) object type: timer_list hint: delayed_work_timer_fn+0x0/0x34
WARNING: CPU: 12 PID: 810 at lib/debugobjects.c:505 debug_print_object+0xb8/0x100
Modules linked in:
CPU: 12 PID: 810 Comm: sh Not tainted 5.16.0-next-20220120-yocto-standard+ #536
Hardware name: Marvell OcteonTX CN96XX board (DT)
pstate: 60400009 (nZCv daif +PAN -UAO -TCO -DIT -SSBS BTYPE=--)
pc : debug_print_object+0xb8/0x100
lr : debug_print_object+0xb8/0x100
sp : ffff80001dfcf9a0
x29: ffff80001dfcf9a0 x28: 0000000000000001 x27: ffff0001464f0000
x26: 0000000000000000 x25: ffff8000090e3f00 x24: ffff80000af60210
x23: ffff8000094dfb78 x22: ffff8000090e3f00 x21: ffff0001080b7118
x20: ffff80000aeb2430 x19: ffff800009e8f5e0 x18: 0000000000000000
x17: 0000000000000002 x16: 00004d62e58be040 x15: 013590470523aff8
x14: ffff8000090e1828 x13: 0000000001359047 x12: 00000000f5257d14
x11: 0000000000040591 x10: 0000000066c1ffea x9 : ffff8000080d15e0
x8 : ffff80000a1765a8 x7 : 0000000000000000 x6 : 0000000000000001
x5 : ffff800009e8c000 x4 : ffff800009e8c760 x3 : 0000000000000000
x2 : 0000000000000000 x1 : 0000000000000000 x0 : ffff0001474ed040
Call trace:
debug_print_object+0xb8/0x100
__debug_check_no_obj_freed+0x1d0/0x25c
debug_check_no_obj_freed+0x24/0xa0
kfree+0x11c/0x440
cpufreq_dbs_governor_exit+0xa8/0xac
cpufreq_exit_governor+0x44/0x90
cpufreq_set_policy+0x29c/0x570
store_scaling_governor+0x110/0x154
store+0xb0/0xe0
sysfs_kf_write+0x58/0x84
kernfs_fop_write_iter+0x12c/0x1c0
new_sync_write+0xf0/0x18c
vfs_write+0x1cc/0x220
ksys_write+0x74/0x100
__arm64_sys_write+0x28/0x3c
invoke_syscall.constprop.0+0x58/0xf0
do_el0_svc+0x70/0x170
el0_svc+0x54/0x190
el0t_64_sync_handler+0xa4/0x130
el0t_64_sync+0x1a0/0x1a4
irq event stamp: 189006
hardirqs last enabled at (189005): [<ffff8000080849d0>] finish_task_switch.isra.0+0xe0/0x2c0
hardirqs last disabled at (189006): [<ffff8000090667a4>] el1_dbg+0x24/0xa0
softirqs last enabled at (188966): [<ffff8000080106d0>] __do_softirq+0x4b0/0x6a0
softirqs last disabled at (188957): [<ffff80000804a618>] __irq_exit_rcu+0x108/0x1a4

[ rjw: Because can be freed by the gov_attr_set_put() in
cpufreq_dbs_governor_exit() now, it is also necessary to put the
invocation of the governor ->exit() callback into the new
cpufreq_dbs_data_release() function. ]

Fixes: c4435630361d ("cpufreq: governor: New sysfs show/store callbacks for governor tunables")
Signed-off-by: Kevin Hao <haokexin@gmail.com>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>


# 85750bcd 10-Mar-2022 Lianjie Zhang <zhanglianjie@uniontech.com>

cpufreq: unify show() and store() naming and use __ATTR_XX

Usually, sysfs attributes have .show and .store and their naming
convention is filename_show() and filename_store().

But in cpufreq the naming convention of these functions is
show_filename() and store_filename() which prevents __ATTR_RW() and
__ATTR_RO() from being used in there to simplify code.

Accordingly, change the naming convention of the sysfs .show and
.store methods in cpufreq to follow the one expected by __ATTR_RW()
and __ATTR_RO() and use these macros in that code.

Signed-off-by: Lianjie Zhang <zhanglianjie@uniontech.com>
[ rjw: Subject and changelog edits ]
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>


# 9a2a9ebc 10-Nov-2020 Rafael J. Wysocki <rafael.j.wysocki@intel.com>

cpufreq: Introduce governor flags

A new cpufreq governor flag will be added subsequently, so replace
the bool dynamic_switching fleid in struct cpufreq_governor with a
flags field and introduce CPUFREQ_GOV_DYNAMIC_SWITCHING to set for
the "dynamic switching" governors instead of it.

No intentional functional impact.

Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Acked-by: Viresh Kumar <viresh.kumar@linaro.org>


# d2912cb1 04-Jun-2019 Thomas Gleixner <tglx@linutronix.de>

treewide: Replace GPLv2 boilerplate/reference with SPDX - rule 500

Based on 2 normalized pattern(s):

this program is free software you can redistribute it and or modify
it under the terms of the gnu general public license version 2 as
published by the free software foundation

this program is free software you can redistribute it and or modify
it under the terms of the gnu general public license version 2 as
published by the free software foundation #

extracted by the scancode license scanner the SPDX license identifier

GPL-2.0-only

has been chosen to replace the boilerplate/reference in 4122 file(s).

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Enrico Weigelt <info@metux.net>
Reviewed-by: Kate Stewart <kstewart@linuxfoundation.org>
Reviewed-by: Allison Randal <allison@lohutok.net>
Cc: linux-spdx@vger.kernel.org
Link: https://lkml.kernel.org/r/20190604081206.933168790@linutronix.de
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>


# ed4676e2 19-Jul-2017 Viresh Kumar <viresh.kumar@linaro.org>

cpufreq: Replace "max_transition_latency" with "dynamic_switching"

There is no limitation in the ondemand or conservative governors which
disallow the transition_latency to be greater than 10 ms.

The max_transition_latency field is rather used to disallow automatic
dynamic frequency switching for platforms which didn't wanted these
governors to run.

Replace max_transition_latency with a boolean (dynamic_switching) and
check for transition_latency == CPUFREQ_ETERNAL along with that. This
makes it pretty straight forward to read/understand now.

Signed-off-by: Viresh Kumar <viresh.kumar@linaro.org>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>


# 2d045036 19-Jul-2017 Viresh Kumar <viresh.kumar@linaro.org>

cpufreq: governor: Drop min_sampling_rate

The cpufreq core and governors aren't supposed to set a limit on how
fast we want to try changing the frequency. This is currently done for
the legacy governors with help of min_sampling_rate.

At worst, we may end up setting the sampling rate to a value lower than
the rate at which frequency can be changed and then one of the CPUs in
the policy will be only changing frequency for ever.

But that is something for the user to decide and there is no need to
have special handling for such cases in the core. Leave it for the user
to figure out.

Signed-off-by: Viresh Kumar <viresh.kumar@linaro.org>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>


# 55687da1 08-Feb-2017 Ingo Molnar <mingo@kernel.org>

sched/headers: Prepare for new header dependencies before moving code to <linux/sched/cpufreq.h>

We are going to split <linux/sched/cpufreq.h> out of <linux/sched.h>, which
will have to be picked up from other headers and a couple of .c files.

Create a trivial placeholder <linux/sched/cpufreq.h> file that just
maps to <linux/sched.h> to make this patch obviously correct and
bisectable.

Include the new header in the files that are going to need it.

Acked-by: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: linux-kernel@vger.kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>


# 00bfe058 16-Nov-2016 Stratos Karafotis <stratosk@semaphore.gr>

cpufreq: conservative: Decrease frequency faster for deferred updates

Conservative governor changes the CPU frequency in steps.
That means that if a CPU runs at max frequency, it will need several
sampling periods to return to min frequency when the workload
is finished.

If the update function that calculates the load and target frequency
is deferred, the governor might need even more time to decrease the
frequency.

This may have impact to power consumption and after all conservative
should decrease the frequency if there is no workload at every sampling
rate.

To resolve the above issue calculate the number of sampling periods
that the update is deferred. Considering that for each sampling period
conservative should drop the frequency by a freq_step because the
CPU was idle apply the proper subtraction to requested frequency.

Below, the kernel trace with and without this patch. First an
intensive workload is applied on a specific CPU. Then the workload
is removed and the CPU goes to idle.

WITHOUT

<idle>-0 [007] dN.. 620.329153: cpu_idle: state=4294967295 cpu_id=7
kworker/7:2-556 [007] .... 620.350857: cpu_frequency: state=1700000 cpu_id=7
kworker/7:2-556 [007] .... 620.370856: cpu_frequency: state=1900000 cpu_id=7
kworker/7:2-556 [007] .... 620.390854: cpu_frequency: state=2100000 cpu_id=7
kworker/7:2-556 [007] .... 620.411853: cpu_frequency: state=2200000 cpu_id=7
kworker/7:2-556 [007] .... 620.432854: cpu_frequency: state=2400000 cpu_id=7
kworker/7:2-556 [007] .... 620.453854: cpu_frequency: state=2600000 cpu_id=7
kworker/7:2-556 [007] .... 620.494856: cpu_frequency: state=2900000 cpu_id=7
kworker/7:2-556 [007] .... 620.515856: cpu_frequency: state=3100000 cpu_id=7
kworker/7:2-556 [007] .... 620.536858: cpu_frequency: state=3300000 cpu_id=7
kworker/7:2-556 [007] .... 620.557857: cpu_frequency: state=3401000 cpu_id=7
<idle>-0 [007] d... 669.591363: cpu_idle: state=4 cpu_id=7
<idle>-0 [007] d... 669.591939: cpu_idle: state=4294967295 cpu_id=7
<idle>-0 [007] d... 669.591980: cpu_idle: state=4 cpu_id=7
<idle>-0 [007] dN.. 669.591989: cpu_idle: state=4294967295 cpu_id=7
...
<idle>-0 [007] d... 670.201224: cpu_idle: state=4 cpu_id=7
<idle>-0 [007] d... 670.221975: cpu_idle: state=4294967295 cpu_id=7
kworker/7:2-556 [007] .... 670.222016: cpu_frequency: state=3300000 cpu_id=7
<idle>-0 [007] d... 670.222026: cpu_idle: state=4 cpu_id=7
<idle>-0 [007] d... 670.234964: cpu_idle: state=4294967295 cpu_id=7
...
<idle>-0 [007] d... 670.801251: cpu_idle: state=4 cpu_id=7
<idle>-0 [007] d... 671.236046: cpu_idle: state=4294967295 cpu_id=7
kworker/7:2-556 [007] .... 671.236073: cpu_frequency: state=3100000 cpu_id=7
<idle>-0 [007] d... 671.236112: cpu_idle: state=4 cpu_id=7
<idle>-0 [007] d... 671.393437: cpu_idle: state=4294967295 cpu_id=7
...
<idle>-0 [007] d... 671.401277: cpu_idle: state=4 cpu_id=7
<idle>-0 [007] d... 671.404083: cpu_idle: state=4294967295 cpu_id=7
kworker/7:2-556 [007] .... 671.404111: cpu_frequency: state=2900000 cpu_id=7
<idle>-0 [007] d... 671.404125: cpu_idle: state=4 cpu_id=7
<idle>-0 [007] d... 671.404974: cpu_idle: state=4294967295 cpu_id=7
...
<idle>-0 [007] d... 671.501180: cpu_idle: state=4 cpu_id=7
<idle>-0 [007] d... 671.995414: cpu_idle: state=4294967295 cpu_id=7
kworker/7:2-556 [007] .... 671.995459: cpu_frequency: state=2800000 cpu_id=7
<idle>-0 [007] d... 671.995469: cpu_idle: state=4 cpu_id=7
<idle>-0 [007] d... 671.996287: cpu_idle: state=4294967295 cpu_id=7
...
<idle>-0 [007] d... 672.001305: cpu_idle: state=4 cpu_id=7
<idle>-0 [007] d... 672.078374: cpu_idle: state=4294967295 cpu_id=7
kworker/7:2-556 [007] .... 672.078410: cpu_frequency: state=2600000 cpu_id=7
<idle>-0 [007] d... 672.078419: cpu_idle: state=4 cpu_id=7
<idle>-0 [007] d... 672.158020: cpu_idle: state=4294967295 cpu_id=7
kworker/7:2-556 [007] .... 672.158040: cpu_frequency: state=2400000 cpu_id=7
<idle>-0 [007] d... 672.158044: cpu_idle: state=4 cpu_id=7
<idle>-0 [007] d... 672.160038: cpu_idle: state=4294967295 cpu_id=7
...
<idle>-0 [007] d... 672.234557: cpu_idle: state=4 cpu_id=7
<idle>-0 [007] d... 672.237121: cpu_idle: state=4294967295 cpu_id=7
kworker/7:2-556 [007] .... 672.237174: cpu_frequency: state=2100000 cpu_id=7
<idle>-0 [007] d... 672.237186: cpu_idle: state=4 cpu_id=7
<idle>-0 [007] d... 672.237778: cpu_idle: state=4294967295 cpu_id=7
...
<idle>-0 [007] d... 672.267902: cpu_idle: state=4 cpu_id=7
<idle>-0 [007] d... 672.269860: cpu_idle: state=4294967295 cpu_id=7
kworker/7:2-556 [007] .... 672.269906: cpu_frequency: state=1900000 cpu_id=7
<idle>-0 [007] d... 672.269914: cpu_idle: state=4 cpu_id=7
<idle>-0 [007] d... 672.271902: cpu_idle: state=4294967295 cpu_id=7
...
<idle>-0 [007] d... 672.751342: cpu_idle: state=4 cpu_id=7
<idle>-0 [007] d... 672.823056: cpu_idle: state=4294967295 cpu_id=7
kworker/7:2-556 [007] .... 672.823095: cpu_frequency: state=1600000 cpu_id=7

WITH

<idle>-0 [007] dN.. 4380.928009: cpu_idle: state=4294967295 cpu_id=7
kworker/7:2-399 [007] .... 4380.949767: cpu_frequency: state=2000000 cpu_id=7
kworker/7:2-399 [007] .... 4380.969765: cpu_frequency: state=2200000 cpu_id=7
kworker/7:2-399 [007] .... 4381.009766: cpu_frequency: state=2500000 cpu_id=7
kworker/7:2-399 [007] .... 4381.029767: cpu_frequency: state=2600000 cpu_id=7
kworker/7:2-399 [007] .... 4381.049769: cpu_frequency: state=2800000 cpu_id=7
kworker/7:2-399 [007] .... 4381.069769: cpu_frequency: state=3000000 cpu_id=7
kworker/7:2-399 [007] .... 4381.089771: cpu_frequency: state=3100000 cpu_id=7
kworker/7:2-399 [007] .... 4381.109772: cpu_frequency: state=3400000 cpu_id=7
kworker/7:2-399 [007] .... 4381.129773: cpu_frequency: state=3401000 cpu_id=7
<idle>-0 [007] d... 4428.226159: cpu_idle: state=1 cpu_id=7
<idle>-0 [007] d... 4428.226176: cpu_idle: state=4294967295 cpu_id=7
<idle>-0 [007] d... 4428.226181: cpu_idle: state=4 cpu_id=7
<idle>-0 [007] d... 4428.227177: cpu_idle: state=4294967295 cpu_id=7
...
<idle>-0 [007] d... 4428.551640: cpu_idle: state=4 cpu_id=7
<idle>-0 [007] d... 4428.649239: cpu_idle: state=4294967295 cpu_id=7
kworker/7:2-399 [007] .... 4428.649268: cpu_frequency: state=2800000 cpu_id=7
<idle>-0 [007] d... 4428.649278: cpu_idle: state=4 cpu_id=7
<idle>-0 [007] d... 4428.689856: cpu_idle: state=4294967295 cpu_id=7
...
<idle>-0 [007] d... 4428.799542: cpu_idle: state=4 cpu_id=7
<idle>-0 [007] d... 4428.801683: cpu_idle: state=4294967295 cpu_id=7
kworker/7:2-399 [007] .... 4428.801748: cpu_frequency: state=1700000 cpu_id=7
<idle>-0 [007] d... 4428.801761: cpu_idle: state=4 cpu_id=7
<idle>-0 [007] d... 4428.806545: cpu_idle: state=4294967295 cpu_id=7
...
<idle>-0 [007] d... 4429.051880: cpu_idle: state=4 cpu_id=7
<idle>-0 [007] d... 4429.086240: cpu_idle: state=4294967295 cpu_id=7
kworker/7:2-399 [007] .... 4429.086293: cpu_frequency: state=1600000 cpu_id=7

Without the patch the CPU dropped to min frequency after 3.2s
With the patch applied the CPU dropped to min frequency after 0.86s

Signed-off-by: Stratos Karafotis <stratosk@semaphore.gr>
Acked-by: Viresh Kumar <viresh.kumar@linaro.org>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>


# 26f0dbc9 07-Nov-2016 Viresh Kumar <viresh.kumar@linaro.org>

cpufreq: governor: Don't use 'timer' keyword

The earlier implementation of governors used background timers and so
functions, mutex, etc had 'timer' keyword in their names.

But that's not true anymore. Replace 'timer' with 'update', as those
functions, variables are based around updates to frequency.

Signed-off-by: Viresh Kumar <viresh.kumar@linaro.org>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>


# 9a15fb2c 18-May-2016 Rafael J. Wysocki <rafael.j.wysocki@intel.com>

cpufreq: Drop the 'initialized' field from struct cpufreq_governor

The 'initialized' field in struct cpufreq_governor is only used by
the conservative governor (as a usage counter) and the way that
happens is far from straightforward and arguably incorrect.

Namely, the value of 'initialized' is checked by
cpufreq_dbs_governor_init() and cpufreq_dbs_governor_exit() and
the results of those checks are passed (as the second argument) to
the ->init() and ->exit() callbacks in struct dbs_governor. Those
callbacks are only implemented by the ondemand and conservative
governors and ondemand doesn't use their second argument at all.
In turn, the conservative governor uses it to decide whether or not
to either register or unregister a transition notifier.

That whole mechanism is not only unnecessarily convoluted, but also
racy, because the 'initialized' field of struct cpufreq_governor is
updated in cpufreq_init_governor() and cpufreq_exit_governor() under
policy->rwsem which doesn't help if one of these functions is run
twice in parallel for different policies (which isn't impossible in
principle), for example.

Instead of it, add a proper usage counter to the conservative
governor and update it from cs_init() and cs_exit() which is
guaranteed to be non-racy, as those functions are only called
under gov_dbs_data_mutex which is global.

With that in place, drop the 'initialized' field from struct
cpufreq_governor as it is not used any more.

Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Acked-by: Viresh Kumar <viresh.kumar@linaro.org>


# e788892b 02-Jun-2016 Rafael J. Wysocki <rafael.j.wysocki@intel.com>

cpufreq: governor: Get rid of governor events

The design of the cpufreq governor API is not very straightforward,
as struct cpufreq_governor provides only one callback to be invoked
from different code paths for different purposes. The purpose it is
invoked for is determined by its second "event" argument, causing it
to act as a "callback multiplexer" of sorts.

Unfortunately, that leads to extra complexity in governors, some of
which implement the ->governor() callback as a switch statement
that simply checks the event argument and invokes a separate function
to handle that specific event.

That extra complexity can be eliminated by replacing the all-purpose
->governor() callback with a family of callbacks to carry out specific
governor operations: initialization and exit, start and stop and policy
limits updates. That also turns out to reduce the code size too, so
do it.

Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Acked-by: Viresh Kumar <viresh.kumar@linaro.org>


# b4f4b4b3 27-Apr-2016 Rafael J. Wysocki <rafael.j.wysocki@intel.com>

cpufreq: governor: Change confusing struct field and variable names

The name of the prev_cpu_wall field in struct cpu_dbs_info is
confusing, because it doesn't represent wall time, but the previous
update time as returned by get_cpu_idle_time() (that may be the
current value of jiffies_64 in some cases, for example).

Moreover, the names of some related variables in dbs_update() take
that confusion further.

Rename all of those things to make their names reflect the purpose
more accurately. While at it, drop unnecessary parens from one of
the updated expressions.

No functional changes.

Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Acked-by: Viresh Kumar <viresh.kumar@linaro.org>
Acked-by: Chen Yu <yu.c.chen@intel.com>


# 379480d8 21-Mar-2016 Rafael J. Wysocki <rafael.j.wysocki@intel.com>

cpufreq: Move governor symbols to cpufreq.h

Move definitions of symbols related to transition latency and
sampling rate to include/linux/cpufreq.h so they can be used by
(future) goverernors located outside of drivers/cpufreq/.

No functional changes.

Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Acked-by: Viresh Kumar <viresh.kumar@linaro.org>


# 66893b6a 21-Mar-2016 Rafael J. Wysocki <rafael.j.wysocki@intel.com>

cpufreq: Move governor attribute set headers to cpufreq.h

Move definitions and function headers related to struct gov_attr_set
to include/linux/cpufreq.h so they can be used by (future) goverernors
located outside of drivers/cpufreq/.

No functional changes.

Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Acked-by: Viresh Kumar <viresh.kumar@linaro.org>


# 2d0c58ad 21-Mar-2016 Rafael J. Wysocki <rafael.j.wysocki@intel.com>

cpufreq: governor: Move abstract gov_attr_set code to seperate file

Move abstract code related to struct gov_attr_set to a separate (new)
file so it can be shared with (future) goverernors that won't share
more code with "ondemand" and "conservative".

No intentional functional changes.

Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Acked-by: Viresh Kumar <viresh.kumar@linaro.org>


# 0dd3c1d6 21-Mar-2016 Rafael J. Wysocki <rafael.j.wysocki@intel.com>

cpufreq: governor: New data type for management part of dbs_data

In addition to fields representing governor tunables, struct dbs_data
contains some fields needed for the management of objects of that
type. As it turns out, that part of struct dbs_data may be shared
with (future) governors that won't use the common code used by
"ondemand" and "conservative", so move it to a separate struct type
and modify the code using struct dbs_data to follow.

Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Acked-by: Viresh Kumar <viresh.kumar@linaro.org>


# e3f5ed93 17-Feb-2016 Rafael J. Wysocki <rafael.j.wysocki@intel.com>

cpufreq: governor: Make dbs_data_mutex static

That mutex is only used by cpufreq_governor_dbs() and it doesn't
need to be exported to modules, so make it static and drop the
export incantation.

No functional changes.

Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Acked-by: Viresh Kumar <viresh.kumar@linaro.org>


# 47ebaac1 18-Feb-2016 Rafael J. Wysocki <rafael.j.wysocki@intel.com>

cpufreq: governor: Relocate definitions of tuners structures

Move the definitions of struct od_dbs_tuners and struct cs_dbs_tuners
from the common governor header to the ondemand and conservative
governor code, respectively, as they don't need to be in the common
header any more.

No functional changes.

Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Acked-by: Viresh Kumar <viresh.kumar@linaro.org>


# 8c8f77fd 20-Feb-2016 Rafael J. Wysocki <rafael.j.wysocki@intel.com>

cpufreq: governor: Move per-CPU data to the common code

After previous changes there is only one piece of code in the
ondemand governor making references to per-CPU data structures,
but it can be easily modified to avoid doing that, so modify it
accordingly and move the definition of per-CPU data used by the
ondemand and conservative governors to the common code. Next,
change that code to access the per-CPU data structures directly
rather than via a governor callback.

This causes the ->get_cpu_cdbs governor callback to become
unnecessary, so drop it along with the macro and function
definitions related to it.

Finally, drop the definitions of struct od_cpu_dbs_info_s and
struct cs_cpu_dbs_info_s that aren't necessary any more.

Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Acked-by: Viresh Kumar <viresh.kumar@linaro.org>


# 7d5a9956 18-Feb-2016 Rafael J. Wysocki <rafael.j.wysocki@intel.com>

cpufreq: governor: Make governor private data per-policy

Some fields in struct od_cpu_dbs_info_s and struct cs_cpu_dbs_info_s
are only used for a limited set of CPUs. Namely, if a policy is
shared between multiple CPUs, those fields will only be used for one
of them (policy->cpu). This means that they really are per-policy
rather than per-CPU and holding room for them in per-CPU data
structures is generally wasteful. Also moving those fields into
per-policy data structures will allow some significant simplifications
to be made going forward.

For this reason, introduce struct cs_policy_dbs_info and
struct od_policy_dbs_info to hold those fields. Define each of the
new structures as an extension of struct policy_dbs_info (such that
struct policy_dbs_info is embedded in each of them) and introduce
new ->alloc and ->free governor callbacks to allocate and free
those structures, respectively, such that ->alloc() will return
a pointer to the struct policy_dbs_info embedded in the allocated
data structure and ->free() will take that pointer as its argument.

With that, modify the code accessing the data fields in question
in per-CPU data objects to look for them in the new structures
via the struct policy_dbs_info pointer available to it and drop
them from struct od_cpu_dbs_info_s and struct cs_cpu_dbs_info_s.

Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Acked-by: Viresh Kumar <viresh.kumar@linaro.org>


# a33cce1c 17-Feb-2016 Rafael J. Wysocki <rafael.j.wysocki@intel.com>

cpufreq: governor: Fix CPU load information updates via ->store

The ->store() callbacks of some tunable sysfs attributes of the
ondemand and conservative governors trigger immediate updates of
the CPU load information for all CPUs "governed" by the given
dbs_data by walking the cpu_dbs_info structures for all online
CPUs in the system and updating them.

This is questionable for two reasons. First, it may lead to a lot of
extra overhead on a system with many CPUs if the given dbs_data is
only associated with a few of them. Second, if governor tunables are
per-policy, the CPUs associated with the other sets of governor
tunables should not be updated.

To address this issue, use the observation that in all of the places
in question the update operation may be carried out in the same way
(because all of the tunables involved are now located in struct
dbs_data and readily available to the common code) and make the
code in those places invoke the same (new) helper function that
will carry out the update correctly.

That new function always checks the ignore_nice_load tunable value
and updates the CPUs' prev_cpu_nice data fields if that's set, which
wasn't done by the original code in store_io_is_busy(), but it
should have been done in there too.

Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Acked-by: Viresh Kumar <viresh.kumar@linaro.org>


# 76c5f66a 17-Feb-2016 Rafael J. Wysocki <rafael.j.wysocki@intel.com>

cpufreq: ondemand: Drop one more callback from struct od_ops

The ->powersave_bias_init_cpu callback in struct od_ops is only used
in one place and that invocation may be replaced with a direct call
to the function pointed to by that callback, so change the code
accordingly and drop the callback.

Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Acked-by: Viresh Kumar <viresh.kumar@linaro.org>


# 8434dadb 17-Feb-2016 Rafael J. Wysocki <rafael.j.wysocki@intel.com>

cpufreq: governor: Drop unused governor callback and data fields

After some previous changes, the ->get_cpu_dbs_info_s governor
callback and the "governor" field in struct dbs_governor (whose
value represents the governor type) are not used any more, so
drop them.

Also drop the unused gov_ops field from struct dbs_governor.

No functional changes.

Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Acked-by: Viresh Kumar <viresh.kumar@linaro.org>


# 702c9e54 17-Feb-2016 Rafael J. Wysocki <rafael.j.wysocki@intel.com>

cpufreq: governor: Add a ->start callback for governors

To avoid having to check the governor type explicitly in the common
code in order to initialize data structures specific to the governor
type properly, add a ->start callback to struct dbs_governor and
use it to initialize those data structures for the ondemand and
conservative governors.

Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Acked-by: Viresh Kumar <viresh.kumar@linaro.org>


# 8847e038 17-Feb-2016 Rafael J. Wysocki <rafael.j.wysocki@intel.com>

cpufreq: governor: Move io_is_busy to struct dbs_data

The io_is_busy governor tunable is only used by the ondemand governor
and is located in the ondemand-specific data structure, but it is
looked at by the common governor code that has to do ugly things to
get to that value, so move it to struct dbs_data and modify ondemand
accordingly.

Since the conservative governor never touches that field, it will
be always 0 for that governor and it won't have any effect on the
results of computations in that case.

Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Acked-by: Viresh Kumar <viresh.kumar@linaro.org>


# 8eb055d3 16-Feb-2016 Rafael J. Wysocki <rafael.j.wysocki@intel.com>

cpufreq: ondemand: Drop unused callback from struct od_ops

The ->freq_increase callback in struct od_ops is never invoked,
so drop it.

No functional changes.

Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Acked-by: Viresh Kumar <viresh.kumar@linaro.org>


# 07aa4402 14-Feb-2016 Rafael J. Wysocki <rafael.j.wysocki@intel.com>

cpufreq: governor: Use microseconds in sample delay computations

Do not convert microseconds to jiffies and the other way around
in governor computations related to the sampling rate and sample
delay and drop delay_for_sampling_rate() which isn't of any use
then.

Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Acked-by: Viresh Kumar <viresh.kumar@linaro.org>


# 57dc3bcd 14-Feb-2016 Rafael J. Wysocki <rafael.j.wysocki@intel.com>

cpufreq: governor: Move rate_mult to struct policy_dbs

The rate_mult field in struct od_cpu_dbs_info_s is used by the code
shared with the conservative governor and to access it that code
has to do an ugly governor type check. However, first of all it
is ever only used for policy->cpu, so it is per-policy rather than
per-CPU and second, it is initialized to 1 by cpufreq_governor_start(),
so if the conservative governor never modifies it, it will have no
effect on the results of any computations.

For these reasons, move rate_mult to struct policy_dbs_info (as a
common field).

Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Acked-by: Viresh Kumar <viresh.kumar@linaro.org>


# 4cccf755 14-Feb-2016 Rafael J. Wysocki <rafael.j.wysocki@intel.com>

cpufreq: governor: Get rid of the ->gov_check_cpu callback

The way the ->gov_check_cpu governor callback is used by the ondemand
and conservative governors is not really straightforward. Namely, the
governor calls dbs_check_cpu() that updates the load information for
the policy and the invokes ->gov_check_cpu() for the governor.

To get rid of that entanglement, notice that cpufreq_governor_limits()
doesn't need to call dbs_check_cpu() directly. Instead, it can simply
reset the sample delay to 0 which will cause a sample to be taken
immediately. The result of that is practically equivalent to calling
dbs_check_cpu() except that it will trigger a full update of governor
internal state and not just the ->gov_check_cpu() part.

Following that observation, make cpufreq_governor_limits() reset
the sample delay and turn dbs_check_cpu() into a function that will
simply evaluate the load and return the result called dbs_update().

That function can now be called by governors from the routines that
previously were pointed to by ->gov_check_cpu and those routines
can be called directly by each governor instead of dbs_check_cpu().
This way ->gov_check_cpu becomes unnecessary, so drop it.

Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Acked-by: Viresh Kumar <viresh.kumar@linaro.org>


# e4db2813 14-Feb-2016 Rafael J. Wysocki <rafael.j.wysocki@intel.com>

cpufreq: governor: Avoid atomic operations in hot paths

Rework the handling of work items by dbs_update_util_handler() and
dbs_work_handler() so the former (which is executed in scheduler
paths) only uses atomic operations when absolutely necessary. That
is, when the policy is shared and dbs_update_util_handler() has
already decided that this is the time to queue up a work item.

In particular, this avoids the atomic ops entirely on platforms where
policy objects are never shared.

Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Acked-by: Viresh Kumar <viresh.kumar@linaro.org>


# aded387b 11-Feb-2016 Viresh Kumar <viresh.kumar@linaro.org>

cpufreq: conservative: Update sample_delay_ns immediately

The ondemand governor already updates sample_delay_ns immediately on
updates to the sampling rate, but conservative doesn't do that.

It was left out earlier as the code was really too complex to get
that done easily. Things are sorted out very well now, however, and
the conservative governor can be modified to follow ondemand in that
respect.

Moreover, since the code needed to implement that in the
conservative governor would be identical to the corresponding
ondemand governor's code, make that code common and change both
governors to use it.

Signed-off-by: Viresh Kumar <viresh.kumar@linaro.org>
Tested-by: Juri Lelli <juri.lelli@arm.com>
Tested-by: Shilpasri G Bhat <shilpa.bhat@linux.vnet.ibm.com>
[ rjw: Changelog ]
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>


# 99522fe6 11-Feb-2016 Viresh Kumar <viresh.kumar@linaro.org>

cpufreq: Remove cpufreq_governor_lock

We used to drop policy->rwsem just before calling __cpufreq_governor()
in some cases earlier and so it was possible that __cpufreq_governor()
ran concurrently via separate threads for the same policy.

In order to guarantee valid state transitions for governors,
'governor_enabled' was required to be protected using some locking
and cpufreq_governor_lock was added for that.

But now __cpufreq_governor() is always called under policy->rwsem,
and 'governor_enabled' is protected against races even without
cpufreq_governor_lock.

Get rid of the extra lock now.

Signed-off-by: Viresh Kumar <viresh.kumar@linaro.org>
Tested-by: Juri Lelli <juri.lelli@arm.com>
Tested-by: Shilpasri G Bhat <shilpa.bhat@linux.vnet.ibm.com>
[ rjw : Changelog ]
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>


# c54df071 09-Feb-2016 Viresh Kumar <viresh.kumar@linaro.org>

cpufreq: governor: Create and traverse list of policy_dbs to avoid deadlock

The dbs_data_mutex lock is currently used in two places. First,
cpufreq_governor_dbs() uses it to guarantee mutual exclusion between
invocations of governor operations from the core. Second, it is used by
ondemand governor's update_sampling_rate() to ensure the stability of
data structures walked by it.

The second usage is quite problematic, because update_sampling_rate() is
called from a governor sysfs attribute's ->store callback and that leads
to a deadlock scenario involving cpufreq_governor_exit() which runs
under dbs_data_mutex. Thus it is better to rework the code so
update_sampling_rate() doesn't need to acquire dbs_data_mutex.

To that end, rework update_sampling_rate() to walk a list of policy_dbs
objects supported by the dbs_data one it has been called for (instead of
walking cpu_dbs_info object for all CPUs). The list manipulation is
protected with dbs_data->mutex which also is held around the execution
of update_sampling_rate(), it is not necessary to hold dbs_data_mutex in
that function any more.

Reported-by: Juri Lelli <juri.lelli@arm.com>
Reported-by: Shilpasri G Bhat <shilpa.bhat@linux.vnet.ibm.com>
Signed-off-by: Viresh Kumar <viresh.kumar@linaro.org>
[ rjw: Subject & changelog ]
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>


# fd8ddc48 08-Feb-2016 Viresh Kumar <viresh.kumar@linaro.org>

cpufreq: governor: Drop unused macros for creating governor tunable attributes

The previous commit introduced a new set of macros for creating sysfs
attributes that represent governor tunables and the old macros used for
this purpose are not needed any more, so drop them.

Signed-off-by: Viresh Kumar <viresh.kumar@linaro.org>
Tested-by: Juri Lelli <juri.lelli@arm.com>
Tested-by: Shilpasri G Bhat <shilpa.bhat@linux.vnet.ibm.com>
[ rjw: Subject & changelog ]
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>


# c4435630 08-Feb-2016 Viresh Kumar <viresh.kumar@linaro.org>

cpufreq: governor: New sysfs show/store callbacks for governor tunables

The ondemand and conservative governors use the global-attr or freq-attr
structures to represent sysfs attributes corresponding to their tunables
(which of them is actually used depends on whether or not different
policy objects can use the same governor with different tunables at the
same time and, consequently, on where those attributes are located in
sysfs).

Unfortunately, in the freq-attr case, the standard cpufreq show/store
sysfs attribute callbacks are applied to the governor tunable attributes
and they always acquire the policy->rwsem lock before carrying out the
operation. That may lead to an ABBA deadlock if governor tunable
attributes are removed under policy->rwsem while one of them is being
accessed concurrently (if sysfs attributes removal wins the race, it
will wait for the access to complete with policy->rwsem held while the
attribute callback will block on policy->rwsem indefinitely).

We attempted to address this issue by dropping policy->rwsem around
governor tunable attributes removal (that is, around invocations of the
->governor callback with the event arg equal to CPUFREQ_GOV_POLICY_EXIT)
in cpufreq_set_policy(), but that opened up race conditions that had not
been possible with policy->rwsem held all the time. Therefore
policy->rwsem cannot be dropped in cpufreq_set_policy() at any point,
but the deadlock situation described above must be avoided too.

To that end, use the observation that in principle governor tunables may
be represented by the same data type regardless of whether the governor
is system-wide or per-policy and introduce a new structure, struct
governor_attr, for representing them and new corresponding macros for
creating show/store sysfs callbacks for them. Also make their parent
kobject use a new kobject type whose default show/store callbacks are
not related to the standard core cpufreq ones in any way (and they don't
acquire policy->rwsem in particular).

Signed-off-by: Viresh Kumar <viresh.kumar@linaro.org>
Tested-by: Juri Lelli <juri.lelli@arm.com>
Tested-by: Shilpasri G Bhat <shilpa.bhat@linux.vnet.ibm.com>
[ rjw: Subject & changelog + rebase ]
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>


# ff4b1789 08-Feb-2016 Viresh Kumar <viresh.kumar@linaro.org>

cpufreq: governor: Move common tunables to 'struct dbs_data'

There are a few common tunables shared between the ondemand and
conservative governors. Move them to struct dbs_data to simplify
code.

Signed-off-by: Viresh Kumar <viresh.kumar@linaro.org>
Tested-by: Juri Lelli <juri.lelli@arm.com>
Tested-by: Shilpasri G Bhat <shilpa.bhat@linux.vnet.ibm.com>
[ rjw: Changelog ]
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>


# d0684d3b 08-Feb-2016 Viresh Kumar <viresh.kumar@linaro.org>

cpufreq: governor: Create generic macro for common tunables

Some tunables are present in governor-specific structures, whereas one
(min_sampling_rate) is located directly in struct dbs_data.

There is a special macro for creating its sysfs attribute and the
show/store callbacks, but since more tunables are going to be moved
to struct dbs_data, a new generic macro for such cases will be useful,
so add it and use it for min_sampling_rate.

Signed-off-by: Viresh Kumar <viresh.kumar@linaro.org>
Tested-by: Juri Lelli <juri.lelli@arm.com>
Tested-by: Shilpasri G Bhat <shilpa.bhat@linux.vnet.ibm.com>
[ rjw: Subject & changelog ]
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>


# 686cc637 08-Feb-2016 Rafael J. Wysocki <rafael.j.wysocki@intel.com>

cpufreq: governor: Rename skip_work to work_count

The skip_work field in struct policy_dbs_info technically is a
counter, so give it a new name to reflect that.

No functional changes.

Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Acked-by: Viresh Kumar <viresh.kumar@linaro.org>


# bc505475 07-Feb-2016 Rafael J. Wysocki <rafael.j.wysocki@intel.com>

cpufreq: governor: Rearrange governor data structures

The struct policy_dbs_info objects representing per-policy governor
data are not accessible directly from the corresponding policy
objects. To access them, one has to get a pointer to the
struct cpu_dbs_info of policy->cpu and use the policy_dbs field of
that which isn't really straightforward.

To address that rearrange the governor data structures so the
governor_data pointer in struct cpufreq_policy will point to
struct policy_dbs_info (instead of struct dbs_data) and that will
contain a pointer to struct dbs_data.

Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Acked-by: Viresh Kumar <viresh.kumar@linaro.org>


# d10b5eb5 06-Feb-2016 Rafael J. Wysocki <rafael.j.wysocki@intel.com>

cpufreq: governor: Drop cpu argument from dbs_check_cpu()

Since policy->cpu is always passed as the second argument to
dbs_check_cpu(), it is not really necessary to pass it, because
the function can obtain that value via its first argument just fine.

Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Acked-by: Viresh Kumar <viresh.kumar@linaro.org>


# e40e7b25 10-Feb-2016 Rafael J. Wysocki <rafael.j.wysocki@intel.com>

cpufreq: governor: Rename cpu_common_dbs_info to policy_dbs_info

The struct cpu_common_dbs_info structure represents the per-policy
part of the governor data (for the ondemand and conservative
governors), but its name doesn't reflect its purpose.

Rename it to struct policy_dbs_info and rename variables related to
it accordingly.

No functional changes.

Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Acked-by: Viresh Kumar <viresh.kumar@linaro.org>


# ea59ee0d 07-Feb-2016 Rafael J. Wysocki <rafael.j.wysocki@intel.com>

cpufreq: governor: Drop the gov pointer from struct dbs_data

Since it is possible to obtain a pointer to struct dbs_governor
from a pointer to the struct governor embedded in it with the help
of container_of(), the additional gov pointer in struct dbs_data
isn't really necessary.

Drop that pointer and make the code using it reach the dbs_governor
object via policy->governor.

Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Acked-by: Viresh Kumar <viresh.kumar@linaro.org>


# 906a6e5a 07-Feb-2016 Rafael J. Wysocki <rafael.j.wysocki@intel.com>

cpufreq: governor: Rework cpufreq_governor_dbs()

Since it is possible to obtain a pointer to struct dbs_governor
from a pointer to the struct governor embedded in it via
container_of(), the second argument of cpufreq_governor_init()
is not necessary. Accordingly, cpufreq_governor_dbs() doesn't
need its second argument either and the ->governor callbacks
for both the ondemand and conservative governors may be set
to cpufreq_governor_dbs() directly. Make that happen.

Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Acked-by: Saravana Kannan <skannan@codeaurora.org>
Acked-by: Viresh Kumar <viresh.kumar@linaro.org>


# 7bdad34d 07-Feb-2016 Rafael J. Wysocki <rafael.j.wysocki@intel.com>

cpufreq: governor: Rename some data types and variables

The ondemand and conservative governors are represented by
struct common_dbs_data whose name doesn't reflect the purpose it
is used for, so rename it to struct dbs_governor and rename
variables of that type accordingly.

No functional changes.

Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Acked-by: Viresh Kumar <viresh.kumar@linaro.org>


# af926185 04-Feb-2016 Rafael J. Wysocki <rafael.j.wysocki@intel.com>

cpufreq: governor: Put governor structure into common_dbs_data

For the ondemand and conservative governors (generally, governors
that use the common code in cpufreq_governor.c), there are two static
data structures representing the governor, the struct governor
structure (the interface to the cpufreq core) and the struct
common_dbs_data one (the interface to the cpufreq_governor.c code).

There's no fundamental reason why those two structures have to be
separate. Moreover, if the struct governor one is included into
struct common_dbs_data, it will be possible to reach the latter from
the policy via its policy->governor pointer, so it won't be necessary
to pass a separate pointer to it around. For this reason, embed
struct governor in struct common_dbs_data.

Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Acked-by: Saravana Kannan <skannan@codeaurora.org>
Acked-by: Viresh Kumar <viresh.kumar@linaro.org>


# 2bb8d94f 07-Feb-2016 Rafael J. Wysocki <rafael.j.wysocki@intel.com>

cpufreq: governor: Use common mutex for dbs_data protection

Every governor relying on the common code in cpufreq_governor.c
has to provide its own mutex in struct common_dbs_data. However,
there actually is no need to have a separate mutex per governor
for this purpose, they may be using the same global mutex just
fine. Accordingly, introduce a single common mutex for that and
drop the mutex field from struct common_dbs_data.

That at least will ensure that the mutex is always present and
initialized regardless of what the particular governors do.

Another benefit is that the common code does not need a pointer to
a governor-related structure to get to the mutex which sometimes
helps.

Finally, it makes the code generally easier to follow.

Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Acked-by: Saravana Kannan <skannan@codeaurora.org>
Acked-by: Viresh Kumar <viresh.kumar@linaro.org>


# 9be4fd2c 10-Feb-2016 Rafael J. Wysocki <rafael.j.wysocki@intel.com>

cpufreq: governor: Replace timers with utilization update callbacks

Instead of using a per-CPU deferrable timer for queuing up governor
work items, register a utilization update callback that will be
invoked from the scheduler on utilization changes.

The sampling rate is still the same as what was used for the
deferrable timers and the added irq_work overhead should be offset by
the eliminated timers overhead, so in theory the functional impact of
this patch should not be significant.

Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Acked-by: Viresh Kumar <viresh.kumar@linaro.org>
Tested-by: Gautham R. Shenoy <ego@linux.vnet.ibm.com>


# 2dd3e724 08-Dec-2015 Rafael J. Wysocki <rafael.j.wysocki@intel.com>

cpufreq: governor: Use lockless timer function

It is possible to get rid of the timer_lock spinlock used by the
governor timer function for synchronization, but a couple of races
need to be avoided.

The first race is between multiple dbs_timer_handler() instances
that may be running in parallel with each other on different
CPUs. Namely, one of them has to queue up the work item, but it
cannot be queued up more than once. To achieve that,
atomic_inc_return() can be used on the skip_work field of
struct cpu_common_dbs_info.

The second race is between an already running dbs_timer_handler()
and gov_cancel_work(). In that case the dbs_timer_handler() might
not notice the skip_work incrementation in gov_cancel_work() and
it might queue up its work item after gov_cancel_work() had
returned (and that work item would corrupt skip_work going
forward). To prevent that from happening, gov_cancel_work()
can be made wait for the timer function to complete (on all CPUs)
right after skip_work has been incremented.

Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Acked-by: Viresh Kumar <viresh.kumar@linaro.org>


# 70f43e5e 08-Dec-2015 Viresh Kumar <viresh.kumar@linaro.org>

cpufreq: governor: replace per-CPU delayed work with timers

cpufreq governors evaluate load at sampling rate and based on that they
update frequency for a group of CPUs belonging to the same cpufreq
policy.

This is required to be done in a single thread for all policy->cpus, but
because we don't want to wakeup idle CPUs to do just that, we use
deferrable work for this. If we would have used a single delayed
deferrable work for the entire policy, there were chances that the CPU
required to run the handler can be in idle and we might end up not
changing the frequency for the entire group with load variations.

And so we were forced to keep per-cpu works, and only the one that
expires first need to do the real work and others are rescheduled for
next sampling time.

We have been using the more complex solution until now, where we used a
delayed deferrable work for this, which is a combination of a timer and
a work.

This could be made lightweight by keeping per-cpu deferred timers with a
single work item, which is scheduled by the first timer that expires.

This patch does just that and here are important changes:
- The timer handler will run in irq context and so we need to use a
spin_lock instead of the timer_mutex. And so a separate timer_lock is
created. This also makes the use of the mutex and lock quite clear, as
we know what exactly they are protecting.
- A new field 'skip_work' is added to track when the timer handlers can
queue a work. More comments present in code.

Suggested-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Signed-off-by: Viresh Kumar <viresh.kumar@linaro.org>
Reviewed-by: Ashwin Chaugule <ashwin.chaugule@linaro.org>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>


# affde5d0 02-Dec-2015 Viresh Kumar <viresh.kumar@linaro.org>

cpufreq: governor: Pass policy as argument to ->gov_dbs_timer()

Pass 'policy' as argument to ->gov_dbs_timer() instead of cdbs and
dbs_data.

Signed-off-by: Viresh Kumar <viresh.kumar@linaro.org>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>


# 03d5eec0 07-Sep-2015 Viresh Kumar <viresh.kumar@linaro.org>

cpufreq: conservative: remove 'enable' field

Conservative governor has its own 'enable' field to check if
conservative governor is used for a CPU or not

This can be checked by policy->governor with 'cpufreq_gov_conservative'
and so this field can be dropped.

Because its not guaranteed that dbs_info->cdbs.shared will a valid
pointer for all CPUs (will be NULL for CPUs that don't use
ondemand/conservative governors), we can't use it anymore. Lets get
policy with cpufreq_cpu_get_raw() instead.

Signed-off-by: Viresh Kumar <viresh.kumar@linaro.org>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>


# 43e0ee36 18-Jul-2015 Viresh Kumar <viresh.kumar@linaro.org>

cpufreq: governor: split out common part of {cs|od}_dbs_timer()

Some part of cs_dbs_timer() and od_dbs_timer() is exactly same and is
unnecessarily duplicated.

Create the real work-handler in cpufreq_governor.c and put the common
code in this routine (dbs_timer()).

Shouldn't make any functional change.

Reviewed-and-tested-by: Preeti U Murthy <preeti@linux.vnet.ibm.com>
Signed-off-by: Viresh Kumar <viresh.kumar@linaro.org>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>


# 44152cb8 18-Jul-2015 Viresh Kumar <viresh.kumar@linaro.org>

cpufreq: governor: Keep single copy of information common to policy->cpus

Some information is common to all CPUs belonging to a policy, but are
kept on per-cpu basis. Lets keep that in another structure common to all
policy->cpus. That will make updates/reads to that less complex and less
error prone.

The memory for cpu_common_dbs_info is allocated/freed at INIT/EXIT, so
that it we don't reallocate it for STOP/START sequence. It will be also
be used (in next patch) while the governor is stopped and so must not be
freed that early.

Reviewed-and-tested-by: Preeti U Murthy <preeti@linux.vnet.ibm.com>
Signed-off-by: Viresh Kumar <viresh.kumar@linaro.org>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>


# 42994af6 19-Jun-2015 Viresh Kumar <viresh.kumar@linaro.org>

cpufreq: governor: rename cur_policy as policy

Just call it 'policy', cur_policy is unnecessarily long and doesn't
have any special meaning.

Reviewed-by: Preeti U Murthy <preeti@linux.vnet.ibm.com>
Signed-off-by: Viresh Kumar <viresh.kumar@linaro.org>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>


# 875b8508 19-Jun-2015 Viresh Kumar <viresh.kumar@linaro.org>

cpufreq: governor: Rename 'cpu_dbs_common_info' to 'cpu_dbs_info'

Its not common info to all CPUs, but a structure representing common
type of cpu info to both governor types. Lets drop 'common_' from its
name.

Reviewed-by: Preeti U Murthy <preeti@linux.vnet.ibm.com>
Signed-off-by: Viresh Kumar <viresh.kumar@linaro.org>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>


# d3574c85 19-Jun-2015 Viresh Kumar <viresh.kumar@linaro.org>

cpufreq: governor: Drop unused field 'cpu'

Its not used at all, drop it.

Reviewed-by: Preeti U Murthy <preeti@linux.vnet.ibm.com>
Signed-off-by: Viresh Kumar <viresh.kumar@linaro.org>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>


# 386d46e6 19-Jun-2015 Viresh Kumar <viresh.kumar@linaro.org>

cpufreq: governor: Name delayed-work as dwork

Delayed work was named as 'work' and to access work within it we do
work.work. Not much readable. Rename delayed_work as 'dwork'.

Reviewed-by: Preeti U Murthy <preeti@linux.vnet.ibm.com>
Signed-off-by: Viresh Kumar <viresh.kumar@linaro.org>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>


# 732b6d61 03-Jun-2015 Viresh Kumar <viresh.kumar@linaro.org>

cpufreq: governor: Serialize governor callbacks

There are several races reported in cpufreq core around governors (only
ondemand and conservative) by different people.

There are at least two race scenarios present in governor code:
(a) Concurrent access/updates of governor internal structures.

It is possible that fields such as 'dbs_data->usage_count', etc. are
accessed simultaneously for different policies using same governor
structure (i.e. CPUFREQ_HAVE_GOVERNOR_PER_POLICY flag unset). And
because of this we can dereference bad pointers.

For example consider a system with two CPUs with separate 'struct
cpufreq_policy' instances. CPU0 governor: ondemand and CPU1: powersave.
CPU0 switching to powersave and CPU1 to ondemand:
CPU0 CPU1

store* store*

cpufreq_governor_exit() cpufreq_governor_init()
dbs_data = cdata->gdbs_data;

if (!--dbs_data->usage_count)
kfree(dbs_data);

dbs_data->usage_count++;
*Bad pointer dereference*

There are other races possible between EXIT and START/STOP/LIMIT as
well. Its really complicated.

(b) Switching governor state in bad sequence:

For example trying to switch a governor to START state, when the
governor is in EXIT state. There are some checks present in
__cpufreq_governor() but they aren't sufficient as they compare events
against 'policy->governor_enabled', where as we need to take governor's
state into account, which can be used by multiple policies.

These two issues need to be solved separately and the responsibility
should be properly divided between cpufreq and governor core.

The first problem is more about the governor core, as it needs to
protect its structures properly. And the second problem should be fixed
in cpufreq core instead of governor, as its all about sequence of
events.

This patch is trying to solve only the first problem.

There are two types of data we need to protect,
- 'struct common_dbs_data': No matter what, there is going to be a
single copy of this per governor.
- 'struct dbs_data': With CPUFREQ_HAVE_GOVERNOR_PER_POLICY flag set, we
will have per-policy copy of this data, otherwise a single copy.

Because of such complexities, the mutex present in 'struct dbs_data' is
insufficient to solve our problem. For example we need to protect
fetching of 'dbs_data' from different structures at the beginning of
cpufreq_governor_dbs(), to make sure it isn't currently being updated.

This can be fixed if we can guarantee serialization of event parsing
code for an individual governor. This is best solved with a mutex per
governor, and the placeholder for that is 'struct common_dbs_data'.

And so this patch moves the mutex from 'struct dbs_data' to 'struct
common_dbs_data' and takes it at the beginning and drops it at the end
of cpufreq_governor_dbs().

Tested with and without following configuration options:

CONFIG_LOCKDEP_SUPPORT=y
CONFIG_DEBUG_RT_MUTEXES=y
CONFIG_DEBUG_PI_LIST=y
CONFIG_DEBUG_SPINLOCK=y
CONFIG_DEBUG_MUTEXES=y
CONFIG_DEBUG_LOCK_ALLOC=y
CONFIG_PROVE_LOCKING=y
CONFIG_LOCKDEP=y
CONFIG_DEBUG_ATOMIC_SLEEP=y

Signed-off-by: Viresh Kumar <viresh.kumar@linaro.org>
Reviewed-by: Preeti U Murthy <preeti@linux.vnet.ibm.com>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>


# 8e0484d2 03-Jun-2015 Viresh Kumar <viresh.kumar@linaro.org>

cpufreq: governor: register notifier from cs_init()

Notifiers are required only for conservative governor and the common
governor code is unnecessarily polluted with that. Handle that from
cs_init/exit() instead of cpufreq_governor_dbs().

Signed-off-by: Viresh Kumar <viresh.kumar@linaro.org>
Reviewed-by: Preeti U Murthy <preeti@linux.vnet.ibm.com>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>


# c8ae481b 09-Jun-2014 Viresh Kumar <viresh.kumar@linaro.org>

cpufreq: governor: remove copy_prev_load from 'struct cpu_dbs_common_info'

'copy_prev_load' was recently added by commit: 18b46ab (cpufreq: governor: Be
friendly towards latency-sensitive bursty workloads).

It actually is a bit redundant as we also have 'prev_load' which can store any
integer value and can be used instead of 'copy_prev_load' by setting it zero.

True load can also turn out to be zero during long idle intervals (and hence the
actual value of 'prev_load' and the overloaded value can clash). However this is
not a problem because, if the true load was really zero in the previous
interval, it makes sense to evaluate the load afresh for the current interval
rather than copying the previous load.

So, drop 'copy_prev_load' and use 'prev_load' instead.

Update comments as well to make it more clear.

There is another change here which was probably missed by Srivatsa during the
last version of updates he made. The unlikely in the 'if' statement was covering
only half of the condition and the whole line should actually come under it.

Also checkpatch is made more silent as it was reporting this (--strict option):

CHECK: Alignment should match open parenthesis
+ if (unlikely(wall_time > (2 * sampling_rate) &&
+ j_cdbs->prev_load)) {

Signed-off-by: Viresh Kumar <viresh.kumar@linaro.org>
Reviewed-by: Srivatsa S. Bhat <srivatsa.bhat@linux.vnet.ibm.com>
Acked-by: Pavel Machek <pavel@ucw.cz>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>


# 18b46abd 07-Jun-2014 Srivatsa S. Bhat <srivatsa.bhat@linux.vnet.ibm.com>

cpufreq: governor: Be friendly towards latency-sensitive bursty workloads

Cpufreq governors like the ondemand governor calculate the load on the CPU
periodically by employing deferrable timers. A deferrable timer won't fire
if the CPU is completely idle (and there are no other timers to be run), in
order to avoid unnecessary wakeups and thus save CPU power.

However, the load calculation logic is agnostic to all this, and this can
lead to the problem described below.

Time (ms) CPU 1

100 Task-A running

110 Governor's timer fires, finds load as 100% in the last
10ms interval and increases the CPU frequency.

110.5 Task-A running

120 Governor's timer fires, finds load as 100% in the last
10ms interval and increases the CPU frequency.

125 Task-A went to sleep. With nothing else to do, CPU 1
went completely idle.

200 Task-A woke up and started running again.

200.5 Governor's deferred timer (which was originally programmed
to fire at time 130) fires now. It calculates load for the
time period 120 to 200.5, and finds the load is almost zero.
Hence it decreases the CPU frequency to the minimum.

210 Governor's timer fires, finds load as 100% in the last
10ms interval and increases the CPU frequency.

So, after the workload woke up and started running, the frequency was suddenly
dropped to absolute minimum, and after that, there was an unnecessary delay of
10ms (sampling period) to increase the CPU frequency back to a reasonable value.
And this pattern repeats for every wake-up-from-cpu-idle for that workload.
This can be quite undesirable for latency- or response-time sensitive bursty
workloads. So we need to fix the governor's logic to detect such wake-up-from-
cpu-idle scenarios and start the workload at a reasonably high CPU frequency.

One extreme solution would be to fake a load of 100% in such scenarios. But
that might lead to undesirable side-effects such as frequency spikes (which
might also need voltage changes) especially if the previous frequency happened
to be very low.

We just want to avoid the stupidity of dropping down the frequency to a minimum
and then enduring a needless (and long) delay before ramping it up back again.
So, let us simply carry forward the previous load - that is, let us just pretend
that the 'load' for the current time-window is the same as the load for the
previous window. That way, the frequency and voltage will continue to be set
to whatever values they were set at previously. This means that bursty workloads
will get a chance to influence the CPU frequency at which they wake up from
cpu-idle, based on their past execution history. Thus, they might be able to
avoid suffering from slow wakeups and long response-times.

However, we should take care not to over-do this. For example, such a "copy
previous load" logic will benefit cases like this: (where # represents busy
and . represents idle)

##########.........#########.........###########...........##########........

but it will be detrimental in cases like the one shown below, because it will
retain the high frequency (copied from the previous interval) even in a mostly
idle system:

##########.........#.................#.....................#...............

(i.e., the workload finished and the remaining tasks are such that their busy
periods are smaller than the sampling interval, which causes the timer to
always get deferred. So, this will make the copy-previous-load logic copy
the initial high load to subsequent idle periods over and over again, thus
keeping the frequency high unnecessarily).

So, we modify this copy-previous-load logic such that it is used only once
upon every wakeup-from-idle. Thus if we have 2 consecutive idle periods, the
previous load won't get blindly copied over; cpufreq will freshly evaluate the
load in the second idle interval, thus ensuring that the system comes back to
its normal state.

[ The right way to solve this whole problem is to teach the CPU frequency
governors to also track load on a per-task basis, not just a per-CPU basis,
and then use both the data sources intelligently to set the appropriate
frequency on the CPUs. But that involves redesigning the cpufreq subsystem,
so this patch should make the situation bearable until then. ]

Experimental results:
+-------------------+

I ran a modified version of ebizzy (called 'sleeping-ebizzy') that sleeps in
between its execution such that its total utilization can be a user-defined
value, say 10% or 20% (higher the utilization specified, lesser the amount of
sleeps injected). This ebizzy was run with a single-thread, tied to CPU 8.

Behavior observed with tracing (sample taken from 40% utilization runs):
------------------------------------------------------------------------

Without patch:
~~~~~~~~~~~~~~
kworker/8:2-12137 416.335742: cpu_frequency: state=2061000 cpu_id=8
kworker/8:2-12137 416.335744: sched_switch: prev_comm=kworker/8:2 ==> next_comm=ebizzy
<...>-40753 416.345741: sched_switch: prev_comm=ebizzy ==> next_comm=kworker/8:2
kworker/8:2-12137 416.345744: cpu_frequency: state=4123000 cpu_id=8
kworker/8:2-12137 416.345746: sched_switch: prev_comm=kworker/8:2 ==> next_comm=ebizzy
<...>-40753 416.355738: sched_switch: prev_comm=ebizzy ==> next_comm=kworker/8:2
<snip> --------------------------------------------------------------------- <snip>
<...>-40753 416.402202: sched_switch: prev_comm=ebizzy ==> next_comm=swapper/8
<idle>-0 416.502130: sched_switch: prev_comm=swapper/8 ==> next_comm=ebizzy
<...>-40753 416.505738: sched_switch: prev_comm=ebizzy ==> next_comm=kworker/8:2
kworker/8:2-12137 416.505739: cpu_frequency: state=2061000 cpu_id=8
kworker/8:2-12137 416.505741: sched_switch: prev_comm=kworker/8:2 ==> next_comm=ebizzy
<...>-40753 416.515739: sched_switch: prev_comm=ebizzy ==> next_comm=kworker/8:2
kworker/8:2-12137 416.515742: cpu_frequency: state=4123000 cpu_id=8
kworker/8:2-12137 416.515744: sched_switch: prev_comm=kworker/8:2 ==> next_comm=ebizzy

Observation: Ebizzy went idle at 416.402202, and started running again at
416.502130. But cpufreq noticed the long idle period, and dropped the frequency
at 416.505739, only to increase it back again at 416.515742, realizing that the
workload is in-fact CPU bound. Thus ebizzy needlessly ran at the lowest frequency
for almost 13 milliseconds (almost 1 full sample period), and this pattern
repeats on every sleep-wakeup. This could hurt latency-sensitive workloads quite
a lot.

With patch:
~~~~~~~~~~~

kworker/8:2-29802 464.832535: cpu_frequency: state=2061000 cpu_id=8
<snip> --------------------------------------------------------------------- <snip>
kworker/8:2-29802 464.962538: sched_switch: prev_comm=kworker/8:2 ==> next_comm=ebizzy
<...>-40738 464.972533: sched_switch: prev_comm=ebizzy ==> next_comm=kworker/8:2
kworker/8:2-29802 464.972536: cpu_frequency: state=4123000 cpu_id=8
kworker/8:2-29802 464.972538: sched_switch: prev_comm=kworker/8:2 ==> next_comm=ebizzy
<...>-40738 464.982531: sched_switch: prev_comm=ebizzy ==> next_comm=kworker/8:2
<snip> --------------------------------------------------------------------- <snip>
kworker/8:2-29802 465.022533: sched_switch: prev_comm=kworker/8:2 ==> next_comm=ebizzy
<...>-40738 465.032531: sched_switch: prev_comm=ebizzy ==> next_comm=kworker/8:2
kworker/8:2-29802 465.032532: sched_switch: prev_comm=kworker/8:2 ==> next_comm=ebizzy
<...>-40738 465.035797: sched_switch: prev_comm=ebizzy ==> next_comm=swapper/8
<idle>-0 465.240178: sched_switch: prev_comm=swapper/8 ==> next_comm=ebizzy
<...>-40738 465.242533: sched_switch: prev_comm=ebizzy ==> next_comm=kworker/8:2
kworker/8:2-29802 465.242535: sched_switch: prev_comm=kworker/8:2 ==> next_comm=ebizzy
<...>-40738 465.252531: sched_switch: prev_comm=ebizzy ==> next_comm=kworker/8:2

Observation: Ebizzy went idle at 465.035797, and started running again at
465.240178. Since ebizzy was the only real workload running on this CPU,
cpufreq retained the frequency at 4.1Ghz throughout the run of ebizzy, no
matter how many times ebizzy slept and woke-up in-between. Thus, ebizzy
got the 10ms worth of 4.1 Ghz benefit during every sleep-wakeup (as compared
to the run without the patch) and this boost gave a modest improvement in total
throughput, as shown below.

Sleeping-ebizzy records-per-second:
-----------------------------------

Utilization Without patch With patch Difference (Absolute and % values)
10% 274767 277046 + 2279 (+0.829%)
20% 543429 553484 + 10055 (+1.850%)
40% 1090744 1107959 + 17215 (+1.578%)
60% 1634908 1662018 + 27110 (+1.658%)

A rudimentary and somewhat approximately latency-sensitive workload such as
sleeping-ebizzy itself showed a consistent, noticeable performance improvement
with this patch. Hence, workloads that are truly latency-sensitive will benefit
quite a bit from this change. Moreover, this is an overall win-win since this
patch does not hurt power-savings at all (because, this patch does not reduce
the idle time or idle residency; and the high frequency of the CPU when it goes
to cpu-idle does not affect/hurt the power-savings of deep idle states).

Signed-off-by: Srivatsa S. Bhat <srivatsa.bhat@linux.vnet.ibm.com>
Reviewed-by: Gautham R. Shenoy <ego@linux.vnet.ibm.com>
Acked-by: Viresh Kumar <viresh.kumar@linaro.org>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>


# 6f1e4efd 03-Jan-2014 Jane Li <jiel@marvell.com>

cpufreq: Fix timer/workqueue corruption by protecting reading governor_enabled

When a CPU is hot removed we'll cancel all the delayed work items via
gov_cancel_work(). Sometimes the delayed work function determines that
it should adjust the delay for all other CPUs that the policy is
managing. If this scenario occurs, the canceling CPU will cancel its own
work but queue up the other CPUs works to run.

Commit 3617f2 (cpufreq: Fix timer/workqueue corruption due to double
queueing) has tried to fix this, but reading governor_enabled is not
protected by cpufreq_governor_lock. Even though od_dbs_timer() checks
governor_enabled before gov_queue_work(), this scenario may occur. For
example:

CPU0 CPU1
---- ----
cpu_down()
... <work runs>
__cpufreq_remove_dev() od_dbs_timer()
__cpufreq_governor() policy->governor_enabled
policy->governor_enabled = false;
cpufreq_governor_dbs()
case CPUFREQ_GOV_STOP:
gov_cancel_work(dbs_data, policy);
cpu0 work is canceled
timer is canceled
cpu1 work is canceled
<waits for cpu1>
gov_queue_work(*, *, true);
cpu0 work queued
cpu1 work queued
cpu2 work queued
...
cpu1 work is canceled
cpu2 work is canceled
...

At the end of the GOV_STOP case cpu0 still has a work queued to
run although the code is expecting all of the works to be
canceled. __cpufreq_remove_dev() will then proceed to
re-initialize all the other CPUs works except for the CPU that is
going down. The CPUFREQ_GOV_START case in cpufreq_governor_dbs()
will trample over the queued work and debugobjects will spit out
a warning:

WARNING: at lib/debugobjects.c:260 debug_print_object+0x94/0xbc()
ODEBUG: init active (active state 0) object type: timer_list hint: delayed_work_timer_fn+0x0/0x14
Modules linked in:
CPU: 1 PID: 1205 Comm: sh Tainted: G W 3.10.0 #200
[<c01144f0>] (unwind_backtrace+0x0/0xf8) from [<c0111d98>] (show_stack+0x10/0x14)
[<c0111d98>] (show_stack+0x10/0x14) from [<c01272cc>] (warn_slowpath_common+0x4c/0x68)
[<c01272cc>] (warn_slowpath_common+0x4c/0x68) from [<c012737c>] (warn_slowpath_fmt+0x30/0x40)
[<c012737c>] (warn_slowpath_fmt+0x30/0x40) from [<c034c640>] (debug_print_object+0x94/0xbc)
[<c034c640>] (debug_print_object+0x94/0xbc) from [<c034c7f8>] (__debug_object_init+0xc8/0x3c0)
[<c034c7f8>] (__debug_object_init+0xc8/0x3c0) from [<c01360e0>] (init_timer_key+0x20/0x104)
[<c01360e0>] (init_timer_key+0x20/0x104) from [<c04872ac>] (cpufreq_governor_dbs+0x1dc/0x68c)
[<c04872ac>] (cpufreq_governor_dbs+0x1dc/0x68c) from [<c04833a8>] (__cpufreq_governor+0x80/0x1b0)
[<c04833a8>] (__cpufreq_governor+0x80/0x1b0) from [<c0483704>] (__cpufreq_remove_dev.isra.12+0x22c/0x380)
[<c0483704>] (__cpufreq_remove_dev.isra.12+0x22c/0x380) from [<c0692f38>] (cpufreq_cpu_callback+0x48/0x5c)
[<c0692f38>] (cpufreq_cpu_callback+0x48/0x5c) from [<c014fb40>] (notifier_call_chain+0x44/0x84)
[<c014fb40>] (notifier_call_chain+0x44/0x84) from [<c012ae44>] (__cpu_notify+0x2c/0x48)
[<c012ae44>] (__cpu_notify+0x2c/0x48) from [<c068dd40>] (_cpu_down+0x80/0x258)
[<c068dd40>] (_cpu_down+0x80/0x258) from [<c068df40>] (cpu_down+0x28/0x3c)
[<c068df40>] (cpu_down+0x28/0x3c) from [<c068e4c0>] (store_online+0x30/0x74)
[<c068e4c0>] (store_online+0x30/0x74) from [<c03a7308>] (dev_attr_store+0x18/0x24)
[<c03a7308>] (dev_attr_store+0x18/0x24) from [<c0256fe0>] (sysfs_write_file+0x100/0x180)
[<c0256fe0>] (sysfs_write_file+0x100/0x180) from [<c01fec9c>] (vfs_write+0xbc/0x184)
[<c01fec9c>] (vfs_write+0xbc/0x184) from [<c01ff034>] (SyS_write+0x40/0x68)
[<c01ff034>] (SyS_write+0x40/0x68) from [<c010e200>] (ret_fast_syscall+0x0/0x48)

In gov_queue_work(), lock cpufreq_governor_lock before gov_queue_work,
and unlock it after __gov_queue_work(). In this way, governor_enabled
is guaranteed not changed in gov_queue_work().

Signed-off-by: Jane Li <jiel@marvell.com>
Acked-by: Viresh Kumar <viresh.kumar@linaro.org>
Reviewed-by: Dmitry Torokhov <dmitry.torokhov@gmail.com>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>


# 0b981e70 02-Oct-2013 Viresh Kumar <viresh.kumar@linaro.org>

cpufreq: use cpufreq_driver->flags to mark CPUFREQ_HAVE_GOVERNOR_PER_POLICY

Use cpufreq_driver->flags to mark CPUFREQ_HAVE_GOVERNOR_PER_POLICY instead
of a separate field within cpufreq_driver. This will save some bytes of
memory.

Signed-off-by: Viresh Kumar <viresh.kumar@linaro.org>
Reviewed-by: Srivatsa S. Bhat <srivatsa.bhat@linux.vnet.ibm.com>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>


# c4afc410 26-Aug-2013 Stratos Karafotis <stratosk@semaphore.gr>

cpufreq: governor: Fix typos in comments

- 'Governer' should be 'Governor'.
- 'S' is used for Siemens (electrical conductance) in SI units,
so use small 's' for seconds.

Signed-off-by: Stratos Karafotis <stratosk@semaphore.gr>
Acked-by: Viresh Kumar <viresh.kumar@linaro.org>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>


# 3a3e9e06 06-Aug-2013 Viresh Kumar <viresh.kumar@linaro.org>

cpufreq: Give consistent names to cpufreq_policy objects

They are called policy, cur_policy, new_policy, data, etc. Just call
them policy wherever possible.

Signed-off-by: Viresh Kumar <viresh.kumar@linaro.org>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>


# 5ff0a268 06-Aug-2013 Viresh Kumar <viresh.kumar@linaro.org>

cpufreq: Clean up header files included in the core

This patch addresses the following issues in the header files in the
cpufreq core:
- Include headers in ascending order, so that we don't add same
many times by mistake.
- <asm/> must be included after <linux/>, so that they override
whatever they need to.
- Remove unnecessary includes.
- Don't include files already included by cpufreq.h or
cpufreq_governor.h.

[rjw: Changelog]
Signed-off-by: Viresh Kumar <viresh.kumar@linaro.org>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>


# 6c4640c3 04-Aug-2013 Viresh Kumar <viresh.kumar@linaro.org>

cpufreq: rename ignore_nice as ignore_nice_load

This sysfs file was called ignore_nice_load earlier and commit
4d5dcc4 (cpufreq: governor: Implement per policy instances of
governors) changed its name to ignore_nice by mistake.

Lets get it renamed back to its original name.

Reported-by: Martin von Gagern <Martin.vGagern@gmx.net>
Signed-off-by: Viresh Kumar <viresh.kumar@linaro.org>
Cc: 3.10+ <stable@vger.kernel.org> # 3.10+
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>


# dfa5bb62 05-Jun-2013 Stratos Karafotis <stratosk@semaphore.gr>

cpufreq: ondemand: Change the calculation of target frequency

The ondemand governor calculates load in terms of frequency and
increases it only if load_freq is greater than up_threshold
multiplied by the current or average frequency. This appears to
produce oscillations of frequency between min and max because,
for example, a relatively small load can easily saturate minimum
frequency and lead the CPU to the max. Then, it will decrease
back to the min due to small load_freq.

Change the calculation method of load and target frequency on the
basis of the following two observations:

- Load computation should not depend on the current or average
measured frequency. For example, absolute load of 80% at 100MHz
is not necessarily equivalent to 8% at 1000MHz in the next
sampling interval.

- It should be possible to increase the target frequency to any
value present in the frequency table proportional to the absolute
load, rather than to the max only, so that:

Target frequency = C * load

where we take C = policy->cpuinfo.max_freq / 100.

Tested on Intel i7-3770 CPU @ 3.40GHz and on Quad core 1500MHz Krait.
Phoronix benchmark of Linux Kernel Compilation 3.1 test shows an
increase ~1.5% in performance. cpufreq_stats (time_in_state) shows
that middle frequencies are used more, with this patch. Highest
and lowest frequencies were used less by ~9%.

[rjw: We have run multiple other tests on kernels with this
change applied and in the vast majority of cases it turns out
that the resulting performance improvement also leads to reduced
consumption of energy. The change is additionally justified by
the overall simplification of the code in question.]

Signed-off-by: Stratos Karafotis <stratosk@semaphore.gr>
Acked-by: Viresh Kumar <viresh.kumar@linaro.org>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>


# bb176f7d 19-Jun-2013 Viresh Kumar <viresh.kumar@linaro.org>

cpufreq: Fix minor formatting issues

There were a few noticeable formatting issues in core cpufreq code.
This cleans them up to make code look better. The changes include:
- Whitespace cleanup.
- Rearrangements of code.
- Multiline comments fixes.
- Formatting changes to fit 80 columns.

Copyright information in cpufreq.c is also updated to include my name
for 2013.

[rjw: Changelog]
Signed-off-by: Viresh Kumar <viresh.kumar@linaro.org>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>


# 72a4ce34 17-May-2013 Viresh Kumar <viresh.kumar@linaro.org>

cpufreq: Move get_cpu_idle_time() to cpufreq.c

Governors other than ondemand and conservative can also use
get_cpu_idle_time() and they aren't required to compile
cpufreq_governor.c. So, move these independent routines to
cpufreq.c instead.

Signed-off-by: Viresh Kumar <viresh.kumar@linaro.org>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>


# a97c98ad 30-Apr-2013 Viresh Kumar <viresh.kumar@linaro.org>

cpufreq: governors: Fix CPUFREQ_GOV_POLICY_{INIT|EXIT} notifiers

There are two types of INIT/EXIT activities that we need to do for
governors:
- Done only once per governor (doesn't depend how many instances of
the governor there are). eg: cpufreq_register_notifier() for
conservative governor.
- Done per governor instance, eg: sysfs_{create|remove}_group().

There were some corner cases where current code isn't able to handle
them separately and so failing for some test cases.

We use two separate variables now for keeping track of above two
requirements.
- governor->initialized for first one
- dbs_data->usage_count for per governor instance

Signed-off-by: Viresh Kumar <viresh.kumar@linaro.org>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>


# fb30809e 02-Apr-2013 Jacob Shin <jacob.shin@amd.com>

cpufreq: ondemand: allow custom powersave_bias_target handler to be registered

This allows for another [arch specific] driver to hook into existing
powersave bias function of the ondemand governor. i.e. This allows AMD
specific powersave bias function (in a separate AMD specific driver)
to aid ondemand governor's frequency transition decisions.

Signed-off-by: Jacob Shin <jacob.shin@amd.com>
Acked-by: Thomas Renninger <trenn@suse.de>
Acked-by: Borislav Petkov <bp@suse.de>
Acked-by: Viresh Kumar <viresh.kumar@linaro.org>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>


# beb0ff39 01-Apr-2013 Borislav Petkov <bp@suse.de>

cpufreq: Correct header guards typo

It should be "governor".

Signed-off-by: Borislav Petkov <bp@suse.de>
Acked-by: Viresh Kumar <viresh.kumar@linaro.org>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>


# 9366d840 28-Feb-2013 Stratos Karafotis <stratosk@semaphore.gr>

cpufreq: governors: Calculate iowait time only when necessary

Currently we always calculate the CPU iowait time and add it to idle time.
If we are in ondemand and we use io_is_busy, we re-calculate iowait time
and we subtract it from idle time.

With this patch iowait time is calculated only when necessary avoiding
the double call to get_cpu_iowait_time_us. We use a parameter in
function get_cpu_idle_time to distinguish when the iowait time will be
added to idle time or not, without the need of keeping the prev_io_wait.

Signed-off-by: Stratos Karafotis <stratosk@semaphore.gr>
Acked-by: Viresh Kumar <viresh.kumar@linaro.,org>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>


# 031299b3 26-Feb-2013 Viresh Kumar <viresh.kumar@linaro.org>

cpufreq: governors: Avoid unnecessary per cpu timer interrupts

Following patch has introduced per cpu timers or works for ondemand and
conservative governors.

commit 2abfa876f1117b0ab45f191fb1f82c41b1cbc8fe
Author: Rickard Andersson <rickard.andersson@stericsson.com>
Date: Thu Dec 27 14:55:38 2012 +0000

cpufreq: handle SW coordinated CPUs

This causes additional unnecessary interrupts on all cpus when the load is
recently evaluated by any other cpu. i.e. When load is recently evaluated by cpu
x, we don't really need any other cpu to evaluate this load again for the next
sampling_rate time.

Some sort of code is present to avoid that but we are still getting timer
interrupts for all cpus. A good way of avoiding this would be to modify delays
for all cpus (policy->cpus) whenever any cpu has evaluated load.

This patch does this change and some related code cleanup.

Signed-off-by: Viresh Kumar <viresh.kumar@linaro.org>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>


# 98104ee2 26-Feb-2013 Viresh Kumar <viresh.kumar@linaro.org>

cpufreq: governor: Set MIN_LATENCY_MULTIPLIER to 20

Currently MIN_LATENCY_MULTIPLIER is set defined as 100 and so on a system with
transition latency of 1 ms, the minimum sampling time comes to be around 100 ms.
That is quite big if you want to get better performance for your system.

Redefine MIN_LATENCY_MULTIPLIER to 20 so that we can support 20ms sampling rate
for such platforms.

Signed-off-by: Viresh Kumar <viresh.kumar@linaro.org>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>


# 4d5dcc42 27-Mar-2013 Viresh Kumar <viresh.kumar@linaro.org>

cpufreq: governor: Implement per policy instances of governors

Currently, there can't be multiple instances of single governor_type.
If we have a multi-package system, where we have multiple instances
of struct policy (per package), we can't have multiple instances of
same governor. i.e. We can't have multiple instances of ondemand
governor for multiple packages.

Governors directory in sysfs is created at /sys/devices/system/cpu/cpufreq/
governor-name/. Which again reflects that there can be only one
instance of a governor_type in the system.

This is a bottleneck for multicluster system, where we want different
packages to use same governor type, but with different tunables.

This patch uses the infrastructure provided by earlier patch and
implements init/exit routines for ondemand and conservative
governors.

Signed-off-by: Viresh Kumar <viresh.kumar@linaro.org>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>


# e5dde92c 27-Feb-2013 Namhyung Kim <namhyung.kim@lge.com>

cpufreq: Fix a typo in comment

Fix a typo in a comment in cpufreq_governor.h.

[rjw: Changelog]
Signed-off-by: Namhyung Kim <namhyung@kernel.org>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>


# 4bd4e428 06-Feb-2013 Stratos Karafotis <stratosk@semaphore.gr>

cpufreq: ondemand: Replace down_differential tuner with adj_up_threshold

In order to avoid the calculation of up_threshold - down_differential
every time that the frequency must be decreased, we replace the
down_differential tuner with the adj_up_threshold which keeps the
difference across multiple checks.

Update the adj_up_threshold only when the up_theshold is also updated.

Signed-off-by: Stratos Karafotis <stratosk@semaphore.gr>
Acked-by: Viresh Kumar <viresh.kumar@linaro.org>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>


# 4447266b 31-Jan-2013 Viresh Kumar <viresh.kumar@linaro.org>

cpufreq: governors: Remove code redundancy between governors

With the inclusion of following patches:

9f4eb10 cpufreq: conservative: call dbs_check_cpu only when necessary
772b4b1 cpufreq: ondemand: call dbs_check_cpu only when necessary

code redundancy between the conservative and ondemand governors is
introduced again, so get rid of it.

[rjw: Changelog]
Signed-off-by: Viresh Kumar <viresh.kumar@linaro.org>
Tested-by: Fabio Baltieri <fabio.baltieri@linaro.org>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>


# 8eeed095 31-Jan-2013 Viresh Kumar <viresh.kumar@linaro.org>

cpufreq: governors: Get rid of dbs_data->enable field

CPUFREQ_GOV_START/STOP are called only once for all policy->cpus and hence we
don't need to adapt cpufreq_governor_dbs() routine for multiple calls.

So, this patch removes dbs_data->enable field entirely. And rearrange code a
bit.

Signed-off-by: Viresh Kumar <viresh.kumar@linaro.org>
Tested-by: Fabio Baltieri <fabio.baltieri@linaro.org>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>


# 2624f90c 31-Jan-2013 Fabio Baltieri <fabio.baltieri@linaro.org>

cpufreq: governors: implement generic policy_is_shared

Implement a generic helper function policy_is_shared() to replace the
current dbs_sw_coordinated_cpus() at cpufreq level, so that it can be
used by code other than cpufreq governors.

Suggested-by: Viresh Kumar <viresh.kumar@linaro.org>
Signed-off-by: Fabio Baltieri <fabio.baltieri@linaro.org>
Acked-by: Viresh Kumar <viresh.kumar@linaro.org>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>


# da53d61e 27-Dec-2012 Fabio Baltieri <fabio.baltieri@linaro.org>

cpufreq: ondemand: call dbs_check_cpu only when necessary

Modify ondemand timer to not resample CPU utilization if recently
sampled from another SW coordinated core.

Signed-off-by: Fabio Baltieri <fabio.baltieri@linaro.org>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>


# 2abfa876 27-Dec-2012 Rickard Andersson <rickard.andersson@stericsson.com>

cpufreq: handle SW coordinated CPUs

This patch fixes a bug that occurred when we had load on a secondary CPU
and the primary CPU was sleeping. Only one sampling timer was spawned
and it was spawned as a deferred timer on the primary CPU, so when a
secondary CPU had a change in load this was not detected by the cpufreq
governor (both ondemand and conservative).

This patch make sure that deferred timers are run on all CPUs in the
case of software controlled CPUs that run on the same frequency.

Signed-off-by: Rickard Andersson <rickard.andersson@stericsson.com>
Signed-off-by: Fabio Baltieri <fabio.baltieri@linaro.org>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>


# 1e7586a1 25-Oct-2012 Viresh Kumar <viresh.kumar@linaro.org>

cpufreq: Fix sparse warnings by updating cputime64_t to u64

There were few sparse warnings due to mismatch of type on function arguments.
Two types were used u64 and cputime64_t. Both are actually u64, so use u64 only.

Reported-by: Fengguang Wu <fengguang.wu@intel.com>
Signed-off-by: Viresh Kumar <viresh.kumar@linaro.org>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>


# 4471a34f 25-Oct-2012 Viresh Kumar <viresh.kumar@linaro.org>

cpufreq: governors: remove redundant code

Initially ondemand governor was written and then using its code conservative
governor is written. It used a lot of code from ondemand governor, but copy of
code was created instead of using the same routines from both governors. Which
increased code redundancy, which is difficult to manage.

This patch is an attempt to move common part of both the governors to
cpufreq_governor.c file to come over above mentioned issues.

This shouldn't change anything from functionality point of view.

Signed-off-by: Viresh Kumar <viresh.kumar@linaro.org>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>